Visual Tracking papers

Visual Tracking Review

Author: Ferhat Can ATAMAN
Created In 06/04/2020
Last Modified: 07/04/2020

Aim of this review is to investigate some of the core visual tracking papers that give solutions to basic problems in area. Before going on deep learning methods, these review can give good intuition about the field.

Core papers

1. ASEF (CVPR09)

It tries to solve robustness problem. That is changing in the appearance should not effect results of the localization. There should be training set to learn custom correlation filter.

Fig.1: ASEF algorithm explanation

2. MOSSE (CVPR10)

It tries to solve visual tracking problem with using correlation filters that is similar to ASEF. In this approach, correlation filter is updated online according to current target appearance. The target window is selected in first frame of the video, then MOSSE initializes tracking. First, the initial MOSSE filter is constructed with target window and its’ N (hyperparameter) affine permutations like Rotation, Scale and Translation (RST). After that, in each frame, MOSSE filter updating with moving average that is new filter has smaller effect than early calculations.

Two main contribution exist:

ASEF needs too much training sample to obtain reliable correlation filter $H(w, v)$. (When we have small number of train data, element-wise division in frequency domain can be unstable. The reason behind this the training image contains very little energy that is $F_i \odot F_i^$ closes to zero.) It becomes too slow for tracking purpose.

3. TADT (CVPR19)

The main purpose of this approach is following. Pre-trained CNN that is trained for object classification task can be used to extract features of an object. However, the classification network have tendency to separate inter-class differences. In tracking scenarios, tracking objects can be in same class. That is there could be 2 person crossing each other. In single target tracking, the intra-class separation is also important. Thus, from pre-trained network, features that mostly activate our desired object can be found. If we found these features, to track object, only these features can be used to localize object location on the next frame. For feature comparison, Siamese matching network is used.

Basically, the important features should be selected using following equation: $\chi^{‘} = \varphi(\chi;\Delta)$. $\chi$ is input features and $\varphi(.)$ function selects important features according to the channel importance, $\Delta$.

where $G_{AP}(.)$ global average pooling function, $L$ is the loss function to select features and $z_i$ is the output feature of $i^{th}$ filter.

There are 2 loss function to select important features. Target active features via regression loss and Scale sensitive features via ranking loss.

For the given object, firstly, pre-trained features are calculated. Then, losses are calculated and gradients are found to select features using above equations.

WARNING: This part can be more clear. In paper, it is not explained very well. The inspired paper should be read.

Above loss calculating networks to select important feature channels are trained with first frame of the tracking sequence. Thus, in initializing phase, target aware features from pre-trained network are decided and after these process, until the end of the tracking, same feature channels are used.

where ${* }$ denotes convolution operation.

Overall network architecture in the paper is shown in Fig.2. Dashed part is calculated only in initialization part.

Fig.2: TADT algorithm explanation

To conclude, this method does not obtain best accuracy on dataset but it is very fast compared to other methods.