In this section, we first introduce our weakly supervised module based on co-saliency learning in
Section 3.1. Then, we describe our weakly supervised data processing in
Section 3.2. In the end, we present how we integrate our method into existing tracking frameworks to achieve weakly supervised tracking in
Section 3.3 (CNN-based framework) and
Section 3.4 (Transformer-based framework), respectively.
3.1. Weakly Supervised Co-Saliency Learning
Co-saliency learning aims to extract the common significant foreground areas from a set of images. Inspired by the prior work [
14], we propose a co-saliency attention mechanism (CSA) to construct a weakly supervised tracking environment.
Figure 2 shows the detailed architecture of the proposed CSA module. The input of the module contains feature maps of multiple frames from the same video, denoted as
.
T represents the total number of feature maps and
has
resolution with
C dimensions. Our CSA module consists of two branches, i.e., spatial branch and channel branch. In doing so, the common foreground information of multiple input frames can be obtained considering both spatial positions and common important channels. Specifically, to reduce computations and implement our CSA as a lightweight module, all of the input feature maps first go through the channel and spatial dimension reduction layers (shown as Channel Conv and Spatial Conv in
Figure 2). These two dimension reduction layers are both implemented by a convolutional layer followed with ReLU, and output two types of feature maps with low dimensions
and
, where
,
and
are much smaller than
C,
H, and
W. Then, we employ the normalized cross correlation operation (NCC) to calculate the similarity among multiple frames. The NCC operation of two feature vectors with dimension
can be formulated as
where
and
represent the couples of the mean and standard deviation of the feature vectors
P and
Q, respectively. The aforementioned two types of low-dimension feature map sets
and
are, respectively, input to the channel NCC block and spatial NCC block to calculate the co-salient information of all
T frames in channel and spatial levels.
In the spatial NCC block, every spatial location
of the feature map
with dimension
can be viewed as a
dimensional feature vector (denoted as
) that describes the local information at
. We employ NCC to calculate the similarity of the information in the frame’s position
with the information in all positions of other frames, formulated as
The output of the spatial NCC block is a 3D attention map . It records the correlation information between each position of the frame t and all positions of other frames. Thus, spatial locations of the common foreground consulting with multiple frames are activated.
The process of the channel NCC block is similar to the spatial NCC block. First, the feature map
with dimension
is flatted to 2D matrices. It represents that every channel in the feature can be viewed as a
dimensional vector
. The channel NCC block computes the channel attention map
, which records the similarity between each channel of the frame
t and all channels of other
frames. The common important channels of all video frames can then be enhanced. This process can be formulated as
The two attention maps recording the similarity information, i.e.,
and
, are then summarized by a convolutional layer, and transformed to
and
, respectively. The final spatial-channel attention map is formulated as follows:
Among them, is a matrix, where common important channels and the co-salient spatial positions are highlighted with reference to multiple video frames, while other information will be weakened. In this paper, we calculate the co-saliency attention map of the current frame’s feature map, with reference to the information from the other two frames, including the initial template and a random intermediate frame. The attention map is multiplied with the feature map of the current frame, making the target position and important channels more salient. Meanwhile, the background information and other trivial channels of the final feature map of the current frame are suppressed.
3.2. Weakly Supervised Data Processing
Since the target positions and shapes usually do not change much in a 30-frame clip, which occurs in a second for most videos, we adopt sparsely labeled training datasets to achieve weakly supervised tracking. That is, labeling one bounding box of the target every 30 frames. The input images to CSA are used to generate co-saliency attention maps, which do not require precise target annotations but only need the target to be present within the search regions cropped based on pseudo-labels. As a result, CSA inputs from unlabeled frames are highly tolerant to pseudo-label inaccuracies. Therefore, in this work, we employ linear interpolation to generate pseudo-labels for unlabeled frames, illustrated as
Figure 3.
Although linear interpolation is not reliable for modeling complex scenarios such as nonlinear target motion, it is sufficient for our method because precise annotations are not strictly required. To support this claim, we evaluate the quality of the pseudo-labels generated on LaSOT dataset using two metrics. As shown in
Table 1, we report results under varying annotation sparsity. “IoU” denotes the Intersection over Union between pseudo-labels and groundtruth annotations, while “Target ratio” measures the proportion of the target included in the search region cropped based on the pseudo-label. We observe that while the IoU between pseudo-labels and groundtruth drops significantly with increased sparsity (e.g., from 0.679 at sparsity 30 to 0.445 at sparsity 150), the target ratio remains consistently high. Even at a sparsity level of 150, the target ratio exceeds 0.9, indicating that the cropped search regions still contain nearly complete target information. This property fits well with our method, as the CSA module only requires the target to be present in the search region, not precisely localized. Thus, pseudo-labels from linear interpolation are sufficiently reliable, enabling effective weakly supervised training.
3.3. CNN-Based Weakly Supervised Tracking
Taking the baseline tracker SiamRPN++ [
5] as an example, we present how we hierarchically integrate our CSA module into a CNN-based tracking framework to achieve weakly supervised training, shown as
Figure 4.
For the baseline tracker SiamRPN++, its inputs consist of an initial template image
z and a current search image
. A CNN-based backbone network is employed to extract their multi-level feature maps, and then places them into corresponding region proposal networks (RPN blocks in
Figure 4). After fusing the outputs from multiple RPN blocks, the classification and regression branches will give a dense prediction of the target. The former is in charge of choosing the best proposal through classifying the foreground and background, while the latter is used to refine the proposal and estimate the offsets of anchor boxes. This process can be formulated as
where
represents the feature maps extracted from the backbone network,
k represents the number of predefined anchors, and ⋆ is a convolution operation on feature maps of search regions, with the template features as the convolution kernel.
Different from the baseline, there are three different search regions in the inputs of our model, which are cropped from the initial frame, an unlabeled intermediate frame, and the current search frame, respectively, namely
,
, and
. It is worth noting that intermediate frames are cropped based on generated pseudo-labels. Each of these three search regions has a shared search branch to extract feature maps. As shown in
Figure 4, multiple of our CSA modules are hierarchically inserted after different layers of the backbone network. The features of the corresponding layers output by the three search branches are input to our CSA modules. For the current search branch, the input of the next backbone layer will be replaced by the output of the corresponding CSA module, i.e., the enhanced feature maps of the current search image. After integrating our CSAs, the tracking process can be reformulated as
We can find that after integrating our CSA modules, the tracking framework is able to use both labeled (i.e., the initial and current search frames) and unlabeled data (i.e., intermediate frames), therefore constructing a weakly supervised training environment. The precisely labeled data can provide accurate target information and guide the model optimization by calculating precise losses, while the unlabeled data is mainly utilized to calculate the co-salient information of multiple frames, and then enhance the current search features.
3.4. Transformer-Based Weakly Supervised Tracking
Due to their powerful global representation ability, Transformer-based tracking methods have gradually become the mainstream. To demonstrate the generalization ability of our method on Transformer-based tracking frameworks, we choose a one-stream Transformer-based tracker OSTrack [
6] as the baseline, and integrate our CSA modules into its framework. In OSTrack, the template and search region are first split to multiple patches with a fixed size. Each patch can be treated as a processing unit in the Transformer network, namely token. Passing through a patch embedding layer, template tokens and search tokens are concatenated together and iteratively input into multiple Transformer blocks (TF blocks), shown as the top row in
Figure 5. After executing self-attentions and cross-attentions in these TF blocks, the search tokens output from the last TF block are sent to a head module to predict the target state.
Similar to CNN-based integration, we employ three shared backbone branches to encode three types of tokens along with the template tokens, including initial tokens, intermediate tokens, and search tokens. These tokens are processed through a series of TF blocks, and three CSA modules are integrated after specific TF blocks to facilitate information interaction and enhancement among the three branches. Specifically, the output tokens from the current TF blocks in each of the three branches, including initial tokens, intermediate tokens, and search tokens, are reshaped into feature maps. These feature maps are then passed to the corresponding CSA modules for co-saliency calculation. The CSA module enhances the search tokens by incorporating additional context and information from the other branches. The enhanced search tokens will then replace the original search tokens, continuing the Transformer encoding process with the updated tokens, as shown by the orange tokens in
Figure 5.