This section provides a detailed explanation of our methodology. The overall framework of our tracker is described in
Section 3.1. Next, we proceed to introduce the global coordinate-aware enhancement (GCAE) module and embedding feature aggregation (EFA) module in
Section 3.2 and
Section 3.3, respectively. Lastly,
Section 3.4 provides a brief description of the online matching strategy.
3.2. Global Coordinate-Aware Enhancement
The effectiveness of the attention mechanism for enhancing network learning is well-established. However, attention constrained to specific regions poses difficulties in capturing interactive information across a global context. It is worth noting that global information helps the network comprehend the context and extract useful features. Especially in object detection task, it is crucial to understand the global positional relationship between targets [
42]. Therefore, with the goal of strengthening the representation power of the detection feature map and enhancing detection accuracy, we introduce the GCAE module, a pivotal component of the asymmetric feature enhancement network.
The specific structure of the GCAE module is illustrated in
Figure 3. We designate the shared feature map output from the feature extractor part as
. The GCAE module takes it as input and generates the detection-specified feature map
with amplified depiction. In particular, we employ global max pooling and global average pooling along both the row direction and column direction of
to obtain channel-based global coordinate attention. The output of the c-th channel through global max pooling can be defined as:
where
and
represent the global max pooling at the row
direction and column
direction, respectively. Similarly, the output of the c-th channel through global average pooling can be defined as
where
and
represent the global average pooling at the row
direction and column
direction, respectively.
The dimensions of the four tensors obtained through the different pooling operations are
,
,
, and
, respectively. Then, we adopt the following equation to produce the global coordinate-aware channel information
.
Recall that denotes the concatenation operation along the channel dimension. Fusing the four tensors with distinct spatial orientations and contextual details enables Z to capture comprehensive interdependencies in the spatial dimension while maintaining sensitivity to the region of interest.
After that, we obtain the refined global coordinate-aware channel information, denoted as
. This is accomplished through a tailored transformation, which adaptively reweights position information with diverse characteristics while reducing dimensionality to align with the original feature dimension. The procedure is outlined as follows
Here,
is a learnable matrix, and
denotes a standard 1 × 1 convolutional layer. Additionally,
and
correspond to the rectified linear unit and batch normalization operator, respectively. Later, we split
into two distinct vectors
and
along the height and width directions. Two additional non-shared 1 × 1 convolutional layers with non-linear activation function
are implemented to
and
, respectively. Considering that
Lastly,
is expanded horizontally and
is expanded vertically. The resulting detection-specific feature map, denoted as
, is obtained through the following numerical operations
The proposed GCAE module is capable of allowing the detection branch to prioritize regions of interest for objects located on the feature map. In contrast to the standard global pooling process, which merely condenses global spatial information into a channel vector, our approach takes into account both the direction and type of pooling operation. This allows the detection-specific feature map to establish interactions over long distances, thereby assisting the detection head in recognizing and locating targets.
3.3. Embedding Feature Aggregation
In this section, we introduce a novel EFA module, specifically designed to enhance high-level features for the ReID branch. This module is aimed at creating the more distinctive task-related feature map, and it serves as a crucial component of the asymmetric feature enhancement network.
As depicted in
Figure 4, the EFA module takes three inputs, namely detection heatmap
, pseudo-Gaussian heatmap
and shared feature map
. Then, generating the ReID-specific feature map
. Generally,
is generated by the detection head and contains the predicted center position of the object. The target category is represented by a specified feature layer of
, with the number of channels
K indicating the number of categories. As shown in the first stage of
Figure 4, the non-maximum suppression (NMS) algorithm is utilized to eliminate redundant or overlapping targets over the detection heatmap. It is implemented via a standard 3 × 3 max pooling layer for improved efficiency [
24].
Following the NMS operation, which extracts the peak keypoints from each layer of the detection heatmap, we manually proceed to generate the pseudo-Gaussian heatmap
according to the bounding box annotation. The
is produced as follows
where
represents the pixel coordinates on the
and
denotes the center point of the bounding box annotation.
is the standard deviation value that determines the Gaussian radius. The feature map layers in
correspond to those in
and each layer represents an object category.
As shown in the second stage of
Figure 4, we utilize the pseudo-Gaussian heatmap to limit the feature representation areas for each layer of the detection heatmap. This way can effectively highlight the region of interest and mitigate background interference. We perform element-wise multiplication and summation along the channel dimension to create a unified aggregated feature map, denoted as
M, yielding
We focus on the importance of fine-grained semantic information within
, employing it as prior knowledge to guide the enhancement of feature representation in the ReID branch. The incorporation of pixel-level priors enables the ReID phase to concentrate on potential target areas during the training stage, thus enhancing the discriminative power and robustness of the extracted identity embeddings during the inference stage. Therefore, in the third stage, as depicted in
Figure 4, by utilizing
as an attentional weight, the output of our ReID-specific feature map
can be expressed as
All in all, we propose the integration of the EFA module during the ReID phase to enhance the reliability of the target identity. In contrast to adaptive optimization methods reliant on deep convolutional networks, we employ a task-guided learning strategy to facilitate pixel-level learning at the ReID branch. During the training phase, both the pseudo-Gaussian and detection heatmaps are simultaneously used as attentional weights in the third stage to update the embedding feature. However, during the inference phase, the pseudo-Gaussian heatmap is not available. This encoding process allows the EFA module to strike a balance between network generalization and its superior performance.
3.4. Online Matching Strategy
In this section, we provide a comprehensive explanation of our matching strategy to associate detected objects across consecutive frames. Our approach is based on the cascade matching strategy introduced in MOTDT [
43] and incorporates method from ByteTrack [
37] to maximize the use of low-scoring detection results. The pseudocode of the online matching strategy is shown in Algorithm 1.
The association process requires processing a multi-frame video sequence , where each frame contains the detection results and their corresponding identity embeddings , with N representing the number of detected targets. Additionally, we define two detection thresholds, and , to categorize . The result of the association process is the set of tracks .
For each frame, we employ the Kalman filter to predict the positional status of the tracks in the current frame (lines 3 to 4). We initialize both the low-scoring detection results and the high-scoring detection results along with their corresponding identity information , and subsequently categorize the detection results based on the detection threshold (lines 5 to 11).
During the cascade matching stage (lines 12 to 26), we initially utilize appearance-based information to establish associations between
and
. This entails computing the Mahalanobis distance, denoted as
, between the predicted tracks
and the high-scoring detected bounding boxes
. Then, we combine the Mahalanobis distance with the cosine distance computed from the identity embedding
, generating the composite distance metric
, which can be expressed as
where
serves as a weighting parameter and is set to 0.98 in our experiment following the default setting. The Hungarian algorithm with a matching threshold
is used to determine the matching targets. If
exceeds
, the
detection is deemed successfully associated with the corresponding tracks. Otherwise, we keep the unmatched tracks
and detections
.
Secondly, for the remaining tracks
and detections
, we associate them using Intersection over Union (IOU) distance based on motion information. The second remaining detections from
is put into
and the second remaining tracks from
is put into
. Last but not least, we update the appearance features of the identified targets in each frame to accommodate appearance variations, which can be described as
where,
denotes the identity embedding of matched targets in the current frame,
denotes the identity embedding of tracks from the previous frame and
denotes the identity embedding of matched tracks in the current frame. Additionally, we initialize new tracks for any detections that fail to correspond with previously identified targets (lines 25–26).
For the low-scoring detection results , the IoU distance is used between and to preserve detections that may be subject to severe occlusion or motion blur. These detections are considered background during the cascade matching stage (lines 27–28).
Finally, we store the last remaining tracks
after the entire matching process for 30 frames in case they reappear again (lines 29–30).
Algorithm 1: Pseudo-code of online matching strategy |
|