Next Article in Journal
Wearable-Sensor and Virtual Reality-Based Interventions for Gait and Balance Rehabilitation in Stroke Survivors: A Systematic Review
Previous Article in Journal
Deep Learning for Wildlife Monitoring: Near-Infrared Bat Detection Using YOLO Frameworks
Previous Article in Special Issue
Adversarial Content–Noise Complementary Learning Model for Image Denoising and Tumor Detection in Low-Quality Medical Images
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Object Part-Aware Attention-Based Matching for Robust Visual Tracking

Graduate School of Data Science, Kyungpook National University, Daegu 41566, Republic of Korea
Signals 2025, 6(3), 47; https://doi.org/10.3390/signals6030047
Submission received: 14 July 2025 / Revised: 29 August 2025 / Accepted: 4 September 2025 / Published: 10 September 2025
(This article belongs to the Special Issue Recent Development of Signal Detection and Processing)

Abstract

In this paper, we propose a novel visual tracking method with a object part-aware attention-based matching (OPAM) mechanism, which leverages local–global attention to enhance visual tracking performance. Our method introduces three key components: (1) a local part-aware global self-attention mechanism that embeds rich contextual information among candidate regions, enabling the model to capture mutual dependencies and relationships effectively, (2) a local part-aware global cross-attention mechanism that injects target-specific information into candidate region features, improving the alignment and discrimination between the target and background, and (3) a global cross-attention mechanism that extracts object holistic information from the target-search feature context for further discriminability. By integrating these attention modules, our approach achieves robust feature aggregation and precise target localization. Extensive experiments on a large-scale tracking benchmark demonstrate that our method shows competitive performance metrics in both accuracy and robustness, particularly under challenging scenarios such as occlusion and appearance changes, while running at real-time speeds.

1. Introduction

Visual object tracking is a fundamental problem in computer vision with diverse applications, including autonomous driving, surveillance, and human–computer interaction (HCI). Visual tracking tasks involve estimating the trajectory of a target object across consecutive video frames given the initial target state under challenging conditions such as occlusions, appearance changes, scale variations, and cluttered backgrounds. Despite recent advancements in deep learning-based tracking methods, achieving robust and accurate performance in dynamic real-world scenarios remains an unsolved challenge. The ability to effectively distinguish the target from its background while adapting to variations in scale, pose, and illumination is essential for reliable visual tracking task. However, traditional approaches often fall short in capturing the intricate relationships between the target and its surrounding context regions, particularly when faced with complex scenes with multiple distractor objects with similar appearances.
One of the primary challenges in visual tracking lies in balancing local and global contextual information. While local part-based feature representations are crucial for capturing fine-grained details of the target, global feature representations provide essential context for understanding dependencies across the entire scene. Conventional cross correlation-based matching approaches often focus on only one of these aspects, leading to suboptimal performance when handling scenarios involving occlusion, distractors, or abrupt appearance changes. Although attention mechanisms have emerged as powerful tools for enhancing feature representation in computer vision tasks, their potential for integrating local and global information simultaneously in visual tracking needs further investigation.
In this paper, we propose a novel visual tracking framework which involves object part-aware attention-based matching (OPAM) to address the aforementioned challenges. Our approach introduces novel attention modules that aim to enhance local–global feature aggregation and improve target classification and localization: (1) a local part-aware self-attention mechanism that captures rich contextual relationships among candidate regions, enabling the model to understand scene-level dependencies for enhanced discriminability; (2) a local part-aware cross-attention mechanism that injects target-specific information into candidate region features for precise classification among distractor objects; and (3) a global cross-attention mechanism that extracts comprehensive object holistic information from the target-search context. By incorporating these aforementioned components, OPAM aims to achieve a balance between sensitivity to local part-aware details and global scene-aware contextual awareness, resulting in improved robustness and discrimination between the target and the background regions.
In order to validate the effectiveness of our proposed approach, we conduct extensive experiments on the widely-used large-scale visual tracking benchmarks LaSOT [1] and GOT-10k [2]. The experimental results demonstrate that our proposed tracker incorporating OPAM achieves competitive performance in terms of both accuracy and robustness compared to state-of-the-art methods while running at real-time speeds. Notably, our method performs well under challenging conditions such as occlusion, distractors, and significant appearance variations. The motivation for our proposed tracking framework is illustrated in Figure 1.
In summary, our contributions are as follows: (1) We propose a novel attention-based framework for visual tracking that integrates part-aware local self-attention, local cross-attention, and global cross-attention mechanisms to achieve robust feature aggregation and modulation. (2) Our method effectively balances between sensitivity to local part-based detail and discriminability in a global scene-aware context to improve target localization under diverse conditions. (3) Extensive experiments on standard visual tracking benchmarks demonstrate that our approach performs competitively among conventional methods in terms of accuracy and robustness, particularly under challenging scenarios. (4) Also, the lightweight design of our tracking model enables real-time speed, enabling wider application to other computer vision tasks.
The remainder of this paper is organized as follows: Section 2 reviews related work on deep learning-based methods for visual tracking and approaches based on attention mechanisms. Section 3 details the proposed OPAM tracking framework with the overall visual tracking framework and its operation. Section 4 presents experimental results including external comparisons and internal ablation experiments. Finally, Section 5 concludes the paper with future research directions and remaining challenges.

2. Related Work

Visual tracking field has seen significant advancements in recent years, particularly with the adoption of deep learning-based modeling techniques. Siamese network-based trackers have emerged as a dominant paradigm, with SiamFC [3] and SiamRPN [4] pioneering this approach. These methods formulate tracking as a similarity learning problem, comparing a template of the target object with candidate regions in subsequent frames. Subsequent works like SiamRPN++ [5] and SiamBAN [6] have further improved performance by incorporating deeper backbone networks and more sophisticated region proposal mechanisms.
Another line of research focuses on online learning and model updating to adapt to appearance changes. ATOM [7] introduced an online optimization framework for accurate target state estimation, while DiMP [8] extended this approach with a discriminative model prediction architecture. These methods demonstrate improved robustness to appearance variations but can struggle with long-term tracking scenarios.

2.1. Attention Mechanisms in Visual Tracking

Attention mechanisms have shown great promise in various computer vision tasks, including object tracking. TransT [9] introduced a Transformer [10]-based architecture for visual tracking, leveraging self-attention to model long-range dependencies in feature maps. STARK [11] further extended this idea by incorporating a Transformer decoder for target localization. These methods demonstrate the potential of attention mechanisms for capturing global context in tracking scenarios where subsequent methods improve upon these approaches using autoregressive sequence modeling [12,13,14]. Local attention has also been explored in the context of visual tracking. AlphaRefine [15] proposed a local self-attention mechanism to refine target localization, while TrDiMP [16] incorporated a target-aware attention module to enhance feature discrimination. However, these approaches often focus on either local or global attention, without fully integrating both aspects.

2.2. Object Part-Based Approaches

Part-based tracking methods have shown effectiveness in handling partial occlusions and deformations. SPM-Tracker [17] proposed a structure-preserving object matching framework that leverages part-level features for robust tracking. These approaches demonstrate the potential of part-aware representations for improving tracking performance under challenging conditions. The integration of local and global features has been explored in various computer vision tasks. In object detection, DETR [18] demonstrated the effectiveness of global attention for feature aggregation. For visual tracking, GlobalTrack [19] proposed a global instance search framework to handle full occlusions and out-of-view scenarios. However, the potential of combining local and global attention mechanisms for visual tracking remains underexplored.

2.3. Local–Global Context Integration for Visual Tracking

The integration of transformer-based architectures with convolutional networks has opened new avenues for contextual reasoning in tracking. Vision Transformers (ViTs) [20] demonstrated the efficacy of self-attention for capturing long-range dependencies, inspiring trackers like TransT [9], which fused CNN (Convolutional Neural Network) features with transformer encoders for global matching. While effective, pure transformer architectures often overlooked local spatial details critical for precise localization. Hybrid models, such as Deformable Siamese Attention Networks [21], combined deformable convolutions with cross-attention to preserve local geometric information while aggregating global context. Similarly, SiamAttnAT [22] employed temporal attention for adaptive template updates but lacked explicit mechanisms to link local part features with scene-level semantics. Our work advances these efforts by introducing a tripartite attention framework that concurrently models: (1) part-level contextual relationships via local self-attention, (2) target–candidate interactions through local cross-attention, and (3) holistic scene understanding via global cross-attention, enabling synergistic local–global feature integration unmatched by previous approaches.
Our proposed OPAM framework builds upon these previous works by introducing a novel combination of local self-attention, local cross-attention, and global cross-attention mechanisms for target–candidate feature aggregation and modulation. Our approach aims to leverage the strengths of both local and global feature representations, addressing the limitations of existing visual tracking methods in handling complex tracking scenarios, including videos with object deformations, scale changes, out-of-plane rotations, and distractor objects with similar appearances.

3. Proposed Method

In this section, we describe the details of our proposed object part-aware attention-based matching tracker. The overall framework largely consists of two stages: (1) feature extraction and region proposal stage, and (2) part-aware local–global matching and region classification stage. In the following subsections, we first introduce the overall flow of our tracking algorithm, and we describe the details of each stage, including the proposed OPAM scheme for part-aware context integration. Finally, we provide the details on the training process and specify the implementation details, comprising architectural design, training data, and hyperparameters. Figure 2 shows the overview of the proposed visual tracking process, with the aforementioned stages for region proposal and region classification.

3.1. Overview of the Tracking Framework

In the first stage of the proposed tracking framework, feature extraction from the input frames is performed using a feature extraction model. For input frames, I z , I x R H × W × 3 are provided from a video sequence, where I z is the initial query frame which contains the target object and its bounding box coordinates b 1 R 4 , and  I x is the current search frame where we wish to find the bounding box coordinates of the target object. Each image is a tensor which consists of H rows and W columns with RGB channels. Feature maps are obtained using the backbone feature extractor network ϕ ( · ) where feature maps from input images I z , I x are obtained as in,
F z = ϕ ( I z ) , F x = ϕ ( I x ) ,
where the output query feature map F z R h × w × c , and the search feature map F x R h × w × c have spatial sizes of height h, width w, and  c channels. The backbone feature extractor network ϕ ( · ) can be any deep neural network pretrained on a large-scale image classification dataset, with the final linear classification layers removed to obtain the intermediate feature representation. To achieve real-time processing speed in our experiments, we chose a lightweight CNN-based feature extractor based on ResNet-18 [23], where stride of the last residual block is modified from 2 to 1 for higher spatial resolution of the output feature maps. Also, to prevent overfitting and ensure faster convergence, parameters for the layers except the last residual block are fixed during the training stage. Additional technical details on the implementation are shown in Section 3.4.
Using the feature maps F z ,   F x obtained from the initial frame and the current frame, region proposal is performed to search for candidate regions that resemble the target object inside the current frame. First, both feature maps are processed with 1 × 1 convolution layers, respectively, for channel reduction from c to c. Then, using the query feature map F z along with the initial bounding box coordinates b 1 , adaptive spatial pooling operation [24] is performed to obtain a fixed-sized target feature representation z R s × s × c , where s is the spatial size of the pooled feature representation. Subsequently, using the target feature z as a kernel on search feature map F x , depth-wise cross-correlation operation is performed to obtain a refined search feature map as in
F ^ x = F x z ,
where ∗ represents the depth-wise cross-correlation operation with unit stride and zero padding of s / 2 for spatial dimension consistency. The refined search feature map F ^ x R h × w × c is then processed through the region proposal network (RPN) branches with detection heads similar to [25], where the class label map P R h × w × 2 and box regression map Q R h × w × 4 are obtained as in
P = f c ( F ^ x ) , Q = f r ( F ^ x ) ,
where both branches f c ,   f r include 2 residual blocks which include a convolution layer, group normalization [26] layer, and ReLU nonlinearity, and the final 1 × 1 convolutional layers that convert the feature maps to the respective output channel dimensions. The class label map P represents the binary logit values for each spatial position ( i , j ) in P, where the corresponding vector P i , j R 2 can indicate whether this spatial region is inside or outside the target region. Similarly, for spatial position ( i , j ) in Q, which is inside the target bounding box, the corresponding vector Q i , j R 4 can indicate the spatial boundaries of the candidate bounding box, where distances from ( i , j ) to the four sides (left, top, right, and bottom) are estimated. Finally, based on the positive scores obtained by conducting softmax operation on P and non-maximum suppression operation using Q, top-N candidate regions are obtained as bounding boxes { r 1 , , r N } , with each box r k R 4 representing the location and size of the bounding box as in r k = [ x k , y k , w k , h k ] T , where x k , y k are the center coordinates and w k ,   h k are the width and height of the bounding box.
Using the candidate regions found in the region proposal stage, the second stage of our proposed tracking framework involves object part-aware local–global attention operation, followed by region classification and refinement process to choose a single region as the estimated target region. First, similar to the region proposal stage, query and search feature maps are processed with 1 × 1 convolution layers, respectively, for channel reduction from c to c. Then, using the N candidate regions { r 1 , , r N } obtained from the previous region proposal stage, spatial pooling operations [24] are performed on the search feature map F x to produce spatially pooled candidate feature representations X = [ x 1 , , x N ] T R N × s × s × c where each x i R s × s × c is obtained from the candidate region r i . Afterwards, our proposed object part-aware attention module (OPAM) processes the feature representations X as in
X ^ = OPAM ( X , z ) ,
where X ^ = [ x ^ 1 , , x ^ N ] T R N × s × s × c represents the modulated candidate region features; the details for the module’s operation will be discussed in the following section. Next, using the context-embedded candidate feature representations, region classification and bounding box refinement operations are performed as in
u i = g c ( x ^ i ) , v i = g r ( x ^ i ) ,
where vectors of region-wise class logits u i R 2 and box refinement values v i R 4 are obtained using the branches g c , g r . Similar to the aforementioned RPN branches, they also include 2 residual blocks, which include a convolution layer, group normalization layer, and ReLU nonlinearity, and a final s×s convolutional layer without padding to convert the feature maps to the respective output vectors. Then, by conducting softmax operation on class logits u i , a region is classified into target or background region, where we assume a k-th candidate region with the maximum positive score is chosen for the output with a corresponding box of r k . Using the refinement values v k = [ v k 1 , v k 2 , v k 3 , v k 4 ] T on candidate box coordinates r k = [ x k , y k , w k , h k ] T , the final refinement of the bounding box is performed as in
x k = x k + w k v k 1 , y k = y k + h k v k 2 , w k = w k · exp ( v k 3 ) , h k = h k · exp ( v k 4 ) ,
where the final target box coordinates r k = [ x k , y k , w k , h k ] T are obtained by modifying the center coordinates proportional to the original box size, and box sizes are refined using exponential multipliers as in [25] for larger scale adjustment.

3.2. Incorporating the Object Part-Aware Attention Module

In this subsection, we elaborate on Equation (4), where the details of our proposed OPAM scheme for part-aware context integration are described. We employ the Transformer [10] encoder and decoder architectures for attention-based local–global context integration. First, the channel dimensions of all input feature representations X and z are reduced to d < c using 1 × 1 convolutions for faster and memory efficient subsequent computation. Then, local part-aware self-attention (LPSA) operation using a Transformer encoder module is performed on the candidate features X, where we first vectorize and perform a row-wise concatenation of the individual candidate features x i for part-wise tokenization as in X R N s 2 × d .
A basic attention block in a transformer architecture receives query Q R n q × d , key K R n k × d , and value V R n k × d matrices as inputs, where n q , n k are the number of query and key vectors, and we denote the attention mechanism as in Att ( Q , K , V ) = softmax col Q K T d V , where column-wise softmax operation generates the attention weights for the values. To implement the multi-head self-attention (MSA), we use X as the input query, key, and value matrices simultaneously, where multiple attention blocks are concatenated to produce the output X S R N s 2 × d as in
X S = MSA ( X ) = Concat ( { Att ( X , X , X ) } i = 1 , , N H ) W S ,
where N H denote the number of heads and  Concat ( · ) operation represents the tensor concatenation operation along the channel dimension followed by dimension reduction using linear projection matrix W S R N H d × d as in R N s 2 × N H d R N s 2 × d . By employing multiple attention blocks in parallel, various attention weight distributions can be learned in order to focus on different object parts and emphasize the subtle differences between distractor objects with similar appearances.
Subsequently, using the scene context-embedded candidate feature representation X S and part-wise vectorized target feature representation Z R s 2 × d , multi-head cross-attention operation (MCA) is performed to implement target part-aware cross attention (TPCA). To implement MCA, we use Z as the key and value matrices, and  X S as the query matrix, where output X C R N s 2 × d is obtained as in
X C = MCA ( X S , Z ) = Concat ( { Att ( X S , Z , Z ) } i = 1 , , N H ) W C ,
where the same number of heads N H is used, with the linear projection matrix W C R N H d × d . Resulting candidate feature representation X C is now embedded with the information of the target object context, where useful characteristics that help identify the target receives more attention weights and vice versa for redundant features.
Additionally, we elicit global part-agnostic cross-attention (GCA) operation to further emphasize the discriminability between objects. We first reshape the feature representations to restore spatial dimensions, where the target context-embedded feature representation is restored as in X C : R N s 2 × d R N × s × s × d , and also for the target feature representation as in Z : R s 2 × d R s × s × d . Then, spatial average pooling operations are performed to obtain feature representations X ¯ C R N × d and  Z ¯ R 1 × d , respectively. Using these feature representations, multi-head cross attention operation is performed to obtain the output feature X ¯ G R N × d as in
X ¯ G = MCA ( X ¯ C , Z ¯ ) = Concat ( { Att ( X ¯ G , Z ¯ , Z ¯ ) } i = 1 , , N H ) W G ,
where the linear projection matrix W G R N H d × d is used and the output is reshaped by adding redundant dimensions as in X ¯ G : R N × d R N × 1 × 1 × d ; these are used as channel-wise modulators for the candidate features X C R N × s × s × d as in
X ^ = X C X ¯ G ,
where ⊕ denotes element-wise addition with the shape of X ¯ G automatically broadcasted and spatially repeated to match the shape of X C . Afterwards, channel dimension of the resulting candidate feature representation X ^ R N × s × s × d is then reverted to c using 1 × 1 convolution in order to match the dimension of the input X. An overview for the mechanism for our proposed OPAM scheme is illustrated in Figure 3, and the overall step-by-step algorithm for our tracking framework is shown in Algorithm 1.
Algorithm 1: Visual tracking with OPAM
Input   : Video sequence with length L, with frame images { I 1 , I 2 , , I L }
         Initial target bounding box coordinates b 1
Output: Target bounding box coordinates b t for all frames in the video
# Tracker Initialization
  Compute feature F z as in Equation (1)
  Obtain target feature z from F z , using initial box b 1 by ROIAlign [24]
# Tracking for frames t > 1
for t = 2 to L do
  # Region proposal stage
 Compute feature F x as in Equation (1)
 Perform depth-wise cross correlation with as in Equation (2)
 Compute output maps using RPN branches as in Equation (3)
 Obtain top-N candidate boxes { r 1 , , r N }
# OPAM modulation as in Equation (4)
 Extract candidate features X using ROIAlign
 MSA on X as in Equation (7), obtain X S
 MCA between X S and Z as in Equation (8), obtain X C
 Compute spatially pooled X ¯ C and Z ¯
 MCA between X ¯ C and Z ¯ as in Equation (9), obtain X G
 Modulate X C using X ¯ G as in Equation (10), obtain X ^
# Region classification stage
 Compute output values as in Equation (5)
 Refine box coordinates as in Equation (6)
 Choose the k-th region r k with the highest score
end
Return: r k as the output of current frame

3.3. Training the Proposed Tracking Framework

To train our proposed visual tracking framework, we aim to minimize the multi-task loss function, which consists of two loss terms. The first term is for training the region proposal network, and the second term is for training the region classification network along with the OPAM module. In this section, we describe the details for the two aforementioned loss functions and the optimization method for the combined loss function.
First, the loss function for the region proposal network consists of two terms where the first term L c f trains the classification branch f c and the second term L r f trains the regression branch f r . Combined loss function L f can be denoted as
L f ( { P , Q } , { P * , Q * } ) = 1 h w i , j L c f ( P , P * ) + λ N P 1 { P i , j * = 1 } i , j L r f ( Q , Q * ) ,
where P * R h × w × 1 , Q * R h × w × 4 are ground truth classification and regression maps, while P R h × w × 2 , Q R h × w × 4 are the output maps for the classification and regression branches as in Equation (3), respectively. λ is a hyperparameter for balancing the importance between two terms. N p denotes the number of positions ( i , j ) in P * with a positive label P i , j * = 1 , where it indicates that ( i , j ) is inside the bounding box of the target object. For the first term L c f enforced on the classification map P, focal loss formulation is used as in
L c f ( P i , j , P i , j * ) = P i , j * ( 1 P i , j ) γ log P i , j ( 1 P i , j * ) P i , j γ log ( 1 P i , j ) ,
where γ is the focusing hyperparameter for sample reweighting. For the second term L r f enforced on the regression map Q, IoU (intersection-over-union) loss is enforced only on positions ( i , j ) with P i , j * = 1 as in
L r f ( Q i , j , Q i , j * ) = 1 | Q i , j Q i , j * | | Q i , j Q i , j * | ,
where | · | indicates the area of the region, ∩ is the intersection operator that returns the overlapping region between two regions, and ∪ is the union operator that returns the combined region given two regions.
Regarding the second loss for the region classification network and the OPAM module, similar loss function with two terms is enforced on top-N candidate region features. The first term L c g trains the region classification branch g c and the second term L r g trains the bounding box refinement branch g r . Combined loss function L g can be denoted as
L g ( { u , u * } , { v , v * } ) = 1 N i L c g ( u , u * ) + λ N Q 1 { u i * = 1 } i L r g ( v , v * ) ,
where u i * R , v i * R 4 are ground truth classification and regression labels for each candidate region, while u i R 2 , v i R 4 are the outputs from classification and regression branches as in Equation (5), respectively. N Q denotes the number of positive candidate boxes with IoU scores exceeding τ P = 0.5 , which are labeled as u i * = 1 . For the first term L c g enforced on the class estimations u, focal loss formulation in Equation (12) is used. For the second term L r g enforced on the box refinement values v, IoU loss formulation in Equation (13) is used.
Using the individual loss terms defined above, we can formulate the overall loss function L f , g as in
L f , g ( { P , Q } , { P * , Q * } , { u , u * } , { v , v * } ) = L f ( { P , Q } , { P * , Q * } ) + L g ( { u , u * } , { v , v * } ) ,
where λ g is a hyperparameter for balancing the importance between two terms. By minimizing the above loss function using a gradient descent-based optimization method, the region proposal network and the region classification network along with the proposed OPAM module can be trained simultaneously using the upstream loss gradients. At the early training stage, candidate region proposals from the RPN are mostly inaccurate, enforcing strong negative bias on the region classification network and OPAM module. To alleviate this issue, we first set λ g = 0 and only train the region proposal network for the initial training epoch. After this stage, we can set λ g > 0 to optimize the region classification network and the overall tracking framework can be trained end-to-end.

3.4. Implementation Details

In this subsection, implementation details and hyperparameter settings for training the proposed tracking framework is clarified, with the specifications for the training and evaluation settings.
Details for the Architecture and Hyperparameters: For the backbone feature extractor ϕ ( · ) which was introduced in Section 3.1, its input frames I z ,   I x are resized to H = 480 , W = 720 where the original aspect ratio is preserved by resizing and adding zero-padding on the right and bottom side. The output feature maps F z ,   F x originally have the sizes of h = 30 ,   w = 45 with channel dimensions of c = 512 , which is reduced to c = 256 for subsequent computation. For the target feature representation z which is pooled from F z , a spatial size of s = 5 is used.
For the region proposal network branches f c ,   f r , both include two residual blocks with 1 × 1 convolution layer with input and output channels sizes of c = 256 , and a group normalization layer with group number of 16. To obtain the candidate regions using output maps P ,   Q , softmax operation on P and non-maximum suppression (NMS) operation on Q are performed using the threshold value of 0.9 , obtaining top N = 64 candidate bounding boxes with high scores. For the region classification network branches g c ,   g r , they are formulated identically as f c ,   f r . In terms of the proposed OPAM module, self and cross-attention operation modules utilize feature representations that are reduced to dimensions d = 128 , and the multi-head attention heads N H = 2 are used across all attention modules with the depth of each attention module set to 2.
Details for the Training Process: To train the overall tracking framework, we employ the AdamW [27] algorithm to optimize Equation (15). Training batch size was set to 16 instances of ( I z , I x , b 1 ) triplets, learning rate was set to 10 4 throughout the training process, and weight decay hyperparameter was set to 10 5 . For the hyperparameters used in the loss functions, we use γ = 2 for focal loss, and λ = 1 ,   λ g = 1 for loss balancing terms.
For the training dataset, we utilize the training sets of ImageNetVID [28], LaSOT [1], and GOT-10k [2] datasets for our implementation. To sample a training instance, a video sequence is sampled from the aforementioned training datasets, where dataset is randomly chosen proportional to the dataset size. From a sampled video sequence, the instance of the ( I z , I x , b 1 ) triplet is randomly sampled by choosing two random frame indices given the length of the sequence. For sampled frames I z ,   I x , scale normalization to [ 0 , 1 ] is first applied, with subsequent random data augmentations to prevent overfitting, including horizontal axis flip ( p = 0.5 ), additive Gaussian noise ( σ = 0.05 ), additive brightness jitter (uniform | ϵ | 0.01 ), Gaussian blur (kernel size 3 to 9), and color jitter (Gaussian σ = 0.025 ). All augmentations are applied independently with a probability of 0.25 . Afterwards, following the default setting of ResNet architectures, pixel-wise normalization F ¯ = ( F μ ) / σ where μ = [ 0.485 , 0.456 , 0.406 ] and σ = [ 0.229 ,   0.224 ,   0.225 ] is performed on both frames. For the software and hardware environments, our framework was implemented in Python 3.9 using PyTorch 2.5.1 library on Ubuntu 20.04 OS while running on a Nvidia RTX 4090 GPU with 24 GB VRAM and Intel Core i9 CPU with 128 GB RAM.

4. Experiments

Herein, we conduct quantitative and qualitative evaluations for our proposed tracker, where we use the test set of the LaSOT [1] dataset and the GOK-10k [2] dataset for the experiments. We also carry out external and internal comparisons for our proposed tracking algorithm where we compare our tracker with other recently proposed algorithms and perform ablation experiments for an in-depth analysis on the contribution of our proposed OPAM scheme under different challenging scenarios.

4.1. Quantitative Evaluation

For the quantitative evaluations, we obtain the performance of our tracker using the test set of LaSOT dataset which is a large-scale and long-term tracking benchmark containing 280 test sequences with average sequence length of 84 s (30 fps). It contains video sequences of target objects of 70 common object categories, with dense bounding box annotations for every video frame. We evaluate the tracking algorithms under a conventional one-pass-evaluation (OPE) setting where trackers are initialized with the initial ground truth bounding box and executed sequentially throughout the subsequent frames. Then, the estimated target bounding boxes can be evaluated with the IoU score and center position error using the ground truth bounding boxes.
Using the frame-wise IoU scores and center position errors for each sequence, we can derive three performance plots where (1) the success plot can be obtained by measuring the proportion of box predictions exceeding a given IoU threshold value ( [ 0 ,   1 ] ), (2) the precision plot can be obtained by measuring the proportion of box center coordinate predictions below a given distance threshold value (0 to 50 pixels), and (3) the normalized precision plot can be obtained using normalized location error where the center distance is normalized using the square root of the area of the ground truth bounding box ( [ 0 ,   0.5 ] ). Using these plots, we can formulate a comprehensive evaluation metric where we employ the area-under-curve (AUC), which can concisely measure the performance of a given tracking algorithm. First, we compare our proposed tracking algorithm with other recently proposed tracking algorithms and provide the results in Table 1. Our tracker is denoted as OPAM, and we chose other trackers with ResNet-based backbone architecture for fair comparison, where GlobalTrack [19], ATOM [7], DiMP [8], SiamRPN++ [5], DASiam [29], and Ocean [30] are shown. The results show that our tracking algorithm performs comparably at a real time speed of 65 fps due to lightweight ResNet-18 architecture, and shows comparable performance even compared with trackers with more complex ResNet-50 backbone architectures, such as GlobalTrack and DiMP-50. For a detailed comparison, the success, precision, and normalized precision plots under varying threshold values are shown in Figure 4. Additionally, we also provide additional comparisons on the GOT-10k dataset in Table 2, where performance metrics are based on the success rate (SR) overlap threshold values of 0.5 and 0.75, with an average overlap (AO) value of the estimated bounding box relative to the ground truth. We observed similar characteristics to LaSOT in terms of performance metrics, where our tracker performs comparably among other tracking algorithms.
To further analyze our proposed tracking framework under diverse tracking situations, we show the success plot AUC values for different challenge attributes of the LaSOT test set in Table 3, where we achieve relatively high performance metrics on the attributes of aspect ratio, camera motion, deformation, illumination variation, and background clutter compared to other tracking algorithms. Using the object part-based attention mechanism embedded in our tracking framework, we were able to achieve higher performance by utilizing local part-aware information to identify the target object under situations such as heavy deformation and partial occlusions and elicit global scene-aware information to rule out distractor objects with similar appearances.
Finally, we provide an in-depth analysis to quantify the effect of our proposed OPAM scheme with the ablation experiments in Table 4. The experimental results on the LaSOT test set highlight that our proposed attention schemes employing local part-aware self-attention (LPSA), target part-aware cross-attention (TPCA), and global part-agnostic cross attention (GCA) shows varying contributions to the tracking performance when compared to the baseline model without the OPAM scheme in terms of the success plot AUC metric. Furthermore, we report the computational complexity of our proposed framework where we show the total number parameters for each module and calculate the GFLOPs to quantify the computational overhead in Table 5. Our tracking framework requires a total of 57.64 GFLOPs when processing query and search frames simultaneously. However, since the first query frame is only processed once for initialization, and its features can be stored in the memory throughout the tracking sequence, we only need to process new search frames when performing tracking. Therefore, by subtracting the 42.04 / 2 = 21.02 GFLOPs from the total, we require 36.62 GFLOPs per frame, and our tracking framework can run efficiently at real-time speeds.

4.2. Qualitative Evaluation

To encourage further understanding for the characteristics and performance of our proposed tracking framework, we visualize various examples of the output video sequences selected from the LaSOT test set in Figure 5, where we compare our method against GlobalTrack, SiamRPN++, ATOM, SPLT, and VITAL trackers. The color of the bounding box represents the output of a single tracker, where the frame indices are denoted in yellow in the top left corner. We visualize the results for the four selected sequences zebra-17, rubicCube-19, fox-5, and kite-6 in each row, where the leftmost column shows the initial frames for each sequence.
The results show that our tracker can successfully handle challenging scenes with partial occlusions, background distractor objects, and object deformations, whereas other trackers can be hijacked by distractor objects with similar class (zebra-17) and similar appearance (rubicCube-19). Also, other trackers fail to estimate the accurate bounding box of the target object under partial occlusions (fox-5) and deformation (kite-6). Since our proposed tracker includes the OPAM mechanism for local–global context integration, it can successfully locate and estimate the target bounding box under these circumstances.

5. Conclusions

In this paper, we proposed the object part-aware attention-based matching (OPAM) framework, a novel method designed to enhance visual tracking robustness. By integrating local part-aware self-attention, target-specific cross-attention, and global cross-attention mechanisms, our model successfully captures both fine-grained object details and the broader scene context. Extensive evaluations on the LaSOT benchmark confirmed that our approach provides competitive performance in accuracy and robustness, especially under challenging conditions such as occlusion, deformation, and the presence of distractors while maintaining real-time processing speeds. The success of the proposed attention mechanism underscores its potential for more complex tracking scenarios, and future research could focus on adapting this framework for multi-object tracking or improving its efficiency for broader applications in video understanding.

Funding

This research was supported by Kyungpook National University Research Fund, 2022.

Data Availability Statement

This study uses publically available datasets. Training and evaluation were performed on LaSOT (http://vision.cs.stonybrook.edu/~lasot/), ImageNetVID (https://www.image-net.org/), and GOT-10k (http://got-10k.aitestunion.com/). All links are accessed and verified on 14 June 2025. The source code and data supporting the conclusions of this article made available by the authors on request.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Fan, H.; Lin, L.; Yang, F.; Chu, P.; Deng, G.; Yu, S.; Bai, H.; Xu, Y.; Liao, C.; Ling, H. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  2. Huang, L.; Zhao, X.; Huang, K. GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE TPAMI Trans. Pattern Anal. Mach. Intell. 2019, 43, 1562–1577. [Google Scholar] [CrossRef] [PubMed]
  3. Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-Convolutional Siamese Networks for Object Tracking. arXiv 2016, arXiv:1606.09549. [Google Scholar] [CrossRef]
  4. Li, B.; Yan, J.; Wu, W.; Zhu, Z.; Hu, X. High Performance Visual Tracking with Siamese Region Proposal Network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  5. Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  6. Chen, Z.; Zhong, B.; Li, G.; Zhang, S.; Ji, R. Siamese Box Adaptive Network for Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  7. Danelljan, M.; Bhat, G.; Khan, F.S.; Felsberg, M. ATOM: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  8. Bhat, G.; Danelljan, M.; Gool, L.V.; Timofte, R. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  9. Chen, X.; Yan, B.; Zhu, J.; Wang, D.; Yang, X.; Lu, H. Transformer Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8126–8135. [Google Scholar]
  10. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the 2017 Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  11. Yan, B.; Peng, H.; Fu, J.; Wang, D.; Lu, H. Learning Spatio-Temporal Transformer for Visual Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 10448–10457. [Google Scholar]
  12. Chen, X.; Peng, H.; Wang, D.; Lu, H.; Hu, H. SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 14572–14581. [Google Scholar]
  13. Bai, Y.; Zhao, Z.; Gong, Y.; Wei, X. ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19048–19057. [Google Scholar]
  14. Liang, S.; Bai, Y.; Gong, Y.; Wei, X. Autoregressive Sequential Pretraining for Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 11–15 June 2025; pp. 7254–7264. [Google Scholar]
  15. Yan, B.; Zhang, X.; Wang, D.; Lu, H.; Yang, X. Alpha-Refine: Boosting Tracking Performance by Precise Bounding Box Estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 5289–5298. [Google Scholar]
  16. Wang, N.; Zhou, W.; Wang, J.; Li, H. Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 1571–1580. [Google Scholar]
  17. Wang, G.; Luo, C.; Xiong, Z.; Zeng, W. SPM-Tracker: Series-Parallel Matching for Real-Time Visual Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  18. Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
  19. Huang, L.; Zhao, X.; Huang, K. GlobalTrack: A Simple and Strong Baseline for Long-term Tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020. [Google Scholar]
  20. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  21. Yu, Y.; Xiong, Y.; Huang, W.; Scott, M.R. Deformable Siamese Attention Networks for Visual Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
  22. Zhang, B.; Liang, Z.; Dong, W. Siamese Attention Networks with Adaptive Templates for Visual Tracking. Mob. Inf. Syst. 2022, 1, 7056149. [Google Scholar] [CrossRef]
  23. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  24. He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017. [Google Scholar]
  25. Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  26. Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  27. Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  28. Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. IJCV Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
  29. Zhu, Z.; Wang, Q.; Li, B.; Wu, W.; Yan, J.; Hu, W. Distractor-Aware Siamese Networks for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018. [Google Scholar]
  30. Zhang, Z.; Peng, H.; Fu, J.; Li, B.; Hu, W. Ocean: Object-aware Anchor-free Tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020. [Google Scholar]
  31. Yan, B.; Zhao, H.; Wang, D.; Lu, H.; Yang, X. ‘Skimming-Perusal’ Tracking: A Framework for Real-Time and Robust Long-Term Tracking. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019. [Google Scholar]
  32. Nam, H.; Han, B. Learning Multi-Domain Convolutional Neural Networks for Visual Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
  33. Valmadre, J.; Bertinetto, L.; Henriques, J.; Vedaldi, A.; Torr, P.H.S. End-To-End Representation Learning for Correlation Filter Based Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  34. Wang, Q.; Zhang, L.; Bertinetto, L.; Hu, W.; Torr, P.H. Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
  35. Held, D.; Thrun, S.; Savarese, S. Learning to track at 100 fps with deep regression networks. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
  36. Danelljan, M.; Robinson, A.; Khan, F.S.; Felsberg, M. Beyond correlation filters: Learning continuous convolution operators for visual tracking. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
  37. Danelljan, M.; Bhat, G.; Khan, F.; Felsberg, M. ECO: Efficient Convolution Operators for Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar]
  38. Ma, C.; Huang, J.B.; Yang, X.; Yang, M.H. Hierarchical convolutional features for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015. [Google Scholar]
Figure 1. Motivation for the proposed method.
Figure 1. Motivation for the proposed method.
Signals 06 00047 g001
Figure 2. Overview for the proposed tracking algorithm.
Figure 2. Overview for the proposed tracking algorithm.
Signals 06 00047 g002
Figure 3. Overview of the operation of the proposed OPAM module.
Figure 3. Overview of the operation of the proposed OPAM module.
Signals 06 00047 g003
Figure 4. Success and precision plots for the LaSOT test set.
Figure 4. Success and precision plots for the LaSOT test set.
Signals 06 00047 g004
Figure 5. Qualitative results on the LaSOT test set.
Figure 5. Qualitative results on the LaSOT test set.
Signals 06 00047 g005
Table 1. Quantitative Results on the LaSOT test set.
Table 1. Quantitative Results on the LaSOT test set.
OPAMGlobalTrack
[19]
ATOM
[7]
DiMP-50
[8]
SiamRPN++
[5]
DASiam
[29]
SPLT
[31]
MDNet
[32]
Ocean
[30]
SiamFC
[3]
CFNet
[33]
AUC0.5510.5210.5180.5690.4960.4480.4260.3970.5600.3360.275
Precision0.5690.5290.506-0.4910.4270.3960.3730.5660.3390.259
Normalized Precision0.6080.5990.5760.6500.569-0.4940.460-0.4200.312
FPS65630433511025.70.9255843
Table 2. Quantitative comparison on the GOT-10k test set.
Table 2. Quantitative comparison on the GOT-10k test set.
(%)OPAMATOM
[7]
DiMP-50
[8]
SiamMask
[34]
Ocean
[30]
CFNet
[33]
SiamFC
[3]
GOTURN
[35]
CCOT
[36]
ECO
[37]
CF2
[38]
MDNet
[32]
SR 0.50 64.163.471.758.772.140.435.337.532.830.929.730.3
SR 0.75 48.940.249.236.6-14.49.812.410.711.18.89.9
AO 56.955.661.151.461.137.434.834.732.531.631.529.9
Table 3. Success plot AUC for challenge attributes of the LaSOT test set.
Table 3. Success plot AUC for challenge attributes of the LaSOT test set.
Aspect
Ratio
Background
Clutter
Camera
Motion
DeformationFast MotionFull
Occlusion
Illumination
Variation
Low
Resolution
Motion BlurOut-of-ViewPartial
Occlusion
RotationScale
Variation
Viewpoint
Change
AUC0.5490.4620.5640.5880.4120.4530.5700.4570.5390.5300.5240.5570.5510.532
Table 4. Ablation analysis on the OPAM mechanism.
Table 4. Ablation analysis on the OPAM mechanism.
Baseline+LPSA+TPCA+GCA
AUC0.5250.5330.5420.551
Table 5. Complexity analysis in terms of number of parameters and GFLOPs.
Table 5. Complexity analysis in terms of number of parameters and GFLOPs.
ModuleBackbone ϕ Region ProposalRegion
Classification
OPAMTotal
Num. of Param.11.18 M5.11 M4.96 M0.96 M22.21 M
Forward
Scenario
Total
(end-to-end)
Backbone ϕ Continuous
Tracking
GFLOPs57.6442.0436.62
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Choi, J. Object Part-Aware Attention-Based Matching for Robust Visual Tracking. Signals 2025, 6, 47. https://doi.org/10.3390/signals6030047

AMA Style

Choi J. Object Part-Aware Attention-Based Matching for Robust Visual Tracking. Signals. 2025; 6(3):47. https://doi.org/10.3390/signals6030047

Chicago/Turabian Style

Choi, Janghoon. 2025. "Object Part-Aware Attention-Based Matching for Robust Visual Tracking" Signals 6, no. 3: 47. https://doi.org/10.3390/signals6030047

APA Style

Choi, J. (2025). Object Part-Aware Attention-Based Matching for Robust Visual Tracking. Signals, 6(3), 47. https://doi.org/10.3390/signals6030047

Article Metrics

Back to TopTop