1. Introduction
Vehicle tracking plays a vital role in intelligent transportation systems, autonomous driving, traffic surveillance, and smart city applications [
1,
2,
3,
4]. Accurate and consistent tracking across video frames provides essential cues for downstream tasks such as trajectory prediction [
5], traffic flow analysis [
6], and collision avoidance [
7]. Recent advances in deep learning have significantly improved vehicle localization accuracy under supervised learning frameworks [
8,
9].
Despite these advances, most trackers are still trained with frame-level objectives, while they are evaluated and deployed at the sequence level. This gap becomes critical when the tracker encounters challenging frames (e.g., occlusion, fast motion, clutter), where a single localization error can propagate and degrade subsequent predictions. As illustrated in
Figure 1, a tracker may mislocalize the target in a difficult frame (left column) even if its performance in other frames is satisfactory. In real sequential tracking, however, such errors propagate to subsequent frames, eventually causing target loss and a sharp drop in sequence-level accuracy (right column). Therefore, improving sequence-level stability—rather than isolated per-frame accuracy—is essential for reliable vehicle tracking in real traffic scenes.
A key reason for this failure mode is the training–testing mismatch. During testing, the search region and target state depend on the tracker’s own historical predictions, so small errors accumulate into large state deviations. In contrast, frame-level training often relies on mildly perturbed ground-truth boxes, exposing the model to a much narrower distribution and optimizing for short-term accuracy instead of long-term robustness.
To address this mismatch, we formulate vehicle tracking as a sequential decision-making problem and optimize the tracker with a reinforcement learning (RL) objective. Reinforcement learning (RL) [
10,
11] provides a natural framework for this formulation, as it enables explicit modeling of temporal dependencies and optimization of long-term performance in an end-to-end manner. Unlike conventional frame-level training, the RL-based tracker learns from sequences sampled from real trajectories and directly optimizes evaluation metrics consistent with testing (e.g., average overlap ratio (AOR), whose definition is given in
Section 3). This directly aligns the training objective with sequence-level evaluation and reduces error accumulation.
In addition, we extend data augmentation into the temporal dimension by using a sliding-window mechanism to construct training clips with diverse motion and occlusion patterns. This temporal sampling enriches supervision signals for long-term tracking and improves robustness under reappearance and interaction scenarios.
The main contributions of this work are summarized as follows:
We propose a RL-based sequence-level vehicle tracking framework. In this framework, vehicle tracking is formulated as a sequential decision-making process, explicitly modeling inter-frame dependencies within an end-to-end training paradigm. By optimizing a sequence-consistent reward, the framework alleviates the mismatch between frame-level training and sequence-level evaluation.
We enhance robustness in challenging frames and occlusion scenarios. By learning from on-policy trajectories, the tracker suppresses error propagation and improves target recovery after temporary disappearance, leading to more stable sequence-level performance.
We introduce a temporal data augmentation strategy. Beyond conventional spatial perturbations, data augmentation is extended to the temporal domain through a sliding-window sampling mechanism. This produces motion-diverse training clips and improves generalization to complex traffic dynamics.
Extensive experiments on public benchmarks and our vehicle-specific dataset demonstrate consistent gains in both accuracy and stability, validating the effectiveness of sequence-level optimization for vehicle tracking.
3. RL–Based Vehicle Tracking Framework
To address the inconsistency between frame-level training and sequence-level evaluation, we adopt an RL framework [
26] that enables end-to-end optimization through interaction with sequential tracking environments. In this framework, the tracker learns from sampled frame sequences of real trajectories and optimizes its policy using evaluation metrics consistent with testing, such as the AOR.
To formally define this metric, we compute the AOR over a trajectory as:
which measures the mean region-overlap quality between the predicted bounding box sequence
and the ground-truth sequence
. A higher AOR indicates more accurate and more stable tracking along the entire sequence. For completeness, the intersection-over-union (IoU) at time step
t is computed as:
where
denotes the area of spatial overlap between the predicted and ground-truth bounding boxes, and
is the area of their union. Thus, IoU provides a normalized measure of spatial alignment at each frame, and AOR summarizes this alignment across the whole sequence, serving as a consistent sequence-level reward signal for reinforcement learning.
Although AOR is computed as a temporal average, it does not ignore long-term tracking failures. In practice, once the tracker drifts away from the target or loses it for multiple consecutive frames, the IoU values in those frames rapidly approach zero, which continuously decreases the accumulated AOR. Therefore, prolonged target loss results in a substantial penalty through the aggregation of low-overlap frames, rather than being dominated by early accurate predictions.
More importantly, under the RL formulation, tracking errors have temporally compounding effects. When a poor localization decision occurs, subsequent predictions are generated from an increasingly corrupted state, which further degrades future overlaps and yields persistently low rewards along the remainder of the sequence. This sequential dependency implicitly discourages policies that cause long-term drift, even without introducing explicit penalties for lost-target states.
We intentionally avoid introducing additional handcrafted reward terms (e.g., binary lost-target indicators or duration-based drift penalties), as such designs require heuristic thresholds and may reduce generality across different datasets and tracking scenarios. By directly optimizing AOR—the same metric used for sequence-level evaluation at test time—the proposed framework maintains a simple, consistent, and stable objective, while still effectively penalizing sustained tracking failures through cumulative sequence-level feedback.
This training paradigm not only alleviates the data distribution mismatch between training and testing but also captures temporal decision dependencies that are critical for long-term tracking performance. Furthermore, by extending data augmentation into the temporal dimension through sliding-window sampling, the model is exposed to diverse motion patterns and occlusion conditions across sequences.
3.1. Sequence-Level Training Under the RL Paradigm
Given a video sequence
with
frames and the ground-truth bounding box
of the target in the initial frame
, the tracker predicts a bounding box
for each subsequent frame
, where
. The tracker is parameterized by
and modeled as a policy function
, which maps the current observation
to a predicted bounding box:
The observation aggregates all available information up to time t, including the video frames , the initial ground-truth box , and the history of previous predictions . In practice, most trackers estimate the current bounding box based on the previous prediction and the current frame . This is typically achieved by defining a local search region centered at and selecting the optimal bounding box through similarity matching or classification responses. To integrate this tracking formulation into an RL framework, in our RL formulation, the policy network follows the tracker’s backbone and branches into two heads: a spatial classification head that produces a probability distribution over candidate locations, and a regression head that refines the selected location into a final bounding box. The stochastic sampling is performed on the classification.
The objective of tracking is to maximize a sequence-level evaluation metric , where denotes the predicted bounding box sequence and measures overall tracking quality, such as the average overlap ratio (AOR). Because each decision influences subsequent predictions , the tracking process is inherently temporal and sequentially interdependent.
Conventional frame-level training does not account for this dependency. During training, the model typically learns to localize targets from perturbed ground-truth boxes rather than from its own historical predictions , where denotes a random perturbation function. This approximation, , introduces additional hyperparameters and creates a data distribution gap between training and testing, which limits the model’s ability to learn effective sequential decision strategies.
To address these limitations, we adopt a sequence-level training scheme under the RL framework, as illustrated in
Figure 2. The tracker is optimized directly with respect to the sequence-level evaluation metric used at test time, thereby reducing the discrepancy between training and deployment. The objective function is defined as:
Training sequences are sampled from real tracking trajectories, allowing the tracker to experience realistic state transitions and motion variations. To stabilize this optimization, we apply several standard RL regularization techniques. First, reward values are normalized within each mini-batch to ensure a consistent gradient scale across different sequences. Second, gradient clipping is used to prevent exploding gradients when long trajectories produce large policy updates. Additionally, an entropy regularization term encourages sufficient exploration during early training, while a temperature-controlled sampling schedule gradually reduces randomness to improve convergence.
The objective in Equation (
4) is optimized using a policy gradient method. The expected gradient is given by:
In practice, the expectation is approximated using a single Monte Carlo sample
:
To reduce the variance of the gradient estimate, we adopt the Self-Critical Sequence Training (SCST) algorithm [
27], which introduces a baseline reward computed from a deterministic inference trajectory. Specifically, two trackers with shared parameters are used:
- 1.
A sampling strategy that generates bounding boxes based on stochastic policy sampling;
- 2.
A greedy tracker that selects actions with maximum confidence.
The greedy tracker serves as the baseline in SCST, providing a deterministic reference trajectory whose reward is detached from the computational graph. This baseline reduces variance in the policy gradient estimate and avoids introducing extra learnable components, ensuring that the optimization remains stable and computationally efficient.
During training, if the sampled trajectory yields a higher cumulative reward than the greedy one, the current policy is reinforced; otherwise, it is penalized. The resulting gradient is computed as:
where
and
denote the rewards obtained by the sampling and greedy trackers, respectively. The overall sequence-level training procedure is summarized in Algorithm 1.
| Algorithm 1 Sequence-Level Reinforcement Learning for Vehicle Tracking |
- Require:
Tracker parameters , training dataset - 1:
while not converged do - 2:
Sample video clip and annotations from - 3:
Initialize tracker with in the first frame - 4:
Set for the sampling strategy, and for the greedy tracker - 5:
for to T do - 6:
Sample - 7:
Select - 8:
end for - 9:
Compute rewards and - 10:
Compute loss - 11:
Update parameters - 12:
end while
|
As illustrated in
Figure 3, the proposed RL-based training strategy exhibits improved robustness under challenging conditions such as prolonged occlusion. When the target vehicle is fully occluded, the greedy tracker often mislocalizes nearby objects due to its reliance on maximum confidence scores. In contrast, the sampling strategy maintains exploration around previous estimates and successfully re-identifies the target after occlusion. In this scenario, a positive reward reinforces the sampling policy. Conversely, when the sampling strategy underperforms, a negative reward is applied to discourage suboptimal exploration.
This reward-driven adaptation enables the tracker to learn robust temporal decision strategies, effectively reducing error propagation and improving long-term tracking stability in complex real-world scenarios.
3.2. Sliding-Window Sampling Mechanism
In computer vision, data augmentation is commonly divided into spatial and temporal categories [
28]. Traditional frame-level tracking algorithms mainly rely on spatial augmentation techniques, such as geometric transformations, color perturbations, and random noise injection. These methods help mitigate overfitting caused by limited training data. However, frame-level training usually treats a video sequence as a collection of independent image pairs, ignoring temporal dependencies among frames. As a result, temporal correlations are not explicitly modeled.
To address this limitation, we propose a sliding-window–based sampling strategy that integrates both spatial and temporal augmentation. This design enriches training diversity and strengthens spatiotemporal representation learning. In our sampling pipeline, both the temporal progression of the window and the stochastic spatial perturbations are governed by well-defined sampling ranges and probability distributions, ensuring that clip construction follows a consistent augmentation process.
Let denote an input video sequence and its corresponding ground-truth bounding boxes, where T, H, and W represent the number of frames, height, and width, respectively. Unlike conventional training schemes that sample only a single template–search pair, our method extracts a clip of l consecutive frames using a sliding window, together with the corresponding annotations .
As shown in
Figure 4a, a window of fixed size
W slides from the first frame
to the last frame
with a stride
S. At each step, candidate frames and their ground-truth bounding boxes are selected until
l frames are obtained.
For the first window, the initial frame
is chosen as the base frame
. For each reference frame
, we compute a motion offset
between
and
using the CIoU distance:
Here, denotes the Complete Intersection-over-Union (CIoU) metric, which measures the geometric similarity between two bounding boxes by jointly considering three factors: the IoU, the normalized distance between box centers, and the consistency of aspect ratios. Compared with standard IoU, CIoU provides a more comprehensive assessment of spatial alignment by penalizing large center displacement and shape mismatch, making it particularly suitable for capturing motion variation between frames.
As illustrated in
Figure 4b, a larger motion offset
indicates richer motion variation in frame
, which is beneficial for enhancing spatiotemporal representation learning. After each sliding step, the newly selected target frame
becomes the base frame for the next iteration.
The temporal advance of the window employs a stride S sampled from a uniform interval, leading to moderate variability in the positions from which clips are drawn. Within each window, candidate frames follow a discrete uniform distribution before motion offsets are computed, so that temporal selection is not biased toward specific frame indices. The resulting CIoU-based motion offsets are normalized to the range for each sequence to maintain comparable motion magnitudes across videos of different scales. Meanwhile, spatial augmentation—comprising random cropping, horizontal flipping, and color jittering—is applied to each selected frame with fixed probabilities of 0.5, 0.5, and 0.3, forming a stable stochastic augmentation process during clip generation.
The sampling index for each selected frame is determined by:
The operator selects the frame with the largest normalized motion variation within each window, guiding the sampling process toward the most informative temporal transitions.
Through the proposed sliding-window sampling mechanism, the extracted clip X is both temporally coherent and motion-diverse. The fixed window length ensures smooth temporal evolution, while inter-frame variation encourages learning dynamic motion patterns. As a result, the tracker can better handle challenging scenarios such as fast motion, occlusion, and camera movement.
3.3. Loss Function Design
The position prediction network outputs
classification and regression candidates. To support stochastic sampling, we construct a classification probability distribution
as:
where
denotes the logit function. Since
lies in the range
, it is first transformed using the logit function to remove the bounded constraint, and then normalized via softmax to form a valid probability distribution. To avoid numerical instability when
approaches 0 or 1, we clip the values using
with
before computing the logit. This prevents extreme logit values and ensures stable probability estimation.
Importantly, stochastic sampling is performed over a discrete set of candidate locations generated by the tracker’s search region and regression head, rather than over unconstrained bounding boxes. Each sampled candidate corresponds to a valid bounding box predicted by the network and is geometrically anchored around the previous estimate. As a result, the sampling process explores alternative plausible hypotheses within a localized neighborhood, rather than producing arbitrary or unrealistic bounding boxes.
During training, bounding boxes are randomly sampled from to encourage exploration. Because the probability distribution is derived from normalized classification scores, candidates with extremely low confidence receive negligible sampling probability, which further suppresses implausible predictions. In addition, the regression losses applied to sampled boxes provide continuous geometric regularization, discouraging degenerate or highly inaccurate bounding boxes during optimization. During inference, a deterministic strategy is adopted by selecting the prediction with the highest confidence score.
For each training sequence, the classification loss is defined as:
where
denotes the self-critical reward from SCST, and
is the sampled bounding box at time step
t.
The overall training loss is formulated as:
where
and
denote the CIoU loss and L1 regression loss, respectively. The regularization coefficients are set to
and
. These coefficients were selected based on empirical sensitivity analysis:
enforces geometric consistency between sampled predictions and ground truth, while
stabilizes regression by penalizing large coordinate deviations. Together with the sequence-level reward, they ensure that stochastic exploration remains constrained within realistic spatial configurations and does not destabilize training.
This loss formulation jointly optimizes classification confidence, spatial localization accuracy, and sequence-level reward, enabling stable policy learning and robust bounding-box regression under stochastic sampling.
4. Experiments
4.1. Experimental Setup
4.1.1. Datasets and Baselines
The proposed vehicle object tracking framework is trained on a combination of public benchmark datasets and a self-constructed vehicle-specific dataset. Specifically, we utilize OTB100 [
29], LaSOT [
30], and our own Vehicle Tracking Dataset (VTD). Among them, OTB100 and LaSOT are widely used general-purpose visual tracking benchmarks, frequently adopted for evaluating algorithmic performance. In contrast, the VTD dataset is specifically tailored for vehicle tracking tasks and captures a broader spectrum of real-world traffic conditions.
The VTD dataset consists of video footage collected from several representative and diverse traffic environments, including the South Second Ring Road in Xi’an, the Yongtaiwen Expressway, Haidian District in Beijing, and the Tiantaishan Tunnel. These scenes encompass varying levels of traffic density, lighting conditions, and environmental complexity. A total of approximately 120 min of video data were recorded and segmented into 610 annotated video clips. To increase the diversity and difficulty of the dataset, videos were captured from six distinct viewpoints: front, rear, front-left, front-right, rear-left, and rear-right.
Each target vehicle in the VTD dataset is annotated with a bounding box, a category label, and a trajectory ID. The primary object categories include four common vehicle types: car, truck, bus, and motorcycle.
Figure 5 provides representative examples of annotated tracking scenarios under various real-world conditions.
Notably, the truck category exhibits substantial intra-class variation. As shown in
Figure 6, the dataset covers multiple truck subtypes, including pickup trucks, medium-duty trucks, heavy-duty trucks, container trucks, tanker trucks, and flatbed trucks. Quantitatively, the bounding-box area of trucks spans a wide range, covering both small vehicles occupying less than
of the frame and large vehicles exceeding
of the frame area. This scale variation, together with pronounced differences in aspect ratios and visual appearance, increases the difficulty of maintaining consistent localization and feature representation during tracking.
In addition to category diversity, the dataset contains substantial illumination variability. The recorded sequences include daytime, nighttime, tunnel environments, and scenes with strong backlighting or shadows. Approximately 30% of the clips are captured under low-light or rapidly changing illumination conditions, which often lead to reduced contrast and degraded visual cues. These factors contribute to frequent appearance changes and further challenge robust vehicle tracking.
For training, 60,000 spatiotemporally correlated sample pairs are randomly selected per iteration from the four datasets. Each sample pair comprises a template frame and a search frame, both extracted from the same video sequence with a temporal interval randomly chosen between 10 and 50 frames. To enhance temporal coherence, a consistency constraint is applied during training, which improves the model’s robustness to appearance changes and occlusions.
4.1.2. Implementation Details
The proposed tracking algorithm is trained using the AdamW optimizer [
31], with a weight decay factor set to
. The training process spans 1000 epochs, with 1000 iterations per epoch. The initial learning rate is set to
and is decayed to
after the first 500 epochs to facilitate convergence. A batch size of 32 is used throughout training. The template and search region images are resized to
pixels and
pixels, respectively. To improve the generalization capability of the model, standard data augmentation techniques are applied, including horizontal flipping and brightness jittering.
All experiments are conducted using the following hardware and software environment: the implementation is based on Python 3.7 and PyTorch 1.11.0. The operating system is Ubuntu 20.04 with CUDA version 11.3. The hardware configuration includes an Intel® Xeon® i7-6800K CPU (Intel Corporation, Santa Clara, CA, USA), 16 GB of RAM, and an NVIDIA GeForce RTX 3070 GPU (NVIDIA Corporation, Santa Clara, CA, USA).
During training, the network is trained for 1000 epochs, with 300 iterations per epoch. In each iteration, tracking sequences are randomly sampled from the training sets of the OTB, LaSOT, and VTD datasets. Each sampled sequence consists of video frames and one template frame. The batch size is set to 2.
For training sample generation using the sliding-window sampling mechanism, the window size is set to 10. The temporal stride is fixed to
, ensuring moderate overlap between neighboring windows, and the clip length is set to
, which balances computational efficiency with sufficient temporal coverage. These values were selected based on preliminary empirical observations indicating that shorter clips cause insufficient temporal modeling, whereas excessively long clips introduce redundant frames without performance gain. The reward function
r in Equation (
11) adopts the AOR as the evaluation metric, which guides the reinforcement learning optimization.
All experiments are conducted using a fixed random seed of 2024 for Python, NumPy 1.21, and PyTorch. The sampling frequency for reinforcement learning updates follows the SCST protocol, where each iteration generates both a sampled trajectory and a greedy baseline trajectory. The SCST baseline uses the deterministic inference output with rewards detached from the computational graph. The implementation code is publicly available at:
https://github.com/chd-via-lab/RL-VT (accessed on 10 December 2025).
4.1.3. Evaluation Metrics
To ensure consistency and clarity, we adopt three commonly used tracking evaluation metrics: Area Under Curve (AUC), Normalized Precision (), and Success Rate (SR). All metrics are defined as follows.
Area Under Curve (AUC). AUC measures the overall tracking success across a continuous range of IoU thresholds. For a given IoU threshold
, the success rate is computed as:
and the AUC is obtained by integrating
over
:
Normalized Precision (). This metric evaluates center localization accuracy using the normalized center distance. Let
denote the Euclidean distance between predicted and ground-truth box centers:
where
d is the image diagonal used for normalization. The metric is computed as:
Success Rate (SR). SR measures the proportion of frames whose predicted bounding boxes achieve sufficient overlap with the ground-truth. For a general IoU threshold
, SR is defined as:
This metric reflects the tracker’s ability to maintain accurate localization throughout the sequence.
4.2. Training Loss
The variation of the loss function during training is illustrated in
Figure 7. Both the bounding box regression loss (Loss/bbox) and the CIoU loss (Loss/ciou) show a steep decrease during the early training epochs, followed by a gradual stabilization phase. This behavior indicates that the proposed RL-based sequence-level training framework effectively captures spatial and temporal dependencies among consecutive frames. In the initial iterations, the model rapidly learns the fundamental geometric structures required for vehicle localization, while in the later stages, the losses converge smoothly to a stable minimum, demonstrating the stability of the training process.
A closer examination of the loss curves further confirms the convergence stability of our RL optimization. Both Loss/bbox and Loss/ciou rapidly drop during the first 150–200 epochs and subsequently enter a smooth plateau without noticeable oscillations or gradient spikes. The narrow-band fluctuations observed in the later training stages indicate that the model maintains stable gradient behavior and avoids divergence issues commonly seen in RL-based optimization. Moreover, repeated experiments using three different random seeds produce nearly identical convergence profiles, with variance below 3% at stabilization epochs, demonstrating that the optimization process is robust to initialization.
We additionally report the training duration to provide deeper insight into convergence behavior. Across datasets, the training process averages 0.043 GPU-hours per epoch on an RTX 3070 GPU, resulting in approximately 43 GPU-hours for full training. The epoch-wise training time also reflects dataset composition: 2.2 min/epoch for OTB–dominant batches, 2.5 min/epoch for LaSOT–dominant batches, and 2.0 min/epoch for VTD–dominant batches. In terms of memory consumption, the peak GPU memory usage varies slightly across datasets due to differences in sequence length and image resolution. Specifically, OTB–dominant batches require approximately 6.5 GB of GPU memory, LaSOT–dominant batches require around 7.2 GB, and VTD–dominant batches consume about 6.8 GB under the same training configuration. Despite these variations, the memory footprint remains stable within a narrow range, and all experiments can be conducted on a single RTX 3070 GPU without memory overflow. These consistent per-epoch runtimes, together with the smooth loss trajectories, indicate that the RL-based training pipeline not only converges reliably but also maintains stable computational behavior across heterogeneous sequence distributions.
The overall downward trend and smooth convergence reflect the robustness and efficiency of the proposed sequence-level optimization strategy. Unlike conventional frame-level training that often results in fluctuating loss curves due to the lack of temporal consistency, our approach leverages reinforcement signals accumulated across sequences to maintain stable gradient updates. Furthermore, the absence of noticeable oscillations in both the Loss/bbox and Loss/ciou curves suggests that the model mitigates error propagation over time and maintains consistent convergence even under occlusion and complex motion scenarios.
These results validate the effectiveness of our temporal reinforcement learning paradigm and the introduced sliding-window temporal data augmentation strategy, which together contribute to improved tracking continuity, enhanced generalization across diverse motion patterns, and more reliable vehicle re-identification under challenging conditions.
4.3. Comparative Experimental Analysis
- (1)
Quantitative Analysis
To evaluate performance differences among various tracking algorithms, our method is compared with several state-of-the-art (SOTA) trackers, including KURL [
20], BGS [
19], SLT [
23], FasterMDNet [
32], ACSiam [
33], LightFC [
34], TransT [
35], MixFormer [
36], and MOTR [
37] on the OTB100 dataset. All trackers are evaluated using the one-pass evaluation (OPE) protocol across the full sequence.
As shown in
Figure 8, the left panel presents the precision curve, while the right panel shows the success rate curve. Quantitative results on OTB100 are reported in
Table 1. Our proposed method achieves the best overall performance, surpassing all competing methods. Specifically, our method attains a success rate of 70.9%, outperforming the second-best KURL by 0.9%, and achieves a precision of 92.9%, exceeding KURL by 0.7%. These results confirm that the sequence-level training strategy enhances the tracker’s understanding of temporal motion dynamics, improving robustness under occlusion and disappearance scenarios.
As shown in
Figure 9, our method achieves a precision of 94.9% and a success rate of 68.4%, outperforming KURL by 4.8% and 0.8%, respectively, indicating that the proposed objective better guides learning under occlusion conditions.
As shown in
Figure 10, our method attains a precision of 89.9% and a success rate of 68.4%, outperforming KURL by 2.1% and 0.6%. This demonstrates that the RL-based sequence-level framework effectively models temporal dependencies and can recover targets after disappearance.
In
Figure 11, our method achieves 94.9% precision and 74.2% success rate, outperforming KURL by 0.4% and 0.5%. This verifies that the sliding-window sampling mechanism enables RLTransT to learn motion-rich representations, improving robustness under fast-moving conditions.
Results on the LaSOT dataset are summarized in
Table 2. Our method achieves an AUC of 67.9%, surpassing KURL by 0.3%, and maintains strong normalized precision and precision scores, confirming its accuracy on long-term tracking sequences.
As shown in
Table 3, our method achieves the highest overall performance, reaching 79.9% at SR
0.75, 65.8% at SR
0.50, and 70.7% in AUC, outperforming KURL by 1.8%, 1.7%, and 0.9%, respectively. This demonstrates our method’s ability to robustly track targets under occlusion, scale variation, disappearance, and reappearance.
We further analyzed the SR distribution across all sequences in the OTB100, LaSOT, and VTD datasets and visualized the results using the box plots shown in
Figure 12. As illustrated, the proposed method (Ours) not only achieves a higher average SR on all three datasets but also exhibits substantially lower variance and a noticeably more compact interquartile range (IQR).
Specifically, on OTB100, the SR values of Ours are concentrated within a narrow interval, demonstrating the most stable distribution among all compared methods. In contrast, KURL and MOTR present much wider IQRs and longer whiskers, indicating larger performance fluctuations. Similar trends are observed on LaSOT and VTD: the median SR of Ours consistently surpasses those of the baseline methods, while its IQR remains markedly smaller, suggesting that the performance improvements are consistently obtained across the majority of sequences. Moreover, Ours shows no apparent performance collapse (extreme low values) in any dataset, further verifying its robustness.
These statistical results indicate that, although reinforcement learning introduces certain stochasticity, the improvements achieved by our method are not accidental but remain stable across different sequences and scenarios. Therefore, the distributional analysis provides statistical support for the superiority of our method over KURL and MOTR, addressing the limitations of relying solely on single-score comparisons.
- (2)
Qualitative Analysis
Three representative video sequences—0118, 0114, and 0112—from the VTD dataset are selected for qualitative evaluation. Among them, 0118 represents a tunnel highway scenario, while 0114 and 0112 correspond to ordinary highway scenes. All sequences contain challenging situations such as target occlusion, disappearance, and reappearance.
As illustrated in
Figure 13, qualitative comparisons are performed among five representative trackers: Ours, KURL, MOTR, BGS, and SLT.
Sequence 0118 (Tunnel Highway). In the 0118 sequence, a white sedan driving in the left lane is tracked, while a large container truck approaches from the right and completely occludes the sedan for several frames. As shown in
Figure 13 (left), from frame 5 onward, DiMP and TransT begin to drift. At frame 6, the target vehicle becomes fully occluded, causing nearly all trackers to deviate. Our method shows only a minor offset, remaining close to the ground-truth bounding box. At frame 10, MOTR, BGS, and SLT incorrectly localize the occluding truck, whereas our method and KURL maintain bounding boxes near the occlusion area. When the target reappears at frame 15, our method and KURL successfully reacquire the vehicle, while MOTR and SLT still follow the wrong object, and SLT fails to recover. Throughout the sequence, our method demonstrates tighter bounding boxes and greater robustness under full occlusion and reappearance scenarios.
Sequence 0114 (Urban Highway). In the 0114 sequence, a taxi is tracked. As illustrated in
Figure 13 (middle), at frame 43, most of the taxi body is occluded by an overpass pillar. SLT and TransT continue to localize the full-size vehicle, while our method and KURL adjust to only the visible portion. From frame 46 onward, as the taxi becomes fully occluded, MOTR enlarges its search region, while our method and KURL narrow it to avoid drift. Meanwhile, BGS and SLT incorrectly switch to similar-looking vehicles ahead. At frame 52, when the taxi front reappears, our method first locks onto it, MOTR mistakenly tracks another vehicle, and KURL performs a wider search. By frame 57, our method and KURL accurately track the target again, while other trackers still exhibit mis-tracking.
Sequence 0112 (Ordinary Highway). In the 0112 sequence, another taxi is tracked under similar conditions but without interference from similar vehicles. As shown in
Figure 13 (right), at frame 174, most of the car body is occluded by a bridge pillar. MOTR, BGS, and SLT maintain full-vehicle bounding boxes, while our method and KURL localize only the visible part. At frame 32, when the vehicle’s front appears, our method accurately detects it first, while MOTR roughly identifies it after expanding its search area. SLT and TransT remain focused on background regions. At frame 34, they mistakenly detect a nearby trash bin. After frame 56, when the vehicle fully emerges, all trackers except MOTR and SLT reacquire the target successfully, with our method achieving the highest localization precision.
Overall, our method demonstrates superior robustness and adaptability under complex dynamic conditions involving occlusion, disappearance, and reappearance, outperforming other state-of-the-art trackers qualitatively.
4.4. Ablation Study and Analysis
4.4.1. Framework-Level Ablation Across Datasets
To validate the effectiveness of our proposed tracking algorithm, ablation studies are conducted on the RL-based sequence-level training strategy and the sliding-window-based sampling mechanism. Experiments are carried out on the OTB, LaSOT, and VTD datasets, evaluating the success rate of different model variants on test sequences.
Model (a) serves as the baseline trained using, TransT, a traditional frame-level strategy. Model (b) extends this baseline with the proposed sequence-level training framework under the reinforcement learning paradigm, where each training sequence is sampled using a random interval strategy. Model (c) further incorporates the sliding-window sampling mechanism to extract temporally coherent and motion-diverse video clips for training. The results are summarized in
Table 4.
Comparing models (a) and (b), the sequence-level training framework achieves improvements of 1.1%, 5.7%, and 0.6% in success rate on the OTB, LaSOT, and VTD datasets, respectively. When comparing (b) and (c), the proposed sliding-window sampling mechanism achieves further gains, confirming that temporally structured sampling enhances the model’s ability to learn motion continuity and long-term dependencies.
These findings demonstrate that the sequence-level framework substantially enhances the model’s temporal reasoning capability, improving its adaptability across different motion patterns and ensuring greater robustness in fast motion, disappearance, and reappearance scenarios.
4.4.2. Module-Level Ablation
While the framework-level ablation validates the effectiveness of sequence-level RL training and temporal sampling, it remains important to disentangle the contribution of individual components within the RL optimization pipeline. Therefore, we conduct fine-grained, module-level ablations on SCST baseline design, stochastic sampling, and reward design (reward shaping). All variants in this subsection share the same backbone, optimizer, training schedule, and sliding-window parameters, and only the studied module is changed. Following the standard OPE protocol on OTB100, we report Precision and Success Rate.
Ablation on SCST. To isolate the effect of SCST, we compare three baseline strategies in policy gradient optimization: no baseline (REINFORCE), batch-mean reward baseline, and SCST with greedy baseline. As shown in
Table 5, SCST yields the best performance, indicating that the greedy inference baseline effectively reduces gradient variance and stabilizes sequence-level optimization without introducing extra learnable components.
Ablation on stochastic sampling. We next examine the contribution of stochastic sampling in the classification head. We compare deterministic greedy training (no sampling), stochastic sampling with fixed temperature, and stochastic sampling with an annealed temperature schedule. As reported in
Table 6, purely greedy training leads to inferior performance, while stochastic sampling consistently improves both precision and success. The annealed schedule performs best by encouraging exploration in early training and improving convergence in later stages.
Ablation on reward design (reward shaping). Finally, we isolate the effect of reward design. We compare terminal-only IoU reward, frame-wise IoU average reward, and sequence-level AOR reward. The results in
Table 7 show that sequence-level AOR provides the strongest supervision signal for long-term tracking stability, outperforming terminal-only and frame-wise alternatives.
These fine-grained ablations demonstrate that the improvements of our method are not due to a single factor. SCST provides an effective variance-reduction baseline, stochastic sampling enhances exploration and recovery under challenging scenarios, and the sequence-level AOR reward aligns training with evaluation to improve long-term stability. Together, these modules form a coherent RL optimization framework that consistently improves tracking performance on OTB100.
4.4.3. Ablation on Temporal Sampling Hyperparameters
Beyond the improvements brought by sequence-level training and the sliding-window sampling mechanism, it is also important to understand how the specific temporal sampling parameters affect the overall performance. To further analyze the temporal modeling behavior of our method—and in response to the reviewer’s request regarding the parameter choices of window size W, stride S, and clip length l—we additionally conduct a detailed ablation study on these temporal sampling hyperparameters. This study is performed on the LaSOT validation split, with results reported using AUC and SR metrics to comprehensively assess their impact on tracking accuracy and stability.
Effect of Window Size . The results in
Table 8 show that tracking performance consistently improves as the window size increases. Smaller windows (e.g.,
or
) provide limited temporal coverage and therefore miss longer motion patterns, while larger windows supply richer temporal context. The best performance is achieved at
, indicating that OTB sequences benefit from capturing broader temporal dependencies.
Effect of Stride . Stride determines the temporal spacing between adjacent windows. A very small stride () results in overly dense sampling with redundant frames, whereas a large stride () skips meaningful temporal transitions and weakens continuity. A moderate stride of yields the strongest performance, suggesting that balanced temporal overlap is essential for effective sequence-level training.
Effect of Clip Length . Clip length controls the temporal duration modeled by each training sample. Short clips (e.g., or ) do not capture long-range motion cues, while longer clips provide more complete temporal evolution. Performance peaks at , demonstrating that extended temporal context enhances the RL policy’s ability to learn stable decision patterns without causing optimization instability.
These findings demonstrate that temporal sampling plays a crucial role in sequence-level reinforcement learning: expanding temporal coverage (larger W and l) and maintaining sufficient overlap between sampled clips (smaller S) both contribute meaningfully to performance. Importantly, the results confirm that the temporal parameters are not arbitrarily chosen—each parameter affects the degree of motion continuity and temporal context available during training, and the trends observed across the ablation studies provide empirical evidence guiding their selection.
5. Conclusions
In this paper, we proposed a novel RL-based vehicle tracking algorithm that employs a sequence-level training strategy within a reinforcement learning framework. By designing a new training objective consistent with the OPE metric, the proposed approach effectively resolves the inherent inconsistency between frame-level training objectives and real-world sequential tracking requirements. Through the use of sequence-level training data, the algorithm learns temporal dependencies of target motion, thereby mitigating the discrepancy between training and testing data distributions. This enables the tracker to robustly handle complex scenarios such as occlusion, disappearance, and reappearance. Additionally, a sliding-window-based sample extraction mechanism is designed to provide training sequences with rich motion information, enhancing the model’s perception and understanding of diverse motion patterns. Extensive quantitative and qualitative experiments conducted on the OTB, LaSOT, and VTD datasets demonstrate that the proposed RLTransT achieves SOTA tracking performance, effectively addressing challenges such as complex backgrounds, similar objects, occlusion, and vehicle disappearance or reappearance in real-world traffic environments.
Although the proposed RL-based tracking framework demonstrates strong robustness and generalization, several limitations remain. First, the RL process relies on trajectory sampling, which increases computational cost during training and may limit scalability on large-scale datasets. Second, the current reward design primarily focuses on overlap-based performance; incorporating multi-modal cues such as motion consistency, long-term re-identification confidence, or scene understanding could further enhance robustness. Third, while our sliding-window-based sequence extraction enriches temporal diversity, it does not explicitly model long-range temporal dependencies that span extended occlusion periods. In future work, we plan to explore more efficient RL optimization strategies, richer reward formulations, and memory-augmented architectures capable of better handling long-term tracking challenges. Furthermore, integrating multi-agent interaction modeling may improve vehicle tracking performance in crowded traffic scenes.