A Transformer-Based Multi-Task Learning Model for Vehicle Traffic Surveillance

Hermosillo-Reynoso, Fernando; López-Pimentel, Juan-Carlos; Ruiz-Ibarra, Erica; García-Berumen, Armando; Del-Puerto-Flores, José A.; Gilardi-Velazquez, H. E.; Kumaravelu, Vinoth Babu; Luna-Rodriguez, L. A.

doi:10.3390/math13233832

Open AccessArticle

A Transformer-Based Multi-Task Learning Model for Vehicle Traffic Surveillance

by

Fernando Hermosillo-Reynoso

^1,†,

Juan-Carlos López-Pimentel

¹

,

Erica Ruiz-Ibarra

²

,

Armando García-Berumen

²

,

José A. Del-Puerto-Flores

^1,*,†

,

H. E. Gilardi-Velazquez

¹,

Vinoth Babu Kumaravelu

³

and

L. A. Luna-Rodriguez

¹

Facultad de Ingeniería, Universidad Panamericana, Álvaro del Portillo 49, Zapopan 45010, Mexico

²

Departamento de Ingeniería Eléctrica y Electrónica, Instituto Tecnológico de Sonora, Ciudad Obregon 85000, Mexico

³

Department of Communication Engineering, School of Electronics Engineering, Vellore Institute of Technology, Vellore 632014, India

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Mathematics 2025, 13(23), 3832; https://doi.org/10.3390/math13233832 (registering DOI)

Submission received: 31 October 2025 / Revised: 24 November 2025 / Accepted: 25 November 2025 / Published: 29 November 2025

(This article belongs to the Special Issue Data-Driven Decentralized Learning for Future Communication Networks)

Download

Browse Figures

Versions Notes

Abstract

Vehicle traffic surveillance (VTS) systems are based on the automatic analysis of video sequences to detect, classify, and track vehicles in urban environments. The design of new VTS systems requires computationally efficient architectures with high performance in accuracy. Conventional approaches based on multi-stage pipelines have been successfully used during the last decade. However, these systems need to be improved to face the challenges of complex, high-mobility traffic environments. This article proposes an efficient system based on transformer architectures for VTS channels. The proposed analysis system is evaluated in scenarios with high vehicle density and occlusions. The results demonstrate that the proposed scheme reduces the computational complexity required for multi-object detection and tracking and exhibits a Multiple Object Tracking Accuracy (MOTA) of 0.757 and an identity F1 score (IDF1) of 0.832 when compared to conventional multi-stage systems under the same conditions and parameters, along with achieving a high detection precision of 0.934. The results show the viability of implementing the proposed system in practical applications for high-density vehicle VTS channels.

Keywords:

vehicle traffic surveillance; multi-task learning; transfer learning; transformer

MSC:

68T05; 68T07; 68T45

1. Introduction

For smart city applications, intelligent transportation systems (ITSs) have emerged as an economically, socially, and environmentally sustainable solution to the challenges of modern transportation [1,2]. ITSs interconnect vehicles, pedestrians, and infrastructure through sensing, communication, and computation technologies to improve transportation efficiency and safety [3,4].

As a key component of ITS, VTS systems automatically analyze traffic scenes to extract high-level semantic information, such as vehicle trajectories, speeds, traffic density, and lane occupancy. These data form the basis for higher-level ITS applications, including collision detection [5], route optimization [6], and intelligent traffic management [7,8].

Due to the complex nature of traffic scenes, a VTS system is typically broken into small, relatively independent tasks, such as vehicle detection, tracking, and classification. These tasks are often addressed under a single-task learning (STL) paradigm, in which each task is learned separately and later combined to solve the overall problem [9,10]. However, STL methods fail to leverage the shared and complementary knowledge among tasks, which may limit overall performance.

To overcome these limitations, multi-task learning (MTL) has emerged as a promising machine learning paradigm for simultaneously learning several tasks [11]. In contrast to STL, MTL leverages shared representations across tasks, which introduces an inductive bias that constrains the hypothesis space to promote solutions that generalize better and reduce both the risk of overfitting and computational complexity, particularly when the tasks are closely related [11,12]. Prior studies have consistently shown that MTL outperforms STL. For instance, ref. [13] reported that MTL improved document classification performance while also reducing computation time by 27%. Another study reported simultaneous improvements in depth estimation and semantic segmentation [14], highlighting the effectiveness of shared representations.

Recent advances in neural network architectures have further enhanced shared representation learning by introducing mechanisms that regularize the interactions between shared and task-specific knowledge. In particular, Transformer-based architectures have shown remarkable success for extracting more expressive and transferable features across related tasks in computer vision tasks via self-attention mechanisms [15,16,17]. Unlike a convolutonal neural network (CNN), which uses convolutional filters to extract local spatial features, vision transformers treat an image as a sequence of patches and leverage self-attention mechanisms to capture long-range relationships and global context-aware across patches. This makes transformers particularly useful for MTL, in which related tasks can benefit from shared, context-aware representations.

Inspired by these ideas, this study presents a transformer-based MTL model for traffic monitoring. This work extends our previous research on multi-view and MTL VTS systems [18] by incorporating modern neural architectures to enhance feature abstraction and data representation. Specifically, we extend the MOTR framework [19], originally developed for multi-object tracking in human motion analysis, as an end-to-end multi-task model for vehicle detection, classification, and tracking in traffic surveillance.

The main contributions of this work are summarized as follows:

We present a high-performance transformer-based MTL model for VTS systems that jointly addresses detection, tracking, and classification.
We perform extensive experiments on standard VTS benchmarks demonstrate that the proposed approach achieves competitive state-of-the-art performance across tasks.
To the best of our knowledge, this is the first end-to-end MTL model for vehicle detection, tracking, and classification, providing a scalable and generalizable solution for real-world traffic analysis.

The remainder of this paper is organized as follows: Section 2 reviews related work on vehicle detection, tracking, and classification. Section 3 formally defines the problem, followed by a comprehensive description of the MOTR architecture in Section 4. The experimental results and discussion are presented in Section 5 and Section 6, respectively. Finally, Section 7 concludes the paper and outlines future directions.

2. Related Work

The typical pipeline of a VTS system involves a series of related tasks for an accurate semantic understanding of traffic scenes. These tasks commonly include vehicle detection, tracking, and classification. Each of them contributes to the overall system: detection localizes vehicles in each frame, tracking maintains their temporal identities, and classification assigns semantic categories such as car, bus, or truck to each vehicle. In the following subsections, we review representative approaches to each of these tasks, emphasizing the evolution from traditional handcrafted methods to modern deep learning (DL).

2.1. Moving Vehicle Detection Task

Moving vehicle detection is an essential and challenging part of VTS systems, as it involves accurately identifying vehicles in complex environments affected by adverse weather, shadows, visual occlusions, and varying lighting conditions, and where multiple objects coexist, including cyclists, pedestrians, and waving vegetation [20].

In general, vehicle detection methods can be grouped into three main approaches: appearance-based, motion-based, and DL-based.

Appearance-based approaches model vehicle appearance using handcrafted descriptors such as color and edges [21], the Histogram of Oriented Gradients (HOG) [22], Scale-Invariant Feature Transform (SIFT) [23], Haar-like features [24], Local Binary Patterns (LBPs) [25,26], and wavelet-based representations [27]. Other studies have combined multiple descriptors using tensor decompositions, including Tucker decomposition to fuse HOG, LBP, and Four-Direction Features (FDFs) for vehicle detection at night [28].
Motion-based approaches exploit temporal changes of pixel values through frame differencing [29,30], background subtraction [31,32,33,34], optical flow [35,36], and subspace methods [37,38].
DL-based approaches have recently improved vehicle detection by automatically learning robust appearance features, overcoming the limitations of handcrafted descriptors. Notable models include YOLO [39], Faster R-CNN [40], and Mask R-CNN [41]. More recently, transformer-based architectures like Detection Transformer (DETR) [42] introduced an end-to-end detection paradigm that formulates object detection as a direct set prediction problem, leveraging attention mechanisms to capture long-range spatial dependencies and global contextual relationships across the entire scene.

Occlusion Handling

A major challenge of vehicle detection is visual occlusion, in which vehicles or parts of them are hidden behind other objects in the traffic scene. Occluded vehicles often lead to missed detections, inaccurate localization, or the misclassification of partially visible vehicles, potentially degrading overall vehicle detection performance.

Occlusion handling can fall into two main approaches: heuristic-based and DL-based.

Heuristic-based approaches typically exploit prior knowledge of vehicle geometry and motion—such as shape convexity, symmetry, aspect ratio, and bounding box overlapping—to detect occluded vehicles [9,43,44,45,46]. However, such assumptions often fail in complex scenarios involving heavy occlusion, irregular vehicle shapes, and unpredictable motion patterns.
DL-based approaches have significantly improved occlusion handling, as these models can not only detect and resolve partially hidden vehicles but also infer or reconstruct occluded parts by leveraging learned features and contextual information [47,48,49,50,51].

2.2. Vehicle Tracking Task

Tracking methods model the temporal evolution of vehicles, including motion trajectories and appearance features, across frames. In VTS systems, accurately tracking vehicles is challenging due to the presence of multiple vehicles and other moving objects. For multi-object tracking (MOT), a common strategy is tracking-by-detection, in which vehicles are first detected in each frame and then associated with those detected in previous frames using appearance or motion cues.

Typically, vehicle tracking approaches fall into three categories: appearance-based, motion-based, and DL-based approaches.

Appearance-based methods rely on visual similarity measures, such as color histograms, texture descriptors, and local features extracted from vehicle images [52,53,54]. However, these methods are sensitive to illumination, perspective, and occlusions, which can degrade tracking performance.
Motion-based methods model vehicle dynamics using IoU heuristics or probabilistic filters such as Kalman and particle filters, followed by assignment solvers like the Hungarian algorithm [9,55,56,57]. Classical motion-based trackers include SORT, which uses Kalman filtering and the Hungarian algorithm to estimate bounding box states and associate tracks in real-time [58], and ByteTrack, which extends SORT by also associating low-confidence detections to maintain consistent tracking in crowded or occluded scenes [59]. Nevertheless, they often fail under abrupt maneuvers, irregular trajectories, or heavy occlusions.
DL methods have improved tracking robustness by leveraging automatically learned feature embeddings for better discrimination and data association [60]. DeepSORT extends SORT by extracting appearance features via a CNN for re-identification after occlusions [61]. Other neural network approaches include Tracktor [62] and CenterTrack [63]. Transformer-based architectures—including TrackFormer [17] and TransTrack [64]—leverage attention mechanisms to capture long-term temporal dependencies, enabling more robust tracking compared to traditional approaches such as Kalman filters, which rely on simplified observation models. Recently, MOTR [19] introduced an end-to-end detection and tracking pipeline with track queries for identity propagation, leveraging transformers to learn nonlinear temporal dependencies directly from the data. MOTRv2 [65] have further extended this framework by incorporating a pretrained YOLOv5 backbone to enhance detection performance.

2.3. Vehicle Classification Task

The vehicle classification task assigns detected vehicles to categories such as cars, buses, trucks, and motorcycles. These methods can fall into two main groups: appearance-based and DL-based approaches.

Appearance-based approaches model vehicle appearance using handcrafted features—including Image Moments, HOG, SIFT, and color histograms—that capture discriminative visual cues such as shape, texture, and color for different vehicle categories. Subsequently, these features are fed to traditional machine learning classifiers (e.g., support vector machines [9]), or clustering techniques (e.g., K-means [66]). However, the performance of such methods strongly depends on the quality and expressiveness of the handcrafted features.
DL approaches leverage CNNs to automatically learn discriminative vehicle features directly from images. Modern architectures, including YOLO [39], VGGNet [67], ResNet [68], DenseNet [69], and EfficientNet [70], have consistently outperformed traditional methods, enabling robust intra- and inter-class classification in challenging traffic scenarios.

Table 1 summarizes representative approaches in VTS systems. Nonetheless, despite substantial progress in vehicle detection, occlusion handling, tracking, and classification, most existing methods address these tasks independently, limiting their ability to learn shared representations and capture inter-task dependencies.

3. Problem Statement and Mathematical Formulation

3.1. Problem Statement

Given a traffic surveillance video recorded from a static camera at a specific frame rate, let

I^{(1)}, \dots, I^{(N)}

denote a temporal sequence of frames, where

I^{(n)} \in R^{H \times W \times 3}

is the n-th frame and N the total number of frames. Then, the goal of a VTS system is to analyze the sequence of frames to infer high-level semantic traffic information by jointly solving multiple tasks, such as vehicle detection, classification, and tracking.

3.2. Mathematical Formulation

Consider a collection of M supervised tasks in a VTS system. For the m-th task, we seek to learn a mapping

T_{m} : R^{H \times W \times 3} \to Y_{m}

that predicts, with high probability, the corresponding ground truth

y_{m} \in Y_{m}

for a given frame

I

.

In our case study, we consider the following tasks:

Vehicle detection ( $T_{1}$ ): predicts the bounding box $y_{1}^{(n, k)} = b^{(n, k)}$ corresponding to the k-th vehicle on the road in frame n. Each bounding box is parameterized by the pixel coordinates of its top-left corner and its spatial size (width and height).
Vehicle classification ( $T_{2}$ ): assigns a category label $y_{2}^{(n, k)} = c^{(n, k)}$ to the k-th vehicle detected in the frame n.
Vehicle tracking ( $T_{3}$ ): maintains temporal identity $y_{3}^{(n, k)} = {id}^{(n, k)}$ for the k-th vehicle detected in the frame n.

Following the standard formulation of MTL [11], we further consider the parametric hypothesis class

T_{m} = h_{m} \circ g

for the m-th task, where

g : R^{W \times H \times 3} \to S

denotes a mapping from the image space to some shared latent space S parameterized by

θ_{s h}

, and

h_{m} : S \to Y_{m}

represents the task-specific mapping parameterized by

θ_{m}

. The shared parameters

θ_{s h}

encode features jointly learned from all tasks, while each task-specific

h_{m}

specializes these features to solve the m-th task. And we also consider that each task is associated with a loss function,

L_{m} : Y_{m} \times Y_{m} \to R_{+}

, that evaluates the discrepancy between the prediction and its corresponding ground truth.

MTL then seeks to simultaneously learn the set of task functions,

T_{1}, \dots, T_{M}

, by minimizing a weighted combination of their individual task losses, referred to as the multi-task empirical risk [71], as expressed in Equation (1):

min_{θ_{1}, \dots, θ_{M}, θ_{s h}} \sum_{m = 1}^{M} \frac{λ_{m}}{N} \sum_{n = 1}^{N} L_{m} (h_{m} (g (I^{(n)})), Y_{m}^{(n)}),

(1)

where

I^{(1)}, \dots, I^{(N)}

is a sequence of training frames and, for the m-th task,

λ_{m}

denotes its task-importance weight, and

Y_{m}^{(n)} : = {y_{m}^{(n, 1)}, \dots, y_{m}^{(n, K_{n})}} \subseteq Y_{m}

denotes the set of

K_{n}

ground truth instances associated with the n-th frame

I^{(n)}

.

Figure 1 shows a block diagram that visually summarizes the proposed MTL setup and complements the mathematical formulation. For a more detailed visual description of the architecture and the interactions between modules, the reader is referred to Figure 2.

4. Transformer-Based Multi-Task Learning Model for Vehicle Traffic Surveillance

Traditional VTS systems rely on multi-stage pipelines that sequentially address detection, classification, and tracking tasks for the effective monitoring of traffic environments. In such pipelines, each stage is addressed independently, which may lead to suboptimal performance due to their limited capacity to leverage shared representations across tasks.

The MTL formulation in Equation (1) addresses these limitations by enabling a unified end-to-end framework that jointly performs object detection, classification, and tracking, while promoting a shared representation across tasks. Nevertheless, two main challenges arise from this formulation. First, feature sharing must ensure that learned representations capture complementary spatial patterns across tasks. Second, temporal coherence must maintain vehicle identities and representations under occlusions, illumination variations, and dense traffic interactions.

Recent advances in DL have significantly enhanced the performance of VTS systems. For instance, models based on the YOLO architecture have demonstrated remarkable accuracy for vehicle detection and classification [72,73]. However, their inherent multi-stage design treats detection and temporal association independently, which limits their ability to capture spatio-temporal dependencies. To overcome these limitations, the proposed system adapts the MOTR framework for VTS applications. In the next subsection, we will describe the MOTR architecture.

4.1. Proposed Model

The overall architecture of the proposed VTS system is illustrated in Figure 2. Video streams from static traffic surveillance cameras are processed frame-by-frame and fed into the MOTR network, which forms the core of the VTS model. MOTR consists of a transformer-based encoder-decoder architecture designed for end-to-end multi-object tracking. The backbone encoder, based on ResNet-50 with deformable transformer layers [16], extracts rich spatial features, while the transformer decoder utilizes two types of learnable queries: detection queries, responsible for identifying newly appearing vehicles, and tracking queries, which propagate the identities of vehicles detected in previous frames.

This encoder–decoder mechanism allows MOTR to detect and re-identify vehicles in continuous video streams, producing reliable bounding boxes, high-confidence classification outputs, and maintaining consistent vehicle identities across frames.

Let

Q_{d}

denote the learnable detection queries and

{\hat{Q}}_{t r}^{(n)}

the predicted tracking queries at frame n. For the first frame (

n = 1

), the model only uses detection queries, i.e.,

{\hat{Q}}_{t r}^{(0)} = ⌀

. For subsequent frames (

n > 1

), the query set is recursively concatenated, as Equation (2) shows:

{\hat{Q}}_{t r}^{(n)} = Q_{d} \cup {\hat{Q}}_{t r}^{(n - 1)} .

(2)

To correctly associate predicted queries with ground-truth vehicle instances, a tracklet-aware label assignment (TALA) mechanism is employed. Following the principles in [19], this mechanism separates label assignment for detection and tracking queries:

Detection queries are matched exclusively to newborn vehicles using bipartite matching between predictions of detect queries, ${\hat{Q}}_{d}^{(n)}$ , and the ground truth of newborn objects, $O_{n e w}^{(n)}$ , at frame n.
Tracking queries are matched according to their temporal identity consistency, inheriting their assignments from the previous frame to preserve temporal identity.

Formally, let

ω_{d}^{(n)}

and

ω_{t r}^{(n)}

denote the label assignments for detection and tracking queries, respectively. Then, the total assignment for the n-th frame is given via Equation (3),

\begin{matrix} ω_{t r}^{(n)} = ω_{t r}^{(n - 1)} \cup ω_{d}^{(n - 1)} \\ ω_{d}^{(n)} = \underset{ω_{d}^{(n)} \in Ω^{(n)}}{arg min} L ({\hat{Q}}_{d}^{(n)} | ω_{d}^{(n)}, O_{new}^{(n)}), \end{matrix}

(3)

where

L

denotes the pairwise matching cost, as defined in DETR [42], and

Ω^{(n)}

represents the space of all possible bipartite associations between detection queries and newborn objects [19].

This formulation ensures that newborn vehicles (e.g., cars entering the monitored area) are detected via

{\hat{Q}}_{d}^{(n)}

, while ongoing vehicle tracks are maintained through

{\hat{Q}}_{t r}^{(n)}

. This mechanism is particularly relevant to VTS applications, where vehicles frequently enter and exit the camera’s field of view due to occlusions, lane changes, or traffic flow dynamics.

During inference, the model processes frames sequentially and updates the query sets in an online manner. The hidden states corresponding to the tracked vehicles are passed to the next frame through a Query Interaction Module (QIM), maintaining inter-frame consistency [19].

4.2. Multi-Task Learning Framework

The MOTR learning problem can be modeled as a joint optimization problem that simultaneously learns to detect, classify, and track multiple objects within a unified transformer framework. This formulation enables the model to exploit shared representations between the two related tasks: object detection and temporal identity association. Compared to traditional multi-stage VTS pipelines, transformer-based architectures offer a key advantage: they learn long-range spatial and temporal relationships in complex traffic scenes—including vehicle appearance and occlusion variations, and complex maneuvers—without relying on naive assumptions or hand-crafted features.

Formally, given a sequence of N frames,

I^{(1)}, \dots, I^{(N)}

with ground truths

Y =

{(Y_{1}^{(n)}, Y_{2}^{(n)}, Y_{3}^{(n)})}_{n = 1}^{N}

, their corresponding predictions,

\hat{Y} = {({\hat{Y}}_{1}^{(n)}, {\hat{Y}}_{2}^{(n)}, {\hat{Y}}_{3}^{(n)})}_{n = 1}^{N}

, and the set of matching assignments between predictions and ground truth instances,

ω = {ω^{(n)}}_{n = 1}^{N}

, we seek to minimize the overall loss,

L_{o}

, of the entire video sequence, called the collective average loss (CAL), as defined in Equation (4):

L_{o} (\hat{Y} | ω, Y) = \frac{\sum_{n = 1}^{N} (L ({\hat{Q}}_{t r}^{(n)} | ω_{t r}^{(n)}, Q_{t r}^{(n)}) + L ({\hat{Q}}_{d}^{(n)} | ω_{d}^{(n)}, Q_{d}^{(n)}))}{\sum_{n = 1}^{N} V^{(n)}} .

(4)

Here,

V^{(n)} = V_{t r}^{(n)} + V_{d}^{(n)}

is the total number of ground-truth objects in the frame n, where

V_{t r}^{(n)}

and

V_{d}^{(n)}

are the numbers of tracked objects and newborns, respectively. The function

L

represents the loss for a single frame, which is defined as in Equation (5):

L ({\hat{Y}}^{(n)} ∣ ω^{(n)}, Y^{(n)}) = λ_{c l s} L_{c l s} + λ_{1} L_{1} + λ_{g i o u} L_{g i o u},

(5)

where

L_{c l s}

is the focal loss [74],

L_{1}

is the

ℓ_{1}

loss, and

L_{g i o u}

is the generalized intersection-over-union (gIoU) loss [75], while

λ_{c l s}

,

λ_{1}

, and

λ_{g i o u}

are their corresponding weights.

5. Experimental Results

In this section, the performance of the model is evaluated using the UA-DETRAC benchmark dataset [76]. Our experiments assess the robustness of the model under various environmental conditions and traffic densities, as well as its ability to maintain stable multi-object detection and tracking accuracy. Quantitative results are reported in terms of standard MOT metrics (see Section 5.2). All experiments were conducted on a workstation equipped with an Intel Core i7-13650HX CPU, 16 GB of RAM, and an NVIDIA RTX 4050 GPU, running Windows 11 with CUDA 12.2 and PyTorch 2.1.0.

5.1. Dataset Description

We employ the UA-DETRAC benchmark [76] for evaluation. UA-DETRAC is a fully annotated, large-scale corpus for vehicle detection and multi-object tracking in real-world traffic surveillance. It contains more than

140, 000

frames from 100 video sequences at a native resolution of

960 \times 540

and 25 FPS, all acquired with fixed cameras. The sequences span urban scenes with diverse traffic densities, illumination conditions, and weather, yielding frequent partial/heavy occlusions and pronounced scale variation—conditions that stress both detection and association.

Each frame provides tight bounding boxes for every vehicle, together with a persistent object identifier (obj_id), to maintain temporal consistency across frames. Additional per-instance attributes include the vehicle category (car, bus, van, others), the occlusion level (none, partial, heavy), and the truncation ratio, defined as the fraction of the object area outside the image bounds. These annotations support a controlled analysis of tracking performance under visibility degradation and class-specific traffic composition.

5.2. Quantitative Results

The performance of the MOTR model was evaluated using ten representative videos from the UA-DETRAC dataset [76], covering diverse scenarios, including heavy traffic, partial occlusions, and varying motion patterns. The evaluation follows standard multi-object tracking protocols, reporting MOTA, IDF1, identity switches (IDS), identity precision (IDP), identity recall (IDR), precision, and recall metrics [76,77]. These are typical metrics used to evaluate detection reliability and identity continuity, where higher values of MOTA, IDP, IDR, IDF1, precision, and recall suggest superior overall performance, while lower IDS values mean enhanced temporal identity preservation across the sequence.

Table 2 presents the quantitative results. The results show that sequences with moderate traffic density achieve superior performance, with MVI_40962 and MVI_40981 obtaining MOTA values of 0.863 and 0.796, respectively. These sequences also maintain low identity switch counts of 4 and 3, confirming effective identity preservation under favorable conditions. In contrast, sequences that exhibit heavy occlusion and dense traffic exhibit reduced performance, with MVI_40963 and MVI_40204 showing MOTA values of 0.710 and 0.697, along with elevated identity switch counts of 32 and 24, respectively.

The overall system performance in Table 3 shows a mean MOTA of 0.757 and IDF1 of 0.832 across the evaluated sequences. The high precision value of 0.934 indicates minimal false positives, while the recall of 0.796 reflects the system’s capability to maintain detection coverage in challenging environments. The standard deviation values demonstrate consistent performance across different traffic scenarios.

Figure 3 shows sample frames of four distinct UA-DETRAC videos in different traffic scenarios to supplement the numerical evaluation found in Table 2 and Table 3.

6. Discussion

The implementation of the transformer-based architecture in the proposed VTS system demonstrates robust multi-object tracking capabilities across diverse urban scenarios. The experimental results show that the system achieves a MOTA of 0.757 and an IDF1 of 0.832, while simultaneously maintaining a high detection precision of 0.934 across different traffic conditions.

Per-sequence analysis reveals performance variations aligned with environmental challenges. Sequences such as MVI_40962 and MVI_41073 achieve superior performance with MOTA values of 0.863 and 0.822, respectively, demonstrating a high performance under moderate traffic density and good visibility conditions. In contrast, sequences such as MVI_40212 and MVI_40204 present reduced performance metrics, particularly in recall values of 0.711 and 0.775, highlighting limitations in handling severe occlusions and significant scale variations.

Identity preservation is further characterized by identity metrics. The high IDP value of 0.906 indicates that, once a track is established, identity assignments are rarely incorrect. Conversely, a lower IDR of 0.767 reveals that the system may fail to recover or maintain identities when targets reappear after occlusions or when appearance cues are degraded.

The number of IDS ranges from 2 to 32 per sequence, with elevated counts being aligned with challenging sequences: MVI_40963 (32), MVI_40204 (24), and MVI_20065 (20). These sequences also exhibit comparatively lower recall, supporting the hypothesis that identity fragmentation is primarily driven by missed detections during occlusion events. In contrast, MVI_20033 achieves the lowest IDS (2) while maintaining high precision (0.977), suggesting that the system can effectively stabilize identity assignment when scene dynamics are moderate.

The missed detections observed in Figure 3 are primarily attributed to the limited spatial resolution of the corresponding vehicles, which prevents the model from extracting sufficiently distinctive visual features. This is further supported by the fact that vehicles exhibiting similar visibility and occlusion conditions—but with a higher spatial resolution—are effectively detected. It is important to highlight that the UA-DETRAC benchmark does not consider these vehicles for tracking evaluation, so they do not impact the reported performance metrics.

6.1. Failure Modes and Operating Conditions

From the results in Table 2, the following observations can be made:

(a): Density and occlusion dominate errors. Sequences characterized by dense traffic and frequent partial occlusions (e.g., MVI_40212 and MVI_40204) show a lower recall and a higher IDS. This is consistent with identity drops caused by lower spatial visibility, rather than tracker confusion.
(b): Precision is uniformly strong. All sequences maintain precision $\geq 0.889$ , with several above $0.96$ (MVI_20064: $0.963$ ; MVI_20033: $0.977$ ; and MVI_40962: $0.988$ ). This suggests that false positives are well handled, and further improvements should focus on increasing recall while preserving precision.
(c): Association is reliable when observations persist. High IDP across sequences indicates that the association cost is well calibrated for temporally consistent tracks. Drops in IDR occur mostly when targets leave the field of view or are heavily truncated.

6.2. Implications and Future Improvements

The above findings imply that the current pipeline is well suited for deployment in fixed-camera traffic monitoring, where false alarms must be minimized. In highly congested scenes, performance is bounded by missed detections during transient occlusions. Accordingly, two avenues are expected to yield measurable improvements:

Occlusion-aware re-identification. Incorporating long-range appearance embeddings with explicit occlusion modeling and temporal memory (e.g., tracklet-level re-ID with motion gating) should raise IDR and reduce IDS in clips with frequent hide-and-reveal events.
Recall-oriented detection tuning. Threshold calibration and hard-example mining targeted at small, truncated, or partially visible vehicles can improve recall while preserving the current precision regime.

6.3. Summary

In summary, the model delivers competitive tracking with high precision and stable identity assignment across diverse operating conditions (mean MOTA

0.757

, IDF1

0.832

). Performance degradations concentrate in dense-traffic, occlusion-heavy scenes, where recall and IDR are most affected, leading to higher IDS. Given that precision is already high and uniform, future work should prioritize recall enhancements and robust re-identification to close the remaining gap without compromising the low false-positive rate.

7. Conclusions

This paper has presented a comprehensive evaluation of a transformer-based model for multi-vehicle tracking in urban environments. The proposed system demonstrates consistent performance across various challenging scenarios, achieving a mean MOTA of 0.757 and IDF1 of 0.832 on the UA-DETRAC benchmark. The system maintains a high precision of 0.934 while effectively preserving vehicle identities with an average of 15 identity switches per sequence.

The unified architecture successfully handles vehicle detection, classification, and tracking within a single framework, eliminating the complex multi-stage pipelines typically required in conventional VTS systems. The model exhibits high robustness under diverse conditions, including high traffic density, moderate partial occlusions, and varying illumination. The system maintains tracking consistency for traffic conditions, as reflected in the very low identity inconsistency observed in sequences such as MVI_20033, which contains only two identity switches. In contrast to multi-stage pipelines, which run an STL model for each task, significantly increasing inference time and memory resources, the proposed model learns a shared representation that substantially reduces computational complexity, a crucial aspect in VTS systems, especially for resource-constrained environments.

Future work will focus on enhancing re-identification capabilities during prolonged occlusion events and optimizing the system for real-time deployment in ITS applications. The results demonstrate the viability of the proposed approach to practical VTS systems operating in complex urban environments, providing a solid foundation for next-generation urban traffic monitoring solutions.

Author Contributions

Conceptualization, F.H.-R. and J.A.D.-P.-F.; Methodology, F.H.-R. and J.A.D.-P.-F.; Software, F.H.-R., J.A.D.-P.-F. and H.E.G.-V.; Validation, J.-C.L.-P., A.G.-B., V.B.K. and L.A.L.-R.; Formal analysis, J.-C.L.-P., E.R.-I., J.A.D.-P.-F., H.E.G.-V., V.B.K. and L.A.L.-R.; Investigation, F.H.-R., J.-C.L.-P., E.R.-I., A.G.-B., J.A.D.-P.-F., V.B.K. and L.A.L.-R.; Resources, J.-C.L.-P. and E.R.-I.; Data curation, J.-C.L.-P., E.R.-I., A.G.-B., H.E.G.-V., V.B.K. and L.A.L.-R.; Writing—original draft, F.H.-R. and J.A.D.-P.-F.; Writing—review & editing, J.-C.L.-P., E.R.-I., A.G.-B., H.E.G.-V., V.B.K. and L.A.L.-R.; Visualization, J.-C.L.-P., A.G.-B., H.E.G.-V., V.B.K. and L.A.L.-R.; Supervision, F.H.-R., H.E.G.-V., V.B.K. and L.A.L.-R.; Project administration, J.-C.L.-P., E.R.-I., A.G.-B. and H.E.G.-V. All authors have read and agreed to the published version of the manuscript.

Funding

The authors would like to express their special thanks to the Programa de Fomento al Apoyo a Proyectos de Investigación (PROFAPI) for providing financial support through the project PROFAPI_2025-0562.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hassan, M.A.; Javed, R.; Farhatullah; Granelli, F.; Gen, X.; Rizwan, M.; Ali, S.H.; Junaid, H.; Ullah, S. Intelligent Transportation Systems in Smart City: A Systematic Survey. In Proceedings of the 2023 International Conference on Robotics and Automation in Industry (ICRAI), Peshawar, Pakistan, 3–5 March 2023; pp. 1–9. [Google Scholar] [CrossRef]
Elassy, M.; Al-Hattab, M.; Takruri, M.; Badawi, S. Intelligent transportation systems for sustainable smart cities. Transp. Eng. 2024, 16, 100252. [Google Scholar] [CrossRef]
Tonix-Gleason, L.E.; Del-Puerto-Flores, J.A.; Castillo-Soria, F.R.; Parra-Michel, R.; Campos, F.P. Neural Network Aided M-PSK Detection in 802.11P V2V OFDM Systems Under ICI Conditions. IEEE Wirel. Commun. Lett. 2025, 14, 3420–3424. [Google Scholar] [CrossRef]
Del Puerto-Flores, J.A.; Castillo-Soria, F.R.; Gutiérrez, C.A.; Peña-Campos, F. Efficient Index Modulation-Based MIMO OFDM Data Transmission and Detection for V2V Highly Dispersive Channels. Mathematics 2023, 11, 2773. [Google Scholar] [CrossRef]
Kumar, N.; Shukla, H.; Rajalakhsmi, P. V2X Enabled Emergency Vehicle Alert System. arXiv 2024, arXiv:2403.19402. [Google Scholar] [CrossRef]
Abdul-Hak, M.; Al-Holou, N.; Bazzi, Y.; Tamer, M.A. Predictive Vehicle Route Optimization in Intelligent Transportation Systems. Int. J. Data Sci. Technol. 2019, 5, 14–28. [Google Scholar] [CrossRef]
Manikonda, P.; Yerrapragada, A.K.; Annasamudram, S.S. Intelligent traffic management system. In Proceedings of the 2011 IEEE Conference on Sustainable Utilization and Development in Engineering and Technology (STUDENT), Semenyih, Malaysia, 20–21 October 2011; pp. 119–122. [Google Scholar] [CrossRef]
Hermosillo-Reynoso, F.; Torres-Roman, D.; Santiago-Paz, J.; Ramirez-Pacheco, J. A Novel Algorithm Based on the Pixel-Entropy for Automatic Detection of Number of Lanes, Lane Centers, and Lane Division Lines Formation. Entropy 2018, 20, 725. [Google Scholar] [CrossRef]
Velazquez-Pupo, R.; Sierra-Romero, A.; Torres-Roman, D.; Shkvarko, Y.V.; Santiago-Paz, J.; Gómez-Gutiérrez, D.; Robles-Valdez, D.; Hermosillo-Reynoso, F.; Romero-Delgado, M. Vehicle Detection with Occlusion Handling, Tracking, and OC-SVM Classification: A High Performance Vision-Based System. Sensors 2018, 18, 374. [Google Scholar] [CrossRef]
Chen, Y.; Hu, W. Robust Vehicle Detection and Counting Algorithm Adapted to Complex Traffic Environments with Sudden Illumination Changes and Shadows. Sensors 2020, 20, 2686. [Google Scholar] [CrossRef]
Caruana, R. Multitask Learning. Mach. Learn. 1997, 28, 41–75. [Google Scholar] [CrossRef]
Crawshaw, M. Multi-Task Learning with Deep Neural Networks: A Survey. arXiv 2020, arXiv:2009.09796. [Google Scholar] [CrossRef]
Abdillah, A.F.; Hamidi, M.Z.; Esti Anggraeni, R.N.; Sarno, R. Comparative Study of Single-task and Multi-task Learning on Research Protocol Document Classification. In Proceedings of the 2021 13th International Conference on Information & Communication Technology and System (ICTS), Surabaya, Indonesia, 20–21 October 2021; pp. 213–217. [Google Scholar] [CrossRef]
Lu, Y.; Sarkis, M.; Lu, G. Multi-Task Learning for Single Image Depth Estimation and Segmentation Based on Unsupervised Network. In Proceedings of the 2020 IEEE International Conference on Robotics and Automation (ICRA), Paris, France, 31 May–31 August 2020; pp. 10788–10794. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv 2021, arXiv:2010.04159. [Google Scholar]
Meinhardt, T.; Kirillov, A.; Leal-Taixé, L.; Feichtenhofer, C. TrackFormer: Multi-Object Tracking with Transformers. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8834–8844. [Google Scholar] [CrossRef]
Hermosillo-Reynoso, F.; Torres-Roman, D. A Tensor Space for Multi-View and Multitask Learning Based on Einstein and Hadamard Products: A Case Study on Vehicle Traffic Surveillance Systems. Sensors 2024, 24, 7463. [Google Scholar] [CrossRef] [PubMed]
Zeng, F.; Dong, B.; Zhang, Y.; Wang, T.; Zhang, X.; Wei, Y. MOTR: End-to-End Multiple-Object Tracking with Transformer. arXiv 2022, arXiv:2105.03247. [Google Scholar]
Wang, Z.; Zhan, J.; Duan, C.; Guan, X.; Lu, P.; Yang, K. A Review of Vehicle Detection Techniques for Intelligent Vehicles. IEEE Trans. Neural Netw. Learn. Syst. 2023, 34, 3811–3831. [Google Scholar] [CrossRef]
Tsai, L.W.; Hsieh, J.W.; Fan, K.C. Vehicle Detection Using Normalized Color and Edge Map. IEEE Trans. Image Process. 2007, 16, 850–864. [Google Scholar] [CrossRef]
Yan, G.; Yu, M.; Yu, Y.; Fan, L. Real-time vehicle detection using histograms of oriented gradients and AdaBoost classification. Optik 2016, 127, 7941–7951. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, pp. 1–13. [Google Scholar] [CrossRef]
Ojala, T.; Pietikainen, M.; Maenpaa, T. Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 2002, 24, 971–987. [Google Scholar] [CrossRef]
Hassaballah, M.; Kenk, M.A.; El-Henawy, I.M. Local binary pattern-based on-road vehicle detection in urban traffic scene. Pattern Anal. Appl. 2020, 23, 1505–1521. [Google Scholar] [CrossRef]
Mallat, S. A theory for multiresolution signal decomposition: The wavelet representation. IEEE Trans. Pattern Anal. Mach. Intell. 1989, 11, 674–693. [Google Scholar] [CrossRef]
Kuang, H.; Chen, L.; Chan, L.L.H.; Cheung, R.C.C.; Yan, H. Feature Selection Based on Tensor Decomposition and Object Proposal for Night-Time Multiclass Vehicle Detection. IEEE Trans. Syst. Man, Cybern. Syst. 2019, 49, 71–80. [Google Scholar] [CrossRef]
Cucchiara, R.; Piccardi, M.; Mello, P. Image analysis and rule-based reasoning for a traffic monitoring system. IEEE Trans. Intell. Transp. Syst. 2000, 1, 119–130. [Google Scholar] [CrossRef]
Rahim, H.A.; Sheikh, U.U.; Ahmad, R.B.; Zain, A.S.M.; Ariffin, W.N.F.W. Vehicle speed detection using frame differencing for smart surveillance system. In Proceedings of the 10th International Conference on Information Science, Signal Processing and their Applications (ISSPA 2010), Kuala Lumpur, Malaysia, 10–13 May 2010; pp. 630–633. [Google Scholar] [CrossRef]
Piccardi, M. Background subtraction techniques: A review. In Proceedings of the 2004 IEEE International Conference on Systems, Man and Cybernetics (IEEE Cat. No.04CH37583), The Hague, The Netherlands, 10–13 October 2004; Volume 4, pp. 3099–3104. [Google Scholar] [CrossRef]
Stauffer, C.; Grimson, W. Adaptive background mixture models for real-time tracking. In Proceedings of the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Fort Collins, CO, USA, 23–25 June 1999; Volume 2, pp. 246–252. [Google Scholar] [CrossRef]
Lo, B.; Velastin, S. Automatic congestion detection system for underground platforms. In Proceedings of the 2001 International Symposium on Intelligent Multimedia, Video and Speech Processing, ISIMP 2001 (IEEE Cat. No.01EX489), Hong Kong, China, 4 May 2001; pp. 158–161. [Google Scholar] [CrossRef]
Oliver, N.; Rosario, B.; Pentland, A. A Bayesian computer vision system for modeling human interactions. IEEE Trans. Pattern Anal. Mach. Intell. 2000, 22, 831–843. [Google Scholar] [CrossRef]
Li, L.; Huang, W.; Gu, I.Y.H.; Tian, Q. Foreground object detection from videos containing complex background. In Proceedings of the Eleventh ACM International Conference on Multimedia, New York, NY, USA, 2 November 2003; MULTIMEDIA ’03. pp. 2–10. [Google Scholar] [CrossRef]
Chen, Y.; Wu, Q. Moving vehicle detection based on optical flow estimation of edge. In Proceedings of the 2015 11th International Conference on Natural Computation (ICNC), Zhangjiajie, China, 15–17 August 2015; pp. 754–758. [Google Scholar] [CrossRef]
Candes, E.J.; Li, X.; Ma, Y.; Wright, J. Robust Principal Component Analysis? arXiv 2009, arXiv:0912.3599. [Google Scholar] [CrossRef]
Lu, C.; Feng, J.; Chen, Y.; Liu, W.; Lin, Z.; Yan, S. Tensor Robust Principal Component Analysis with a New Tensor Nuclear Norm. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 42, 925–938. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. arXiv 2016, arXiv:1506.01497. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar] [CrossRef]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Yung, N.; Lai, A. Detection of vehicle occlusion using a generalized deformable model. In Proceedings of the ISCAS ’98, 1998 IEEE International Symposium on Circuits and Systems (Cat. No.98CH36187), Monterey, CA, USA, 31 May–3 June 1998; Volume 4, pp. 154–157. [Google Scholar] [CrossRef]
Oneata, D.; Revaud, J.; Verbeek, J.; Schmid, C. Spatio-temporal Object Detection Proposals. In Proceedings of the Computer Vision—ECCV 2014, Zurich, Switzerland, 6–12 September 2014; Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T., Eds.; Springer: Cham, Switzerland, 2014; pp. 737–752. [Google Scholar]
Phan, H.N.; Pham, L.H.; Tran, D.N.N.; Ha, S.V.U. Occlusion vehicle detection algorithm in crowded scene for Traffic Surveillance System. In Proceedings of the 2017 International Conference on System Science and Engineering (ICSSE), Ho Chi Minh City, Vietnam, 21–23 July 2017; pp. 215–220. [Google Scholar] [CrossRef]
Chang, J.; Wang, L.; Meng, G.; Xiang, S.; Pan, C. Vision-Based Occlusion Handling and Vehicle Classification for Traffic Surveillance Systems. IEEE Intell. Transp. Syst. Mag. 2018, 10, 80–92. [Google Scholar] [CrossRef]
Yan, X.; Yu, Y.; Wang, F.; Liu, W.; He, S.; Pan, J. Visualizing the Invisible: Occluded Vehicle Segmentation and Recovery. arXiv 2019, arXiv:1907.09381. [Google Scholar] [CrossRef]
Su, Y.; Sun, R.; Shu, X.; Zhang, Y.; Wu, Q. Occlusion-Aware Detection and Re-ID Calibrated Network for Multi-Object Tracking. arXiv 2023, arXiv:2308.15795. [Google Scholar]
Plaen, P.F.D.; Marinello, N.; Proesmans, M.; Tuytelaars, T.; Van Gool, L. Contrastive Learning for Multi-Object Tracking with Transformers. In Proceedings of the 2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2024; pp. 6853–6863. [Google Scholar] [CrossRef]
Seyfipoor, M.; Zafarqandi, M.J.S.; Mohammadi, S. Real-Time Occlusion-Aware Object Tracking. In Proceedings of the 2025 Fifth National and the First International Conference on Applied Research in Electrical Engineering (AREE), Ahvaz, Iran, 4–5 February 2025; pp. 1–7. [Google Scholar] [CrossRef]
Zhang, Y.; Zheng, L.; Huang, Q. Multi-object tracking based on graph neural networks. Multimed. Syst. 2025, 31, 89. [Google Scholar] [CrossRef]
Abd-Almageed, W.; Davis, L.S. Robust Appearance Modeling for Pedestrian and Vehicle Tracking. In Multimodal Technologies for Perception of Humans; Stiefelhagen, R., Garofolo, J., Eds.; Springer: Berlin/Heidelberg, Germany, 2007; pp. 209–215. [Google Scholar]
Choi, J.h.; Lee, K.h.; Cha, K.c.; Kwon, J.s.; Kim, D.w.; Song, H.k. Vehicle Tracking using Template Matching based on Feature Points. In Proceedings of the 2006 IEEE International Conference on Information Reuse & Integration, Waikoloa, HI, USA, 16–18 September 2006; pp. 573–577. [Google Scholar] [CrossRef]
Kawamoto, K.; Yonekawa, T.; Okamoto, K. Visual vehicle tracking based on an appearance generative model. In Proceedings of the 6th International Conference on Soft Computing and Intelligent Systems, and The 13th International Symposium on Advanced Intelligence Systems, Kobe, Japan, 20–24 November 2012; pp. 711–714. [Google Scholar] [CrossRef]
Kalman, R.E. A New Approach to Linear Filtering and Prediction Problems. J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
Arulampalam, M.; Maskell, S.; Gordon, N.; Clapp, T. A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans. Signal Process. 2002, 50, 174–188. [Google Scholar] [CrossRef]
Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist. Q. 1955, 2, 83–97. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar] [CrossRef]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. ByteTrack: Multi-Object Tracking by Associating Every Detection Box. arXiv 2022, arXiv:2110.06864. [Google Scholar]
Adžemović, M. Deep Learning-Based Multi-Object Tracking: A Comprehensive Survey from Foundations to State-of-the-Art. arXiv 2025, arXiv:2506.13457. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar] [CrossRef]
Bergmann, P.; Meinhardt, T.; Leal-Taixé, L. Tracking Without Bells and Whistles. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 941–951. [Google Scholar] [CrossRef]
Zhou, X.; Koltun, V.; Krähenbühl, P. Tracking Objects as Points. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 474–490. [Google Scholar]
Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. TransTrack: Multiple Object Tracking with Transformer. arXiv 2021, arXiv:2012.15460. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, T.; Zhang, X. MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 22056–22065. [Google Scholar] [CrossRef]
Ng, J.Y.; Tay, Y.H. Image-based Vehicle Classification System. arXiv 2012, arXiv:1204.2114. [Google Scholar] [CrossRef]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely Connected Convolutional Networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 2261–2269. [Google Scholar] [CrossRef]
Tan, M.; Le, Q.V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. arXiv 2020, arXiv:1905.11946. [Google Scholar] [CrossRef]
Sener, O.; Koltun, V. Multi-Task Learning as Multi-Objective Optimization. arXiv 2019, arXiv:1810.04650. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar] [CrossRef]
Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.C.; Qi, H.; Lim, J.; Yang, M.H.; Lyu, S. UA-DETRAC: A new benchmark and protocol for multi-object detection and tracking. Comput. Vis. Image Underst. 2020, 193, 102907. [Google Scholar] [CrossRef]
Luiten, J.; Ošep, A.; Dendorfer, P.; Torr, P.; Geiger, A.; Leal-Taixé, L.; Leibe, B. HOTA: A Higher Order Metric for Evaluating Multi-object Tracking. Int. J. Comput. Vis. 2021, 129, 548–578. [Google Scholar] [CrossRef]

Figure 1. Block diagram of the proposed MTL setup.

Figure 2. The MOTR architecture [19]. Here, ⓒ denotes concatenation, while colors follow the same convention used in Figure 1: green for detection, blue for classification, orange for tracking, and yellow for the shared feature representations.

Figure 3. Sample frames from four representative UA-DETRAC videos, illustrating diverse traffic conditions, occlusions, and motion patterns. These frames are selected to complement the quantitative results shown in Table 2 and Table 3. Here, green bounding boxes are the ground truths from the UA-DETRAC dataset, while the blue bounding boxes are the predictions produced via our model.

Table 1. Summary of representative approaches in VTS systems.

Task	Approach	Method	Limitations
Detection	Appearance	HOG [22]	Sensitive to lighting, occlusions, and viewpoint variations
		SIFT [23]	Same as above
		Haar-like features [24]	Same as above
		LBP [25,26]	Same as above
	Motion	Background subtraction [31,32,33,34]	Fails under camera motion, dynamic background, or abrupt vehicle movement
		Optical flow [35,36]	Same as above
		Subspace models [37,38]	Same as above
	DL	YOLO [39]	Requires large annotated datasets; higher computational cost
		Faster R-CNN [40]	Same as above
		Mask R-CNN [41]	Same as above
		DETR [42]	Same as above
Occlusion	Heuristic	Geometric priors [9,43,44,45,46]	Limited robustness for heavy occlusion or irregular shapes
	DL	Feature reconstruction [47,48]	Computationally expensive; requires large datasets with occlusions
	DL	Occlusion context-aware [49,50,51]	Same as above
Tracking	Appearance	Color histograms [52,53]	Sensitive to illumination, perspective changes, and occlusions
	Appearance	Texture descriptors [54]	Same as above
	Motion	Kalman filter [9]	Fail in presence of nonlinear dynamics
		SORT [58]	Fails under abrupt maneuvers, nonlinear motion, or crowded scenes
		ByteTrack [59]	Same as above
	DL	DeepSORT [61]	High computational cost; requires large annotated sequences
		TrackFormer [17]	Same as above
		MOTR [19]	Limited detection performance
Classification	Appearance	SIFT and Color histograms [9,66]	Performance limited by feature design; sensitive to occlusions
	Appearance	Image Moments [9]	Same as above
	DL	YOLO, VGGNet, ResNet [39,67,68,69,70]	Requires large labeled datasets; high computational resources

Table 2. MOTR performance on ten representative UA-DETRAC videos, where the best value for each metric is highlighted in bold.

Video	MOTA	IDS	IDP	IDR	IDF1	Precision	Recall
MVI_40962	0.863	4	0.981	0.868	0.921	0.988	0.875
MVI_40981	0.796	3	0.909	0.877	0.893	0.913	0.881
MVI_40963	0.710	32	0.888	0.740	0.807	0.927	0.773
MVI_41073	0.822	17	0.922	0.875	0.898	0.934	0.886
MVI_20063	0.776	8	0.894	0.823	0.857	0.922	0.849
MVI_20033	0.754	2	0.972	0.769	0.859	0.977	0.773
MVI_40212	0.649	12	0.902	0.695	0.785	0.922	0.711
MVI_40204	0.697	24	0.873	0.743	0.803	0.910	0.775
MVI_20064	0.707	19	0.923	0.706	0.800	0.963	0.736
MVI_20065	0.700	20	0.853	0.769	0.809	0.889	0.801

Table 3. Average MOTR performance over the ten representative UA-DETRAC videos.

Metric	MOTA	IDF1	IDP	IDR	Precision	Recall
Mean	0.757	0.832	0.906	0.767	0.934	0.796
Std. Dev.	0.065	0.048	0.034	0.063	0.024	0.058

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hermosillo-Reynoso, F.; López-Pimentel, J.-C.; Ruiz-Ibarra, E.; García-Berumen, A.; Del-Puerto-Flores, J.A.; Gilardi-Velazquez, H.E.; Kumaravelu, V.B.; Luna-Rodriguez, L.A. A Transformer-Based Multi-Task Learning Model for Vehicle Traffic Surveillance. Mathematics 2025, 13, 3832. https://doi.org/10.3390/math13233832

AMA Style

Hermosillo-Reynoso F, López-Pimentel J-C, Ruiz-Ibarra E, García-Berumen A, Del-Puerto-Flores JA, Gilardi-Velazquez HE, Kumaravelu VB, Luna-Rodriguez LA. A Transformer-Based Multi-Task Learning Model for Vehicle Traffic Surveillance. Mathematics. 2025; 13(23):3832. https://doi.org/10.3390/math13233832

Chicago/Turabian Style

Hermosillo-Reynoso, Fernando, Juan-Carlos López-Pimentel, Erica Ruiz-Ibarra, Armando García-Berumen, José A. Del-Puerto-Flores, H. E. Gilardi-Velazquez, Vinoth Babu Kumaravelu, and L. A. Luna-Rodriguez. 2025. "A Transformer-Based Multi-Task Learning Model for Vehicle Traffic Surveillance" Mathematics 13, no. 23: 3832. https://doi.org/10.3390/math13233832

APA Style

Hermosillo-Reynoso, F., López-Pimentel, J.-C., Ruiz-Ibarra, E., García-Berumen, A., Del-Puerto-Flores, J. A., Gilardi-Velazquez, H. E., Kumaravelu, V. B., & Luna-Rodriguez, L. A. (2025). A Transformer-Based Multi-Task Learning Model for Vehicle Traffic Surveillance. Mathematics, 13(23), 3832. https://doi.org/10.3390/math13233832

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Transformer-Based Multi-Task Learning Model for Vehicle Traffic Surveillance

Abstract

1. Introduction

2. Related Work

2.1. Moving Vehicle Detection Task

Occlusion Handling

2.2. Vehicle Tracking Task

2.3. Vehicle Classification Task

3. Problem Statement and Mathematical Formulation

3.1. Problem Statement

3.2. Mathematical Formulation

4. Transformer-Based Multi-Task Learning Model for Vehicle Traffic Surveillance

4.1. Proposed Model

4.2. Multi-Task Learning Framework

5. Experimental Results

5.1. Dataset Description

5.2. Quantitative Results

6. Discussion

6.1. Failure Modes and Operating Conditions

6.2. Implications and Future Improvements

6.3. Summary

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI