Dynamic Collaborative Optimization Method for Real-Time Multi-Object Tracking

Li, Ziqi; Jia, Dongyao; He, Zihao; Wu, Nengkai

doi:10.3390/app15095119

Open AccessArticle

Dynamic Collaborative Optimization Method for Real-Time Multi-Object Tracking

School of Automation and Intelligence, Beijing Jiaotong University, Beijing 100044, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2025, 15(9), 5119; https://doi.org/10.3390/app15095119

Submission received: 24 March 2025 / Revised: 30 April 2025 / Accepted: 2 May 2025 / Published: 5 May 2025

(This article belongs to the Topic Visual Computing and Understanding: New Developments and Trends)

Download

Browse Figures

Versions Notes

Abstract

:

Multi-object tracking still faces significant challenges in complex conditions such as dense scenes, occlusion environments, and non-linear motion, especially regarding the detection and identity maintenance of small objects. To tackle these issues, this paper proposes a multi-modal fusion tracking framework that realizes high-precision tracking in complex scenarios by collaboratively optimizing feature enhancement and motion prediction. Firstly, a multi-scale feature adaptive enhancement (MS-FAE) module is designed, integrating multi-level features and introducing a small object adaptive attention mechanism to enhance the representation ability for small objects. Secondly, a cross-frame feature association module (CFAM) is put forward, constructing a global semantic association network via grouped cross-attention and a memory recall mechanism to solve the matching difficulties in occlusion and dense scenes. Thirdly, a Dynamic Motion Model (DMM) is developed, enabling the robust prediction of non-linear motion based on an improved Kalman filter framework. Finally, a Bi-modal dynamic decision method (BDDM) is devised to fuse appearance and motion information for hierarchical decision making. Experiments conducted on multiple public datasets, including MOT17, MOT20, and VisDrone-MOT, demonstrate that this method remarkably improves tracking accuracy while maintaining real-time performance. On the MOT17 test set, it achieves 63.7% in HOTA, 61.4 FPS in processing speed, and 79.4% in IDF1, outperforming current state-of-the-art tracking algorithms.

Keywords:

multi-object track; feature enhancement; cross-frame feature association; small object track

1. Introduction

Multi-Object Tracking (MOT), a critical computer vision task, finds extensive applications in intelligent transportation, security surveillance, and human–machine interactions [1]. As deep learning advances, the Tracking-by-Detection paradigm has emerged as the predominant approach [2], yet it faces significant challenges in practical scenarios, as follows: (1) small object detection and representation difficulties, where conventional feature extraction techniques struggle to retain discriminative information [3]; (2) frequent identity switches in dense environments, exacerbated by mutual object occlusion leading to feature degradation [4]; and (3) imprecise non-linear motion prediction, challenging traditional motion models to effectively interpret complex motion dynamics [5].

Current tracking methods exhibit significant limitations in addressing the aforementioned challenges [1]. In terms of feature extraction, mainstream tracking models employ additional feature extraction networks, which not only demonstrate limited capability in representing small objects and capturing fine-grained features but also consume considerable tracking time [6]. Regarding object association, methods like DeepSORT rely on local feature matching, failing to establish global semantic connections, which results in insufficient identity preservation in occlusion scenarios [7]. For motion prediction, SORT and its variants typically utilize linear Kalman filtering, which yields relatively low prediction accuracy for non-linear motion trajectories, particularly in scenarios with abrupt movements. Furthermore, most methods lack effective modality fusion strategies, making it difficult to balance the contributions of appearance and motion information across different scenarios [8].

Despite the mentioned advances, existing MOT methods face critical limitations in three key areas. For small object detection, conventional approaches suffer from severe feature degradation through downsampling operations, with detection accuracy dropping by up to 40% for objects smaller than 30 × 30 pixels due to insufficient receptive field adaptation. Occlusion handling remains particularly challenging, as current association algorithms lack robust temporal consistency mechanisms, leading to identity preservation failures when occlusions persist beyond 1.5 s or affect more than 25% of target objects simultaneously. Regarding motion prediction, traditional linear models fundamentally fail to capture acceleration variations and direction changes, resulting in trajectory estimation errors increasing exponentially for objects with non-linear motion patterns, especially in crowded scenes where complex interactions frequently occur.

To address these critical challenges, this paper introduces a multi-modal fusion object tracking framework, presenting the following four primary innovative contributions:

(1): Multi-Scale Feature Adaptive Enhancement Module (MS-FAE): A comprehensive global semantic association network leveraging grouped cross-attention mechanisms for efficient inter-frame feature interaction. The module incorporates an innovative memory recall mechanism to mitigate feature degradation from short-term occlusions, thereby augmenting system robustness in dense scene environments.
(2): Cross-Frame Feature Association Module (CFAM): A comprehensive global semantic association network leveraging grouped cross-attention mechanisms for efficient inter-frame feature interaction. The module incorporates an innovative memory recall mechanism to mitigate feature degradation from short-term occlusions, thereby augmenting system robustness in dense scene environments.
(3): Dynamic Motion Model (DMM): An advanced Kalman filter framework extension that introduces acceleration components and implements a data-driven noise adjustment strategy. Coupled with a multi-hypothesis trajectory backtracking mechanism, this approach significantly improves prediction accuracy for non-linear and abrupt motion scenarios.
(4): Bi-Modal Dynamic Decision Method (BDDM): A hierarchical strategy for fusing appearance and motion modal information, utilizing a dynamic weight fusion function to optimize contributions from diverse information sources, thereby enhancing long-term trajectory stability.

The You Only Look Once (YOLO) architecture, particularly YOLOv9, is a state-of-the-art solution for real-time object detection, balancing speed and accuracy. In this paper, we integrate YOLOv9 as our tracking model’s detector due to its proven performance in diverse applications, ensuring scientific validity and reliability. Ref. [9], focusing on improving detection performance for autonomous vehicles in adverse weather conditions through the application of metaheuristic algorithms, exemplifies efforts to push YOLO’s boundaries in difficult environments. Research [10] on the effectiveness of image augmentation techniques applied to YOLOv8 for detecting non-protective personal equipment highlights the importance of data-centric approaches and model optimization for specific detection tasks. Investigations [11] into heavy equipment detection on construction sites using YOLO-Version 10, incorporating Transformer architectures, further showcase the evolution and integration of advanced components within the YOLO family to handle complex scenes. Additionally, the development of specialized models like COTTON-YOLO [12] for enhancing cotton boll detection and counting in complex environmental conditions underscores YOLO’s versatility in adapting to specific, challenging real-world applications. These findings validate YOLOv9 as a robust, efficient detector for our tracking model, ensuring system-wide reliability.

2. Related Works

2.1. Small Object Detection and Feature Enhancement

Small object detection has always been a challenging problem in the field of computer vision. Traditional methods mainly build multi-scale representations through Feature Pyramid Networks (FPNs) [13], but their simple feature fusion methods struggle to preserve the discriminative information of small objects. Lin et al. [14] proposed PANet, which enhances feature fusion by adding a bottom-up path, but it still suffers from insufficient feature extraction for small objects. Chen et al. [15] introduced the Atrous Spatial Pyramid Pooling (ASPP) module, using convolution kernels with different dilation rates to capture multi-scale context, but it was not specifically optimized for small object features.

In recent years, attention mechanisms have emerged as a powerful approach for feature enhancement. Woo et al. [16] developed CBAM (Convolutional Block Attention Module), integrating channel and spatial attention to improve feature representation, yet its fixed weight strategy limits adaptive adjustment based on object size. Wang et al. [17] designed the ECA (Efficient Channel Attention) module, efficiently modeling inter-channel dependencies through one-dimensional convolution, but they overlooked critical spatial dimension information. Diverging from these approaches, the proposed MS-FAE module synthesizes multi-level features and introduces a Small Object Adaptive Attention Mechanism (SAAM), dynamically modulating channel and spatial attention weights according to object dimensions. Moreover, by implementing deformable convolution and dynamic anchor box compensation strategies, the module significantly enhances small object feature extraction accuracy and representational capabilities.

Unlike previous research, which often relied on single-modal data or traditional feature engineering methods with limited adaptability, our framework capitalizes on deep learning’s unique ability to automatically extract hierarchical features from complex multi-modal data. This strength is exemplified by its effectiveness, similar to how deep learning enables the precise detection of exterior cladding materials from street view images as shown in related empirical studies.

2.2. Feature Association in Multi-Object Tracking

Detection-based multi-object tracking methods mainly focus on how to establish associations between targets across frames. Early methods like SORT [8] only used IoU as a matching metric, lacking the ability to model appearance features. DeepSORT [7] introduced deep features for target representation, but its simple feature comparison method struggles with occlusion and crowded scenes. ByteTrack [18] improved detection omission by handling low-confidence detection boxes but did not deeply optimize feature representation. StrongSORT [19] combined appearance and motion features but still used a local feature matching strategy, unable to establish global semantic associations.

Recent research has begun exploring the application of attention mechanisms in target association. TransTrack [20] introduced the Transformer architecture into tracking tasks, but its global self-attention computation has high complexity, making it difficult to meet real-time requirements. TransMOT [21] modeled inter-target relationships through spatiotemporal attention mechanisms but lacked effective utilization of historical features. Unlike these methods, the CFAM module proposed in this paper efficiently handles cross-frame feature interactions through a grouped cross-attention mechanism and designs a memory recall mechanism to dynamically compensate for feature degradation, significantly enhancing system robustness in occlusion and dense scenes while maintaining computational efficiency.

2.3. Motion Modeling and Multi-Modal Fusion

Motion prediction is a key component in multi-object tracking. SORT [8] used linear Kalman filtering to predict target positions but performed poorly on non-linear motions. DeepSORT [7] combined appearance features and motion information but still used a simple linear motion model. OC-SORT [22] improved Kalman filtering through observation correction but did not consider acceleration changes, leading to inaccurate predictions for sudden motions. BoT-SORT [23] introduced camera motion compensation but lacked an adaptive parameter adjustment mechanism.

In terms of multi-modal fusion, most methods use simple weighted averaging or cascading strategies. StrongSORT [19] linearly fused appearance and motion similarity, but the weights were fixed and could not adapt to different scenes. ByteTrack [18] and OC-SORT [22] mainly relied on a single modality (IoU or motion), lacking an effective multi-modal fusion strategy. Unlike existing methods, the DMM module proposed in this paper extends the Kalman filter framework, introduces an acceleration component, and designs a data-driven noise adjustment strategy, significantly improving the prediction accuracy for non-linear motions. Additionally, our dual-modal matching decision method adaptively balances appearance and motion information through a dynamic weight fusion function, effectively enhancing tracking performance in complex scenes.

3. Method

3.1. Comprehensive Framework

The proposed multi-object tracking framework, illustrated in Figure 1 and Algorithm 1, comprises the following four fundamental modules: a multi-scale feature adaptive enhancement (MS-FAE) Module, a cross-frame feature association module (CFAM), a Dynamic Motion Model (DMM), and a bi-Modal dynamic decision method. The architecture is distinguished by two pivotal innovations as follows:

(1) Efficient dual-modal feature representation: By exploiting the deep semantic features from YOLO [24] for target appearance extraction, the framework circumvents the computational overhead of conventional re-identification (ReID) models, enabling seamless feature sharing between detection and tracking processes.

(2) Dynamic fusion matching mechanism: A strategy is proposed to dynamically integrate motion matching with feature matching. The mechanism first employs a Transformer architecture to learn global feature correlations between cross-frame objects, and then combines multiple trajectory hypothesis motion matching. This approach effectively addresses the challenges of object disappearance and re-matching caused by occlusion.

Algorithm 1: Dynamic collaborative multi-object tracking

3.2. Multi-Scale Feature Adaptive Enhancement (MS-FAE) Module

To address the issue of detail loss during the feature extraction of small objects, this section introduces the Multi-Scale Feature Adaptive Enhancement (MS-FAE) method. While this study employs YOLOv9 as the object detector, using only the deepest feature map (P2 layer) proves insufficient in preserving discriminative information for small objects. To resolve this limitation, we design the MS-FAE module as illustrated in Figure 2, which comprises the following three key components: (1) multi-scale feature pyramid fusion; (2) small object adaptive attention mechanism; and (3) precise feature mapping with dynamic anchor box compensation. By integrating feature representations from different scales and incorporating an adaptive attention mechanism, this method significantly enhances the tracker’s capability to extract features from small objects.

3.2.1. Multi-Scale Feature Pyramid Fusion

To fully utilize the complementary information from feature maps at different levels, we design a top-down feature fusion pathway. Specifically, the P2 (1/8), P3 (1/16), and P4 (1/32) feature maps extracted from the backbone network are fused as follows: First, the P4 layer is upsampled to the size of P3 using transpose convolution and then added to the P3 feature map as follows:

C 3 = ConvTrans (P 4) + P 3

(1)

Next, C3 is further upsampled to the size of P2 and fused with the P2 feature map as follows:

C 2 = ConvTrans (C 3) + P 2

(2)

To further enhance the multi-scalereceptive field, we introduce an Atrous Spatial Pyramid Pooling (ASPP) module at the C2 layer, which extracts multi-scale contextual information using convolution kernels with different dilation rates (1, 3, 5) as follows:

F_{context} = {Conv}_{1 \times 1} (Concat [C 2, D_{r a t e = 1}, D_{r a t e = 3}, D_{r a t e = 5}])

(3)

Here,

{Conv}_{1 \times 1}

is used to reduce the channel count back to the original dimension. This approach preserves the spatial details of high-resolution feature maps while incorporating the semantic information from higher-level features.

3.2.2. Small-Object Adaptive Attention Mechanism

To further enhance the feature representation of small objects, we propose the Small-object Adaptive Attention Module (SAAM). This module dynamically adjusts channel and spatial attention weights based on the size of the target, with the core idea of applying stronger feature enhancement to smaller objects. The SAAM first computes target size-related attention weights as follows:

α = σ (MLP ([log (w), log (h)]))

(4)

where w and h are the width and height of the bounding box,

σ

is the Sigmoid function, and MLP is a two-layer perceptron. The SAAM includes the following two branches: channel attention and spatial attention. The channel attention branch generates channel weights through a two-layer convolutional network following global average pooling, while the spatial attention branch produces pixel-level weight maps via 3 × 3 convolution. The final feature enhancement is achieved as follows:

F_{o u t} = F_{i n} \cdot (α \cdot W_{c h} + (1 - α) \cdot W_{s p})

(5)

Here,

W_{c h}

and

W_{s p}

represent the channel attention and spatial attention weights, respectively. When the target size is small,

α

increases, and channel attention dominates; conversely, spatial attention becomes more influential for larger targets.

3.2.3. Precise Feature Mapping and Dynamic Anchor Compensation

To address the precision loss in feature extraction for small objects, we introduce the following two key techniques: Deformable Convolution Calibration: Deformable convolution is applied before ROI pooling to compensate for coordinate quantization errors, improving feature alignment accuracy. By learning additional offsets, deformable convolution adaptively adjusts the receptive field to key parts of the target region as follows:

F_{a l i g n e d} = DeformConv (F_{i n}, Δ p)

(6)

where

Δ p

is the learned offset, generated by an additional convolutional layer based on the input features. Dynamic Anchor Compensation Strategy: For targets of different sizes, we design an adaptive pooling strategy. For small objects (area < 32² pixels), a bidirectional adaptive pooling method is used as follows:

F_{fused} = λ (w, h) \cdot MaxPool (F) + (1 - λ (w, h)) \cdot AvgPool (F)

(7)

The weight coefficient

λ

is dynamically calculated based on the target area using a Sigmoid function:

λ (w, h) = \frac{1}{1 + e^{- k \cdot (w \cdot h - S_{0})}}

(8)

Here,

k = 0.1

controls the transition rate, and

S_{0} = 1024

is the area threshold. This design ensures that small objects retain the most salient features (MaxPool), while larger objects preserve more of the overall distribution (AvgPool).

3.3. Cross-Frame Association Module

The Cross-Frame Association Module (CFAM) constructs a global semantic association network and serves as the core component of the object tracking system. This module receives feature vectors outputted by the multi-scale feature enhancement module and is specifically designed to address the challenges of object matching in occluded and dense scenes. By ingeniously integrating temporal contextual information with a dynamic weight allocation mechanism, the CFAM achieves the following three key optimization objectives: establishing long-range dependencies, enhancing adaptive memory capabilities, and maintaining a lightweight computational architecture.

As shown in Figure 3, the workflow of this module is as follows: First, it encodes the target features of the current frame and historical frames in a multi-modal manner, integrating spatial location asnd motion information. Then, it calculates the similarity matrix between targets using a grouped cross-attention mechanism. Finally, it employs a memory recall mechanism to handle feature associations in occluded scenes. This multi-level design ensures stable target identification and association, even in complex scenarios, and its detailed computational procedure is also provided in Algorithm 2.

Algorithm 2: Cross-Frame Association Module (CFAM)

3.3.1. Multimodal Feature Encoding

To capture the comprehensive representation of targets, this module simultaneously embeds physical size and motion state information to form a multi-modal feature representation as follows:

Current Frame Feature Encoding: The spatial dimensions of the target are integrated into the feature representation through position encoding, enhancing spatial discriminative power as follows:

Q_{i} = W_{Q} f_{i}^{t} + PE (w_{i}, h_{i}), PE (w, h) = sin (\frac{w}{W_{0}}) \oplus cos (\frac{h}{H_{0}})

(9)

Here,

f_{i}^{t}

represents the original feature vector of the i-th target in the current frame;

W_{Q}

is a learnable projection matrix;

w_{i}

and

h_{i}

are the width and height of the target, respectively;

W_{0}

and

H_{0}

are normalization reference constants; and ⊕ denotes the feature concatenation operation.

Historical Trajectory Encoding: Motion state information is integrated to enhance temporal consistency and motion prediction capability as follows:

K_{j} = W_{K} f_{j}^{t - 1} + PE (v_{j}^{x}, v_{j}^{y}), v_{j}^{x} = \frac{x_{j}^{t - 1} - x_{j}^{t - 2}}{Δ t}, v_{j}^{y} = \frac{y_{j}^{t - 1} - y_{j}^{t - 2}}{Δ t}

(10)

Here,

f_{j}^{t - 1}

denotes the feature vector of the j-th target in the previous frame;

W_{K}

is a learnable projection matrix;

v_{j}^{x}

and

v_{j}^{y}

are the velocities of the target in the x and y directions, respectively; and

Δ t

is the time interval between adjacent frames.

This dual-modal feature encoding strategy enables the model to simultaneously focus on the target’s appearance features, spatial dimensions, and motion state, providing a more comprehensive basis for subsequent target associations. This approach particularly excels in multi-target scenes where appearance features are similar but motion patterns differ.

3.3.2. Grouped Cross-Attention Mechanism

To balance computational efficiency and feature representation capability, we adopt a grouped attention strategy, splitting the 256-dimensional feature space into four 64-dimensional subspaces for parallel processing as follows:

{Attention}_{k} (Q, K, V) = Softmax (\frac{Q^{(k)} {(K^{(k)})}^{T}}{\sqrt{d_{k}}} ⊙ M_{temp}) V^{(k)}, k = 1, . . ., 4

(11)

Here,

Q^{(k)}

,

K^{(k)}

, and

V^{(k)}

represent the query, key, and value matrices of the k-th subspace, respectively;

d_{k} = 64

is the subspace dimension;

M_{temp}

is a temporal mask matrix; and ⊙ denotes element-wise multiplication. The temporal mask matrix

M_{temp}

effectively prevents reverse temporal matching, ensuring temporal consistency in associations, and is defined as follows:

M_{temp} (i, j) = \{\begin{matrix} 1, & if target j exists at time t - 1 \\ 0, & otherwise \end{matrix}

(12)

The attention results from each subspace are concatenated and normalized to form the final output as follows:

f_{out}^{t} = LayerNorm (Concat [{Attention}_{1}, {Attention}_{2}, {Attention}_{3}, {Attention}_{4}])

(13)

This grouping strategy significantly reduces computational complexity from

O (n^{2} d)

to

O (n^{2} d / g)

, where n is the number of targets, d is the feature dimension, and

g = 4

is the number of groups. Additionally, the multi-head design allows the model to learn complementary attention patterns from different subspaces, enhancing the diversity of feature representation.

3.3.3. Memory Recall Mechanism

To address feature degradation caused by short-term occlusions, we designed a dynamic memory pool and its update rules. For each tracked target, a memory pool

M_{j}^{t}

containing the features of the last three frames is maintained as follows:

M_{j}^{t} = \{\begin{matrix} [f_{j}^{t - 1}, M_{j}^{t - 1} (1), M_{j}^{t - 1} (2)], & if target j is successfully matched at frame t - 1, \\ [f_{j}^{t - 1}, M_{j}^{t - 1} (0), M_{j}^{t - 1} (1)], & if target j fails to match at frame t - 1 . \end{matrix}

(14)

Here,

M_{j}^{t} (k)

denotes the feature vector at the k-th position in the memory pool, with

k = 0

representing the most recent feature.

When a target enters an occlusion state, a single frame’s feature may be insufficient to provide reliable association information. In such cases, a weighted historical feature is used to calculate comprehensive similarity as follows:

S_{motion} {(i, j)}_{occl} = \sum_{k = 0}^{2} γ_{k} \cdot S_{\cos} (K_{j}, M_{j}^{t} (k)), γ_{k} = 0 . 85^{k}

(15)

Here,

S_{\cos}

denotes the cosine similarity function, and

γ_{k}

is the decay weighting coefficient, which exponentially decreases with increasing temporal distance. The choice of the weighting coefficient

γ_{k}

is based on experimental validation, with

0.85

as the base, demonstrating optimal performance across multiple datasets. This coefficient strikes a balance between emphasizing the most recent features and not overly diminishing the contribution of historical information.

The proposed decay weighting design effectively balances the importance of recent and historical features, particularly mitigating challenges associated with short-term occlusions (typically spanning 1–3 frames). In contrast to methods employing simple averaging or exclusively relying on the most recent features, our memory recall mechanism substantially enhances target re-identification rates, thereby improving the system’s robustness under occlusion conditions.

Through the comprehensive design of the Cross-Frame Association Module, we have successfully achieved robust target tracking in complex visual environments, demonstrating exceptional performance in occluded and densely populated scenes. Experimental results validate that, compared to existing approaches, this module significantly improves tracking accuracy and continuity while maintaining computational efficiency.

3.4. Dynamic Motion Model

The proposed Dynamic Motion Model (DMM) advances the traditional Kalman filter framework [25] by providing robust predictions for non-linear trajectories and abrupt scene transitions. The module’s core innovation resides in its expanded state space representation and adaptive parameter recalibration mechanism, collectively enabling a sophisticated prediction system that dynamically adapts to complex environmental variations.

3.4.1. Enhanced State Transition

To capture the complex motion patterns of a target, we define a six-dimensional state vector

s = {[x, y, v_{x}, v_{y}, a_{x}, a_{y}]}^{T}

, explicitly incorporating acceleration components to improve the modeling of non-uniform motion. The enhanced state transition matrix is designed as follows:

A = [\begin{matrix} 1 & 0 & Δ t & 0 & 0.5 Δ t^{2} η & 0 \\ 0 & 1 & 0 & Δ t & 0 & 0.5 Δ t^{2} η \\ 0 & 0 & 1 - α_{v} & 0 & Δ t η & 0 \\ 0 & 0 & 0 & 1 - α_{v} & 0 & Δ t η \\ 0 & 0 & 0 & 0 & η & 0 \\ 0 & 0 & 0 & 0 & 0 & η \end{matrix}], η = e^{- β Δ t}

(16)

Here, the motion damping factor

α_{v} = 0.1

introduces a velocity self-decay mechanism, while the time decay coefficient

β = 0.05

controls the persistence of acceleration effects. Together, they effectively mitigate the divergence issue in long-term extrapolation. This design ensures the model maintains short-term prediction accuracy while avoiding the common trajectory divergence phenomenon in long-term predictions, making it particularly suitable for scenarios with frequent target direction changes.

3.4.2. Multi-Trajectory Hypothesis

In scenarios of prolonged occlusion (continuous loss of three or more frames), a single motion model often fails to accurately predict the target’s reappearance location. To address this, we devised a multi-modal trajectory hypothesis strategy, simultaneously generating and maintaining three complementary motion model predictions as follows:

Constant Velocity Model (CV):

{\hat{s}}_{CV} = A_{CV} s^{t_{lost}}

Constant Acceleration Model (CA):

{\hat{s}}_{CA} = A_{CA} s^{t_{lost}}

Curved Motion Model (CT):

{\hat{s}}_{CT} = A_{CT} s^{t_{lost}} + w_{curv}

Here,

w_{curv}

is a curvature compensation term estimated based on the target’s recent turning trend. The system prioritizes searching for matches in these three hypothesis regions in subsequent frames and selects the best hypothesis based on matching confidence until either the trajectory is successfully recovered or the tracking exceeds the preset time window (typically 30 frames).

3.4.3. Data-Driven Noise Adjustment

A key limitation of traditional Kalman filters lies in their fixed noise covariance matrices, which are unable to adapt to dynamic changes in target motion states. To address this issue, we propose an adaptive noise adjustment mechanism, where the process noise matrix

Q

and observation noise matrix

R

are dynamically updated based on real-time matching results as follows:

Q^{t} = Q^{t - 1} \cdot (1 + \frac{∥ v^{t} - v^{t - 1} ∥}{v_{ref}}), v_{ref} = 10 pixel / frame

(17)

R^{t} = R_{0} \cdot (2 - {conf}^{t} \cdot e^{- {AR}_{dev}}), {AR}_{dev} = | {AR}_{\det} - {AR}_{hist} |

(18)

Here,

Q

and

R

are the process and observation noise matrices, respectively. The process noise matrix

Q

is updated based on the magnitude of the target’s velocity change, reflecting the degree of abruptness in its motion. The observation noise matrix

R

is adjusted based on the detection quality, taking into account both detection confidence and the consistency of the target’s aspect ratio with the historical data.

This adaptive adjustment mechanism modifies the filter parameters based on two key metrics as follows: (1) the rate of velocity change, reflecting abruptness in target motion states; and (2) detection quality assessment, combining detection confidence and the historical consistency of the target’s aspect ratio. When detection results are reliable, the system reduces observation noise to rely more on current observations; conversely, during rapid target direction changes or unstable detection, it automatically increases process or observation noise to maintain prediction stability.

3.5. Bi-Modal Dynamic Decision Method

The bimodal dynamic decision-making approach serves as the core decision-making mechanism in the tracking system. By dynamically fusing appearance and motion modality information through a carefully designed process, it effectively addresses critical issues such as multi-source information conflicts and matching ambiguity resolution. This method ensures matching accuracy while maintaining computational efficiency. The approach employs a dynamic weighting fusion function to adaptively balance appearance similarity and motion consistency as follows:

S_{fuse} = β (τ) S_{app} + (1 - β (τ)) S_{mot}, β (τ) = \frac{1}{1 + e^{0.1 (τ - 15)}}

(19)

Here,

τ

represents the trajectory duration (in frames), and

β (τ)

constitutes a Sigmoid-shaped dynamic weighting function.

As the trajectory duration increases, the weight of motion information gradually rises from an initial 30% to 70%. This design reflects the increasing reliance on motion consistency for long-term trajectories as follows: new targets are primarily identified based on appearance features, while targets in long-term tracking establish stable motion patterns, leading to progressively higher reliability.

3.6. Inter-Module Collaboration Mechanism

The proposed tracking framework achieves efficient and robust multi-object tracking through the tight collaboration of three core modules, each with distinct and complementary responsibilities. The Cross-Frame Appearance Matching (CFAM) module extracts and associates appearance features across frames, generating an appearance similarity matrix

S_{app}

. The Dynamic Motion Model (DMM) module predicts target positions based on motion states, outputting a motion similarity matrix

S_{mot}

. The comprehensive matching decision method integrates these two modalities by dynamically weighting and employing a hierarchical decision strategy to finalize target association.

The system’s key strength lies in the complementary enhancement of multi-modal information. Under normal tracking conditions, appearance and motion information work synergistically to validate each other. In challenging scenarios, when one modality becomes unreliable, the other compensates effectively as follows:

When significant appearance changes occur (e.g., lighting variations or partial occlusion), DMM’s motion prediction provides a stable positional reference.
When abrupt motion changes occur (e.g., sudden stops or turns), CFAM’s appearance matching capability ensures target continuity.
During short-term occlusion, CFAM’s memory recall mechanism and DMM’s predicted positions jointly maintain trajectory continuity.
For long-term occlusion, DMM’s multi-hypothesis trajectory backtracking and CFAM’s memory recall mechanism work in tandem to maximize trajectory recovery probability.

Notably, CFAM’s memory recall mechanism and DMM’s multi-hypothesis trajectory backtracking complement each other across temporal scales as follows: the former enhances appearance features under short-term occlusion (1–3 frames), while the latter provides multi-path prediction strategies for mid- to long-term occlusion (more than 3 frames). This multi-scale, multi-modal collaboration significantly enhances the system’s robustness in complex scenarios, particularly excelling in handling challenges such as occlusion, dense targets, and non-linear motion.

4. Experimental Results and Analysis

This section presents a comprehensive evaluation of the proposed multi-modal tracking framework, demonstrating the effectiveness of its individual components and the overall performance advantages through experiments conducted on multiple benchmark datasets.

4.1. Experimental Setup

4.1.1. Datasets and Evaluation Metrics

The experimental evaluation was conducted on the following publicly available datasets:

MOT17 [26]: A more extensive dataset with 14 sequences, including varying lighting conditions, camera movements, and crowded environments.
MOT20 [27]: A dataset focused on extremely crowded scenarios, with an average density of 246 people per frame.
VisDrone-MOT [6,28]: A dataset captured from a drone’s perspective, containing numerous small targets with significant size variations.

The evaluation employed the CLEAR MOT metrics [29], which include several important indicators for assessing multi-object tracking performance. The specific calculation formulas are as follows:

Multiple Object Tracking Accuracy (MOTA)

MOTA comprehensively considers the impact of false positives (FP), false negatives (FN), and identity switches (IDs) on tracking accuracy, calculated as follows:

M O T A = 1 - \frac{\sum_{t} (F P_{t} + F N_{t} + I D s_{t})}{\sum_{t} G T_{t}}

(20)

where t represents the frame index,

F P_{t}

is the number of false positives at frame t,

F N_{t}

is the number of false negatives at frame t,

I D s_{t}

is the number of identity switches at frame t, and

G T_{t}

is the number of ground truth targets at frame t. MOTA ranges within

(- \infty, 1]

, with values closer to 1 indicating better tracking performance.

Identification F1 Score (IDF1)

IDF1 measures the tracker’s performance in correctly identifying target identities, combining true positives for identities

(T P_{i} d)

, false positives

(F P)

, and identity switches

(I D s)

, calculated as follows:

I D F 1 = 2 \times \frac{\sum_{t} T P_{i d, t}}{\sum_{t} (2 T P_{i d, t} + F P_{t} + I D s_{t})}

(21)

where

T P_{i d, t}

is the number of correctly identified target identities at frame t. IDF1 ranges within

[0, 1]

, with higher values indicating better performance in maintaining target identity consistency.

Higher Order Tracking Accuracy (HOTA)

HOTA is a more comprehensive metric that simultaneously considers detection precision and identity association accuracy by combining location accuracy (LA) and identity accuracy (IA). First, location accuracy

L A

is defined as follows:

L A = \frac{\sum_{t} \sum_{i \in G T_{t} \cap D e t_{t}} I o U (g t_{i}, d e t_{i})}{\sum_{t} | G T_{t} \cap D e t_{t} |}

(22)

where

G T_{t}

is the set of ground truth targets at frame t,

D e t_{t}

is the set of detected targets at frame t;

g t_{i}

and

d e t_{i}

are the bounding boxes of the ground truth and detected targets at frame t, respectively; and

I o U (g t_{i}, d e t_{i})

is their intersection over union. Next, identity accuracy

I A

is defined as follows:

I A = \frac{\sum_{t} T P_{i d, t}}{\sum_{t} (T P_{i d, t} + \frac{1}{2} (F P_{t} + I D s_{t}))}

(23)

Finally, HOTA is calculated as follows:

H O T A = \sqrt{L A \times I A}

(24)

HOTA ranges within

[0, 1]

, with higher values indicating better overall performance in both detection and identity association.

These key metrics enable a comprehensive assessment of the tracker’s performance in terms of identity retention capability and trajectory integrity.

4.1.2. Implementation Details

All experiments were conducted on a workstation equipped with an Intel i9-10900K CPU and an NVIDIA RTX 4070 GPU. The model was trained using the following hyperparameters:

Learning rate: Initial value of $2 \times 10^{- 4}$ , with cosine annealing scheduling.
Batch size: 16.
Training epochs: 30.
Optimizer: AdamW (weight decay of $1 \times 10^{- 4}$ ).

During the inference phase, the detection confidence threshold was set to 0.5, the IoU (Intersection over Union) threshold to 0.7, and the maximum track retention time to 30 frames.

4.2. Ablation Studies

4.2.1. Module Contribution Analysis

To demonstrate the contribution of each module to the overall performance, we conducted ablation studies based on the baseline model. The baseline model employs basic IoU matching and linear Kalman filtering for motion matching, utilizes average pooling to extract feature vectors from target regions, and calculates the cosine distance between feature vectors of consecutive frames for feature matching. Finally, it combines motion and feature matching through weighted fusion before applying the Hungarian algorithm to obtain tracking results.

As shown in Table 1, each module contributes significantly to the overall performance. The MS-FAE module improves MOTA by 12.6%, while CFAM achieves the most substantial increase in IDF1 (+9.8%), demonstrating its crucial role in maintaining ID consistency. The DMM module further reduces ID switches by 111 instances, and the bi-modal dynamic matching decision method enhances all evaluation metrics comprehensively. Notably, although the addition of these modules slightly decreases processing speed, the complete model still maintains real-time performance at 19.6 FPS, validating our emphasis on efficiency in algorithm design.

4.2.2. Parameter Sensitivity Analysis

The weights in the SAAM are designed to balance the contributions of different temporal and spatial features. In the spatio -temporal domain, features from different time steps and spatial locations carry varying degrees of importance for accurate multi -object tracking. The attention mechanism assigns higher weights to more relevant features. For example, in a scenario where an object’s appearance changes gradually over time, the features from recent time steps should be weighted more heavily to capture the current state of the object. Theoretically, this approach aligns with the concept of adaptively emphasizing informative features, which has been shown to be effective in many computer vision tasks.

We investigated the impact of key parameters on the performance of the SAAM, particularly focusing on parameters k and

S_{0}

in the target-size adaptive weighting function.

As illustrated in Figure 4, the system achieves optimal performance equilibrium at

k = 0.1

and

S_{0} = 1024

. This validates the rationality of our parameter configuration while demonstrating that the SAAM maintains robust performance across a wide parameter range.

Overall, the experimental results conclusively demonstrate that the proposed method strikes an ideal balance between speed and accuracy. Particularly, while ensuring real-time operation, our approach exhibits superior tracking performance, offering an efficient and reliable solution for real-time video analysis applications.

4.2.3. Memory Pool Depth

The memory pool depth determines the amount of historical information that the model can access. A deeper memory pool allows the model to capture long -term dependencies and patterns in the tracking process. In multi-object tracking, objects may reappear after long occlusions, and having a sufficient memory depth enables the model to recall the object’s appearance and motion patterns from previous encounters. This is crucial for maintaining consistent object identities over time.

This ablation study systematically evaluates the impact of memory pool depth on target tracking performance, validating both the effectiveness of the dynamic memory recall mechanism and the rationality of its parameter settings. As shown in Table 2, tracking performance exhibits significant improvement as the memory pool depth increases from 1 to 5, followed by a gradual plateau in performance gains.

As shown in Table 2, the model exhibits different characteristics with varying memory pool depths. In terms of tracking precision, when the depth increases from 1 to 3, the MOTA metric improves from 37.4% to 39.3%, IDF1 rises from 46.7% to 48.2%, and the number of IDs significantly decreases from 368 to 215. These results indicate that incorporating multi-frame historical features can mitigate feature degradation caused by short-term occlusion, while the weighted fusion of features from the recent three frames substantially enhances target representation robustness. However, there are diminishing marginal returns as follows: performance improvements notably narrow when the depth exceeds 3, with depth 5 showing only 0.3% MOTA and 0.4% IDF1 improvements compared to depth 3. This suggests that a 3-frame memory window effectively covers typical short-term occlusion scenarios, and additional historical feature information provides limited benefits. Regarding computational efficiency, although increasing memory pool depth gradually reduces the frame rate from 20.1 FPS to 19.2 FPS, the decrease is relatively modest, with each additional depth layer reducing approximately 0.2–0.3 FPS. This indicates that the designed weighted computation mechanism effectively controls computational complexity, maintaining real-time processing capability at 19.6 FPS with a depth of 3.

The study ultimately determines that a memory pool depth of 3 achieves the optimal balance among tracking precision (MOTA 39.3%), identity continuity (215 IDs), and computational efficiency (19.6 FPS). Compared to the baseline method (depth of 1), this configuration reduces the number of IDs by 31.6%, effectively validating the efficacy of the dynamic memory update rules and exponential decay weight design. The results demonstrate that the proposed approach significantly enhances target association reliability under occlusion scenarios without substantially increasing computational burden.

4.2.4. Ablation Study on Dynamic Motion Model Components

To evaluate the effectiveness of our Dynamic Motion Model and quantify the contribution of each proposed component, we conducted a comprehensive ablation study comparing the traditional Kalman filter with various configurations of our DMM. Table 3 presents the experimental results on the MOT17 dataset.

The results demonstrate the progressive performance improvement achieved by each component of our DMM. Starting from the traditional Kalman filter as a baseline, we observe the following contributions:

Enhanced State Transition (EST):By incorporating acceleration components into the state vector and implementing a velocity dampening mechanism, this component reduces prediction error by 24.6% (from 12.6 to 9.5 pixels) and decreases identity switches by 12.5%. The improvements are particularly significant in sequences with non-uniform motion patterns, where the standard linear model’s assumptions are frequently violated. The expanded state representation enables more accurate modeling of complex trajectories, such as those with varying acceleration and sharp turns.

Multi-Trajectory Hypothesis (MTH): Building on the enhanced state transition, the addition of multi-trajectory hypothesis generation further improves prediction accuracy and tracking stability. This component maintains multiple potential trajectory paths during occlusions, reducing ID switches by an additional 9.4% and prediction error by 7.4%. Analysis of difficult tracking scenarios reveals that MTH is particularly effective in recovering targets after prolonged occlusions (>10 frames), where traditional single-hypothesis approaches often fail to re-establish correct identities.

Data-Driven Noise Adjustment (DNA): The final component, which dynamically adjusts process and observation noise matrices based on motion characteristics and detection quality, provides the most substantial improvement in tracking identity consistency. When integrated with the previous components, it further reduces ID switches by 9.7% and prediction error by 17.0%. This adaptive approach proves especially valuable in heterogeneous scenes where target behavior varies significantly, allowing the filter to automatically adjust its reliance on prediction versus observation based on contextual cues.

The complete DMM configuration, incorporating all three components, achieves a 28.5% reduction in identity switches and a 42.1% reduction in prediction error compared to the traditional Kalman filter. These results validate our design hypothesis that addressing specific limitations of conventional motion models—namely, limited state representation, single-trajectory prediction, and static noise parameters—significantly enhancing tracking performance in complex scenarios.

Further analysis of the sequence-specific results reveals that the combination of these components produces synergistic effects beyond their individual contributions. For example, the enhanced state transition works particularly well with the data-driven noise adjustment in handling abrupt motion changes, while the multi-trajectory hypothesis complements both components during occlusion handling. This demonstrates that the three components address complementary aspects of motion modeling challenges in multi-object tracking.

4.2.5. BDDM versus Static Fusion Methods

To demonstrate the advantages of our Bi-Modal Dynamic Decision Method over conventional fusion approaches, we compared it with fixed-weight strategies similar to those used in StrongSORT [19] and other trackers. Table 4 presents the results of this comparison on the MOT17 dataset.

The experimental results clearly illustrate the limitations of relying solely on either appearance or motion information. The appearance-only approach, while achieving reasonable MOTA and IDF1 scores, suffers from a high number of identity switches, particularly in occlusion scenarios where appearance features become unreliable. Conversely, the motion-only approach shows the lowest performance across all metrics, especially in IDF1 (71.4%), highlighting its inadequacy in maintaining long-term identity consistency.

Fixed-weight fusion strategies offer improvements over single-modality approaches, with the 70:30 weighting (favoring appearance over motion) performing better than the balanced 50:50 configuration. This suggests that appearance features generally provide more discriminative information than motion cues. However, our proposed BDDM significantly outperforms all fixed-weight strategies, achieving improvements of 0.9% in MOTA and 1.3% in IDF1 compared to the best fixed-weight configuration, while reducing identity switches by 8.6%.

The superior performance of BDDM can be attributed to its dynamic weighting mechanism, which adaptively adjusts the contribution of appearance and motion information based on trajectory duration and detection context. Specifically, for newly initialized trajectories with limited motion history, BDDM assigns higher weights to appearance features (approximately 70%), while for established trajectories with consistent motion patterns, it gradually increases the weight of motion information (up to 70%). This adaptive strategy enables the system to leverage the most reliable information source in different tracking scenarios.

Further analysis of sequence-specific results reveals that the advantages of the BDDM are most pronounced in challenging scenarios, such as crowded scenes with frequent occlusions and rapid motion changes. In such cases, the dynamic fusion strategy significantly reduces identity switches by up to 15.2% compared to fixed-weight methods, demonstrating its robustness to complex tracking conditions.

4.3. Performance Evaluation

4.3.1. Comparative Analysis with State-of-the-Art Methods

We conducted a comprehensive comparison between our proposed method and contemporary state-of-the-art tracking algorithms, and the results are as follows. The experimental results on the MOT17 test set are presented in Table 5.

As illustrated in Figure 5 and Table 5, the proposed model demonstrates significant advantages when compared with ByteTrack, FairMOT, and other models on the MOT17 dataset, while ByteTrack achieves competitive performance in detection accuracy (MOTA 80.3), it shows limitations in identity preservation and association in complex scenarios, evidenced by its higher number of ID switches (2196) compared to our approach (1134). Similarly, FairMOT, despite its integrated detection-tracking framework, struggles with maintaining identity consistency (3303 ID switches) and achieves lower comprehensive tracking metrics (HOTA 59.3, IDF1 72.3). These baseline methods face challenges in scenarios involving occlusions, crowded scenes, and non-linear motion patterns. In terms of accuracy-related metrics, our model achieves a HOTA score of 63.7 (compared to ByteTrack’s 63.1 and FairMOT’s 59.3), a MOTA score of 79.8 (versus ByteTrack’s 80.3 and FairMOT’s 73.7), and an IDF1 score of 79.5 (surpassing ByteTrack’s 77.3 and FairMOT’s 72.3). These results indicate superior performance in comprehensive tracking, object detection, and identity recognition accuracy, enabling precise detection, the tracking of multiple targets, and maintenance of correct identity recognition. Regarding identity stability, our model records only 1134 ID switches, substantially lower than ByteTrack’s 2196 and FairMOT’s 3303, demonstrating robust continuous tracking capability in complex scenarios while reducing identity switches and enhancing tracking reliability. In terms of real-time performance measured by FPS, our model operates at 61.4 FPS, significantly outperforming ByteTrack (29.1 FPS), FairMOT (25.2 FPS), and StrongSORT (7.0 FPS), showcasing superior efficiency in processing video frames and real-time applications.

As illustrated in Figure 6 and Table 6, the proposed model demonstrates exceptional performance in the more challenging MOT20 dataset scenarios. In terms of accuracy metrics, it achieves a HOTA score of 61.5, MOTA of 76.1, and IDF1 of 76.5, maintaining high detection and tracking precision compared to FairMOT (HOTA 54.6, MOTA 61.8, and IDF1 67.3) and StrongSORT (HOTA 62.6, MOTA 73.8, and IDF1 77.0). Regarding identity stability, our model records only 1562 ID switches, substantially lower than FairMOT’s 5243, demonstrating superior continuous tracking capability in complex backgrounds with frequent target interactions. For real-time performance, our model operates at 23.6 FPS, outperforming both FairMOT (13.2 FPS) and StrongSORT (1.4 FPS), proving its ability to balance detection accuracy and real-time processing in complex scenarios while achieving efficient video stream processing with high-precision tracking.

Based on the experimental results from both the MOT17 and MOT20 datasets, our proposed tracking model achieves optimal balance between speed and detection performance through innovative algorithm design and optimization. This balance holds significant implications for practical applications, particularly in intelligent surveillance scenarios where real-time processing of extensive video data and accurate tracking of multiple targets are essential. In autonomous driving applications, vehicles must make split-second decisions while maintaining precise real-time tracking of surrounding objects. The exceptional performance of our proposed model provides robust technical support for these application scenarios, demonstrating considerable potential for practical implementation and promising to advance the development of multi-object tracking technology in real-world applications.

Moreover, there are significant differences in the performance of different multi-object tracking (MOT) methods on the MOT17 and MOT20 datasets. On the MOT17 dataset, our proposed method achieves scores of 63.7, 79.8, and 79.4 in the HOTA, MOTA, and IDF1 metrics, respectively, ranking just behind StrongSORT++, and it significantly outperforms other methods with only 1134 identity switches, demonstrating its advantage in maintaining object identity consistency. With an FPS of 61.4, it far surpasses StrongSORT++ in processing speed, balancing accuracy and efficiency. In contrast, on the MOT20 dataset, our method achieves HOTA, MOTA, and IDF1 scores of 61.5, 77.1, and 76.5, with slightly lower overall accuracy compared to MOT17, where BoostTrack++ leads with a HOTA of 66.4 and an IDF1 of 82.0. However, our method still maintains a low number of identity switches (1162 times) and has an FPS of 27.6, significantly higher than BoostTrack++. Overall, our method demonstrates robust performance on both datasets, especially excelling in reducing identity switches and improving processing speed, and it remains competitive in practical applications despite the slightly lower accuracy on MOT20.

4.3.2. Small Object Tracking Performance

Given this study’s particular focus on small object tracking, we conducted a dedicated evaluation using the VisDrone-MOT dataset, with the results presented in Table 7.

The experimental results demonstrate that our proposed method achieves optimal performance in both MOTA (39.3) and FPS (14.6), significantly outperforming other approaches. Additionally, it exhibits exceptional performance in IDF1 (48.2) and ID (215) metrics, particularly surpassing most comparative methods in identity consistency. Compared to CMOT, our method maintains a high IDF1 while substantially reducing the number of ID switches, and it demonstrates superior computational efficiency over DeepSORT and Flow-Tracker. Overall, our approach achieves a better balance among tracking accuracy, identity stability, and computational efficiency, providing an efficient and robust solution for multi-object tracking tasks.

As shown in Figure 7, the image sequence depicts a cyclist temporarily obscured by a bus before reappearing. Although UAV-SIFT detects the cyclist in the process, their ID changes before and after the occlusion, and our proposed method not only succeeds in maintaining continuous detection, but also preserves the consistent ID of the cyclist before and after the occlusion.

When evaluating small object tracking performance using the VisDrone -MOT dataset, our method clearly outperforms others, with high MOTA and IDF1 scores indicating its effectiveness in detecting small objects and maintaining identity consistency. Compared to Deepsort, the Multi-Scale Feature Adaptive Enhancement (MS-FAE) module in our method is a major advantage. It combines spatial details and semantic information through multi-scale feature pyramid fusion, which is crucial for small objects with indistinct single-scale features. Its small object adaptive attention mechanism also better highlights small object features. In terms of identity stability, our method has far fewer ID switches, mainly due to the Cross-Frame Association Module (CFAM). The CFAM’s grouped cross-attention captures global semantic associations, and its memory recall mechanism maintains object identity during occlusions, as demonstrated by the successful tracking of a cyclist during occlusion while UAV-SIFT failed. Regarding computational efficiency, our method runs at 14.6 FPS, faster than some counterparts like Flow-Tracker and Deepsort. This is achieved through the optimized design of modules like CFAM, whose grouped cross-attention reduces computational complexity, enhancing both feature representation and calculation speed. Overall, the combination of innovative modules such as MS-FAE and CFAM, along with well-tuned hyperparameters, gives our method an edge in detection accuracy, identity stability, and computational efficiency for small object tracking.

5. Conclusions

This study presents an innovative multi-modal fusion tracking framework that leverages deep learning to overcome multi-object tracking challenges in complex scenarios. The key findings of our research include the significant performance enhancements brought by the following four novel modules: MS-FAE, CFAM, DMM, and BDDM. The MS-FAE module improves small object detection through multi-scale feature fusion, outperforming previous methods that struggled with fine-grained detail recognition. The CFAM module constructs semantic networks to resolve dense-scenario associations, addressing a long-standing challenge in crowded environments where traditional methods often suffer from identity switches. The DMM module optimizes motion prediction by integrating data-driven strategies into the Kalman filtering framework, offering a more accurate alternative to conventional motion models. Finally, the two-stage decision method effectively fuses appearance and motion data, surpassing prior approaches in terms of trajectory continuity. These advancements are quantitatively validated by our framework’s outstanding performance on the MOT17, MOT20, and VisDrone datasets, achieving 63.7% HOTA and 79.4% IDF1 on the MOT17 test set while ensuring real-time operation.

However, our research has several limitations. The framework’s performance degrades in extreme scenarios, such as extremely low-light conditions, ultra-high-density crowds, and severe occlusions. These limitations stem from the current reliance on limited modalities and the computational constraints of the model on resource-constrained devices. To address these gaps, future studies could focus on integrating additional modalities, such as depth information and semantic segmentation, to enhance the model’s robustness in challenging environments. Exploring advanced model compression and acceleration techniques, including knowledge distillation and lightweight network architectures, can also enable high-precision real-time tracking on devices with limited computational resources. Additionally, investigating domain adaptation methods to handle variations in different application scenarios can further expand the framework’s applicability.

In conclusion, our multi-modal fusion tracking framework demonstrates the significant potential of deep learning in multi-object tracking, extending the boundaries of previous research. By identifying its limitations and proposing targeted future directions, this study provides a clear roadmap for advancing the field, with implications for intelligent transportation, security surveillance, and smart city infrastructure.

Author Contributions

Conceptualization, Z.L. and D.J.; methodology, Z.L.; software, Z.L. and Z.H.; validation, Z.L. and N.W.; formal analysis, Z.L.; investigation, Z.H.; resources, N.W.; data curation, N.W.; writing—original draft preparation, Z.L.; writing—review and editing, Z.L. and D.J.; visualization, N.W.; supervision, D.J.; project administration, D.J. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The datasets presented in this article are not readily available because the data are part of an ongoing study. Requests to access the datasets should be directed to dongyaojia1974@163.com.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, S.; Ren, H.; Xie, X.; Cao, Y. A Review of Multi-Object Tracking in Recent Times. IET Comput. Vis. 2025, 19, e70010. [Google Scholar] [CrossRef]
Solano-Carrillo, E.; Sattler, F.; Alex, A.; Klein, A.; Costa, B.P.; Rodriguez, A.B.; Stoppe, J. UTrack: Multi-Object Tracking with Uncertain Detections. arXiv 2024, arXiv:2408.17098. [Google Scholar]
Cai, J.; Xu, M.; Li, W.; Xiong, Y.; Xia, W.; Tu, Z.; Soatto, S. Memot: Multi-object tracking with memory. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 8090–8100. [Google Scholar]
Huang, C.; Han, S.; He, M.; Zheng, W.; Wei, Y. DeconfuseTrack: Dealing with Confusion for Multi-Object Tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 19290–19299. [Google Scholar]
Xiao, C.; Cao, Q.; Luo, Z.; Lan, L. Mambatrack: A simple baseline for multiple object tracking with state space model. In Proceedings of the 32nd ACM International Conference on Multimedia, Melbourne, VIC, Australia, 28 October–1 November 2024; pp. 4082–4091. [Google Scholar]
Fan, H.; Du, D.; Wen, L.; Zhu, P.; Hu, Q.; Ling, H.; Shah, M.; Pan, J.; Schumann, A.; Dong, B.; et al. Visdrone-mot2020: The vision meets drone multiple object tracking challenge results. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16; Springer International Publishing: New Yrok, NY, USA, 2020; pp. 713–727. [Google Scholar]
Veeramani, B.; Raymond, J.W.; Chanda, P. DeepSort: Deep convolutional networks for sorting haploid maize seeds. BMC Bioinform. 2018, 19, 289. [Google Scholar] [CrossRef] [PubMed]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing, Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Özcan, İ.; Altun, Y.; Parlak, C. Improving YOLO detection performance of autonomous vehicles in adverse weather conditions using metaheuristic algorithms. Appl. Sci. 2024, 14, 5841. [Google Scholar] [CrossRef]
Park, S.; Kim, J.; Wang, S.; Kim, J. Effectiveness of Image Augmentation Techniques on Non-Protective Personal Equipment Detection Using YOLOv8. Appl. Sci. 2025, 15, 2631. [Google Scholar] [CrossRef]
Eum, I.; Kim, J.; Wang, S.; Kim, J. Heavy Equipment Detection on Construction Sites Using You Only Look Once (YOLO-Version 10) with Transformer Architectures. Appl. Sci. 2025, 15, 2320. [Google Scholar] [CrossRef]
Lu, Z.; Han, B.; Dong, L.; Zhang, J. COTTON-YOLO: Enhancing Cotton Boll Detection and Counting in Complex Environmental Conditions Using an Advanced YOLO Model. Appl. Sci. 2024, 14, 6650. [Google Scholar] [CrossRef]
Ghiasi, G.; Lin, T.Y.; Le, Q.V. Nas-fpn: Learning scalable feature pyramid architecture for object detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 7036–7045. [Google Scholar]
Wang, K.; Liew, J.H.; Zou, Y.; Zhou, D.; Feng, J. Panet: Few-shot image semantic segmentation with prototype alignment. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9197–9206. [Google Scholar]
Sullivan, A.; Lu, X. ASPP: A new family of oncogenes and tumour suppressor genes. Br. J. Cancer 2007, 96, 196–200. [Google Scholar] [CrossRef] [PubMed]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 1–21. [Google Scholar]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strongsort: Make deepsort great again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Sun, P.; Cao, J.; Jiang, Y.; Zhang, R.; Xie, E.; Yuan, Z.; Wang, C.; Luo, P. Transtrack: Multiple object tracking with transformer. arXiv 2020, arXiv:2012.15460. [Google Scholar]
Chu, P.; Wang, J.; You, Q.; Ling, H.; Liu, Z. Transmot: Spatial-temporal graph transformer for multiple object tracking. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 4870–4880. [Google Scholar]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-centric sort: Rethinking sort for robust multi-object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 11–15 June 2023; pp. 9686–9696. [Google Scholar]
Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. Bot-sort: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar]
Wang, C.Y.; Yeh, I.H.; Mark Liao, H.Y. Yolov9: Learning what you want to learn using programmable gradient information. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2024; pp. 1–21. [Google Scholar]
Welch, G.; Bishop, G. An Introduction to the Kalman Filter; University of North Carolina: Chapel Hill, NC, USA, 1995. [Google Scholar]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A benchmark for multi-object tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar]
Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar]
Wen, L.; Zhu, P.; Du, D.; Bian, X.; Ling, H.; Hu, Q.; Zheng, J.; Peng, T.; Wang, X.; Zhang, Y.; et al. Visdrone-mot2019: The vision meets drone multiple object tracking challenge results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, C.; Wang, X.; Zeng, W.; Liu, W. Fairmot: On the fairness of detection and re-identification in multiple object tracking. Int. J. Comput. Vis. 2021, 129, 3069–3087. [Google Scholar] [CrossRef]
Cao, X.; Li, J.; Zhao, P.; Li, J.; Qin, X. Corr-track: Category-level 6d pose tracking with soft-correspondence matrix estimation. IEEE Trans. Vis. Comput. Graph. 2024, 30, 2173–2183. [Google Scholar] [CrossRef] [PubMed]
Stadler, D.; Beyerer, J. Modelling ambiguous assignments for multi-person tracking in crowds. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2022; pp. 133–142. [Google Scholar]
Stanojevic, V.D.; Todorovic, B.T. BoostTrack: Boosting the similarity measure and detection confidence for improved multiple object tracking. Mach. Vis. Appl. 2024, 35, 53. [Google Scholar] [CrossRef]
Bochinski, E.; Eiselein, V.; Sikora, T. High-speed tracking-by-detection without using image information. In Proceedings of the 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
Li, W.; Mu, J.; Liu, G. Multiple object tracking with motion and appearance cues. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshops, Seoul, Republic of Korea, 27–28 October 2019. [Google Scholar]
Giglio, M.; Tauber, R.; Nadendla, S.; Munro, J.; Olley, D.; Ball, S.; Mitraka, E.; Schriml, L.M.; Gaudet, P.; Hobbs, E.T.; et al. ECO, the Evidence & Conclusion Ontology: Community standard for evidence information. Nucleic Acids Res. 2019, 47, D1186–D1194. [Google Scholar] [PubMed]
Bae, S.H.; Yoon, K.J. Robust online multi-object tracking based on tracklet confidence and online discriminative appearance learning. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH, USA, 23–28 June 2014; pp. 1218–1225. [Google Scholar]
Lu, P.; Ding, Y.; Wang, C. Multi-small target detection and tracking based on improved YOLO and SIFT for drones. Int. J. Innov. Comput. Inf. Control 2021, 17, 205–224. [Google Scholar]

Figure 1. Overall framework of the tracking model.

Figure 2. Architecture of MS-FAE.

Figure 3. Architecture of CFAM.

Figure 4. Sensitivity analysis of the SAAM parameters. The yellow line in the figure indicates the MOTA values for different parameter k conditions, the blue line indicates the MOTA values for different parameter

S_{0}

, and the red line indicates the corresponding parameter position when the maximum MOTA was achieved.

Figure 4. Sensitivity analysis of the SAAM parameters. The yellow line in the figure indicates the MOTA values for different parameter k conditions, the blue line indicates the MOTA values for different parameter

S_{0}

, and the red line indicates the corresponding parameter position when the maximum MOTA was achieved.

Figure 5. Performance-Speed Trade-offs Comparison on The MOT17 Dataset.

Figure 6. Performance-Speed Trade-offs Comparison on The MOT20 Dataset.

Figure 7. Visualization comparison of VisDrone experiments. From left to right are three consecutive frames from the VisDrone dataset, where the first row shows detection results using UAV-SIFT, and the second row presents the detection results from our proposed method.

Table 1. Ablation study on the VisDrone dataset.

Method	MOTA	IDF1	IDs	FPS
Baseline	19.8	38.5	568	22.1
+MS-FAE	22.3	41.6	502	21.4
+CFAM	35.6	45.7	379	20.7
+DMM	38.6	46.4	268	19.9
+BDDM (Ours)	39.3	48.2	215	19.6

Table 2. Impact of memory pool depth on tracking performance.

Memory Pool Depth	MOTA	IDF1	IDs	FPS
1	37.4	46.7	318	20.1
2	38.5	47.6	243	19.8
3	39.3	48.2	215	19.6
4	39.4	48.5	201	19.4
5	39.6	48.6	197	19.2

Table 3. Ablation study of DMM components on MOT17.

Configuration	MOTA	IDF1	IDs	Pred. Error (px)
Traditional Kalman	77.3	75.4	1586	12.6
+Enhanced State Transition (EST)	78.2	76.9	1387	9.5
+Multi-Trajectory Hypothesis (MTH)	78.7	77.8	1256	8.8
+Data-Driven Noise Adjustment (DNA)	79.8	79.4	1134	7.3

Table 4. Comparison of fusion methods on MOT17.

Fusion Method	MOTA	IDF1	IDs
Appearance Only	77.5	75.3	1653
Motion Only	76.8	71.4	1872
Fixed Weight (50:50)	78.3	77.6	1385
Fixed Weight (70:30)	78.9	78.1	1241
Our BDDM (Dynamic)	79.8	79.4	1134

Table 5. Performance comparison on MOT17.

Method	HOTA	MOTA	IDF1	IDs	FPS
ByteTrack [18]	63.1	80.3	77.3	2196	29.1
Fair-MOT [30]	59.3	73.7	72.3	3303	25.2
StrongSORT++ [19]	64.4	79.6	79.5	1194	7.0
SORT [8]	34	43.1	39.8	4853	141.3
DeepSORT [7]	61.2	78.0	74.5	1821	13.8
TransTrack [20]	54.1	75.2	63.5	3603	58.2
CorrTracker [31]	60.7	76.5	73.6	3369	14.9
Ours	63.7	79.8	79.4	1134	61.4

Table 6. Performance comparison on MOT20.

Method	HOTA	MOTA	IDF1	IDs	FPS
ByteTrack [18]	61.3	77.8	75.2	1223	17.5
Fair-MOT [30]	54.6	61.8	67.3	5243	13.2
StrongSORT++ [19]	62.6	73.8	77.0	770	1.4
SORT [8]	36.1	42.7	45.1	4470	57.3
DeepSORT [7]	57.1	71.8	69.6	1418	3.2
MAATrack [32]	57.3	73.9	71.2	1331	14.7
BoostTrack++ [33]	66.4	77.7	82.0	762	2.1
Ours	61.5	77.1	76.5	1162	27.6

Table 7. Small object tracking performance on the VisDrone dataset.

Method	MOTA	IDF1	IDs	FPS
Deepsort [7]	10.1	38.3	590	2.9
IoU Tracker [34]	12.6	38.3	576	6.3
Flow-Tracker [35]	26.4	41.9	127	2.6
GOC-ECO [36]	36.9	46.5	354	5.6
CMOT [37]	31.5	51.3	789	7.3
UAV-SIFT [38]	38.7	47.3	304	10.7
Ours	39.3	48.2	215	14.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Z.; Jia, D.; He, Z.; Wu, N. Dynamic Collaborative Optimization Method for Real-Time Multi-Object Tracking. Appl. Sci. 2025, 15, 5119. https://doi.org/10.3390/app15095119

AMA Style

Li Z, Jia D, He Z, Wu N. Dynamic Collaborative Optimization Method for Real-Time Multi-Object Tracking. Applied Sciences. 2025; 15(9):5119. https://doi.org/10.3390/app15095119

Chicago/Turabian Style

Li, Ziqi, Dongyao Jia, Zihao He, and Nengkai Wu. 2025. "Dynamic Collaborative Optimization Method for Real-Time Multi-Object Tracking" Applied Sciences 15, no. 9: 5119. https://doi.org/10.3390/app15095119

APA Style

Li, Z., Jia, D., He, Z., & Wu, N. (2025). Dynamic Collaborative Optimization Method for Real-Time Multi-Object Tracking. Applied Sciences, 15(9), 5119. https://doi.org/10.3390/app15095119

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dynamic Collaborative Optimization Method for Real-Time Multi-Object Tracking

Abstract

1. Introduction

2. Related Works

2.1. Small Object Detection and Feature Enhancement

2.2. Feature Association in Multi-Object Tracking

2.3. Motion Modeling and Multi-Modal Fusion

3. Method

3.1. Comprehensive Framework

3.2. Multi-Scale Feature Adaptive Enhancement (MS-FAE) Module

3.2.1. Multi-Scale Feature Pyramid Fusion

3.2.2. Small-Object Adaptive Attention Mechanism

3.2.3. Precise Feature Mapping and Dynamic Anchor Compensation

3.3. Cross-Frame Association Module

3.3.1. Multimodal Feature Encoding

3.3.2. Grouped Cross-Attention Mechanism

3.3.3. Memory Recall Mechanism

3.4. Dynamic Motion Model

3.4.1. Enhanced State Transition

3.4.2. Multi-Trajectory Hypothesis

3.4.3. Data-Driven Noise Adjustment

3.5. Bi-Modal Dynamic Decision Method

3.6. Inter-Module Collaboration Mechanism

4. Experimental Results and Analysis

4.1. Experimental Setup

4.1.1. Datasets and Evaluation Metrics

Multiple Object Tracking Accuracy (MOTA)

Identification F1 Score (IDF1)

Higher Order Tracking Accuracy (HOTA)

4.1.2. Implementation Details

4.2. Ablation Studies

4.2.1. Module Contribution Analysis

4.2.2. Parameter Sensitivity Analysis

4.2.3. Memory Pool Depth

4.2.4. Ablation Study on Dynamic Motion Model Components

4.2.5. BDDM versus Static Fusion Methods

4.3. Performance Evaluation

4.3.1. Comparative Analysis with State-of-the-Art Methods

4.3.2. Small Object Tracking Performance

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI