You are currently viewing a new version of our website. To view the old version click .
Drones
  • Article
  • Open Access

27 September 2023

DB-Tracker: Multi-Object Tracking for Drone Aerial Video Based on Box-MeMBer and MB-OSNet

,
,
,
and
College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
*
Author to whom correspondence should be addressed.

Abstract

Drone aerial videos offer a promising future in modern digital media and remote sensing applications, but effectively tracking several objects in these recordings is difficult. Drone aerial footage typically includes complicated sceneries with moving objects, such as people, vehicles, and animals. Complicated scenarios such as large-scale viewing angle shifts and object crossings may occur simultaneously. Random finite sets are mixed in a detection-based tracking framework, taking the object’s location and appearance into account. It maintains the detection box information of the detected object and constructs the Box-MeMBer object position prediction framework based on the MeMBer random finite set point object tracking. We develop a hierarchical connection structure in the OSNet network, build MB-OSNet to get the object appearance information, and connect feature maps of different levels through the hierarchy such that the network may obtain rich semantic information at different sizes. Similarity measurements are computed and collected for all detections and trajectories in a cost matrix that estimates the likelihood of all possible matches. The cost matrix entries compare the similarity of tracks and detections in terms of position and appearance. The DB-Tracker algorithm performs excellently in multi-target tracking of drone aerial videos, achieving MOTA of 37.4% and 46.2% on the VisDrone and UAVDT data sets, respectively. DB-Tracker achieves high robustness by comprehensively considering the object position and appearance information, especially in handling complex scenes and target occlusion. This makes DB-Tracker a powerful tool in challenging applications such as drone aerial videos.

1. Introduction

Drones have been rapidly promoted as electronic communication technology has advanced due to their advantages of mobility, ease of operation, and low cost, compensating for the loss of information caused by traditional means due to weather and time constraints. At the same time, the high agility of drones, when compared with fixed cameras, can make the scope of aerial videos more flexible and diverse. Drone video data are particularly informative in terms of content and time, promoting the growing importance of drone aerial videos in numerous sectors of object detection and tracking [,]. However, compared with multi-object recognition and tracking jobs in ordinary view videos, drone aerial videos face several obstacles, such as frame degradation, uneven object distribution density, tiny object size, and real-time performance, which have piqued the interest of academics in recent years generated a lot of interest and research in the world and industry [].
Many academics have developed various multi-object tracking (MOT) algorithms for drone aerial videos in recent years, with the most frequent being multi-object tracking based on visual object identification, which has become the most popular multi-object tracking task for the drone aerial video valid frame. The detection-based tracking process is often divided into two parts: (1) a motion model and state estimation to anticipate the trajectory’s bounding box in subsequent frames and (2) linking new frame detections with the present trajectory [,]. For dealing with association tasks, there are two main approaches: (1) object appearance models and re-identification problems and (2) object localization, specifically the intersection-over-union ratio between predicted trajectory bounding boxes and detection bounding boxes []. Both approaches use distances to quantify associations and treat the association issue as a global assignment problem. However, because of the uncertainty of object recognition and the mutual occlusion and overlap of objects, these approaches may experience certain issues when dealing with complicated scenarios [,]. Second, the re-id proposal brings everyone together, implying that MOT is mostly based on detection and re-id information. Most multi-object tracking performance improvements rely mostly on improved detection and re-id, with motion information serving only as an aid. However, the motion information frequently has a significant “compensation” effect on the trajectory fragmentation caused by obstruction or missed detection by the detector, as well as the disappearance of fractures. After all, the expected path is both continuous and progressive [].
Random finite sets (RFS)-based multi-object tracking is widely used to resolve the aforementioned difficulties []. The RFS method models the object state and observation using a probabilistic framework, and it uses motion information in multi-object tracking to deal with complex situations such as uncertainty and object occlusion, improving the robustness and accuracy of multi-object tracking. Objects may appear and disappear in drone aerial videos, be concealed by other objects, or vary radically in look and motion. The RFS technique effectively solves such issues by predicting and incorporating the uncertainty associated with the object’s presence into the tracking process. This probabilistic modeling technique not only enhances object association accuracy but also allows for an estimate of uncertainty in object state predictions, which is useful for decision-making in dynamic contexts.
Without relying on a one-to-one relationship between detections and objects, RFS-based algorithms efficiently address the data association problem. They employ Bayesian filtering techniques, such as probabilistic hypothesis density filters and multi-Bernoulli filters, to provide efficient and scalable results. This trait is particularly important in drone aerial videos where the number of objects varies over time and the relationship between detections and objects can be confusing due to crossing or overlapping objects. The RFS approach is ideal for demanding and dynamic contexts because it can adapt to varied object counts and manage scenarios with complicated object interactions.
We model the motion information of drone aerial multi-object tracking as a random finite set—the modeling method that is closer to the actuality of multi-object motion. The detection box information of the detected object is retained using MeMBer random finite set point object tracking, the Box-MeMBer object position prediction framework is built, and the effectiveness of our proposed technique is validated.
The following are our primary contributions:
  • Multi-object tracking framework DB-Tracker based on detection and RFS: To achieve the combination of location information and appearance features, we utilize the label generalized multi-Bernoulli filter to collect the object’s position information and MB-OSNet to extract the object’s appearance features. Synthetic matching provides better resistance to occlusions and nonlinear motion.
  • Box-MeMBer: To better adapt to nonlinear systems and object motion patterns, a nonlinear motion model and a nonlinear observation model are employed for multi-object tracking using the stochastic finite set method based on multi-Bernoulli filtering. Simultaneously, the observation likelihood function is computed using the expected object state to eliminate the negative impacts of object superposition on the observation update.
  • MB-OSNet: We created MB-OSNet, introduced a multi-scale feature extraction module, and implemented a dense connection structure and multi-scale feature aggregation technique. By connecting the feature maps of different levels in layers, the network can obtain rich semantic information at different scales, improve the network’s perception of object features at different scales, and reduce information loss.
  • Verification and comparison of VisDrone and UAVDT datasets: We verified the effectiveness of VisDrone and UAVDT datasets, obtaining 37.4% and 46.2% MOTA, respectively. Simultaneously, the results show that our algorithm is effective when compared with the eight latest multi-object tracking algorithms. At the same time, we used a DJI mini3 to capture 1080p footage with varied viewing angles, altitudes, occlusions, and so on to test the algorithm’s robustness.

3. Methods

We begin the Box-MeMBer prediction step for the k -th frame by using the tracking results X k 1 from the previous frame. At this stage, we forecast the objects’ state positions, extract the expected bounding boxes, and compute the association matrix with the detection boxes in the current frame. T L (missed objects), T C (completed tracking objects), T S (surviving objects), and γ (newly identified objects) are the four object classifications. T L represents objects that were not detected, which has a significant impact on tracking accuracy; T C represents objects that no longer appear in the current frame, indicating that their tracking is complete; T S represents objects that exist in the current frame and will continue to be tracked; and γ represents new objects that appear in the current frame for the first time. A secondary matching approach is used to differentiate between missed objects and objects that no longer require tracking. To obtain the tracking results X k for the current frame, all objects undergo a Box-MeMBer update, followed by merging, trimming, and deleting erroneous objects. The Box-MeMBer prediction phase assists in avoiding the problem of failing to associate two bounding boxes of the same object due to excessive displacements in the intersection-over-union (IoU)-based association matrix calculation. It also allows for continuous object tracking, even when the object is moving rapidly. Figure 1 illustrates the framework.
Figure 1. Multi-target tracking algorithm framework for drone aerial videos based on Box-MeMBer and MB-OSNet.

3.1. Box-MeMBer

The current frame’s object detection box set is D k = d k i i = 1 N k , where d k i = x k i , y k i , w k i , h k i , c k i is the state vector of the i -th detection box, x k i , y k i , w k i , h k i , c k i represents the horizontal and vertical coordinates of the detection box’s center, as well as the width, height, and confidence of the detection box, and N k is the number of object detection boxes in the current frame. At the same time, the objects in the frame have some degree of growth, and their influence zones may be independent of one another or overlap. At this moment, the frame detection model is created as follows.
D k = γ S k + w k
where w k = w k , 1 , , w k , M T denotes detection noise with a Gaussian distribution, which is w k N ( 0 , r   , where M denotes the number of pixels in the frame. X k represents the object state set at time k as a random finite set. γ S k = s S k   h k ( s ) represents the spatial distribution of measurement for object pictures with a state of s , whereas h k s = [ h k , 1 s , , h k , M ( s ) T represents the spatial distribution of the sum of frame measurements for all objects in the frame.
The detection likelihood function is as follows:
L D k S k = N D k s S k     h k ( s ) , r     = N D k , r     s S k   g z ( s )
where:
e D ( s ) = e x p h k T ( s ) r 1     D k h k ( s ) / 2 q
q = t S k , y s     h k ( t ) / 2
To obtain the posterior distribution of the updated object set, g D must be estimated. Equation (3) shows that in e D s , h k s , r 1 and D k are known, but q is unknown. We can know that the vector q is a vector using Equation (4), and there is no loss of generality. We can utilize a multivariate Gaussian distribution to approximate P ( q ) , assuming that P ( q ) approximates a Gaussian distribution with mean μ 0 and covariance Σ 0 . Because the goal state established at time k is unknown, the expected object state set at time k 1 is utilized to estimate the above parameters, and the following estimation equation is provided:
μ o = 1 2 j = 1 , j i N k k 1     r j b j
o     = 1 4 j = 1 , j i N k k 1     r j v j r j 2 b j b j T
where i is the label of the multi-Bernoulli component to which the current object s belongs,
b j = p k k 1 j , h k , v j = p k k 1 j , h k h k T .
Taking E 0 g z as the projected value of e D using P ( q ) N μ 0 , Σ o , the following results are obtained:
e ^ D = e x p h k T ( x ) r 1     D k h k ( s ) / 2 μ o + o     h k T r 1     T / 2
Equation (7) gives the estimated expression of e D in the presence of multi-object superposition, demonstrating that the object state can be obtained through prediction and the influence of multi-object superposition can be eliminated, allowing it to be used for object state measurement updates.
The filter has two steps: prediction and updating. Each multi-Bernoulli object component is labeled to improve estimation performance by extracting the object track and eliminating spurious objects. The fundamental steps are as follows.

3.1.1. Box-Bernoulli Prediction

WE assume the object’s multi-Bernoulli distribution parameter at time k 1 is π k 1 = r k 1 ( i ) , p k 1 ( i ) , L k 1 ( i ) i = 1 M k 1 , where r ( i ) represents the existence probability of the i component, p ( i ) represents the probability distribution of the i component, L ( i ) represents the label of the i component, each label is a three-dimensional vector, and the first label is a three-dimensional vector. The first dimension indicates the component’s starting point, the second dimension represents the component’s serial number in all objects, and the third dimension reflects the size of the object corresponding to the component. The object distribution’s projected multi-Bernoulli parameter is:
π k k 1 = r P , k k 1 ( i ) , p P , k k 1 ( i ) , L P , k k 1 ( i ) i = 1 M k 1 r Γ , k ( i ) , p Γ , k ( i ) , L Γ , k ( i ) i = 1 M Γ , k
where, p P , k i k 1 = f k k 1 s , p k 1 i p b , k p k 1 i , p b , k , r P , k i k 1 = r k 1 i < p k 1 i , p b , k > , L P , k i k 1 = L k 1 i .
At time k − 1, f k k 1 s represents the single object state transition probability density function of state s . The chance of survival of an object whose state is s at time k 1 at time k is represented by p b , k ( s ) . At time k , r Γ , k ( i ) , p Γ , k ( i ) , L Γ , k ( i ) i = 1 M Γ , k denotes the distribution parameter of the embryonic multi-Bernoulli component.

3.1.2. Box-Bernoulli Update

The modified object multi-Bernoulli distribution parameter set is as follows, given the expected object multi-Bernoulli distribution parameter set π k k 1 = r k k 1 ( i ) , p k k 1 ( i ) , L k k 1 ( i ) i = 1 M k k 1 :
π k = r k ( i ) , p k ( i ) , L k ( i ) i = 1 M k k 1
where, r k ( i ) = r k k 1 ( i ) < p k k 1 ( i ) , e D > / 1 r k ( i ) k 1 p k k 1 ( i ) ,   e D , p k ( i ) = p k k 1 ( i ) e D / p k k 1 ( i ) , e D , L k ( i ) = L k k 1 ( i ) . e D is estimated by Equation (7).

3.1.3. Object Trajectory Information Extraction

We extract the state parameters and trajectory labels of several Bernoulli components that exist independently (i.e., a single object). While we integrate the position information and probability information of the relevant object components to produce an object state for multiple multi-Bernoulli components fused into one object (i.e., multiple components represent the same object), we combine the position information and probability information of the related object components for multiple components representing the same object. The position information can be used to calculate the state parameters, and further feature extraction and analysis can be performed as needed. To ensure unique identification, the track label is used as the label of the first object. If there is no object with the same label as the current object in the preceding and following frames, indicating that the object is not constantly watched, it might be considered a false alarm object and removed or marked as an unreliable object in the present frame. The obtained projected trajectory set is as follows:
P k k 1 = p k k 1 i = 1 N k
where p k k 1 i is the prediction box, p k k 1 i = x k i , y k i , w k i , h k i , I D k i , comprising the projected frame’s coordinates, length and width, and track number.
We can efficiently identify separate objects and numerous components fused into one object and extract their state parameters and trajectory labels, resulting in the accuracy and reliability of multi-object tracking trajectory prediction.

3.2. MB-OSNet

The simple convolutional neural network can learn the global characteristics of the object region of interest in order to extract the discriminative feature representation of the object, but it cannot discern intra-class differences. Global features aid in the learning of contour information in order to recover frames from a larger field of view. Part-level characteristics, on the other hand, might carry more fine-grained information. At the same time, it is not recommended to partition the frame excessively, since this may result in frame information loss owing to insufficient local information, diminishing accuracy. Partial division produces fine-grained yet restricted data. Global knowledge is adequate, but local factors are inadequate. OSNet is utilized as the core network to gather comprehensive and coordinated picture information by adding a multi-branch structure of feature collaboration, and a multi-branch OSNet (MB-OSNet) is developed to ensure global and fine-grained picture information. Figure 2 depicts the network structure.
Figure 2. MB-OSNet Network Structure.
OSNet can learn the object’s full-scale feature representation, which has the following main components. First, full-scale residual blocks and unified aggregation gates are introduced by deconstructing the convolutional layers. As illustrated in Figure 3b, depth-separated convolution is utilized to decompose the 3×3 convolution to create Lite 3 × 3 layers. Different convolution streams have different receptive fields, and by stacking Lite 3 × 3 layers to construct the bottleneck, as shown in Figure 3c, a wide range of scales can be captured. Furthermore, a parameter t denoting the feature scale is introduced to expand the residual function by numerous Lite 3 × 3 layers in order to learn multi-scale features. The learning results of small-scale features in the following layer can be efficiently kept through short connections, capturing the range of the entire spatial scale. A learnable combinatorial gate (AG) is used to dynamically combine the convolutional stream outputs at different sizes to produce full-scale features. Multiple convolution streams share AG gates, and the number of parameters is independent, making the model scalable. Finally, as illustrated in Figure 3d, creating the full OSNet network by adding lightweight bottlenecks layer by layer allows for a flexible balance of model size, computational cost, and performance.
Figure 3. OSNet network structure and its components. (a) Standard 3 × 3 convolution. (b) Lite 3 × 3 convolution. (c) Bottleneck. (d) OSNet network.
Two branch structures are developed for multi-branch structures. The local branch is the first. The feature map is partitioned into four horizontal grids in this branch, and average pooling is utilized to produce 1 × 1 × C local features. It should be noted that the four local characteristics are concatenated into a column vector, and the resulting concatenated features are
f = [ f 1 T , f 2 T , f 3 T , f 4 T ] T
where f i T represents the four column vectors of the horizontally partitioned feature map.
The global branch is the second branch. GeM pooling is conducted right after the OSNet convolutional layers, as opposed to the local branch. To acquire the eigenvectors, the initialization parameter p k is set to 2.8:
G e M f k = [ f 0 , f 1 , , f n ] = [ 1 n i = 1 n ( f i ) p k ] 1 p k
where f k is an individual feature map. GeM corresponds to maximal pooling when p k , and to average pooling when p k 1 . The MB-OSNet appearance feature extraction network accepts the detection frame object as input and provides a feature vector representation. A set is generated for each object, which is used to store the object’s appearance feature vectors in different frames, and the object’s appearance features are stored at most 100 frames before the current moment of the object.

3.3. Object Association

We created the association method that was used between the object trajectory and the detection frame, and we used the MB-OSNet network to calculate the feature similarity score to determine the similarity between the two object frames, build an association matrix, and minimize the Euclidean distance between the trajectory and the object. The distance minimizes the amount of label hops, enhancing tracking accuracy and making the tracking frame more accurate.
The object trajectory state information P k k 1 = p k k 1 i i = 1 N k anticipated by the k -th frame Box-MeMBer is calculated as shown. The detection set D k = d k i i = 1 N k is used by the object correlation matrix algorithm to obtain the related surviving object set, newborn object set, and dead object set, including missing objects.
The Box-MeMBer-predicted state component P k k 1 = p k k 1 1 , p k k 1 2 , , p k k 1 must be associated and matched with this frame’s detection set D k = d k i = 1 N k , and the object is divided into a surviving object set T s , a new object R and clutter K , a missed object T L , and an end-tracking object T C , where N k represents the number of predicted components and N k is the number of detection frames in this frame.
A = a 11 a 1 M k a N k 1 a N k M k
a i j = Area z i Area x k k 1 j Area z i Area x k k 1 j
where a i j denotes the intersection and union ratio of the i -th detection frame and the j -th Gaussian component. For each detection frame d i , the Gaussian component x k | k 1 j will be compared once. If the calculated value is greater than the threshold T iou , then it is evaluated as the same object and stored as the surviving object T s ; otherwise, it is treated as a separate object.
If there are two or more Gaussian components greater than the threshold T i o u for the same detection frame, the one with the greatest intersection and union ratio is used as the final association result; if the two values are the same, feature pyramid similarity must be performed on the components degree calculation. If there is no value in the i -th row that is greater than the threshold T i o u , d i is regarded as a new object or clutter; if there is no value in the j -th column that is bigger than the threshold T i o u , x k k 1 j is considered a tracking object or a missed object.

3.4. Secondary Matching

There are some unmatched objects and trajectories in multi-object tracking, indicating that there is no successful object-trajectory link. Secondary matching can be used to uncover potential matching associations by examining the link between unmatched objects and trajectories. The motion features and appearance features are merged after thoroughly considering the object’s appearance features and positional relationship information.
At a given point, the detection frame’s appearance features (denoted as r j ) are obtained. Then, the smallest cosine distance (number j ) between the appearance features in all known sets and the appearance features in the resulting detection frame are computed.
d 2 i , j = m i n { 1 r j T r k i } r k ( i ) G i
Motion features and exterior aspects complement each other. Motion characteristics (obtained via the Mahalanobis distance computations) provide possible information on object localization in quadratic matching, which is extremely effective in short-term prediction. Appearance features (obtained from cosine distance computation) help recover an object’s ID number after it has been occluded for a long time, minimizing the frequency of ID switches.
A weighting procedure is conducted on the two features in order to integrate them
c i , j = λ d 1 i , j + 1 λ d 2 ( i , j )
where d 1 i , j represents the Mahalanobis distance and d 2 ( i , j ) represents the cosine distance. λ is the weighting factor. Only appearance features are used for matching and tracking when λ = 0.
Finally, a discriminative overall threshold for determining if an association match is established, with an association being admissible if it falls within the gating zone of the two metrics:
b i , j = m = 1 2 b i , j ( m )
We combine the two previously mentioned criteria (thresholds of the Mahalanobis distance and cosine distance, respectively) to jointly appraise the object’s relationship. If the secondary matching is successful, the matching object and trajectory pair are located, and their status information, such as position, velocity, object ID, and so on, is updated.
Secondary matching can improve the correlation between objects and trajectories, as well as the accuracy and stability of multi-object tracking. Secondary matching takes full use of the information between unmatched objects and trajectories, assisting in the discovery of more suitable matching results by re-evaluating their relationship and further optimizing the effectiveness of multi-object tracking.

4. Experiment

This section is divided by subheadings. It provides a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.

4.1. Datasets and Evaluation Indicators

4.1.1. Datasets

We conducted a comprehensive evaluation of the object detectors and DB-Tracker on the VisDrone MOT [] and UAVDT [] datasets, comparing their performance with other excellent object detectors and multi-object trackers. Sample images of the VisDrone MOT and UAVDT datasets are shown in Figure 4 and Figure 5.
Figure 4. VisDrone dataset. Lines (a,c) represent the original frames, and lines (b,d) represent the corresponding labeled datasets, respectively.
Figure 5. UAVDT dataset. Lines (a,c) represent the original frames, and lines (b,d) represent the corresponding labeled datasets, respectively.

4.1.2. Evaluation Indicators

Two authoritative MOT metrics are used to evaluate our MOT system performance, which are defined as [] and CLEAR MOT metrics []. These metrics are designed to assess the overall performance and indicate the potential shortcomings in each model. These metrics are denoted as follows:
(1)
FP (↓): false positives in the entire video;
(2)
FN (↓): false negatives in the entire video;
(3)
IDSW (↓): ID switches in the entire video.
(4)
FM (↓): number of a ground-truth trajectory interrupted during the tracking process.
(5)
IDF1(↑): Ratio of correctly identified detection to the number of computed detections and ground truth.
(6)
MOTA (↑): combining false positives, false negatives, and IDSW, the score is then defined as follows:
M O T A = 1 ( F N + F P + I D s w ) G T
(7)
MOTP (↑): the mismatch between the ground truth and the predicted results is calculated as follows:
M O T P = t , i d t , i t c t

4.2. Experimental Results

4.2.1. Object Detection

The model is initialized with available weights gained from training the detector on the COCO dataset. The detector is trained using SGD with the following parameters: epoch set to 150, batch size set to 16, learning rate set to 0.02, momentum set to 0.9, and decay set to 0.0001. The experiment is also trained using the VisDrone and UAVDT datasets, and it is verified using the same verification set picture. The test is run on hardware (an NVIDIA RTX 3090 with 24 GB of memory), and the top 100 most reliable detection results are averaged. In order to achieve better average accuracy, we added an attention mechanism to YOLO-V8 to better identify small targets. The improved YOLO-V8 achieved 53.4% mAP on the VisDrone dataset and 71.1% mAP on the UAVDT dataset. Figure 6 and Figure 7 depict the visual effects. Figure 8 shows the comparison of YOLOV8 training with the original YOLOV8 after adding the attention mechanism.
Figure 6. Detection results of improved YOLO-V8 on the VisDrone dataset.
Figure 7. Detection results of improved YOLO-V8 on the UAVDT dataset.
Figure 8. Comparison of YOLOV8 training with the original YOLOV8 after adding the attention mechanism. (a) the precision–confidence curve of original YOLOV8 training; (b) the precision–recall curve of original YOLOV8 training; (c) the recall–confidence curve of original YOLOV8 training; (d) the precision–confidence curve of improved YOLOV8 training; (e) the precision–recall curve of improved YOLOV8 training; and (f) the recall–confidence curve of improved YOLOV8 training.

4.2.2. Multi-Object Tracking

We compared the DB-Tracker with DeepSORT [], ByteTrack [], BoT-SORT [], UAVMOT [], Deep OC-SORT [], Strong SORT [], and SimpleTrack []. Due to the nonuniform distribution of object entities per class in the training set, detection models behave differently across classes. To this end, all tracking comparison methods use the same detections produced by the improved YOLO-V8 detector and refer to [], thresholding cars at 0.3, buses at 0.05, and trucks at 0.1; the threshold is set at 0.4 for pedestrians and 0.05 for vans.
Table 1 and Table 2 present a comprehensive comparison between the DB-Tracker and other popular trackers evaluated on the VisDrone MOT and UAVDT datasets. The evaluation includes key metrics such as MOTA, MOTP, IDF1, and IDSW, as well as comparisons with other methods. The DB-Tracker excels by effectively utilizing both position and appearance information, leading to superior performance. In contrast, Deep SORT extends its framework to handle multiple classes by independently associating each class based on position information. ByteTrack utilizes low-score detections for similarity tracking and background noise filtering. Deep OC-SORT introduces camera motion compensation and adaptive weighting, while BoT-SORT incorporates camera motion compensation. UAVMOT enhances object feature association through an ID feature update module. StrongSORT models nonlinear motion based on Gaussian process regression. SimpleTrack presents a novel incidence matrix by merging the embedded cosine distance and GIOU distance of the objects. This superiority stems from its effective fusion of position and appearance information, resulting in enhanced tracking performance across various evaluation measures.
Table 1. Comparison between DB-Tracker and the latest multiple trackers tested on the VisDrone dataset.
Table 2. Comparison between DB-Tracker and the latest multiple trackers tested on the UAVDT dataset.
The DB-Tracker was compared against DeepSORT, ByteTrack, BoT-SORT, UAVMOT, Deep OC-SORT, Strong SORT, and SimpleTrack. Detection models perform differently across classes because of the nonuniform distribution of object entities per class in the training set. To that purpose, all tracking comparison methods employ the identical detections produced by the YOLOv8x detector, thresholding vehicles at 0.3, buses at 0.05, and trucks at 0.1; for pedestrians and vans, the threshold is set at 0.4 and 0.05, respectively.
Table 1 and Table 2 provide a detailed comparison of the DB-Tracker to various prominent trackers tested on the VisDrone MOT and UAVDT datasets. The assessment contains essential indicators, including MOTA, MOTP, IDF1, IDSW, and comparisons with other approaches. The DB-Tracker succeeds by utilizing both position and appearance information effectively, resulting in greater performance. Deep SORT, on the other hand, extends its structure to handle numerous classes by associating each class independently based on position information. For similarity tracking and background noise filtering, ByteTrack employs low-score detections. Deep OC-SORT features camera motion compensation as well as adaptive weighting, whereas BoT-SORT does not. UAVMOT improves object feature association through an ID feature update module. Based on Gaussian process regression, StrongSORT models nonlinear motion. SimpleTrack creates a new incidence matrix by combining the objects’ embedded cosine and GIOU distances. This superiority is due to the effective merging of position and appearance information, which results in improved tracking performance across several evaluation parameters.
However, according to Table 1 and Table 2, it is found that these existing methods have some problems that limit their performance in multi-target tracking tasks in drone aerial videos. Although DeepSORT integrates deep learning and SORT algorithms, its high computing resource requirements make it less efficient in large-scale scenarios. ByteTrack is less robust in some complex situations, especially for target occlusion and appearance changes. Although BoT-SORT integrates a variety of tracking modules, the complex model structure requires more computing resources. In addition, although UAVMOT is optimized for multi-target tracking in the UAV aerial video, it still needs further improvement in some complex scenes. Although Deep OC-SORT combines appearance and motion information, its computational overhead is relatively large. Although Strong SORT has a certain degree of robustness, it is less efficient in large-scale scenes of drone aerial videos. Although SimpleTrack is simple and efficient, its robustness in complex scenarios is limited. Compared with these methods, the DB-tracker algorithm combines detection and tracking and comprehensively considers the location and appearance information of the target. Its outstanding robustness enables it to handle challenges such as complex scenes and target occlusions. DB-tracker can efficiently handle the diverse scenarios of drone aerial videos, making it a promising solution in this field.
Figure 9 and Figure 10 show the chronological frames with bounding boxes and different colored identities. Using object motion information, our trajectory scoring technique efficiently tackles missed and incorrect detections caused by occlusion, particularly in cases of short-term overlapping objects. In contrast to previous algorithms based on bounding box junctions, our algorithm reduces pedestrian identity swaps. The findings show that our suggested method performs admirably in monitoring congested scenes with numerous items. It ensures that the bounding boxes and identities are consistent throughout the sequence.
Figure 9. Output results on the VisDrone test set. When enlarging and displaying the overlapping part of the moving object in the figure, it can be seen that the object can still be tracked continuously, and the ID of the object has not changed.
Figure 10. Output results on the UAVDT test set.

4.2.3. Ablation Experiment

An in-depth analysis of the DB-Tracker method is conducted by evaluating the impact of each component on the VisDrone dataset. The evaluation includes metrics such as MOTA, MOTP, and IDSW, and the results are presented in Table 3. The baseline is IOU and detection-based multi-object tracking. The addition of Box-MeMBer increases MOTA by 14.1%, MOTP by 16.9%, and IDSW by 3235. Using MB-OSNet to retrieve the target’s appearance features boosted MOTA by 5% while decreasing the unmatched rate. Furthermore, secondary matching improves MOTA and MOTP by 3.5% and 2.3%, respectively. The DB-Tracker achieves an excellent performance of 37.4% MOTA on the VisDrone MOT dataset by utilizing all components, outperforming the compared approaches.
Table 3. Testing the impact of various components of DB-Tracker on tracking performance on the VisDrone dataset.

4.2.4. Self-Collected Dataset

We use DJI drone data for tests and evaluation to verify the effectiveness and performance of the proposed multi-object tracking system. These data are gathered by flying DJI drones equipped with camera equipment in genuine situations, encompassing a wide range of object kinds, motion modes, and scene conditions. The resolution is 1920 × 1080, and the frame rate is 25. The picture depicts a sampling of the dataset. To prepare the dataset for the evaluation of multi-object tracking algorithms, the data are preprocessed and arranged using applicable codes and technologies, including video frame extraction, object labeling, and trajectory labeling. It is annotated using the VisDrone dataset format, and it is used as a verification set to test the effectiveness of the DB-Tracker.
DB-Tracker is utilized throughout the experiment to facilitate the processing and analysis of the data that has been gathered. Figure 11 depicts the experimental outcomes. Using this approach, we can track several objects and collect information on their position, velocity, and trajectory. Through trials and evaluations with DJI mini3 drone data, we were able to validate the usefulness of the suggested multi-object tracking approach in real-world scenarios. These data are valuable resources that allow us to better understand the algorithm’s performance and application scenarios, as well as a foundation for future research and implementation.
Figure 11. Multi-object tracking results of self-collected data. (a) and (b) respectively demonstrate the effectiveness of our algorithm on self collected data. The video website is https://github.com/YubinYuan/DB-Tracker, accessed on 23 September 2023.

5. Conclusions

Our research intends to integrate the advantages of detection-based visual multi-object tracking algorithms with RFS-based visual multi-object tracking algorithms in order to improve the performance of drone aerial video multi-object tracking. In addition, a more comprehensive, robust, and efficient integrated multi-object tracking algorithm is proposed by modeling object motion information. Our method has produced a number of novel results in the field of drone aerial multi-object tracking—DB-Tracker achieved MOTA of 37.4% and 46.2% on the VisDrone and UAVDT datasets, respectively, and it offers a practical solution to the multi-object tracking problem in real-world scenarios. Simultaneously, our method has improved significantly in a variety of datasets and tough settings. Our study contributes new ideas and methodologies to the advancement of the field of unmanned aerial vehicle visual tracking. In the future, we will persistently pursue enhancements and optimizations of the algorithm, aiming to elevate the performance of multi-object tracking even further. Moreover, we are committed to extending the applicability of our research findings across a broader spectrum of application domains. Moving forward, our dedication to improving and fine-tuning the algorithm remains unwavering, with the goal of achieving superior performance in multi-object tracking tasks. Additionally, we are eager to leverage the insights gained from this study to benefit a wider array of application areas.

Author Contributions

Conceptualization, Y.Y.; methodology, Y.Y., Y.W. and Q.Z.; software, J.C. and Q.Z.; formal analysis, Y.W.; investigation, Y.W.; resources, J.C.; data curation, L.Z., J.C. and Q.Z.; writing—original draft, Y.Y.; writing—review and editing, Y.W. and L.Z.; visualization, Q.Z.; supervision, L.Z.; project administration, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61573183.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tan, L.; Huang, X.; Lv, X.; Jiang, X.; Liu, H. Strong interference UAV motion target tracking based on target consistency algorithm. Electronics 2023, 12, 1773. [Google Scholar] [CrossRef]
  2. Fan, H.; Du, D.; Wen, L. Visdrone-mot2020: The vision meets drone multiple object tracking challenge results. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 713–727. [Google Scholar]
  3. Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. [Google Scholar] [CrossRef]
  4. Lin, Y.; Wang, M.; Chen, W.; Gao, W.; Li, L.; Liu, Y. Multiple object tracking of drone videos by a temporal-association network with separated-tasks structure. Remote Sens. 2022, 14, 3862. [Google Scholar] [CrossRef]
  5. Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
  6. Cheng, S.; Yao, M.; Xiao, X. DC-MOT: Motion deblurring and compensation for multi-object tracking in UAV videos. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 789–795. [Google Scholar]
  7. Xu, X.; Feng, Z.; Cao, C.; Yu, C.; Li, M.; Wu, Z.; Ye, S.; Shang, Y. STN-Track: Multiobject tracking of unmanned aerial vehicles by swin transformer neck and new data association method. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8734–8743. [Google Scholar] [CrossRef]
  8. Liang, Z.; Wang, J.; Xiao, G.; Zeng, L. FAANet: Feature-aligned attention network for real-time multiple object tracking in UAV videos. Chin. Opt. Lett. 2022, 20, 081101. [Google Scholar] [CrossRef]
  9. Ariza-Sentís, M.; Baja, H.; Vélez, S.; Valente, J. Object detection and tracking on UAV RGB videos for early extraction of grape phenotypic traits. Comput. Electron. Agric. 2023, 211, 108051. [Google Scholar] [CrossRef]
  10. García-Fernández, Á.F.; Xiao, J. Trajectory poisson multi-bernoulli mixture filter for traffic monitoring using a drone. IEEE Trans. Veh. Technol. 2023, 2023, 1–12. [Google Scholar] [CrossRef]
  11. Al-Shakarji, N.M.; Bunyak, F.; Seetharaman, G.; Palaniappan, K. Multi-object tracking cascade with multi-step data association and occlusion handling. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar]
  12. Wang, J.; Simeonova, S.; Shahbazi, M. Orientation-and scale-invariant multi-vehicle detection and tracking from unmanned aerial videos. Remote Sens. 2019, 11, 2155. [Google Scholar] [CrossRef]
  13. Yu, H.; Li, G.; Zhang, W.; Yao, H.; Huang, Q. Self-balance motion and appearance model for multi-object tracking in UAV. In Proceedings of the 2019 ACM Multimedia Asia (MMAsia), Beijing, China, 15–18 December 2019; pp. 1–6. [Google Scholar]
  14. Dike, H.U.; Zhou, Y. A robust quadruplet and faster region-based CNN for UAV video-based multiple object tracking in crowded environment. Electronics 2021, 10, 795. [Google Scholar] [CrossRef]
  15. Zhang, H.; Wang, G.; Lei, Z.; Hwang, J.N. Eye in the sky: Drone-based object tracking and 3d localization. In Proceedings of the 2019 27th ACM International Conference on Multimedia (MM), Nice, France, 21–25 October 2019; pp. 899–907. [Google Scholar]
  16. He, Y.; Fu, C.; Lin, F.; Li, Y.; Lu, P. Towards robust visual tracking for unmanned aerial vehicle with tri-attentional correlation filters. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 1575–1582. [Google Scholar]
  17. Stadler, D.; Sommer, L.W.; Beyerer, J. Pas tracker: Position-, appearance-and size-aware multi-object tracking in drone videos. In Proceedings of the 2020 European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 604–620. [Google Scholar]
  18. Huang, W.; Zhou, X.; Dong, M.; Xu, H. Multiple objects tracking in the UAV system based on hierarchical deep high-resolution network. Multimed. Tools Appl. 2021, 80, 13911–13929. [Google Scholar] [CrossRef]
  19. Kapania, S.; Saini, D.; Goyal, S.; Thakur, N.; Jain, R.; Nagrath, P. Multi object tracking with UAVs using deep SORT and YOLO V3 RetinaNet detection framework. In Proceedings of the 2020 1st ACM Workshop on Autonomous and Intelligent Mobile Systems (AIMS), Bangalore, India, 11–22 January 2020; pp. 1–6. [Google Scholar]
  20. Emiyah, C.; Nyarko, K.; Chavis, C.; Bhuyan, I. Extracting vehicle track information from unstabilized drone aerial videos using YOLO v4 common object detector and computer vision. In Proceedings of the 2021 Future Technologies Conference (FTC), Vancouver, BC, Canada, 28–29 October 2021; pp. 232–239. [Google Scholar]
  21. Jadhav, A.; Mukherjee, P.; Kaushik, V.; Lall, B. Aerial multi-object tracking by detection using deep association networks. In Proceedings of the 2020 National Conference on Communications (NCC), Kharagpur, India, 21–23 February 2020; pp. 1–6. [Google Scholar]
  22. Avola, D.; Cinque, L.; Diko, A.; Fagioli, A.; Foresti, G.L.; Mecca, A.; Pannone, D.; Piciarelli, C. MS-Faster R-CNN: Multi-stream backbone for improved Faster R-CNN object detection and aerial tracking from UAV images. Remote Sens. 2021, 13, 1670. [Google Scholar] [CrossRef]
  23. Wu, Y.; Wang, Y.; Zhang, D.; Huang, Z.; Wang, B. Research on vehicle tracking method based on UAV video. In Proceedings of the 2022 International Conference on Internet of Things and Smart City (IOTSC), Xiamen, China, 18–20 February 2022; pp. 801–806. [Google Scholar]
  24. Wu, H.; Du, C.; Ji, Z.; Gao, M.; He, Z. SORT-YM: An algorithm of multi-object tracking with YOLO V4-tiny and motion prediction. Electronics 2021, 10, 2319. [Google Scholar] [CrossRef]
  25. Forti, N.; Millefiori, L.M.; Braca, P.; Willett, P. Random finite set tracking for anomaly detection in the presence of clutter. In Proceedings of the 2020 IEEE Radar Conference (RadarConf20), Florence, Italy, 21–25 September 2020. [Google Scholar]
  26. Jeong, H.M.; Lee, W.C.; Choi, H.L. Random finite set based safe landing zone detection and tracking. In Proceedings of the 2022 13th Asian Control Conference (ASCC), Jeju, Republic of Korea, 4–7 May 2022. [Google Scholar] [CrossRef]
  27. Chen, L.J. Multi-target tracking with dependent likelihood structures in labeled random finite set filters. In Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION), Sun City, South Africa, 1–4 November 2021. [Google Scholar]
  28. LeGrand, K.; Zhu, P.; Ferrari, S. A random finite set sensor control approach for vision-based multi-object search-while-tracking. In Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION), Sun City, South Africa, 1–4 November 2021. [Google Scholar] [CrossRef]
  29. Pang, S.; Morris, D.; Radha, H. 3D multi-object tracking using random finite set-based multiple measurement models filtering (rfs-m3) for autonomous vehicles. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13701–13707. [Google Scholar]
  30. Zhu, P.; Wen, L.; Du, D. Vision meets drones: Past, present and future. arXiv 2020, arXiv:1804.07437. [Google Scholar] [CrossRef]
  31. Du, D.; Qi, Y.; Yu, H. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
  32. Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 17–35. [Google Scholar]
  33. Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
  34. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
  35. Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. [Google Scholar]
  36. Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar] [CrossRef]
  37. Liu, S.; Li, X.; Lu, H.; He, W. Multi-object tracking meets moving UAV. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2020; pp. 8876–8885. [Google Scholar]
  38. Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep OC-SORT: Multi-pedestrian tracking by adaptive re-identification. arXiv 2023, arXiv:2302.11813. [Google Scholar] [CrossRef]
  39. Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.Y.; Su, F.; Gong, T.; Meng, H. Strong SORT: Make DeepSORT great again. IEEE Trans. Multimed. 2023, 2023, 1–14. [Google Scholar] [CrossRef]
  40. Li, J.; Ding, Y.; Wei, H.L. Simple Track: Rethinking and improving the JDE approach for multi-object tracking. Sensors 2022, 22, 5863. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.