DB-Tracker: Multi-Object Tracking for Drone Aerial Video Based on Box-MeMBer and MB-OSNet

Yubin Yuan; Yiquan Wu; Langyue Zhao; Jinlin Chen; Qichang Zhao

doi:10.3390/drones7100607

,

and

College of Electronic and Information Engineering, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China

^*

Author to whom correspondence should be addressed.

Drones2023, 7(10), 607;https://doi.org/10.3390/drones7100607

Version Notes

Order Reprints

Review Reports

Abstract

Drone aerial videos offer a promising future in modern digital media and remote sensing applications, but effectively tracking several objects in these recordings is difficult. Drone aerial footage typically includes complicated sceneries with moving objects, such as people, vehicles, and animals. Complicated scenarios such as large-scale viewing angle shifts and object crossings may occur simultaneously. Random finite sets are mixed in a detection-based tracking framework, taking the object’s location and appearance into account. It maintains the detection box information of the detected object and constructs the Box-MeMBer object position prediction framework based on the MeMBer random finite set point object tracking. We develop a hierarchical connection structure in the OSNet network, build MB-OSNet to get the object appearance information, and connect feature maps of different levels through the hierarchy such that the network may obtain rich semantic information at different sizes. Similarity measurements are computed and collected for all detections and trajectories in a cost matrix that estimates the likelihood of all possible matches. The cost matrix entries compare the similarity of tracks and detections in terms of position and appearance. The DB-Tracker algorithm performs excellently in multi-target tracking of drone aerial videos, achieving MOTA of 37.4% and 46.2% on the VisDrone and UAVDT data sets, respectively. DB-Tracker achieves high robustness by comprehensively considering the object position and appearance information, especially in handling complex scenes and target occlusion. This makes DB-Tracker a powerful tool in challenging applications such as drone aerial videos.

Keywords:

drones; multi-object tracking; object detection; random finite sets; multi-branch structure

1. Introduction

Drones have been rapidly promoted as electronic communication technology has advanced due to their advantages of mobility, ease of operation, and low cost, compensating for the loss of information caused by traditional means due to weather and time constraints. At the same time, the high agility of drones, when compared with fixed cameras, can make the scope of aerial videos more flexible and diverse. Drone video data are particularly informative in terms of content and time, promoting the growing importance of drone aerial videos in numerous sectors of object detection and tracking [,]. However, compared with multi-object recognition and tracking jobs in ordinary view videos, drone aerial videos face several obstacles, such as frame degradation, uneven object distribution density, tiny object size, and real-time performance, which have piqued the interest of academics in recent years generated a lot of interest and research in the world and industry [].

Many academics have developed various multi-object tracking (MOT) algorithms for drone aerial videos in recent years, with the most frequent being multi-object tracking based on visual object identification, which has become the most popular multi-object tracking task for the drone aerial video valid frame. The detection-based tracking process is often divided into two parts: (1) a motion model and state estimation to anticipate the trajectory’s bounding box in subsequent frames and (2) linking new frame detections with the present trajectory [,]. For dealing with association tasks, there are two main approaches: (1) object appearance models and re-identification problems and (2) object localization, specifically the intersection-over-union ratio between predicted trajectory bounding boxes and detection bounding boxes []. Both approaches use distances to quantify associations and treat the association issue as a global assignment problem. However, because of the uncertainty of object recognition and the mutual occlusion and overlap of objects, these approaches may experience certain issues when dealing with complicated scenarios [,]. Second, the re-id proposal brings everyone together, implying that MOT is mostly based on detection and re-id information. Most multi-object tracking performance improvements rely mostly on improved detection and re-id, with motion information serving only as an aid. However, the motion information frequently has a significant “compensation” effect on the trajectory fragmentation caused by obstruction or missed detection by the detector, as well as the disappearance of fractures. After all, the expected path is both continuous and progressive [].

Random finite sets (RFS)-based multi-object tracking is widely used to resolve the aforementioned difficulties []. The RFS method models the object state and observation using a probabilistic framework, and it uses motion information in multi-object tracking to deal with complex situations such as uncertainty and object occlusion, improving the robustness and accuracy of multi-object tracking. Objects may appear and disappear in drone aerial videos, be concealed by other objects, or vary radically in look and motion. The RFS technique effectively solves such issues by predicting and incorporating the uncertainty associated with the object’s presence into the tracking process. This probabilistic modeling technique not only enhances object association accuracy but also allows for an estimate of uncertainty in object state predictions, which is useful for decision-making in dynamic contexts.

Without relying on a one-to-one relationship between detections and objects, RFS-based algorithms efficiently address the data association problem. They employ Bayesian filtering techniques, such as probabilistic hypothesis density filters and multi-Bernoulli filters, to provide efficient and scalable results. This trait is particularly important in drone aerial videos where the number of objects varies over time and the relationship between detections and objects can be confusing due to crossing or overlapping objects. The RFS approach is ideal for demanding and dynamic contexts because it can adapt to varied object counts and manage scenarios with complicated object interactions.

We model the motion information of drone aerial multi-object tracking as a random finite set—the modeling method that is closer to the actuality of multi-object motion. The detection box information of the detected object is retained using MeMBer random finite set point object tracking, the Box-MeMBer object position prediction framework is built, and the effectiveness of our proposed technique is validated.

The following are our primary contributions:

Multi-object tracking framework DB-Tracker based on detection and RFS: To achieve the combination of location information and appearance features, we utilize the label generalized multi-Bernoulli filter to collect the object’s position information and MB-OSNet to extract the object’s appearance features. Synthetic matching provides better resistance to occlusions and nonlinear motion.
Box-MeMBer: To better adapt to nonlinear systems and object motion patterns, a nonlinear motion model and a nonlinear observation model are employed for multi-object tracking using the stochastic finite set method based on multi-Bernoulli filtering. Simultaneously, the observation likelihood function is computed using the expected object state to eliminate the negative impacts of object superposition on the observation update.
MB-OSNet: We created MB-OSNet, introduced a multi-scale feature extraction module, and implemented a dense connection structure and multi-scale feature aggregation technique. By connecting the feature maps of different levels in layers, the network can obtain rich semantic information at different scales, improve the network’s perception of object features at different scales, and reduce information loss.
Verification and comparison of VisDrone and UAVDT datasets: We verified the effectiveness of VisDrone and UAVDT datasets, obtaining 37.4% and 46.2% MOTA, respectively. Simultaneously, the results show that our algorithm is effective when compared with the eight latest multi-object tracking algorithms. At the same time, we used a DJI mini3 to capture 1080p footage with varied viewing angles, altitudes, occlusions, and so on to test the algorithm’s robustness.

2. Related Work

2.1. Detection-Based Multi-Object Tracking

The multi-object tracking method based on object feature modeling is the most widely used object tracking method from the perspective of drones. It achieves multi-object tracking by extracting the color, texture, optical flow, and other features of the object. These extracted characteristics must be distinct for objects to be recognized in the feature space. Once the characteristics have been retrieved, the next step is to use them to determine the most comparable item in the following frame based on some similarity criterion. One of the issues with these methods is that unique, accurate, and trustworthy properties of an object must be extracted in order to identify it among other things.

Al-Shakarji et al. introduced an advanced multi-object tracking system that builds upon the SCTrack framework. Their approach is characterized by a sophisticated three-stage data association scheme, which combines various elements to achieve robust and accurate tracking, particularly in challenging scenarios involving occluded objects. Moreover, combined with spatial distance and precise occlusion processing units, which are not only dependent on the tracked object’s motion pattern but also on environmental constraints, it has achieved better results in dealing with occluded objects []. Wang et al. created the OSIM network, trained the vast residual network on the VeRi dataset to extract the object appearance features, calculated the minimum cosine distance of the pixels in the bounding box as the deep appearance similarity measure, and used the detected bounding box Mahalanobis distance as the motion measure. A weighted combination of these two criteria for data association via cascaded matching allows for robust monitoring of many objects in video frames []. Yu et al. introduced an innovative and adaptive technique that addresses a common challenge in multi-object tracking: often subjective control of the fusion rate between appearance and motion cues. They presented an adaptive technique that combined appearance similarity and motion consistency to address the problem of subjective settings often controlling the fusion rate between appearance and motion. The distance is calculated between the object and the other objects in the most recent frame. The Social LSTM network is used to predict the mobility of objects based on their appearance similarity. To generate the association between the object in the current frame and the object in the previous frame, weighted appearance similarity and motion prediction are utilized []. Dike et al. conducted pioneering research in the field of object trajectory tracking within crowded environments. Their work focused on addressing the challenges posed by unstable object appearance information acquisition. They used a deep quadruplet network to track the object trajectory captured from a crowded environment and the quadruplet loss function to model it in order to solve the problem of unstable acquisition of object appearance information, to mine spatial-spectral features, and to investigate the feature space using a deep CNN with 6-layer connections [].

In addition to applying individual dimension features to multi-object tracking, there is a technical path to applying appearance features, position information, time information, and so on to multi-object tracking at the same time. Zhang et al. proposed a Tracklet Net multi-object tracking method that uses time and appearance information to track detected ground objects and locates the tracked ground objects based on the group plane estimated by multi-view stereo technology, minimizing photometric errors across frames and generating smooth and accurate trajectories of moving objects []. He et al. used contextual attention, dimensional attention, and spatiotemporal to multi-level visual attention, incorporated contextual information into the filter training stage, and perceived the changing appearance of the object and environment at the same time, using the dimension and spatiotemporal attention of the response map force to enhance features, to better suppress noise, and to improve tracking performance, inspired by the cutting-edge attention mechanism []. Daniel et al. created the PAS tracker, which took the object’s position, appearance, and size into account, calculated the similarity measure between all detections and trajectories, collected them in a cost matrix representing the possibility of all possible matches, and then solved the assignment problem using the Hungarian method. To compare the similarity of trajectories and detections in terms of position, appearance, and size, a cost matrix was used [].

The DeepSORT framework is used by the majority of the most recent approaches to perform multi-object tracking. Huang et al. generated object bounding boxes using different prediction networks, filtered detection boxes based on confidence, performed cascade matching on all trajectories and detection results, and generated the final trajectory using GIOU matching tracking and detection []. Kapania et al.’s CNN model, pretrained on the MARS dataset, provides a deep association matrix that integrates appearance features and motion information to increase trajectory accuracy by lowering the number of fragments and identity switches. To avoid redundant detections, tracking is implemented in the DeepSORT framework using non-maximum suppression []. Emiyah et al. used YOLO V4 to detect objects and track them from the perspective of a drone in the DeepSORT framework. Ning et al. employed YOLO V5 to collect the object’s real-time position and paired it with the DeepSORT framework to determine the object’s speed []. Jadhav et al. created a deep association network to score objects based on the similarity of in-depth features while also tracking the identification labels of multiple objects of the same class, fusing the confidence provided by the detector with the depth association measure, and transferring them to the DeepSORT network to generate the object trajectory, which improved the tracking accuracy of objects with high confidence in the object but low depth correlation []. Avola et al. developed an innovative framework known as MS-Faster R-CNN, which was specifically designed to address the task of object detection within video streams. Their work is particularly notable for its emphasis on utilizing the visual appearance information inferred from bounding boxes. Moreover, MS-Faster R-CNN was intelligently integrated with the DeepSORT algorithm to facilitate the comprehensive description of object trajectories in drone aerial video sequences [].

In addition to directly applying DeepSORT to UAV-MOT, several academics have made a variety of corresponding advances in multi-object tracking to help with difficulties from a drone standpoint. Because DeepSORT’s pretrained appearance extraction model does not include the vehicle’s appearance model, Wu et al. used the light ShuffleNet V2 network to conduct vehicle re-identification training on VeRi data and obtained the vehicle’s appearance model, which was then added to DeepSORT []. Wu et al. combined YOLO V4 Tiny with the DeepSORT network to design the SORT-YM network, used the object information in the occluded early video, and predicted the position of the occluded object using multi-frame information, which solved the object problem and the problem of occlusion to some extent [].

For drone aerial videos, MOT approaches are typically used in conjunction with a detection and tracking strategy. Because frame content varies so significantly, substantial work is spent on establishing a strong object detection model rather than a sophisticated tracker. The best tracker recorded on the VisDrone MOT dataset, for example, is based on the simple SORT method. Incorporating more object information into the association step, as well as completely exploiting motion information, is thus a viable research avenue for drone video MOT.

2.2. RFS-Based Multi-Object Tracking

The data association problem is an inherent obstacle in multi-object tracking (MOT), where uncertainties stemming from random variation in object numbers, various object statuses, and complicated environmental conditions necessitate considerable computational resources for data association in MOT algorithms. Multiple objects with clutter and unknown relationships can be dynamically estimated within the Bayesian filtering framework by representing the states and observations of objects as random finite sets. By representing object states and measurements in the form of random finite sets, RFS-based algorithms avoid the necessity for a one-to-one correspondence between objects and measurements, hence avoiding data association difficulties.

The number of elements in the RFS framework is random, and the elements are independent and unordered. Object states change over time and include object existence, birth, and death. Assuming the state follows the Markov process in the state space

χ ϵ R^{n_{x}}

and has a transition density

f_{k | k - 1} (\cdot | \cdot)

; an object state

f_{k | k - 1} (\cdot | \cdot)

at time

k

has a probability

p_{S, k} (x_{k - 1})

of survival and transitions from

x_{k - 1}

to

x_{k}

, resulting in a new state. It is also feasible that the object will vanish with a probability of

{1 - p}_{S, k} (x_{k - 1})

. Furthermore, new objects and derived objects formed from other objects may develop at time k. The random finite set measurements are designated as

D \subseteq R^{n_{z}}

, and they have a probability function

g_{k} (\cdot | \cdot)

. At time

k

, the measurement set

D_{k}

consists of noise, object measurements, and missed detections. The multi-object tracking problem is turned into filtering and tracking problems across multi-object states and measurement spaces by using random finite sets.

In recent years, academics have been investigating the use of RFS in multi-object tracking (MOT). Forti et al. suggested a sequential Bayesian framework for detecting and tracking anomalies in cluttered surveillance applications. Anomaly modeling is modeled as switching of unknown control input that modifies the object’s predicted dynamics to accommodate anomalous activity. They created a hybrid Bernoulli filter that sequentially updates the joint posterior density of Bernoulli RFS under unknown velocity inputs and object motion states, allowing them to detect and track anomalies at the same time []. Jeong investigated a reliable safe landing zone perception method for drone landing. The safe landing zone and drone states were treated as an RFS, and the tracking procedure used the Gaussian mixed probability hypothesis density filter []. Chen et al. designed and tested a likelihood-based hypothesis composition method based on the labeled random finite sets filter architecture. This method allows for the joint evaluation of the likelihood of true hypotheses for track measurement connections, and the Murty algorithm is used to solve the assignment problem effectively, resulting in robust object tracking. Collision likelihood and occlusion likelihood are among the hypothesized likelihood structures that contribute to steady object tracking []. In multi-object search and tracking, LeGrand et al. proposed a unique tractable approximation strategy for the random finite sets of predicted information gains. To accommodate for non-ideal data, they developed the cell-MB approximation for RFS reward expectation. To quantify the information gain for both recognized and undiscovered objects, a new KLD reward was used, which was well approximated using the cell-MB technique. The approximation took noise measurement, missed detections, false alarms, and the appearance or disappearance of objects into account []. Pang et al. addressed the MOT problem in autonomous driving applications using RFS-based multi-measurement model filters. To enable varied application scenarios and accurately quantify uncertainties, they suggested alternative measurement models for the Poisson multi-Bernoulli mixture filter. Combining learning-based detection with the RFS-M3 tracker, detection confidence ratings were incorporated into the PMBM prediction and update processes. These experiments show the usefulness of applying RFS-based techniques to a variety of multi-object tracking difficulties, opening the way for developments in MOT and its implementation in real-world applications [].

RFS have evolved as a probabilistic framework for multi-object tracking, with applications ranging from point objects to moving objects to multi-dimensional objects. RFS’s ability to handle fluctuating numbers of objects adaptively without prior knowledge of the object count is one of its primary advantages. RFS describes object states and observations using probabilistic models, allowing tracking under uncertain conditions and providing uncertainty information about object states. However, when dealing with large-scale object populations, particularly those with object crossings and overlaps, RFS may encounter significant computational complexity, resulting in a drop in algorithm performance. Furthermore, increasing object density can exacerbate the data association problem, necessitating empirical and parameter modifications to attain good results.

2.3. Fusion of Detection-Based and RFS-Based Multi-Object Tracking

We discussed detection-based and RFS-based visual multi-object tracking techniques in the previous sections. Object motion information, on the other hand, is critical in multi-object tracking applications. As a result, we will use object motion information in our complete multi-object tracking algorithm to improve tracking performance even more. In Section 3.1, we explore the detection-based multi-object tracking method, which involves first performing object detection and then applying tracking algorithms to associate detection findings and accomplish multi-object tracking. Although this method can use detection accuracy to some extent and offer adequate tracking results, it may not fully utilize object motion information. This technique may confront difficulties, particularly in circumstances involving substantial object motion variations or object crossings. In Section 3.2, we describe the RFS-based multi-object tracking method, which models the multi-object tracking problem using the Bayesian filtering framework. This method manages object uncertainties, occlusions, and complex scenarios more effectively, improving the robustness and accuracy of multi-object tracking. Furthermore, RFS approaches describe object motion by fully exploiting object dynamic information, resulting in improved performance in multi-object motion settings.

In conclusion, detection-based and RFS-based visual multi-object tracking algorithms each have distinct advantages and application scenarios. However, the former may be limited in its use of object motion information. As a result, in order to increase multi-object tracking performance, our research intends to combine the advantages of both systems while also including motion modeling. We propose a multi-object tracking algorithm that is comprehensive, resilient, and efficient. We hope to get higher performance and outcomes in complicated and dynamic drone multi-object visual tracking tasks by fully using the precision of object detection, the resilience of multi-Bernoulli filtering, and the modeling of object motion information.

3. Methods

We begin the Box-MeMBer prediction step for the

k

-th frame by using the tracking results

X_{k - 1}

from the previous frame. At this stage, we forecast the objects’ state positions, extract the expected bounding boxes, and compute the association matrix with the detection boxes in the current frame.

T_{L}

(missed objects),

T_{C}

(completed tracking objects),

T_{S}

(surviving objects), and

γ

(newly identified objects) are the four object classifications.

T_{L}

represents objects that were not detected, which has a significant impact on tracking accuracy;

T_{C}

represents objects that no longer appear in the current frame, indicating that their tracking is complete;

T_{S}

represents objects that exist in the current frame and will continue to be tracked; and

γ

represents new objects that appear in the current frame for the first time. A secondary matching approach is used to differentiate between missed objects and objects that no longer require tracking. To obtain the tracking results

X_{k}

for the current frame, all objects undergo a Box-MeMBer update, followed by merging, trimming, and deleting erroneous objects. The Box-MeMBer prediction phase assists in avoiding the problem of failing to associate two bounding boxes of the same object due to excessive displacements in the intersection-over-union (IoU)-based association matrix calculation. It also allows for continuous object tracking, even when the object is moving rapidly. Figure 1 illustrates the framework.

Figure 1. Multi-target tracking algorithm framework for drone aerial videos based on Box-MeMBer and MB-OSNet.

3.1. Box-MeMBer

The current frame’s object detection box set is

D_{k} = {\{d_{k}^{i}\}}_{i = 1}^{N_{k}}

, where

d_{k}^{i} = [{\overline{x}}_{k}^{i}, {\overline{y}}_{k}^{i}, {\overline{w}}_{k}^{i}, {\overline{h}}_{k}^{i}, c_{k}^{i}]

is the state vector of the

i

-th detection box,

{\overline{x}}_{k}^{i}, {\overline{y}}_{k}^{i}, {\overline{w}}_{k}^{i}, {\overline{h}}_{k}^{i}, c_{k}^{i}

represents the horizontal and vertical coordinates of the detection box’s center, as well as the width, height, and confidence of the detection box, and

N_{k}

is the number of object detection boxes in the current frame. At the same time, the objects in the frame have some degree of growth, and their influence zones may be independent of one another or overlap. At this moment, the frame detection model is created as follows.

D_{k} = γ (S_{k}) + w_{k}

(1)

where

w_{k} = [w_{k, 1}, \dots, {w_{k, M}]}^{T}

denotes detection noise with a Gaussian distribution, which is

w_{k} \sim N (0, \sum_{r}),

where M denotes the number of pixels in the frame.

X_{k}

represents the object state set at time

k

as a random finite set.

γ (S_{k}) = \sum_{s \in S_{k}} h_{k} (s)

represents the spatial distribution of measurement for object pictures with a state of

s

, whereas

h_{k} (s) = {[h}_{k, 1} (s), \dots, {h_{k, M} (s)]}^{T}

represents the spatial distribution of the sum of frame measurements for all objects in the frame.

The detection likelihood function is as follows:

L (D_{k} ∣ S_{k}) = N (D_{k} - \sum_{s \in S_{k}} h_{k} (s), \sum_{r}) = N (D_{k}, \sum_{r}) \cdot \prod_{s \in S_{k}} g_{z} (s)

(2)

where:

e_{D} (s) = e x p (h_{k}^{T} (s) \sum_{r}^{- 1} (D_{k} - h_{k} (s) / 2 - q))

(3)

q = \sum_{t \in S_{k}, y \neq s} h_{k} (t) / 2

(4)

To obtain the posterior distribution of the updated object set,

g_{D}

must be estimated. Equation (3) shows that in

e_{D} (s)

,

h_{k} (s), \sum_{r}^{- 1}

and

D_{k}

are known, but

q

is unknown. We can know that the vector

q

is a vector using Equation (4), and there is no loss of generality. We can utilize a multivariate Gaussian distribution to approximate

P (q)

, assuming that

P (q)

approximates a Gaussian distribution with mean

μ_{0}

and covariance

Σ_{0}

. Because the goal state established at time

k

is unknown, the expected object state set at time

k - 1

is utilized to estimate the above parameters, and the following estimation equation is provided:

μ_{o} = \frac{1}{2} \sum_{j = 1, j \neq i}^{{N_{k} ∣}_{k - 1}} r_{j} b_{j}

(5)

\sum_{o} = \frac{1}{4} \sum_{j = 1, j \neq i}^{{N_{k} ∣}_{k - 1}} (r_{j} v_{j} - r_{j}^{2} b_{j} b_{j}^{T})

(6)

where

i

is the label of the multi-Bernoulli component to which the current object

s

belongs,

b_{j} = ⟨p_{k ∣ k - 1}^{j}, h_{k}⟩, v_{j} = ⟨p_{k ∣ k - 1}^{j}, h_{k} h_{k}^{T}⟩ .

Taking

E_{0} (g_{z})

as the projected value of

e_{D}

using

P (q) \approx N (μ_{0}, Σ_{o})

, the following results are obtained:

{\hat{e}}_{D} = e x p (h_{k}^{T} (x) \sum_{r}^{- 1} (D_{k} - h_{k} (s) / 2 - μ_{o} + \sum_{o} {(h_{k}^{T} \sum_{r}^{- 1})}^{T} / 2))

(7)

Equation (7) gives the estimated expression of

e_{D}

in the presence of multi-object superposition, demonstrating that the object state can be obtained through prediction and the influence of multi-object superposition can be eliminated, allowing it to be used for object state measurement updates.

The filter has two steps: prediction and updating. Each multi-Bernoulli object component is labeled to improve estimation performance by extracting the object track and eliminating spurious objects. The fundamental steps are as follows.

3.1.1. Box-Bernoulli Prediction

WE assume the object’s multi-Bernoulli distribution parameter at time

k - 1

is

π_{k - 1} = {\{(r_{k - 1}^{(i)}, p_{k - 1}^{(i)}, L_{k - 1}^{(i)})\}}_{i = 1}^{M_{k - 1}}

, where

r^{(i)}

represents the existence probability of the

i

component,

p^{(i)}

represents the probability distribution of the

i

component,

L^{(i)}

represents the label of the

i

component, each label is a three-dimensional vector, and the first label is a three-dimensional vector. The first dimension indicates the component’s starting point, the second dimension represents the component’s serial number in all objects, and the third dimension reflects the size of the object corresponding to the component. The object distribution’s projected multi-Bernoulli parameter is:

\begin{array}{l} π_{k ∣ k - 1} = & {\{(r_{P, k ∣ k - 1}^{(i)}, p_{P, k ∣ k - 1}^{(i)}, L_{P, k ∣ k - 1}^{(i)})\}}_{i = 1}^{M_{k - 1}} \cup {\{(r_{Γ, k}^{(i)}, p_{Γ, k}^{(i)}, L_{Γ, k}^{(i)})\}}_{i = 1}^{M_{Γ, k}} \end{array}

(8)

where,

{p_{P_{, k}}^{(i)}|}_{k - 1} = \frac{⟨f_{{k|}_{k - 1}} (s∣ \cdot), p_{k - 1}^{(i)} p_{b, k}⟩}{⟨p_{k - 1}^{(i)}, p_{b, k}⟩}, {r_{P, k}^{(i)}|}_{k - 1} = r_{k - 1}^{(i)} < p_{k - 1}^{(i)}, p_{b, k} >, {L_{P, k}^{(i)}|}_{k - 1} = L_{k - 1}^{(i)} .

At time k − 1,

f_{k ∣ k - 1} (s∣ \cdot)

represents the single object state transition probability density function of state

s

. The chance of survival of an object whose state is

s

at time

k - 1

at time

k

is represented by

p_{b, k} (s)

. At time

k

,

{\{(r_{Γ, k}^{(i)}, p_{Γ, k}^{(i)}, L_{Γ, k}^{(i)})\}}_{i = 1}^{M_{Γ, k}}

denotes the distribution parameter of the embryonic multi-Bernoulli component.

3.1.2. Box-Bernoulli Update

The modified object multi-Bernoulli distribution parameter set is as follows, given the expected object multi-Bernoulli distribution parameter set

π_{k ∣ k - 1} = {\{(r_{k ∣ k - 1}^{(i)}, p_{k ∣ k - 1}^{(i)}, L_{k ∣ k - 1}^{(i)})\}}_{i = 1}^{{M_{k}|}_{k - 1}}

:

π_{k} = {\{(r_{k}^{(i)}, p_{k}^{(i)}, L_{k}^{(i)})\}}_{i = 1}^{M_{k ∣ k - 1}}

(9)

where,

r_{k}^{(i)} = r_{k ∣ k - 1}^{(i)} < p_{k ∣ k - 1}^{(i)}, e_{D} > / (1 - {r_{k}^{(i)}|}_{k - 1} ⟨p_{k ∣ k - 1}^{(i)}, e_{D}⟩)

,

p_{k}^{(i)} = p_{k ∣ k - 1}^{(i)} e_{D} / ⟨p_{k ∣ k - 1}^{(i)}, e_{D}⟩

,

L_{k}^{(i)} = L_{k ∣ k - 1}^{(i)}

.

e_{D}

is estimated by Equation (7).

3.1.3. Object Trajectory Information Extraction

We extract the state parameters and trajectory labels of several Bernoulli components that exist independently (i.e., a single object). While we integrate the position information and probability information of the relevant object components to produce an object state for multiple multi-Bernoulli components fused into one object (i.e., multiple components represent the same object), we combine the position information and probability information of the related object components for multiple components representing the same object. The position information can be used to calculate the state parameters, and further feature extraction and analysis can be performed as needed. To ensure unique identification, the track label is used as the label of the first object. If there is no object with the same label as the current object in the preceding and following frames, indicating that the object is not constantly watched, it might be considered a false alarm object and removed or marked as an unreliable object in the present frame. The obtained projected trajectory set is as follows:

P_{k ∣ k - 1} = {\{p_{k ∣ k - 1}\}}_{i = 1}^{N_{k}}

(10)

where

p_{k ∣ k - 1}^{i}

is the prediction box,

p_{k ∣ k - 1}^{i} = [{\overline{x}}_{k}^{i}, {\overline{y}}_{k}^{i}, {\overline{w}}_{k}^{i}, {\overline{h}}_{k}^{i}, {I D}_{k}^{i}]

, comprising the projected frame’s coordinates, length and width, and track number.

We can efficiently identify separate objects and numerous components fused into one object and extract their state parameters and trajectory labels, resulting in the accuracy and reliability of multi-object tracking trajectory prediction.

3.2. MB-OSNet

The simple convolutional neural network can learn the global characteristics of the object region of interest in order to extract the discriminative feature representation of the object, but it cannot discern intra-class differences. Global features aid in the learning of contour information in order to recover frames from a larger field of view. Part-level characteristics, on the other hand, might carry more fine-grained information. At the same time, it is not recommended to partition the frame excessively, since this may result in frame information loss owing to insufficient local information, diminishing accuracy. Partial division produces fine-grained yet restricted data. Global knowledge is adequate, but local factors are inadequate. OSNet is utilized as the core network to gather comprehensive and coordinated picture information by adding a multi-branch structure of feature collaboration, and a multi-branch OSNet (MB-OSNet) is developed to ensure global and fine-grained picture information. Figure 2 depicts the network structure.

Figure 2. MB-OSNet Network Structure.

OSNet can learn the object’s full-scale feature representation, which has the following main components. First, full-scale residual blocks and unified aggregation gates are introduced by deconstructing the convolutional layers. As illustrated in Figure 3b, depth-separated convolution is utilized to decompose the 3×3 convolution to create Lite 3 × 3 layers. Different convolution streams have different receptive fields, and by stacking Lite 3 × 3 layers to construct the bottleneck, as shown in Figure 3c, a wide range of scales can be captured. Furthermore, a parameter t denoting the feature scale is introduced to expand the residual function by numerous Lite 3 × 3 layers in order to learn multi-scale features. The learning results of small-scale features in the following layer can be efficiently kept through short connections, capturing the range of the entire spatial scale. A learnable combinatorial gate (AG) is used to dynamically combine the convolutional stream outputs at different sizes to produce full-scale features. Multiple convolution streams share AG gates, and the number of parameters is independent, making the model scalable. Finally, as illustrated in Figure 3d, creating the full OSNet network by adding lightweight bottlenecks layer by layer allows for a flexible balance of model size, computational cost, and performance.

Figure 3. OSNet network structure and its components. (a) Standard 3 × 3 convolution. (b) Lite 3 × 3 convolution. (c) Bottleneck. (d) OSNet network.

Two branch structures are developed for multi-branch structures. The local branch is the first. The feature map is partitioned into four horizontal grids in this branch, and average pooling is utilized to produce 1 × 1 × C local features. It should be noted that the four local characteristics are concatenated into a column vector, and the resulting concatenated features are

f = {[f_{1}^{T}, f_{2}^{T}, f_{3}^{T}, f_{4}^{T}]}^{T}

(11)

where

f_{i}^{T}

represents the four column vectors of the horizontally partitioned feature map.

The global branch is the second branch. GeM pooling is conducted right after the OSNet convolutional layers, as opposed to the local branch. To acquire the eigenvectors, the initialization parameter

p_{k}

is set to 2.8:

G e M (f_{k} = [f_{0}, f_{1}, {\dots, f}_{n}]) = {[\frac{1}{n} \sum_{i = 1}^{n} {(f_{i})}^{p_{k}}]}^{\frac{1}{p_{k}}}

(12)

where

f_{k}

is an individual feature map. GeM corresponds to maximal pooling when

p_{k} \to \infty

, and to average pooling when

p_{k} \to 1

. The MB-OSNet appearance feature extraction network accepts the detection frame object as input and provides a feature vector representation. A set is generated for each object, which is used to store the object’s appearance feature vectors in different frames, and the object’s appearance features are stored at most 100 frames before the current moment of the object.

3.3. Object Association

We created the association method that was used between the object trajectory and the detection frame, and we used the MB-OSNet network to calculate the feature similarity score to determine the similarity between the two object frames, build an association matrix, and minimize the Euclidean distance between the trajectory and the object. The distance minimizes the amount of label hops, enhancing tracking accuracy and making the tracking frame more accurate.

The object trajectory state information

P_{k ∣ k - 1} = {\{p_{k ∣ k - 1}^{i}\}}_{i = 1}^{N_{k}}

anticipated by the

k

-th frame Box-MeMBer is calculated as shown. The detection set

D_{k} = {\{d_{k}^{i}\}}_{i = 1}^{N_{k}}

is used by the object correlation matrix algorithm to obtain the related surviving object set, newborn object set, and dead object set, including missing objects.

The Box-MeMBer-predicted state component

P_{k ∣ k - 1} = \{p_{k ∣ k - 1}^{1}, p_{k ∣ k - 1}^{2}, \dots, p_{k ∣ k - 1}\}

must be associated and matched with this frame’s detection set

D_{k} = {\{d_{k}\}}_{i = 1}^{N_{k}}

, and the object is divided into a surviving object set

T_{s}

, a new object

R

and clutter

K

, a missed object

T_{L}

, and an end-tracking object

T_{C}

, where

N_{k}

represents the number of predicted components and

N_{k}

is the number of detection frames in this frame.

\begin{matrix} A = (\begin{matrix} a_{11} & \dots & a_{1 M_{k}} \\ ⋮ & ⋱ & ⋮ \\ a_{{\overline{N}}_{k} 1} & \dots & a_{{\overline{N}}_{k} M_{k}} \end{matrix}) \end{matrix}

(13)

\begin{matrix} a_{i j} = \frac{Area ({\overline{z}}_{i}) \cap Area (x_{k ∣ k - 1}^{j})}{Area ({\overline{z}}_{i}) \cup Area (x_{k ∣ k - 1}^{j})} \end{matrix}

(14)

where

a_{i j}

denotes the intersection and union ratio of the

i

-th detection frame and the

j

-th Gaussian component. For each detection frame

d_{i}

, the Gaussian component

x_{k | k - 1}^{j}

will be compared once. If the calculated value is greater than the threshold

T_{iou}

, then it is evaluated as the same object and stored as the surviving object

T_{s}

; otherwise, it is treated as a separate object.

If there are two or more Gaussian components greater than the threshold

T_{i o u}

for the same detection frame, the one with the greatest intersection and union ratio is used as the final association result; if the two values are the same, feature pyramid similarity must be performed on the components degree calculation. If there is no value in the

i

-th row that is greater than the threshold

T_{i o u}

,

d_{i}

is regarded as a new object or clutter; if there is no value in the

j

-th column that is bigger than the threshold

T_{i o u}

,

x_{k ∣ k - 1}^{j}

is considered a tracking object or a missed object.

3.4. Secondary Matching

There are some unmatched objects and trajectories in multi-object tracking, indicating that there is no successful object-trajectory link. Secondary matching can be used to uncover potential matching associations by examining the link between unmatched objects and trajectories. The motion features and appearance features are merged after thoroughly considering the object’s appearance features and positional relationship information.

At a given point, the detection frame’s appearance features (denoted as

r_{j}

) are obtained. Then, the smallest cosine distance (number

j

) between the appearance features in all known sets and the appearance features in the resulting detection frame are computed.

d^{(2)} (i, j) = m i n {1 - r_{j}^{T} r_{k}^{(i)}} r_{k}^{(i)} \in G_{i}

(15)

Motion features and exterior aspects complement each other. Motion characteristics (obtained via the Mahalanobis distance computations) provide possible information on object localization in quadratic matching, which is extremely effective in short-term prediction. Appearance features (obtained from cosine distance computation) help recover an object’s ID number after it has been occluded for a long time, minimizing the frequency of ID switches.

A weighting procedure is conducted on the two features in order to integrate them

c_{i, j} = λ d^{(1)} (i, j) + (1 - λ) d^{(2)} (i, j)

(16)

where

d^{(1)} (i, j)

represents the Mahalanobis distance and

d^{(2)} (i, j)

represents the cosine distance. λ is the weighting factor. Only appearance features are used for matching and tracking when λ = 0.

Finally, a discriminative overall threshold for determining if an association match is established, with an association being admissible if it falls within the gating zone of the two metrics:

b_{i, j} = \prod_{m = 1}^{2} b_{i, j}^{(m)}

(17)

We combine the two previously mentioned criteria (thresholds of the Mahalanobis distance and cosine distance, respectively) to jointly appraise the object’s relationship. If the secondary matching is successful, the matching object and trajectory pair are located, and their status information, such as position, velocity, object ID, and so on, is updated.

Secondary matching can improve the correlation between objects and trajectories, as well as the accuracy and stability of multi-object tracking. Secondary matching takes full use of the information between unmatched objects and trajectories, assisting in the discovery of more suitable matching results by re-evaluating their relationship and further optimizing the effectiveness of multi-object tracking.

4. Experiment

This section is divided by subheadings. It provides a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.

4.1. Datasets and Evaluation Indicators

4.1.1. Datasets

We conducted a comprehensive evaluation of the object detectors and DB-Tracker on the VisDrone MOT [] and UAVDT [] datasets, comparing their performance with other excellent object detectors and multi-object trackers. Sample images of the VisDrone MOT and UAVDT datasets are shown in Figure 4 and Figure 5.

Figure 4. VisDrone dataset. Lines (a,c) represent the original frames, and lines (b,d) represent the corresponding labeled datasets, respectively.

Figure 5. UAVDT dataset. Lines (a,c) represent the original frames, and lines (b,d) represent the corresponding labeled datasets, respectively.

4.1.2. Evaluation Indicators

Two authoritative MOT metrics are used to evaluate our MOT system performance, which are defined as [] and CLEAR MOT metrics []. These metrics are designed to assess the overall performance and indicate the potential shortcomings in each model. These metrics are denoted as follows:

(1): FP (↓): false positives in the entire video;
(2): FN (↓): false negatives in the entire video;
(3): IDSW (↓): ID switches in the entire video.
(4): FM (↓): number of a ground-truth trajectory interrupted during the tracking process.
(5): IDF1(↑): Ratio of correctly identified detection to the number of computed detections and ground truth.
(6): MOTA (↑): combining false positives, false negatives, and ID_SW, the score is then defined as follows:

M O T A = 1 - \frac{(F N + F P + I D s w)}{G T}

(18)

(7): MOTP (↑): the mismatch between the ground truth and the predicted results is calculated as follows:

M O T P = \frac{\sum_{t, i} d_{t, i}}{\sum_{t} c_{t}}

(19)

4.2. Experimental Results

4.2.1. Object Detection

The model is initialized with available weights gained from training the detector on the COCO dataset. The detector is trained using SGD with the following parameters: epoch set to 150, batch size set to 16, learning rate set to 0.02, momentum set to 0.9, and decay set to 0.0001. The experiment is also trained using the VisDrone and UAVDT datasets, and it is verified using the same verification set picture. The test is run on hardware (an NVIDIA RTX 3090 with 24 GB of memory), and the top 100 most reliable detection results are averaged. In order to achieve better average accuracy, we added an attention mechanism to YOLO-V8 to better identify small targets. The improved YOLO-V8 achieved 53.4% mAP on the VisDrone dataset and 71.1% mAP on the UAVDT dataset. Figure 6 and Figure 7 depict the visual effects. Figure 8 shows the comparison of YOLOV8 training with the original YOLOV8 after adding the attention mechanism.

Figure 6. Detection results of improved YOLO-V8 on the VisDrone dataset.

Figure 7. Detection results of improved YOLO-V8 on the UAVDT dataset.

Figure 8. Comparison of YOLOV8 training with the original YOLOV8 after adding the attention mechanism. (a) the precision–confidence curve of original YOLOV8 training; (b) the precision–recall curve of original YOLOV8 training; (c) the recall–confidence curve of original YOLOV8 training; (d) the precision–confidence curve of improved YOLOV8 training; (e) the precision–recall curve of improved YOLOV8 training; and (f) the recall–confidence curve of improved YOLOV8 training.

4.2.2. Multi-Object Tracking

We compared the DB-Tracker with DeepSORT [], ByteTrack [], BoT-SORT [], UAVMOT [], Deep OC-SORT [], Strong SORT [], and SimpleTrack []. Due to the nonuniform distribution of object entities per class in the training set, detection models behave differently across classes. To this end, all tracking comparison methods use the same detections produced by the improved YOLO-V8 detector and refer to [], thresholding cars at 0.3, buses at 0.05, and trucks at 0.1; the threshold is set at 0.4 for pedestrians and 0.05 for vans.

Table 1 and Table 2 present a comprehensive comparison between the DB-Tracker and other popular trackers evaluated on the VisDrone MOT and UAVDT datasets. The evaluation includes key metrics such as MOTA, MOTP, IDF1, and IDSW, as well as comparisons with other methods. The DB-Tracker excels by effectively utilizing both position and appearance information, leading to superior performance. In contrast, Deep SORT extends its framework to handle multiple classes by independently associating each class based on position information. ByteTrack utilizes low-score detections for similarity tracking and background noise filtering. Deep OC-SORT introduces camera motion compensation and adaptive weighting, while BoT-SORT incorporates camera motion compensation. UAVMOT enhances object feature association through an ID feature update module. StrongSORT models nonlinear motion based on Gaussian process regression. SimpleTrack presents a novel incidence matrix by merging the embedded cosine distance and GIOU distance of the objects. This superiority stems from its effective fusion of position and appearance information, resulting in enhanced tracking performance across various evaluation measures.

Table 1. Comparison between DB-Tracker and the latest multiple trackers tested on the VisDrone dataset.

Table 2. Comparison between DB-Tracker and the latest multiple trackers tested on the UAVDT dataset.

The DB-Tracker was compared against DeepSORT, ByteTrack, BoT-SORT, UAVMOT, Deep OC-SORT, Strong SORT, and SimpleTrack. Detection models perform differently across classes because of the nonuniform distribution of object entities per class in the training set. To that purpose, all tracking comparison methods employ the identical detections produced by the YOLOv8x detector, thresholding vehicles at 0.3, buses at 0.05, and trucks at 0.1; for pedestrians and vans, the threshold is set at 0.4 and 0.05, respectively.

Table 1 and Table 2 provide a detailed comparison of the DB-Tracker to various prominent trackers tested on the VisDrone MOT and UAVDT datasets. The assessment contains essential indicators, including MOTA, MOTP, IDF1, IDSW, and comparisons with other approaches. The DB-Tracker succeeds by utilizing both position and appearance information effectively, resulting in greater performance. Deep SORT, on the other hand, extends its structure to handle numerous classes by associating each class independently based on position information. For similarity tracking and background noise filtering, ByteTrack employs low-score detections. Deep OC-SORT features camera motion compensation as well as adaptive weighting, whereas BoT-SORT does not. UAVMOT improves object feature association through an ID feature update module. Based on Gaussian process regression, StrongSORT models nonlinear motion. SimpleTrack creates a new incidence matrix by combining the objects’ embedded cosine and GIOU distances. This superiority is due to the effective merging of position and appearance information, which results in improved tracking performance across several evaluation parameters.

However, according to Table 1 and Table 2, it is found that these existing methods have some problems that limit their performance in multi-target tracking tasks in drone aerial videos. Although DeepSORT integrates deep learning and SORT algorithms, its high computing resource requirements make it less efficient in large-scale scenarios. ByteTrack is less robust in some complex situations, especially for target occlusion and appearance changes. Although BoT-SORT integrates a variety of tracking modules, the complex model structure requires more computing resources. In addition, although UAVMOT is optimized for multi-target tracking in the UAV aerial video, it still needs further improvement in some complex scenes. Although Deep OC-SORT combines appearance and motion information, its computational overhead is relatively large. Although Strong SORT has a certain degree of robustness, it is less efficient in large-scale scenes of drone aerial videos. Although SimpleTrack is simple and efficient, its robustness in complex scenarios is limited. Compared with these methods, the DB-tracker algorithm combines detection and tracking and comprehensively considers the location and appearance information of the target. Its outstanding robustness enables it to handle challenges such as complex scenes and target occlusions. DB-tracker can efficiently handle the diverse scenarios of drone aerial videos, making it a promising solution in this field.

Figure 9 and Figure 10 show the chronological frames with bounding boxes and different colored identities. Using object motion information, our trajectory scoring technique efficiently tackles missed and incorrect detections caused by occlusion, particularly in cases of short-term overlapping objects. In contrast to previous algorithms based on bounding box junctions, our algorithm reduces pedestrian identity swaps. The findings show that our suggested method performs admirably in monitoring congested scenes with numerous items. It ensures that the bounding boxes and identities are consistent throughout the sequence.

Figure 9. Output results on the VisDrone test set. When enlarging and displaying the overlapping part of the moving object in the figure, it can be seen that the object can still be tracked continuously, and the ID of the object has not changed.

Figure 10. Output results on the UAVDT test set.

4.2.3. Ablation Experiment

An in-depth analysis of the DB-Tracker method is conducted by evaluating the impact of each component on the VisDrone dataset. The evaluation includes metrics such as MOTA, MOTP, and IDSW, and the results are presented in Table 3. The baseline is IOU and detection-based multi-object tracking. The addition of Box-MeMBer increases MOTA by 14.1%, MOTP by 16.9%, and IDSW by 3235. Using MB-OSNet to retrieve the target’s appearance features boosted MOTA by 5% while decreasing the unmatched rate. Furthermore, secondary matching improves MOTA and MOTP by 3.5% and 2.3%, respectively. The DB-Tracker achieves an excellent performance of 37.4% MOTA on the VisDrone MOT dataset by utilizing all components, outperforming the compared approaches.

Table 3. Testing the impact of various components of DB-Tracker on tracking performance on the VisDrone dataset.

4.2.4. Self-Collected Dataset

We use DJI drone data for tests and evaluation to verify the effectiveness and performance of the proposed multi-object tracking system. These data are gathered by flying DJI drones equipped with camera equipment in genuine situations, encompassing a wide range of object kinds, motion modes, and scene conditions. The resolution is 1920 × 1080, and the frame rate is 25. The picture depicts a sampling of the dataset. To prepare the dataset for the evaluation of multi-object tracking algorithms, the data are preprocessed and arranged using applicable codes and technologies, including video frame extraction, object labeling, and trajectory labeling. It is annotated using the VisDrone dataset format, and it is used as a verification set to test the effectiveness of the DB-Tracker.

DB-Tracker is utilized throughout the experiment to facilitate the processing and analysis of the data that has been gathered. Figure 11 depicts the experimental outcomes. Using this approach, we can track several objects and collect information on their position, velocity, and trajectory. Through trials and evaluations with DJI mini3 drone data, we were able to validate the usefulness of the suggested multi-object tracking approach in real-world scenarios. These data are valuable resources that allow us to better understand the algorithm’s performance and application scenarios, as well as a foundation for future research and implementation.

Figure 11. Multi-object tracking results of self-collected data. (a) and (b) respectively demonstrate the effectiveness of our algorithm on self collected data. The video website is https://github.com/YubinYuan/DB-Tracker, accessed on 23 September 2023.

5. Conclusions

Our research intends to integrate the advantages of detection-based visual multi-object tracking algorithms with RFS-based visual multi-object tracking algorithms in order to improve the performance of drone aerial video multi-object tracking. In addition, a more comprehensive, robust, and efficient integrated multi-object tracking algorithm is proposed by modeling object motion information. Our method has produced a number of novel results in the field of drone aerial multi-object tracking—DB-Tracker achieved MOTA of 37.4% and 46.2% on the VisDrone and UAVDT datasets, respectively, and it offers a practical solution to the multi-object tracking problem in real-world scenarios. Simultaneously, our method has improved significantly in a variety of datasets and tough settings. Our study contributes new ideas and methodologies to the advancement of the field of unmanned aerial vehicle visual tracking. In the future, we will persistently pursue enhancements and optimizations of the algorithm, aiming to elevate the performance of multi-object tracking even further. Moreover, we are committed to extending the applicability of our research findings across a broader spectrum of application domains. Moving forward, our dedication to improving and fine-tuning the algorithm remains unwavering, with the goal of achieving superior performance in multi-object tracking tasks. Additionally, we are eager to leverage the insights gained from this study to benefit a wider array of application areas.

Author Contributions

Conceptualization, Y.Y.; methodology, Y.Y., Y.W. and Q.Z.; software, J.C. and Q.Z.; formal analysis, Y.W.; investigation, Y.W.; resources, J.C.; data curation, L.Z., J.C. and Q.Z.; writing—original draft, Y.Y.; writing—review and editing, Y.W. and L.Z.; visualization, Q.Z.; supervision, L.Z.; project administration, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61573183.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tan, L.; Huang, X.; Lv, X.; Jiang, X.; Liu, H. Strong interference UAV motion target tracking based on target consistency algorithm. Electronics 2023, 12, 1773. [Google Scholar] [CrossRef]
Fan, H.; Du, D.; Wen, L. Visdrone-mot2020: The vision meets drone multiple object tracking challenge results. In Proceedings of the Computer Vision–ECCV 2020 Workshops, Glasgow, UK, 23–28 August 2020; Proceedings, Part IV 16. Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 713–727. [Google Scholar]
Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. [Google Scholar] [CrossRef]
Lin, Y.; Wang, M.; Chen, W.; Gao, W.; Li, L.; Liu, Y. Multiple object tracking of drone videos by a temporal-association network with separated-tasks structure. Remote Sens. 2022, 14, 3862. [Google Scholar] [CrossRef]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and tracking meet drones challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7380–7399. [Google Scholar] [CrossRef] [PubMed]
Cheng, S.; Yao, M.; Xiao, X. DC-MOT: Motion deblurring and compensation for multi-object tracking in UAV videos. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 789–795. [Google Scholar]
Xu, X.; Feng, Z.; Cao, C.; Yu, C.; Li, M.; Wu, Z.; Ye, S.; Shang, Y. STN-Track: Multiobject tracking of unmanned aerial vehicles by swin transformer neck and new data association method. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8734–8743. [Google Scholar] [CrossRef]
Liang, Z.; Wang, J.; Xiao, G.; Zeng, L. FAANet: Feature-aligned attention network for real-time multiple object tracking in UAV videos. Chin. Opt. Lett. 2022, 20, 081101. [Google Scholar] [CrossRef]
Ariza-Sentís, M.; Baja, H.; Vélez, S.; Valente, J. Object detection and tracking on UAV RGB videos for early extraction of grape phenotypic traits. Comput. Electron. Agric. 2023, 211, 108051. [Google Scholar] [CrossRef]
García-Fernández, Á.F.; Xiao, J. Trajectory poisson multi-bernoulli mixture filter for traffic monitoring using a drone. IEEE Trans. Veh. Technol. 2023, 2023, 1–12. [Google Scholar] [CrossRef]
Al-Shakarji, N.M.; Bunyak, F.; Seetharaman, G.; Palaniappan, K. Multi-object tracking cascade with multi-step data association and occlusion handling. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar]
Wang, J.; Simeonova, S.; Shahbazi, M. Orientation-and scale-invariant multi-vehicle detection and tracking from unmanned aerial videos. Remote Sens. 2019, 11, 2155. [Google Scholar] [CrossRef]
Yu, H.; Li, G.; Zhang, W.; Yao, H.; Huang, Q. Self-balance motion and appearance model for multi-object tracking in UAV. In Proceedings of the 2019 ACM Multimedia Asia (MMAsia), Beijing, China, 15–18 December 2019; pp. 1–6. [Google Scholar]
Dike, H.U.; Zhou, Y. A robust quadruplet and faster region-based CNN for UAV video-based multiple object tracking in crowded environment. Electronics 2021, 10, 795. [Google Scholar] [CrossRef]
Zhang, H.; Wang, G.; Lei, Z.; Hwang, J.N. Eye in the sky: Drone-based object tracking and 3d localization. In Proceedings of the 2019 27th ACM International Conference on Multimedia (MM), Nice, France, 21–25 October 2019; pp. 899–907. [Google Scholar]
He, Y.; Fu, C.; Lin, F.; Li, Y.; Lu, P. Towards robust visual tracking for unmanned aerial vehicle with tri-attentional correlation filters. In Proceedings of the 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Las Vegas, NV, USA, 24 October 2020–24 January 2021; pp. 1575–1582. [Google Scholar]
Stadler, D.; Sommer, L.W.; Beyerer, J. Pas tracker: Position-, appearance-and size-aware multi-object tracking in drone videos. In Proceedings of the 2020 European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 604–620. [Google Scholar]
Huang, W.; Zhou, X.; Dong, M.; Xu, H. Multiple objects tracking in the UAV system based on hierarchical deep high-resolution network. Multimed. Tools Appl. 2021, 80, 13911–13929. [Google Scholar] [CrossRef]
Kapania, S.; Saini, D.; Goyal, S.; Thakur, N.; Jain, R.; Nagrath, P. Multi object tracking with UAVs using deep SORT and YOLO V3 RetinaNet detection framework. In Proceedings of the 2020 1st ACM Workshop on Autonomous and Intelligent Mobile Systems (AIMS), Bangalore, India, 11–22 January 2020; pp. 1–6. [Google Scholar]
Emiyah, C.; Nyarko, K.; Chavis, C.; Bhuyan, I. Extracting vehicle track information from unstabilized drone aerial videos using YOLO v4 common object detector and computer vision. In Proceedings of the 2021 Future Technologies Conference (FTC), Vancouver, BC, Canada, 28–29 October 2021; pp. 232–239. [Google Scholar]
Jadhav, A.; Mukherjee, P.; Kaushik, V.; Lall, B. Aerial multi-object tracking by detection using deep association networks. In Proceedings of the 2020 National Conference on Communications (NCC), Kharagpur, India, 21–23 February 2020; pp. 1–6. [Google Scholar]
Avola, D.; Cinque, L.; Diko, A.; Fagioli, A.; Foresti, G.L.; Mecca, A.; Pannone, D.; Piciarelli, C. MS-Faster R-CNN: Multi-stream backbone for improved Faster R-CNN object detection and aerial tracking from UAV images. Remote Sens. 2021, 13, 1670. [Google Scholar] [CrossRef]
Wu, Y.; Wang, Y.; Zhang, D.; Huang, Z.; Wang, B. Research on vehicle tracking method based on UAV video. In Proceedings of the 2022 International Conference on Internet of Things and Smart City (IOTSC), Xiamen, China, 18–20 February 2022; pp. 801–806. [Google Scholar]
Wu, H.; Du, C.; Ji, Z.; Gao, M.; He, Z. SORT-YM: An algorithm of multi-object tracking with YOLO V4-tiny and motion prediction. Electronics 2021, 10, 2319. [Google Scholar] [CrossRef]
Forti, N.; Millefiori, L.M.; Braca, P.; Willett, P. Random finite set tracking for anomaly detection in the presence of clutter. In Proceedings of the 2020 IEEE Radar Conference (RadarConf20), Florence, Italy, 21–25 September 2020. [Google Scholar]
Jeong, H.M.; Lee, W.C.; Choi, H.L. Random finite set based safe landing zone detection and tracking. In Proceedings of the 2022 13th Asian Control Conference (ASCC), Jeju, Republic of Korea, 4–7 May 2022. [Google Scholar] [CrossRef]
Chen, L.J. Multi-target tracking with dependent likelihood structures in labeled random finite set filters. In Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION), Sun City, South Africa, 1–4 November 2021. [Google Scholar]
LeGrand, K.; Zhu, P.; Ferrari, S. A random finite set sensor control approach for vision-based multi-object search-while-tracking. In Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION), Sun City, South Africa, 1–4 November 2021. [Google Scholar] [CrossRef]
Pang, S.; Morris, D.; Radha, H. 3D multi-object tracking using random finite set-based multiple measurement models filtering (rfs-m3) for autonomous vehicles. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; pp. 13701–13707. [Google Scholar]
Zhu, P.; Wen, L.; Du, D. Vision meets drones: Past, present and future. arXiv 2020, arXiv:1804.07437. [Google Scholar] [CrossRef]
Du, D.; Qi, Y.; Yu, H. The unmanned aerial vehicle benchmark: Object detection and tracking. In Proceedings of the 2018 European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 370–386. [Google Scholar]
Ristani, E.; Solera, F.; Zou, R.; Cucchiara, R.; Tomasi, C. Performance measures and a data set for multi-target, multi-camera tracking. In Proceedings of the 2016 European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 8–16 October 2016; pp. 17–35. [Google Scholar]
Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 1–21. [Google Scholar]
Aharon, N.; Orfaig, R.; Bobrovsky, B.Z. BoT-SORT: Robust associations multi-pedestrian tracking. arXiv 2022, arXiv:2206.14651. [Google Scholar] [CrossRef]
Liu, S.; Li, X.; Lu, H.; He, W. Multi-object tracking meets moving UAV. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2020; pp. 8876–8885. [Google Scholar]
Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep OC-SORT: Multi-pedestrian tracking by adaptive re-identification. arXiv 2023, arXiv:2302.11813. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.Y.; Su, F.; Gong, T.; Meng, H. Strong SORT: Make DeepSORT great again. IEEE Trans. Multimed. 2023, 2023, 1–14. [Google Scholar] [CrossRef]
Li, J.; Ding, Y.; Wei, H.L. Simple Track: Rethinking and improving the JDE approach for multi-object tracking. Sensors 2022, 22, 5863. [Google Scholar] [CrossRef]

Figure 1. Multi-target tracking algorithm framework for drone aerial videos based on Box-MeMBer and MB-OSNet.

Figure 2. MB-OSNet Network Structure.

Figure 3. OSNet network structure and its components. (a) Standard 3 × 3 convolution. (b) Lite 3 × 3 convolution. (c) Bottleneck. (d) OSNet network.

Figure 4. VisDrone dataset. Lines (a,c) represent the original frames, and lines (b,d) represent the corresponding labeled datasets, respectively.

Figure 5. UAVDT dataset. Lines (a,c) represent the original frames, and lines (b,d) represent the corresponding labeled datasets, respectively.

Figure 6. Detection results of improved YOLO-V8 on the VisDrone dataset.

Figure 7. Detection results of improved YOLO-V8 on the UAVDT dataset.

Figure 8. Comparison of YOLOV8 training with the original YOLOV8 after adding the attention mechanism. (a) the precision–confidence curve of original YOLOV8 training; (b) the precision–recall curve of original YOLOV8 training; (c) the recall–confidence curve of original YOLOV8 training; (d) the precision–confidence curve of improved YOLOV8 training; (e) the precision–recall curve of improved YOLOV8 training; and (f) the recall–confidence curve of improved YOLOV8 training.

Figure 9. Output results on the VisDrone test set. When enlarging and displaying the overlapping part of the moving object in the figure, it can be seen that the object can still be tracked continuously, and the ID of the object has not changed.

Figure 10. Output results on the UAVDT test set.

Figure 11. Multi-object tracking results of self-collected data. (a) and (b) respectively demonstrate the effectiveness of our algorithm on self collected data. The video website is https://github.com/YubinYuan/DB-Tracker, accessed on 23 September 2023.

Table 1. Comparison between DB-Tracker and the latest multiple trackers tested on the VisDrone dataset.

Tracker	MOTA	MOTP	IDF1	IDSW	MT%	ML	FP	FN
DeepSORT	19.4	69.8	33.1	6387	127	592	18,635	73,853
ByteTrack	25.1	72.6	40.8	4590	189	503	14,154	64,329
Deep OC-SORT	21.0	69.5	37.8	6489	133	663	17,599	50,485
BoT-SORT	23.0	71.6	41.4	7014	257	736	13,952	69,402
UAVMOT	25.0	72.3	40.5	6644	244	496	12,804	80,476
Strong SORT	34.8	74.3	42.6	3091	201	364	11,257	54,723
SimpleTrack	33.9	75.1	53.6	2238	317	208	10,528	47,239
DB-Tracker	37.4	75.9	54.8	2043	311	247	12,903	73,425

Table 2. Comparison between DB-Tracker and the latest multiple trackers tested on the UAVDT dataset.

Tracker	MOTA	MOTP	IDF1	IDSW	MT%	ML%	FP	FN
DeepSORT	35.9	71.5	58.3	698	43.4	25.7	69,091	144,760
ByteTrack	39.1	74.3	44.7	2341	43.8	28.1	33,129	173,228
Deep OC-SORT	36.4	73.7	48.3	1993	35.9	30.6	44,612	180,921
BoT-SORT	37.2	72.1	53.1	1692	40.8	27.3	52,451	146,420
UAVMOT	43.0	73.5	61.5	641	45.3	22.7	46,151	147,735
Strong SORT	44.2	76.4	54.3	2024	38.3	24.8	53,423	139,875
SimpleTrack	45.3	77.9	57.1	1404	43.6	22.5	39,452	138,457
DB-Tracker	46.2	78.3	58.2	1436	45.9	22.1	34,531	143,625

Table 3. Testing the impact of various components of DB-Tracker on tracking performance on the VisDrone dataset.

	Components
Baseline	√	√	√	√
Box-MeMBer		√	√	√
MB-OSNet			√	√
secondary matching				√
MOTA	14.8	28.9	33.9	37.4
MOTP	47.3	64.2	72.6	75.9
IDSW	6819	3584	2563	2043

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

DB-Tracker: Multi-Object Tracking for Drone Aerial Video Based on Box-MeMBer and MB-OSNet

Abstract

1. Introduction

2. Related Work

2.1. Detection-Based Multi-Object Tracking

2.2. RFS-Based Multi-Object Tracking

2.3. Fusion of Detection-Based and RFS-Based Multi-Object Tracking

3. Methods

3.1. Box-MeMBer

3.1.1. Box-Bernoulli Prediction

3.1.2. Box-Bernoulli Update

3.1.3. Object Trajectory Information Extraction

3.2. MB-OSNet

3.3. Object Association

3.4. Secondary Matching

4. Experiment

4.1. Datasets and Evaluation Indicators

4.1.1. Datasets

4.1.2. Evaluation Indicators

4.2. Experimental Results

4.2.1. Object Detection

4.2.2. Multi-Object Tracking

4.2.3. Ablation Experiment

4.2.4. Self-Collected Dataset

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Article Metrics

Citations

Article Access Statistics