1. Introduction
With the rapid development of remote sensing earth observation technology and the successive launch of video satellites, such as Jilin-1, SkySat, etc. [
1,
2,
3], remote sensing video data are becoming increasingly easy to obtain, and they play an important role in the tasks of military precision strikes, civilian traffic flow estimation, fire detection, etc. [
4,
5,
6]. Video satellites complete dynamic ground observation by focusing on specific areas for a certain period of time and can provide richer temporal information and a larger observation area. For example, in October 2015, the “Jilin-1” satellite developed by Changguang Satellite Company was launched. It can obtain full-color video data of 11 × 4.5 m with a resolution of 1 m. The video can reach a maximum length of 90 s and a frame rate of 25 frames per second. From 2013 to 2017, the United States launched 13 SkySat satellites developed by Skybox. These satellites can obtain full-color video data with a 0.8-m resolution and multispectral video data with a 1-m resolution, and they have a maximum duration of 90 s. Therefore, SV-MOT has become one of the hotspots in satellite video processing and analysis. Unlike the general MOT approaches, SV-MOT has to face more challenges [
7], such as cloud interference and targets with very small sizes that would only account for one thousandth of the video sizes.
In recent decades, significant breakthroughs have been made in general MOT, which can be divided into two branches: detection-based tracking (DBT) and joint tracking and detection (JDT). Specifically, as shown in
Figure 1a, the DBT approaches first obtain detection boxes in every frame and then perform association over time. Furthermore, the detection and association modules have no interconnection, and their tracking performance is mainly dependent on the quality of detection. However, the independent structure has low tracking efficiency, and it cannot produce joint optimization. These approaches can achieve a competitive performance easily but with a heavy training cost. As shown in
Figure 1b, the JDT approaches use a network to perform both detection and association subtasks simultaneously, so as to estimate the objects and learn re-ID features simultaneously. These models are more efficient and faster because they consider detection and association at the same time, but it is difficult to maintain a balance between detection and association, leading to one subtask dominating the training process, while the performance of the other subtasks decreases. For example, detection involves discovering all the targets in the current frame, while association involves establishing an association between all the current targets and the targets in the previous frames. An association task needs more low-level features than detection. Thus, the shared features in the JDT approaches lead to a reduction in tracking performance.
As there is a big difference between the general videos and the satellite videos, a significant performance degradation occurs when directly applying any of the methods discussed above to satellite videos. This is because, in comparison with natural scenes, the following challenges exist in remote sensing:
Small objects. As shown in
Figure 2a, in the natural scene the object size is 30 × 70 pixels. However, due to the low spatial resolution of satellite videos, the target size in satellite videos is approximately 10 × 8 pixels. The size of the objects in satellite videos decreases their features, which makes it difficult for the detectors that are widely used in natural scenes to recognize targets in satellite videos. Therefore, in SV-MOT problems, it is difficult to accurately detect all the targets, resulting in a large number of missed detections and false alarms.
Low contrast and complex background. As shown in
Figure 2b, the low contrast and complex background scenarios can easily lead to a large number of missed detections and false alarms, resulting in a decrease in tracking performance. In addition, the detected targets are easily removed due to low confidence during data association processing, resulting in discontinuous trajectories. There is also a large amount of motion noise in satellite videos, which affects trajectory prediction and further increases the difficulty of multi-target tracking.
To fill the gap between MOT and SV-MOT, the use of a few tracking methods for satellite videos in succession is proposed. However, they are unable to effectively solve the problems of small object detection and fragmented trajectories in satellite videos. Some methods cannot achieve online tracking, while others require complex and time-consuming network training.
It is worth mentioning that although some progress has been made with the existing SV-MOT methods, the high frequency of tracking misses and ID switches while dealing with dense target occlusion is still a key issue which has not been well handled. In our opinion, the missing tracking problem is mainly due to the poor detection of small objects, the high frequency of ID switches and tracking instability is mainly due to weak data association. The real targets with a small size cannot always be detected with a high level of confidence in satellite videos. Thus, the filter operation based on confidence may cause a lot of small objects to be lost. However, associating all the detection boxes may result in the tracking of many false alarms. To tackle this issue, the similarity of the trajectories is applied to distinguish between the false alarms and the correct targets. Therefore, this study focuses on both the detection and the association parts and proposes an adaptive association-based multi-object tracking approach for satellite videos, enhanced by high-resolution feature fusion (HRTracker). For a better tracking performance, we follow the branch of DBT, which performs detection first, and then association. In detail, a novel detection module with high-resolution feature fusion for small targets is introduced. It is combined with an association method for tracklet prediction based on adaptive Kalman filtering.
By repeatedly optimizing the detection network and matching strategy, the proposed SV-MOT can reduce miss detection and ID switches and achieve precise and stable tracking and a performance that is comparable with that of the SOTA methods. The contributions can be summarized as follows:
We propose a satellite video multi-target tracking method based on a detection and tracking paradigm for satellite videos. Compared with other tracker detection methods, our method can reduce miss detection and false alarms.
An adaptive data association strategy is proposed to handle the problems related to the deletion of low-confidence detection boxes and fragmented trajectories.
The comprehensive experimental results using the VISO datasets [
7] demonstrate the effectiveness of our method. The remainder of this paper is organized as follows.
Section 2 introduces the related work that inspired this study.
Section 3 explains the details of our method, and the extensive experiments are presented in
Section 4. Finally, the conclusion and future work are outlined in
Section 4.
3. Methods
At present, in the research on satellite video multi-object tracking, we believe that the main challenges we face are excessive missed tracking and incomplete trajectories. So, we propose an improved detection network to address the issue of missed tracking and an adaptive data association method for incomplete trajectories. In this section, we provide a detailed introduction to our method, which follows the detection-based multi-objective tracking paradigm. It proposes three innovative points and integrates them into YOLOv5+SORT.
3.1. Reasons for Selecting YOLOv5
Object detection is a fundamental and significant challenge in the field of computer vision, aimed at identifying and locating objects of interest within images or videos. Traditional object detection methods, such as the R-CNN series of algorithms, while achieving high accuracy, tend to be slow and are not suitable for real-time applications. The emergence of the YOLO algorithm, with its single-stage detection and high efficiency, addresses the demands of real-time object detection and fosters advancements in object detection technology. The development history of the YOLO algorithm illustrates the application and progress of deep learning in object detection tasks. From the initial YOLOv1 to the current YOLOv10, each iteration has demonstrated improvements in performance, speed, and accuracy. The popularity of the YOLO algorithm is also attributed to its ease of understanding and implementation, as well as its high adaptability to real-time performance. With the ongoing advancement of technology, the YOLO algorithm and its variants have been widely applied and studied in both industry and academia.
From YOLOv1 to YOLOv5, and now to the latest YOLOv10, each generation of models has undergone continuous optimization and improvement. This paper will compare and analyze the performance differences among the YOLOv10, YOLOv8, and YOLOv5 models. The performance comparison primarily focuses on two aspects: accuracy and inference speed. Accuracy refers to a model’s ability to correctly identify targets, while inference speed indicates the model’s real-time performance in practical applications. Existing research shows that YOLOv8 and YOLOv10 demonstrate some improvement in accuracy compared to YOLOv5, although the extent of this improvement is not substantial. In terms of inference speed, both YOLOv8 and YOLOv10 are faster than YOLOv5. YOLOv10 has further enhanced performance and efficiency through a continuous dual allocation strategy that eliminates the need for Non-Maximum Suppression (NMS) training, along with a comprehensive efficiency-accuracy-driven model design strategy. However, our experiments revealed that YOLOv5 outperforms both YOLOv8 and YOLOv10, as evidenced by its higher mean Average Precision (mAP), recall, and precision. This discrepancy may be attributed to the structural differences among YOLOv5, YOLOv8, and YOLOv10. Although their architectures are generally similar, YOLOv5 may be more adept at detecting small objects in satellite videos for the following reasons: (1) YOLOv8 and YOLOv10 enhance the CSPDarknet backbone network and introduce Anchor-Free point detection heads, whereas YOLOv5 employs Anchor-Based detection heads. While Anchor-Free methods are relatively straightforward and do not require the design of numerous anchor boxes, their performance tends to be suboptimal for multi-scale and multi-aspect ratio targets. In scenarios with densely arranged targets, inaccurate positioning is more likely to occur. Conversely, Anchor-Based methods are well-suited for targets with varying scales and aspect ratios, and anchor boxes can enhance detection performance for densely packed targets. Given the dense distribution and varying scales of targets in satellite videos, YOLOv5 may be more effective for small target detection in this context; (2) YOLOv8 and YOLOv10 are designed to be lighter and have significantly reduced computational parameters compared to YOLOv5. However, due to the larger amplitude of satellite videos and the lower proportion of targets within these videos, detecting targets in satellite imagery necessitates a greater amount of computational resources. The intricate architecture of the YOLOv5 network may be more suitable for object detection in satellite scenes.
We will provide a detailed explanation of the experimental results for YOLOv10, YOLOv8, and YOLOv5 in the experimental section. Our findings indicate that YOLOv5 demonstrates superior performance; therefore, we have selected YOLOv5 as the baseline detection method for this paper.
3.2. High-Resolution Detection for Network
HRNet (High Resolution Network) is specifically designed for 2D human pose estimation, also referred to as keypoint detection. This network is primarily intended for assessing the pose of a single individual, meaning that the input image should contain only one human target. However, HRNet has also improved the performance of various computer vision tasks, including object detection, image super-resolution, image segmentation, image classification, and change detection. This enhancement is largely attributed to its capability to maintain high resolution.
In satellite videos, the appearance information of an object is weak because the object is very small in satellite scenarios and the background is complex and is affected by factors such as cloud interference, which results in a considerable amount of missed detection. The large scale of missed detection is one of the main reasons why general MOT cannot be introduced into satellite videos. If there are not enough detection boxes generated at the beginning, any data association has to face a tracking failure. Therefore, in this study, the general detection module is replaced by a small object detection module to adapt to the characteristics of satellite video. At present, most small target detectors obtain strong semantic information through downsampling and then use upsampling to recover high-resolution location information. However, these approaches can lead to the loss of a large amount of effective information during the continuous upsampling and downsampling process. HRNet achieves the goal of strong semantic information and precise positional information by parallelizing multiple resolution branches and continuously exchanging information between different branches. Thus, the most representative concept of multi-resolution is introduced for an essential improvement in tracking performance. Specifically, the most representative detection framework, YOLOv5 [
29], is set as the baseline, and a high-resolution feature fusion strategy [
30] is designed to make full use of the features in multi-resolution, whose network structure is shown in
Figure 3.
As illustrated in the figure, we substituted the neck region of the YOLOv5 network with the HRNet network, pruned the YOLOv5 backbone region, and preserved the original detection head within the YOLOv5 network. After modification, the backbone part of the network mainly consisted of two parts: the focus module and the C3 module. The focus structure, a convolutional neural network layer for feature extraction, compresses and combines information from input feature maps to extract high-level feature representations. It is a special convolutional operation in YOLOv5 and is used as the first convolutional layer in the network to downsample input feature maps, thereby reducing computational demands and parameter counts. Specifically, the focus structure can divide the input feature map into four subgraphs and concatenate these four subgraphs through channels to obtain a smaller feature map. Assuming that the size of the input feature map is N × N × C, where N is the size of the feature map and C is the number of channels, then the output feature map size of the focus structure is N/4 × N/4 × C/2. The C3 structure is an important component of YOLOv5, as shown in
Figure 4, and is used to build backbone networks. The C3 structure can effectively reduce network parameters and computational complexity, while improving the efficiency of feature extraction. The core idea of the C3 structure is to divide the input feature map into two parts, one part is processed by a small convolutional network, and the other part is directly processed by the next layer. Then, the two feature maps are concatenated as input for the next layer. This can combine low-level detail features with high-level abstract features to improve the efficiency of feature extraction. Finally, the size of the satellite data we input is 1024 × 1024 × 3; therefore, the feature map output from the backbone structure of the detection network is 128 × 128 × 256.
The idea of HRNet [
31] was introduced into YOLOv5 to allow high-resolution feature fusion. In the original HRNet network, as illustrated in
Figure 5a, only a high-resolution feature representation was output, signifying that only high-resolution convolutional features were retained. Consequently, we also investigated outputting feature representations from several other low-resolution convolutions. This approach benefitted from the utilization of multi-resolution convolutional features while incurring minimal additional parameters and computational overhead. However, unlike the original HRNet, which only uses the highest resolution feature map as the output of the model, HRNetv2 [
32] added a concatenation operation for feature maps with different resolutions and then subsampled the combined feature maps with average pooling to obtain multiple feature maps with different resolutions. In order for it to be applied to object detection tasks, HRNetV2p was obtained by post-processing the concatenated feature representation, as shown in
Figure 5b,c. Specifically, HRNetV2 upsamples the low-resolution outputs bilinearly to match the sizes of the high-resolution outputs and concatenates these high-resolution outputs to produce high-resolution feature representations. These representations are employed in semantic segmentation and facial feature detection, among other applications. The feature fusion across different resolutions employs a 1 × 1 convolution. For object detection tasks, HRNetv2p adopts multi-level feature representation. The method involves using the high-resolution feature representation obtained from the previous step of average pooling downsampling to obtain another feature representation with lower resolution.
Therefore, in this study, an operation similar to that of HRNetv2p is adopted. Specifically, the HRNet retains high-level semantic information by adopting a parallel structure with different resolutions. Within the forward propagation, multiple resolutions interact between different branches, which further reduces the impact caused by the decreasing channel dimensions. The HRNet is composed of basic blocks and fuse layers. The input of the basic block is
, and the output
of each basic block can be expressed as
where
f(.) denotes the convolutional layer with a batch normalization layer (BN) and an ReLU activation function layer. Regarding the fuse layer, the input is combined with multi-resolution features, which are concatenated in the channel dimension. Then, the concatenated feature goes through an
f(.) operation to generate the output, where the output channel varies with the position of the fuse layer.
In addition, to reduce the information loss of the feature maps with different resolutions in the sampling process and to reduce the calculation amount, the feature maps with the current resolution are also concatenated. Finally, three feature maps with different resolutions and channel numbers are obtained. These feature maps are then fed into the detection head for further prediction.
3.3. An Adaptive Association-Based SV-MOT
Even though our detection network architecture and the detection rate of the targets in satellite videos were improved, we focused more on improving feature representation performance in the process of improving the network and did not pay attention to how to more efficiently associate interframe targets into trajectories. Further improvement and research are needed to address the robustness of the tracking trajectory estimation and data association methods. Below, we introduce our proposed adaptive Kalman trajectory method and two-step matching spatial data association method.
3.3.1. The Adaptive Kalman Filter for Trajectory Prediction
Kalman filtering [
32] is a commonly used method for object tracking. However, in practical applications, the variance of observed noise is often unknown, which can have a certain impact on the performance of Kalman filtering. To solve this problem, an adaptive Kalman filtering method is adopted. In the following, a detailed introduction to the process of this method is provided.
Firstly, we introduce the universal Kalman filtering method, which is an algorithm that utilizes linear system state equations to perform an optimal estimation of the system state through system input and output observation data. In the study, the state vector of each object was chosen to be an eight-tuple:
where
x and
y represent the horizontal and vertical coordinates of the object, while
a and
h represent the aspect ratio and the height of the object’s bounding box, respectively. The last four parameters represent their respective rates of change.
In the prediction step, use the state transition matrix and control the inputs to predict the new state of the target. The specific prediction formulae are as follows:
where
is the predicted state vector,
A is the state transition matrix,
is the previous state vector,
is the predicted state covariance matrix, and
Q is the process noise covariance matrix.
In the update step, compare the predicted state with the observation and update the state vector and covariance matrix. The specific update formulae are as follows:
where
K is the Kalman gain matrix,
H is the observation matrix,
R is the observation noise covariance matrix,
is the observation vector, and
I is the identity matrix.
In the prediction step of Kalman filtering, the model predicts the current state based on the most recent best estimate and the process noise covariance, Q. If Q is set too high, the model may place excessive trust in the predicted values, leading to estimation results that deviate from the true values. Conversely, if Q is set too low, the model may disregard the predicted values, resulting in estimation outcomes that are overly sensitive to changes in the observed values. The observed values are then used to correct the predicted values, yielding the optimal estimate. The weight assigned to the observed values is determined by the covariance, R, of the observed noise. If R is set too high, the model may place undue trust in the observed values, which can also lead to estimation results that deviate from the true values. On the other hand, if R is set too low, the model may overlook the observed values, causing the estimation results to rely excessively on the predicted values. Therefore, the process of adjusting Q and R is essentially about balancing the trustworthiness of model predictions and observed data. Generally, these two parameters can be adjusted based on the specific application scenarios and the characteristics of the observed data. If the observed data is relatively accurate and reliable, the value of R can be reduced to increase the weight of the observed values in the estimation results. Similarly, if the model predictions are accurate and reliable, the value of Q can be decreased, thereby increasing the weight of the predicted values in the estimation results.
In standard Kalman filtering, the observed noise variance is considered to be a known constant. However, in practical applications, the variance of observed noise may vary over time or be uncertain due to other factors. Due to the large motion noise in the video, the observation noise is constantly changing, especially in satellite videos. To solve this problem, an adaptive method can be used to estimate the variance of the observation noise. A common method is to use the minimum variance unbiased estimate. In statistics, the minimum variance unbiased estimate is an unbiased estimate that has the minimum level of variance among all the unbiased estimates. Regardless of the true parameter values, the minimum variance unbiased estimate has lower variance than the other unbiased estimates, or, at most, the variance is equal. Then, this estimate is called the uniform minimum variance unbiased estimate (MVUE). The basic purpose of the minimum variance unbiased estimate method is to estimate the variance of the unknown observation noise by minimizing the variance of the estimation error. The specific steps are as follows:
where
e is the estimation error vector, and
R is the updated observation noise covariance matrix.
During the target tracking process, the prediction and update steps are iteratively executed. Each iteration predicts and updates the state based on new observations and adaptively updates the variance of the observation noise.
3.3.2. The Two-Step Data Association for Satellite Videos
Targets in satellite videos that had a small size or were affected by a complex background would easily obtain low detection confidence. Common MOT associations always first filtered out the detection boxes with low confidence and then performed association, which caused many objects to be lost. Specifically, in order to make more objects in satellite videos associate and form a trajectory, a two-step association method is proposed, which first associates the high-score detection frames and then associates the low-score detection frames. Finally, it filters out false alarms according to their similarity to the trajectory. The flowchart can be seen in
Figure 6.
Below, we specifically discuss the data association methods. Firstly, according to an adaptive threshold, the detection boxes were divided into high-score detection boxes and low-score detection boxes. Secondly, the trajectory was predicted in the new frame using an adaptive Kalman filter. Thirdly, the high-score detection boxes in the new frame were associated with the existing trajectory. The similarity between the predicted trajectory and the high-score detection boxes was calculated using IOU. The Hungarian algorithm [
33] was used to solve the optimal matching between the tracking and the detection boxes. Fourthly, the unmatched trajectories were associated with the low-score detection boxes in order to achieve better tracking performance in terms of scale change, occlusion, and motion ambiguity. The false alarms were distinguished from the real target by their similarity to the trajectories. Finally, the high-score detection boxes that had no match were initialized to a new track, and the tracklets which had not been matched over a certain number of frames were deleted.
4. Experiments
To verify whether our proposed method could solve the missed tracking and incomplete trajectory problems, we conducted a series of experiments to verify its effectiveness. In this section, we first introduce the details of the implementation, including the dataset, metrics, and experimental setup. Then, ablation experiments are performed to verify the effectiveness of each proposed module. Finally, the performance of HRTracker is compared with that of start-of-the-art methods, and the experimental results demonstrate the superiority of the proposed HRTracker.
4.1. Datasets
We validated our method using the VISO dataset, which consists of 47 satellite videos captured by the Jilin-1 satellite. The satellite video obtained by the Jilin-1 satellite consists of a series of true-color images. The coverage area of real scenes can reach several square kilometers. The majority of instances in VISO have a size smaller than 50 pixels, and the videos cover different types of city-scale elements, such as roads, bridges, lakes, and a variety of moving vehicles. VISO is a diverse and comprehensive dataset which consists of different traffic situations, such as dense lanes and traffic jams. Therefore, the satellite videos in the dataset cover a wide range of challenges, including complex backgrounds, illumination variations, dense targets, and so on. The size of every video is 1024 × 1024. The frame numbers are different and range from 324 to 326. We randomly selected 20 videos as a training set and 5 unrepeated videos as a testing set from the VISO dataset.
4.2. Evaluation Metrics
We evaluated the performances of the different methods and the proposed module for multi-object tracking in satellite videos. Six representative metrics were utilized for the evaluation, including false positives (FPs), false negatives (FNs), ID switches (IDs), the ID F1 score (IDF1), multi-object tracking precision (MOTP), and multi-object tracking accuracy (MOTA). The discrimination of the metrics is as follows:
FP: the number of false positives in the whole video;
FN: the number of false negatives in the whole video;
IDs: the total number of ID switches;
IDF1: the ratio of correctly identified tests to the average true and calculated number of tests;
The MOTP is defined as follows, in which
is the Euclidean distance between the
i-th object, and the assumed position in frame
t, and
is the number of matches in frame t:
The MOTA is defined as follows, in which GT is the ground truth:
4.3. Experimental Setup
In our HRTrack, the detection part with high-resolution feature fusion was first trained with the VISO dataset, and this was followed by the two-step association strategy to obtain the trajectories of multiple objects in the satellite video. Twenty video datasets in VISO were trained using GPU 3090. During the training process, random gradient descent (SGD) was selected as the optimizer. The initial learning rate was 0.01, and the weight attenuation parameter was 0.0005. In the tracking process, the maximum number of lost track frames was set to 30 frames, and the matching threshold of the tracking was set to 0.8. We compared our method with the state-of-the-art trackers in natural and satellite scenarios, such as SORT, FairMOT, TGraM, OC-SORT, BoT-SORT, and Bytetrack. To ensure the fairness of the experiment, all the method settings were consistent with those of the original study, and the same satellite video dataset was used for training with RTX 3090 GPU.
4.4. Ablation Experiments
To demonstrate that the detection performance of YOLOv5 is superior to other YOLO models in remote sensing scenarios, we randomly selected 6000 images from the VISO dataset for experimentation. We maintained consistent parameters to ensure fairness while testing YOLOv5, YOLOv8, and YOLOv10. The results of the tests are shown in
Figure 7, which indicates that in satellite scenarios, YOLOv5 outperforms both YOLOv8 and YOLOv10 in terms of mean Average Precision (mAP), precision, and recall.
To verify the effectiveness of the separate proposed modules in HRTracker, five representative videos in the VISO datasets (as shown in
Figure 8), with challenges such as serious occlusion and complex backgrounds, were selected as the testing set.
YOLOv5 + SORT was selected as our baseline method. Each module was integrated independently into the baseline for an effective evaluation.
Table 1 shows the results of the ablation experiment using the method we proposed.
HRNet is a network that integrates high and low resolution in the baseline detection network. Because it parallelizes multiple resolution branches and continuously exchanges information between different branches, the detector is more suitable for small targets in satellite videos.
TSA represents the use of a two-step matching spatial data association method in the baseline methods. By utilizing low-resolution detection boxes, it can more stably track targets and improve the performance of HRTracker.
AKF represents the insertion of adaptive Kalman filtering into the baseline method, which adaptively calculates the variance of observations and performs interframe updates to improve the performance of HRTracker.
It can be seen from
Table 1 that compared with the baseline, YOLOv5 + TSP (the two-step association strategy is added) achieved great tracking enhancement by improving MOTA by 36% and IDF1 by 45.3%. At the same time, the FN, FP and IDs were significantly reduced, which verifies the effectiveness of the two-step association. HRNet was integrated into the framework of YOLOv5. After integrating HRNet into YOLOv5, we achieved improvements in tracking performance by improving MOTA by 1.4% and IDF1 by 0.4%, which verifies the effectiveness of HRNet. Finally, we integrated AKF into the tracking network, and it achieved a 0.6% increase in MOTA. The MOTA reached 74.6% and FN and FP were both decreased, but it can be seen that the IDF1 and IDs indicators were slightly decreased, indicating an increase in the number of ID conversions for the target. More accurate trajectory prediction increases the trajectory of each target in satellite videos, which also leads to more target box flickering and a decrease in the IDF1 index, but the improvement in the MOTA and MOTP indicators is sufficient to demonstrate the effectiveness of the AKF.
4.5. Comparison with Other Methods
In this section, the SOTA approaches, including FairMOT, SORT, Bytetrack, and TGraM are compared with the proposed method using the VISO test, and the experimental results are shown in
Table 2. Although our proposed method showed a decrease in the MOTP metric, it can be seen from
Table 2 that we achieved the best results for MOTA, IDF1, and other metrics. The proposed HRTracker achieved 74.6% for MOTA and 78.9% for IDF1, which were much better results than those of the previous work. Specifically, for the most representative MOT approach—Bytetrack—we were able to achieve results that were 4.7% higher for MOTA and 0.3% higher for IDF1. The superiority of the proposed method is not only reflected in the accuracy but also in the speed. Although our tracking method is based on the paradigm of DBT, our tracking part does not have a deep network model, and the detection network is simple with fast inference speed. Our tracking speed can reach over 100 FPS in satellite videos, making it suitable for in-orbit tracking. The decrease in MOTP is due to the fact that although our improved detection network can detect targets that the other methods cannot detect, the false positives also increase accordingly. This is also an area that we need to continue to improve in future work.
Some of the visualization results of the proposed method are shown in
Figure 9,
Figure 10 and
Figure 11. The tracked trajectories comparisons between the proposed method and the SOTA multi-object tracking methods are shown in
Figure 9 and
Figure 10. It can be concluded that the trajectories of TGraM, FairMOT, SORT, and other methods tend to be fragmented and confused. However, in
Figure 9d,e and
Figure 10d,e, it can be seen that our two-step association strategy could perform stable tracking and that the tracklets tended to be more complete, which demonstrates the viability of the two-step association. Finally, as shown in
Figure 9f and
Figure 10f, with high-resolution feature fusion added, the number of tracklets increased, which intuitively demonstrates the effectiveness of our detection network, YOLOv5 + HRNet. The key frames of the representative visual experimental results regarding the challenges of occlusion and intensive targets are shown in
Figure 11. It can be seen in
Figure 11a that when two cars meet and block each other, the IDs of the two cars remain unchanged and that there is no missed tracking. When a vehicle crosses the bridge, the ID information of the vehicle remains unchanged before and after the crossing of the bridge, which indicates that our method can effectively handle the problem of background occlusion and the mutual occlusion of targets in complex satellite scenes. It can be seen in
Figure 11b that our method can maintain simultaneous and stable tracking of multiple targets in satellite scenes with dense targets. Overall, when facing rather complex tracking challenges, our HRTracker can achieve more stable tracking, forming more complete trajectories with fewer ID conversions and higher accuracy.
In addition, to demonstrate the scalability and real-time performance of our algorithm, we applied it to various datasets for multi-object tracking. Specifically, we directly utilized the weights trained on our dataset with other satellite videos from the VISO datasets for multi-object tracking. The tracking results are presented in
Figure 12. Although there may be some missed detection in certain scenarios, the results sufficiently indicate the scalability and real-time capabilities of this algorithm. We also employed this algorithm to track ships and planes in the satellite videos, with the tracking results displayed in
Figure 13 and
Figure 14.