On-Satellite Implementation of Real-Time Multi-Object Moving Vehicle Tracking with Complex Moving Backgrounds

Yu, Jingyi; Wei, Siyuan; Wen, Yuxiao; Zhou, Danshu; Dou, Runjiang; Wang, Xiuyu; Xu, Jiangtao; Liu, Jian; Wu, Nanjian; Liu, Liyuan

doi:10.3390/rs17030418

Open AccessArticle

On-Satellite Implementation of Real-Time Multi-Object Moving Vehicle Tracking with Complex Moving Backgrounds

by

Jingyi Yu

^1,2

,

Siyuan Wei

^2,3,

Yuxiao Wen

^2,3,

Danshu Zhou

^2,3,

Runjiang Dou

^2,3,*

,

Xiuyu Wang

¹

,

Jiangtao Xu

¹,

Jian Liu

^2,3

,

Nanjian Wu

^2,3 and

Liyuan Liu

^2,4

¹

Tianjin Key Laboratory of Imaging and Sensing Microelectronic Technology, School of Microelectronics, Tianjin University, Tianjin 300072, China

²

State Key Laboratory of Semiconductor Physics and Chip Technologies, Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083, China

³

College of Materials Science and Opto-Electronics Technology, University of Chinese Academy of Sciences, Beijing 100049, China

⁴

School of Electronic, Electrical and Communication Engineering, University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(3), 418; https://doi.org/10.3390/rs17030418

Submission received: 12 December 2024 / Revised: 15 January 2025 / Accepted: 24 January 2025 / Published: 26 January 2025

(This article belongs to the Special Issue Remote Sensing of Target Object Detection and Identification (Third Edition))

Download

Browse Figures

Versions Notes

Abstract

On-satellite information processing enables all-weather target tracking. The background of videos from satellite sensors exhibits an affine transformation due to their motion relative to the Earth. In complex moving backgrounds, moving vehicles have a small number of pixels and weak texture features. At the same time, the resources and performance of on-satellite equipment are limited. To address these issues, we propose a multi-object tracking (MOT) algorithm with a detection–association framework for moving vehicles in complex moving backgrounds and implement the algorithm on a satellite to achieve real-time MOT. We use feature matching to effectively eliminate the effects of background motion and use the neighborhood pixel difference method to extract moving vehicle targets in the detection stage. The accurate extraction of motion targets ensures the effectiveness of target association to achieve MOT of moving vehicles in complex moving backgrounds. Additionally, we use a Field-Programmable Gate Array (FPGA) to implement the algorithm completely on a satellite. We propose a pixel-level stream processing mode and a cache access processing mode, given the characteristics of on-satellite equipment and sensors. According to the experimental results, the prototype on-satellite implementation method proposed in this paper can achieve real-time processing at 1024 × 1024 px@47 fps.

Keywords:

multi-object tracking (MOT); detection–association framework; moving vehicle targets; on-satellite implementation; FPGA

1. Introduction

With the continuous development of space satellite technology, remote sensing video satellites have achieved multi-object tracking (MOT) of moving vehicles on the ground. The MOT of moving vehicles has a wide range of applications in fields such as security monitoring [1], motion analysis [2], and traffic control [3]. Furthermore, due to the large amount of data in satellite videos and the high latency of transmission to the ground, real-time, on-satellite processing is of great practical significance [4,5,6]. With regard to devices for implementing on-satellite algorithms, commercial, off-the-shelf (COTS) Graphics Processing Units (GPUs) or Central Processing Units (CPUs) are difficult to use due to issues related to power consumption, spatial radiation, and extreme temperatures [7,8]. Application-Specific Integrated Circuit (ASIC) processing chips are expensive due to their single-function nature and are not suitable for algorithm deployment or adaptation to different scenarios [9], while Field-Programmable Gate Arrays (FPGAs) are highly versatile and portable and may be used in on-satellite applications [10].

In real-time videos captured by remote sensing satellites, vehicle targets are small and exhibit little texture information due to large imaging distance and coverage [11]. It is difficult to apply traditional MOT algorithms to remote sensing vehicle tracking in natural scenes (using, e.g., surveillance cameras and video recorders) or aerial scenes (using, e.g., drones and aircraft) [12]. Remote sensing MOT algorithms can be divided into traditional algorithms and deep learning-based algorithms [13], while the frameworks of remote sensing MOT algorithms can be divided into detection–association frameworks and the joint detection–association frameworks [14]. The detection–association framework has two steps—moving target detection and target association—which are independent of each other. In a joint detection–association framework, detection and association are combined and the association effect is improved via predictions of target displacement in the detection stage. In this framework, the complexity and computational complexity of the algorithm are increased, which is not conducive to on-satellite implementation.

In recent years, deep learning-based remote sensing MOT algorithms have made great progress [15]. Xiao et al. [16] applied the recognition framework to remote sensing MOT through a two-stream framework [17], resulting in DSFNet. DSFNet effectively recognizes moving target information; however, five consecutive frames at a time are required to be processed to obtain a single frame of a moving target. Zhang et al. [18] proposed a bidirectional MOT framework based on trajectory criteria (BMTC). In this BMTC, the impact of performance degradation in the detection stage is reduced, and it predicts the target trajectory while simultaneously backtracking the trajectory of invalid segments. However, the trajectory backtracking process greatly increases the computational complexity, and the frame rate implementation is only 0.148 fps using a TITAN X GPU; this is far from achieving real-time detection. Zhao et al. [19] proposed a mask propagation and motion prediction network (MP²Net), which enhances the characteristics of small targets and combines implicit and explicit motion prediction to improve multi-target tracking. However, due to its complex network structure, MP²Net only achieves a tracking speed of about 3 fps on a Titan RTX GPU, which does not meet real-time tracking requirements. Deep learning-based MOT algorithms for remote sensing can model target and image features through continuous training, but this relies on high-performance hardware such as GPUs or Tensor Processing Units (TPUs). Satellite power supply and cooling systems cannot cope with the use of such equipment [20]. At the same time, on-satellite MOT implementation requires quantization and model pruning, which will affect the accuracy of the models and require more tuning and training [21].

Traditional methods for implementing MOT algorithms for remote sensing mainly use the detection–association framework. In terms of the detection method, Ahmadi et al. [22] proposed moving vehicle detection, tracking and traffic parameter estimation method (DTTP), which removes the background using the frame difference method to extract moving targets and uses a neighborhood search method to associate the target, while Wei et al. [23] proposed a detecting and tracking framework (D&T), which uses an exponential probability distribution to distinguish potential vehicles from noise patterns based on modeling the local noise. The use of D&T improved moving vehicle detection. In terms of the association method, Zhang et al. [24] proposed a bi-level K-shortest method for constructing spatio-temporal grid flows for association. The extracted trajectory can skip bad detections. These traditional algorithms for remote sensing MOT have significant advantages for solving specific problems (e.g., background noise, occlusion, and lost tracks) compared to deep learning-based methods. The use of these methods increases the depth of the model and the amount of computations, and more operations are required during processing. The traditional method is more suitable for application on satellites with limited power consumption and computing power.

Due to the limitations of on-satellite equipment, most MOT algorithms used on satellites have, in recent years, adopted a detection–association framework containing a traditional algorithm. Liu et al. [25] used a Zynq FPGA to detect and associate targets, using dynamic background difference and a Kalman filter, respectively. However, the Programmable Logic (PL) part of the FPGA is only used to implement the preprocessing algorithms, while both detection and tracking algorithms are implemented on the Processing System (PS) side. The acceleration effect of PL is limited, and real-time processing cannot be achieved. Han et al. [26] achieved tracking through a shape center extraction algorithm, and they then implemented the algorithm in an FPGA-based space-embedded system. In this method, only the largest moving target is tracked without background removal; it is not suitable for complex moving backgrounds. Su et al. [27] proposed using an improved local contrast method and the Kalman filter to detect and associate potential targets, using the difference in motion states to suppress fixed stars. Then, they implemented the algorithm on an FPGA and DSP. However, this method is aimed at targets in deep space with a simple background and is unable to track moving vehicles in complex moving backgrounds. Overall, recent research on the on-satellite implementation of MOT algorithms for remote sensing is still limited to the implementation of traditional algorithms for deep space backgrounds. In this paper, we propose an MOT algorithm for moving vehicles in complex moving backgrounds and implement the algorithm on a satellite to achieve real-time MOT.

Based on the above analysis, it can be observed that traditional MOT algorithms can effectively solve the following problems related to deep learning-based MOT algorithms when implemented on satellites:

(1) Dataset limitations

The basic framework of deep learning-based MOT algorithms features two steps: training and inference. During the training phase of the neural network, a large amount of labeled data is required for training. The dataset’s quality and the training method directly affect the inference results. However, there is currently a small amount of high-quality labeled datasets for on-satellite MOT, and labeling multiple small targets is costly. Traditional MOT algorithms rely on predefined target types and extract the features of the targets to achieve MOT in the absence of these datasets.

(2) Computational scale and on-satellite implementation capabilities

The performance of deep learning-based MOT algorithms can be further improved by increasing the size of the neural network, but this also leads to a sharp increase in the number of calculations. The implementation of a large-scale neural network will lead to substantial processing time. Moreover, the deployment of such a network may be impractical due to the power consumption and resource constraints of on-satellite equipment. The actions of moving equipment (e.g., camera turntables and propulsion devices) are highly dependent on real-time performance, and the images need to be processed in real time to control the attitude of the satellite.

(3) Traceability and reliability

Deep learning models have a range of structures and connections, and there is no clear conclusion yet on the specific role of certain parts. The traceability and reliability of results are crucial in critical tasks such as satellite control. Traditional MOT algorithms are based on clear target features and work steps, and the results of each step can be clearly explained, making them more suitable for on-satellite processing tasks.

The main contributions of this paper are two-fold:

(1) We propose an MOT algorithm with a detection–association framework designed for complex moving backgrounds in remote sensing scenarios. The algorithm employs corner feature matching and the neighbor pixel difference method for background compensation and motion pixel extraction. The extracted motion pixels enable precise bounding box generation for moving targets. For target association, the Jonker–Volgenant (JV) algorithm is adopted to solve the linear assignment problem (LAP) efficiently, ensuring accurate and continuous MOT.

(2) A pixel-level stream processing mode and a cache access processing mode are proposed to optimize the on-satellite implementation of the MOT algorithm. These modes leverage pipeline and parallel processing to enhance processing efficiency and real-time performance, aligning with the characteristics of the satellite hardware and sensor output. The complete algorithm is successfully implemented on-satellite, achieving a comparable tracking performance to existing algorithms while ensuring real-time applicability.

In this paper, we propose an MOT algorithm for moving vehicles in complex moving backgrounds and apply the tracking algorithm onboard a satellite to achieve real-time multi-object tracking. In the detection stage, we use feature matching to obtain the background’s motion information. Then, we use neighboring pixel differences to segment the motion pixels. Finally, to eliminate the background noise, we fuse the motion pixels with the enhanced raw image to obtain complete moving-vehicle targets. In the association stage, we use the Kalman filter and the LAPJV matching algorithm to predict and associate the target, respectively. In addition, we implement the algorithm using an on-satellite FPGA and optimize the algorithm for pipelining according to the on-satellite equipment and sensors. The algorithm processes while the sensor reads. The joint software and hardware design can accelerate the processing speed of the algorithm and improve its real-time performance; the on-satellite processing method proposed in this paper can achieve real-time processing at 1024 × 1024 px@47 fps.

The remainder of this paper is structured as follows. Section 2 details the design of the tracking method, Section 3 introduces the hardware implementation, Section 4 presents the experimental results, Section 5 is discussions and limitations, and Section 6 provides a conclusion.

2. Methodology

2.1. Overview of the Method

The working scenario of a remote sensing video satellite is shown in Figure 1a. The remote sensing video satellite acquires an image of a specific area of the Earth, and due to the relative motion between the satellite and the Earth, a displacement is observed in the background of each frame of the image. The on-satellite tracking workflow diagram is shown in Figure 1b. The sensor acquires real-time videos and processes them using the on-satellite tracking method proposed in this paper to determine the positions and identifiers (IDs) of multiple moving vehicles. The on-satellite tracking results can be used to initiate a variety of subsequent actions; for example, the results can be transmitted to the ground or the rotation of the turntable can be controlled to maintain continuous tracking of the target.

Figure 1c shows the block diagram of the proposed real-time multi-moving-vehicle tracking algorithm for complex moving backgrounds. The algorithm comprises two principal stages: moving target detection and multi-target association. In the moving target detection stage, the current and previous frames are employed in the feature matching step to obtain the background moving value, which is then utilized to compensate for the previous frame image, while the neighbor pixel difference method is employed to identify motion pixels, which are then extracted from the raw image to form a complete moving target. In the multi-target association stage, the target trajectory is predicted based on the target’s position in the previous frame. Subsequently, the predicted target position is matched with the target position detected in the current frame. This allows for the continuous assignment of target IDs between multiple frames, thereby facilitating target tracking.

2.2. Moving Target Detection Stage

The spatial resolution of satellite remote sensing images is approximately 1 m [28], and the length of road vehicles such as passenger cars and trucks is typically less than 10 pixels. In comparison to target detection in video data obtained from traditional cameras or surveillance cameras, remote sensing video target detection is subject to several limitations, including the relatively small target size, low resolution, and lack of rich texture details, which can be attributed to the fact that a large ground area is imaged by the satellite. As a satellite orbits the Earth, its position relative to the shooting position changes, resulting in background distortion and changes in the viewing angle. Furthermore, background objects appear to undergo a similar movement. The background movement inherent in satellite remote sensing images presents a challenge for the detection of moving targets, such as vehicles.

In this paper, we propose a moving vehicle detection algorithm that combines background compensation, the neighbor pixel difference method, and image enhancement. The workflow of the algorithm is shown in Figure 2, and mainly includes moving background compensation calculations (Figure 2a) and moving target extraction (Figure 2b). The moving background compensation calculation step uses the corner features to match the moving background, while the moving target extraction step uses the moving background information to align the two frames. Then, the neighbor pixel difference method is used to detect motion pixels, and at the same time, the original image is enhanced. After image segmentation, all possible targets are obtained. Finally, the motion pixels are combined with all possible targets to determine the moving vehicle target.

2.2.1. Moving Background Compensation Calculation

The goal of the tracking algorithm is to process images in real time on a satellite. Due to the relative motion of the ground and the satellite, the background in the original video data clearly shifts from frame to frame. This section details how corner detection, feature extraction, feature matching, and affine transformation matrix calculations are used to obtain the background translation information

t_{x}

and

t_{y}

for moving background compensation.

(1) Corner detection

The most common environment in moving vehicle tracking is urban roads, where the buildings on both sides of the road can be regarded as the background. Buildings have distinctive corners, and by extracting the corner positions, they can be used for T frame and

T - t

frame matching. Corner points are located where the edges of buildings intersect, and the neighborhood of the corner points exhibits large changes in pixel values in different directions. In this paper, we use the Shi–Tomasi corner detection algorithm [29]. This algorithm has good stability and high precision and can adapt to changes in rotation and scale.

(2) Feature extraction and matching

A corner point is a point in an image where the brightness changes significantly in multiple directions. Corner point areas also exhibit a significant gradient change in multiple directions, while the local image structure information around the corner point is rich [30,31]. At the same time, remote sensing images have a large field of view, with the images covering a vast geographical area containing different buildings with different features. Remote sensing image frames can be matched by extracting corner features [32,33].

By obtaining pixel information from around the corner points, corresponding point pairs between two frames can be found. In this paper, we use FREAK to extract corner neighborhood information [34] and generate binary descriptors. FREAK feature extraction adjusts the sampling point positions of the feature descriptors according to the feature direction, which can ensure that the feature descriptors do not vary with rotation and can avoid building perspective changes caused by satellite movement. This ensures that feature descriptors for the same corner points remain consistent across multiple frames.

Because the motion speed between the satellite and the ground is relatively stable, there are no drastic jumps between the video frames of the sensor. After obtaining the T frame binary feature descriptor of the corner point, the Hamming distance is calculated with the feature descriptors of the positions near the corner point in the previous

T - t

frame. The matching corner points of the current, T, and previous,

T - t

, frame are obtained by thresholding to complete feature matching.

(3) Affine transformation matrix calculation

The satellite moves relative to the ground, resulting in large image transformations, such as translation, rotation, scaling, and shearing, between the current and previous frame. The coordinate relationship between corresponding point pairs can be described by an affine transformation matrix

T \in R^{3 \times 3}

[35], as Formula (1).

P = {TP}^{'}, P = [\begin{matrix} x \\ y \\ 1 \end{matrix}], P^{'} = [\begin{matrix} x^{'} \\ y^{'} \\ 1 \end{matrix}], T = [\begin{matrix} t_{11} & t_{12} & t_{13} \\ t_{21} & t_{22} & t_{23} \\ 0 & 0 & 1 \end{matrix}]

(1)

where

P^{'} = {[x^{'}, y^{'}, 1]}^{T}, P^{'} \in R^{3 \times 1}

is the coordinate point in the previous frame

T - t

and

P = {[x, y, 1]}^{T}, P \in R^{3 \times 1}

is the corresponding coordinate point in the current frame T. The affine transformation matrix

T

contains 6 parameters, of which

t_{11}

,

t_{12}

,

t_{21}

,

t_{22}

are affine transformation parameters and

t_{13}

and

t_{23}

are translation transformation parameters, i.e., moving background information

t_{x}

and

t_{y}

[36].

In this paper, we calculate the affine transformation matrix between two frames using the RANSAC algorithm [37]. First, three point pairs are randomly selected from the two frame images T,

T - t

, and the transformation matrix

T_{t e m p}

is calculated. The error of all matched point pairs satisfying the transformation is obtained. The matched point pairs that meet the threshold are counted, and the transformation matrix

T_{t e m p}

corresponding to the maximum number of matched points is taken as the optimal affine transformation matrix

T

.

The moving background compensation algorithm proposed in this paper calculated the image pixel by pixel during feature extraction; by performing this, it can adapt to the format of the on-satellite sensor’s readout data stream. Calculations are performed during readout, which improves the real-time performance. The extraction of corner features reduces the information density of the image and outputs sparse corner features. During feature extraction, the features of the previous frame are obtained for matching, making full use of the idle processing time. The affine transformation matrix calculation implemented by the RANSAC algorithm can read out the matching points and calculate the error in a pipelined manner, improving parallelism and reducing the calculation delay.

2.2.2. Moving Target Extraction

After obtaining the moving background compensation information

t_{x}

and

t_{y}

, the method proposed in this paper can effectively extract moving vehicle targets. The two frames of images T and

T - t

are aligned by background motion compensation. Then, the motion pixels are extracted using the neighbor pixel difference method, and finally, all possible targets are fused to obtain complete moving targets.

(1) Background motion compensation

Remote sensing video satellites orbit the Earth and capture images of the ground. Thus, the perspective of the satellite changes, and the corresponding video produces an affine transformation. To simplify processing and perform fast on-satellite extraction of moving objects, in this paper, only the most important translational motion in the affine transformation is considered, namely

t_{x}

and

t_{y}

. We use the translation parameters

t_{x}

and

t_{y}

to compensate for the translation between frames T and

T - t

.

(2) Motion pixel extraction

Background motion compensation simplifies the affine transformation of the image, retaining only the translation parameters

t_{x}

and

t_{y}

. Directly subtracting two frames will result in a large number of motion pixels due to other small changes between the frames. We note that the brightness of the target vehicle in remote sensing videos is generally higher than that of the surrounding road, and the target vehicle is small in size. After subtracting the pixel where the vehicle target is located in the current frame from the

k \times k

region of the corresponding pixel in the previous frame, the difference is large. For the stationary background area, since there is no movement between the two frames, after background motion compensation and alignment, even if there is a slight mismatch in the image, the minimum value of the difference between the current frame pixel and the previous frame neighborhood

k \times k

can be used to eliminate the effects of slight changes.

With regard to the motion pixel extraction in this paper, we propose the neighbor pixel difference method, which improves upon the traditional frame difference method. The current frame

I_{T}

is compared pixel by pixel

(i, j)

with the

k \times k

region at the corresponding position of the previous frame

I_{T - t}

to find the minimum value and to determine the pixels that differ between the previous and current frames. The minimum value of the neighbor pixel frame difference is obtained and the threshold

T_{d i f f}

is subtracted. The pixels with a value greater than 0 are assigned to the motion pixel image

I_{m o v e}

, as shown in the Formula (2).

I_{m o v e} (i, j) = min (I_{T} (i, j) - I_{T - t} (i + i^{'}, j + j^{'})) - T_{d i f f}

(2)

where

- k / 2 \leq i^{'} \leq k / 2

,

- k / 2 \leq j^{'} \leq k / 2

and

(I_{T}, I_{T - t}, I_{m o v e}) \in R^{H \times W}

. H and W are the height and width of the image, respectively.

In this paper, we propose combining the neighbor pixel difference method with

k \times k

neighborhood minimum value extraction, which reduces the impact of small background changes observed in the traditional frame difference method and improves the extraction of motion pixels.

(3) Image enhancement to obtain possible targets

Due to the different sizes and speeds of vehicle targets, moving targets between two frames may overlap and thus complete motion pixels cannot be extracted for targets that move slowly. Vehicle targets are bright and small in images; the Laplace filter

K_{L} \in R^{l \times l}

is used to enhance small targets in images, where l is the radius of the filtering kernel. Subtracting the thresholded

T_{l a p}

kernel from the enhanced image will lead to all possible target images

I_{p o s s i b l e} \in R^{H \times W}

, as shown in the Formula (3).

I_{p o s s i b l e} = I_{T} \otimes K_{L} - I_{T} - T_{l a p}

(3)

(4) Fusing to obtain moving vehicle targets

The resulting motion pixel image

I_{m o v e}

contains incomplete targets, and all of the possible images

I_{p o s s i b l e}

contain a large number of noisy areas. Combining the two images gives a complete moving target. When a satellite with a spatial resolution of about 1 m images the ground, the aspect ratio of a moving vehicle target is generally less than 8, the area is generally less than 100 pixels, and the length is less than 30 pixels. A connected domain analysis is performed for all possible images

I_{p o s s i b l e}

, and noisy regions with large aspect ratios and pixel areas are removed based on the aspect ratio and area of the connected domain. At the same time, the motion pixel image

I_{m o v e}

is used to find the bounding box to which the target belongs in all possible images

I_{p o s s i b l e}

of the target.

The moving target detection algorithm proposed in this paper for remote sensing in moving backgrounds first obtains the information on the moving background through corner detection, feature extraction, and affine transformation matrix calculation. Then, motion pixels are extracted via moving background compensation, the frame difference method, and local minimum calculation. All possible targets in the entire frame image are obtained through image enhancement. Finally, by fusing motion pixels and all possible targets, complete moving targets are obtained.

Simplifying the affine transform can remove motion in the background while enabling the algorithm to be used on-satellite. The motion pixel extraction algorithm of the neighbor pixel difference method can effectively reduce the effects of affine transformation simplification and random noise and extract clean motion pixels. By combining motion pixel extraction and all-possible-target extraction, complete moving targets can be effectively extracted. The moving target detection algorithm can eliminate the background noise in all possible target images and extract the real moving target. The algorithm does not require lengthy and complex background modeling methods and directly implements moving target detection based on the characteristics of the on-satellite implementation device and image features.

2.3. Multi-Target Association Stage

The multi-target association method used in this paper for moving vehicle targets in remote sensing images is based on the SORT algorithm [38]. The algorithm includes two parts: target match and Kalman filter. The target match step includes calculating the IOU (Intersection over Union) and assigning object identity (ID). The overlap rate with the bounding box obtained by prediction is calculated based on the bounding box detected in the current frame to generate a cost matrix. Unlike the SORT algorithm, the method used in this paper addresses the Linear Assignment Problem (LAP) and uses the Jonker–Volgenant (JV) algorithm [39] to calculate the optimal two-frame target matching result, assigning the multi-target label ID in the current frame, as shown in Figure 3a. In the Kalman filter step, the target match result of the current frame is obtained for correction, and then the bounding box is predicted, as shown in Figure 3b. The specific calculation process is as follows.

(1) Target match

In the target match step, the moving targets detected in the current frame are matched with the moving targets in the previous frame to continuously assign an ID, ultimately achieving MOT. The remotely sensed vehicle targets rarely overlap with each other in the presence of multiple targets, and the assignment problem is relatively simple. The augmented path efficient search in the JV algorithm can obtain the optimal assignment the fastest. For moving vehicle targets in remote sensing images, it is difficult to extract sufficient feature descriptions due to their small size. An IoU matrix

C \in R^{m \times n}

is generated by calculating the overlap rate IoU of the moving target boxes, as in Formula (4).

C = [\begin{matrix} C_{11} & C_{12} & \dots & C_{1 n} \\ C_{21} & C_{22} & \dots & C_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ C_{m 1} & C_{m 2} & \dots & C_{m n} \end{matrix}], C_{m n} = I o U (m, n)

(4)

where

C_{m n}

is the IoU between target boxes m and n in two frames. Linear assignment is performed using

C^{'} = (1 - C)

as the cost matrix. The goal of this assignment is to find the minimum assignment cost (5), which is the objective function.

min \sum_{i = 1}^{n} \sum_{j = 1}^{n} C^{'} (m, n) s (m, n)

(5)

where

s (m, n) = 1

indicates that the tracked target m is assigned to the predicted target n. For successfully matched targets, the state is updated by feeding it into the Kalman filter. For unassigned tracked targets, a new Kalman filter is created. Predicted targets that have not been matched for a long time are deleted.

(2) Kalman filter

Vehicle targets in remote sensing videos have relatively slow movements and relatively fixed movement directions. In this paper, the Kalman filter in the SORT algorithm is used to predict the trajectory of the target. By continuously correcting and predicting the target’s tracking trajectory in the previous frame, target matching can be better achieved in the current frame. The Kalman filter includes two parts: prediction and correction. In the prediction process, the current state estimate is obtained based on the previous state. In the correction process, the predicted value is corrected using observations to achieve the optimal prediction estimate.

The MOT algorithm proposed in this paper is implemented in two stages: In the moving target detection stage, feature extraction, feature matching, and the neighbor pixel difference method are used. By combining the motion pixels and all possible targets, the background motion is effectively removed, and the moving targets are identified. In the multi-target association stage, the LAPJV matching algorithm and the Kalman filter are used for matching and prediction; these methods can adapt to the characteristics of moving vehicle targets. The target detection algorithm designed in this paper can be implemented in parallel with the sensor readout, thus reducing the processing delay. With the simplified algorithm steps, the power consumption and computational limitations of on-satellite equipment are avoided while processing is sped up. The target association algorithm proposed in this paper can achieve on-satellite high-speed target matching and target prediction, while the MOT algorithm in this paper can be completely implemented using on-satellite equipment to achieve MOT of moving vehicles in complex moving backgrounds.

3. Hardware Implementation

3.1. Overall Hardware Design

The multi-object vehicle tracking algorithm for moving backgrounds proposed in this paper was implemented on a satellite, where real-time MOT processing is entirely implemented by an FPGA. The structure of the proposed on-satellite system is shown in Figure 4, and it mainly includes input and output interfaces, a global control module, a pipeline control and status acquisition module, a DDR cache, and image processing modules. The input interface receives the real-time data from the sensor and converts the pixel-by-pixel readout from the sensor into an image data stream that is sent to the image processing module for processing. This data stream is also stored in the DDR cache. The output interface outputs the target tracking results, which can be used by the satellite to control the turntable and transmit result; thereby, autonomous control of the remote sensing satellite is achieved. The global control module controls all system functions and obtains the system status for telemetry and remote control, while the pipeline control and status acquisition module controls the pipeline status of the image processing module and obtains the working status of each module. The DDR cache buffers the results of the image processing module and also buffers multiple image frames. The image processing modules are the compensation calculation module, the background compensation and target detection module, and the multi-target association module. These three modules are processed in a pipeline, and read-and-write interactions are performed with the data in the DDR cache to achieve real-time image processing.

The internal design of the submodule has two modes: the pixel-level stream processing mode and the cache access processing mode. The pixel-level stream processing mode is used to scan and process complete image data, and its internal structure is shown in Figure 5a, showing that it includes a row buffer, versatile RAM, and computational core. The input image data are cached in real time in the k row buffer according to the size of the calculation kernel

k \times k

. The row buffer consists of a chain of shift registers made up of block RAM, and the output buffer contains the pixels of the calculation kernel as a new pixel i is input. The row buffer greatly simplifies image neighborhood processing, reduces the on-chip memory footprint, and also satisfies pixel-by-pixel input and output processing. The submodule also contains versatile RAM, which is designed with different functions, such as convolution kernel, filter kernel, or temporary data storage. Such data include the location of the matching points and sampling points storage, and the functions can be used after obtaining the

k \times k

calculation kernel.

The computational core is designed with different computational functions according to different submodules, and the computational core can be assembled and reused according to the design of the algorithm.

The other design mode is the cache access processing mode, which can be used to perform various complex operations on the data in the cache and write the results back to the cache. Its internal structure is shown in Figure 5b, and it includes multi-frame or multi-target cache, load and save module, and complex computation module. The cache may be composed of on-chip block RAM or DDR, and the cache configuration can be changed according to the amounts of calculations and data. The load and save module adjusts the read and write timing according to the cache type and calculation mode to reduce the effect of the interface bandwidth. The complex computation module implements a variety of processing tasks and performs on-chip linear tasks such as prediction, matching, and lookup in multi-object tracking algorithms. This mode is similar to that of the traditional processing method, but by decomposing complex computations into finer granularities, the demand for interface bandwidth is reduced, preventing the processing from being affected by real-time bandwidth bottlenecks. The following is a detailed introduction to each module of the algorithm implementation.

3.2. Compensation Calculation

In the compensation calculation module, the transformation matrix

T_{T}

between T frame and

T - t

frame is calculated by acquiring and matching the corner features of an image, i.e., the compensation information

t_{x}, t_{y}

of the moving background. The structure of the compensation calculation module is shown in Figure 6, and it includes five submodules: corner detection, main direction calculation, feature description extraction, feature matching, and transformation matrix modules. The corner detection submodule extracts the corner points of the current frame image, while the main direction calculation submodule calculates the main direction of the corner points. The feature description extraction submodule extracts rotation-invariant corner features according to the main direction and stores the corner features in the DDR cache at the same time. The feature matching submodule matches the corner features of frame T with those of remtheframe

T - t

, while the transformation matrix calculation submodule calculates the affine transformation matrix of the image according to the matched feature points and obtains the translation amounts

t_{x}

and

t_{y}

.

The corner detection submodule performs gradient calculations, Gaussian smoothing, and thresholding and outputs the maximum value. It extracts the corners from the current frame by thresholding the smaller eigenvalue

λ_{m i n}

of the gradient covariance matrix

M \in R^{2 \times 2}

and then using non-maximum suppression (NMS). The gradient covariance matrix

M

is shown in Formula (6):

M (x, y) = (\begin{matrix} G_{x}^{2} & G_{x y} \\ G_{x y} & G_{y}^{2} \end{matrix})

(6)

where

G_{x}

and

G_{y}

represent the image gradients in the x and y directions, respectively, which are obtained by the Sobel operator after Gaussian filtering. Gaussian smoothing is performed by convolving the calculated gradient magnitude with a

5 \times 5

Gaussian kernel, using a standard deviation of 0.5 for the Gaussian function.

The main direction calculation submodule performs neighborhood extraction, neighborhood gradient averaging, and rotation angle calculation. This submodule obtains the main direction of the corner points in the image. The feature direction of the central pixel

O \in R^{2 \times 1}

is calculated using the information gained from 45 sampling points around the corner point

(p_{i}, q_{i})

, as shown in Formula (7).

O = \frac{1}{N_{p a i r}} \sum (I (p_{i}) - I (q_{i})) \frac{p_{i} - q_{i}}{| | p_{i} - q_{i} | |}

(7)

where

p_{i}, q_{i}

are the coordinates of the sampling points,

I (p_{i})

and

I (q_{i})

are the pixel values after Gaussian filtering, and

N_{p a i r}

is the number of sampling points.

The feature descriptor extraction submodule performs sampling point selection and grayscale comparison and outputs features. This submodule adjusts the sampling point position of the feature descriptor according to the feature direction

O

, obtains a feature descriptor with rotational invariance, and stores the coordinates of the corner point and the corresponding feature value in the DDR cache. The feature

(F = {F_{1}, \dots, F_{i}, \dots, F_{N}}

\in R^{N \times 1}

, N = 512)

of the corner point is extracted using information from 43 sampling points around the corner point

(p_{i}, q_{i})

. The calculation of

F_{i}

is shown in Formula (8):

F_{i} = \{\begin{matrix} 1, I (p_{i}) > I (q_{i}) \\ 0, I (p_{i}) \leq I (q_{i}) \end{matrix}

(8)

The feature matching submodule performs feature loading, region selection, and similarity comparisons. By obtaining the corner coordinates and feature values of the current, T, and previous,

T - t

, frame, the similarity between the two frames is calculated to obtain the matching point pair. The area around the feature points in the current frame is filtered according to the background motion speed and the frame rate of the image sensor. The search neighborhood range needs to be adjusted according to the task; a typical value is

100 \times 100

centered on the corner feature. The FREAK feature descriptor of the corner feature is a 512-bit vector. If the number of corner features that are the same in the two frames is greater than 400 bits, corner feature matching is considered successful. The similarity calculation uses a circuit addition tree in parallel to improve the processing speed.

The transformation matrix calculation submodule performs random selection, matrix calculations, error calculations, and best selection. The best transformation matrix is obtained by continuously randomly selecting matching points and calculating the error. The RANSAC algorithm is used to randomly sample point pairs from matching point pairs; this random sampling can exclude outliers when performed a large number of times, bringing the calculation result closer to the correct transformation matrix. Correct matching point pairs and abnormal matching point pairs can then be obtained. The specific process is as follows:

(1) First, three matching point pairs are randomly sampled from matching point pairs

P

and

P^{'}

, and a conversion matrix is calculated for this set of sampled point pairs. Set

P

to represent the coordinates of feature points in the current frame T, and

P^{'}

to represent those in the previous frame

T - t

. An 11-bit linear feedback shift register (LFSR) is used to generate pseudo-random numbers.

(2) The coordinates of the points in set

P^{'}

are calculated after transformation using the transformation matrix calculated in step (1).

(3) The error between the coordinates calculated in step (2) and the coordinates of the matched point pairs in set

P

is calculated. If the error is less than the set threshold, the point pair is an interior point under the current transformation matrix; if it exceeds the threshold, it is an exterior point. The number of interior points under this transformation is counted.

(4) Steps (1) to (3) are repeated until the maximum number of iterations is reached. Select the transformation matrix with the most interior points.

The complexity of the RANSAC algorithm depends on the set number of iterations. The higher the number of iterations, the more accurate the transformation matrix calculation, but the longer the calculation time. Therefore, a reasonable number of iterations is required for the algorithm to run more efficiently. The theoretical minimum number of iterations N is calculated [40] as shown in Formula (9).

N = \frac{ln (1 - p)}{ln (1 - {(1 - ε)}^{s})}

(9)

where s represents the number of samples, which is set to 3 when calculating the transformation matrix, indicating that three sampling points are selected. p represents the probability that at least one of the three sampling points is an interior point, and p is generally set to 0.99.

ε

represents the probability of an exterior point being in the matched point pair. Seting

p = 0.99

,

s = 3

, and

ε = 0.38

leads to

N = 17

. Seventeen point pairs were randomly selected to calculate the optimal affine transformation matrix

T_{T}

with the maximum number of interior point matches.

3.3. Background Compensation and Target Detection

The background compensation and target detection modules begin calculating in parallel with the compensation calculation module. Background compensation requires the image translation parameters

t_{x}, t_{y}

in the affine transformation matrix output by the compensation calculation module, and the background compensation and target detection modules process

T - 1

frames and

T - t - 1

frames.

T - 1

frames are written to the DDR cache simultaneously while the compensation calculation module is operating. The DDR has a cache of the original image of frame

t + 1

. The background compensation and target detection modules obtain the original images of frames

T - 1

and

T - t - 1

and the affine transformation matrix of frame

T - 1

from the DDR cache. The structure of the background compensation and target detection module is shown in Figure 7, and it includes six submodules: image translation, image enhancement, neighborhood pixel difference and thresholding, image enhancement and segmentation, connected domain marking, and target extraction modules.

The

T - t - 1

frame image is background-compensated after image translation using Laplace enhancement. The neighborhood pixel difference and thresholding submodule subtract the pixel in the enhanced

T - 1

frame from each pixel in a

7 \times 7

neighborhood in the background-compensated

T - t - 1

frame to obtain the minimum value, which is then compared with the threshold. If it exceeds the threshold, the pixel is a motion pixel, and the motion pixel image

I_{m o v e}

is output.

At the same time, the

T - 1

frame image is Laplace enhanced, and after subtraction from the original image, it is segmented according to the threshold value. The connected domain labeling submodule assigns different label numbers to the connected domains in the segmented image, and all possible target images

I_{s h a p e s}

are output.

The target extraction submodule calculates the bounding box of the possible target image

I_{s h a p e s}

and filters out the bounding box of the moving target that contains the motion pixel. Because all possible target images

I_{s h a p e s}

have a large number of connected domains, this calculation is time-consuming. Therefore, in this paper, we use the single-scan connected-component analysis (CCA) algorithm for all possible target images

I_{s h a p e s}

:

(1) The input image

I_{s h a p e s}

is input row by row, and each connected component that appears in a row is assigned a tag from 1 to the maximum number of columns, ensuring that there are no duplicate tags between adjacent rows.

(2) The tags for the same connected component in each row are stored in the form of a linked list, including a head list, a next tag list, and a tail list. In addition, there is a motion pixel list, where the tag in the list is set to 1 if the pixel at the corresponding position of a tag is a motion pixel in

I_{m o v e}

.

(3) During the scanning process, the bounding box coordinates of the tags in the linked lists belonging to the same connected domain are merged, including the upper, lower, left, and right coordinates.

(4) After each current connected-component scan is complete, and if the current scan of the connected domain ends here and the connected component corresponds to a moving pixel, the bounding box accumulated for this connected domain is output; if the current scan of the connected domain does not end, the bounding box coordinates corresponding to the current mark are re-initialized after merging.

(5) After inputting all possible target images

I_{s h a p e s}

, the bounding boxes of all connected components containing moving pixels are output.

The CCA algorithm is performed in a single scan and can immediately obtain the bounding boxes of moving objects, even if the image has not been fully inputand the connected components have already ended.

The bounding boxes

T - 1

of the background compensation and object detection module are output in real time as the data streams of the

T - 1

frames and

T - t - 1

frames are read from the DDR cache. Once a bounding box is detected, which meets the requirements, and the connected components end, the bounding box is output in the next clock cycle. The background compensation and object detection modules are designed to perform single scans using our proposed pixel-level stream processing mode, which avoids multiple reads from the DDR cache, reduces the amount of bandwidth used by the DDR interface, and improves the processing speed. It can also be used in parallel with the compensation calculation module.

3.4. Multi-Target Association

In the second stage of the entire MOT process, moving targets are associated with multiple frames and target IDs are assigned for achieving target tracking. The structure of the multi-target association module is shown in Figure 8, and it includes five submodules: IoU calculation, ID assignment, Kalman prediction, Kalman correction, and target feature and Kalman parameter caching modules.

In order to achieve pipelined parallelism, the multi-target association module processes the moving targets of frame

T - 2

when the sensor outputs frame T. First, the target information and Kalman filter parameters of frame

T - 3

are obtained from the target feature and Kalman parameter cache, and then the Kalman prediction submodule predicts the target position of frame

T - 3

. The IoU calculation module calculates the IoU between the moving target of frame

T - 2

and the predicted target of frame

T - 3

to obtain the IoU matrix

C

, as shown in Formula (4). The ID assignment module calculates the optimal assignment, assigns the ID from the target of frame

T - 3

to frame

T - 2

, and outputs the tracked target. At the same time, the Kalman correction submodule is used to adjust the parameters of the Kalman filter and store them in the target feature and Kalman parameter cache.

The results of ID assignment are separated into three categories. The first category is for results in which the target box in the current frame is successfully associated with the tracking box. If the similarity is greater than the set threshold, it is considered to be a valid match, and the tracked ID number is assigned to the current tracking box. If the similarity is less than the set threshold, it is an invalid match, and the tracked ID number will be discarded. The second category is for results in which the target box in the current frame is not successfully associated with the tracking box. This indicates that the target box in the current frame is a newly emerged target, and a new ID number is assigned to this target at this time. The third category is for results in which the tracking box is not successfully associated with the current target box. This indicates that the tracking box does not have a corresponding target in the current frame, that is, the target is lost, and the ID number of the tracking box is discarded at this time.

After the above process, the target box of the current frame is assigned to that of the previous frame, and the ID number corresponding to the target box of the previous frame is assigned to the target of the current frame to continuously track the target.

The MOT algorithm proposed in this paper is implemented in a pipelined parallel manner. In the compensation calculation module, the pipelined calculation is implemented by means of a row buffer, which reduces the delay in the output of feature descriptors, while in the background compensation and target detection module, motion pixel extraction is achieved using the neighbor pixel difference method, effectively removing background interference. Moving target extraction is implemented using the single-scan CCA, which reduces repeated reads and writes to the DDR cache. In the multi-target association module, the target features and Kalman parameters of each tracked target are independently cached on-chip, which improves the speed of target association. The algorithm implementation module adopts a pixel-level stream processing mode and a cache access processing mode, which implements the entire processing algorithm using the satellite’s equipment.

4. Experimental Results

4.1. Prototype Experimental Platform

The prototype on-satellite experimental test platform used in this paper is shown in Figure 9. The hardware for the target tracking algorithm was implemented on an FPGA image processing platform. Image processing was performed using a Xilinx Kintex-7 xc7k325tffg900-2 FPGA, the DDR cache was a DDR3 800M 8Gb, and data input and output interface was a 10 Gb SFP+ high-speed interface. The experimental test platform features a data simulator [41], which is connected to the SFP+ high-speed data interface of the FPGA image processing platform to simulate and verify the implementation.

During the experimental test, simulated image data were input through the SFP+ high-speed data interface, and the processing results were stored to verify the processing performance. The data simulator can simulate the readout timing of the sensor and verify the algorithm’s metrics by feeding a test image from the dataset.

4.2. Dataset and Evaluation Metrics

The image simulation data input into the high-speed data interface in this paper were taken from public datasets for a comparison with other works. There are multiple datasets for remote-sensed MOT of vehicles, including SkySat [28], VISO [42], and SAT-MTB [14], which were released in 2020, 2022, and 2023 respectively. Among these, SkySat is widely used because it is more established. Moreover, the SkySat dataset contains grayscale images, and the algorithm proposed in this paper was designed for grayscale image sensors. The algorithm was thus tested with this dataset. The SkySat dataset contains satellite video sequences recorded using the SkySat-1 satellite platform at a frame rate of 30 frames per second. It consists of 700 frames with a spatial resolution of 1.0 m. The videos are divided into two areas, video 001 and 002, as shown in Figure 10a. In addition to testing the original sequence, multiple MOT test sequences with multiple moving background states were compiled in this study by adjusting the location of the dataset. As shown in Figure 10b, by performing a translation with speed v to the initial region 001, a moving background video sequence relative to the ground can be obtained. By adjusting the magnitude of the speed v, the speed of the background can be changed, and thus, a variety of different moving background video sequences can be obtained.

We used the MOT metric [43] to evaluate the effectiveness of multi-object tracking. The MOT metric focuses on the multi-object tracking accuracy (MOTA) and ID F1-Score (IDF1). MOTA reflects the overall tracking accuracy and is calculated using Formula (10).

MOTA = 1 - \frac{FN + FP + IDs}{GT}

(10)

where FN, FP, IDs, and GT are the total number of false negatives, false positives, identity switches, and ground truth objects in the entire video.

IDF1 measures the performance of the tracking system in terms of the consistency with which the object ID is correctly maintained [44]. It is calculated using Formula (11).

IDF 1 = 2 \times \frac{IDTP}{IDTP + IDFP + IDFN}

(11)

where IDTP, IDFP, and IDFN correspond to the number of true positive (TP), FP, and FN object ID, respectively.

ID precision (IDP) and ID recall (IDR) indicate how many detected objects are correctly assigned an identity and how many real objects are correctly tracked and maintain the same identity; they can be calculated using the following Formula (12):

IDP = \frac{IDTP}{IDTP + IDFP}, IDR = \frac{IDTP}{IDTP + IDFN}

(12)

The MOT metrics used in this study also include the percentage of mostly tracked (MT), partially tracked (PT), and mostly lost (ML) trajectories and the fragmentation (FM). They reflect the matching degree between the tracking trajectories of the algorithm and the ground truth.

4.3. Tracking Performance Evaluation and Comparison

4.3.1. Evaluation on the SkySat Dataset

The algorithm was tested in a fixed background using the initial position selected from the SkySat dataset; that is, the video sequence from region 001 was used for evaluation, which has a total of 700 frames. As we know the frame rate and resolution of the SkySat dataset, we can infer the vehicle’s speed from the video. By measuring vehicles’ speeds on different roads in the dataset, we found that the speed ranges roughly from 42 to 118 km/h. The method proposed in this paper transmits the target ID by calculating the IoU between the bounding box of the current frame and the bounding box predicted by the Kalman filter from the previous frame. For a typical car with a length of about 4 m, ignoring the Kalman filter prediction, tracking will only fail when there is no overlap between the target box of the current frame and the previous frame. At this time, the speed reaches 432 km/h, which is much higher than the speed of a normal vehicle. In this study, we used the detected moving pixels to filter all possible targets and extract the complete bounding box of the target. Therefore, the MOT algorithm proposed in this paper can track moving vehicle targets in the SkySat dataset.

Table 1 shows the evaluation results using the SkySat dataset, sorted by year of publication for the referenced paper. The same dataset was used for testing, and the evaluation results of Bi-level K-shortest [24], BMTC [18] and MP²Net [19] are from the original paper; the evaluation results of DTTP [22] and D&T [23] were presented by Zhao et al. [19]. Among these methods, Bi-level K-shortest [24], DTTP [22], and D&T [23] use the traditional algorithm to achieve target tracking, while BMTC [18] and MP²Net [19] use deep learning-based methods. We used the code and weight file published by MP2Net [19] to implement this algorithm on the network and to compare it visually with the MOT algorithm proposed in this paper, as shown in Figure 11. Green boxes correspond to TPs, yellow boxes correspond to FPs, and FNs are shown in red. The target ID is overlaid on the corresponding bounding box.

The metrics MOTA and IDF1 combine the difference between all predicted target boxes and the ground truth across the entire image sequence. MOTA focuses more on whether the target is tracked correctly; an increase in the output FPs, FNs, and IDs will all lead to a decrease in the MOTA. IDF1 is predominantly sensitive to the switching of tracking IDs; frequent ID switching and track losses will lead to a decrease in IDF1. IDP and IDR correspond to the precision and recall rates. An increase in IDP means that the proportion of correctly tracked targets is relatively high among all output targets, while an increase in IDR means that the proportion of correctly tracked targets is relatively high among all real targets. MT indicates the proportion of target trajectories that are correctly tracked more than 80% of the time, while PT indicates the proportion of target trajectories that are correctly tracked 20% to 80% of the time and ML indicates the proportion of target trajectories that are correctly tracked less than 20% of the time. FM indicates the number of times a single target trajectory is interrupted during tracking. When inlaying on-satellite tracking results, real-time outputs are more often used to control the motion actuator to align the gaze with the desired target. Therefore, better IDP and MT values indicate accurate detection of targets with specific motion characteristics.

4.3.2. Evaluation on Moving Background Datasets

There is relative motion between a satellite and the Earth. The video data transmitted from the satellite to the ground are calibrated, and the moving background is compensated to generate a static background video in the dataset. On-satellite processing of images captured by sensors is required to be low-latency and to be performed online and in real time. The input image is from the original video data of the moving background. We adjusted the position of the SkySat video sequence frame by frame to simulate the video sequence under the moving background and constructed a dataset from five groups of data with different moving speeds. The end position of the selected area is fixed, and the generated video sequence lengths are different for different motion speeds. We truncated the generated five video sequences to 300 frames to ensure that they correspond to the same real-time duration.

The multi-object tracking algorithm proposed in this paper and the latest deep learning-based multi-object tracking algorithm, MP²Net [19], were evaluated at different motion speeds using MOTA and IDF1. The curves of the background motion speed and the metrics were plotted and are shown in Figure 12. MP²Net [19] uses publicly available code and weight files to feed the generated dataset for inference.

For a visual comparison, we visualized the the MOT results for moving backgrounds, as shown in Figure 13. We extracted the results of the 50th, 100th, 150th, and 200th frames at background movement speeds of 1 px/frame and 2 px/frame. The green boxes indicate the correctly detected targets, i.e., TP, the yellow boxes indicate the incorrectly output targets, i.e., FP, and the red boxes indicate the undetected targets, i.e., FN. The target ID is overlaid on the corresponding bounding box. The top two rows in the figure show the results of MP²Net [19], and the bottom two rows in the figure show the results of the algorithm proposed in this paper.

In addition, to verify that the algorithm can function in real-world scenarios with background motion, we tested it using raw images captured via the Jilin-1 satellite, which has a spatial resolution of 0.92 m and a frame rate of 10 fps. The raw output video sequence was cropped to 1024 × 1024 and converted to grayscale images for processing. Since the dataset is not labeled for multi-object tracking, Figure 14 only shows the tracking results of the algorithm proposed in this paper. We fed the video sequence into MP²Net for processing. The background motion in real satellite video sequences is non-linear, and the background motion changes from slow to fast. MP²Net uses a convolutional gate recurrent unit (ConvGRU) in the implicit motion prediction (IMP) stage to enhance moving objects. However, nonlinear movements cause incorrect enhancements in the moving background, making it difficult to detect moving objects. Therefore, MP²Net did not output any tracked targets. The background motion patterns in remote sensing satellite video frames can vary depending on the specific mission being executed. The lack of perception of background movement in IMP makes it impossible to effectively extract moving objects from moving backgrounds.

4.4. Performance Analysis of On-Satellite Implementation

The multi-object tracking algorithm proposed in this paper was implemented on a Xilinx Kintex-7 FPGA using High-Level Synthesis (HLS). Table 2 shows the performance comparison of implementation on different hardware. The working clock of the image processing part is 100 MHz. The on-satellite implementation experiment was designed to process a resolution of

1024 \times 1024

, with an upper limit of 1024 matching feature points and 128 tracking targets. Images from the SkySat dataset were resized from

400 \times 400

to

1024 \times 1024

to test the processing speed and functionality. The corresponding size limits for the target extraction part of the moving target detection were also adjusted to account for the change in target size caused by scaling. Because moving vehicles are highlighted by a brighter area in the image, zooming in on the image does not lead to weaker edge features. The corner calculation process features Gaussian smoothing, which suppresses the noise caused by zooming; therefore, the change in accuracy caused by rescaling is small. In order to maintain the same input size as other algorithms, images of the SkySat dataset were padded from zero to

1024 \times 1024

for the testing of the evaluation metrics. About 170 feature points and 40 tracked targets were obtained in the actual test, and a processing speed of 47 fps was achieved. The power consumption of the entire board was measured to be about 10 W using a power meter. We reproduced the deep learning-based algorithm of MP²Net [19] on the RTX 3090 GPU and applied it to the same scene. We also used nvidia-smi to view the real-time GPU power consumption. The performance indicators of the methods of Liu et al. [25], Su et al. [27], and BMTC [18] were taken from their publications. The power consumption of BMTC [18] is the theoretical power consumption of the TITAN X.

5. Discussion and Limitations

5.1. Discussion of Experimental Results

The on-satellite MOT algorithm proposed in this paper is implemented at the sensor’s edge to continuously track multiple targets and extract their trajectories. The FPGA can also control the turntable to further continuously track and locate individual targets or groups of targets whose motion trajectories meet specific characteristics. The moving-background compensation algorithm proposed in this paper can effectively remove the background motion caused by the relative motion of the turntable and the satellite to the ground, enabling on-satellite real-time MOT.

First, we tested the algorithm on the publicly available SkySat dataset; the results are shown in Table 1. Compared with traditional algorithms [22,23,24], the proposed MOT algorithm led to improvements in terms of MOTA, IDF1, IDP, and IDR. In particular, the MOTA improved from 48.1% to 69.9%, an increase of about 22%. The multi-object tracking algorithm proposed in this paper is based on a traditional algorithm. With the same static dataset, the tracking metrics greatly improved compared to those of other traditional algorithms.

Compared with the latest deep learning-based object tracking algorithms [18,19], the proposed algorithm achieves an IDP of 84.3%, 2.8% higher than 81.5%, and the number of MTs increased from 126 to 149. MP²Net [19] combines implicit (IMP) and explicit (EMP) motion prediction strategies. IMP can enhance targets with inconspicuous motion features, while EMP uses predicted displacement information between target frames to suppress false positives in motion target tracking and compensate for missing detections. Therefore, MP²Net can detect more moving targets and thus obtain higher MOTA, IDF1, and IDR values. However, its multiple complex network structures make the inference process extremely cumbersome, and tracking results cannot be output in real time. Compared with the latest deep learning-based object tracking algorithms, the proposed algorithm can be implemented onsatellites with only a small deterioration in tracking metrics.

The MOT algorithm proposed in this paper uses the neighbor pixel difference method to simultaneously detect motion pixels and extract complete targets from the original image. This combination can effectively screen out complete moving targets, effectively improving the MOTA compared to traditional algorithms. The extraction of complete moving targets also helps the effective matching of Kalman prediction and LAPJV matching during the matching stage, which leads to a superior IDF1 value.

Next, we altered the displacement of the selected region in the dataset to generate several test video sequences with a moving background. The MOTA and IDF1 values with respect to the speed of movement are shown in Figure 12. The video sequence is unified with the public dataset at 300 frames, and there are fewer scenarios where the tracked target is occluded. When the background movement speed is 0 px/frame, the algorithm proposed in this paper is better than MP²Net [19] in regard to both MOTA and IDF1 values. As the background motion speed increases, the MOT algorithm proposed in this paper can effectively track the target in a moving background and maintain its superior tracking metrics.

The end position of the selected region of the dataset is different at different background motion speeds, which leads to a difference in the total number of targets. Therefore, the evaluation metrics at a background motion speed of 1.5 px/frame are better than those at 1.0 and 2.0 px/frame, and MP²Net [19] performs similarly to the algorithm proposed in this paper.

Analyzing the overall trends, it is shown that the algorithm proposed in this paper has a good tracking effect in the presence of a moving background. The addition of an algorithm to compensate for background motion can help to accurately obtain the background offset based on the motion relationship of feature points. The moving background can thus be removed, and the moving vehicle target is effectively retained.

Then, we aslo visualized the results of motion target detection, as shown in Figure 13. MP²Net [19] mistakenly identifies the background motion as a moving vehicle target, generates a false alarm, and also ignores vehicle targets moving at a similar speed to the background motion, resulting in missed detections. The moving target detection algorithm we propose can effectively detect moving vehicle targets in the presence of background movement.

Based on the evaluation results using the public SkySat dataset, the algorithm proposed in this paper has good MOT results in both static and dynamic backgrounds. The algorithm compensates for and removes the background and extracts the moving target to achieve effective MOT.

Furthermore, we evaluated MOT using original output videos from the Jilin-1 satellite. The tracking results are shown in Figure 14, where targets with different IDs are distinguished by different colors. The original output video of the Jilin-1 satellite exhibits clear non-linear background motion. Our proposed MOT algorithm can compensate for this moving background and track moving vehicle targets at the same time. When the speed of the moving background changes for the deep learning-based MOT algorithm MP²Net [19], all tracked targets are lost due to the difference between the scenes in the training dataset, and it cannot output new tracked target boxes. In real situations, the speed of the background motion changes due to the movement of the satellite camera turntable, and this change is not fixed. By training the network through deep learning methods, it cannot be adapted to multiple changing scenarios.

Finally, we compared the performance of on-satellite implementation of the algorithm with other hardware, as shown in Table 2. In terms of application scenarios, the method proposed in this paper can be implemented on satellites to track multiple vehicle targets in complex moving backgrounds. Liu et al. [25] and Su et al. [27] only achieved multi-object tracking in static background scenarios concerning ships and space-based target surveillance. BMTC [18] and MP²Net [19] are the latest deep learning-based multi-object tracking algorithms for remote sensing vehicles; however, they can only perform remote sensing tracking in static backgrounds. The on-satellite method in this paper can achieve a frame rate of 47 fps, which is about 4 times higher than MP²Net’s 11 fps [19]. In terms of power consumption, note that Su et al. [27] performed target tracking in deep space with a simple background. Their system is relatively complex, with detection and tracking performed using an FPGA and a DSP, respectively. The total power consumption was 13 W, which is 3 W higher than that of our method, which was implemented completely using an FPGA. In terms of GPU implementation in the same application scenario, MP²Net [19] consumes 310 W, which is about 31 times more power than that consumed by the on-satellite method presented in this paper. Furthermore, we learned from an Nvidia official that BMTC [18] used a GPU with a theoretical power consumption of 250 W, which is about 25 times more power than that consumed by the method in this paper.

A comparison of the comprehensive performance indicators shows that, due to the general-purpose design of the GPU and the frequent movement of data between different memories and caches, the interface bandwidth is the bottleneck of the entire implementation, which prevents the GPU from achieving its maximum performance (typical value: 350 W) and further increases in the frame rate. The on-satellite method presented in this paper is deployed at the sensor edge and directly processes the output data from the sensor. The pipelined parallel design of multiple modules for pixel-by-pixel input and output effectively increases the processing frame rate, and eliminates the bottleneck caused by frequent data transfers to meet the real-time requirements of on-satellite processing. Moreover, the background compensation algorithm proposed in this paper removes the effects of background motion in the output video at the sensor edge. The concise moving-target extraction algorithm and target matching algorithm can achieve real-time online MOT at the edge, with only a small deterioration in performance metrics.

5.2. Limitations and Shortcomings of the Proposed Method

The MOT algorithm and on-satellite implementation method proposed in this paper have some limitations and shortcomings.

First, the image output of the on-satellite sensor may be affected by jitter due to the shaking of the satellite. If a frame is jittery, the entire picture will appear to be displaced, and the calculated displacement between frames will include the amount of jitter. As compensation is performed based on the amount of displacement, the frame difference is determined to cancel out the effect of this jitter and extract the real moving vehicle target. However, the on-satellite implementation method in this paper does not feature image pre-processing algorithms such as image stabilization, and this, the jitter of target vehicles, cannot be eliminated.

Next, moving vehicles may be occluded by environmental objects such as bridges and buildings, and the movement trajectories of different vehicles may intersect. The algorithm for target association between multiple frames used in this paper transfers target IDs between frames by calculating the IoU and using an assignment algorithm. In the short term, continuous prediction by the Kalman filter can provide some occlusion resistance; however, the algorithm proposed in this paper is not optimized for scenarios with intersecting vehicle trajectories. Intersecting vehicle trajectories often occur at intersections, but for vehicles moving normally on a road, intersecting trajectories are rare. In future work, the direction vector of the vehicle’s movement can be combined with the IoU to avoid the effects of intersecting trajectories.

In addition, due to the high mobility of vehicles, they have different speeds and directions of movement. For objects with the same speed and direction of movement as both the vehicle and the background, the moving background compensation used in this paper will consider the object as background and remove it, causing the object to be lost. This may occur when a certain object or certain objects are continuously tracked by a controlled motion mechanism. This situation can be avoided by protecting the current tracked object.

Finally, the MOT algorithm proposed in this paper was designed based on manually extracted target features only. It is not universally applicable to a variety of different scenarios. The target feature extraction capability is weak, and the algorithm needs to be adjusted and adapted according to different scenarios. However, these limitations and shortcomings are typical defects of traditional algorithms. In order to adapt to more complex scenarios, more algorithm steps and processing operations need to be added, which will have a negative impact on on-satellite implementation.

In summary, although in this paper, we have implemented the MOT algorithm on a satellite in the context of moving backgrounds, the various limitations and shortcomings of the method need to be addressed by implementing more on-satellite processing algorithms. An increase in the number of these algorithms will inevitably lead to a decrease in processing speed and an increase in resource usage. Trade-offs need to be made between various metrics to meet the requirements of real on-satellite applications.

6. Conclusions

In this paper, we propose a multi-object tracking algorithm for moving vehicles, which is deployed on a satellite to achieve real-time multi-object tracking. The algorithm proposed in this paper utilizes feature matching and neighborhood pixel differences for complex moving backgrounds, effectively eliminating the impact of and compensation for background motion in extracting moving vehicle targets. For the association of multiple moving targets, we use the Kalman filter and the LAPJV matching algorithm to predict and associate the target, respectively, achieving multi-object tracking. The multi-object tracking algorithm proposed in this paper is implemented on the FPGA side using pipelined parallel processing to improve the frame rate and achieve real-time performance (>30 fps). It was designed with the hardware in mind; its feasibility and implementation efficiency were considered during the design stage. The algorithm was thus deployed on the on-satellite equipment with a minimal loss in accuracy, achieving a balance between power consumption, speed, and accuracy.

In future work, the algorithm should be adapted and optimized for different sensors (color, line scan, area, etc.). At the same time, the offset and compensation method of motion can be optimized to improve moving-background compensation and to meet the requirements of continuous multi-target tracking in different scenarios and operating conditions by taking into account different background motion speed curves and the influence of turntable rotation.

Author Contributions

J.Y.: data curation, formal analysis, investigation, methodology, software, validation, and writing—original draft; S.W.: writing—review and editing; Y.W.: writing—review and editing; D.Z.: writing—review and editing; R.D.: conceptualization, funding acquisition, methodology, and writing—review and editing; X.W.: writing—review and editing; J.X.: writing—review and editing; J.L.: writing—review and editing; N.W.: visualization and writing—review and editing; L.L.: supervision and writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

Supported by the National Natural Science Foundation of China under Grant No. 62334008.

Institutional Review Board Statement

Not applicable for studies not involving humans or animals.

Informed Consent Statement

Not applicable for studies not involving humans.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Zhang, J.; Jia, X.; Hu, J. Motion Flow Clustering for Moving Vehicle Detection from Satellite High Definition Video. In Proceedings of the 2017 International Conference on Digital Image Computing: Techniques and Applications (DICTA), Sydney, Australia, 29 November–1 December 2017; pp. 1–7. [Google Scholar] [CrossRef]
Yang, T.; Wang, X.; Yao, B.; Li, J.; Zhang, Y.; He, Z.; Duan, W. Small Moving Vehicle Detection in a Satellite Video of an Urban Area. Sensors 2016, 16, 1528. [Google Scholar] [CrossRef]
Kopsiaftis, G.; Karantzalos, K. Vehicle detection and traffic density monitoring from very high resolution satellite video data. In Proceedings of the 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 26–31 July 2015; pp. 1881–1884. [Google Scholar] [CrossRef]
Yan, H.; Li, B.; Zhang, H.; Wei, X. An Antijamming and Lightweight Ship Detector Designed for Spaceborne Optical Images. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2022, 15, 4468–4481. [Google Scholar] [CrossRef]
Gao, G.; Yao, L.; Li, W.; Zhang, L.; Zhang, M. Onboard Information Fusion for Multisatellite Collaborative Observation: Summary, challenges, and perspectives. IEEE Geosci. Remote Sens. Mag. 2023, 11, 40–59. [Google Scholar] [CrossRef]
Zhang, Z.; Wei, L.; Xiang, S.; Xie, G.; Liu, C.; Xu, M. Task-Driven Onboard Real-Time Panchromatic Multispectral Fusion Processing Approach for High-Resolution Optical Remote Sensing Satellite. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023, 16, 7636–7661. [Google Scholar] [CrossRef]
Lentaris, G.; Maragos, K.; Stratakos, I.; Papadopoulos, L.; Papanikolaou, O.; Soudris, D.; Lourakis, M.; Zabulis, X.; Gonzalez-Arjona, D.; Furano, G. High-Performance Embedded Computing in Space: Evaluation of Platforms for Vision-Based Navigation. J. Aerosp. Inf. Syst. 2018, 15, 178–192. [Google Scholar] [CrossRef]
Furano, G.; Meoni, G.; Dunne, A.; Moloney, D.; Ferlet-Cavrois, V.; Tavoularis, A.; Byrne, J.; Buckley, L.; Psarakis, M.; Voss, K.O.; et al. Towards the Use of Artificial Intelligence on the Edge in Space Systems: Challenges and Opportunities. IEEE Aerosp. Electron. Syst. Mag. 2020, 35, 44–56. [Google Scholar] [CrossRef]
Cappellone, D.; Di Mascio, S.; Furano, G.; Menicucci, A.; Ottavi, M. On-Board Satellite Telemetry Forecasting with RNN on RISC-V Based Multicore Processor. In Proceedings of the 2020 IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), Frascati, Italy, 19–21 October 2020; pp. 1–6. [Google Scholar] [CrossRef]
Rapuano, E.; Meoni, G.; Pacini, T.; Dinelli, G.; Furano, G.; Giuffrida, G.; Fanucci, L. An FPGA-Based Hardware Accelerator for CNNs Inference on Board Satellites: Benchmarking with Myriad 2-Based Solution for the CloudScout Case Study. Remote Sens. 2021, 13, 1518. [Google Scholar] [CrossRef]
Ren, L.; Yin, W.; Diao, W.; Fu, K.; Sun, X. Motion-Guided Multiobject Tracking Model for High-Speed Aerial Objects in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–15. [Google Scholar] [CrossRef]
Feng, J.; Jiang, Q.; Zhang, J.; Liang, Y.; Shang, R.; Jiao, L. CFDRM: Coarse-to-Fine Dynamic Refinement Model for Weakly Supervised Moving Vehicle Detection in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–13. [Google Scholar] [CrossRef]
Li, S.; Sun, X.; Gu, Y.; Lv, Y.; Zhao, M.; Zhou, Z.; Guo, W.; Sun, Y.; Wang, H.; Yang, J. Recent Advances in Intelligent Processing of Satellite Video: Challenges, Methods, and Applications. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2023, 16, 6776–6798. [Google Scholar] [CrossRef]
Li, S.; Zhou, Z.; Zhao, M.; Yang, J.; Guo, W.; Lv, Y.; Kou, L.; Wang, H.; Gu, Y. A Multitask Benchmark Dataset for Satellite Video: Object Detection, Tracking, and Segmentation. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5611021. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, C.; Song, J.; Xu, Y. Object Tracking Based on Satellite Videos: A Literature Review. Remote Sens. 2022, 14, 3674. [Google Scholar] [CrossRef]
Xiao, C.; Yin, Q.; Ying, X.; Li, R.; Wu, S.; Li, M.; Liu, L.; An, W.; Chen, Z. DSFNet: Dynamic and Static Fusion Network for Moving Object Detection in Satellite Videos. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
Tran, A.; Cheong, L.F. Two-Stream Flow-Guided Convolutional Attention Networks for Action Recognition. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 3110–3119. [Google Scholar] [CrossRef]
Zhang, J.; Zhang, X.; Huang, Z.; Cheng, X.; Feng, J.; Jiao, L. Bidirectional Multiple Object Tracking Based on Trajectory Criteria in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–14. [Google Scholar] [CrossRef]
Zhao, M.; Li, S.; Wang, H.; Yang, J.; Sun, Y.; Gu, Y. MP2Net: Mask Propagation and Motion Prediction Network for Multiobject Tracking in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5617515. [Google Scholar] [CrossRef]
Bahl, G. Deep Learning Architectures for Onboard Satellite Image Analysis. Ph.D. Thesis, Université Cote d’Azur, Nice, France, 2022. [Google Scholar]
Boucetta, A.Y.; Baziz, M.; Hamdad, L.; Allal, I. Optimizing for Edge-AI Based Satellite Image Processing: A Survey of Techniques. In Proceedings of the 2024 IEEE Mediterranean and Middle-East Geoscience and Remote Sensing Symposium (M2GARSS), Oran, Algeria, 15–17 April 2024; pp. 83–87. [Google Scholar] [CrossRef]
Ahmadi, S.A.; Ghorbanian, A.; Mohammadzadeh, A. Moving vehicle detection, tracking and traffic parameter estimation from a satellite video: A perspective on a smarter city. Int. J. Remote Sens. 2019, 40, 8379–8394. [Google Scholar] [CrossRef]
Ao, W.; Fu, Y.; Hou, X.; Xu, F. Needles in a Haystack: Tracking City-Scale Moving Vehicles From Continuously Moving Satellite. IEEE Trans. Image Process. 2020, 29, 1944–1957. [Google Scholar] [CrossRef]
Zhang, J.; Jia, X.; Hu, J.; Tan, K. Satellite Multi-Vehicle Tracking under Inconsistent Detection Conditions by Bilevel K-Shortest Paths Optimization. In Proceedings of the 2018 Digital Image Computing: Techniques and Applications (DICTA), Canberra, Australia, 10–13 December 2018; pp. 1–8. [Google Scholar] [CrossRef]
Liu, W.; Chen, H.; Ma, L. Moving object detection and tracking based on ZYNQ FPGA and ARM SOC. In Proceedings of the IET International Radar Conference 2015, Hangzhou, China, 14–16 October 2015; pp. 1–4. [Google Scholar] [CrossRef]
Han, K.; Pei, H.; Huang, Z.; Huang, T.; Qin, S. Non-cooperative Space Target High-Speed Tracking Measuring Method Based on FPGA. In Proceedings of the 2022 7th International Conference on Image, Vision and Computing (ICIVC), Xi’an, China, 26–28 July 2022; pp. 222–231. [Google Scholar] [CrossRef]
Su, Y.; Chen, X.; Liu, G.; Cang, C.; Rao, P. Implementation of Real-Time Space Target Detection and Tracking Algorithm for Space-Based Surveillance. Remote Sens. 2023, 15, 3156. [Google Scholar] [CrossRef]
Zhang, J.; Jia, X.; Hu, J. Error Bounded Foreground and Background Modeling for Moving Object Detection in Satellite Videos. IEEE Trans. Geosci. Remote Sens. 2020, 58, 2659–2669. [Google Scholar] [CrossRef]
Shi, J. Tomasi. Good features to track. In Proceedings of the 1994 Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 21–23 June 1994; pp. 593–600. [Google Scholar] [CrossRef]
Zhang, W.; Sun, C.; Gao, Y. Image Intensity Variation Information for Interest Point Detection. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 9883–9894. [Google Scholar] [CrossRef] [PubMed]
Lindeberg, T. Image Matching Using Generalized Scale-Space Interest Points. J. Math. Imaging Vision 2015, 52, 3–36. [Google Scholar] [CrossRef]
Yu, L.; Zhang, D.; Holden, E.J. A fast and fully automatic registration approach based on point features for multi-source remote-sensing images. Comput. Geosci. 2008, 34, 838–848. [Google Scholar] [CrossRef]
Wang, C.; Nian, H. Algorithm of remote sensing image matching based on corner-point. In Proceedings of the 2017 International Workshop on Remote Sensing with Intelligent Processing (RSIP), Shanghai, China, 18–21 May 2017; pp. 1–4. [Google Scholar] [CrossRef]
Alahi, A.; Ortiz, R.; Vandergheynst, P. FREAK: Fast Retina Keypoint. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 510–517. [Google Scholar] [CrossRef]
Liu, J.; Liang, A.; Zhao, E.; Pang, M.; Zhang, D. Homography Matrix-Based Local Motion Consistent Matching for Remote Sensing Images. Remote Sens. 2023, 15, 3379. [Google Scholar] [CrossRef]
Luo, Y.; Wang, X.; Liao, Y.; Fu, Q.; Shu, C.; Wu, Y.; He, Y. A Review of Homography Estimation: Advances and Challenges. Electronics 2023, 12, 4977. [Google Scholar] [CrossRef]
Fischler, M.A.; Bolles, R.C. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 1981, 24, 381–395. [Google Scholar] [CrossRef]
Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple Online and Realtime Tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; pp. 3464–3468. [Google Scholar] [CrossRef]
Jones, W.; Chawdhary, A.; King, A. Optimising the Volgenant–Jonker algorithm for approximating graph edit distance. Pattern Recognit. Lett. 2017, 87, 47–54. [Google Scholar] [CrossRef]
Hartley, R.; Zisserman, A. Multiple View Geometry in Computer Vision; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
Yu, J.; Dou, R.; Wang, X.; Xu, J.; Liu, J.; Wu, N.; Liu, L. A Multi-Channel Data Simulator Based on the Time Unification System. Appl. Sci. 2024, 14, 5938. [Google Scholar] [CrossRef]
Yin, Q.; Hu, Q.; Liu, H.; Zhang, F.; Wang, Y.; Lin, Z.; An, W.; Guo, Y. Detecting and Tracking Small and Dense Moving Objects in Satellite Videos: A Benchmark. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–18. [Google Scholar] [CrossRef]
Milan, A.; Leal-Taixé, L.; Reid, I.; Roth, S.; Schindler, K. MOT16: A Benchmark for Multi-Object Tracking. arXiv 2016, arXiv:1603.00831. [Google Scholar] [CrossRef]
Ristani, E.; Solera, F.; Zou, R.S.; Cucchiara, R.; Tomasi, C. Performance Measures and a Data Set for Multi-Target, Multi-Camera Tracking. In Proceedings of the Computer Vision—ECCV 2016 Workshops, Amsterdam, The Netherlands, 11–14 October 2016; pp. 17–35. [Google Scholar] [CrossRef]

Figure 1. (a) Working scenarios of remote sensing video satellites. (b) On-satellite vehicle tracking workflow diagram. (c) Multi-object tracking algorithm structure.

Figure 2. The algorithm workflow of moving target detection stage: (a) The moving background compensation calculation step. (b) The moving target extraction step.

Figure 3. The algorithm workflow in the multi-target association stage: (a) The target match step. (b) The Kalman filter step.

Figure 4. The structure diagram of the on-satellite hardware system implementation.

Figure 5. (a) The design of the pixel-level stream processing mode. (b) The cache access processing mode.

Figure 6. Block design of the compensation calculation module.

Figure 7. Block diagram of background compensation and target detection.

Figure 8. Block design of the multi-target association module.

Figure 9. The prototype on-satellite experimental test platform.

Figure 10. Region selection for the SkySat dataset: (a) The original region. (b) The dashed box represents the initial region, and the solid box represents the final region at the simulated maximum motion speed. Different color boxes indicate different target IDs.

Figure 11. Visual comparison of evaluation results. The first row is the result of MP²Net, and the second row is our result. The columns indicate the results of the SkySat dataset for 50, 100, 150, and 200 frames. Green boxes correspond to TPs, yellow boxes correspond to FPs, and FNs are shown in red. The target ID is overlaid on the corresponding bounding box.

Figure 12. The impact of regional motion speed on the following: (a) MOTA; (b) IDF1.

Figure 13. The visual comparison of MOT results in a moving background. The first two rows and the last two rows are the results of MP²Net and our results at different background motion speeds, respectively. The columns indicate the results of the SkySat dataset for 50, 100, 150, and 200 frames. The TP corresponds to green boxes, the FP corresponds to yellow boxes, and the FN is in red. The target ID is overlaid on the corresponding bounding box.

Figure 14. MOT results for original images from the Jilin-1 satellite. The columns indicate the results of the 100, 150, 200, and 250 frames.

Table 1. Evaluation results on the SkySat dataset. The up arrow (resp. down arrow) indicates that the performance is better if the quality is greater (resp. smaller).

Method	Year	MOTA↑	IDF1↑	IDP↑	IDR↑	MT↑	PT↓	ML↓	FP↓	FN↓	IDs↓	FM↓
Bi-level K-shortest [24]	DICTA 2018	48.1%	62.6%	72.7%	55.0%	81	35	55	3103	8638	33	80
DTTP [22] *	IJRS 2019	48.1%	68.2%	80.5%	59.1%	74	-^#	22	2410	17,865	153	-^#
D&T [23] *	TIP 2020	24.7%	40.3%	51.7%	33.0%	25	-^#	91	1100	28,438	101	-^#
BMTC [18]	TGRS 2023	78.0%	-^#	-^#	-^#	126	11	7	2283	1651	74	129
MP²Net [19]	TGRS 2024	75.6%	83.2%	81.5%	85.0%	115	-^#	4	5613	3915	60	-^#
Ours	-	69.9%	81.1%	84.3%	78.1%	149	37	7	3001	4959	50	350

* These data were obtained from [19]. ^# These data are not detailed in these papers.

Table 2. Performance comparison of implementation on different hardware.

Item	Ours	Liu et al. [25]	Su et al. [27]	BMTC [18]	MP²Net [19]
Platform	Kintex-7	Zynq-7000	Kintex-7 FPGA & C66x DSP	TITAN X	RTX 3090
Frequency (MHz)	100	100	50 & 1000	1417 *^#	1800 *^#
Resolution	$1024 \times 1024$	$1137 \times 686$	$1024 \times 1024$	$400 \times 400$	$400 \times 400$
fps	47	1.3	45	0.148	11
Power Consumption (W)	10	- *	7 & 6	250 *^#	310
Scenario Type	Vehicles in static and moving background	Ocean ships	Space-based target surveillance	Vehicles in a static background	Vehicles in a static background

* These data are not presented in these papers. ^# We obtained these data from Nvidia officials.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, J.; Wei, S.; Wen, Y.; Zhou, D.; Dou, R.; Wang, X.; Xu, J.; Liu, J.; Wu, N.; Liu, L. On-Satellite Implementation of Real-Time Multi-Object Moving Vehicle Tracking with Complex Moving Backgrounds. Remote Sens. 2025, 17, 418. https://doi.org/10.3390/rs17030418

AMA Style

Yu J, Wei S, Wen Y, Zhou D, Dou R, Wang X, Xu J, Liu J, Wu N, Liu L. On-Satellite Implementation of Real-Time Multi-Object Moving Vehicle Tracking with Complex Moving Backgrounds. Remote Sensing. 2025; 17(3):418. https://doi.org/10.3390/rs17030418

Chicago/Turabian Style

Yu, Jingyi, Siyuan Wei, Yuxiao Wen, Danshu Zhou, Runjiang Dou, Xiuyu Wang, Jiangtao Xu, Jian Liu, Nanjian Wu, and Liyuan Liu. 2025. "On-Satellite Implementation of Real-Time Multi-Object Moving Vehicle Tracking with Complex Moving Backgrounds" Remote Sensing 17, no. 3: 418. https://doi.org/10.3390/rs17030418

APA Style

Yu, J., Wei, S., Wen, Y., Zhou, D., Dou, R., Wang, X., Xu, J., Liu, J., Wu, N., & Liu, L. (2025). On-Satellite Implementation of Real-Time Multi-Object Moving Vehicle Tracking with Complex Moving Backgrounds. Remote Sensing, 17(3), 418. https://doi.org/10.3390/rs17030418

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

On-Satellite Implementation of Real-Time Multi-Object Moving Vehicle Tracking with Complex Moving Backgrounds

Abstract

1. Introduction

2. Methodology

2.1. Overview of the Method

2.2. Moving Target Detection Stage

2.2.1. Moving Background Compensation Calculation

2.2.2. Moving Target Extraction

2.3. Multi-Target Association Stage

3. Hardware Implementation

3.1. Overall Hardware Design

3.2. Compensation Calculation

3.3. Background Compensation and Target Detection

3.4. Multi-Target Association

4. Experimental Results

4.1. Prototype Experimental Platform

4.2. Dataset and Evaluation Metrics

4.3. Tracking Performance Evaluation and Comparison

4.3.1. Evaluation on the SkySat Dataset

4.3.2. Evaluation on Moving Background Datasets

4.4. Performance Analysis of On-Satellite Implementation

5. Discussion and Limitations

5.1. Discussion of Experimental Results

5.2. Limitations and Shortcomings of the Proposed Method

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI