Joint Adjustment Image Stabilization Method Based on Trajectories of Maritime Multi-Target Detection and Tracking

Liu, Fangjian; Li, Yuan; Wang, Mi

doi:10.3390/app16084029

Open AccessArticle

Joint Adjustment Image Stabilization Method Based on Trajectories of Maritime Multi-Target Detection and Tracking

by

Fangjian Liu

^1,2,*,

Yuan Li

² and

Mi Wang

¹

State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan 430079, China

²

Key Laboratory of Technology in Geo-Spatial Information Processing and Application Systems, Chinese Academy of Sciences, Beijing 100190, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(8), 4029; https://doi.org/10.3390/app16084029

Submission received: 11 February 2026 / Revised: 12 April 2026 / Accepted: 16 April 2026 / Published: 21 April 2026

(This article belongs to the Section Earth Sciences)

Download

Browse Figures

Versions Notes

Abstract

Existing technologies can achieve relative geometric correction and stabilization of geostationary satellite image sequences through fixed land scene matching or homonymous point adjustment. However, these methods heavily rely on fixed land areas, rendering them completely ineffective in vast ocean regions with only ship targets. Additionally, the trajectories of ship targets after processing still exhibit noticeable jitter, hindering motion information analysis. To address these issues, this paper proposes a joint image adjustment and stabilization method based on multi-target trajectories in marine environments: (1) An optimized target detection algorithm based on a multi-scale heterogeneous convolution module is introduced, which extracts background and target features through convolutions of different scales, enabling accurate detection and tracking of weak small targets in the image sequence frame by frame. (2) Curve fitting is performed on the detected positions of the same ship across multiple frames to simulate its motion trajectory under stabilized conditions. Combined with the prior assumption of uniform motion, an equal-division strategy is adopted to determine the corrected positions of the target in the image sequence. (3) The deviation correction values of multiple targets within the same frame are obtained, and based on the principle of intra-frame deviation consistency, precise image stabilization is achieved under multi-target constraints. Experiments based on Gaofen-4 satellite image sequences demonstrate that this method reduces the average position deviation of ship targets in the original images from 8.5 pixels (425 m) to 3.4 pixels (170 m), a decrease of approximately 59.41%, effectively improving the relative geometric accuracy of the image sequence and significantly eliminating target trajectory jitter.

Keywords:

remotesensing; object detection; image stabilization; deep network

1. Introduction

Geostationary Earth orbit (GEO) satellites, due to their high revisit frequency and wide coverage, are capable of continuously monitoring dynamic Earth surface processes [1,2,3]. However, the image sequences captured by these satellites often suffer from random errors caused by platform jitter, which reduces the accuracy of motion information extraction and target tracking. Traditional GEO image sequence stabilization methods primarily rely on fixed terrestrial scene matching or homologous point adjustment. These techniques utilize stable reference points on the terrestrial surface to estimate inter-frame image differences, thereby achieving relative registration of image sequences [4]. While effective in terrestrial regions, these methods face severe limitations in vast oceanic areas lacking fixed features. Consequently, sequence image matching performance significantly declines, leading to uncorrected ship targets in open ocean regions, whose trajectories exhibit obvious jitter. This jitter not only disrupts the continuity of target motion analysis but also compromises the reliability of derived motion parameters (e.g., velocity and direction), which is critical for maritime surveillance, vessel traffic management, and environmental monitoring.

The challenges posed by the marine environment stem from the lack of stable reference points and the low contrast between ship targets and complex ocean backgrounds [5,6,7]. In particular, small ships often manifest as weak targets with low signal-to-noise ratios, making their detection and tracking in image sequences highly susceptible to noise and background interference [8]. Existing target detection algorithms, including those based on traditional hand-crafted features or deep learning models, struggle to maintain high target detection and tracking accuracy in such scenarios. Moreover, inter-frame positioning bias correction for ship targets in oceanic regions remains an unresolved issue, as the absence of fixed reference points prevents the application of traditional scene matching techniques. Thus, the jitter in ship trajectories derived from biased image sequences complicates motion analysis and limits the practicality of GEO satellite data in applications requiring precise target tracking.

To address these challenges, this paper proposes a joint image adjustment and stabilization method for marine environments. The proposed framework combines ship target detection, trajectory modeling, and multi-target collaborative correction to achieve accurate geometric adjustment and stabilization in open-ocean scenes lacking stable reference points. First, ship targets are detected and tracked frame by frame in the image sequence. Then, their motion trajectories are modeled to suppress random fluctuations and provide reliable position estimates. Finally, inter-frame deviation correction is performed by jointly optimizing the consistency of multiple target trajectories, enabling accurate registration of the image sequence. Experiments on Gaofen-4 satellite image sequences demonstrate the effectiveness of the proposed method. The average positional deviation of ship targets is reduced from 8.5 pixels in the original images to 3.4 pixels after correction, corresponding to a reduction of 59.41%. The main contributions of this paper are as follows:

Innovative multi-scale heterogeneous convolution detection algorithm: A heterogeneous network module integrating convolutions of different scales is designed, jointly learning features through target, background regions, and strip convolutions, significantly enhancing the detection accuracy and robustness of weak small ship targets.
Trajectory Correction Method Based on Curve Model Fitting: To address the absence of stable feature points in maritime scenes, a target trajectory fitting method based on curve models is proposed. Combined with the prior assumption of uniform motion, an equal-division correction strategy is introduced to achieve accurate fitting of ship target trajectories.
Joint Stabilization Method with Multi-Target Deviation Consistency Constraint: A joint image stabilization method based on multiple target trajectories is constructed. By imposing a consistency constraint on the deviations of multiple targets within the same frame, the stabilization parameters are solved accurately, which significantly reduces trajectory jitter.

The subsequent sections of this paper are organized as follows: In Section 2, target detection and tracking methods for remote sensing images and sequence image stabilization techniques are systematically reviewed, with a focus on analyzing the advantages and limitations of existing methods. Section 3 elaborates on the principles of the proposed weak target detection and tracking and image stabilization methods. Section 4 introduces the experimental verification analysis, covering the introduction of the Gaofen-4 satellite marine dataset, experimental environment configuration, evaluation metrics, and comparative experimental analysis, validating the effectiveness of the method. Section 5 summarizes the innovative points of the full text and outlines future research directions.

2. Related Work

2.1. Small-Target Detection and Tracking in Remote Sensing Images

Small-target detection is one of the long-standing challenges in remote sensing image interpretation. Early studies mainly relied on handcrafted features, such as Scale-Invariant Feature Transform (SIFT) [9] and Histogram of Oriented Gradients (HOG) [10], combined with traditional classifiers for target recognition. Although these methods achieved certain success in simple scenes, their representation capability was limited under complex backgrounds and weak-target conditions. With the rapid development of deep learning, general object detectors such as Faster R-CNN [11], SSD [12], and the YOLO series [13] have significantly improved detection performance through end-to-end feature learning. However, directly applying these methods to remote sensing images is still suboptimal, because small targets often occupy only a few pixels and are easily submerged in cluttered backgrounds.

To improve the detection performance of small targets, recent studies have mainly focused on multi-scale feature fusion, contextual information enhancement, and oriented object representation. For example, feature pyramid based frameworks such as DFPN-YOLO [14] and MFPNet [15] improve the representation of small and multi-scale objects by strengthening cross-level feature interaction. In addition, oriented object detection methods, such as R³Det [16] and Oriented R-CNN [17], further improve localization accuracy for arbitrarily oriented targets in aerial and remote sensing imagery. Transformer-based detectors have also been introduced to capture long-range dependencies and enhance global context modeling [18]. Nevertheless, weak and tiny targets in low-resolution remote sensing imagery remain difficult to detect reliably, especially when background interference is strong. Recent review studies have also pointed out that scale variation, limited pixels, dense distribution, and orientation diversity are still major bottlenecks in remote sensing small object analysis [19].

Compared with detection, tracking of small targets in remote sensing data is relatively less studied but is equally important for dynamic scene understanding. In satellite video and image-sequence analysis, small-target tracking must address additional difficulties such as weak appearance features, motion blur, temporal discontinuity, and low frame rate. Existing studies show that satellite-video tracking generally relies on three key components, namely target representation, search strategy, and model update, while recent developments increasingly integrate temporal motion cues and cross-frame feature association into the tracking framework [20]. Therefore, for remote sensing small targets, a practical framework should not only enhance weak target perception in single frames, but also exploit temporal continuity across frames to improve tracking stability and localization consistency.

2.2. Marine Target Detection and Tracking

Marine target detection, especially ship detection, is one of the most important application branches of remote sensing target analysis. In marine scenes, targets are often characterized by small size, low contrast, sparse texture, arbitrary orientation, and strong interference from sea clutter, clouds, waves, and illumination variation [21,22]. These characteristics make marine target detection more challenging than general object detection, particularly in open-ocean environments without stable reference points. Recent survey studies have shown that optical remote-sensing ship detection still faces persistent challenges in weak target perception, scale variation, dense and rotated distributions, and false alarms under complex marine backgrounds [23].

To address these challenges, numerous dedicated ship detection methods have been developed, including multi-scale feature enhancement, attention mechanisms, rotated bounding box regression, and lightweight frameworks. For example, Hu et al. improved small-ship recall under low-signal-to-noise conditions using an attention-guided multi-scale fusion network [24], while S-DETR employs a Transformer-based architecture with scale-aware modeling for real-time detection [25]. Other studies further enhance performance through multi-scale feature extraction, oriented representation, and background suppression [26,27,28]. Despite these advances, most methods focus on single-frame detection and overlook temporal motion continuity.

Marine target tracking in satellite image sequences remains more challenging due to weak targets, dynamic backgrounds, and the lack of reliable reference points. Existing approaches improve performance by combining target enhancement, candidate filtering, and multi-frame association. For instance, Yu et al. proposed a GF-4-based tracking framework using joint probability data association (JPDA) [29]. Recent methods have evolved toward joint detection-and-tracking frameworks, integrating motion correlation and spatiotemporal features to reduce missed detections and identity switches [30]. However, for open-ocean scenarios, effectively integrating weak target detection, trajectory modeling, and inter-frame deviation correction remains a key challenge.

2.3. Sequence Image Stabilization Technology

Sequence image stabilization aims to eliminate inter-frame biases caused by factors such as platform jitter, serving as a crucial preprocessing step to ensure accurate and continuous motion target trajectories [31]. Existing methods are primarily divided into three categories: (1) Feature matching-based methods: Estimate inter-frame transformation models by extracting and matching stable feature points (e.g., SIFT features [9]) between images [32]. These methods perform well in textured terrestrial scenes but are inaccurate in homogeneous marine regions lacking obvious textures, where feature points are sparse and unstable. (2) Region matching-based methods: Directly align image regions using metrics such as mutual information and normalized cross-correlation [33]. These methods are sensitive to noise and grayscale changes, and matching accuracy is easily disturbed in dynamic marine scenes.

Currently, dedicated research on image stabilization for open marine scenes is relatively scarce. Most methods rely on stable reference points in terrestrial regions and cannot be directly transplanted. In recent years, point-of-view-free stabilization methods combined with deep learning have become a research hotspot. For example, James et al. utilized deep learning optical flow networks to estimate the global motion field of video sequences and achieved effective stabilization through knowledge distillation techniques [34]. In [35], a real-time video stabilization method was proposed, achieving global adjustment and local repair without hardware dependence through self-supervised learning, significantly reducing jitter. However, these methods still lack targeted optimization for sub-pixel level systematic jitter and precise correction of weak small-target trajectories in high-resolution geostationary orbit (GEO) satellite image sequences.

2.4. Multi-Target Trajectory Modeling in Image Processing

Multi-target trajectory analysis models the positions of multiple targets in consecutive frames to reveal their motion patterns and provides geometric constraints for image stabilization. In the field of computer vision, trajectory analysis has been widely applied in tasks such as video surveillance and behavior recognition [36,37]. Common methods include using models such as Kalman filters [38] and particle filters [39] for trajectory fitting and prediction, as well as trajectory clustering and association through metric learning [40].

In remote sensing image processing, particularly in marine scenes, the application of trajectory analysis is still insufficient. Existing research focuses more on single-target tracking [41] and lacks work that utilizes the spatial geometric relationships between multi-target trajectories to jointly enhance the precision of overall scene stabilization. The latest progress indicates that using Graph Neural Networks (GNNs) for relationship modeling and joint optimization of multi-target trajectories has become an emerging direction. For example, Deng et al. proposed a multi-target tracking framework based on frame graphs and association graphs, enhancing feature stability by modeling the topological relationships (geometric shapes) between targets and achieving high-precision target detection and tracking [42]. Although this method cannot be directly applied to image stabilization, it provides a cutting-edge reference for the idea of achieving stabilization through multi-target trajectory joint adjustment.

In summary, existing research methods still need further improvement in achieving precise weak small-target detection, tracking, and sequence image stabilization in marine scenes, especially when processing GEO satellite image sequences without stable ground control points. There is an urgent need to construct a sequence image stabilization method based on multi-target trajectory joint adjustment.

3. Methodology

This section presents the proposed joint image adjustment and stabilization method for multi-target trajectories, which consists of two main components: weak small-target detection and tracking, and multi-target joint stabilization. As illustrated in Figure 1, to address the small scale and weak features of ship targets in low-resolution remote sensing images, a multi-scale convolution module (c3Km2) is integrated into the YOLOv11 framework to enhance feature representation and improve detection accuracy. The detection results are then fed into a tracking module, where Kalman filtering and Hungarian matching are applied for consistent multi-frame tracking. Finally, a multi-target joint stabilization method is developed by fitting ship motion trajectories with a curve model, estimating frame-wise deviations, and enforcing consistency constraints across multiple targets to achieve stable image sequences.

3.1. Small-Target Detection and Tracking

Satellite remote sensing images are typically single-band 16-bit low-resolution images, where targets occupy only a few pixels, resulting in insufficient texture and color information for direct detection. To address this issue, temporal feature compensation is applied by associating consecutive frames. Specifically, three consecutive frames are combined as a three-channel input, transforming point-like targets into line-like representations and effectively enhancing target features.

The constructed three-frame images are then fed into a modified YOLOv11-based detection network, consisting of a backbone, a neck, and three multi-scale detection heads (Figure 2). A multi-scale convolution module (c3Km2) is introduced into both the backbone and neck to enhance feature extraction. This module employs 3 × 3 and 7 × 7 kernels to capture target and background features, respectively, while “+”- and “×”-shaped kernels extract motion information across frames (Figure 3). Through parallel multi-branch design, the network captures multi-scale and multi-directional features, significantly improving weak target detection. The module employs a parallel branching strategy for feature extraction during training, while during inference, the convolution kernel parameters of multiple branches are merged to achieve more efficient inference.

After detection, results from each frame are input into a tracking module. A Kalman filter is initialized for each target based on the first frame and predicts subsequent positions under a constant velocity assumption. The Hungarian algorithm is then used to associate predicted and detected targets by minimizing the matching cost. Matched targets are updated using detection results, while unmatched targets are evaluated for disappearance based on a threshold. This combination of Kalman filtering and Hungarian matching enables robust multi-target tracking across consecutive frames.

3.2. Multi-Target Joint Image Stabilization

Based on the aforementioned ship target detection and tracking algorithm, the detection and tracking results of ship targets are generated on a frame-by-frame basis. The positional information of the ship target in each frame image is extracted. Assuming the same ship target i is observed across a sequence of T images, its position is denoted as

P_{i t} = (x_{i t}, y_{i t}), i = 1, \dots, N, t = 1, 2, \dots, T .

where

x_{i t}

and

y_{i t}

represent the horizontal and vertical coordinates of the ship target’s position in the t-th frame, respectively. Subsequently, a quadratic curve model is constructed to fit the trajectory of the ship target. Thereafter, the horizontal and vertical distances between the start and end positions of the ship’s motion trajectory are calculated to determine whether the trajectory is predominantly horizontal or vertical. The criterion is expressed as follows:

\{\begin{matrix} | x_{i 1} - x_{i T} | \geq | y_{i 1} - y_{i T} |, & horizontal - dominated \\ | x_{i 1} - x_{i T} | < | y_{i 1} - y_{i T} |, & vertical - dominated \end{matrix}

Based on the determination result, a quadratic curve model is established, and the parameters of the curve model are solved using the least squares method. In the case of a horizontal-dominated trajectory, the quadratic curve model is given as follows:

y = f (x) = a x^{2} + b x + c

In the case of a vertical-dominated trajectory, the quadratic curve model is given as follows:

x = f_{y} (y) = a_{y} \cdot y^{2} + b_{y} \cdot y + c_{y}

where

(a, b, c)

are the model parameters for the horizontal-dominated case, and

(a_{y}, b_{y}, c_{y})

are the model parameters for the vertical-dominated case. Using the positional information of the i-th ship target obtained from T frames, denoted as

P_{i t} = (x_{i t}, y_{i t}),

i = 1, 2, \dots, N,

t = 1, \dots, T

, the parameters of the quadratic model are solved based on the least squares method.

The corrected position

P_{i t}^{'}, (i = 1, 2, \dots, N, t = 1, \dots, T)

of the i-th ship target in each frame after trajectory fitting is determined, which corresponds to the ship target location after stabilization correction. Due to the characteristic of ships moving approximately uniformly in a short period during marine navigation [43], as illustrated in Figure 4, the corrected ship target positions can be determined by equally dividing the fitted curve. The processing steps are as follows:

Assumptions:

Let t be the time index ( $t = 1, 2, \dots, T$ ), $t_{0}$ the start time, and $t_{T}$ the end time.
Let i be the vessel index ( $i = 1, 2, \dots, N$ ).
Let $(x_{i t}, y_{i t})$ denote the image coordinates of the i-th ship at time t.
Let $P_{i t}^{'} = (x_{i t}^{'}, y_{i t}^{'})$ denote the image coordinates of the i-th ship at time t obtained from the quadratic curve model fitting.
Let $(d x_{i t}, d y_{i t})$ represent the image offset at time t.
Let $(Δ x_{i t}, Δ y_{i t})$ represent the offset of the i-th ship at time t.

The positions of the i-th ship target on the fitted curve in the initial and final frames are determined as

P_{i t}^{'} = (x_{i t}^{'}, y_{i t}^{'})

. Specifically, when the ship trajectory is horizontal-dominated, the fitted vertical coordinate is calculated based on the horizontal coordinates of the ship target in the initial and final frames; when the trajectory is vertical-dominated, the fitted horizontal coordinate is calculated based on the vertical coordinates. The formulas are given as follows:

For horizontal-dominated case,

x_{i t}^{'} = x_{i t}, y_{i t}^{'} = f_{x} (x_{i t})

For vertical-dominated case,

x_{i t}^{'} = f_{y} (y_{i t}), y_{i t}^{'} = y_{i t}

where

(x_{i t}, y_{i t})

are the ship target positions in the initial and final frames. The curve length from

P_{i 1}^{'}

to

P_{i T}^{'}

along the fitted curve is calculated and denoted as

D_{1 T}

. Based on the uniform motion characteristic of the ship target, the fitted curve is divided equally to determine the corrected ship target position

P_{i t}^{'}, i = 1, 2, \dots, N, t = 1, \dots, T

in each frame. Specifically, the curve distance between ship targets in adjacent frames is calculated as

D = \frac{D_{1 T}}{T - 1}

Starting from the first frame, the corrected position

P_{i t}^{'}

of the i-th ship target in each frame is successively determined along the curve such that the motion distance between adjacent frames is equal, i.e.,

[P_{i t}^{'}, P_{i, t + 1}^{'}] = D, t = 1, 2, \dots, T - 1

where

[P_{i t}^{'}, P_{i, t + 1}^{'}]

denotes the curve distance between

P_{i t}^{'}

and

P_{i, t + 1}^{'}

along the curve model. In practice, the curve can be discretized according to the image resolution, and the distances between successive discrete points can be accumulated to obtain the curve length between two points, thereby yielding the corrected position

P_{i t}^{'}

of the i-th ship target.

Furthermore, based on practical experience, it is known that under the premise of accurate geometric calibration, systematic geometric positioning errors in the imagery have been eliminated, leaving only random errors caused by factors such as satellite platform jitter. Therefore, when a sufficient number of observations are available, the random errors within the imagery can cancel each other out; in other words, the observation error for a single target point tends to approach zero over time.

0 = \sum_{t = 0}^{T} Δ x_{i t} + Δ y_{i t}

where

Δ x_{i t}

and

Δ y_{i t}

represent the error vectors of a single target i in the X-direction and Y-direction, respectively, over a single time interval.

In summary, adopting the adjustment theory for high-precision positioning of the sequence frames, let the initial transformation model parameters at time t be

(e_{0}^{t}, e_{1}^{t}, e_{2}^{t}, f_{0}^{t}, f_{1}^{t}, f_{2}^{t})

, which remain consistent for all targets at time t. The image offset at the i-th target can then be modeled by the following equations:

\{\begin{matrix} Δ x_{i 0} = e_{0}^{0} + e_{0}^{1} \cdot x_{i 0}^{'} + e_{0}^{2} \cdot y_{i 0}^{'} - x_{i 0} \\ Δ y_{i 0} = f_{0}^{0} + f_{0}^{1} \cdot x_{i 0}^{'} + f_{0}^{2} \cdot y_{i 0}^{'} - y_{i 0} \\ Δ x_{i 1} = e_{1}^{0} + e_{1}^{1} \cdot x_{i 1}^{'} + e_{1}^{2} \cdot y_{i 1}^{'} - x_{i 1} \\ Δ y_{i 1} = f_{1}^{0} + f_{1}^{1} \cdot x_{i 1}^{'} + f_{1}^{2} \cdot y_{i 1}^{'} - y_{i 1} \\ Δ x_{i 2} = e_{2}^{0} + e_{2}^{1} \cdot x_{i 2}^{'} + e_{2}^{2} \cdot y_{i 2}^{'} - x_{i 2} \\ Δ y_{i 2} = f_{2}^{0} + f_{2}^{1} \cdot x_{i 2}^{'} + f_{2}^{2} \cdot y_{i 2}^{'} - y_{i 2} \\ ⋮ \\ Δ x_{i T} = e_{T}^{0} + e_{T}^{1} \cdot x_{i T}^{'} + e_{T}^{2} \cdot y_{i T}^{'} - x_{i T} \\ Δ y_{i T} = f_{T}^{0} + f_{T}^{1} \cdot x_{i T}^{'} + f_{T}^{2} \cdot y_{i T}^{'} - y_{i T} \\ 0 = \sum_{t = 0}^{T} (Δ x_{i t} + Δ y_{i t}) \end{matrix}

Meanwhile, under the premise of rigorous geometric calibration, the internal geometric offsets of the image remain relatively stable within a short period; that is, multiple targets in the same image exhibit identical offsets between consecutive frames. Assuming there are N targets in the image, after applying the affine transformation, the differences between the estimated offsets

(Δ x_{i t}, Δ y_{i t})

for all targets at time t should be minimized—i.e., the deviation vectors of all targets within the same image at the same moment should be as consistent as possible. Therefore, the offsets of different targets at time t can be modeled through constraints among multiple targets as follows:

\{\begin{matrix} Δ x_{0 t} = e_{0}^{0} + e_{1}^{0} \cdot x_{0 t}^{'} + e_{2}^{0} \cdot y_{0 t}^{'} - x_{0 t} \\ Δ y_{0 t} = f_{0}^{0} + f_{1}^{0} \cdot x_{0 t}^{'} + f_{2}^{0} \cdot y_{0 t}^{'} - y_{0 t} \\ Δ x_{1 t} = e_{0}^{0} + e_{1}^{1} \cdot x_{1 t}^{'} + e_{2}^{1} \cdot y_{1 t}^{'} - x_{1 t} \\ Δ y_{1 t} = f_{0}^{0} + f_{1}^{1} \cdot x_{1 t}^{'} + f_{2}^{1} \cdot y_{1 t}^{'} - y_{1 t} \\ Δ x_{2 t} = e_{0}^{0} + e_{1}^{2} \cdot x_{2 t}^{'} + e_{2}^{2} \cdot y_{2 t}^{'} - x_{2 t} \\ Δ y_{2 t} = f_{0}^{0} + f_{1}^{2} \cdot x_{2 t}^{'} + f_{2}^{2} \cdot y_{2 t}^{'} - y_{2 t} \\ ⋮ \\ Δ x_{N t} = e_{0}^{0} + e_{1}^{N} \cdot x_{N t}^{'} + e_{2}^{N} \cdot y_{N t}^{'} - x_{N t} \\ Δ y_{N t} = f_{0}^{0} + f_{1}^{N} \cdot x_{N t}^{'} + f_{2}^{N} \cdot y_{N t}^{'} - y_{N t} \\ min \sum_{i = 1}^{N} \sum_{j = i + 1}^{N} {∥{\bar{v}}_{i t} - {\bar{v}}_{j t}∥}^{2} \\ {\bar{v}}_{i t} = (Δ x_{i t}, Δ y_{i t}) \\ {∥{\bar{v}}_{i t} - {\bar{v}}_{j t}∥}^{2} = {(Δ x_{i t} - Δ x_{j t})}^{2} + {(Δ y_{i t} - Δ y_{j t})}^{2} \end{matrix}

By jointly solving the equations in Formulas (1) and (2), the affine transformation parameters of the image at each time t (

t = 1, 2, \dots, T

) can be obtained. Subsequently, the offsets

(Δ x_{i t}, Δ y_{i t})

can be derived to correct the images, thereby achieving positional consistency across the image sequence.

4. Experiments

4.1. Datasets

This study conducted experiments based on Gaofen-4 (GF-4) satellite data. The GF-4 satellite was launched on 29 December 2015, and is China’s first geostationary orbit (GEO) high-resolution optical remote sensing satellite. It operates at an altitude of approximately 36,000 km, enabling continuous observation of fixed areas with revisit intervals ranging from minutes to hours. The satellite is equipped with a staring optical camera system, providing a swath width exceeding 400 km. The spatial resolution of the visible light sensor at nadir is 50 m. This image sequence is particularly suitable for dynamic monitoring applications such as disaster response, environmental monitoring, and target tracking.

To enhance the robustness of the target detection and tracking models, this paper constructed a object detection and tracking sample dataset by stacking three consecutive frames of GF-4 imagery, which is shown in Figure 5. This method leveraged the temporal continuity of the sequence to improve feature representation and reduce noise. The sample construction process includes:

Frame selection: Three consecutive frames are selected from the sequence, ensuring they capture the same geographic area with minimal time intervals. This aims to preserve temporal dynamics while maintaining spatial consistency.
Stacking operation: The selected frames are stacked along the time dimension to create a three-channel input, where each channel corresponds to a different frame image, thereby forming a three-channel RGB-like image. The original targets, which appear as gray patches, are transformed into red–green–blue point sequences.
Sample generation: The stacked images are divided into non-overlapping patches of size $512 \times 512$ pixels, which collectively constitute the complete dataset. Sample annotation is performed based on the synthetic sequence point targets from the three frames, with bounding boxes annotated accordingly. During testing, the predicted bounding box center points are used as the target positions in the current frame. The dataset contains a total of 260 target sequences, each consisting of 30 frames. The training, validation, and test sets are split in an 8:1:1 ratio.
Quality control: Each sample undergoes quality checks to verify the absence of severe artifacts such as cloud cover or sensor saturation. Significantly contaminated samples are excluded to ensure training reliability.

In addition, this paper compared and evaluated the image stabilization effect of the original image deviation ground truth and the proposed method. The original image deviation results were obtained through land-based stabilization references, where image matching and deviation calculation were achieved by applying the KAZE method [44] to land regions within the image sequence. Since land regions have stable and texture-rich image features, image registration with a mature standard algorithm can yield stabilization ground truth with an average deviation within 1 pixel. Additionally, manual verification of the stabilization results was conducted to further ensure the accuracy of the deviation calculation. The final stabilization results can thus be regarded as geometrically stable ground truth for evaluating the positional deviations of marine targets. It is worth noting that the image sequences used for image stabilization evaluation in this paper are completely different from those currently used for detection training. By dividing the existing image data, part of the images were selected for constructing the training sample set for object detection, while another independent set of image sequences was used for testing. The two are disjoint to ensure the objectivity of the results.

4.2. Experimental Setting

The experimental procedures in this study were conducted in a Linux operating system environment. The hardware configuration included a workstation equipped with an NVIDIA GeForce RTX 4090 GPU with 24 GB of memory. The hyperparameters used during model training are summarized in Table 1.

4.3. Evaluation Metrics

To evaluate the effectiveness of the proposed method, this paper introduces evaluation metrics in two aspects: target detection and tracking, and image stabilization performance. The specific formulas are as follows.

4.3.1. Target Detection and Tracking Metrics

Precision (the proportion of true positives among all predicted positives):

Precision = \frac{TP}{TP + FP}

Recall (the proportion of actual positives that are correctly predicted):

Recall = \frac{TP}{TP + FN}

where TP denotes the number of true positive detections, FP denotes the number of false positives, and FN denotes the number of false negatives.

MOTA (Multiple-Object Tracking Accuracy, used for multi-object tracking tasks, which comprehensively considers false positives, false negatives, and ID switches):

MOTA = 1 - \frac{FP + FN + IDSw}{GT}

where FP is the number of false positives, FN is the number of false negatives, IDSw is the number of identity switches, and GT is the number of ground truth targets.

4.3.2. Image Stabilization Deviation Evaluation Metric

In this paper, land regions are used as the reference for sequence image stabilization (referencing existing methods in the literature), which can be regarded as the ground truth for stabilization processing. The pixel-wise deviation between the land-area-based correction results and the original sequence images, as well as the stabilized images obtained by the proposed method, is calculated as the stabilization performance evaluation metric:

D_{s} = \sqrt{{(x_{2} - x_{1})}^{2} + {(y_{2} - y_{1})}^{2}}

4.4. Comparative Experiments

This section presents the complete experimental process and results. To verify the effectiveness of the proposed target detection and tracking method as well as the image stabilization method, experiments were conducted based on the Gaofen-4 sample set. Specifically, the experiments are divided into two parts: a comparative experiment on target detection and tracking, and a validation experiment on image stabilization.

4.4.1. Comparative Experiment on Target Detection and Tracking

In the comparative experiment, a multi-dimensional comparison scheme was designed to evaluate the performance of the proposed remote sensing target detection and tracking method. For the target detection part, general detection frameworks (Faster R-CNN [11], YOLOv11 [45]) and mainstream remote-sensing-specific detectors (e.g., R-DFPN [46]) were used as baseline methods. Due to the small size of the target, an Intersection-over-Union (IoU) threshold of 0.2 was set to evaluate detection accuracy, in order to effectively avoid judgment errors caused by minor positional deviations [47]. For the target tracking part, commonly used tracking methods such as DeepSort [48], ByteTrack [49], and TrackFormer [50] were compared, with tracking accuracy evaluated on sequence data, focusing on target loss rate and identity-switch frequency during tracking. To ensure fairness, all hyper-parameters other than those specific to each model were kept as consistent as possible.

In the comparison of target detection performance, the differences among methods mainly stem from their network architectures and feature extraction capabilities, as shown in Table 2. Faster R-CNN, as a classic two-stage detector, has achieved reasonable results, but its performance is relatively poorer compared to other methods because it is not specifically optimized for small and dim targets. YOLOv11 improves the backbone network structure and optimizes the design of multi-scale detection heads, thereby enhancing performance. R-DFPN demonstrates excellent performance in remote sensing target detection by refining the feature pyramid network. However, all these methods still have room for improvement in detecting small and dim targets in complex scenes where target features are insufficient. In contrast, the method proposed in this paper optimizes the algorithm for small-target characteristics, achieving the best performance in both detection precision and recall. Compared to the YOLOv11 baseline, the detection precision is improved by approximately 2%, and the recall is increased by about 1%. This improvement is attributed to its innovative multi-scale feature fusion mechanism, which better addresses challenges such as insufficient target features in remote sensing images.

As shown in Figure 6, visual comparison results are provided for qualitative analysis, where the first column presents the manually annotated ground truth bounding boxes, and the subsequent columns display the detection outputs of different algorithms. Experimental results demonstrate that, compared with other competing methods, the proposed method exhibits superior capability in detecting small and dim targets, achieving more complete coverage of all weak instances. Moreover, several baseline methods generate a significant number of false alarms in fragmented cloud regions, indicating their limited ability to suppress background clutter. In contrast, the proposed method, through optimized contextual modeling, effectively suppresses false positives while maintaining a high recall rate. Overall, the results indicate that existing approaches still face the challenge of coexisting missed detections and false alarms when handling low-resolution targets, whereas the proposed improvement strategy significantly enhances detection consistency and robustness under complex backgrounds.

Table 3 presents the quantitative comparison results of target tracking methods. DeepSort and ByteTrack are based on traditional visual tracking algorithms, perform adequately in simple scenarios but are prone to tracking failures under complex occlusions and target deformations. TrackFormer, as an end-to-end tracking method based on Transformer, possesses stronger theoretical modeling capabilities; however, it still suffers from frequent identity switches in such complex scenes. The tracking method proposed in this paper achieves a MOTA score of 74.97%, which represents an improvement of approximately 11% compared to the classical DeepSort. As illustrated in Figure 7, the proposed tracking algorithm demonstrates strong performance on the GF-4 satellite image sequence. Other comparative methods tend to lose targets or suffer from trajectory breaks when object features are insufficient or under severe cloud occlusion. In contrast, the proposed method maintains more stable target locking throughout the tracking process, exhibiting superior adaptability and robustness under challenges such as short-term partial occlusion and background clutter. Although occasional target loss or misassociation may still occur in extreme scenarios with heavy occlusion and intense background interference, the overall results indicate that the proposed algorithm offers enhanced practicality.

4.4.2. Image Stabilization Validation Experiment

Based on the obtained target detection and tracking results, this study performed stabilization processing on the sequence frame images. Tests were conducted on multiple real remote sensing sequences with significant jitter, and the results were compared with the deviations of the original images to validate the effectiveness of the proposed method. In addition, since other stabilization methods primarily target fixed land-based areas and are not suitable for maritime scenarios, comparative experiments were not conducted in this context.

The experimental results of the image stabilization validation are presented in Table 4. It can be observed that the mean ship target deviation in the original geostationary optical satellite image sequence is 8.5 pixels (i.e., with an actual distance error of approximately 425 m), while after processing with the proposed method, the remaining mean deviation is reduced to 3.4 pixels (i.e., with an actual distance error of approximately 170 m), representing a decrease of approximately 59.41%. Therefore, the proposed method significantly enhances the stability of marine image sequences and overcomes the limitations of existing imaging technologies.

After the stabilization process, we re-plotted the target trajectories using the proposed target detection and tracking method. As shown in Figure 8, Figure 9 and Figure 10 the visualization of the target trajectory in the stabilized image sequence is markedly improved. The original navigation trajectory exhibits significant jitter and discontinuity, while after stabilization, the target motion trajectory becomes smooth and continuous, effectively eliminating trajectory drift caused by jitter. Comparative analysis indicates that the stabilized trajectory more accurately reflects the true motion state of the target, providing a reliable data foundation for subsequent motion analysis and behavior recognition. This trajectory smoothing not only improves the visual effect but, more importantly, enhances the stability and accuracy of target tracking, verifying the effectiveness of the proposed stabilization method in practical applications.

In addition, comparative experiments were conducted to evaluate the stabilization performance under different target-number scenarios, as shown in Table 5. The results indicate that the stabilization performance gradually decreases as the number of available targets in the image is reduced, suggesting that under multi-target joint stabilization conditions, the proposed method can exploit multi-target constraints to estimate inter-frame displacement more accurately and effectively suppress imaging jitter. Nevertheless, even when only a single target is available as a constraint, the mean absolute deviation after stabilization remains significantly lower than that of the original imagery, demonstrating that the proposed method still achieves satisfactory stabilization effectiveness and a certain degree of robustness. Overall, the proposed method performs best under multi-target conditions and retains a certain correction capability in sparse-target scenarios, thereby effectively improving image stabilization performance and ship target trajectory correction in ocean scenes without fixed reference points.

5. Discussion

This study demonstrates the effectiveness of the proposed image stabilization method. By introducing the multi-scale heterogeneous convolution module (c3Km2), the feature extraction for small maritime vessels is significantly improved, enhancing detection precision. The study also addresses the challenge of image stabilization in open-ocean scenarios, where traditional terrestrial feature-based methods fail. We propose using moving ships as “control points” and fitting their motion with a quadratic curve model, leveraging prior knowledge of uniform ship motion. The stabilized positions are determined by segmenting the curve’s arc length. Additionally, a multi-target joint stabilization framework is introduced to minimize displacement inconsistencies across multiple targets, ensuring accurate compensation of frame jitter throughout the image sequence. Through comparative experiments based on GF-4 sequential images, the results show that the proposed method improves detection precision by approximately 2% compared with the YOLOv11 baseline, increases the MOTA metric by about 11% compared with the classical tracking method DeepSort, and reduces the image stabilization error from 425 m to 170 m, thereby achieving significant performance improvement.

Nevertheless, certain limitations remain. First, the performance of the method heavily depends on the accuracy of the front-end target detection and tracking modules; in extreme weather conditions or under severe occlusions among dense targets, the loss of certain tracks may degrade the final stabilization quality. Second, the present algorithm assumes that ships maintain approximately uniform motion within short time windows. While this assumption holds for most conventional navigation scenarios, it can lead to short-term modeling errors when targets perform temporary maneuvers, which affects the stability and accuracy of the stabilization results. This issue will be emphatically investigated in future work, with the goal of improving the adaptability and robustness of the method across diverse operational scenarios.

6. Conclusions

To address the challenge of effective image stabilization in vast oceanic regions where geostationary optical satellites lack fixed reference features, this paper proposes a comprehensive solution. The main contributions and conclusions are summarized as follows: First, a detection and tracking method enhanced for small maritime targets is proposed. By constructing a three-frame temporal image channel to strengthen target signatures, and innovatively introducing a multi-scale heterogeneous convolution module, the method parallelly extracts background and target features at different scales and orientations, significantly improving the detection rate and tracking stability of ships in low-resolution, low-contrast remote sensing imagery. Second, a multi-target trajectory-based joint adjustment method for oceanic image stabilization is introduced. The core idea abandons reliance on fixed ground features and instead leverages moving ship targets themselves. By fitting their motion trajectories with a curve model, correcting positions based on the prior of uniform motion, and pioneering a consistency constraint on inter-target deviations within the same frame, a multi-target joint stabilization framework is established, enabling precise estimation and elimination of systematic jitter across image sequences. Experiments on real data from the GF-4 satellite demonstrate that the proposed method outperforms comparative algorithms in terms of target detection accuracy and tracking continuity. Ultimately, the stabilization process reduces the average target position deviation from 8.5 pixels (425 m) to 3.4 pixels (170 m)—a 59.41% reduction—effectively eliminating trajectory jitter and providing a reliable data foundation for subsequent high-precision analysis of maritime dynamic target motion and behavior recognition.

In conclusion, this study not only presents a practical technical pathway for geometric consistency processing of geostationary satellite ocean imagery, but the core idea of “stabilizing motion using motion” also offers valuable insights for video or sequence image stabilization in other scenarios lacking stable control points. Future work will focus on enhancing the algorithm’s adaptability in complex dynamic environments and exploring deeper integration with spatiotemporal auxiliary information.

Author Contributions

Methodology, F.L.; Software, F.L.; Validation, F.L.; Formal analysis, M.W.; Resources, M.W.; Data curation, Y.L.; Writing—original draft, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Fund of Hubei Province Strategic Science and Technology Talent Cultivation Special Project, No. 2024DJA035.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to confidentiality.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, B.; Xing, Y.; Wang, N.; Chen, C.P. Monitoring waste from unmanned aerial vehicle and satellite imagery using deep learning techniques: A review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 1–15. [Google Scholar] [CrossRef]
Zhu, W.; Gong, W.; Wang, Y.; Zhang, Y.; Hu, J. GMF-Net: A Gaussian-Matched Fusion Network for Weak Small Object Detection in Satellite Laser Ranging Imagery. Sensors 2026, 26, 407. [Google Scholar] [CrossRef]
Aydin, A.; Avaroğlu, E. AERIS-ED: A Novel Efficient Attention Riser for Multi-Scale Object Detection in Remote Sensing. Appl. Sci. 2025, 15, 12223. [Google Scholar] [CrossRef]
Wang, P.; Qin, P.; Chai, R.; Zeng, J.; Zhao, P.; Chen, Z.; Han, B. End-to-End Online Video Stitching and Stabilization Method Based on Unsupervised Deep Learning. Appl. Sci. 2025, 15, 5987. [Google Scholar] [CrossRef]
Rekavandi, A.M.; Xu, L.; Boussaid, F.; Seghouane, A.K.; Hoefs, S.; Bennamoun, M. A guide to image-and video-based small object detection using deep learning: Case study of maritime surveillance. IEEE Trans. Intell. Transp. Syst. 2025, 26, 1234–1245. [Google Scholar] [CrossRef]
Zhang, C.; Zhang, X.; Gao, G.; Lang, H.; Liu, G.; Cao, C.; Song, Y.; Guan, Y.; Dai, Y. Development and application of ship detection and classification datasets: A review. IEEE Geosci. Remote Sens. Mag. 2024, 14, 456–468. [Google Scholar] [CrossRef]
Gao, F.; Tian, Y.; Wu, Y.; Zhang, Y. ST-YOLOv8: Small-Target Ship Detection in SAR Images Targeting Specific Marine Environments. Appl. Sci. 2025, 15, 3456–3468. [Google Scholar] [CrossRef]
Liu, F.; Zhang, F.; Wang, M.; Xu, Q. Two-Level Supervised Network for Small Ship Target Detection in Shallow Thin Cloud-Covered Optical Satellite Images. Appl. Sci. 2024, 14, 11558. [Google Scholar] [CrossRef]
Lowe, D.G. Distinctive Image Features from Scale-Invariant Keypoints. Int. J. Comput. Vis. 2004, 60, 91–110. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2005; pp. 886–893. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Computer Vision–ECCV 2016 (ECCV); Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2016. [Google Scholar]
Sun, Y.; Liu, W.; Gao, Y.; Hou, X.; Bi, F. A Dense Feature Pyramid Network for Remote Sensing Object Detection. Appl. Sci. 2022, 12, 4997. [Google Scholar] [CrossRef]
Yuan, Z.; Liu, Z.; Zhu, C.; Qi, J.; Zhao, D. Object Detection in Remote Sensing Images via Multi-Feature Pyramid Network with Receptive Field Block. Remote Sens. 2021, 13, 862. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Feng, Z.; He, T. R3Det: Refined Single-Stage Detector with Feature Refinement for Rotating Object Detection. Proc. AAAI Conf. Artif. Intell. 2021, 35, 3163–3171. [Google Scholar] [CrossRef]
Xie, X.; Cheng, G.; Wang, J.; Yao, X.; Han, J. Oriented R-CNN for Object Detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2021; pp. 3520–3529. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In European Conference on Computer Vision (ECCV); Springer: Berlin/Heidelberg, Germany, 2020. [Google Scholar]
Wang, X.; Wang, A.; Yang, J.; Chen, A.; Sun, Y. Small Object Detection Based on Deep Learning for Remote Sensing: A Comprehensive Review. Remote Sens. 2023, 15, 3265. [Google Scholar] [CrossRef]
Zhang, Z.; Wang, C.; Song, J.; Xu, Y. Object Tracking Based on Satellite Videos: A Literature Review. Remote Sens. 2022, 14, 3674. [Google Scholar] [CrossRef]
Li, Y.; Xu, Q.; Kong, Z.; Li, W. MULS-Net: A multilevel supervised network for ship tracking from low-resolution remote-sensing image sequences. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5624214. [Google Scholar] [CrossRef]
Kong, Z.; Xu, Q.; Li, Y.; Han, X.; Li, W. TS-Track: Trajectory self-adjusted ship tracking for GEO satellite image sequences via multilevel supervision paradigm. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5639415. [Google Scholar] [CrossRef]
Zhao, T.; Wang, Y.; Li, Z.; Gao, Y.; Chen, C.; Feng, H.; Zhao, Z. Ship Detection with Deep Learning in Optical Remote-Sensing Images: A Survey of Challenges and Advances. Remote Sens. 2024, 16, 1145. [Google Scholar] [CrossRef]
Hu, J.; Zhi, X.; Jiang, S.; Tang, H.; Zhang, W.; Bruzzone, L. Supervised multi-scale attention-guided ship detection in optical remote sensing images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5630514. [Google Scholar] [CrossRef]
Xing, Z.; Ren, J.; Fan, X.; Zhang, Y. S-DETR: A Transformer Model for Real-Time Detection of Marine Ships. J. Mar. Sci. Eng. 2023, 11, 696. [Google Scholar] [CrossRef]
Liu, Q.; Xiang, X.; Yang, Z.; Hu, Y.; Hong, Y. Arbitrary Direction Ship Detection in Remote-Sensing Images Based on Multitask Learning and Multiregion Feature Fusion. IEEE Trans. Geosci. Remote Sens. 2021, 59, 1553–1564. [Google Scholar] [CrossRef]
Li, L.; Zhou, Z.; Wang, B.; Miao, L.; Zong, H. A Novel CNN-Based Method for Accurate Ship Detection in HR Optical Remote Sensing Images via Rotated Bounding Box. IEEE Trans. Geosci. Remote Sens. 2021, 59, 686–699. [Google Scholar] [CrossRef]
Zhang, X.; Wang, G.; Zhu, P.; Zhang, T.; Li, C.; Jiao, L. GRS-Det: An Anchor-Free Rotation Ship Detector Based on Gaussian-Mask in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 59, 3518–3531. [Google Scholar] [CrossRef]
Yu, W.; You, H.; Lv, P.; Hu, Y.; Han, B. A Moving Ship Detection and Tracking Method Based on Optical Remote Sensing Images from the Geostationary Satellite. Sensors 2021, 21, 7547. [Google Scholar] [CrossRef]
Wang, B.; Sui, H.; Ma, G.; Zhou, Y. MCTracker: Satellite Video Multi-Object Tracking Considering Inter-Frame Motion Correlation and Multi-Scale Cascaded Feature Enhancement. ISPRS J. Photogramm. Remote Sens. 2024, 214, 181–199. [Google Scholar] [CrossRef]
Wang, Y.; Huang, Q.; Jiang, C.; Liu, J.; Shang, M.; Miao, Z. Video Stabilization: A Comprehensive Survey. Neurocomputing 2023, 516, 205–230. [Google Scholar] [CrossRef]
Tang, L.; Ma, S.; Ma, X.; You, H. Research on image matching of improved SIFT algorithm based on stability factor and feature descriptor simplification. Appl. Sci. 2022, 12, 8448. [Google Scholar] [CrossRef]
Maes, F.; Collignon, A.; Vandermeulen, D.; Marchal, G.; Suetens, P. Multimodality Image Registration by Maximization of Mutual Information. IEEE Trans. Med Imaging 1997, 16, 187–198. [Google Scholar] [CrossRef] [PubMed]
James, J.G.; Jain, D.; Rajwade, A. Globalflownet: Video stabilization using deep distilled global motion estimates. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Piscataway, NJ, USA, 2023; pp. 5078–5087. [Google Scholar]
Choi, J.; Park, J.; Kweon, I.S. Self-supervised real-time video stabilization. In Proceedings of the 32nd British Machine Vision Conference, BMVC 2021, Online, 22–25 November 2021. [Google Scholar]
Morris, B.T.; Trivedi, M.M. A Survey of Vision-Based Trajectory Learning and Analysis for Surveillance. IEEE Trans. Circuits Syst. Video Technol. 2008, 18, 1114–1127. [Google Scholar] [CrossRef]
Heravi, M.Y.; Jang, Y.; Jeong, I.; Sarkar, S. Deep learning-based activity-aware 3D human motion trajectory prediction in construction. Expert Syst. Appl. 2024, 239, 122423. [Google Scholar] [CrossRef]
Zhang, B.; Yu, W.; Jia, Y.; Huang, J.; Yang, D.; Zhong, Z. Predicting vehicle trajectory via combination of model-based and data-driven methods using Kalman filter. Proc. Inst. Mech. Eng. Part D J. Automob. Eng. 2024, 238, 2437–2450. [Google Scholar] [CrossRef]
Tian, M.; Chen, Z.; Wang, H.; Liu, L. An intelligent particle filter for infrared dim small target detection and tracking. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 5318–5333. [Google Scholar] [CrossRef]
Luo, W.; Xing, J.; Milan, A.; Zhang, X.; Liu, W.; Zhao, X.; Kim, T.-K. Multiple Object Tracking: A Literature Review. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 3348–3365. [Google Scholar] [CrossRef]
Li, X.; Zhang, T.; Liu, Z.; Liu, B.; ur Rehman, S.; Rehman, B.; Sun, C. Saliency guided siamese attention network for infrared ship target tracking. IEEE Trans. Intell. Veh. 2024, 10, 123–134. [Google Scholar] [CrossRef]
Deng, C.; Wu, J.; Han, Y.; Wang, W.; Chanussot, J. Learning a robust topological relationship for online multiobject tracking in UAV scenarios. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5628615. [Google Scholar] [CrossRef]
Billah, M.M.; Zhang, J.; Zhang, T. A method for vessel’s trajectory prediction based on encoder decoder architecture. J. Mar. Sci. Eng. 2022, 10, 1529. [Google Scholar] [CrossRef]
Alcantarilla, P.F.; Bartoli, A.; Davison, A.J. KAZE features. In European Conference on Computer Vision (ECCV); Fiorenze, Italy, 7–13 October 2012, Springer: Berlin/Heidelberg, Germany, 2012; pp. 214–227. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11. GitHub Repository. 2024. Available online: https://docs.ultralytics.com/zh/models/yolo11/ (accessed on 10 February 2026).
Yang, X.; Liu, Y.; Wang, Y. Position detection and direction prediction for arbitrary-oriented ships via multitask rotation region convolutional neural network. IEEE Access 2018, 6, 50839–50849. [Google Scholar] [CrossRef]
Yuan, X.; Wang, T.; Liu, J.; Shen, H. Small object detection via coarse-to-fine proposal generation and imitation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Piscataway, NJ, USA, 2023. [Google Scholar]
Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In 2017 IEEE International Conference on Image Processing (ICIP); IEEE: Piscataway, NJ, USA, 2017; pp. 3645–3649. [Google Scholar]
Zhang, Y.; Sun, Y.; Wang, Y.; Li, Y. Bytetrack: Multi-object tracking by associating every detection box. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 581–597. [Google Scholar]
Meinhardt, T.; Kainz, B.; Leal-Taixé, L. Trackformer: Multi-object tracking with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2022; pp. 15203–15213. [Google Scholar]

Figure 1. Schematic of weak object and object trajectories in Gaofen-4 satellite imagery before and after stabilization.

Figure 2. The whole framework of the object detection network. C3KM2 module is introduced to enhance the network’s capability of extracting small objects.

Figure 3. Schematic diagram of C3KM2 module.

Figure 4. Schematic diagram of the multi-object trajectory fitting method based on a curve model.

Figure 5. Schematic diagram of training samples. The tri-color continuous dots composed of red, green, and blue indicated ship targets.

Figure 6. Qualitative comparison of detection results on GF-4 images. The first column displays the ground truth annotations, followed by the outputs of various detection methods.

Figure 7. Qualitative comparison of tracking results on GF-4 images. The first column displays the ground truth, followed by the outputs of various tracking methods.

Figure 8. Comparison of image stabilization before and after processing. The red dashed box indicates a fixed small island, and the yellow dots represent the trajectory points of ship target tracking.

Figure 9. Comparison of image stabilization results in Scene 2 before and after processing.

Figure 10. Comparison of image stabilization results in Scene 3 before and after processing.

Table 1. Hyperparameter Settings.

Hyperparameter	Value
Epoch	150
Batch Size	16
Image Size	$512 \times 512$
Initial Learning Rate	0.001
Optimizer	Adam

Table 2. Comparative experimental results of target detection methods.

Method		Precision (%)	Recall (%)
Detection	Faster R-CNN	83.93	78.56
	YOLOv11	88.86	81.67
	R-DFPN	89.22	81.52
	Our approach	90.13	82.45

Note: Bold values represent the optimal results.

Table 3. Comparative experimental results of target tracking methods.

Method		MOTA (%)	IDSw (n)
Tracking	DeepSort	63.59	63
	ByteTrack	68.44	56
	TrackFormer	73.26	38
	Our approach	73.97	35

Note: Bold values represent the optimal results.

Table 4. Multi-modal image sequence deviations before and after stabilization based on land-area testing (pixel-based).

Frame	Original Image Deviation (Pixel)			Stabilized Image Deviation (Pixel)
No.	Horizontal	Vertical	Absolute	Horizontal	Vertical	Absolute
1	0.0	0.0	0.0	0.0	0.0	0.0
2	0.0	−4.0	4.0	0.9	−0.8	1.2
3	0.0	−4.0	4.0	0.6	−0.5	0.8
4	1.0	3.0	3.2	1.2	−1.0	1.6
5	−3.0	4.0	5.0	−0.4	1.8	1.8
6	−11.1	7.0	13.1	0.2	0.3	0.4
7	−8.1	9.0	12.1	0.5	0.2	0.6
8	−12.1	5.0	13.1	0.8	1.2	1.4
9	−10.1	−2.0	10.3	−1.1	1.3	1.7
10	−2.0	−4.0	4.5	−2.1	0.5	2.2
11	−7.0	−3.0	7.7	−1.2	0.8	1.4
12	−15.1	0.0	15.1	−3.3	1.3	3.6
13	−14.1	2.0	14.2	−2.6	1.8	3.1
14	−17.1	8.0	18.9	−2.9	1.4	3.2
15	−11.1	6.0	12.6	−4.3	1.1	4.4
16	−5.0	2.0	5.4	−3.8	0.9	3.9
17	1.0	−1.0	1.4	−5.4	1.8	5.7
18	−1.0	−4.0	4.1	−6.1	1.8	6.3
19	−4.0	−2.0	4.5	−5.8	0.8	5.9
20	−9.1	1.0	9.1	−5.6	−0.1	5.6
21	−12.1	4.0	12.7	−4.5	−0.9	4.6
22	−16.1	5.0	16.9	−2.5	−0.7	2.6
23	0.0	1.0	1.0	−3.6	−3.4	4.9
24	2.0	4.0	4.5	−1.7	−2.0	2.6
25	10.1	1.0	10.1	1.1	−2.6	2.8
26	13.1	1.0	13.1	0.9	−4.1	4.1
27	7.0	4.0	8.1	2.5	−4.5	5.2
28	1.0	5.0	5.1	2.1	−5.9	6.3
29	−4.0	7.0	8.1	3.7	−6.2	7.2
30	4.0	11.0	11.7	4.0	−6.7	7.8
Mean	−4.1	2.2	8.5	−1.2	−0.8	3.4

Table 5. Comparison of mean stabilization deviations under different target quantity conditions.

Object Number	Original Image Deviation (Pixel)			Stabilized Image Deviation (Pixel)
Object Number	Horizontal	Vertical	Absolute	Horizontal	Vertical	Absolute
≥3	−3.8	2.1	7.9	−1.2	−0.7	3.2
2	−3.7	2.0	7.7	−1.6	−0.7	4.5
1	−3.9	2.1	8.1	−1.6	−0.6	5.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liu, F.; Li, Y.; Wang, M. Joint Adjustment Image Stabilization Method Based on Trajectories of Maritime Multi-Target Detection and Tracking. Appl. Sci. 2026, 16, 4029. https://doi.org/10.3390/app16084029

AMA Style

Liu F, Li Y, Wang M. Joint Adjustment Image Stabilization Method Based on Trajectories of Maritime Multi-Target Detection and Tracking. Applied Sciences. 2026; 16(8):4029. https://doi.org/10.3390/app16084029

Chicago/Turabian Style

Liu, Fangjian, Yuan Li, and Mi Wang. 2026. "Joint Adjustment Image Stabilization Method Based on Trajectories of Maritime Multi-Target Detection and Tracking" Applied Sciences 16, no. 8: 4029. https://doi.org/10.3390/app16084029

APA Style

Liu, F., Li, Y., & Wang, M. (2026). Joint Adjustment Image Stabilization Method Based on Trajectories of Maritime Multi-Target Detection and Tracking. Applied Sciences, 16(8), 4029. https://doi.org/10.3390/app16084029

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Joint Adjustment Image Stabilization Method Based on Trajectories of Maritime Multi-Target Detection and Tracking

Abstract

1. Introduction

2. Related Work

2.1. Small-Target Detection and Tracking in Remote Sensing Images

2.2. Marine Target Detection and Tracking

2.3. Sequence Image Stabilization Technology

2.4. Multi-Target Trajectory Modeling in Image Processing

3. Methodology

3.1. Small-Target Detection and Tracking

3.2. Multi-Target Joint Image Stabilization

4. Experiments

4.1. Datasets

4.2. Experimental Setting

4.3. Evaluation Metrics

4.3.1. Target Detection and Tracking Metrics

4.3.2. Image Stabilization Deviation Evaluation Metric

4.4. Comparative Experiments

4.4.1. Comparative Experiment on Target Detection and Tracking

4.4.2. Image Stabilization Validation Experiment

5. Discussion

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI