Miniature Multi-Target Tracking in Sonar Images Using Dual Trajectory Storage Method

Huang, Zhen; Zhang, Peizhen; Wang, Rui; Xian, Xiaoyan; Wang, Qi; Hu, Jiayu; Wu, Qinyu

doi:10.3390/jmse14060568

Open AccessArticle

Miniature Multi-Target Tracking in Sonar Images Using Dual Trajectory Storage Method

by

Zhen Huang

,

Peizhen Zhang

^*,

Rui Wang

^*

,

Xiaoyan Xian

,

Qi Wang

,

Jiayu Hu

and

Qinyu Wu

College of Electronic and Information Engineering, Guangdong Ocean University, Zhanjiang 524088, China

^*

Authors to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2026, 14(6), 568; https://doi.org/10.3390/jmse14060568

Submission received: 13 February 2026 / Revised: 15 March 2026 / Accepted: 17 March 2026 / Published: 19 March 2026

(This article belongs to the Section Physical Oceanography)

Download

Browse Figures

Versions Notes

Abstract

To address the conflict between trajectory fragmentation and the trade-off between association efficiency and data integrity in underwater micro-scale multi-target sonar motion detection and tracking in video sequences, a multi-target motion detection and tracking algorithm based on a dual trajectory storage mechanism and adaptive trajectory association is proposed. The method first obtains target centroids through Gaussian mixture model foreground extraction, morphological post-processing, and connected region analysis. By employing a dual-storage structure consisting of real-time trajectories and complete trajectories, it dynamically adjusts association thresholds based on frame sampling rates to achieve adaptive distance calculation for trajectory tracking. Experimental results demonstrate that the proposed method achieves a completeness rate of 100% in recording valid trajectory point lengths. The adaptive threshold mechanism improves association accuracy to 96.07% while reducing trajectory fragmentation rate to 0.9%. The average association time is 0.28 ms per frame, enabling efficient real-time association while ensuring the integrity of motion trajectory tracking. This research contributes to enhancing real-time detection and tracking capabilities for micro-scale underwater targets and provides support for applications such as underwater security surveillance, marine resource exploration, and intelligent autonomous underwater vehicle navigation.

Keywords:

sonar image; multi-target tracking; dual trajectory storage; adaptive trajectory association; motion detection

1. Introduction

The detection and tracking of multi-target motion trajectories in underwater sonar imagery hold significant value for fields such as marine ecological monitoring, fisheries resource assessment, and autonomous underwater vehicle navigation [1,2]. Translating this potential into practical applications relies on a shared prerequisite: the accurate detection and stable tracking of moving targets. Traditional moving target detection methods, such as background subtraction, frame differencing, and optical flow, have been widely applied in terrestrial and aerial vision tasks [3,4,5]. However, their effectiveness relies on relatively ideal imaging conditions and sufficient data support. In contrast, the acquisition of underwater sonar images is often costly and inefficient, resulting in datasets that are typically limited in both quantity and quality [6,7,8]. Moreover, due to interference from water column reverberation, environmental noise, and inherent device characteristics, sonar images commonly exhibit low resolution, pronounced speckle noise, and blurred target contours [9]. These factors contribute to the degradation and distortion of target motion features [10], posing challenges for underwater multi-target trajectory detection and tracking. These challenges include frequent trajectory fragmentation, unstable target association accuracy, and limited model generalization capability under data-scarce conditions.

To address these challenges, researchers have made various improvements to traditional methods [11,12] to enhance their robustness in underwater settings. For instance, Wang et al. [13] optimized optical flow computation and edge fusion strategies to overcome the computational inefficiency and real-time limitations of traditional optical flow methods, significantly improving their speed and denoising capability. However, this approach depends heavily on the quality of edge detection, leading to suboptimal performance for partially or fully occluded moving targets. Chen et al. [14] proposed an improved ViBe algorithm that effectively suppresses ghosting artifacts caused by moving objects in the initial frame by integrating a three-frame differencing method and constructs adaptive thresholds to improve adaptability to dynamic backgrounds. Nevertheless, this method still falls short of achieving fully adaptive parameter adjustment. This limitation, along with the work of Zheng [15], underscores the fundamental constraint of traditional background subtraction methods: their reliance on manual parameter tuning and lack of inherent adaptability.

To surpass the accuracy and generalization limitations of traditional methods, researchers have turned to data-driven paradigms, particularly deep learning, aiming for superior detection performance [16,17,18,19]. However, as Zhang [20] and Liu [21] notes in their study, although deep learning methods can achieve higher detection accuracy and robustness in complex environments through end-to-end training, their performance is highly dependent on training with large-scale, annotated datasets. This creates a fundamental contradiction with the scarcity of annotated underwater sonar datasets.

A comprehensive analysis reveals that while existing research has made progress in specific aspects, limitations remain in addressing underwater multi-target detection and tracking. On one hand, most methods focus on mitigating individual challenges such as noise or occlusion, with insufficient exploration of systematically maintaining trajectory continuity in long-term, multi-target scenarios. On the other hand, under data-scarce conditions, constructing efficient and highly adaptive lightweight frameworks remains a challenging area requiring further investigation. Furthermore, while Cao et al. [22] proposed an observation-centric trajectory correction method that uses the ORU module to correct accumulated Kalman filter errors upon trajectory reactivation, Maggiolino et al. [23] introduced EMA-based dynamic appearance feature fusion to upgrade the pure motion model to an appearance-motion joint model, and Du et al. [24] addressed missing associations and missing detections through AFLink and GSI modules, respectively, for offline post-processing repair. The aforementioned SORT-series methods all adopt a single-layer storage structure. Consequently, after trajectory loss, they primarily rely on prediction, post-correction, or offline remediation, making it difficult to retain complete historical information during the loss period.

To address the aforementioned multiple challenges, this paper proposes a solution that balances trajectory integrity and computational efficiency by optimizing the trajectory storage structure and adaptively adjusting the association threshold. Specifically, a micro-scale multi-target motion detection and tracking algorithm based on a dual trajectory storage mechanism and adaptive trajectory association is introduced. The method begins by employing an adaptively updated Gaussian Mixture Model (GMM) for dynamic foreground extraction, followed by morphological operations and connected component analysis to obtain target centroid positions. Building on this, a dual-storage structure combining real-time trajectories and complete trajectories is introduced. Real-time trajectories record the motion state of targets in the current frame, while complete trajectories serve as backup storage, preserving the historical motion paths of targets. When a target is temporarily lost due to occlusion or missed detection, the algorithm retrieves the missing trajectory segments from the complete trajectories, thereby preventing trajectory fragmentation and ensuring long-term tracking continuity. Furthermore, the association threshold is dynamically adjusted based on the frame sampling rate, which enhances association accuracy while achieving lightweight computation. This enables adaptive trajectory association and complete trajectory output, ultimately generating visualized motion trajectories and statistical information. By integrating local background modeling with an efficient data association mechanism, the proposed method aims to improve the stability and coherence of micro-scale multi-target tracking under limited data conditions. It provides a lightweight and practical technical approach for underwater motion analysis.

2. Data Acquisition and Methods

The sonar image data were collected at an experimental station located in the Duhekou Reservoir, Zhejiang Province, where the water level is consistently maintained between 27.0 m and 28.0 m. An M750d dual-frequency multi-beam imaging sonar (Blueprint Design Engineering Limited trading as Blueprint Subsea, Kendal, UK) was employed for data acquisition. This system integrates transmitting and receiving transducer arrays with a signal processing module, functioning as a complete underwater acoustic imaging device. The experiment primarily measured the relative radial distance and azimuth angle between the sonar transducer and underwater dynamic targets. The detection range was set to 0.5–50 m (with an imaging range of 5 m during the experiment). The horizontal scanning angle was 130°, the vertical scanning angle was 20°, the range resolution reached 10 mm, and the operating frequencies were 750 kHz and 1.2 MHz [25].

During the experiment, the target was rotated 360° at a constant speed using a turntable. The sonar recorded echo data in real time at a refresh rate of 15 Hz, while simultaneously capturing echo image data, as shown in Figure 1. The video sequence includes four targets (all “micro-scale” moving targets, with physical dimensions less than 0.5 m). An artificially suspended target (an aluminum three-cylinder structure, each cylinder approximately 10 cm in diameter and 40 cm in height) positioned beneath the turntable remains visible throughout the entire rotational motion. In addition, three moving targets (fish naturally active in the water, each approximately 10 cm in body length) appear at different time intervals. The first target appears from the 17th to the 41st second, lasting 21 s; the second appears from the 36th to the 51st second; and the third appears from the 94th to the 109th second. Each target remains continuously detectable during its corresponding period. A frame containing three simultaneous targets is shown in Figure 2, and a frame containing two simultaneous targets is shown in Figure 3.

2.1. Background Modeling and Foreground Extraction

A Gaussian Mixture Model (GMM) [26,27,28] was employed for background modeling to achieve foreground extraction by establishing a statistical distribution model of the background. Based on a pixel-wise probability density function model,

k

Gaussian distributions are constructed for each pixel in the image to model multi-modal data, thereby adapting to complex background variations. For a current pixel

X_{t}

at time

t

, its probability density function is defined as:

P (X_{t}) = \sum_{i = 1}^{k} ω_{i, t} η (X_{t}, μ_{i, t}, Σ_{i, t})

(1)

In Equation (1),

ω_{i, t}

represents the weight of the

i

-th Gaussian component at time

t

, satisfying

\sum_{i = 1}^{k} ω_{i, t} = 1

;

μ_{i, t}

is the mean vector;

Σ_{i, t} = σ_{i, t}^{2} I

is the covariance matrix, where

σ_{i, t}

is the standard deviation and I is the identity matrix. The probability density function for a single Gaussian component is:

η (X_{t}, μ_{i, t}, Σ_{i, t}) = \frac{1}{(2 π)^{\frac{n}{2}} | Σ_{i, t} |^{\frac{1}{2}}} \exp (- \frac{1}{2} (X_{t} - μ_{i, t})^{T} Σ_{i, t}^{- 1} (X_{t} - μ_{i, t}))

(2)

Here,

n

denotes the data dimensionality, which determines the mathematical structure of function

η

.

X_{t}

and

μ_{i, t}

are

n

-dimensional vectors, and

Σ_{i, t}

is an

n \times n

matrix. In this study, the data dimensionality

n

= 3. Following initialization, the model parameters are updated to determine the Gaussian components suitable for each pixel.

The background model selection utilizes the stability index

\frac{ω_{i, t}}{σ_{i, t}}

. The set of background components is defined as:

B_{s e t} = \{i ∣ \frac{ω_{i, t}}{σ_{i, t}} > T\}

(3)

where

T

is the background selection threshold. For the current pixel

X_{t}

, if it matches any component in

B_{s e t}

, it is classified as a background pixel; otherwise, it is classified as a foreground pixel. This classification produces a foreground binary image

f (x, y)

, where

f (x, y) = 1

denotes a foreground pixel corresponding to a moving target, and

f (x, y) = 0

denotes a background pixel representing the water body. Once the parameters remain relatively stable, the classification of foreground and background pixels is completed. The statistical distribution for each pixel is estimated through online updates.

2.2. Target Detection and Segmentation

2.2.1. Morphological Preprocessing

A three-step morphological procedure was applied to the binary image obtained from GMM-based foreground extraction. First, an opening operation was performed to remove noise. This was followed by a closing operation to connect broken points and fill small gaps. Finally, a hole-filling algorithm was used to address completely enclosed cavities within the detected targets.

Morphological filtering of the foreground binary map

f (x, y)

involved both opening and closing operations. The opening operation is defined as:

O (f) = (f (x, y) ⊖ S E_{1}) \oplus S E_{1}

(4)

In Equation (4),

S E_{1}

is a

3 \times 3

rectangular structuring element,

⊖

denotes erosion, and

\oplus

denotes dilation. The opening operation, which performs erosion followed by dilation, is used to eliminate small protrusions and noise from the foreground binary map.

Subsequently, a closing operation was applied to the result of Equation (4), defined as:

C (f) = (O (f) \oplus S E_{2}) ⊖ S E_{2}

(5)

In Equation (5),

S E_{2}

is a

10 \times 10

rectangular structuring element. The closing operation, performing dilation followed by erosion, serves to fill small indentations and smooth the surfaces within the foreground binary map. Let A denote the binary image after morphological filtering, i.e.,

A = C (f)

.

To address the issue of cavities on the surfaces of targets arising from the image binarization process, a morphological hole-filling algorithm was employed. Given A as the morphologically filtered binary image and

A^{c}

as its complement, with B as a

3 \times 3

rectangular structuring element, the iterative hole-filling process is:

X_{k} = (X_{k - 1} \oplus B) \cap A^{c} k = 1,2, 3, \dots

(6)

In Equation (6),

X_{0}

is the initial seed image, typically created by setting at least one pixel within each cavity to 1 (with all others set to 0).

X_{k}

represents the binary image after the

k

-th iteration, where pixels with a value of 1 indicate the filled region. This iterative process expands the filling region through dilation and intersects it with the image complement

A^{c}

, ensuring the filled area does not exceed the cavity boundaries. Iteration continues until convergence, i.e., when

X_{k} = X_{k - 1}

. At this point, all pixels with a value of 1 are extracted from

X_{k}

; the region formed by their positions constitutes the filled cavity area. This filled cavity region is then merged with the morphologically filtered foreground binary image to produce the final foreground binary map,

A_{f i n a l} (x, y)

, which serves as the foundation for subsequent connected component analysis. The original grayscale frame, as shown in Figure 4a, is input into the GMM to obtain the foreground binary image presented in Figure 4b. The final morphological processing result is illustrated in Figure 4c.

2.2.2. Connected Component Analysis

Connected component analysis was performed on the morphologically processed foreground binary image

A_{f i n a l} (x, y)

to identify and extract distinct target regions. A connected component labeling algorithm (Blob Analysis) [29] was employed to segment the binary image, enabling rapid identification, localization, and feature extraction of independent connected components (blobs).

For each foreground pixel in

A_{f i n a l} (x, y)

, connectivity was determined using either a 4-connectivity or 8-connectivity criterion. The connected component labeling algorithm grouped all connected pixels, with each connected group forming an independent connected region. After labeling, N connected regions were obtained, denoted as

R = {R_{1}, R_{2}, \dots, R_{N}}

, where N is the total number of connected regions and

g

is an index variable with

g \in 1,2, . . ., N

. Each element

R_{g}

in the set represents the

g

-th connected region (i.e., the

g

-th blob), defined as the set of coordinates for all pixels within that connected pixel group.

For each connected region

R_{g}

, key geometric features were extracted, primarily area and centroid. The area

A_{g}

of the

g

-th connected region is defined as the total number of pixels within that region. The centroid coordinates

(x_{g}, y_{g})

of the

g

-th connected region, representing its geometric center, are defined as:

x_{g} = \frac{1}{A_{g}} \sum_{(x, y) \in R_{g}} x, y_{g} = \frac{1}{A_{g}} \sum_{(x, y) \in R_{g}} y

(7)

The results of blob detection are shown in Figure 5a, which displays the individual blob target regions identified by the connected component labeling algorithm along with their corresponding area values.

An area thresholding strategy was applied to filter the detected connected regions, aiming to select genuine targets while eliminating noise and false detections. Only connected regions satisfying the area threshold condition were retained as valid targets, as illustrated in Figure 5b. This figure presents area statistics for five detected blobs, with the red dashed line indicating the minimum area threshold (160 pixels). The areas of all five blobs exceed this threshold, demonstrating that the area thresholding strategy effectively filters out noise while preserving real targets. The set of valid connected regions is defined as:

R_{v a l i d} = \{R_{g}| A_{g} \geq A_{t h r e s h o l d}, R_{g} \in R\}

(8)

Here,

A_{t h r e s h o l d}

is the area threshold, set to

A_{t h r e s h o l d} = 160

in this study. After filtering, the valid connected region set

R_{v a l i d}

is obtained. The corresponding valid centroid set is defined as

C = {C_{j} = (x_{j}, y_{j}) | R_{j} \in R_{v a l i d}}

, where

j

is the index for valid connected regions. This valid centroid set C serves as the input for the subsequent dual trajectory storage and association algorithm.

2.2.3. Dual Trajectory Storage and Association Algorithm

To address the trade-off between real-time performance and data integrity in multi-target tracking, a dual trajectory storage mechanism coupled with an adaptive association algorithm is proposed. This method separates the storage of real-time trajectories from complete historical trajectories and incorporates the frame sampling rate into the association threshold calculation to enhance the algorithm’s adaptability under varying frame rates.

The dual trajectory storage structure consists of a Real-time Trajectory

T_{m}^{(t)}

and a Complete Trajectory

C_{m}^{(t)}

, where

m

is the trajectory index and

t

is the temporal frame index. The algorithm takes the valid centroid set

C

obtained from connected component analysis as its input and employs a synchronous update strategy. When a detected centroid

(x_{j}, y_{j}) \in C

is successfully associated with trajectory

m

, this centroid is simultaneously appended to both

T_{m}

and

C_{m}

. The real-time trajectory retains only the last valid centroid for fast Euclidean distance computation with newly detected centroids, while the complete trajectory preserves all centroids to ensure data integrity. Upon the detection of a new associated centroid, both trajectories are updated synchronously:

T_{m}^{(t + 1)} = T_{m}^{(t)} \cup \{p_{n e w}\}

(9)

C_{m}^{(t + 1)} = C_{m}^{(t)} \cup {p_{n e w}}

(10)

Here,

T_{m}^{(t + 1)}

and

C_{m}^{(t + 1)}

are the real-time and complete trajectories for target

m

at time

(t + 1)

, respectively, and

p_{n e w} = (x_{j}, y_{j}) \in C

is the newly detected centroid coordinate. If no associated centroid is detected in the current frame,

p_{n e w}

is a missing value (defined as ‘NaN’), and both trajectories synchronously add a ‘NaN’ placeholder to maintain temporal alignment. The advantage of this mechanism is twofold: trajectory association only requires processing the last valid centroid from the real-time trajectory, while the complete trajectory retains all centroid information, ensuring data integrity for subsequent analysis.

Association decisions are based on Euclidean distance computation [30]. The Euclidean distance between the last valid centroid

(x_{m}, y_{m})

of trajectory

m

and a newly detected centroid

C_{j} = (x_{j}, y_{j}) \in C

is given by:

d_{m j} = \sqrt{(x_{m} - x_{j})^{2} + (y_{m} - y_{j})^{2}}

(11)

To accommodate varying frame sampling rates, an adaptive association threshold is employed. If the target’s maximum velocity is

v_{m a x}

(pixels/frame) and the sampling interval is

f_{s}

(i.e., processing one frame every

f_{s}

frames), then the maximum possible displacement over

f_{s}

frames is

Δ_{m a x}^{(f_{s})} = v_{m a x} \times f_{s}

. The adaptive threshold is defined as:

D_{a d j} = D_{b a s e} \times f_{s}

(12)

where

D_{a d j}

is the adaptive association distance threshold and

D_{b a s e}

is the base association distance threshold, set to

D_{b a s e} = 17

pixels in this study. Regarding the determination basis of

D_{b a s e}

, this threshold is derived from target kinematic constraints and sonar imaging parameters, with the fundamental relationship expressed as:

D_{b a s e} \approx σ \cdot \frac{v_{m a x}}{f \cdot R}

(13)

where

v_{m a x}

is the maximum motion velocity,

f

is the sonar frame rate,

R

is the image spatial resolution, and

σ

is a relaxation coefficient compensating for detection errors. In the 750 kHz mode, considering the typical motion speed of micro-scale targets and the device sampling interval, the calculated theoretical displacement range falls between 10 and 20 pixels. The value

D_{b a s e} = 17

pixels is precisely the optimal observation value under this physical constraint. When the sonar frequency changes, leading to variations in resolution

R

,

D_{b a s e}

scales linearly to maintain a constant physical search radius, thereby achieving adaptive adjustment for different sonar resolutions.

This adaptive formula ensures that

D_{a d j}

scales appropriately with the maximum possible displacement

Δ_{m a x}^{(f_{s})}

under different sampling rates. When

d_{m j} \leq D_{a d j}

, the newly detected centroid

C_{j}

is successfully associated with trajectory

m

, and both trajectories are updated synchronously according to Equations (9) and (10). If the distance between

C_{j}

and all existing trajectories exceeds

D_{a d j}

, a new trajectory is initialized as

T_{N_{t r a j} + 1}^{(t + 1)} = {p_{n e w}}

and

C_{N_{t r a j} + 1}^{(t + 1)} = {p_{n e w}}

, where

p_{n e w}

is the coordinate of

C_{j}

, and the total trajectory count

N_{t r a j}

is incremented by one.

After trajectory association, a complete trajectory set

C_{t r a j} = {C_{1}^{(t)}, C_{2}^{(t)}, \dots, C_{N_{t r a j}}^{(t)}}

is obtained, where each

C_{m}^{(t)}

is iteratively updated.

N_{t r a j}

is the total number of trajectories, which serves as the input for the subsequent trajectory filtering algorithm.

2.2.4. Trajectory Filtering

Factors such as background noise, occlusion, and detection failures often generate numerous short trajectory segments, which typically correspond to noise points or false detections and interfere with subsequent motion analysis. To extract valid motion trajectories, a post-processing algorithm based on a length threshold is applied to filter the complete trajectory set.

The algorithm takes the complete trajectory set

C_{t r a j} = {C_{1}^{(t)}, C_{2}^{(t)}, \dots, C_{N_{t r a j}}^{(t)}}

output by the dual trajectory storage and association algorithm as its input. For each complete trajectory

C_{m}^{(t)}

, its valid point set

P_{m} = {(x, y) \in C_{m}^{(t)} | (x, y) \neq (N a N, N a N)}

is extracted, and its valid length

L_{m} = | P_{m} |

is computed. Only trajectories with a valid length

L_{m} \geq τ

are retained. The filtered set of valid complete trajectories is defined as:

C^{'} = {C_{m}^{(t)} | C_{m}^{(t)} \in C_{t r a j}, L_{m} \geq τ}

(14)

where

τ

is the minimum length threshold and

C^{'}

is the filtered set of valid complete trajectories. In this study, the video frame sampling rate is 1/3, and the minimum length threshold

τ

is determined based on specific conditions. This algorithm effectively filters out short trajectory segments caused by background noise and false detections while retaining motion trajectories with sufficient statistical significance.

3. Results

After applying foreground extraction, morphological processing, connected component analysis, dual trajectory storage, and adaptive association, multi-target trajectory data with reduced noise, significantly fewer fractures, and high integrity are obtained. These results are visualized below. Figure 6a,b present the results obtained from two different video segments. The artificially suspended target in the video consists of multiple structural units. In Figure 6a, Artificial Target I is composed of three unit structures, resulting in three displayed trajectories in the trajectory plot. Additionally, three longer motion paths corresponding to naturally swimming fish appear during video monitoring. In Figure 6b, Artificial Target I consists of two unit structures; thus, two trajectories are shown. Furthermore, the motion trajectories of two fish are accurately tracked. Each trajectory is distinguished by a different color, with a green circle marking the start point and a red square marking the end point to clearly indicate the motion’s initiation and termination. Overall, the movement path morphology, directional changes, displacement range, and distribution patterns (whether clustered or dispersed) of each target can be directly observed from the visualization.

The visual results indicate significant differences in trajectory length and motion patterns among different targets, as detailed in Table 1 and Table 2. For Video 1, the average trajectory length is 179.16 trajectory points, with the longest trajectory spanning 285 points. For Video 2, the average trajectory length is 191 points, with the longest trajectory comprising 272 points. In Figure 6a, the dense cluster of trajectories in the region with x-coordinates 450–600 pixels and y-coordinates 150–300 pixels corresponds to the activity area of the fixed rotating Target I. Since this target is formed by three connected similar subunits, three distinct target trajectories are obtained. The dispersed independent trajectories in areas II, III, and IV represent the motion paths of individual targets, such as fish trajectories. Some trajectories exhibit long continuous paths, as seen in the 272-point trajectory within region I of Figure 6b, indicating that these targets remained consistently within the sonar field of view and were effectively tracked. This target is constructed from two connected similar subunits, hence two trajectories are generated. These statistical results demonstrate that, under a frame sampling rate of 1/3, the tracking performance achieves good continuity and stability. The extracted trajectories can serve as a foundational input for subsequent recognition tasks, including target classification and behavior analysis.

4. Discussion

To objectively evaluate the effectiveness of the proposed method, the following metrics were adopted: trajectory completeness, trajectory fragmentation rate, association accuracy, and association time. Trajectory completeness measures the integrity of data recording and is defined as:

Q_{c o m p l e t e n e s s} = \frac{1}{N_{t r a j}} \sum_{m = 1}^{N_{t r a j}} \frac{L_{m}}{T_{m}}

(15)

where

N_{t r a j}

is the total number of trajectories in the complete trajectory set

C_{t r a j} = {C_{1}^{(t)}, C_{2}^{(t)}, \dots

,

C_{N_{t r a j}}^{(t)}}

;

L_{m}

is the number of valid frames for the

m

-th complete trajectory, equal to the number of elements in its valid point set

P_{m}

, i.e.,

L_{m} = | P_{m} |

;

T_{m}

is the total number of frames (including NaN frames) for the

m

-th complete trajectory

C_{m}^{(t)}

.

Taking Video 2 as an example, the trajectory completeness was calculated. As shown in Figure 7a, all detected valid trajectory points were completely recorded, achieving a trajectory completeness of 100%. This indicates that the dual trajectory storage mechanism can fully record and preserve all detected target point information, ensuring that historical data are not physically lost due to filtering or elimination by real-time association logic.

The trajectory fragmentation rate is calculated as:

R_{b r e a k} = \frac{\sum_{m = 1}^{N_{t r a j}} G_{m}}{\sum_{m = 1}^{N_{t r a j}} T_{m}}

(16)

where

G_{m}

is the number of transitions from a NaN state to a valid state for the

m

-th trajectory, i.e., the count of trajectory breaks.

T_{m}

and

N_{t r a j}

are defined as in Equation (14). Figure 7b shows that the overall trajectory fragmentation rate is as low as 0.9%, with the fragmentation rate for individual trajectories remaining at relatively low levels (2.24% to 5.15%). This indicates that the proposed method effectively controls trajectory fragmentation while maintaining trajectory completeness.

It should be noted that due to the lack of precise ground truth annotations in underwater sonar images, this study did not employ classical multi-object tracking metrics that require frame-by-frame ground truth comparison (such as MOTA, MOTP, IDF1, etc.). Therefore, we adopt the trajectory fragmentation rate, which does not require external ground truth, as an evaluation metric for tracking continuity. However, it must be acknowledged that this metric cannot directly reflect identity switch errors. To compensate for this limitation, we introduce an adaptive association threshold mechanism at the algorithm level (Equation (12) and related derivation), which reduces unreasonable trajectory associations from the bottom layer through physical constraints, thereby lowering the probability of identity switches.

Presents two core metrics for algorithm performance evaluation: association accuracy and association time. The association accuracy is calculated as:

A_{t r a j e c t o r y} = \frac{N_{s u c c e s s f u l}}{N_{t o t a l}}

(17)

where

N_{s u c c e s s f u l}

is the number of successfully associated centroids satisfying

d_{m j} \leq D_{a d j}

;

N_{t o t a l}

is the total number of centroids detected across all frames, i.e., the total number of elements in the input set

C = {(x_{j}, y_{j})}

. Under a frame sampling rate of 1/3, Figure 7a compares the association accuracy using a fixed threshold (17.0 pixels) versus an adaptive threshold (51.0 pixels). The fixed-threshold method uses a constant distance threshold of 17.0 pixels, but when frames are skipped, targets may move more than 17 pixels, resulting in an average accuracy of only 45.81%, with a significant decline in the later part of the video. The red solid line in the figure shows gaps around frames 0–180 because no centroids were detected in those frames (targets left the field of view or were occluded), making instantaneous accuracy incalculable; NaN values are automatically skipped during plotting. The adaptive-threshold method dynamically adjusts the threshold based on the frame sampling rate (

D_{a d j} = D_{b a s e} \times f_{s} = 51

pixels), ensuring the threshold covers the maximum possible displacement of targets during skipped frames, thereby maintaining stable association performance in frame-skipping scenarios. The adaptive-threshold method achieves an average association accuracy of 96.07%, representing a relative improvement of 50.26%.

The association time is calculated as:

T_{a s s o c i a t i o n} = \frac{1}{N_{f r a m e s}} \sum_{t = 1}^{N_{f r a m e s}} t_{a s s o c}^{(t)}

(18)

where

t_{a s s o c}^{(t)}

is the association computation time for frame

t

, and

N_{f r a m e s}

is the total number of processed frames. The experiments in this study were conducted on a laptop equipped with a 13th Gen Intel Core i5-1340P CPU (1.90 GHz), 16 GB of RAM, and an integrated Intel Iris Xe graphics card, running a 64-bit Windows 11 operating system. The algorithm was implemented using MATLAB R2022a (64-bit) and its Computer Vision Toolbox, with all processing executed on the CPU in the form of single-threaded scripts. The reported association time of 0.28 ms per frame represents the average time consumption of the association module over the complete test sequence and can be directly reproduced under the aforementioned experimental environment.

Figure 8b shows that the average association time is approximately 0.28 ms/frame, well below the real-time target of 1.0 ms/frame. The experimental results demonstrate that the adaptive threshold mechanism significantly improves association accuracy and stability by dynamically adjusting the association distance threshold, effectively addressing the performance degradation of fixed thresholds in frame-skipping scenarios. Furthermore, the algorithm meets real-time requirements, providing reliable trajectory data for subsequent motion analysis.

Under the condition that all preprocessing parameters, detection parameters, visualization parameters, and ROI settings were kept completely consistent, a comparative experiment was conducted between the proposed algorithm and the classical “Kalman filter + nearest neighbor” tracker. The comparison results are shown in Figure 9a,b.

The recording results of the proposed algorithm align well with the actual motion logic observed in the experiment. In contrast, the “Kalman filter + nearest neighbor” tracker exhibited unstable association performance, failing to fully correspond to the actual presence of these targets and generating trajectories that did not correspond to real targets. The observed data deviated significantly from the true physical process. This discrepancy is primarily attributed to two major issues. First, its linear motion model struggles to match the nonlinear maneuvering behavior of targets such as fish, leading to systematic deviations between predicted and actual positions. Second, the tracker is highly sensitive to observation interruptions. Once a target is temporarily lost due to clutter interference, relying solely on open-loop prediction rapidly accumulates errors. When the target reappears, the deviation between the predicted and actual positions exceeds the association threshold, forcing the algorithm to mistakenly identify it as a “new target” rather than recovering the original trajectory. This results in frequent trajectory fragmentation.

The proposed method introduces a dual trajectory storage mechanism, which collaboratively manages trajectories by categorizing them into “real-time trajectories” and “complete trajectories.” When a target is temporarily lost due to signal interference, the algorithm retains its historical state. Upon target reappearance, the interrupted trajectory segments are effectively reconnected through logical backtracking, significantly improving trajectory recording completeness.

The selection of the frame sampling rate involves a trade-off among association accuracy, trajectory fragmentation rate, and computational efficiency. To systematically optimize this parameter, the following objective function is used to determine the optimal frame sampling rate:

f_{s k i p}^{(o p t)} = \arg \min_{f_{s}} (1 - A_{t r a j e c t o r y} (f_{s}) + α \cdot R_{b r e a k} (f_{s}) + λ \cdot \frac{T_{a s s o c i a t i o n} (f_{s})}{T_{m a x}})

(19)

where

f_{s}

is a candidate frame-skip interval;

T_{\max} = 1.0

ms/frame is the real-time target threshold; and

α

and

λ

are weight parameters for the trajectory fragmentation rate and association time, respectively. Here,

α = 0.5

assigns equal weight to accuracy and fragmentation rate to ensure tracking quality, while

λ = 0.3

serves as an efficiency penalty term to guarantee that the algorithm meets real-time requirements.

By iterating over different

f_{s}

values and computing the objective function, the

f_{s}

that minimizes the objective function is selected as the optimal frame-skip interval. Based on this optimization framework, a frame sampling rate of 1/3 is chosen as the optimal interval, demonstrating good stability across test sequences with varying target numbers and motion complexities.

5. Conclusions

(1): To address the challenges of trajectory fragmentation, the trade-off between association computational efficiency and data integrity in tracking the motion trajectories of micro-scale targets in underwater sonar images, a multi-target tracking method based on a dual trajectory storage mechanism is proposed. This method employs Gaussian Mixture Model (GMM) foreground extraction, morphological preprocessing, and connected component analysis to obtain target centroids. It utilizes an adaptive distance threshold for trajectory association, effectively handling issues such as trajectory breakage, computational efficiency in association, and data integrity for micro-scale targets.
(2): The dual trajectory storage mechanism uses real-time trajectories for fast association computation while maintaining complete trajectories that preserve all historical information, thereby achieving separation between computation and storage. The association algorithm adopts an adaptive distance threshold strategy, where the threshold is dynamically adjusted based on the frame sampling rate to effectively accommodate the maximum displacement of targets under different frame-rate scenarios. Experimental results show that the adaptive threshold mechanism significantly improves association accuracy. The algorithm meets real-time requirements while ensuring data integrity. Overall, the algorithm demonstrates clear advantages in both accuracy enhancement and trajectory completeness.
(3): The proposed method successfully tracks all targets in the experiments, including artificial targets (aluminum three-cylinder and two-cylinder structures) and naturally swimming fish. False trajectories are filtered out based on trajectory length, outputting complete and quantifiable trajectory data. This provides high-quality trajectory feature data for the motion analysis of underwater micro-scale multi-targets, supporting the extraction of behavioral features such as movement distance, start/end positions, and other relevant metrics.

The method exhibits strong engineering transferability and can be applied to practical tasks such as marine ecological observation, underwater security surveillance, and fishery resource monitoring, offering robust support for micro-scale multi-target motion analysis in underwater environments. When targets are in extremely close proximity, causing overlapping detection centroids, the algorithm assigns the merged point to the nearest trajectory, while the other trajectory temporarily holds a NaN value and enters a protected state. Short-term overlap maintains trajectory integrity, whereas long-term overlap leads to removal by the length filtering mechanism due to an insufficient number of valid points. However, beyond overlap scenarios, the distance-threshold-based association strategy remains susceptible to trajectory fragmentation or ID switching under prolonged occlusion, sudden acceleration, or frequent target crossing. Regarding the memory overhead of the dual trajectory storage mechanism, which is twice that of single storage, quantitative analysis shows that the dual mechanism increases overhead by only approximately 320 KB, demonstrating good hardware adaptability. This design trades a small amount of space for improved processing efficiency and trajectory continuity, satisfying both real-time and stability requirements. It should be noted that “double memory” refers to logical functional partitioning rather than data redundancy. Finally, concerning acoustic artifacts, the proposed method effectively suppresses interference through basic sonar equipment filtering combined with an area-threshold-based foreground extraction algorithm. Future improvements will incorporate acoustic propagation models to address more complex environments.

Author Contributions

Conceptualization, P.Z., R.W. and Z.H.; methodology, Z.H. and P.Z.; software, Z.H. and P.Z.; validation, Z.H.; formal analysis, Z.H., P.Z. and R.W.; investigation, Z.H. and P.Z.; resources, P.Z. and R.W.; data curation, Z.H., X.X., Q.W. (Qi Wang), J.H. and Q.W. (Qinyu Wu); writing—original draft preparation, Z.H., X.X., Q.W. (Qi Wang) and J.H.; writing—review and editing, P.Z. and R.W.; visualization, Z.H.; supervision, P.Z. and R.W.; project administration, Z.H., P.Z. and R.W.; funding acquisition, P.Z. and R.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Joint Fund for Offshore Wind Power under the Guangdong Basic and Applied Basic Research Fund (Project No. 2023A1515240013) and the Guangdong Province College Students’ Innovation and Entrepreneurship Project (Project No. S202510566080) and the Scientific Research Start-up Funds of Guangdong Ocean University (Project No. 06032112311). Academic support and research environment were provided by the Marine Acoustics and Information Processing Innovation Team of Guangdong Ocean University (Team No. CCTD201822).

Data Availability Statement

The sonar equipment and field trial work were supported and funded by Shanghai Jiao Tong University, and are currently not available to the public or other researchers.

Acknowledgments

The authors would like to thank the Marine Acoustic Information Processing Innovation Team of Guangdong Ocean University (Team code: CCTD201822) for providing academic support and research environment. The authors have reviewed and edited the output and take full responsibility for the content of this publication. The sonar equipment used in this study and the field trial work were supported and funded by Shanghai Jiao Tong University.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Ferreira, A.; Almeida, J.; Matos, A.; Silva, E. Real-Time Registration of 3D Underwater Sonar Scans. Robotics 2025, 14, 13. [Google Scholar] [CrossRef]
Xu, J. Moving target detection method based on improved Gaussian mixture model. Inf. Rec. Mater. 2024, 25, 142–144. [Google Scholar] [CrossRef]
Lin, C.; Yungang, L. UAV Visual Target Tracking Algorithms: Review and Future Prospect. Inf. Control 2022, 51, 23–40. [Google Scholar] [CrossRef]
Bai, Y. Target detection method of underwater moving image based on optical flow characteristics. J. Coastal. Res. 2019, 93, 668–673. [Google Scholar] [CrossRef]
Dai, K. Automatic Detection Method for Daytime Traffic Flow Based on Background Difference and Edge Extraction. Tehnički Vjesnik 2026, 33, 192–202. [Google Scholar] [CrossRef]
Huang, H.; Li, B.; Liu, J.; Wei, Z.; Zhao, S. Review and prospect of underwater target recognition in sonar images. J. Electron. Inf. Technol. 2024, 46, 1742–1760. [Google Scholar] [CrossRef]
Ferreira, F.; Machado, D.; Ferri, G.; Dugelay, S.; Potter, J. Underwater optical and acoustic imaging: A time for fusion? A brief overview of the state-of-the-art. In Proceedings of the OCEANS 2016 MTS/IEEE Monterey, Monterey, CA, USA, 19–23 September 2016; pp. 1–6. [Google Scholar] [CrossRef]
Guan, F.; Zhang, Y.; Xu, Q.; Guo, T. Underwater Image Enhancement via RGB Color-Cast Correction and Lab Space Luminance-Chrominance Enhancement. J. Mar. Sci. Eng. 2026, 14, 310. [Google Scholar] [CrossRef]
Karimanzira, D.; Renkewitz, H.; Shea, D.; Albiez, J. Object Detection in Sonar Images. Electronics 2020, 9, 1180. [Google Scholar] [CrossRef]
Huang, J.; Peng, X.; Li, W.; Hu, K.; Wang, T.; Huang, Y.; Wen, Y. High quality sonar image generation method based on multi-scale feature fusion. J. Comput. Appl. 2025, 45, 3987–3994. [Google Scholar] [CrossRef]
Xue, X.; Ma, T.; Han, Y.; Ma, L.; Liu, R. Learning Deep Scene Curve for Fast and Robust Underwater Image Enhancement. Signal Process. Lett. 2024, 31, 6–10. [Google Scholar] [CrossRef]
Guo, Y.; Xu, J.; Wang, J.; Zhao, W.; Zhang, W. MFPD: Mamba-Driven Feature Pyramid Decoding for Underwater Object Detection. Signal Process. Lett. 2026, 33, 141–145. [Google Scholar] [CrossRef]
Wang, Y.; Lu, Q.; Wang, Y.; Wu, M. Accurate and Efficient Moving Object Detection with Optical Flow. Softw. Guide 2024, 23, 134–141. [Google Scholar] [CrossRef]
Chen, C.; Jiang, G.; Zhang, L.; Ling, Y.; Yu, C.; Yan, H.; Zhang, Y. Motion Object Detection Based on Improved Visual Background Extraction Algorithm. Comput. Meas. Control 2022, 30, 105–111. [Google Scholar]
Zheng, J. Implementation of Moving Object Detection Algorithm Based on FPGA in Complex Background. Master’s Thesis, Xi’an University of Electronic Science and Technology, Xi’an, China, 2021. [Google Scholar] [CrossRef]
Fan, Y.; Peng, C.; Zhang, P.; Zhang, Z.; Zhang, G.; Tang, J. MVDCNN: A Multi-View Deep Convolutional Network with Feature Fusion for Robust Sonar Image Target Recognition. Remote Sens. 2026, 18, 76. [Google Scholar] [CrossRef]
Kirubakaran, N. An Analysis the Sonar Contact Images by Using the LONAR Analysis with Deep Convolutional Neural Network. YMER 2022, 21, 9. [Google Scholar]
Soujanya, V.; Swathi, C.; Kuncha, P.; Ravi, S. Detection of Underwater Trash Objects Using Deep Learning Algorithms and Yolo v8. Int. J. Mod. Trends Sci. Technol. 2025, 11, 19–25. [Google Scholar] [CrossRef]
Agarwal, A.; Verma, I.; Gupta, V. Survey on deep learning techniques used for object identification of underwater forward-looking sonar images. AIP Conf. Proc. 2025, 3297, 030004. [Google Scholar] [CrossRef]
Zhang, Y.; Dai, X. Research progress on moving target detection in complex environments. J. Sichuan Univ. Sci. Eng. (Nat. Sci. Ed.) 2025, 38, 54–62. [Google Scholar]
Liu, Y.; Ding, J.; Xu, M.; Huang, Z.; Qiang, Y. A Multi-Supervised Network for Real-Time and Accurate Semantic Segmentation in Underwater Scenes. J. Mar. Sci. Eng. 2026, 14, 340. [Google Scholar] [CrossRef]
Cao, J.; Pang, J.; Weng, X.; Khirodkar, R.; Kitani, K. Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 9686–9696. [Google Scholar] [CrossRef]
Maggiolino, G.; Ahmad, A.; Cao, J.; Kitani, K. Deep OC-Sort: Multi-Pedestrian Tracking by Adaptive Re-Identification. In Proceedings of the 2023 IEEE International Conference on Image Processing (ICIP), Kuala Lumpur, Malaysia, 8–11 October 2023; pp. 3025–3029. [Google Scholar] [CrossRef]
Du, Y.; Zhao, Z.; Song, Y.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Strong SORT: Make Deep SORT Great Again. IEEE Trans. Multimed. 2023, 25, 8725–8737. [Google Scholar] [CrossRef]
Zhou, G.; Zhang, P.; Mo, Q.; Zhao, Y.; Su, F.; Gong, T.; Meng, H. Recognition method of sonar images from a small sample underwater targets based on Radon projection and improved convolutional neural network. J. Harbin Eng. Univ. 2024, 45, 2048–2056. [Google Scholar] [CrossRef]
Stauffer, C.; Grimson, W.E.L. Adaptive background mixture models for real-time tracking. In Proceedings of the 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), Fort Collins, CO, USA, 23–25 June 1999; Volume 2, pp. 246–252. [Google Scholar] [CrossRef]
Greggio, N.; Bernardino, A.; Laschi, C.; Dario, P.; Santos-Victor, J. Self-adaptive Gaussian mixture models for real-time video segmentation and background subtraction. In Proceedings of the 10th International Conference on Intelligent Systems Design and Applications, Cairo, Egypt, 29 November–1 December 2010; pp. 983–989. [Google Scholar] [CrossRef]
Chen, Z.; Ellis, T. Self-adaptive Gaussian Mixture Model for urban Traffic monitoring system. In Proceedings of the 2011 IEEE International Conference on Computer Vision Workshops (ICCV Workshops), Barcelona, Spain, 6–13 November 2011; pp. 1769–1776. [Google Scholar] [CrossRef]
Wang, W.; Zhu, T. Application of capsule defect detection algorithm based on VisionMaster. Qual. Mark. 2025, 8, 67–69. [Google Scholar]
Zhou, J.; He, Q.; Zhou, L.; Zhu, Y.; Wang, J. Small-sample Missing Environmental Test Data Prediction Model Based on k-Nearest Neighbors and Ensemble Learning. Equip. Environ. Eng. 2025, 22, 160–168. [Google Scholar]

Figure 1. Underwater Target Sonar Measurement Experimental Platform.

Figure 2. A sonar image of the targets to be detected during the three-target coexistence phase (two fish and the artificially suspended target).

Figure 3. A sonar image of the targets to be detected during the two-target coexistence phase (one fish and the artificially suspended target).

Figure 4. Foreground modeling and processing with three-step morphological preprocessing. (a) Original grayscale frame. (b) Foreground binary image. (c) Image after three steps of morphological processing.

Figure 5. Blob Detection and Area Statistics. (a) Blob detection results image. (b) Blob area statistics chart (red dashed line represents the minimum area threshold).

Figure 6. Detection path. (a) Motion trajectory diagram from Video 1. (b) Motion trajectory diagram from Video 2.

Figure 7. Statistical Chart of Trajectory Completeness and Breakage Rate of Video Two. (a) Video 2 trajectory integrity. (b) Video 2 fracture rate statistical chart. The red dashed line represents the average fracture rate of the four trajectories.

Figure 8. Performance Analysis of Target Association Algorithm. (a) Comparison of Association Accuracy Between Fixed-Threshold and Adaptive-Threshold Methods in Target Association Algorithms. (b) statistics on association time per frame.

Figure 9. Kalman filter + nearest neighbor tracking trajectories. (a) Motion trajectory diagram from Video 1. (b) Motion trajectory diagram from Video 2.

Table 1. Statistics of motion trajectory in video 1.

Object ID		Trajectory Points	Start Point X	Start Point Y	End Point X	End Point Y
Object I	Subobject 1	244	505.15	272.93	554.43	212.77
	Subobject 2	71	511.47	219.85	567.89	254.48
	Subobject 3	285	483.93	269.93	519.00	195.96
Object II		235	130.48	215.18	387.72	77.42
Object III		113	402.15	364.97	620.26	395.45
Object IV		130	761.45	138.90	779.19	334.35

Table 2. Statistics of motion trajectory in video 2.

Object ID		Trajectory Points	Start Point X	Start Point Y	End Point X	End Point Y
Object I	Subobject 1	272	627.5	302.7	654.9	293.6
Object I	Subobject 2	231	628.0	330.2	647.8	342.7
Object II		103	684.4	66.4	750.2	102.2
Object III		103	601.6	62.3	642.2	107.8

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, Z.; Zhang, P.; Wang, R.; Xian, X.; Wang, Q.; Hu, J.; Wu, Q. Miniature Multi-Target Tracking in Sonar Images Using Dual Trajectory Storage Method. J. Mar. Sci. Eng. 2026, 14, 568. https://doi.org/10.3390/jmse14060568

AMA Style

Huang Z, Zhang P, Wang R, Xian X, Wang Q, Hu J, Wu Q. Miniature Multi-Target Tracking in Sonar Images Using Dual Trajectory Storage Method. Journal of Marine Science and Engineering. 2026; 14(6):568. https://doi.org/10.3390/jmse14060568

Chicago/Turabian Style

Huang, Zhen, Peizhen Zhang, Rui Wang, Xiaoyan Xian, Qi Wang, Jiayu Hu, and Qinyu Wu. 2026. "Miniature Multi-Target Tracking in Sonar Images Using Dual Trajectory Storage Method" Journal of Marine Science and Engineering 14, no. 6: 568. https://doi.org/10.3390/jmse14060568

APA Style

Huang, Z., Zhang, P., Wang, R., Xian, X., Wang, Q., Hu, J., & Wu, Q. (2026). Miniature Multi-Target Tracking in Sonar Images Using Dual Trajectory Storage Method. Journal of Marine Science and Engineering, 14(6), 568. https://doi.org/10.3390/jmse14060568

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Miniature Multi-Target Tracking in Sonar Images Using Dual Trajectory Storage Method

Abstract

1. Introduction

2. Data Acquisition and Methods

2.1. Background Modeling and Foreground Extraction

2.2. Target Detection and Segmentation

2.2.1. Morphological Preprocessing

2.2.2. Connected Component Analysis

2.2.3. Dual Trajectory Storage and Association Algorithm

2.2.4. Trajectory Filtering

3. Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI