Next Article in Journal
Effects of Spatial Resolution on Assessing Cotton Water Stress Using Unmanned Aerial System Imagery
Previous Article in Journal
Assessing the Use of the Standardized GRACE Satellite Groundwater Storage Change Index for Quantifying Groundwater Drought in the Mu Us Sandy Land
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Multi-Domain Intelligent State Estimation Network for Highly Maneuvering Target Tracking with Non-Gaussian Noise

by
Zhenzhen Ma
,
Xueying Wang
,
Yuan Huang
*,
Qingyu Xu
,
Wei An
and
Weidong Sheng
College of Electronic Science and Technology, National University of Defense Technology, Changsha 410073, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2025, 17(24), 4016; https://doi.org/10.3390/rs17244016
Submission received: 30 October 2025 / Revised: 21 November 2025 / Accepted: 10 December 2025 / Published: 12 December 2025
(This article belongs to the Section AI Remote Sensing)

Highlights

What are the main findings?
  • We propose a Multi-domain Intelligent Estimation Network (MIENet) for tracking highly maneuvering targets with non-Gaussian noise.
  • The MIENet consists of a fusion denoising model (FDM) and a parameter estimation model (PEM), which jointly enhance robustness and accuracy in state estimation for radar trajectory and remote sensing video data.
What are the implications of the main findings?
  • The proposed MIENet achieves robust tracking performance across various noise intensities and distributions, significantly reducing observation noise and improving estimation stability.
  • Our approach effectively generalizes from radar trajectory simulation to real satellite video tracking tasks, showing great potential for intelligent remote sensing target tracking in complex environments.

Abstract

In the field of remote sensing, tracking highly maneuvering targets is challenging due to its rapidly changing patterns and uncertainties, particularly under non-Gaussian noise conditions. In this paper, we consider the problem of tracking highly maneuvering targets without using preset parameters in non-Gaussian noise. We propose a multi-domain intelligent state estimation network (MIENet). It consists of two main models to estimate the key parameter for the Unscented Kalman Filter, enabling robust tracking of highly maneuvering targets under various intensities and distributions of observation noise. The first model, called a fusion denoising model (FDM), is designed to eliminate observation noise by enhancing multi-domain feature fusion. The second model, called a parameter estimation model (PEM), is designed to estimate key parameters of target motion by learning both global and local motion information. Additionally, we design a physically constrained loss function (PCLoss) that incorporates physics-informed constraints and prior knowledge. We evaluate our method on radar trajectory simulation and real remote sensing video datasets. Simulation results on the LAST dataset demonstrate that the proposed FDM can reduce the root mean square error (RMSE) of observation noise by more than 60%. Moreover, the proposed MIENet consistently outperforms the state-of-the-art state estimation algorithms across various highly maneuvering scenes, achieving this performance without requiring adjustment of noise parameters under non-Gaussian noise. Furthermore, experiments conducted on the real-world SV248S dataset confirm that MIENet effectively generalizes to satellite video object tracking tasks.

1. Introduction

In the context of Earth observation (EO), target tracking aims to estimate target states from noisy EO measurements acquired by satellite, radar, and UAV sensors. It supports UAV object tracking [1,2,3,4], remote sensing surveillance [5,6,7], and traffic control [8,9,10]. Within the domain of target tracking, the problem of tracking highly maneuvering targets is widely recognized as a fundamental challenge [11,12,13]. Specially, highly maneuvering targets are characterized by motion involving high speeds, frequent maneuvers, and long-duration maneuvers. Tracking such targets presents two main challenges. The first challenge is high-speed and high-frequency maneuvers [14]. Such maneuvers cause rapid motion changes and result in error accumulation. The second challenge arises from non-Gaussian noise with different intensities and distributions [12], causing instability in state estimation results and divergence in tracking performance. Currently, multiple paradigms have been proposed to solve these challenges. They can be classified into two types, which are model-based methods and data-driven methods.
For model-based methods [15,16,17,18,19,20,21], the interacting multiple model (IMM) filter [19] is a fundamental algorithm to solve rapidly changing patterns problem [22]. It combines several predefined motion models and works with the improved Kalman filter [15,17,23,24,25,26,27,28,29] to achieve reliable tracking in nonlinear systems. To enhance adaptability under non-Gaussian noise, Xu et al. [30] proposed an EM-based extended Kalman filter (EKF) tracker that decomposed non-Gaussian noise into Gaussian components and incorporated bias compensation, significantly improving angle-of-arrival target tracking under non-Gaussian noise conditions. Building on the idea of Gaussian decomposition, Chen et al. [31] proposed a joint probability data association based on noise-interaction Kalman filter (JPDA-IKF), which decomposed non-Gaussian noise into multiple Gaussian components and fused their results to enhance multitarget tracking performance in complex maritime environments. To further capture multimodal noise and motion characteristics, Wang et al. [32] introduced Gaussian mixture models (GMMs) into the IMM-KF, allowing accurate switching and fusion across multiple motion. Liu et al. [33] proposed an IMM-MCQKF algorithm that integrated the maximum correntropy criterion into the quadrature Kalman filter within an IMM framework, improving state estimation under non-Gaussian noise and maneuvering target scenarios. Xie et al. [34] proposed an adaptive TPM-based parallel IMM algorithm that integrated online transition probability adaptation with a threshold-controlled model-jumping mechanism within the IMM framework.
The above model-driven methods require accurate motion patterns and covariance noise parameters based on manual experiences. Indeed, constructing an accurate and well-matched model is challenging, and such prior information cannot be obtained promptly and reliably before tracking [22]. Additionally, these methods process trajectories only in the temporal domain and do not fully exploit multi-domain information. As a result, by neglecting the latent representational information in noisy observations, their sensitivity to maneuverability is reduced under non-Gaussian noise.
For data-driven methods, precise kinematic modeling of target motion is unnecessary. They leverage the strength of data mining to capture complex dynamics. They can be broadly categorized into two types, which are end-to-end architectures and hybrid architectures that integrate traditional algorithms. The former constructs an end-to-end model to directly map the input observations to the output states, eliminating the need for manually designed intermediate feature extraction. Cai et al. [35] proposed an adaptive LSTM-based tracking method that incorporated measurement–prediction errors to improve accuracy and adaptability for maneuvering targets. Liu et al. [11] learned the residual between the ground truth and estimation trajectories. This framework mainly consisted of BiLSTM layers [36,37,38] and enabled target state estimation without the need for a pre-defined model set. Building upon these, Zhang et al. [13,39] employed an enhanced transformer architecture to achieve further reductions in state estimation RMSE. Shen et al. [40] also proposed Transformer-based nonlinear target trackers, including a classical Transformer for smoothing and prediction and a recursive Transformer for filtering and prediction, achieving superior accuracy and efficiency in nonlinear radar target tracking. PCRLB [41] propose a physics-informed data-driven autoregressive nonlinear filter (DAF) based on the Transformer. Its end-to-end architecture embedded known dynamics and measurement models into the training process, ensuring physically consistent and efficient state estimation.
The above end-to-end data-driven methods primarily use a single end-to-end model to capture complex nonlinear functions often leads to convergence difficulties in practice. Moreover, a wide variation range of target motion may degrade method performance and generalization.
The latter hybrid methods [12,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57] integrate deep learning with model-driven approaches, thereby combining the strengths of both paradigms. These methods retain the framework of model-driven algorithms while incorporating deep learning modules to estimate variable parameters. By involving less functional learning and relying on simpler architectures, they achieve lower computational cost and enhanced interpretability. KalmanNet [42] integrated a structural state space model with dedicated recurrent neural networks (RNNs) to estimate the Kalman gain. It operated within the KF framework and effectively addressed non-linearity. Building on this idea, KalmanFormer [53] employed a Transformer to learn the Kalman Gain. To handle mismatches between state and observation models, Split-KalmanNet [43] calculated the Kalman gain through the Jacobian of the measurement function and leveraged two RNNs with a split architecture. Cholesky-KalmanNet [45] further enhances state estimation by providing and enforcing transiently precise error covariance estimation. Recursive KalmanNet [48] extended this line of research by employing the recursive Joseph’s formulation, thereby enabling accurate state estimation with consistent uncertainty quantification under non-Gaussian noise. Latent-KalmanNet [44] advanced the framework by jointly learning a latent representation and Kalman-based tracking from high-dimensional signals. More recently, MAML-KalmanNet [54] integrated model-agnostic meta-learning with artificially generated labeled data to achieve fast adaptation and accurate state estimation under partially unknown models. In addition, Liu et al. [12] introduced a digital twin system that combined the Unscented Kalman Filter with intelligent estimation models. Zhu et al. [58] developed an adaptive IMM algorithm that employed a neural network to estimate TPMs in real time, thereby reformulating IMM as a generalized recurrent neural network. Fu et al. [46] proposed deep learning-aided Bayesian filtering methods for guarded semi-Markov switching systems with soft constraints, addressing the challenge of state estimation under complex sojourn times and state-dependent transitions. Furthermore, Jia et al. [50] proposed a Multiple Variational Kalman-GRU model to capture multimodal ship dynamics and provide reliable trajectory prediction.
This paradigm endows the network with stronger generalization ability and interpretability. Considering these advantages, we adopt this paradigm in our work. However, existing hybrid approaches that integrate deep learning into traditional model-driven frameworks still have notable limitations. They typically rely solely on temporal-domain processing and fail to exploit complementary information from spatial and frequency domains, thereby restricting their effectiveness. In addition, when dealing with maneuvering targets, these methods often emphasize local temporal features while overlooking broader global motion patterns, which are critical for capturing abrupt changes in dynamics.
To overcome these issues, we propose a versatile and adaptive tracking framework called the Multi-Domain Intelligent State Estimation Network (MIENet) for highly maneuvering target tracking that can effectively handle non-Gaussian noise. The MIENet first applies a nonlinear transformation for coordinate system conversion and comprises two main models. Due to the nonlinear, heavy-tailed, and spatially correlated characteristics of non-Gaussian noise, standard Gaussian-based methods fail to remove it effectively. The independent module in the MIENet enables adaptive learning of these complex noise features, thereby improving convergence stability and denoising efficiency. Furthermore, leveraging multi-domain information helps capture the statistical characteristics of non-Gaussian noise. Therefore, the first model, termed the fusion denoising model (FDM), mitigates non-Gaussian observation noise by integrating temporal, spatial, and frequency-domain features.
As discussed in [12], bias in the state transition matrix ( F ) causes rapidly growing deviations, while bias in the initial state x 0 results in linear growth. Hence, errors from an inaccurate transition matrix become much greater than those from an incorrect initial state as the trajectory extends. The main challenge in estimating the maneuvering target motion model lies in the uncertainty of F , whose exact form is typically unknown; thus, accurately estimating its key parameters is critical to avoid computational bias. Motivated by this, the second model, termed the parameter estimation model (PEM), estimates the turn rate of target motion from temporal–spatial information. We also introduce a physically constrained loss function (PCLoss) to avoid motion estimation bias. The estimated parameters are then incorporated into the UKF to achieve accurate tracking of highly maneuvering targets. The entire tracking process is iteratively updated using a fixed sliding-window mechanism. The main contributions of this work are summarized as follows:
  • We propose a MIENet to estimate target states and motion patterns from non-Gaussian observation noise. The multi-domain information of target trajectories can be well incorporated, thus helping to infer the latent state of targets.
  • We design an FDM to handle noise with varying intensities and distributions. It employs a novel Inverted-UNet and FFT-based weighting to integrate temporal, spatial, and frequency-domain features within an encoder–decoder framework.
  • We develop a PEM to estimate target motion patterns by fusing local and global spatial–temporal features. Considering that the state transition matrix characterizes motion dynamics and enables trajectory inference, we further introduce a PCLoss to help the PEM estimate the key turn rate parameter ( ω ).
  • Experiments on the LAST [11] and SV248S [59] datasets demonstrate the superior performance of our method in various highly maneuvering scenes. Compared with existing methods, our method is more robust to observation noise with varying intensities and distributions.

2. Materials and Methods

2.1. Problem Formulation

Our method is dedicated to maneuvering target tracking and shows limited dependence on the observation modality. Using the LAST dataset as an example, we analyzed a radar tracking system. As shown in Figure 1, the system structure is based on range and azimuth observations [12]. Initially, due to the long distance of the target and the limited measurement resolution, we ignored the shape of targets and treated them as points in space [14]. The relationship of target states and observations is represented by the state space model (SSM) as follows:
State Equation :   x k = F k x k 1 + n k ,
Observation Equation :   z k = h ( x k ) + m k ,
where x k and z k are target states and observations at time step k, respectively. x k is defined as [ p x , k , p y , k , v x , k , v y , k ] , where [ p x , k , p y , k ] is the two-dimensional position, and  [ v x , k , v y , k ] is the corresponding velocity. z k is defined as [ d k , θ k ] , where d k and θ k are the range and azimuth of the target detected by the sensor. n k and m k are additive noise at time step k; F k and h ( · ) are the state transition matrix and observation function.

2.1.1. State Equation

For most 2D maneuvering tracking scenes, Constant Turn (CT) models [14] are representative maneuvering patterns that can approximate most motion behaviors. The  F k is defined as follows:
F k = 1 0 sin ( ω k s τ ) ω k cos ( ω k s τ ) 1 ω k 0 1 1 cos ( ω k s τ ) ω k sin ( ω k s τ ) ω k 0 0 cos ( ω k s τ ) sin ( ω k s τ ) 0 0 sin ( ω k s τ ) cos ( ω k s τ ) ,
where ω k denotes the turn rate at time step k. It is a critical parameter whose accuracy significantly affects performance. The symbol s τ represents the sampling interval of trajectories. If  ω k equals to zero, the motion model is simplified to Constant Velocity (CV) model, which could be defined as follows:
F k = 1 0 s τ 0 0 1 0 s τ 0 0 1 0 0 0 0 1 .

2.1.2. Observation Equation

(1) The Radar Tracking System
For the common radar tracking system, the detector at ( r x , r y ) provides range and azimuth measurements contaminated by additive noise. The polar coordinate system of observations is given as follows:
d k θ k = ( p x , k r x ) 2 + ( p y , k r y ) 2 a r c t a n ( ( p y , k r y ) / ( p x , k r x ) ) + q d , k q θ , k ,
where q d , k and q θ , k are additive noise in range and azimuth angle observations at time step k.
Subsequently, the polar coordinate system is transformed into the Cartesian system as the input of our method. In this step, additive noise is converted into non-Gaussian noise by nonlinear transformation. The process is formulated as follows:
p x , k p y , k = d k c o s ( θ k ) d k s i n ( θ k ) + r x r y .
(2) The Visual Remote Sensing Tracking System
For the visual remote sensing tracking system, the observation is linear and corresponds to the target’s position in the image. The observation matrix  H is defined as follows:
H = 1 0 0 0 0 1 0 0 .
The SV248S dataset used in our experiments was captured by the Jilin-1 video satellite, where platform-induced distortions—primarily blur and contrast variations—can be reasonably approximated as Gaussian noise [59].

2.1.3. Problem Addressed in This Work

The main task is to estimate the target states ( x k ) based on the nonlinear observation ( z k ). This work addresses two primary challenges. First, the observations are often corrupted by diverse types of noise. In practice, for nonlinear observations, the noise distributions we processed are rarely Gaussian; instead, they may vary in intensity, exhibit non-Gaussian characteristics, or change dynamically across different environments. Second, maneuvering targets introduce additional difficulties due to their highly dynamic and uncertain motion patterns. A key challenge lies in accurately estimating critical motion parameters in the state transition matrix ( F k ). Errors in estimating such parameters can severely degrade trajectory inference, as the deviation induced by model mismatch often grows increasingly with trajectory length. This work aims to design a robust framework for stable and reliable tracking of highly maneuvering targets. The framework can effectively handle non-Gaussian noise and accurately estimate key motion parameters.

2.2. Multi-Domain Intelligent State Estimation Network

In this work, we introduce our MIENet in details. The overall architecture of the proposed method is shown in Figure 2.

2.2.1. Overall Architecture

As illustrated in Figure 2, we segmented the acquired trajectories using a sliding window, and our MIENet processed fixed-length noisy polar-coordinate trajectories as input. It first applied a nonlinear transformation (Equation (6)) to convert polar coordinates ( d k , θ k ) into Cartesian coordinates ( p x , k , p y , k ). In this process, additive noise was transformed into non-Gaussian noise. A fusion denoising model (FDM, Section 2.2.2) was then employed to suppress non-Gaussian observation noise by integrating temporal, spatial, and frequency-domain features. Subsequently, a parameter estimation model (PEM, Section 2.2.3) estimates the turn rate from the denoised trajectories output by the FDM. It leverages both local and global temporal–spatial information. The estimated parameter was then incorporated into a UKF for maneuvering target tracking. Notably, the FDM and PEM were trained independently and evaluated jointly with the UKF in testing. A summary of the overall tracking process is provided in Algorithm 1. Subsequently, the FDM and PEM are described in detail.
Algorithm 1 MIENet tracking process
Input: Noisy observations z with N time steps.
Output: Target states x with N time steps.
  1:
Initialization: Independently trained FDM and PEM; Fixed model input length is n.
  2:
for  k = 1 in N do
  3:
    while  k N n do
  4:
       Segment extraction: z seg ( n )
  5:
       Position recovery with Equation (5), to get noisy position sequences: d seg ( n )
  6:
       Noise elimination: D seg ( n ) = FDM ( d seg ( n ) )
  7:
       Turn rate estimation: w ˜ k = PEM ( D seg ( n ) )
  8:
       Calculate state transition matrix with Equation (3): F ˜ = F ( w ˜ k )
  9:
    end while
10:
    Prediction and update of UKF filter: x ˜ k x k
11:
end for

2.2.2. Fusion Denoising Model

(1) Motivation
In practice, the statistical characteristics of observation noise exist not only in the temporal domain but also in the spatial and frequency domains. Temporal features alone are insufficient to capture oscillatory energy distributions in the frequency domain or spatial structural correlations, both of which are essential for robust denoising and stable tracking under non-Gaussian noise.
As noted in [60], the encoder–decoder structure enables effective separation of noise and signal in the latent space, ensuring noise suppression during reconstruction. However, traditional encoder–decoder structure consists of an encoder, a decoder, and plain connections. The encoder is used to increase feature dimension while reducing the spatial resolution. Decoder helps to recover the spatial resolution to match the size of input trajectories. Such architectures mainly emphasize hierarchical semantic features derived from temporal information. Although deepening the network or modifying the encoder submodules can capture higher-level representations, they do not explicitly address the multi-domain characteristics of observation noise. This motivates the design of a dedicated module within the encoder–decoder framework to explicitly extract and fuse temporal, spatial, and frequency-domain features.
(2) The Overall Structure of FDM
As illustrated in Figure 3, the framework adopts a symmetric encoder–decoder structure with an additional multi-domain feature module (MDFM). The input trajectories were converted from polar to Cartesian coordinates and segmented into fixed-length sequences. These processed sequences were subsequently fed into the transformer encoder to extract temporal-domain features. The MDFM further captures frequency-domain and spatial-domain information and fuses them with temporal dynamics. Finally, the fused multi-domain features are passed into the decoder subnetworks for trajectory reconstruction and prediction.
(3) The Multi-Domain Feature Module
As shown in Figure 3, the multi-domain feature module (MDFM) is embedded within the encoder to enhance feature representation beyond the temporal domain. It first performs padding and reshaping on the input data and then passes it through the Inverted-UNet. The Inverted-UNet is specifically designed to capture both multi-channel spatial structures while maintaining computational efficiency, which is critical for robust trajectory representation. The extracted features are then normalized with Layer Normalization and activated by a LeakyReLU, with this block repeated p times according to experimental results.
In parallel, the Fast Fourier Transform (FFT) is applied to the temporal features to reveal frequency-domain amplitude distributions. These amplitudes are used as weights, allowing the model to emphasize oscillatory patterns that are often hidden in noisy observations. The denoised spatial features from the Inverted-UNet and the weighted frequency-domain information are fused through a softmax layer.
By combining the Inverted-UNet and FFT, MDFM explicitly leverages spatial structures and frequency-domain energy patterns, providing a more comprehensive feature representation that significantly improves noise suppression. The following sections present a detailed description of these two key components.
(a) The Inverted-UNet
For low-dimensional trajectory sequences, the traditional dense U-Net (Figure 4a) is unsuitable, as its initial downsampling step leads to an immediate loss of information. Based on this idea, we designed a novel Inverted-UNet (Figure 4b). It first upsamples the multi-channel trajectory features to preserve spatial structure and then downsamples them to restore the original trajectory shape. Skip connections promote gradient flow.
Assume R i , j denotes the output of each node, where i is the i t h upsampling layer in the encoder and j is the j t h neighboring layer. To balance network performance and computational efficiency, the Inverted-UNet was designed with a depth of four layers, where the indices i and j take values from 0 to 3. The computation of each node feature is illustrated as follows:
R i , 0 = U P ( R i 1 , 0 ) ,
R i , 1 = [ U P ( R i 1 , 1 ) , R i , 0 , D N ( R i + 1 , 0 ) ] ,
R i , 2 = [ S K ( R i , 0 ) , R i , 1 , D N ( R i + 1 , 1 ) ] ,
R i , 3 = [ S K ( R i , 0 ) , R i , 2 , D N ( R i + 1 , 2 ) ] ,
R o u t = C o n v ( R i , 3 ) ,
where U P and D N denote upsampling and max-pooling with a stride of 2, respectively. C o n v denotes the convolution layer. S K denotes skip connection. [ · , · ] denotes the concatenation layer. The detail dimensions of each R i , j are shown in Table 1.
(b) FFT-Based Weighting
We applied Fast Fourier Transform (FFT) to extract their spectrum amplitude distribution. These serve as weights in the softmax layer to fuse different features. This approach utilizes temporal and frequency domains information to assess the importance of different signals. The process is illustrated as follows:
I N = n = 0 N 1 I N input · e j 2 π f n / N ,
A u = | I N | ,
W normal = S o f t m a x ( A u , R ˜ o u t ) ( u = 1 , 2 , , 6 ) ,
where I N input represents the fixed-length input trajectory segment of the MDFM, generated through a sliding window mechanism. f denotes the frequency index, N is the length of each fixed trajectory segment, and n refers to the time-domain sampling index. The operator | · | computes the magnitude in the frequency domain. R ˜ o u t denotes the output feature, which is subsequently processed by the Inverted-UNet, LayerNorm, and LeakyReLU modules in sequence.

2.2.3. Parameter Estimation Model

(1) Motivation
In most 2D maneuvering tracking scenes, Constant Turn (CT) models are important maneuvering patterns [11,12,14], where the motion pattern F k and its turn rate ( w k ) are defined in Equation (3). The changing turn rate of a target affects the direction of its trajectory. Short-term trajectory features reflect instantaneous velocity variations and local motion trends, whereas long-term features characterize the overall motion pattern and the magnitude of velocity. Therefore, parameter estimation based on historical trajectory information requires a special module to extract both global and local spatial features.
As noted in ref. [61], despite achieving accurate predictions with large-scale training data, networks still struggle to acquire inductive biases aligned with the underlying world model when adapting to new tasks. To address this limitation, the loss function becomes a crucial mechanism for embedding physical constraints into the learning process. Therefore, in our work, the PEM loss function should be designed to ensure that F k complies with physical laws and avoids calculation bias in the network output.
(2) The Overall Structure of PEM
As illustrated in Figure 5, PEM takes a denoised trajectory as input. It first uses a cross-and-dot-product (CADP) [12] to extract trajectory shape information by calculating the sine and cosine of the intersection angle. The equations are shown below:
s k sin = Δ p x , k Δ p y , k + 1 Δ p y , k Δ p x , k + 1 | Δ p k | | Δ p k + 1 | .
s k cos = Δ p x , k Δ p x , k + 1 + Δ p y , k Δ p y , k + 1 | Δ p k | | Δ p k + 1 | .
where Δ p x , k = p x , k p x , 0 , Δ p y , k = p y , k p y , 0 , Δ p k = [ Δ p x , k , Δ p y , k ] T . Subsequently, the trajectory shape information was fed into multi-head attention with convolution neutral nodes (MHCN) to capture both global and local features from temporal and spatial domains. The attention output was fused with the original CADP features via a residual connection, and the result was further refined by convolutional neural nodes. Specially, a novel physically constrained loss function was applied during training to avoid bias in computing F k .
(3) Physics-Constrained Loss Function
Considering loss function design, we introduced a PCLoss to ensure that the estimated motion parameters and the resulting state transition matrix remain consistent with the underlying physical dynamics. Under both CV and CT motion models, the elements of the state transition matrix exhibit strict analytical dependencies dictated by the laws of motion. Constraining or estimating individual elements independently can violate these inherent relationships.
To avoid this issue, we jointly estimate the turn rate and impose analytical physical constraints on the entire transition matrix F k via the proposed PCLoss. By assigning a larger weight to the F k -related term, the model is encouraged to maintain strict physical consistency, thereby stabilizing turn rate estimation and preventing implausible parameter updates. This design forms the foundation for the robustness and reliability of the proposed tracking framework.
Building on this physical consistency, the PEM is designed to estimate the key unknown motion parameter—the turn rate ω k —which dominates the behavior of targets under coordinated-turn (CT) dynamics. Estimating ω k allows the model to capture its temporal variation, enabling reliable tracking even under high maneuvering levels (typically ω k > 10 / s , e.g., transport aircraft [62]). However, estimating only a single parameter creates a large search space and may lead to unstable convergence. To mitigate this, we further incorporate physics-based constraints through PCLoss and formulate the loss as follows:
ω ˜ k = PEM ( D seg ( n ) ) ,
F ˜ k = F CT ( ω ˜ k ) ,
Loss = λ 1 RMSE ( F ˜ k ) + λ 2 RMSE ( ω ˜ k ) ,
where D seg ( n ) denotes the noise-eliminated trajectory segment produced by the FDM and used as input to the PEM, ω ˜ k is the predicted turn rate, and F CT ( · ) is the analytical CT model used to generate the corresponding transition matrix. The loss weights λ 1 and λ 2 are set to 0.8 and 0.2, respectively.

3. Results

In this section, we first introduce our training protocol. Then, we compare our MIENet to several state-of-the-art state estimation methods. Finally, we present ablation studies to investigate our work.

3.1. Implementation Details

Our MIENet is trained using the publicly available LAST dataset [11], and the implementation details are summarized in Table 2. For testing, we primarily evaluate our method on the LAST dataset [11]. Additionally, experiments on the SV248S dataset [59] are included as supplementary validation to further demonstrate the applicability of our method to visual tracking.
For training on the LAST datasets [11], all trajectories involve both CV and CT kinetic patterns. The observed range in training data contains additive Gaussian noise with a mean of 0 and a standard deviation of 10, while the observed azimuth contains additive Gaussian noise with a mean of 0 and a standard deviation of 0.006. After nonlinear transformation (Equation (6)), the observation noise becomes non-Gaussian. We use the RMSE as the evaluation metric for the network performance. Our MIENet was trained with a batch size of 10, using the Adam optimizer with an initial learning rate of 0.01. To prevent the network from converging to a local minimum, the learning rate was adjusted using the Cosine Annealing method. All experiments were conducted on an NVIDIA 4090 GPU.
For testing on the LAST dataset [11], detailed parameter settings of testing scenes are provided in the caption of Table 3. We perform highly maneuvering scenes where ω k is more than 10°/s (e.g., Transport Aircraft [62]), and each part has a maneuvering time exceeding 20 s. Specially, they also include abrupt turn rates (e.g., from −30°/s to 70°/s). Specifically, in the first test scene, the target is more than 20 km away from the radar sensor. The angular velocities gradually decrease, and the second segment lasts up to 52.6 s. This tracking scenario presents a greater challenge than those discussed in previous papers [11,12]. In the second scene, the initial state matches the first scene, but we change the direction of the turn rate and increase its value, with turns lasting up to 40 s in the first and third segments. In the third scene, the radar senor is closer to the target, within a range of less than 4.5 km. In the forth scene, the turn rate changes are more significant in both magnitude and direction (from −30°/s to 70°/s, and then from 70°/s to −10°/s) compared to the previous three trajectories. The four scenes vary in terms of target distance and maneuvering capability, which highlights the diversity and comprehensive coverage of the evaluation environments.
For testing on the SV248S dataset [59], the observations are generated using the results of the STAR [63] method, since our approach does not rely on a detector or an acceleration module tailored for image processing. Under this formulation, the observed and estimated coordinates ( p x , k , p y , k ) are directly comparable without any nonlinear transformation. We evaluate our method on all scenes included in the SV248S dataset, following the same evaluation protocol as STAR. The algorithm ranking is determined according to two criteria: the 5-pixel precision metric and the area under the curve (AUC) of the success plot, corresponding to the precision rate (PR) and success rate (SR), respectively [63].

3.2. Comparison to the State-of-the-Art (SOTA)

To demonstrate the superiority of our method, we compare our MIENet to several state-of-the-art (SOTA) methods, including traditional methods (UKF-A [64], IMM [19] combining multiple CT motion patterns) and deep learning-based methods (ISPM [12]).

3.2.1. Quantitative Results

(1) The LAST Dataset
To objectively evaluate the tracker performance, we record the momentary RMSEs over 100 Monte Carlo simulations.
Quantitative results are shown in Table 4. The improvements achieved by our MIENet over traditional methods are significant. That is because our MIENet can learn multi-domain features that are robust to scene variations and leverages both local and global information to capture latent motion patterns. In contrast, the traditional methods are usually designed for specific scenes (e.g., specific motion pattern and measurement environment).
Moreover, as shown in Figure 6, we set each scene with three different noise levels. With the increase in noise, traditional methods suffer a dramatic performance decrease, while our MIENet maintains accuracy. That is because the performance of traditional methods rely heavily on manually chosen parameters (e.g., the process noise covariance Q and the measurement noise covariance R ) and cannot adapt to various noise. It is worth noting that our method preserves error stability more effectively than traditional approaches when abrupt changes occur in turn rate direction or magnitude. Moreover, it remains robust as it does not require re-adjustment of the noise parameters ( Q and R ) under changing environmental noise.
As shown in Table 4, the improvements achieved by the MIENet over deep learning-based methods (i.e., ISPM) are obvious. That is because we redesign the FDM, and the PEM is optimized for highly maneuvering target tracking. The Inverted-UNet combined with FFT-based softmax weighting enables fusion of spatial and frequency domain information. The physics-constrained loss function can avoid calculation bias. Consequently, multi-domain information of trajectories can be maintained and fully learned in the network.
In the tracking process, our MIENet takes 47.2ms per iteration, and ISPM takes 11.3ms. Both algorithms are suitable for real-time applications. The detailed memory requirements of the MIENet for the LAST dataset are shown in Table 5.
(2) The SV248S Dataset
Quantitative results are presented in Table 6 and Table 7. The proposed MIENet demonstrates applicability to remote visual tracking, effectively reducing localization errors across various target classes, difficulty levels, and attribute categories without requiring retraining. These results further indicate that our method can be readily extended to two-dimensional visual object tracking tasks. The detailed memory requirements of the MIENet for the SV248S dataset are shown in Table 8.

3.2.2. Qualitative Results

(1) The LAST Dataset
Qualitative results are presented in Figure 7 and Figure 8. As shown in Figure 7b,c, the MIENet effectively reduces peak tracking errors. That is because traditional methods detect pattern switching only after accumulating errors. This causes delay and peak error accumulation. The deep learning-based methods (i.e., ISPM) perform much better than traditional methods. However, due to highly maneuvering scenes, ISPM lacks multi-domain information and physical constraints, which leads to large parameter estimation errors. Our MIENet is more robust compared to various highly maneuvering target tracking methods.
Moreover, as shown in the enlarged view of Figure 8a, our method achieves the lowest tracking error even when the target undergoes high-speed motion (turn rate of 40°/s). This is because our FFT-based weighting module captures the rich frequency-domain information induced by highly maneuvering motion, preventing the loss of high-frequency features and enabling more accurate estimation of motion parameters. In the enlarged view of Figure 8b, our method again attains the lowest tracking error at a lower turn rate of 20°/s. That is because the Inverted-UNet structure effectively captures both temporal and spatial cues of the target’s motion, maintaining reliable tracking performance when inter-frame motion changes are relatively small. Finally, Figure 8c shows that our method continues to track the target accurately even after it performs a maneuver from 70°/s to −10°/s. This is attributed to the joint use of convolution and the MHN module, which facilitates the fusion of local and global information, enabling the network to capture historical motion features and recognize motion-pattern transitions, thus adapting robustly to diverse motion scenarios.
(2) The SV248S Dataset
Qualitative results are shown in Figure 9. Our MIENet is capable of performing satellite localization in remote sensing images.

3.3. Ablation Study

In this section, we compare our MIENet with several variants. The objective is to investigate the potential benefits arising from our chosen modules and network design.

3.3.1. Ablation Study on the Magnitude and Distribution of Noise

As shown in Table 9 and Table 10, we evaluate the denoising performance under injected Gaussian-distributed and Uniform-distributed additive noise at three intensity levels to assess adaptability and generalization. Importantly, although the injected noise follows Gaussian or Uniform distributions in the input space, the nonlinear observation transformation distorts these distributions, producing non-Gaussian noise in the observation space.
Our method reduces RMSEs by at least 60% (with an average reduction of 66.4%) for cases with Gaussian-injected noise and at least 55% (with an average reduction of 63.28%) for cases with Uniform-injected noise. This consistent improvement demonstrates that the FDM effectively exploits multi-domain information in observations, making it applicable to denoising tasks under varying noise levels and across different non-Gaussian noise distributions induced by nonlinear transformations.

3.3.2. Ablation Study on MDFM

The MDFM is used for temporal, spatial and frequency-domain features enhancement to achieve better feature fusion. To investigate the benefits introduced by this module, we compare its effects on the FDM and MIENet.
  • FDM w/o MDFM: As shown in Table 11, the original FDM achieves a 61.2% reduction in the RMSE, while removing the MDFM module decreases the reduction to 48.3%. The results are averaged over three noise intensity levels. This performance gap demonstrates that the Inverted-UNet is essential for denoising, as it extracts and fuses spatial-domain features more effectively from noisy observations.
  • MIENet w/o MDFM: As shown in Table 12, our MIENet achieves position and velocity RMSEs of 7.78 m and 10.64 m/s, respectively. In Ablation 4, removing the MDFM module increases RMSEs to 15.83 m and 29.61 m/s. That is because the FDM, as a component of the MIENet, effectively suppresses noise and thereby contributes to its tracking performance.

3.3.3. Ablation Study on Multi-Head Attention

We evaluate the contribution of multi-head attention (MHN) to the MIENet and present the results in Table 12. In Ablation 5, removing the MHN module increases RMSEs to 16.9 m and 14.07 m/s. That is because our proposed MHN can capture historical global information about trajectories to enhance temporal domain features.

3.3.4. Ablation Study on Loss Function Design

The PCLoss function for the PEM replaces the single parameter with eight unknown parameters. Results are shown in Table 12. Both Ablation 6 and our method demonstrate that the PCLoss function reduces the distance RMSE by 21.41% and the velocity RMSE by 21.47%. That is because the PCLoss function can incorporate physical constrains and prevent bias in the calculations.

4. Discussion

Through extensive experiments on both radar-based trajectory simulations and real satellite video datasets, the MIENet has demonstrated strong robustness and adaptability in tracking highly maneuvering targets under non-Gaussian noise conditions. Across all evaluation scenarios, the MIENet consistently improves estimation accuracy and stability compared with traditional state-estimation methods such as UKF, PF, and their variants. These performance gains arise from the complementary roles of the fusion denoising model (FDM) and the parameter estimation model (PEM), which jointly enhance the reliability of the entire estimation pipeline. First, the FDM effectively suppresses observation noise through multi-domain feature extraction and fusion. Unlike conventional denoising filters or handcrafted feature-based approaches that are sensitive to noise intensity and distribution, the FDM learns noise-resilient representations directly from data. This allows it to handle a wide range of non-Gaussian noise patterns without relying on explicit assumptions or manual parameter tuning. Quantitative evaluations on the LAST dataset show that the FDM reduces observation noise by more than 60%, directly improving the stability of subsequent estimation. Second, the PEM accurately predicts key motion parameters by integrating global motion trends with local trajectory variations. This dual-level modeling enables the PEM to capture complex dynamics of highly maneuvering targets, especially in scenarios involving abrupt motion changes or large accelerations. In addition, the PCLoss function enforces consistency with motion laws and prior knowledge, preventing unrealistic predictions and improving model interpretability. Ablation studies confirm that each component—FDM, PEM, and PCLoss—contributes substantively to the overall performance, and their combined effect yields a more robust and accurate estimation process.
Another important observation is the strong generalization ability of the MIENet. Although trained only on radar trajectory simulations, the MIENet focuses on learning domain-invariant motion patterns rather than modality-dependent appearance cues. Its multi-domain learning framework extracts stable geometric and temporal features of target motion that are governed by physical laws and therefore shared across radar and satellite video data. In addition, the PCLoss function further regularize the model to produce physically consistent motion predictions, reducing reliance on dataset-specific noise characteristics. Together, these factors enable the MIENet to maintain robust performance when transferred to the real-world SV248S dataset despite substantial differences in sensing modality and noise distribution. Finally, the MIENet exhibits promising computational efficiency. With an inference time of approximately 47.2 ms per iteration and a stable memory footprint of only 5–8 MB, the MIENet is well suited for real-time remote sensing tasks such as maritime surveillance, UAV monitoring, or on-orbit object tracking. Its minimal dependence on sequence length further enhances applicability in long-duration tracking scenarios.

5. Conclusions

In this paper, we propose a versatile and adaptive multi-domain intelligent state estimation network to enhance highly maneuvering target tracking that can handle non-Gaussian noise. To address the challenges of converting noisy observation trajectories into clean inputs, we design the FDM with an encoder–decoder architecture. This model enables reduction in RMSEs by at least 60% through exploiting multi-domain features. To achieve a more accurate motion model for state estimation, we propose the PEM to capture temporal domain information across both global and local spatial scales. We also redesign the loss function to avoid calculation bias. Finally, experimental results demonstrate that our method achieves the state-of-the-art performance and maintains robustness across various noise levels and distributions.
There are two promising directions for future work. First, this paper considers only range and azimuth observations, as incorporating Doppler measurements may further enhance maneuvering target tracking. Second, this work is restricted to tracking in the two-dimensional X–Y plane; extending the framework to three-dimensional scenarios is a valuable direction.

Author Contributions

Conceptualization, Z.M. and Y.H.; methodology, Z.M. and Y.H.; software, Z.M. and Q.X.; validation, Z.M.; investigation, Z.M., Q.X., W.A., X.W. and W.S.; visualization, Z.M., W.A. and Y.H.; supervision, Y.H.; project administration, W.A.; writing—original draft preparation, Z.M., Y.H. and X.W.; writing—review and editing, Y.H., Q.X., W.A., X.W. and W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (Grant No. 62207030) and in part by the Postgraduate Scientific Research Innovation Project of Hunan Province (Grant No. CX20240120).

Data Availability Statement

Publicly available datasets were used in this study. The LSTM dataset is accessible at https://github.com/ljx43031/DeepMTT-algorithm (accessed on 9 December 2025), and the SV248S dataset can be obtained from https://github.com/xdai-dlgvv/SV248S (accessed on 9 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Chen, Z.; Zhu, E.; Guo, Z.; Zhang, P.; Liu, X.; Wang, L.; Zhang, Y. Predictive Autonomy for UAV Remote Sensing: A Survey of Video Prediction. Remote Sens. 2025, 17, 3423. [Google Scholar] [CrossRef]
  2. Fraternali, P.; Morandini, L.; Motta, R. Enhancing Search and Rescue Missions with UAV Thermal Video Tracking. Remote Sens. 2025, 17, 3032. [Google Scholar] [CrossRef]
  3. Bu, D.; Ding, B.; Tong, X.; Sun, B.; Sun, X.; Guo, R.; Su, S. FSTC-DiMP: Advanced Feature Processing and Spatio-Temporal Consistency for Anti-UAV Tracking. Remote Sens. 2025, 17, 2902. [Google Scholar] [CrossRef]
  4. Zhou, Y.; Tang, D.; Zhou, H.; Xiang, X. Moving Target Geolocation and Trajectory Prediction Using a Fixed-Wing UAV in Cluttered Environments. Remote Sens. 2025, 17, 969. [Google Scholar] [CrossRef]
  5. Meng, F.; Zhao, G.; Zhang, G.; Li, Z.; Ding, K. Visual detection and association tracking of dim small ship targets from optical image sequences of geostationary satellite using multispectral radiation characteristics. Remote Sens. 2023, 15, 2069. [Google Scholar] [CrossRef]
  6. Huang, J.; Sun, H.; Wang, T. IAASNet: Ill-Posed-Aware Aggregated Stereo Matching Network for Cross-Orbit Optical Satellite Images. Remote Sens. 2025, 17, 3528. [Google Scholar] [CrossRef]
  7. Li, S.; Fu, G.; Yang, X.; Cao, X.; Niu, S.; Meng, Z. Two-Stage Spatio-Temporal Feature Correlation Network for Infrared Ground Target Tracking. IEEE Trans. Geosci. Remote Sens. 2024, 62, 1–14. [Google Scholar] [CrossRef]
  8. Zhao, D.; He, W.; Deng, L.; Wu, Y.; Xie, H.; Dai, J. Trajectory tracking and load monitoring for moving vehicles on bridge based on axle position and dual camera vision. Remote Sens. 2021, 13, 4868. [Google Scholar] [CrossRef]
  9. Xia, Q.; Chen, P.; Xu, G.; Sun, H.; Li, L.; Yu, G. Adaptive Path-Tracking Controller Embedded with Reinforcement Learning and Preview Model for Autonomous Driving. IEEE Trans. Veh. Technol. 2025, 74, 3736–3750. [Google Scholar] [CrossRef]
  10. Chen, Z.; Liu, L.; Yu, Z. Towards Robust Visual Object Tracking for UAV with Multiple Response Incongruity Aberrance Repression Regularization. IEEE Signal Process. Lett. 2024, 31, 2005–2009. [Google Scholar] [CrossRef]
  11. Liu, J.; Wang, Z.; Xu, M. DeepMTT: A deep learning maneuvering target-tracking algorithm based on bidirectional LSTM network. Inf. Fusion 2020, 53, 289–304. [Google Scholar] [CrossRef]
  12. Liu, J.; Yan, J.; Wan, D.; Li, X.; Al-Rubaye, S.; Al-Dulaimi, A.; Quan, Z. Digital Twins Based Intelligent State Prediction Method for Maneuvering-Target Tracking. IEEE J. Sel. Areas Commun. 2023, 41, 3589–3606. [Google Scholar] [CrossRef]
  13. Zhang, Y.; Li, G.; Zhang, X.P.; He, Y. A deep learning model based on transformer structure for radar tracking of maneuvering targets. Inf. Fusion 2024, 103, 102120. [Google Scholar] [CrossRef]
  14. Li, X.R.; Jilkov, V.P. Survey of maneuvering target tracking. Part I. Dynamic models. IEEE Trans. Aerosp. Electron. Syst. 2003, 39, 1333–1364. [Google Scholar]
  15. Kalman, R.E. A new approach to linear filtering and prediction problems. Trans. Asme-J. Basic Eng. 1960, 82, 35–45. [Google Scholar] [CrossRef]
  16. Cortina, E.; Otero, D.; D’Attellis, C.E. Maneuvering target tracking using extended kalman filter. IEEE Trans. Aerosp. Electron. Syst. 1991, 27, 155–158. [Google Scholar] [CrossRef]
  17. Julier, S.; Uhlmann, J.; Durrant-Whyte, H.F. A new method for the nonlinear transformation of means and covariances in filters and estimators. IEEE Trans. Autom. Control 2000, 45, 477–482. [Google Scholar] [CrossRef]
  18. Gustafsson, F.; Gunnarsson, F.; Bergman, N.; Forssell, U.; Jansson, J.; Karlsson, R.; Nordlund, P.J. Particle filters for positioning, navigation, and tracking. IEEE Trans. Signal Process. 2002, 50, 425–437. [Google Scholar] [CrossRef]
  19. Blom, H.A.; Bar-Shalom, Y. The interacting multiple model algorithm for systems with markovian switching coefficients. IEEE Trans. Autom. Control 1988, 33, 780–783. [Google Scholar] [CrossRef]
  20. Sheng, H.; Zhao, W.; Wang, J. Interacting multiple model tracking algorithm fusing input estimation and best linear unbiased estimation filter. IET Radar Sonar Navig. 2017, 11, 70–77. [Google Scholar] [CrossRef]
  21. Sun, Y.; Yuan, B.; Miao, Z.; Wu, W. From GMM to HGMM: An approach in moving object detection. Comput. Inform. 2004, 23, 215–237. [Google Scholar]
  22. Chen, X.; Wang, Y.; Zang, C.; Wang, X.; Xiang, Y.; Cui, G. Data-Driven Intelligent Multi-Frame Joint Tracking Method for Maneuvering Targets in Clutter Environments. IEEE Trans. Aerosp. Electron. Syst. 2024, 61, 2679–2702. [Google Scholar] [CrossRef]
  23. Zhang, J.; Huang, Y.; Masouros, C.; You, X.; Ottersten, B. Hybrid Data-Induced Kalman Filtering Approach and Application in Beam Prediction and Tracking. IEEE Trans. Signal Process. 2024, 72, 1412–1426. [Google Scholar] [CrossRef]
  24. Zhang, W.; Zhao, X.; Liu, Z.; Liu, K.; Chen, B. Converted state equation kalman filter for nonlinear maneuvering target tracking. Signal Process. 2022, 202, 108741. [Google Scholar] [CrossRef]
  25. Sun, M.; Davies, M.E.; Proudler, I.K.; Hopgood, J.R. Adaptive kernel kalman filter based belief propagation algorithm for maneuvering multi-target tracking. IEEE Signal Process. Lett. 2022, 29, 1452–1456. [Google Scholar] [CrossRef]
  26. Singh, H.; Mishra, K.V.; Chattopadhyay, A. Inverse Unscented Kalman Filter. IEEE Trans. Signal Process. 2024, 72, 2692–2709. [Google Scholar] [CrossRef]
  27. Lan, H.; Hu, J.; Wang, Z.; Cheng, Q. Variational nonlinear kalman filtering with unknown process noise covariance. IEEE Trans. Aerosp. Electron. Syst. 2023, 59, 9177–9190. [Google Scholar] [CrossRef]
  28. Guo, Y.; Li, Z.; Luo, X.; Zhou, Z. Trajectory optimization of target motion based on interactive multiple model and covariance kalman filter. In Proceedings of the International Conference on Geoscience and Remote Sensing Mapping (GRSM), Lianyungang, China, 13–15 October 2023. [Google Scholar]
  29. Deepika, N.; Rajalakshmi, B.; Nijhawan, G.; Rana, A.; Yadav, D.K.; Jabbar, K.A. Signal processing for advanced driver assistance systems in autonomous vehicles. In Proceedings of the IEEE Uttar Pradesh Section International Conference on Electrical, Electronics and Computer Engineering (UPCON), Greater Noida, India, 1–3 December 2023. [Google Scholar]
  30. Xu, S.; Rice, M.; Rice, F.; Wu, X. An expectation-maximization-based estimation algorithm for AOA target tracking with non-Gaussian measurement noises. IEEE Trans. Veh. Technol. 2022, 72, 498–511. [Google Scholar] [CrossRef]
  31. Chen, J.; He, J.; Wang, G.; Peng, B. A Maritime Multi-target Tracking Method with Non-Gaussian Measurement Noises based on Joint Probabilistic Data Association. IEEE Trans. Instrum. Meas. 2025, 74, 1–12. [Google Scholar]
  32. Wang, J.; He, J.; Peng, B.; Wang, G. Generalized interacting multiple model Kalman filtering algorithm for maneuvering target tracking under non-Gaussian noises. ISA Trans. 2024, 155, 148–163. [Google Scholar] [CrossRef] [PubMed]
  33. Liu, B.; Wu, Z. Maximum correntropy quadrature Kalman filter based interacting multiple model approach for maneuvering target tracking. Signal Image Video Process. 2025, 19, 76. [Google Scholar] [CrossRef]
  34. Xie, G.; Sun, L.; Wen, T.; Hei, X.; Qian, F. Adaptive transition probability matrix-based parallel IMM algorithm. IEEE Trans. Syst. Man Cybern. Syst. 2019, 51, 2980–2989. [Google Scholar] [CrossRef]
  35. Cai, S.; Wang, S.; Qiu, M. Maneuvering target tracking based on LSTM for radar application. In Proceedings of the IEEE International Conference on Software Engineering and Artificial Intelligence (SEAI), Xiamen, China, 16–18 June 2023; pp. 235–239. [Google Scholar]
  36. Medsker, L.R.; Jain, L. Recurrent neural networks. Des. Appl. 2001, 5, 2. [Google Scholar]
  37. Graves, A.; Schmidhuber, J. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw. 2005, 18, 602–610. [Google Scholar] [CrossRef] [PubMed]
  38. Kawakami, K. Supervised Sequence Labelling with Recurrent Neural Networks. Ph.D. Thesis, University of Edinburgh, Edinburgh, UK, 2008. [Google Scholar]
  39. Zhang, Y.; Li, G.; Zhang, X.P.; He, Y. Transformer-based tracking network for maneuvering targets. In Proceedings of the ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar]
  40. Shen, L.; Su, H.; Li, Z.; Jia, C.; Yang, R. Self-attention-based Transformer for nonlinear maneuvering target tracking. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–13. [Google Scholar] [CrossRef]
  41. Liu, H.; Sun, X.; Chen, Y.; Wang, X. Physics-Informed Data-Driven Autoregressive Nonlinear Filter. IEEE Signal Process. Lett. 2025, 32, 846–850. [Google Scholar] [CrossRef]
  42. Revach, G.; Shlezinger, N.; Ni, X.; Escoriza, A.L.; Van Sloun, R.J.; Eldar, Y.C. KalmanNet: Neural network aided Kalman filtering for partially known dynamics. IEEE Trans. Signal Process. 2022, 70, 1532–1547. [Google Scholar] [CrossRef]
  43. Choi, G.; Park, J.; Shlezinger, N.; Eldar, Y.C.; Lee, N. Split-KalmanNet: A robust model-based deep learning approach for state estimation. IEEE Trans. Veh. Technol. 2023, 72, 12326–12331. [Google Scholar] [CrossRef]
  44. Buchnik, I.; Revach, G.; Steger, D.; Van Sloun, R.J.; Routtenberg, T.; Shlezinger, N. Latent-KalmanNet: Learned Kalman filtering for tracking from high-dimensional signals. IEEE Trans. Signal Process. 2024, 72, 352–367. [Google Scholar] [CrossRef]
  45. Ko, M.; Shafieezadeh, A. Cholesky-KalmanNet: Model-Based Deep Learning with Positive Definite Error Covariance Structure. IEEE Signal Process. Lett. 2024, 32, 326–330. [Google Scholar] [CrossRef]
  46. Fu, Q.; Lu, K.; Sun, C. Deep Learning Aided State Estimation for Guarded Semi-Markov Switching Systems with Soft Constraints. IEEE Trans. Signal Process. 2023, 71, 3100–3116. [Google Scholar] [CrossRef]
  47. Xi, R.; Lan, J.; Cao, X. Nonlinear Estimation Using Multiple Conversions with Optimized Extension for Target Tracking. IEEE Trans. Signal Process. 2023, 71, 4457–4470. [Google Scholar] [CrossRef]
  48. Mortada, H.; Falcon, C.; Kahil, Y.; Clavaud, M.; Michel, J.P. Recursive KalmanNet: Deep Learning-Augmented Kalman Filtering for State Estimation with Consistent Uncertainty Quantification. arXiv 2025, arXiv:2506.11639. [Google Scholar]
  49. Chen, X.; Li, Y. Normalizing Flow-Based Differentiable Particle Filters. IEEE Trans. Signal Process. 2024, 73, 493–507. [Google Scholar] [CrossRef]
  50. Jia, C.; Ma, J.; Kouw, W.M. Multiple Variational Kalman-GRU for Ship Trajectory Prediction with Uncertainty. IEEE Trans. Aerosp. Electron. Syst. 2024, 61, 3654–3667. [Google Scholar] [CrossRef]
  51. Yin, J.; Li, W.; Liu, X.; Wang, Y.; Yang, J.; Yu, X.; Guo, L. KFDNNs-Based Intelligent INS/PS Integrated Navigation Method Without Statistical Knowledge. IEEE Trans. Intell. Transp. Syst. 2025, 26, 12197–12209. [Google Scholar] [CrossRef]
  52. Lin, C.; Cheng, Y.; Wang, X.; Liu, Y. AKansformer: Axial Kansformer–Based UUV Noncooperative Target Tracking Approach. IEEE Trans. Ind. Inform. 2025, 21, 4883–4891. [Google Scholar] [CrossRef]
  53. Shen, S.; Chen, J.; Yu, G.; Zhai, Z.; Han, P. KalmanFormer: Using transformer to model the Kalman Gain in Kalman Filters. Front. Neurorobot. 2025, 18, 1460255. [Google Scholar] [CrossRef] [PubMed]
  54. Chen, S.; Zheng, Y.; Lin, D.; Cai, P.; Xiao, Y.; Wang, S. MAML-KalmanNet: A neural network-assisted Kalman filter based on model-agnostic meta-learning. IEEE Trans. Signal Process. 2025, 73, 988–1003. [Google Scholar] [CrossRef]
  55. Nuri, I.; Shlezinger, N. Learning Flock: Enhancing Sets of Particles for Multi Sub-State Particle Filtering with Neural Augmentation. IEEE Trans. Signal Process. 2024, 73, 99–112. [Google Scholar] [CrossRef]
  56. Zhang, H.; Liu, W.; Zhang, L.; Meng, Y.; Han, W.; Song, T.; Yang, R. An allocation strategy integrating power, bandwidth, and subchannel in an RCC network. Def. Technol. 2025, in press. [Google Scholar] [CrossRef]
  57. Zhang, H.; Liu, W.; Zhang, Q.; Liu, B. Joint Customer Assignment, Power Allocation, and Subchannel Allocation in a UAV-Based Joint Radar and Communication Network. IEEE Internet Things J. 2024, 11, 29643–29660. [Google Scholar] [CrossRef]
  58. Zhu, H.; Xiong, W.; Cui, Y. An adaptive interactive multiple-model algorithm based on end-to-end learning. Chin. J. Electron. 2023, 32, 1120–1132. [Google Scholar] [CrossRef]
  59. Li, Y.; Jiao, L.; Huang, Z.; Zhang, X.; Zhang, R.; Song, X.; Tian, C.; Zhang, Z.; Liu, F.; Yang, S.; et al. Deep learning-based object tracking in satellite videos: A comprehensive survey with a new dataset. IEEE Geosci. Remote Sens. Mag. 2022, 10, 181–212. [Google Scholar] [CrossRef]
  60. Fan, C.M.; Liu, T.J.; Liu, K.H. SUNet: Swin transformer UNet for image denoising. In Proceedings of the 2022 IEEE International Symposium on Circuits and Systems (ISCAS), Austin, TX, USA, 27 May–1 June 2022; pp. 2333–2337. [Google Scholar]
  61. Vafa, K.; Chang, P.G.; Rambachan, A.; Mullainathan, S. What has a foundation model found? using inductive bias to probe for world models. In Proceedings of the International Conference on Machine Learning, ICML, Vancouver, BC, Canada, 13–19 July 2025. [Google Scholar]
  62. Holleman, E.C. Flight Investigation of the Roll Requirements for Transport Airplanes in Cruising Flight; Technical Report; NASA: Washington, DC, USA, 1970.
  63. Chen, Y.; Yuan, Q.; Xiao, Y.; Tang, Y.; He, J.; Han, T. STAR: A Unified Spatiotemporal Fusion Framework for Satellite Video Object Tracking. IEEE Trans. Geosci. Remote Sens. 2025, 63, 1–22. [Google Scholar] [CrossRef]
  64. Huang, Z.; Marelli, D.; Xu, Y.; Fu, M. Distributed target tracking using maximum likelihood Kalman filter with non-linear measurements. IEEE Sens. J. 2021, 21, 27818–27826. [Google Scholar] [CrossRef]
Figure 1. The overall structure of our radar tracking system.
Figure 1. The overall structure of our radar tracking system.
Remotesensing 17 04016 g001
Figure 2. The overall structure of our MIENet. It includes two typical models to achieve highly maneuvering target tracking. The input data is first processed by low-pass filters to obtain auxiliary supervised signals (a1a5). Then auxiliary supervised signals and observations are fed into the fusion denoising model. Afterward the denoised data is fed into the parameter estimation model to estimate the key turn rate of ω . Subsequently, the input observations and the estimated ω are all provided to UKF for estimating target states.
Figure 2. The overall structure of our MIENet. It includes two typical models to achieve highly maneuvering target tracking. The input data is first processed by low-pass filters to obtain auxiliary supervised signals (a1a5). Then auxiliary supervised signals and observations are fed into the fusion denoising model. Afterward the denoised data is fed into the parameter estimation model to estimate the key turn rate of ω . Subsequently, the input observations and the estimated ω are all provided to UKF for estimating target states.
Remotesensing 17 04016 g002
Figure 3. The detailed structure of the fusion denoising model (FDM).
Figure 3. The detailed structure of the fusion denoising model (FDM).
Remotesensing 17 04016 g003
Figure 4. The detailed structure of different UNets. The dotted lines indicate the alignment of neuron positions.
Figure 4. The detailed structure of different UNets. The dotted lines indicate the alignment of neuron positions.
Remotesensing 17 04016 g004
Figure 5. The detailed structure of the parameter estimation model (PEM).
Figure 5. The detailed structure of the parameter estimation model (PEM).
Remotesensing 17 04016 g005
Figure 6. Mean RMSEs of distance and velocity for four test trajectories. (a) Trajectory 1; (b) Trajectory 2; (c) Trajectory 3; (d) Trajectory 4.
Figure 6. Mean RMSEs of distance and velocity for four test trajectories. (a) Trajectory 1; (b) Trajectory 2; (c) Trajectory 3; (d) Trajectory 4.
Remotesensing 17 04016 g006
Figure 7. Trajectory estimation results for the first scene. Our MIENet achieves the lowest state estimation error and the smallest fluctuations. (a) The first tracking scenario, with a detailed view of steps 0–62 shown in the upper-right inset. (b,c) The corresponding time-varying position and velocity errors.
Figure 7. Trajectory estimation results for the first scene. Our MIENet achieves the lowest state estimation error and the smallest fluctuations. (a) The first tracking scenario, with a detailed view of steps 0–62 shown in the upper-right inset. (b,c) The corresponding time-varying position and velocity errors.
Remotesensing 17 04016 g007
Figure 8. Trajectory estimation results for the second, third, and fourth maneuvering target tracking scenarios. Our MIENet shows superior tracking accuracy, especially during periods of maneuvering transitions. (a) The second tracking scenario; the right panel is a detailed view corresponding to time steps 460–500. (b) The third tracking scenario, with a detailed view of steps 1000–1074. (c) The fourth tracking scenario, with zoomed-in views of the intervals [610, 670] and [530, 570].
Figure 8. Trajectory estimation results for the second, third, and fourth maneuvering target tracking scenarios. Our MIENet shows superior tracking accuracy, especially during periods of maneuvering transitions. (a) The second tracking scenario; the right panel is a detailed view corresponding to time steps 460–500. (b) The third tracking scenario, with a detailed view of steps 1000–1074. (c) The fourth tracking scenario, with zoomed-in views of the intervals [610, 670] and [530, 570].
Remotesensing 17 04016 g008
Figure 9. Tracking trajectory results of our method on the SV248S dataset.
Figure 9. Tracking trajectory results of our method on the SV248S dataset.
Remotesensing 17 04016 g009
Table 1. Channel dimensions of the inverted-UNet at each stage.
Table 1. Channel dimensions of the inverted-UNet at each stage.
R 0 , 0 R 1 , 0 R 0 , 1 R 2 , 0 R 1 , 1 R 0 , 2 R 3 , 0 R 2 , 1 R 1 , 2 R 0 , 3 R out
Dim 111111111111
Dim 2292958291162032917452281229
Dim 312142184211
Dim 461262412648241266
Table 2. Parameter settings for the LAST datasets.
Table 2. Parameter settings for the LAST datasets.
ContentsRanges
Initial distance from radar[1 km, 10 km]
Initial velocity of targets[100, 200 m/s]
Initial distance azimuth[−180°, 180°]
Initial velocity azimuth[−180°, 180°]
Maneuvering turn rate[−90°/s, 90°/s]
Variance of accelerated velocity noise10 (m/s)2
Table 3. Parameter settings for tracking test trajectories.
Table 3. Parameter settings for tracking test trajectories.
TrajectoriesInitial StateThe First PartThe Second PartThe Third Part
1[1000 m, −18,000 m, 150 m/s, 200 m/s]27.4 s, ω = 50°/s52.6 s, ω = 40°/s27.4 s, ω = 30°/s
2[1000 m, −18,000 m, 150 m/s, 200 m/s]40 s, ω = −30°/s27.4 s, ω = 70°/s40 s, ω = 50°/s
3[500 m, 2000 m, −300 m/s, 200 m/s]25 s, ω = 60°/s42.4 s, ω = 50°/s40 s, ω = −20°/s
4[−20,000 m, −5000 m, 250 m/s, 180 m/s]30 s, ω = −30°/s17.4 s, ω = 70°/s20.9 s, ω = −10°/s
Table 4. Means RMSEs of observation position and velocity for trajectories with UKF-A, IMM, ISPM and MIENet. The best results are in red, and the second best results are in blue.
Table 4. Means RMSEs of observation position and velocity for trajectories with UKF-A, IMM, ISPM and MIENet. The best results are in red, and the second best results are in blue.
Trajectories
p (m)/v (m/s)
Noise Standard Deviation
(Azimuth, Range)
UKF-A [64]IMM [19]ISPM [12]MIENet (Ours)
1(1 × 10−3 rad, 2 m)Part 1
Part 2
Part 3
6.667/11.38
6.329/7.463
6.457/7.963
9.193/32.57
8.689/28.72
8.366/25.87
3.770/7.920
3.370/8.030
2.490/7.040
1.800/4.170
1.500/3.100
1.350/3.100
All6.420/7.8608.720/28.843.280 /7.7701.580/3.470
(2 × 10−3 rad, 4 m)Part 1
Part 2
Part 3
39.73/64.94
13.23/13.67
10.86/10.17
17.01/47.30
15.92/39.23
15.52/36.66
7.950/9.570
7.360/9.630
5.500/8.410
4.650/6.150
2.650/4.380
3.310/4.870
All22.90/30.0516.07/40.237.110/9.3303.570/5.300
(6 × 10−3 rad, 8 m)Part 1
Part 2
Part 3
139.2/208.9
136.4/124.3
132.7/109.7
62.80/147.8
41.62/63.69
40.46/60.04
17.80/13.59
15.88/14.67
13.26/13.83
9.170/10.15
7.250/12.20
4.520/8.590
All133.5/125.842.90/67.3715.80/14.217.780/10.64
2(1 × 10−3 rad, 2 m)Part 1
Part 2
Part 3
6.854/8.294
10.10/26.66
6.494/8.489
9.481/33.15
8.692/29.30
9.013/30.38
2.640/7.170
5.560/12.70
4.350/8.570
1.790/4.930
3.340/9.200
3.870/7.260
All6.920/10.369.090/31.104.170/9.2902.840/6.190
(2 × 10−3 rad, 4 m)Part 1
Part 2
Part 3
12.04/12.22
94.06/172.2
83.27/118.6
17.95/45.88
15.82/44.09
16.33/42.03
5.600/8.870
11.74/10.53
9.970/10.07
3.440/6.190
10.06/9.740
11.53/10.58
All67.66/107.116.78/43.879.110/9.7507.600/7.690
(6 × 10−3 rad, 8 m)Part 1
Part 2
Part 3
121.2/127.0
223.4/293.5
275.9/301.5
54.83/96.77
39.80/76.48
41.08/67.76
11.62/11.72
28.47/17.84
35.27/20.49
6.540/7.930
17.26/15.48
32.28/17.27
All244.4/272.444.05/73.9824.30/16.0718.65/12.64
3(1 × 10−3 rad, 2 m)Part 1
Part 2
Part 3
1.651/5.706
1.561/5.125
2.214/6.587
2.106/24.70
2.010/25.08
2.723/28.50
5.030/1.810
4.590/1.650
2.320/1.870
3.300/1.920
2.590/1.550
3.120/1.760
All1.810/5.6402.290/26.234.000/1.7702.710/1.670
(2 × 10−3 rad, 4 m)Part 1
Part 2
Part 3
2.953/8.071
2.758/6.747
4.097/9.735
3.882/32.96
3.730/32.38
5.018/36.97
10.31/2.440
9.540/2.150
4.760/2.470
6.400/2.670
5.690/2.270
4.090/1.770
All3.260/7.5904.230/34.198.280/2.3404.950/2.180
(6 × 10−3 rad, 8 m)Part 1
Part 2
Part 3
6.277/14.22
5.747/10.03
9.921/18.34
8.626/50.51
8.169/46.74
11.79/52.27
22.15/5.630
21.03/3.200
9.160/7.020
14.98/5.070
10.43/4.040
8.120/3.980
All7.210/12.069.620/49.5317.92/5.45010.49/4.100
4(1 × 10−3 rad, 2 m)Part 1
Part 2
Part 3
8.420/16.20
15.38/43.98
42.29/77.72
10.47/39.53
9.549/33.30
9.838/31.73
3.530/9.050
8.440/19.11
4.650/13.54
3.650/9.910
6.380/17.36
1.300/3.490
All18.13/40.2210.01/35.405.430/13.494.180/11.38
(2 × 10−3 rad, 4 m)Part 1
Part 2
Part 3
28.23/43.75
99.56/175.2
114.5/124.9
20.41/55.48
17.93/50.32
18.37/44.73
7.150/11.13
18.42/13.35
10.81/12.76
8.440/6.140
16.01/18.94
2.370/6.230
All85.48/121.519.06/50.3111.89/12.2110.16/11.14
(6 × 10−3 rad, 8 m)Part 1
Part 2
Part 3
159.9/195.7
301.7/361.7
207.8/114.6
72.75/154.2
43.35/85.78
49.05/72.03
25.64/14.89
53.94/31.10
34.12/23.23
24.62/9.770
57.04/33.48
12.61/16.75
All235.5/244.053.04/87.6336.33/22.1034.57/20.50
Table 5. Detailed memory requirements of the MIENet for the LAST dataset.
Table 5. Detailed memory requirements of the MIENet for the LAST dataset.
Trajectories1234
Total Time Steps107410741074683
Allocated GPU Memory (MB)8.058.367.998.07
Active GPU Memory (MB)1316
FLOPs (MB)FDM: 10    PEM: 0.02
# Params (MB)FDM: 1239    PEM: 14.7
Table 6. Category-wise and difficulty-wise PR and SR evaluations on SV248S. The best results are in red.
Table 6. Category-wise and difficulty-wise PR and SR evaluations on SV248S. The best results are in red.
TrackersCategory-Wise EvaluationsDifficulty-Wise Evaluations
VehicleL-VehicleAirplaneShipSimpleNormalHard
PRSRPRSRPRSRPRSRPRSRPRSRPRSR
STAR [63]0.75420.49110.87390.65550.83990.75171.00000.75620.88000.62280.71380.46530.66520.4177
Ours0.75500.49440.87440.65980.83990.75301.00000.76110.88160.62780.71380.46750.67600.4440
Table 7. Attribute-wise PR/SR evaluations on SV248S. The best results are in red.
Table 7. Attribute-wise PR/SR evaluations on SV248S. The best results are in red.
TrackersSTOLTODSIVBCHSMNDCOBCLIPR
STAR [63]0.676/0.4440.472/0.3010.731/0.4890.692/0.4460.796/0.5340.700/ 0.4620.752/0.4890.700/0.4440.730/0.4580.697/0.464
Ours0.676/0.4470.473/0.3020.731/0.4920.693/0.4490.797/0.5380.700/0.4610.753/0.4900.700/0.4470.730/0.4600.697/0.466
Table 8. The detailed memory requirements of the MIENet for the SV248S dataset.
Table 8. The detailed memory requirements of the MIENet for the SV248S dataset.
Scenes123456
Total Time Steps750747490748580499
Allocated GPU memory (MB)5.47
Active GPU Memory (MB)1310
FLOPs (MB)FDM: 10       PEM: 0.02
# Params (MB)FDM: 1239       PEM: 14.7
Table 9. The adaptation and generalization of FDM ablation study (Gaussian-injected noise). The best results are in red.
Table 9. The adaptation and generalization of FDM ablation study (Gaussian-injected noise). The best results are in red.
Methods
RMSE (m)
Gaussian-Injected Noise
σ (10−3 rad, m)
(2, 4)(8, 10)(11, 13)
Original Observation35.310140.94193.77
Butterworth Filter28.000112.09153.65
ISPM-NEN [12]17.35961.97884.804
FDM (ours)13.593
(↓61.5%)
43.690
(↓69.0%)
61.135
(↓68.4%)
Table 10. The adaptation and generalization of an FDM ablation study (uniform-injected noise). The best results are in red.
Table 10. The adaptation and generalization of an FDM ablation study (uniform-injected noise). The best results are in red.
Methods
RMSE (m)
Uniform-Injected Noise
σ (10−3 rad, m)
(12, 21.3)(40.3, 56.3)(75, 133.3)
Original Observation61.536112.78153.84
Butterworth Filter47.83687.673119.59
ISPM-NEN [12]28.99452.34871.290
FDM (ours)23.295
(↓62.1%)
40.333
(↓64.2%)
54.277
(↓64.7%)
Table 11. MDFM to FDM ablation study. The best results are in red.
Table 11. MDFM to FDM ablation study. The best results are in red.
Methods
RMSE (m)
Noise
(2 × 10−3 rad, 4 m)(8 × 10−3 rad, 10 m)(11 × 10−3 rad, 13 m)
FDM w/o
MDFM
pos (m)Part 1
Part 2
Part 3
15.53
16.34
15.69
49.34
42.67
44.04
74.20
58.67
62.43
All 15.97
(↓36.3%)
44.81
(↓55.2%)
63.91
(↓53.5%)
Ourspos (m)Part 1
Part 2
Part 3
11.63
10.25
9.960
39.28
33.58
38.76
55.11
46.97
57.77
All 10.55
(↓57.9%)
36.45
(↓63.5%)
52.02
(↓62.1%)
Table 12. Results of the ablation study.
Table 12. Results of the ablation study.
Tracking
Variations
MDFMMHNPCLossDistance (m)/
Velocity (m/s)
Ablation 1 25.88/21.61
Ablation 2 19.39/28.87
Ablation 3 33.29/38.21
Ablation 4 15.83/29.61
Ablation 5 16.90/14.07
Ablation 6 9.900/13.55
Ours7.780/10.64
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, Z.; Wang, X.; Huang, Y.; Xu, Q.; An, W.; Sheng, W. Multi-Domain Intelligent State Estimation Network for Highly Maneuvering Target Tracking with Non-Gaussian Noise. Remote Sens. 2025, 17, 4016. https://doi.org/10.3390/rs17244016

AMA Style

Ma Z, Wang X, Huang Y, Xu Q, An W, Sheng W. Multi-Domain Intelligent State Estimation Network for Highly Maneuvering Target Tracking with Non-Gaussian Noise. Remote Sensing. 2025; 17(24):4016. https://doi.org/10.3390/rs17244016

Chicago/Turabian Style

Ma, Zhenzhen, Xueying Wang, Yuan Huang, Qingyu Xu, Wei An, and Weidong Sheng. 2025. "Multi-Domain Intelligent State Estimation Network for Highly Maneuvering Target Tracking with Non-Gaussian Noise" Remote Sensing 17, no. 24: 4016. https://doi.org/10.3390/rs17244016

APA Style

Ma, Z., Wang, X., Huang, Y., Xu, Q., An, W., & Sheng, W. (2025). Multi-Domain Intelligent State Estimation Network for Highly Maneuvering Target Tracking with Non-Gaussian Noise. Remote Sensing, 17(24), 4016. https://doi.org/10.3390/rs17244016

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop