Next Article in Journal
RLFNet: A Real-Time Lightweight Network for Forest Fire Detection on Edge Devices
Previous Article in Journal
Recent Advances in Remote Sensing of Soil Science
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

HyperNCMD: A Scene-Adaptive Clutter Measurement Density Estimator for Radar Tracking via Hypernetworks and Normalizing Flows

by
Zongqing Cao
,
Jianchao Yang
,
Wang Sun
,
Xingyu Lu
,
Ke Tan
,
Zheng Dai
,
Wenchao Yu
and
Hong Gu
*
School of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2026, 18(10), 1541; https://doi.org/10.3390/rs18101541
Submission received: 27 March 2026 / Revised: 28 April 2026 / Accepted: 10 May 2026 / Published: 13 May 2026

Highlights

What are the main findings?
  • We propose HyperNCMD, a scene-adaptive clutter measurement density (CMD) estimator that employs a hypernetwork to dynamically generate normalizing flow parameters conditioned on scene representations, enabling fast adaptation to unseen environments without full retraining.
  • HyperNCMD leverages Random Fourier Features (RFFs) and a proposed ISAB-LSTM module to encode spatio-temporal information from raw radar measurements, and further improves adaptation to novel environments via Feature-wise Linear Modulation (FiLM)-based test-time fine-tuning.
What are the implications of the main findings?
  • HyperNCMD demonstrates strong robustness and estimation accuracy across spatially and temporally varying clutter, highlighting the benefit of hypernetwork-driven parameter generation for adaptive radar CMD modeling.
  • The proposed framework provides a scalable and deployment-friendly solution for CMD estimation, enabling more reliable clutter distribution modeling for multi-target tracking (MTT) and downstream radar perception in complex environments.

Abstract

Accurateestimation of clutter measurement density (CMD) is crucial for radar-based multi-target tracking (MTT), especially under spatially non-uniform and temporally varying environments. Existing methods, including finite mixture models, kernel density estimation, and normalizing flows, often require scene-specific tuning and exhibit limited generalization. To address these limitations, we propose HyperNCMD, a scene-adaptive CMD estimator that employs hypernetworks to dynamically generate the parameters of normalizing flows. To capture spatial variability, radar measurements are first embedded using Random Fourier Features (RFFs), and then processed by a spatio-temporal encoder that jointly models spatial structures and temporal clutter dynamics. The hypernetwork leverages the encoded embedding to adaptively produce flow parameters, enabling flexible CMD estimation across diverse environments. Lightweight data augmentation is further applied to make the estimator more robust across diverse environments, while a Feature-wise Linear Modulation (FiLM)-based fine-tuning scheme enhances test-time adaptation. Experiments on both synthetic and real radar datasets demonstrate that HyperNCMD achieves superior accuracy and robustness, achieving up to 10.5% reduction in per-point negative log-likelihood under dynamically varying conditions. These results highlight the potential of hypernetwork-driven CMD modeling for reliable radar perception in complex sensing environments.

1. Introduction

Clutter Measurement Density (CMD) is a critical yet often underexplored parameter in radar data processing. It characterizes the spatial probability distribution of clutter (i.e., false alarm) measurements across the surveillance region [1,2,3], thereby reflecting environmental complexity and directly influencing the performance of multi-target tracking (MTT) systems. Accurate modeling of clutter not only improves the robustness and accuracy of tracking algorithms but also provides valuable priors for high-level perception and decision-making tasks.
CMD estimation plays a pivotal role in several key applications:
  • Adaptive Track Initiation: CMD modeling enables region-aware initiation strategies. For instance, ref. [4] employs M/N logic [5] in low-clutter areas for rapid track initiation, while adopting conservative multi-hypothesis approaches [6] in high-clutter regions to suppress false initiations.
  • MTT Algorithms: Advanced MTT methods such as JIPDA-MAP [2,7] rely on accurate clutter-target likelihoods for reliable data association. Prior studies [8,9,10] have shown that fine-grained CMD estimation substantially reduces false track rates and enhances overall tracking performance.
  • Environmental Perception: Beyond tracking, CMD estimation can reveal latent structural cues embedded in clutter measurements. In highway environments, for example, ref. [11] exploits persistent clutter reflections to extract road contours, which can serve as geometric priors for downstream perception tasks such as vehicle tracking [12].
Given its importance, a variety of CMD estimation methods have been developed, which can be broadly categorized into three classes [3]: track gate-based, measurement space-based, and clutter generator-based approaches. Track gate-based methods assume uniform clutter within validation gates and estimate density from local measurements [5,13], but suffer from limited spatial coverage and degraded performance in dense or spatially heterogeneous environments. Measurement space-based methods construct global clutter maps over time using histograms [1], finite mixture models (FMMs) [8], or kernel density estimation (KDE) [9], offering improved spatial resolution and robustness. Clutter generator-based methods, grounded in the Random Finite Set (RFS) framework, model clutter as a nonhomogeneous Poisson process (NHPP) [14,15], but often involve computationally intractable inference and rely on simplifying assumptions that degrade practical performance [16].
Among these methods, our prior work, NF Streaming [11], represents a recent advance in measurement space-based CMD estimation. It introduces a normalizing flow (NF)-based neural network [17] for global CMD estimation, achieving superior accuracy and robustness compared with FMMs [8] and KDE [9] methods, as demonstrated on both synthetic and real-world datasets. However, like FMMs and KDE, NF Streaming remains fundamentally scene-specific: it requires retraining from scratch under distribution shifts or when deployed to new environments, as it lacks the capability to exploit cross-scene priors. Moreover, repeated full retraining incurs substantial computational overhead, thereby limiting scalability and adaptability in real-world deployments.
To overcome these limitations, we propose HyperNCMD, a scene-adaptive and data-efficient neural CMD estimator designed to operate effectively across dynamic and previously unseen environments. Unlike NF Streaming, which learns scene-specific parameters separately for each scene, HyperNCMD employs a hypernetwork to generate NF parameters conditioned on extracted scene features. This mechanism enables rapid adaptation to novel clutter distributions with minimal tuning. By decoupling flow parameter learning from scene-specific optimization, HyperNCMD maintains high performance across diverse environments while significantly reducing computational cost, thereby enabling accurate, real-time CMD estimation for deployment in complex and time-varying scenarios.
Our main contributions are summarized as follows:
  • Scene-adaptive CMD estimation via hypernetworks. A hypernetwork-based architecture is designed to generate NF parameters conditioned on scene embeddings, enabling rapid adaptation to previously unseen environments with minimal tuning and facilitating knowledge transfer across scenes (see Section 4.1.2 and Section 4.1.3).
  • Temporal-aware scene encoding. A Temporal Set Transformer [18]-based encoder is proposed to capture spatio-temporal variations in radar clutter. Raw radar measurements are embedded using Random Fourier Features (RFFs) [19], and an ISAB-LSTM architecture models spatio-temporal dynamics to produce compact yet expressive scene-level representations (see Section 4.1.1).
  • Spatio-temporal data augmentation for point-based radar measurements. Two lightweight augmentation strategies—Coordinate Flip and Random Context Length Truncation—are introduced to enhance robustness and generalization for spatio-temporal radar data (see Section 4.2.2).
  • Efficient test-time adaptation via lightweight fine-tuning. A Feature-wise Linear Modulation (FiLM) [20]-based fine-tuning mechanism is developed to enable efficient adaptation to unseen scenes at test time, achieving a favorable trade-off between accuracy and computational efficiency (see Section 4.3.1).
Extensive experiments on both synthetic and real-world radar datasets demonstrate that HyperNCMD substantially outperforms existing CMD estimators in accuracy, robustness, and effectiveness across diverse and previously unseen environments.
The remainder of this article is organized as follows. Section 2 reviews the background relevant to this work. Section 3 formulates the problem under study. Section 4 presents the proposed HyperNCMD model in detail, including its training and inference procedures. Section 5 and Section 6 evaluate the effectiveness of the proposed method using simulation and real-world data, respectively. Finally, Section 7 concludes the article.

2. Background

2.1. CMD Modeling

In MTT, clutter is typically modeled as an NHPP characterized by a spatially varying intensity function ρ ( z ) , where z R D denotes a point in the D-dimensional measurement space [3,5]. For any bounded region B within the measurement space, the expected number of clutter measurements is
λ = B ρ ( z ) d z <
where λ denotes the expected clutter count per scan in B . The intensity function ρ ( z ) can be normalized as
ρ ˜ ( z ) = 1 λ ρ ( z )
This produces ρ ˜ ( z ) , which is a valid probability density function satisfying
B ρ ˜ ( z ) d z = 1 , ρ ˜ ( z ) 0 .
In practice, the mean clutter rate λ is usually estimated via maximum-likelihood methods [8], while the spatial density ρ ˜ ( z ) is inferred using density estimation techniques such as FMMs [8], KDE [9], or NFs [11]. In this work, we adopt a flow-based approach to model ρ ˜ ( z ) , following the NF paradigm, which provides both flexibility and expressive modeling capacity. A comprehensive review of CMD modeling can be found in [3], and its relevance to radar-based MTT is further discussed in [7,11].

2.2. Neural Density Estimation

NFs are generative models that transform a simple base distribution (e.g., Gaussian) into a complex target distribution through a sequence of invertible and differentiable mappings [17]. These transformations enable expressive modeling and tractable density evaluation via the change-of-variables formula [21]:
log p Z ( z ) = log p V ( f ( z ) ) + log det J f ( z )
where v = f ( z ) denotes the transformed latent variable in space V with base distribution p V ( · ) , while p Z ( · ) represents the target distribution in the data space Z. The term J f ( z ) = f ( z ) / z denotes the Jacobian matrix of the transformation f ( · ) , whose determinant accounts for the local volume change introduced by the mapping.
Several NF architectures have been proposed to balance modeling capacity and computational efficiency. NICE [22] is designed with additive coupling layers and volume-preserving transformations, enabling efficient inference with simple mappings. RealNVP [23] employs affine couplings, offering increased flexibility while retaining tractable Jacobian computations. MAF [24] adopts an autoregressive design, supporting exact likelihood evaluation but requiring sequential sampling. NSF [25] introduces monotonic rational spline couplings, providing precise control over complex density shapes and enhancing expressive power.
Following NF Streaming [11], we adopt NSF as the base architecture for CMD modeling, as it captures intricate spatial patterns while maintaining tractable inference.

2.3. Hypernetwork Architecture

Hypernetworks [26] are neural architectures in which a separate network, called the hypernetwork, generates the parameters of a target network. Formally, given a conditioning input c , the hypernetwork H θ produces the weights w of the target network T w as
w = H θ ( c )
where θ denotes the learnable parameters of the hypernetwork. This design allows the target network to adapt dynamically to the context c , supporting flexible and context-aware inference.
Hypernetworks have been applied to a wide range of tasks, including few-shot learning [27], continual learning [28], and neural architecture search [29]. More recently, they have been integrated with NFs for conditional density estimation. For example, HCNAF [30] employs a hypernetwork to parameterize an autoregressive flow conditioned on LiDAR sequences for multimodal occupancy forecasting. HyperFlow [31] conditions a continuous NF [32] on latent codes for high-fidelity 3D surface generation, while Predicting Flow [33] applies a similar approach to pedestrian trajectory prediction.
Building on recent developments in hypernetwork-based NFs [30,31,33,34], we develop a context-aware CMD estimator for radar perception. The method dynamically parameterizes the NF with scene-level spatio-temporal features, enabling adaptive and efficient CMD estimation across diverse and time-varying environments, thereby improving robustness in subsequent MTT and perception tasks.

3. Problem Statement

In fixed-deployment radar systems, clutter is predominantly caused by quasi-static scatterers, including terrain, vegetation, and buildings. Although clutter evolves slowly—over minutes or even hours—a full-scene radar scan is typically completed within a few seconds. Hence, the clutter field can be regarded as locally stationary over short temporal windows, allowing past observations to inform the estimation of the current CMD.
To exploit this local stationarity, let Z k = { z m , k } m = 1 M k denote the set of radar measurements at time step k, where each z m , k R D (e.g., z m , k = ( x m , k , y m , k ) when D = 2 ) represents a spatial measurement, and M k is the number of measurements in that frame. We define a length-L temporal window—also referred to as the context—as
C { Z 1 , Z 2 , , Z L }
where the goal is to model the CMD of the final frame Z L conditioned on the preceding ones.
We aim to learn a conditional density estimator f θ that, given the context C , assigns a normalized density to any query location in the last frame:
ρ ^ L ( z q ) = f θ z q C , z q Z L .
To this end, we assume access to a collection of S temporal windows { C ( s ) } s = 1 S sampled from different times and locations, each representing a distinct clutter distribution:
C ( s ) { Z 1 ( s ) , , Z L ( s ) } , Z k ( s ) = { z m , k ( s ) } m = 1 M k ( s ) .
The model parameters are learned by minimizing the empirical negative log-likelihood (NLL) of the last frame:
θ = arg min θ L ( θ )
L ( θ ) = 1 N s = 1 S z q Z L ( s ) log f θ z q C ( s )
where N = s = 1 S | Z L ( s ) | is the total number of training samples.
After training, the estimator f θ is fixed and can be applied to unseen data. At test time, given a novel context C ( u ) , the model produces the density of its final frame as
ρ ^ L ( u ) ( z q ) = f θ z q C ( u ) .
Unlike KDE [9], FMMs [8], or NF Streaming [11] that estimate densities independently for each window, the conditional formulation in Equation (7) amortizes density estimation across contexts [35]. By mapping the temporal context C to a density function for the current frame, the model shares parameters across scenes and thereby achieves stronger adaptability to previously unseen environments.
For efficient batched processing, each temporal window, whose frames may contain different numbers of measurements M k ( s ) , is reshaped into a fixed-size tensor Z ctx ( s ) R L × M × D , where M is the maximum number of measurements per frame across the dataset. Frames with fewer than M points are zero-padded, and a binary mask I ( s ) { 0 , 1 } L × M marks valid entries.

4. Proposed Method

This section introduces the proposed HyperNCMD model—a scene-adaptive and fully data-driven solution for CMD estimation. We first describe the overall architecture, followed by the training pipeline and two data augmentation strategies designed to enhance robustness. Finally, we present a lightweight FiLM-based adaptation module that further improves inference performance under previously unseen environments.

4.1. The HyperNCMD Model

To achieve scene-adaptive CMD estimation, HyperNCMD employs a hypernetwork to generate the NF parameters conditioned on the scene context. This design enables adaptive CMD estimation for each temporal window, effectively capturing spatio-temporal variability in the radar environment. As illustrated in Figure 1, the framework comprises three main components: a Scene Feature Extractor, a Hypernetwork-based Parameter Generator, and a Parameterized Normalizing Flow.
The Scene Feature Extractor encodes the input context Z ctx into a compact representation h ctx , capturing the spatial-temporal characteristics of radar measurements. This embedding is fed into a hypernetwork, which dynamically produces the parameters Ω of a context-specific NF. In this way, HyperNCMD eliminates the need for retraining across scenes, as required in prior work [8,9,11]. The Parameterized Normalizing Flow then estimates the CMD through a sequence of invertible and differentiable transformations F k ( · ) , k = 1 , , K . Unlike NF Streaming [11], which relies on gradient-based optimization to update flow parameters, HyperNCMD generates them in real time, thereby enhancing adaptability to new environments while improving computational efficiency.
By integrating context-aware scene encoding with hypernetwork-driven parameterization, HyperNCMD provides a scalable and data-efficient approach to clutter modeling across diverse radar environments. The design details of each module are presented in the following subsections.

4.1.1. Scene Feature Extractor

To enable context-aware CMD estimation, a compact representation of the spatial-temporal clutter distribution is extracted from past radar measurements. As illustrated in Figure 2, the Scene Feature Extractor processes accumulated radar measurements Z ctx ( s ) R L × M × D from scene s and produces a global embedding h ctx R D h , where D h denotes the embedding dimensionality. This embedding summarizes scene-specific spatio-temporal clutter patterns and serves as the conditioning input to the hypernetwork.
Each frame Z k = { z m , k } m = 1 M is treated as an unordered set of radar returns. Because such sets are permutation-invariant [36], conventional architectures such as CNNs or RNNs, which rely on spatial or temporal ordering, are not directly applicable. To address this issue, we adopt the Set Transformer [18], a self-attention-based architecture capable of modeling unordered sets and capturing higher-order interactions. Compared with alternatives such as Deep Sets [36] and PointNet [37], the Set Transformer offers a favorable balance between representational capacity and computational efficiency.
The standard Set Transformer assumes time-independent sets, whereas the accumulated context Z ctx spans multiple time steps and implicitly encodes temporal dynamics. To capture these dynamics, we introduce an ISAB-LSTM module that integrates the Induced Set Attention Block (ISAB) [18], which performs efficient self-attention on sets via a set of learnable inducing points, with a Long Short-Term Memory (LSTM)-based temporal encoder [38]. This hybrid design jointly models spatial structure and temporal evolution while maintaining permutation invariance within each frame.
To further enhance representational power, each measurement z m , k is first projected using Random Fourier Features [19] before being fed into the ISAB-LSTM module. This projection maps the inputs into a randomized high-dimensional feature space approximating a shift-invariant kernel, allowing the model to capture fine-grained spatial patterns in cluttered radar scenes.
Finally, a two-stage pooling strategy is applied. Mean pooling aggregates features across unordered measurements within each frame to ensure permutation invariance, followed by a temporal attention pooling layer that integrates information across frames. The resulting embedding h ctx provides a compact and informative descriptor of the scene’s spatio-temporal clutter, serving as an effective conditioning signal for adaptive CMD estimation via the hypernetwork.
(1)
Random Fourier Feature Mapping
As demonstrated in [39], standard multi-layer perceptrons (MLPs) tend to capture only the smooth, low-frequency components when learning from coordinate-based data such as radar raw measurements. To mitigate this limitation, RFFs are applied to map the inputs into a space spanned by sinusoidal bases of different frequencies, enabling the network to represent functions with higher spatial frequencies.
Given a set of 2D radar measurements at frame k, denoted as Z k R M × 2 (normalized to [ 0 , 1 ] 2 , following [11]), each point is mapped as
H RFF = cos 2 π Z k B , sin 2 π Z k B
where B R B F × 2 is randomly initialized with entries drawn from N ( 0 , σ F 2 ) . The resulting representation H RFF R M × 2 B F approximates a shift-invariant kernel [19]. The hyperparameters B F and σ F control the frequency resolution and are empirically selected through ablation studies (see Section 5.2.3).
(2)
ISAB-LSTM Module
To jointly capture spatial and temporal dependencies in clutter point sets, we propose the ISAB-LSTM module, which integrates the ISAB [18] with an LSTM [38]. The module operates in three stages: (i) spatial summarization using ISAB to efficiently model raw measurement sets, (ii) temporal modeling with an LSTM to track frame-to-frame evolution, and (iii) fusion, which re-encodes individual measurements with temporal context. This design is conceptually analogous to ConvLSTM [40], in that convolution-like operations are incorporated within the recurrent state updates to preserve spatial structure while modeling temporal dynamics.
Set-structured models such as Deep Sets [36], PointNet [37], and Set Transformer [18] typically aggregate element-wise features via a permutation-invariant pooling operation:
net { x 1 , , x n } = ρ pool { ϕ ( x i ) } i = 1 n
where ϕ ( · ) is a shared encoder, pool ( · ) is permutation-invariant, and ρ ( · ) maps the pooled feature to the output.
Self-attention [41] without positional encoding is also permutation-invariant and captures pairwise interactions, but its quadratic complexity O ( M 2 ) becomes prohibitive when the number of radar measurements per scan is large. To address this, ISAB introduces I M inducing points, reducing complexity to O ( I M ) while retaining the modeling capacity of attention.
Given the RFF-encoded measurements H RFF R M × 2 B F , the ISAB-LSTM processes the input in three sequential stages, described below.
(1) Spatial Summarization via Induced Attention: In the first stage, a learnable set of inducing points I R I × D h attends to the input representation H RFF via a Multihead Attention Block (MAB) [41], producing a compact latent summary H ind R I × D h for each frame:
H ind = MAB ( I , H RFF , H RFF )
Here, the inducing points I act as global queries that extract essential spatial structures while discarding redundancy. The MAB consists of two residual sublayers with layer normalization [42]: a multi-head attention mechanism MHA ( · ) and a position-wise feedforward network FFN ( · ) [41]. It is computed as follows:
MAB ( Q , K , V ) = LN H ˜ + FFN ( H ˜ ) H ˜ = LN Q + MHA ( Q , K , V )
The input queries Q , keys K , and values V are first linearly projected to a common feature space.
Through this design, the inducing points I serve as bottlenecks that aggregate information from the measurement set H RFF in a permutation-invariant manner, thereby yielding a compact representation of the spatial structure in each scan.
(2) Temporal Modeling via LSTM Cell: While the ISAB summarizes spatial relationships within each frame, clutter statistics often evolve over time. To capture such temporal dynamics, the latent features H ind are sequentially processed by an LSTM cell [38]:
h k , c k = LSTMCell ( H ind , h k 1 , c k 1 )
where h k and c k denote the hidden state and cell state at time step k, respectively. All operations are applied independently to each of the I inducing points, with parameters shared across both time steps and inducing points. The updated hidden state H LSTM : = h k is then forwarded to the subsequent fusion stage.
(3) Fusion of Spatial and Temporal Context: Finally, to inject temporal context back into the measurement data domain, the original measurement set H RFF serves as a query to attend to the temporally-updated features H LSTM :
H IL = MAB ( H RFF , H LSTM , H LSTM )
This step recontextualizes each measurement with the temporally-evolved clutter representation, yielding H IL R M × D h that encodes both intra-frame spatial structures and inter-frame temporal dynamics in a permutation-invariant manner.
(3)
Mean Pooling
Given the spatio-temporal features H IL R M × D h obtained from the ISAB-LSTM module, a mask-aware mean pooling operation is applied to derive a compact per-frame representation. Let I { 0 , 1 } M denote the binary validity mask introduced in Section 3, and M v = i = 1 M I i be the number of valid measurements. The pooled feature is computed as
h m = Linear MP I H IL M v
where I H IL R 1 × D h sums the features of valid measurements, which are then averaged by M v and linearly projected to obtain h m .
In contrast to the Set Transformer [18], which utilizes attention-based pooling for global summarization, mean pooling is adopted here for its lower computational cost and empirically observed robustness. The resulting vector h m R D h serves as a compact per-frame descriptor that summarizes the spatial-temporal structure within each frame.
(4)
Temporal Attention Pooling
Having obtained per-frame descriptors h m , k R D h for k = 1 , , L from the ISAB-LSTM and mean pooling stages, the next step is to aggregate these features across time. Direct averaging treats all frames equally, which may not be optimal in dynamic environments where certain frames carry more informative cues. To address this limitation, we employ a temporal attention pooling mechanism that adaptively weights each frame according to its contextual relevance.
The sequence of frame-level embeddings is defined as
H m = cat [ h m , 1 , , h m , L ] R L × D h
where cat [ · ] denotes the concatenation operation. A learnable query vector q R D h is used to compute raw attention scores:
α = H m q R L
The attention scores are then converted to attention weights:
α = Softmax ( α ) R L
Finally, the global temporal embedding is obtained as a weighted combination of frame embeddings:
h ctx = α H m R D h
The resulting context vector h ctx serves as a compact and discriminative summary of the temporal evolution of clutter, providing a robust conditioning signal for the subsequent hypernetwork-based parameter generation.

4.1.2. Hypernetwork-Based Parameter Generator

Once the scene-level context embedding h ctx is obtained, a hypernetwork is employed to generate the parameters of the target NF model. In contrast to NF Streaming, where NF parameters are independently optimized for each scene via gradient-based maximum likelihood estimation, HyperNCMD adopts a different paradigm by learning a hypernetwork that generates NF parameters conditioned on the scene representation h ctx .
Given that CMD exhibit substantial spatio-temporal variability due to environmental dynamics and sensor conditions, training a separate NF model for each scene is often inefficient and lacks generalization. Instead, the hypernetwork is optimized across multiple scenarios to learn a shared mapping from scene representations to NF parameters, thereby capturing common structures in CMD. This formulation shifts the learning process from per-scene parameter fitting to amortized parameter generation. As a result, the model can generalize to unseen environments by producing scene-adaptive NF parameters without extensive retraining, enabling effective cross-scenario knowledge transfer and improving both adaptability and efficiency in CMD estimation [30,31,33,34].
As illustrated in Figure 3, all K layers of the NF share the same parameter generation mechanism. For each NF layer k, the context embedding h ctx R D h is first transformed into an enhanced representation h e R D h via a two-layer transformation that incorporates a channel reweighting mechanism ( SE ( · ) ) to improve representation expressiveness:
h ^ ctx = Linear h 1 ( h ctx ) R D h h e = ReLU Linear h 2 ( h ^ ctx ) + SE ( h ^ ctx )
where D h is set to ensure dimensional consistency with the subsequent parameter generation modules.
The Squeeze-and-Excitation (SE) module [43], originally developed for channel attention in CNNs, is adapted here to recalibrate 1D contextual embeddings by modeling inter-feature dependencies:
SE ( h ^ ctx ) = h ^ ctx σ W se 2 ReLU ( W se 1 h ^ ctx )
where W se 1 R D h / γ × D h and W se 2 R D h × D h / γ are learnable matrices, γ = 4 is the reduction ratio, ⊙ denotes element-wise multiplication, and ReLU ( · ) and σ ( · ) denote the rectified linear unit and sigmoid activation functions, respectively.
The enhanced embedding h e is subsequently used to generate the weights and biases of the residual MLPs within each NF layer. Specifically, two independent linear projections are applied to produce the corresponding weight and bias vectors, which are then reshaped to match the sublayer dimensions, as detailed in Section 4.1.3. For the r-th sublayer in the k-th NF layer, the generation process is defined as:
W r k = Reshape Linear W ( h e ) b r k = Reshape Linear b ( h e )
where the reshaping operation ensures dimensional compatibility with the sublayer structure. The parameters generated for layer k are collected as
ω k = ω r k = ( W r k , b r k ) r = 1 R
Applying this procedure across all R sublayers and K flow layers yields the complete parameter set:
Ω = { ω k } k = 1 K
which is dynamically instantiated by the hypernetwork conditioned on the context embedding h ctx .

4.1.3. Parameterized Normalizing Flow Model

After the hypernetwork generates the target NF parameters, these parameters are used to perform a sequence of differentiable and invertible transformations on the input clutter points, enabling flexible density estimation over arbitrarily complex clutter distributions [17,21]. As illustrated in Figure 1, the NF model in HyperNCMD follows the architecture of NF Streaming [11] and comprises K sequential flow blocks F k ( · ) , k = 1 , , K . Each block consists of three components: a Rational Quadratic Spline (RQS) Coupling Layer [25], which performs expressive invertible transformations based on monotonic rational quadratic splines, a Permutation Layer [22,23], and an ActNorm Layer [44].
In the RQS Coupling Layer, a subset of input dimensions is transformed using monotonic rational quadratic splines, yielding a lower-triangular Jacobian that allows efficient computation of the log-determinant. The Permutation Layer reorders input dimensions to ensure that all variables are transformed across different layers. The ActNorm Layer performs affine normalization with data-dependent initialization, which improves stability and accelerates convergence while also removing the need for batch statistics at inference.
In the proposed HyperNCMD framework, the hypernetwork exclusively generates the parameters of the residual MLPs within the RQS Coupling Layers. The Permutation Layer is parameter-free, and the ActNorm parameters are shared globally across scenes and remain independent of the context embedding. Both the Permutation and ActNorm Layers follow standard implementations [11,23,25], whereas the design of the parameterized RQS Coupling Layer is detailed below.
The RQS Coupling Layer implements an invertible, continuously differentiable transformation using monotonic rational quadratic splines [25], as illustrated in Figure 4. Following the coupling scheme in [23], the input vector v I R D is split into two parts. The second block, v d + 1 : D I , is transformed using monotonic rational quadratic splines whose parameters are predicted from the first block v 1 : d I via a residual MLP. The transformed second block replaces the original, while v 1 : d I remains unchanged; the inverse transformation is obtained by applying the inverse RQS using the same predicted parameters.
Each transformed dimension is defined over a bounded interval [ B , B ] . Outside this range, the identity mapping is applied to preserve invertibility. The interval is divided into M non-overlapping bins. For each bin m = 0 , , M 1 , the residual MLP predicts unconstrained parameters, which are subsequently mapped (e.g., via softmax or exponential functions) to valid spline parameters: the bin width φ w ( m ) , bin height φ h ( m ) , and derivative at the bin boundary φ d ( m ) . To ensure smooth boundary behavior, the derivatives at the endpoints are fixed: φ d ( 0 ) = φ d ( M ) = 1 . Let ( v I , ( m ) , v O , ( m ) ) m = 0 M denote the spline knot points. For v i [ v i I , ( m ) , v i I , ( m + 1 ) ] , the transformation is [25]:
g ( v i ) = v O , ( m ) + φ h ( m ) s ( m ) ξ 2 + φ d ( m ) ξ ( 1 ξ ) s ( m ) + φ d ( m + 1 ) + φ d ( m ) 2 s ( m ) ξ ( 1 ξ )
where ξ = ( v i v i I , ( m ) ) / φ w ( m ) [ 0 , 1 ] and s ( m ) = φ h ( m ) / φ w ( m ) is the secant slope. This formulation ensures monotonicity, smoothness, and invertibility, while allowing closed-form computation of the Jacobian determinant for likelihood estimation.
The spline parameters φ i = φ w i , φ h i , φ d i i = d + 1 D are predicted by a residual MLP with L r residual blocks, conditioned on v 1 : d I . The MLP parameters ω k are dynamically generated by the hypernetwork from the scene-level context embedding h ctx , as detailed in Section 4.1.2. By conditioning on the context embedding, the hypernetwork produces adaptive NF parameters, enabling the model to generalize across diverse clutter distributions and capture spatio-temporal variations without scene-specific optimization.

4.2. Training

After introducing the proposed HyperNCMD network, this subsection begins with the description of the loss function, followed by two data augmentation techniques used in training, and concludes with the overall training procedure.

4.2.1. Loss Function

NFs are trained by minimizing the NLL in an unsupervised manner, enabling exact density estimation [11,17]. In contrast, our goal is to construct a scene-adaptive CMD estimator that generalizes across spatio-temporal variations in clutter. To this end, we consider a time-varying clutter distribution p Z k ( z ) , which is unknown.
At each discrete time step k, we define two sets of inputs (see Figure 1):
  • Context set Z ctx k = { Z k L + 1 , , Z k } , which provides past observations for generating a scene-level context embedding.
  • Query set Z q k = { z q , j k } j = 1 M q p z k , used to evaluate the NLL under the NF model.
Given S scene instances ( Z ctx k , ( s ) , Z q k , ( s ) ) s = 1 S , each context set is processed by the scene encoder to obtain a high-level embedding:
h ctx ( s ) = SceneEncoder ( Z ctx k , ( s ) )
which is then passed to the hypernetwork to generate the parameters of a scene-specific NF model:
Ω ( s ) = HyperNet ( h ctx ( s ) )
Using the change-of-variables formula for NFs [21], the NLL of the query samples is
L = 1 S M q s = 1 S j = 1 M q log p Z z q , j k , ( s ) ; Ω ( s ) = 1 S M q s = 1 S j = 1 M q log p V ( v q , j ( s ) ) + log | det J F 1 ( z q , j k , ( s ) ; Ω ( s ) ) |
where v q , j ( s ) = F 1 ( z q , j k , ( s ) ; Ω ( s ) ) is the corresponding latent variable and p V ( · ) is a standard normal prior. The first term aligns the latent variable with the prior, while the second accounts for volume change under the NF transformation.
During simulation, both the query set Z q k and context set Z ctx k can be freely sampled from a predefined distribution. In real-world radar operation, the context set is constructed from scans of the preceding L frames, while the query set contains samples from the most recent frame(s), depending on the temporal characteristics of each scene.
By jointly training the scene encoder and hypernetwork, the model learns to adapt NF parameters to scene-specific spatio-temporal context. This enables accurate CMD estimation in dynamic environments, while retaining the expressiveness and tractability of NFs.

4.2.2. Data Augmentation

To enhance adaptability across dynamic and diverse clutter patterns, we employ two data augmentation strategies during training: (i) Coordinate Flip Augmentation, which perturbs the spatial layout of radar measurements, and (ii) Random Context Length Truncation, which simulates variability in temporal context. These augmentations increase the diversity of training samples and improve model robustness. The individual contributions of these strategies are evaluated in the ablation studies in Section 5.2.3.
(1) Coordinate Flip Augmentation. To improve robustness to spatial variations in clutter distributions, all points z = ( x , y ) in both the context set Z ctx k and query set Z q k are independently flipped along the horizontal or vertical axis with probability 0.5. Specifically, assuming normalized coordinates x , y [ 0 , 1 ] :
( x , y ) ( 1 x , y ) ( horizontal flip ) ( x , y ) ( x , 1 y ) ( vertical flip )
This augmentation enlarges the effective support of the training distribution, thereby enhancing generalization to unseen spatial layouts.
(2) Random Context Length Truncation. To increase robustness under varying temporal conditions, a context length L L is randomly sampled at each training iteration, and only the most recent L frames are used to construct the context set. By simulating variability in available temporal context, this augmentation promotes robustness to both short- and long-term dependencies, enabling the model to adapt to dynamic temporal environments.

4.2.3. Training Procedure

As summarized in Algorithm A1, the overall training procedure consists of three main steps: (i) encoding the context set to obtain a scene-level embedding and generating NF parameters via the hypernetwork, (ii) applying the inverse NF transformations to the query samples and computing the log-likelihood, and (iii) updating all trainable parameters via gradient descent. Data augmentation is applied to both context and query sets during training to enhance robustness, as described in Section 4.2.2.
The training of HyperNCMD is fully unsupervised and end-to-end, relying solely on point samples drawn from clutter distributions across different spatial locations and time steps. The trainable parameters of HyperNCMD consist of the scene feature extractor θ enc , the hypernetwork θ hyp , and the ActNorm parameters θ nf act of the NF model. All of these parameters are updated via gradient-based optimization. In contrast, the parameters of the RQS residual MLPs, Ω ( s ) , are dynamically generated by the hypernetwork conditioned on the scene-specific context embedding h ctx ( s ) , and thus do not have independent trainable weights.

4.3. Inference

After training, HyperNCMD can be directly applied to previously unseen radar scenes. However, spatio-temporal variations in clutter may lead to distributional shifts. To address this, we employ a lightweight test-time adaptation strategy based on FiLM [20], which allows efficient fine-tuning of scene embeddings while keeping the pretrained backbone fixed.

4.3.1. Fine-Tuning with FiLM

The pretrained HyperNCMD model already provides a scene-adaptive CMD estimator. To further improve performance on a previously unseen test scene, we apply FiLM-style affine modulation to the scene embedding. Specifically, given a scene embedding h ctx R D h , the modulated embedding is computed as
h ctx = γ h ctx + β
where γ , β R D h are the only parameters updated during adaptation. This approach preserves all pretrained model parameters while enabling the scene embedding to adjust to new clutter distributions. The modulated embedding h ctx is then fed into the hypernetwork to generate the NF parameters for the test scene.
This strategy achieves a favorable balance between adaptability and computational efficiency, enabling rapid fine-tuning in resource-constrained environments while preserving robustness, as demonstrated in Section 5.2.3.

4.3.2. Inference Pipeline

The complete test-time inference and adaptation procedure is summarized in Algorithm A2. Similar to the training phase, inference incorporates a FiLM-based modulation step applied to the scene embedding. Specifically, the pretrained model first encodes the context set to obtain the raw embedding h ctx raw , which is then modulated through learnable affine transformations at test time. This FiLM operation allows efficient adaptation to previously unseen environments by updating only the modulation parameters γ and β , while keeping all backbone parameters fixed.
CMD estimation with HyperNCMD can be integrated into MTT frameworks such as JIPDA-MAP [7]. Accurate clutter modeling serves as a prior to improve tracking performance. In particular, Algorithm A2 can be incorporated into the asynchronous update framework proposed in NF Streaming [11], enabling real-time, context-aware CMD estimation alongside tracking.

5. Simulation Experiments

This section presents simulation experiments to evaluate the proposed HyperNCMD framework. We conduct comparative studies against existing methods and perform ablation analyses to assess the impact of individual components. The experiments include CMD estimation and MTT evaluation under varying clutter conditions, thereby examining both overall performance and module-level contributions.

5.1. Experimental Setup

5.1.1. Datasets

Simulations are conducted in a two-dimensional surveillance region S = [ 5000 , 5000 ] × [ 5000 , 5000 ] m2, where clutter is modeled as an NHPP. At each scan k, the number of clutter points N c follows a Poisson distribution with rate parameter λ c , which is sampled once from a uniform prior and kept constant across scans:
P ( N c = c k λ c ) = e λ c λ c c k c k ! , λ c U ( λ min , λ max )
where c k denotes the observed number of clutter points at scan k.
The spatial distribution of clutter is modeled as a mixture function:
F c , k ( z ) = π c 1 U b ( z ) + i = 2 N s π c i B i N ( z ; μ c , k i , Σ c , k i ) + ( 1 B i ) U ( z ; a c , k i , b c , k i )
where F c , k ( z ) denotes the true CMD at scan k. The first component U b ( · ) represents uniform background clutter, while the remaining N s 1 components correspond to local clutter sources. Each source is randomly defined as Gaussian N ( · ) or uniform U ( · ) according to B i Bernoulli ( 0.5 ) , and the mixture weights π c satisfy i = 1 N s π c i = 1 . All distributional parameters are drawn from predefined ranges following the protocol in [11], ensuring diverse and physically plausible clutter scenarios for evaluation.
To evaluate performance under different clutter dynamics, two scene types are considered: (i) static clutter scenes, in which mixture components remain unchanged across scans; and (ii) dynamic clutter scenes, where both Gaussian and uniform clutter components evolve independently over time. In dynamic scenes, Gaussian component centers follow a constant-velocity motion model, with speeds independently sampled within [ 0 , 10 ]  m/s for each component, ensuring smooth spatial evolution across scans. Meanwhile, the bounds of uniform clutter regions are updated using the same velocity-driven mechanism, allowing the support region to shift consistently over time.
A total of 100,000 static and 300,000 dynamic scenes are generated for training. Each scene spans 30 s with a 1 s scan interval (30 scans in total) and includes a query set at the final scan containing 2048 clutter points, used for likelihood-based learning as described in Section 4.2.1.
Test datasets are constructed following [11], comprising four distinct sets summarized in Table 1. For static scenes, CMD estimation is evaluated at the final scan, whereas for dynamic scenes, evaluations are performed every 20 s starting from 40 s. To assess tracking performance, the radar measurements are augmented with multiple constant-velocity targets, each detected with a probability of P D = 0.9 , and the corresponding ground-truth trajectories are provided for quantitative evaluation [11].
As shown in Figure 5, a representative MTT scenario and the corresponding CMD are depicted. Regions with high clutter density may lead to false track initiation or disrupt established target trajectories. Thus, accurate estimation of the CMD is crucial to ensure reliable MTT performance.

5.1.2. Evaluation Metrics

To assess the accuracy of CMD estimation, the 2D surveillance area is discretized into grid cells, forming a density map that can be treated as an image. Following the definitions in [11], the estimated maps are quantitatively evaluated using several standard image similarity metrics, including the Root Mean Squared Error (RMSE) [1], Mean Absolute Error (MAE), Peak Signal-to-Noise Ratio (PSNR) [45], Structural Similarity Index (SSIM) [46], Normalized Cross Correlation (NCC) [47], and Kullback-Leibler Divergence (KLD) [48].
For the MTT evaluation, we adopt the CLEAR MOT protocol [11,49], which defines standard tracking metrics such as False Positives (FP), False Negatives (FN), Identity Switches (IDS), Fragmentations (Frag), False Track Rate (FTR), Precision (P), Recall (R), Multi-Object Tracking Accuracy (MOTA), and Multi-Object Tracking Precision (MOTP). Among these, MOTA and MOTP are generally regarded as the primary indicators of overall tracking performance, representing accuracy and precision, respectively. The detailed computation of these metrics follows the definitions provided in [11].

5.1.3. Baselines

We compare HyperNCMD with a set of representative CMD estimation approaches, including HistClassic, HistSpatial, and HistTemporal [1], as well as SCMDE [50,51], FMM [8], KDE [9], awKDE [10], and NF Streaming [11]. All baseline implementations and hyperparameter settings strictly follow the configurations in [11], ensuring fair and consistent comparison across methods.

5.1.4. Implementation Details

For HyperNCMD, the RFFs module is configured with number of features B F = 512 and kernel bandwidth σ F = 5.0 ; the ISAB-LSTM encoder uses a hidden dimension of D h = 512 , inducing points I = 8 , and a MAB with h = 4 attention heads; the hypernetwork employs a hidden dimension of D h = 512 ; the parameterized NF model consists of four flow layers K = 4 with two residual MLP blocks L r = 2 with hidden dimension 64, and the RQS transformation is parameterized with a bound of B = 2 and M = 10 bins.
For the baselines, following [11], the data buffer window size was set to L = 10 . For HyperNCMD, the context window size was set to 30, while the query window size was set to 10. Under the above configuration, the proposed HyperNCMD contains approximately 29.6 M parameters. Given a typical input size with up to 100 measurements per frame and 2048 query points, the overall computational cost is around 6.5 GFLOPs (FP32) per frame. Furthermore, the model can be adapted for embedded deployment, and its efficiency can be further improved through reduced-precision computation and hardware-aware acceleration techniques, enabling real-time inference in practical scenarios.
The model was trained for 50 epochs with a batch size of 32 using the Adam optimizer [52] (default parameters: β 1 = 0.9 , β 2 = 0.999 ) and a weight decay of 1 × 10 4 . Gradients were clipped with a maximum norm of 0.25 to improve training stability. The learning rate was linearly warmed up from 1 × 10 6 to a peak value of 5 × 10 4 over the first 5 epochs, and then decayed according to a cosine schedule [53]. For test-time fine-tuning, the model was updated for 3 steps with a fixed learning rate of 1 × 10 3 .
All experiments were conducted on a 64-bit Windows 10 computer equipped with 64 GB of RAM, an Intel Core i5-12600K CPU running at 3.69 GHz, and an NVIDIA RTX A4000 GPU. The implementation was based on the PyTorch deep learning framework (version 1.12.1) [54].

5.2. Results and Analysis

5.2.1. CMD Evaluation Results

(1) Static Clutter Scenarios. The comparative performance of different CMD estimation methods under static clutter conditions is presented in Table 2. Several representative CMD estimation approaches are included for comparison, covering classical, adaptive, and learning-based paradigms to ensure a comprehensive evaluation. In addition, we report the performance of HyperNCMD with and without the FiLM-based fine-tuning module to assess its contribution to estimation accuracy.
As shown in Table 2, HyperNCMD achieves the best overall performance across all quantitative metrics under static clutter conditions. The FiLM-augmented variant of HyperNCMD achieves the lowest RMSE ( 6.12 × 10 7 ) and competitive MAE ( 1.16 × 10 7 ), representing a 6.7% reduction in RMSE compared with NF Streaming and a 20.1% gain over KDE. In addition, it attains the highest PSNR (33.7 dB), SSIM (0.96), and NCC (0.89), while maintaining the lowest KLD (0.11), indicating accurate modeling of both global and local clutter characteristics. While FiLM adaptation enhances estimation accuracy, it introduces a moderate increase in computational overhead, raising inference time from 68.1 ms to 227.3 ms per scan. Nevertheless, the latency remains well within real-time constraints (i.e., below 1 s), offering a controllable trade-off between accuracy and efficiency.
Overall, HyperNCMD exhibits superior fidelity, stability, and computational efficiency compared with existing CMD estimation techniques. By jointly leveraging scene-conditioned hypernetworks and lightweight test-time adaptation, it eliminates the need for scene-specific retraining while maintaining low variance across diverse clutter scenarios. These characteristics underline its robustness and practical suitability for operational radar environments.
(2) Dynamic Clutter Scenarios. Dynamic clutter introduces considerable challenges because scene statistics vary rapidly over time, rendering fixed-length temporal modeling strategies suboptimal. For example, NF Streaming [11] tends to degrade with longer buffer lengths due to its implicit assumption of temporal stationarity.
As summarized in Table 3, HyperNCMD achieves the best overall performance across all quantitative metrics under dynamic clutter conditions, with an average RMSE reduction of 8.2% and a 7.9% improvement in SSIM compared with NF Streaming, consistently outperforming both classical and learning-based baselines. As shown in Figure 6, the model’s performance steadily improves with increasing context length ( L { 10 , 20 , 30 } ) in both static and dynamic scenarios. In addition, FiLM-based fine-tuning further enhances estimation accuracy across all settings, providing stable and consistent gains. This superior performance is attributed to HyperNCMD’s ISAB-LSTM spatial-temporal module with temporal attention, which adaptively aggregates historical frames and assigns higher weights to informative temporal features, effectively capturing the evolution of dynamic clutter.
Figure 7 shows a representative dynamic scenario with CMD estimates from different methods at three time points ( t { 50 s , 75 s , 100 s } ), with corresponding PSNR and NCC values reported in each subfigure. HistSpatial shows noticeable spatial discontinuities, while awKDE and NF Streaming struggle to accurately represent low-intensity uniform clutter regions under limited samples. In contrast, HyperNCMD—trained offline across multiple scenarios—produces high-fidelity CMD, achieving a PSNR of up to 39 dB and consistently higher NCC values, outperforming the compared methods.
Taken together, these results indicate that HyperNCMD provides robust and accurate CMD estimation under dynamic clutter conditions, exhibiting reliable performance and practical applicability in diverse radar environments.

5.2.2. MTT Evaluation Results

As discussed in Section 2.1, accurate CMD modeling plays a critical role in determining the performance of MTT, especially under dense and non-uniform clutter conditions where the traditional uniform clutter assumption becomes invalid. To assess the impact of CMD estimation on tracking accuracy, we conduct experiments within the JIPDA-MAP framework [7] under both static and dynamic backgrounds, using various CMD estimation methods. The results are summarized in Table 4 and Table 5.
Under static conditions, HyperNCMD with FiLM achieves near-ground-truth performance, with MOTA = 95.39%, closely matching the ideal baseline using exact CMD (95.47%). It also records the lowest false track rate (FTR = 0.049) and the fewest false positives (FP = 2015), indicating that HyperNCMD effectively suppresses false track initiation. Moreover, it maintains high detection precision (98.70%) and recall (96.72%), demonstrating excellent overall tracking reliability. Consistent with prior studies [9,11], MOTP remains nearly constant across methods, confirming that CMD estimation primarily influences data association accuracy rather than localization precision.
In more challenging dynamic backgrounds characterized by time-varying clutter statistics, HyperNCMD consistently outperforms all baseline approaches, achieving MOTA = 95.20%, FTR = 0.052, FP = 2133, and FN = 5391. These results confirm its robustness and adaptability to dynamic clutter conditions. In contrast, NF Streaming is sensitive to buffer length selection and tends to degrade under rapidly evolving clutter distributions. The superior performance of HyperNCMD stems from its temporal modeling capability, which enables accurate representation of dynamic clutter variations without requiring scene-specific retraining.
Overall, by incorporating context-conditioned modeling and FiLM-based test-time modulation, HyperNCMD significantly enhances CMD estimation accuracy and improves downstream MTT performance, all while maintaining computational efficiency. These results demonstrate its practical suitability for real-time MTT in complex and dynamic radar environments.

5.2.3. Ablation Studies and Visualization

This section presents ablation studies on key model components, hyperparameter settings, and temporal attention to quantify their individual contributions and inform the final design choices. Unless otherwise specified, all experiments are conducted using 10,000 static and 10,000 dynamic clutter scenarios for training, and 1000 static and 1000 dynamic scenarios for validation. Unlike the test set, the validation data spans a duration of 30 s.
(1) Ablation Study on Model Components. To quantify the contribution of individual architectural components and design choices, a series of ablation studies were conducted on HyperNCMD, with averaged results summarized in Table 6 for L { 10 , 20 , 30 } .
The baseline configuration employs an ISAB module without explicit temporal modeling, serving as a reference for subsequent enhancements. Adding the RFFs module, which encodes high-frequency positional information, significantly improves spatial discrimination, resulting in a 2.66 dB PSNR gain. Introducing temporal modeling via the LSTM cell and temporal attention pooling captures inter-frame dependencies more effectively, yielding an additional 0.16 dB improvement under dynamic clutter conditions. The inclusion of the SE block further enhances representational adaptivity through channel-wise recalibration, contributing another 0.20 dB gain. Among data augmentation strategies, coordinate flipping provides the largest boost (+1.09 dB) by enriching spatial diversity, while random context-length sampling offers a smaller improvement (+0.11 dB). Together, these components cumulatively improve PSNR from 27.00 dB to 31.22 dB, highlighting the effectiveness of each enhancement in strengthening CMD estimation performance.
Collectively, these components provide complementary improvements in both estimation accuracy and stability, achieving a cumulative PSNR gain of 4.22 dB over the baseline, accompanied by consistent NCC increases and KLD reductions. Notably, RFFs and coordinate-based augmentation emerge as the dominant contributors, underscoring their critical role in enhancing the spatial fidelity and adaptability of clutter modeling across diverse environments.
(2) Ablation Study on RFF Parameters. We analyze the influence of two key RFF parameters—the feature dimension B F and the Gaussian kernel bandwidth σ F —on model performance, as illustrated in Figure 8.
As B F increases from 64 to 512, PSNR shows a consistent improvement, while NCC first increases and then exhibits a slight decline at higher values. This suggests that higher-dimensional RFF mappings enhance the representational capacity, although the correlation metric does not monotonically improve. However, when B F exceeds 512, the performance begins to degrade slightly, indicating that excessively large feature dimensions may introduce redundancy and lead to overfitting. Therefore, B F = 512 is selected as a trade-off between estimation accuracy and computational cost.
Likewise, increasing σ F from 1.0 to 5.0 leads to steady performance improvement, while further enlargement results in nearly unchanged performance. Consequently, σ F = 5.0 is used as the default hyperparameter setting. This corresponds to a dominant physical wavelength of roughly λ ( 5000 ( 5000 ) ) / ( 2 σ F ) 1000 m, covering approximately 95% of sampled RFF frequencies and closely matching the dominant spatial scales of the clutter. From a frequency perspective, σ F directly controls the bandwidth of the RFFs mapping, which influences the spatial scale of the resulting representation. Small σ F leads to over-smoothing due to loss of fine structures, whereas large σ F introduces redundant high-frequency components, resulting in performance saturation rather than further improvement.
(3) Ablation Study on Hidden Dimension and Flow Layer Depth. We further investigate the effect of network capacity, controlled by the hidden dimension D h and the flow layer depth K, as illustrated in Figure 9.
As D h increases from 128 to 512, both PSNR and NCC steadily improve, reflecting the enhanced ability of the latent space to represent complex clutter structures. When D h exceeds 512, however, the marginal gain diminishes and a slight decline in PSNR is observed, suggesting that excessive hidden width may lead to overparameterization and degraded generalization.
Regarding flow layer depth, deeper transformations ( K = 2 –4) yield clear benefits by improving distributional expressiveness. Nevertheless, further stacking ( K > 4 ) introduces additional computational burden with limited accuracy gain. Based on these observations, D h = 512 and K = 4 are the default model configuration, which strikes a balance between performance and efficiency while maintaining robust clutter modeling.
(4) Effect of ISAB-LSTM and Attention Pooling on Temporal Modeling. To assess the impact of temporal modeling on CMD estimation, we perform an ablation study with varying context lengths L { 10 , 20 , 30 } , as summarized in Table 7. We compare four variants:
(i)
ISAB + Temporal Mean Pooling, which aggregates contextual frames via simple averaging without temporal modeling;
(ii)
ISAB-LSTM + Temporal Mean Pooling, which incorporates an LSTM to capture sequential dependencies;
(iii)
ISAB-LSTM + Temporal Attention Pooling, which further introduces a temporal attention mechanism to adaptively emphasize informative frames; and
(iv)
ISAB + LSTM + Temporal Attention Pooling, which directly aggregates temporal information at the spatial point level using an LSTM.
For short contexts ( L = 10 ), all variants yield comparable results due to the relatively stable clutter statistics. However, as the context length increases, models equipped with temporal modeling exhibit clear advantages. In particular, the ISAB-LSTM with attention pooling consistently achieves the highest PSNR and NCC values and the lowest KLD, demonstrating superior robustness to long-term temporal variations.
Notably, the variant that directly applies an LSTM to the spatial point features after ISAB performs significantly worse than the proposed ISAB-LSTM. The ISAB-LSTM aggregates temporal information at the inducing point level, enabling more effective modeling of clutter dynamics over time.
Taken together, these results confirm that integrating LSTM-based sequence modeling with temporal attention pooling substantially enhances the model’s capacity to represent dynamic clutter evolution, underscoring the importance of adaptive temporal aggregation in realistic dynamic sensing environments.
(5) Compared with Conditional Encoding Method. In traditional conditional encoding methods, contextual features are concatenated with the input and processed by a shared-parameter network to predict the flow transformation parameters [17,21]. This input-level conditioning indirectly affects the model output and often struggles to adapt to diverse conditions. As shown in Figure 10, the conditional encoding method exhibits slow convergence and fails to effectively reduce the training loss, whereas HyperNCMD, which generates context-specific flow parameters via a hypernetwork (parameter-level conditioning), demonstrates stable loss reduction and achieves a significantly lower final loss. This confirms that HyperNCMD can better capture complex conditional distributions, validating our design choice.
(6) Visualization of Embedding Space. To further investigate the learned scene representations, the context embeddings h ctx of 2000 test scenes are visualized in Figure 11 using t-SNE [55]. The embeddings are well-dispersed across the 2D latent space, suggesting that the encoder effectively captures diverse clutter characteristics across different scenarios without mode collapse. Furthermore, embeddings corresponding to scenes with similar clutter statistics tend to cluster together. For instance, Scene A and Scene B, which exhibit comparable spatial clutter patterns, are located close to each other, whereas Scene C—characterized by distinct clutter composition—lies farther apart. These findings confirm that the proposed encoder learns semantically meaningful and discriminative representations of scene-dependent clutter, thereby enhancing the adaptability of the HyperNCMD model.

6. Real-World Experiments

In this section, we evaluate HyperNCMD on real-world radar measurements, further validating its ability to accurately estimate CMD beyond synthetic simulations and maintain robustness under diverse sensing conditions.

6.1. Experimental Setup

6.1.1. Data Collection

Experiments are conducted using a C-band phased-array radar system that performs a full-scene scan every 5 s [11,56]. The dataset comprises approximately 165 h of clutter measurements collected across multiple locations and time periods, covering a wide range of real-world clutter scenarios.
During training, each context sequence consists of a 3-min segment, with the final 1 min designated as the query set. For testing, the model is adapted using a 3-min context segment, and the subsequent five frames are used for evaluation. This setup ensures strict temporal independence between training and testing, preventing overlap and enabling realistic assessment of CMD estimation performance.

6.1.2. Evaluation Metrics

Since ground-truth clutter distributions are unavailable, all models are evaluated using the NLL computed on the test samples [23,25]:
Avg . NLL = 1 N j = 1 N log p ( z j )
where p ( z j ) denotes the model-estimated density of the test sample z j , and lower Avg . NLL indicates better agreement with observed clutter. All baseline models, awKDE [10] and NF Streaming [11], are evaluated on the same test samples under an identical protocol, enabling consistent comparison of CMD estimation accuracy in real radar clutter scenarios.

6.1.3. Implementation Details

We employ the same model architecture as used in the simulation experiments (see Section 5.1.4). To evaluate model performance, we first pretrain the models on simulation data and then further train them using real-world radar measurements. At test time, we perform fine-tuning of the FiLM-based scene embeddings as well as the hypernetwork parameters, enabling the model to rapidly adapt and converge in previously unseen environments. Additionally, we apply the data augmentation strategies described in Section 4.2.2 to expand the effective size of the dataset and improve the models’ generalization capability.

6.2. Results and Analysis

Table 8 summarizes the performance of HyperNCMD on real-world radar clutter data under different training strategies, in comparison with representative baselines. Three key observations can be made.
First, directly applying the simulation-pretrained HyperNCMD model to real-world data yields an average NLL of 0.724 , which is inferior to scene-specific optimization methods such as awKDE ( 0.965 ) and NF Streaming ( 0.981 ). This performance gap highlights the challenge of direct sim-to-real transfer, as real-world CMD exhibit more complex characteristics that are difficult to fully capture in simulation. Nevertheless, the pretrained model provides a reasonable initialization for subsequent adaptation.
Second, applying test-time fine-tuning on the pretrained model substantially enhances estimation accuracy, reducing the average NLL to −0.989. This significant improvement demonstrates the effectiveness of fine-tuning in bridging the domain gap, enabling the model to better align with real data distributions while leveraging prior knowledge learned during pretraining.
Finally, retraining HyperNCMD on real measurements produces further improvements, achieving an average NLL of −1.004. Incorporating fine-tuning further reduces the NLL to −1.066, corresponding to a 10.5% relative improvement over awKDE and consistently outperforming both awKDE and NF Streaming across all test conditions. These results underscore the flexibility and effectiveness of HyperNCMD for accurate CMD estimation in real-world radar applications.
We further evaluate HyperNCMD on the Dynamic Cloud Monitoring experiment reported in [11]. Figure 12a depicts a representative dynamic radar scene featuring a moving rain clutter cluster. The cumulative radar measurements over three consecutive intervals (0–5 min, 5–10 min, and 10–15 min) reveal multiple complex, high-density clutter regions. Figure 12b–d present the corresponding CMD estimates obtained by awKDE, NF Streaming, and HyperNCMD, with average NLL values annotated. While all methods capture the general clutter distribution, HyperNCMD more accurately resolves fine spatial and intensity structures, quantitatively demonstrating superior CMD estimation performance.
Collectively, these findings confirm that HyperNCMD consistently outperforms baseline methods on both simulated and real-world datasets, demonstrating strong robustness, adaptability, and practical utility for real-world CMD estimation.

7. Conclusions

In this article, we presented HyperNCMD, a scene-adaptive framework for CMD estimation in radar-based MTT. Unlike existing approaches that require scene-specific tuning, HyperNCMD leverages hypernetworks and parameterized NFs to enable adaptive and transferable estimation with minimal tuning. Extensive experiments on both synthetic and real-world radar datasets demonstrate consistent improvements in accuracy, robustness, and cross-environment performance, achieving up to a 10.5% reduction in per-point NLL in challenging dynamic environments. Beyond empirical results, HyperNCMD demonstrates the potential of hypernetwork-driven density modeling for radar perception, providing a principled solution for handling spatially non-uniform and temporally dynamic clutter.
In future work, we plan to extend the applicability of HyperNCMD beyond MTT, exploring its potential in tasks such as clutter region identification and clutter suppression, thereby broadening its utility in intelligent radar perception systems.

Author Contributions

Conceptualization, Z.C. and J.Y.; methodology, Z.C. and J.Y.; software, Z.C.; validation, Z.C., W.S. and J.Y.; formal analysis, W.S.; investigation, Z.C. and J.Y.; resources, H.G.; data curation, W.S.; writing—original draft preparation, Z.C.; writing—review and editing, Z.C., W.S. and J.Y.; supervision, X.L., K.T., Z.D., W.Y. and H.G.; project administration, H.G.; funding acquisition, J.Y., X.L., K.T., Z.D., W.Y. and H.G. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China under Grant 62001229, Grant 62101260, Grant 62101264 and Grant 62401263; in part by the Natural Science Foundation of Jiangsu Province under Grant BK20210334 and Grant BK20230915; in part by the China Postdoctoral Science Foundation under Grant 2020M681604; and in part by the Jiangsu Province Postdoctoral Science Foundation under Grant 2020Z441.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Training Details for HyperNCMD

Algorithm A1 Training Procedure for HyperNCMD
Input: Training set D train = ( Z ctx k , ( s ) , Z q k , ( s ) ) s = 1 S ; Initial parameters θ = { θ enc , θ hyp , θ nf act } ;
            Learning rate η , number of epochs N e , batch size N b .
Output: Trained parameters θ .
  1:
  for epoch = 1 to N e  do
  2:
      Randomly shuffle and split D train into mini-batches { B 1 , , B M }
  3:
      for each mini-batch B = ( Z ctx k , ( s ) , Z q k , ( s ) ) s = 1 N b  do
  4:
          Apply data augmentation to all ( Z ctx k , ( s ) , Z q k , ( s ) )
  5:
          for each scene s = 1 to N b  do
  6:
               h ctx ( s ) SceneEncoder ( Z ctx k , ( s ) ; θ enc )
  7:
               Ω ( s ) HyperNet ( h ctx ( s ) ; θ hyp )
  8:
               V q ( s ) F 1 ( Z q k , ( s ) ; Ω ( s ) , θ nf act )
  9:
              Compute log-likelihood:
                   log p Z ( s ) = 1 M q j = 1 M q log p V ( v q , j ( s ) ) + log | det J F 1 ( z q , j k , ( s ) ; Ω ( s ) , θ nf act ) |
10:
          end for
11:
          Compute batch NLL: L 1 N b s = 1 N b log p z ( s )
12:
          Update trainable parameters via Adam: θ ADAM ( θ , θ L , η )
13:
      end for
14:
  end for
15:
  return  θ

Appendix A.2. Test-Time Fine-Tuning Details

Algorithm A2 Test-Time Fine-Tuning with FiLM
Input: Test scene data D test = ( Z ctx k , Z q k ) ; Pretrained parameters θ = { θ enc , θ hyp , θ nf act } ;
            Initial FiLM parameters θ film = { γ = 1 , β = 0 } ;
            Learning rate η ; adaptation steps N a .
Output: Adapted FiLM parameters ( γ , β ) .
  1:
  Freeze all pretrained parameters θ
  2:
   h ctx raw SceneEncoder ( Z ctx k ; θ enc )
  3:
  for step = 1 to N a  do
  4:
       h ctx γ h ctx raw + β
  5:
       Ω HyperNet ( h ctx ; θ hyp )
  6:
       V q F 1 ( Z q k ; Ω , θ nf act )
  7:
       L adapt 1 | Z q k | j = 1 | Z q k | log p V ( v q , j ) + log | det J F 1 ( z q , j k ) |
  8:
       ( γ , β ) ADAM ( γ , β , ( γ , β ) L adapt , η )
  9:
  end for
10:
  return   ( γ , β )

References

  1. Musicki, D.; Suvorova, S.; Morelande, M.; Moran, B. Clutter map and target tracking. In Proceedings of the 7th International Conference on Information Fusion, Philadelphia, PA, USA, 25–28 July 2005; Volume 1, pp. 69–76. [Google Scholar]
  2. Musicki, D.; Evans, R. Clutter map information for data association and track initialization. IEEE Trans. Aerosp. Electron. Syst. 2004, 40, 387–398. [Google Scholar] [CrossRef]
  3. Fahad, A.M.; Saritha, D.S. Survey on Clutter Spatial Intensity Estimation Methods for Target Tracking. J. Adv. Inf. Fusion 2022, 17, 3–13. [Google Scholar]
  4. Luo, X.; Zhang, B.; Liu, J.; Lin, H.; Yu, J. Researches on the method of clutter suppression in radar data processing. Syst. Eng. Electron. 2016, 38, 37–44. [Google Scholar]
  5. Bar-Shalom, Y.; Li, X.R. Multitarget-Multisensor Tracking: Principles and Techniques; YBS Publications: Storrs, CT, USA, 1995. [Google Scholar]
  6. Reid, D. An Algorithm for Tracking Multiple Targets. IEEE Trans. Autom. Control 1979, 24, 843–854. [Google Scholar] [CrossRef]
  7. Challa, S.; Morelande, M.R.; Musicki, D.; Evans, R.J. Fundamentals of Object Tracking; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
  8. Lv, N.; Lian, F.; Han, C. Unknown clutter estimation by FMM approach in multitarget tracking algorithm. Math. Probl. Eng. 2014, 2014, 938242. [Google Scholar] [CrossRef]
  9. Chen, X.; Tharmarasa, R.; Kirubarajan, T.; McDonald, M. Online clutter estimation using a Gaussian kernel density estimator for multitarget tracking. IET Radar Sonar Navig. 2015, 9, 1–9. [Google Scholar] [CrossRef]
  10. Wang, B.; Wang, X. Bandwidth selection for weighted kernel density estimation. arXiv 2007, arXiv:0709.1616. [Google Scholar]
  11. Cao, Z.; Yang, J.; Sun, W.; Lu, X.; Tan, K.; Dai, Z.; Yu, W.; Gu, H. Online Clutter Measurement Modeling for Surveillance Radar Tracking via Normalizing Flows. IEEE Trans. Instrum. Meas. 2025, 74, 1–18. [Google Scholar] [CrossRef]
  12. Li, T. Single-Road-Constrained Positioning Based on Deterministic Trajectory Geometry. IEEE Commun. Lett. 2019, 23, 80–83. [Google Scholar] [CrossRef]
  13. Li, X.R.; Li, N. Integrated Real-Time Estimation of Clutter Density for Tracking. IEEE Trans. Signal Process. 2000, 48, 2797–2805. [Google Scholar] [CrossRef]
  14. Mahler, R. CPHD and PHD Filters for Unknown Backgrounds I: Dynamic Data Clustering. In Proceedings of the SPIE Defense, Security, and Sensing, Orlando, FL, USA, 14–15 April 2009; Volume 7330, pp. 140–151. [Google Scholar]
  15. Mahler, R. CPHD and PHD Filters for Unknown Backgrounds II: Multitarget Filtering in Dynamic Clutter. In Proceedings of the SPIE Defense, Security, and Sensing, Orlando, FL, USA, 14–15 April 2009; Volume 7330, pp. 152–163. [Google Scholar]
  16. Chen, X.; Tharmarasa, R.; Pelletier, M.; Kirubarajan, T. Integrated Clutter Estimation and Target Tracking Using Poisson Point Processes. IEEE Trans. Aerosp. Electron. Syst. 2012, 48, 1210–1235. [Google Scholar] [CrossRef]
  17. Kobyzev, I.; Prince, S.J.; Brubaker, M.A. Normalizing flows: An introduction and review of current methods. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 43, 3964–3979. [Google Scholar] [CrossRef]
  18. Lee, J.; Lee, Y.; Kim, J.; Kosiorek, A.; Choi, S.; Teh, Y.W. Set Transformer: A Framework for Attention-Based Permutation-Invariant Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Long Beach, CA, USA, 9–15 June 2007; pp. 3744–3753. [Google Scholar]
  19. Rahimi, A.; Recht, B. Random Features for Large-Scale Kernel Machines. In Proceedings of the 20th Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 3–6 December 2007; pp. 1177–1184. [Google Scholar]
  20. Perez, E.; Strub, F.; de Vries, H.; Dumoulin, V.; Courville, A.C. FiLM: Visual Reasoning With a General Conditioning Layer. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018; pp. 3942–3951. [Google Scholar]
  21. Papamakarios, G.; Nalisnick, E.; Rezende, D.J.; Mohamed, S.; Lakshminarayanan, B. Normalizing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 2021, 22, 1–64. [Google Scholar]
  22. Dinh, L.; Krueger, D.; Bengio, Y. NICE: Non-linear Independent Components Estimation. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  23. Laurent Dinh, J.S.; Bengio, S. Density estimation using Real NVP. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  24. Papamakarios, G.; Murray, I.; Pavlakou, T. Masked Autoregressive Flow for Density Estimation. In Proceedings of the 31st Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 2338–2347. [Google Scholar]
  25. Durkan, C.; Bekasov, A.; Murray, I.; Papamakarios, G. Neural Spline Flows. In Proceedings of the 32nd Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 7509–7520. [Google Scholar]
  26. Ha, D.; Dai, A.M.; Le, Q.V. HyperNetworks. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  27. Rusu, A.A.; Rao, D.; Sygnowski, J.; Vinyals, O.; Pascanu, R.; Osindero, S.; Hadsell, R. Meta-Learning with Latent Embedding Optimization. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  28. von Oswald, J.; Henning, C.; Sacramento, J.; Grewe, B.F. Continual Learning with Hypernetworks. In Proceedings of the 8th International Conference on Learning Representations (ICLR), Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
  29. Brock, A.; Lim, T.; Ritchie, J.M.; Weston, N. SMASH: One-Shot Model Architecture Search Through HyperNetworks. In Proceedings of the 6th International Conference on Learning Representations (ICLR), Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
  30. Oh, G.; Valois, J.S. HCNAF: Hyper-Conditioned Neural Autoregressive Flow and Its Application for Probabilistic Occupancy Map Forecasting. In Proceedings of the 2020 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14538–14547. [Google Scholar]
  31. Spurek, P.; Zieba, M.; Tabor, J.; Trzcinski, T. HyperFlow: Representing 3D Objects as Surfaces. arXiv 2020, arXiv:2006.08710. [Google Scholar] [CrossRef]
  32. Grathwohl, W.; Chen, R.T.; Bettencourt, J.; Sutskever, I.; Duvenaud, D. FFJORD: Free-Form Continuous Dynamics for Scalable Reversible Generative Models. In Proceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
  33. Sun, J.; Wang, Z.; Li, J.; Lu, C. Unified and Fast Human Trajectory Prediction Via Conditionally Parameterized Normalizing Flow. IEEE Robot. Autom. Lett. 2022, 7, 842–849. [Google Scholar] [CrossRef]
  34. Fons, E.; Sztrajman, A.; El-Laham, Y.; Iosifidis, A.; Vyetrenko, S. HyperTime: Implicit Neural Representation for Time Series. arXiv 2022, arXiv:2208.05836. [Google Scholar] [CrossRef]
  35. Kosiorek, A.R.; Strathmann, H.; Zoran, D.; Moreno, P.; Schneider, R.; Mokrá, S.; Rezende, D.J. NeRF-VAE: A Geometry Aware 3D Scene Generative Model. In Proceedings of the 38th International Conference on Machine Learning (ICML), Virtual Event, 18–24 July 2021; pp. 5742–5752. [Google Scholar]
  36. Zaheer, M.; Kottur, S.; Ravanbakhsh, S.; Poczos, B.; Salakhutdinov, R.R.; Smola, A.J. Deep Sets. In Proceedings of the 30th Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 3391–3401. [Google Scholar]
  37. Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 77–85. [Google Scholar]
  38. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  39. Tancik, M.; Srinivasan, P.; Mildenhall, B.; Fridovich-Keil, S.; Raghavan, N.; Singhal, U.; Ramamoorthi, R.; Barron, J.; Ng, R. Fourier Features Let Networks Learn High Frequency Functions in Low Dimensional Domains. In Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS), Virtual, 6–12 December 2020; pp. 7537–7547. [Google Scholar]
  40. Shi, X.; Chen, Z.; Wang, H.; Yeung, D.Y.; Wong, W.K.; Woo, W.C. Convolutional LSTM Network: A machine learning approach for precipitation nowcasting. In Proceedings of the 29th Advances in Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 7–12 December 2015; pp. 802–810. [Google Scholar]
  41. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the 30th Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  42. Ba, J.L.; Kiros, J.R.; Hinton, G.E. Layer Normalization. arXiv 2016, arXiv:1607.06450. [Google Scholar] [CrossRef]
  43. Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  44. Kingma, D.P.; Dhariwal, P. Glow: Generative Flow with Invertible 1x1 Convolutions. In Proceedings of the 31st Advances in Neural Information Processing Systems (NeurIPS), Montréal, QC, Canada, 3–8 December 2018; pp. 10236–10245. [Google Scholar]
  45. Huynh-Thu, Q.; Ghanbari, M. Scope of validity of PSNR in image/video quality assessment. Electron. Lett. 2008, 44, 800–801. [Google Scholar] [CrossRef]
  46. Wang, Z.; Bovik, A.; Sheikh, H.; Simoncelli, E. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
  47. Lewis, J. Fast normalized cross-correlation. Vis. Interface 1995, 10, 120–123. [Google Scholar]
  48. Kullback, S.; Leibler, R. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  49. Bernardin, K.; Stiefelhagen, R. Evaluating multiple object tracking performance: The clear mot metrics. EURASIP J. Image Video Process. 2008, 2008, 246309. [Google Scholar] [CrossRef]
  50. Musicki, D.; Morelande, M. Non Parametric Target Tracking in Non Uniform Clutter. In Proceedings of the 7th International Conference on Information Fusion, Philadelphia, PA, USA, 25–28 July 2005; Volume 1, pp. 48–53. [Google Scholar]
  51. Song, T.L.; Musicki, D. Adaptive Clutter Measurement Density Estimation for Improved Target Tracking. IEEE Trans. Aerosp. Electron. Syst. 2011, 47, 1457–1466. [Google Scholar] [CrossRef]
  52. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. In Proceedings of the 3rd International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  53. Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the 5th International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
  54. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the 33rd Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 8–14 December 2019; pp. 8024–8035. [Google Scholar]
  55. van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008, 9, 2579–2605. [Google Scholar]
  56. Yang, J.; Lu, X.; Dai, Z.; Yu, W.; Tan, K. A Cylindrical Phased Array Radar System for UAV Detection. In Proceedings of the 6th International Conference on Intelligent Computing and Signal Processing (ICSP), Xi’an, China, 9–11 April 2021; pp. 894–898. [Google Scholar]
Figure 1. Overview of the proposed HyperNCMD model for CMD estimation in MTT.
Figure 1. Overview of the proposed HyperNCMD model for CMD estimation in MTT.
Remotesensing 18 01541 g001
Figure 2. Architecture of the Scene Feature Extractor.
Figure 2. Architecture of the Scene Feature Extractor.
Remotesensing 18 01541 g002
Figure 3. Hypernetwork architecture for parameter generation of target NF layers.
Figure 3. Hypernetwork architecture for parameter generation of target NF layers.
Remotesensing 18 01541 g003
Figure 4. Structure of the RQS transformation.
Figure 4. Structure of the RQS transformation.
Remotesensing 18 01541 g004
Figure 5. Representative MTT scenario with static clutter: (a) sensor measurements collected over T = 200   s ; (b) corresponding CMD.
Figure 5. Representative MTT scenario with static clutter: (a) sensor measurements collected over T = 200   s ; (b) corresponding CMD.
Remotesensing 18 01541 g005
Figure 6. Impact of context length L on HyperNCMD performance in (a) static clutter and (b) dynamic clutter scenarios.
Figure 6. Impact of context length L on HyperNCMD performance in (a) static clutter and (b) dynamic clutter scenarios.
Remotesensing 18 01541 g006
Figure 7. Temporal evolution of different CMD estimates in a dynamic clutter background: (a) Exact; (b) HistSpatial; (c) awKDE; (d) NF Streaming; and (e) HyperNCMD.
Figure 7. Temporal evolution of different CMD estimates in a dynamic clutter background: (a) Exact; (b) HistSpatial; (c) awKDE; (d) NF Streaming; and (e) HyperNCMD.
Remotesensing 18 01541 g007
Figure 8. Effect of RFF parameters on model performance: (a) feature dimension B F ; (b) Gaussian kernel bandwidth σ F .
Figure 8. Effect of RFF parameters on model performance: (a) feature dimension B F ; (b) Gaussian kernel bandwidth σ F .
Remotesensing 18 01541 g008
Figure 9. Effect of model size on performance: (a) hidden dimension D h ; (b) flow layer depth K.
Figure 9. Effect of model size on performance: (a) hidden dimension D h ; (b) flow layer depth K.
Remotesensing 18 01541 g009
Figure 10. Training loss comparison between conditional encoding and HyperNCMD.
Figure 10. Training loss comparison between conditional encoding and HyperNCMD.
Remotesensing 18 01541 g010
Figure 11. Visualization of Embedding space.
Figure 11. Visualization of Embedding space.
Remotesensing 18 01541 g011
Figure 12. Comparison of the results of different CMD estimators over time in a dynamic cloud clutter environment: (a) Radar measurements; (b) awKDE; (c) NF Streaming; and (d) HyperNCMD.
Figure 12. Comparison of the results of different CMD estimators over time in a dynamic cloud clutter environment: (a) Radar measurements; (b) awKDE; (c) NF Streaming; and (d) HyperNCMD.
Remotesensing 18 01541 g012
Table 1. Test datasets for CMD estimation and MTT evaluation.
Table 1. Test datasets for CMD estimation and MTT evaluation.
Dataset IDEvaluation TaskClutter TypeDuration#Scenes
DS-ICMDStatic50 s1000
DS-IICMDDynamic100 s1000
DS-IIIMTTStatic150 s500
DS-IVMTTDynamic150 s500
Table 2. Performance of CMD Estimators in Static Clutter Scenarios (DS-I).
Table 2. Performance of CMD Estimators in Static Clutter Scenarios (DS-I).
MethodsPlatformRMSE ( ) MAE ( ) PSNR  ( dB , ) SSIM  ( ) NCC  ( ) KLD  ( ) Time  ( ms , )
HistSpatial [1]CPU8.68 ± 4.912.32 ± 0.5830.39 ± 4.450.90 ± 0.040.74 ± 0.140.30 ± 0.1316.4 ± 2.72
SCMDE [51]CPU9.62 ± 5.502.61 ± 0.5829.51 ± 4.400.88 ± 0.040.73 ± 0.130.32 ± 0.12226.1 ± 57.7
FMM [8]GPU9.93 ± 4.292.89 ± 0.3228.90 ± 5.640.76 ± 0.130.76 ± 0.100.99 ± 0.44581.9 ± 107.3
KDE [9]GPU7.66 ± 3.973.17 ± 0.8731.35 ± 4.440.78 ± 0.080.82 ± 0.080.82 ± 0.581024.0 ± 273.7
awKDE [10]GPU8.13 ± 5.962.32 ± 0.4331.35 ± 4.130.85 ± 0.080.81 ± 0.090.41 ± 0.110.14 ± 0.73
NF Streaming [11]GPU6.56 ± 3.501.89 ± 0.3532.76 ± 5.160.89 ± 0.050.86 ± 0.070.25 ± 0.07950.1 ± 33.4
HyperNCMD (w/o FiLM)GPU6.43 ± 4.481.15 ± 0.3533.4 ± 5.190.96 ± 0.020.88 ± 0.070.11 ± 0.0668.1 ± 10.7
HyperNCMD (w/ FiLM)GPU6.12 ± 3.951.16 ± 0.3233.7 ± 5.350.96 ± 0.020.89 ± 0.060.11 ± 0.05227.3 ± 10.0
RMSE and MAE values should be multiplied by 10 7 to obtain the actual values. The best results for each metric are indicated in bold, and the second-best in underline. For metrics marked with ↑, higher values indicate better performance, whereas for those marked with ↓, lower values are preferable. Notations are consistent across subsequent tables and figures.
Table 3. Performance of CMD Estimators in Dynamic Clutter Scenarios (DS-II).
Table 3. Performance of CMD Estimators in Dynamic Clutter Scenarios (DS-II).
MethodsPlatformRMSE ( ) MAE ( ) PSNR  ( dB , ) SSIM  ( ) NCC  ( ) KLD  ( ) Time  ( ms , )
HistSpatial [1]CPU9.10 ± 5.132.37 ± 0.6029.7 ± 4.120.90 ± 0.040.72 ± 0.130.31 ± 0.1417.6 ± 2.82
SCMDE [51]CPU9.48 ± 5.032.61 ± 0.5729.2 ± 4.330.88 ± 0.050.72 ± 0.130.32 ± 0.13177.5 ± 51.3
FMM [8]GPU10.24 ± 4.512.95 ± 0.3328.28 ± 5.330.75 ± 0.140.74 ± 0.101.02 ± 0.48634.2 ± 106.5
KDE [9]GPU8.10 ± 4.653.86 ± 2.6430.67 ± 4.120.76 ± 0.080.80 ± 0.080.49 ± 0.211029.0 ± 268.7
awKDE [10]GPU8.01 ± 5.232.34 ± 0.4131.00 ± 4.080.84 ± 0.080.80 ± 0.090.41 ± 0.120.23 ± 1.79
NF Streaming [11]GPU6.97 ± 3.741.93 ± 0.3331.90 ± 4.750.89 ± 0.060.84 ± 0.070.26 ± 0.07912.7 ± 54.6
HyperNCMD (w/o FiLM)GPU6.64 ± 4.151.21 ± 0.3532.63 ± 4.710.96 ± 0.020.86 ± 0.070.13 ± 0.06113.2 ± 41.2
HyperNCMD (w/ FiLM)GPU6.40 ± 3.841.21 ± 0.3232.87 ± 4.810.96 ± 0.020.87 ± 0.060.12 ± 0.05232.1 ± 9.2
Table 4. Performance of Different CMD Estimators in MTT with Static Backgrounds (DS-III).
Table 4. Performance of Different CMD Estimators in MTT with Static Backgrounds (DS-III).
MethodsMOTA ( % , ) MOTP ( % , ) FTR ( ) P ( % , ) R ( % , ) FP ( ) FN ( ) IDS ( ) Frag ( )
HistSpatial [1]92.5286.090.14096.3396.2358225981626441
SCMDE [51]92.4686.090.12196.8095.6750216869586397
FMM [8]80.0186.060.54686.4894.9223,5268053456476
awKDE [10]94.7586.060.06698.2496.5227335500726432
NF Streaming [11]95.3586.080.05298.6396.7521315141786436
HyperNCMD (w/ FiLM)95.3986.080.04998.7096.7220155188786433
Exact95.4786.070.05198.6496.8621184959806465
Table 5. Performance of Different CMD Estimators in MTT with Dynamic Backgrounds (DS-IV).
Table 5. Performance of Different CMD Estimators in MTT with Dynamic Backgrounds (DS-IV).
MethodsMOTA ( % , ) MOTP ( % , ) FTR ( ) P ( % , ) R ( % , ) FP ( ) FN ( ) IDS ( ) Frag ( )
HistSpatial [1]91.6786.200.16495.7196.0168336332556360
SCMDE [51]91.8686.190.13196.5295.3354507405566300
FMM [8]70.8386.180.8580.0494.4037,0618822306356
awKDE [10]94.2086.200.08297.8496.3733715741616385
NF Streaming [11]95.0686.290.05798.5096.5723315417646382
HyperNCMD (w/ FiLM)95.2086.200.05298.6296.5921335391676386
Exact95.4986.200.05098.6796.8520684988726397
Table 6. Quantitative Effect of Sequentially Integrated Components on CMD Estimation Performance.
Table 6. Quantitative Effect of Sequentially Integrated Components on CMD Estimation Performance.
StageModule AddedPSNR ( ) NCC ( ) KLD ( ) Δ PSNR
0ISAB (Baseline)27.000.3100.766
1+RFFs29.660.7160.296+2.66
2+Temporal Modeling29.820.7390.265+0.16
3+SE Block30.020.7490.253+0.20
4+Coordinate Flip31.110.7940.205+1.09
5+Rand. Ctx. Length31.220.7980.196+0.11
Table 7. Effect of Temporal Modeling under Different Context Lengths.
Table 7. Effect of Temporal Modeling under Different Context Lengths.
Ctx. LenModulePSNR ( ) NCC ( ) KLD ( )
10ISAB + Temporal Mean30.760.7790.220
ISAB-LSTM + Temporal Mean30.950.7880.212
ISAB-LSTM + Temporal Attention31.010.7890.209
ISAB + LSTM + Temporal Attention30.640.7660.229
20ISAB + Temporal Mean31.050.7890.202
ISAB-LSTM + Temporal Mean31.220.7980.196
ISAB-LSTM + Temporal Attention31.340.8030.191
ISAB + LSTM + Temporal Attention30.940.7820.210
30ISAB + Temporal Mean30.920.7780.208
ISAB-LSTM + Temporal Mean31.070.7870.203
ISAB-LSTM + Temporal Attention31.320.8020.190
ISAB + LSTM + Temporal Attention30.960.7820.208
Avg.ISAB + Temporal Mean30.910.7820.210
ISAB-LSTM + Temporal Mean31.080.7910.204
ISAB-LSTM + Temporal Attention31.220.7980.196
ISAB + LSTM + Temporal Attention30.840.7770.215
Table 8. Performance Comparison on Real-World Clutter Data.
Table 8. Performance Comparison on Real-World Clutter Data.
MethodAvg. NLL ( ) Δ NLL 1
awKDE [10]−0.965 ± 0.35baseline
NF Streaming [11]−0.981 ± 0.39+1.66%
HyperNCMD (Pretrain, no fine-tuning)−0.724 ± 0.27
HyperNCMD (Pretrain, fine-tuned)−0.989 ± 0.39+2.49%
HyperNCMD (Retrain, no fine-tuning)−1.004 ± 0.38+4.04%
HyperNCMD (Retrain, fine-tuned)−1.066 ± 0.43+10.5%
1  Δ NLL indicates relative improvement over awKDE.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cao, Z.; Yang, J.; Sun, W.; Lu, X.; Tan, K.; Dai, Z.; Yu, W.; Gu, H. HyperNCMD: A Scene-Adaptive Clutter Measurement Density Estimator for Radar Tracking via Hypernetworks and Normalizing Flows. Remote Sens. 2026, 18, 1541. https://doi.org/10.3390/rs18101541

AMA Style

Cao Z, Yang J, Sun W, Lu X, Tan K, Dai Z, Yu W, Gu H. HyperNCMD: A Scene-Adaptive Clutter Measurement Density Estimator for Radar Tracking via Hypernetworks and Normalizing Flows. Remote Sensing. 2026; 18(10):1541. https://doi.org/10.3390/rs18101541

Chicago/Turabian Style

Cao, Zongqing, Jianchao Yang, Wang Sun, Xingyu Lu, Ke Tan, Zheng Dai, Wenchao Yu, and Hong Gu. 2026. "HyperNCMD: A Scene-Adaptive Clutter Measurement Density Estimator for Radar Tracking via Hypernetworks and Normalizing Flows" Remote Sensing 18, no. 10: 1541. https://doi.org/10.3390/rs18101541

APA Style

Cao, Z., Yang, J., Sun, W., Lu, X., Tan, K., Dai, Z., Yu, W., & Gu, H. (2026). HyperNCMD: A Scene-Adaptive Clutter Measurement Density Estimator for Radar Tracking via Hypernetworks and Normalizing Flows. Remote Sensing, 18(10), 1541. https://doi.org/10.3390/rs18101541

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop