Next Article in Journal
Direct UAV-Based Detection of Botrytis cinerea in Vineyards Using Chlorophyll-Absorption Indices and YOLO Deep Learning
Previous Article in Journal
A Waist-Mounted Interface for Mobile Viewpoint-Height Transformation Affecting Spatial Perception
error_outline You can access the new MDPI.com website here. Explore and share your feedback with us.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Improved RODNet for Object Detection Based on Radar and Camera Fusion

School of Information and Communication Engineering, Hainan University, Haikou 570228, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(2), 373; https://doi.org/10.3390/s26020373
Submission received: 28 November 2025 / Revised: 4 January 2026 / Accepted: 4 January 2026 / Published: 6 January 2026
(This article belongs to the Section Radar Sensors)

Abstract

Deep learning-based radar detection often suffers from poor cross-device generalization due to hardware heterogeneity. To address this, we propose a unified framework that combines rigorous calibration with adaptive temporal modeling. The method integrates three coordinated steps: (1) ensuring precise spatial alignment via improved Perspective-n-Point (PnP) calibration with closed-loop verification; (2) unifying signal statistics through multi-range bin calibration and chirp-wise Z-score standardization; and (3) enhancing feature consistency using a lightweight global–temporal adapter (GTA) driven by global gating and three-point attention. By combining signal-level standardization with feature-level adaptation, our framework achieves 86.32% average precision (AP) on the ROD2021 dataset. It outperforms the E-RODNet baseline by 22.88 percentage points with a 0.96% parameter increase, showing strong generalization across diverse radar platforms.

1. Introduction

Millimeter-wave radar sensors have become essential components in autonomous driving sensor suites due to their all-weather robustness, direct range–velocity–angle measurement capabilities, and cost-effectiveness. Unlike camera sensors, which degrade in low-light or high-glare conditions, and LiDAR sensors, constrained by high costs, mechanical complexity, and weather sensitivity [1], millimeter-wave radar sensors maintain stable sensing performance under adverse environmental conditions, including rain, fog, backlighting, and nighttime operation. These intrinsic sensor advantages have driven extensive research on radar-based perception systems for robust autonomous vehicle sensing [2,3,4].
However, radar sensor signals inherently exhibit low spatial resolutions and lack semantic features such as shape or texture, fundamentally limiting the object discrimination capabilities. Due to these sensor-level signal characteristics, traditional radar signal processing methods such as the constant false alarm rate (CFAR) [5] and clustering-based algorithms [6], which rely on classical signal processing and statistical modeling, often produce high false alarm rates in cluttered environments and lack semantic representation capabilities, making them insufficient for high-level perception tasks despite their computational efficiency. To overcome these fundamental sensor limitations, multi-sensor fusion combining radar and camera has emerged as a promising approach. By exploiting the complementary sensing modalities—radar’s robust range–velocity measurement and camera’s rich semantic information—fusion frameworks enhance perception performance through cross-modal learning while maintaining all-weather operation. Vision-guided supervision and deep learning-based sensor fusion have become effective strategies, enabling radar sensing networks to learn semantic-aware representations from synchronized camera sensor data [2,7,8,9,10]. However, cross-modal supervision alone cannot address the fundamental challenges of radar sensor signal quality variations and cross-device hardware heterogeneity.
Recent advances in deep learning have transformed radar perception from rule-based processing to data-driven paradigms. The Radar Object Detection Network (RODNet) [2] pioneered radar object detection directly from range–azimuth (RA) maps under camera supervision, providing pixel-level cross-modal training signals. E-RODNet [11] further improved spatiotemporal modeling through short-sequence fusion (SFF) and enhanced encoder–decoder architectures. Subsequent studies have explored radar–camera fusion and bird’s-eye-view (BEV) perception frameworks [9,12,13,14,15], leveraging cross-modal supervision and attention mechanisms to enrich semantic understanding. Nevertheless, despite these advances, deep learning-based radar detection still suffers from two critical bottlenecks that hinder practical deployment: (1) poor cross-device generalization due to hardware-dependent signal distributions and (2) unstable temporal consistency in dynamic driving scenes.
The first issue—cross-device domain shift—occurs because radar devices differ in antenna array configuration, frequency modulation, and noise statistics. Models trained on one sensor may perform poorly on another [16]. Although standardization-based domain adaptation techniques [17] have shown promise, existing radar frameworks still lack systematic calibration and distribution alignment mechanisms. The second issue—insufficient temporal modeling—arises because most current radar networks use local 3D convolutions with limited receptive fields and no explicit temporal attention, leading to detection jitter and inconsistency across frames [18].
To address these challenges, this paper proposes an improved RODNet-based radar–camera fusion framework for robust object detection. The framework synergistically integrates geometric alignment, time-division multiplexing multiple-input multiple-output (TDM-MIMO) channel calibration, chirp-wise Z-score standardization, and a lightweight GTA to tackle cross-device domain shift and temporal instability. These components work collaboratively: geometric alignment enables effective training supervision, calibration and standardization decouple hardware-specific characteristics from semantic features, and the GTA module stabilizes temporal predictions, collectively achieving robust cross-device object detection.
Experiments on the public ROD2021 dataset demonstrate that the proposed fusion framework achieves average precision of 86.32%, outperforming the baseline E-RODNet by 22.88 percentage points with only a 0.96% parameter increase. Experimental validation further demonstrates that channel calibration reduces the main lobe width by 42.3%, statistical standardization stabilizes the complex signal standard deviation to 0.037 ± 0.005, and the GTA module improves the detection performance with a minimal parameter overhead.
The main contributions of this paper are summarized as follows:
1.
A radar–camera fusion framework with improved PnP-based extrinsic calibration and closed-loop geometric verification, enabling precise cross-modal coordinate alignment for effective training supervision.
2.
A multi-range-bin joint TDM-MIMO channel calibration method that corrects receive-channel complex gains and transmit-phase deviations, improving cross-device array consistency.
3.
A chirp-wise Z-score standardization strategy with formal mathematical definitions that achieves the statistical alignment of radar signal distributions (standard deviation converging to 0.037, 99th percentile at 0.19) across sensors, enhancing model generalization and transferability.
4.
A lightweight GTA module that stabilizes short-term temporal features through global gating and temporal attention with only approximately 0.12 M parameters.
5.
Comprehensive experiments on the ROD2021 dataset and signal-level cross-device verification on AWR1642 data, validating the effectiveness of the proposed preprocessing pipeline.
The remainder of this paper is structured as follows: Section 2 reviews related work in radar preprocessing, object detection, and attention mechanisms; Section 3 introduces the radar platform and signal modeling; Section 4 presents the proposed methodology, including geometric alignment, channel calibration, statistical standardization, and the GTA module; and Section 5 provides the experimental results and analysis. Finally, Section 6 summarizes the conclusions.

2. Related Work

2.1. Radar Data Preprocessing

Millimeter-wave radar signal processing typically involves multiple stages to transform raw time-domain signals into multi-dimensional representations. Range fast Fourier transform (FFT) extracts distance information, Doppler FFT captures velocity characteristics, and angle FFT estimates angular positions, providing essential features for object detection and tracking. The phase consistency of TDM-MIMO virtual arrays is crucial for accurate angle estimation [19]. However, different radar hardware exhibits significant variations in terms of transmit–receive channel numbers, sampling rate, bandwidth, and antenna geometry. Moreover, the amplitude distribution of raw signals is affected by environmental clutter, electromagnetic interference, and thermal noise. These factors lead to inconsistent signal characteristic distributions across device domains, significantly limiting the generalization performance of deep learning models.
Early preprocessing research mainly focused on single-device channel calibration and noise suppression. For example, correcting transmitter and receiver complex gains allows compensation for channel amplitude drift [20], while phase difference estimation-based calibration methods can alleviate the impacts of array errors on angle spectra [21]. Recent work has explored the statistical standardization of radar signals, including amplitude normalization, power spectrum compression, and adaptive noise modeling, to improve model robustness across different scenarios. However, these methods often rely on device-specific parameters or fixed environment settings and lack a unified cross-device standardization strategy. In addition, existing approaches generally ignore temporal feature statistical drift, such as chirp-level amplitude fluctuations and inter-frame power inconsistencies, which significantly weaken model performance in multi-device training and transfer scenarios.
To address these challenges, this paper proposes a general preprocessing pipeline with multi-range-bin joint channel calibration and chirp-wise Z-score standardization. The former achieves consistent array responses via the joint optimization of complex gains and phase deviations; the latter aligns cross-frame and cross-device statistical distributions. This approach improves the angle spectrum clarity and feature stability with low computational complexity.

2.2. Radar Object Detection

Millimeter-wave radar object detection methods include traditional signal processing algorithms and deep learning approaches. Traditional methods such as CFAR [5] and clustering [6] have high computational efficiency but struggle with clutter, occlusion, and semantic feature extraction in complex scenes.
Deep learning has transformed radar perception towards data-driven approaches. RODNet [2] pioneered learning from range–azimuth (RA) maps using 3D convolutional neural networks (CNNs) with camera pseudo-labels for cross-modal supervision. Subsequent variants (RODNet-HG, RODNet-CDC) introduced HourGlass and deformable convolution. E-RODNet [11] achieved significant improvements through short-sequence fusion (SFF) and an enhanced encoder–decoder architecture, providing a lightweight baseline. These networks leverage residual connections [22] and encoder–decoder architectures [23] for multi-scale feature fusion. T-RODNet [24] achieved 83.83% AP using Swin Transformer but requires 159.7M parameters, limiting real-time deployment.
Recent work has explored hybrid Transformer–CNN architectures. TC-Radar [25] integrates Transformers for long-range dependencies with CNNs for local features, TransRadar [26] proposes adaptive directional Transformers for multi-view segmentation, and SMIFormer [27] leverages multi-view Transformers for 4D radar detection. The K-Radar dataset [28] provides comprehensive 4D radar data for weather-robust evaluation. Recent 4D radar fusion work [29,30] combines LiDAR and radar for improved robustness in adverse conditions.
Beyond RA maps, radar point cloud methods leverage 3D architectures like PointNet++ [31], VoxelNet, and PointPillars. However, radar point clouds suffer from extreme sparsity (10–50 points/frame vs. LiDAR’s 10,000+), fundamentally limiting their effectiveness. RAMP-CNN [32] combines RA maps and point clouds for improved accuracy but at the cost of increased complexity. Recent efficient architectures include YOLOv8 [33], RT-DETR [34], and EfficientDet [35].
While existing research primarily focuses on network architecture innovation (HourGlass, Deformable Convolution, Transformers) to improve the detection performance on single-device datasets, these methods lack the systematic treatment of two fundamental challenges: (1) cross-device domain shift—existing domain adaptation methods [16,17] address high-level feature distribution alignment but ignore device-specific signal-level heterogeneity in radar systems (antenna configuration, frequency modulation, noise statistics); (2) temporal consistency—while hybrid Transformer–CNN models [25,26,27] employ global self-attention for temporal modeling, their O ( N 2 ) complexity and heavy parameter requirements (e.g., T-RODNet with 159.7 M parameters) limit real-time deployment.
This paper addresses both challenges through a dual-level approach. At the signal level, we propose multi-range-bin joint TDM-MIMO calibration and chirp-wise Z-score standardization to achieve hardware-agnostic preprocessing that decouples device-specific characteristics from semantic features, enabling cross-device model transferability. At the feature level, we introduce a lightweight GTA module (0.12 M parameters) that achieves efficient temporal modeling through three-point attention with linear O ( N ) complexity, balancing performance and deployability. Unlike existing works that treat preprocessing and architectures separately, our framework synergistically integrates signal standardization and temporal attention, achieving 86.32% AP with only a 0.96% parameter increase—demonstrating that systematic signal-level preprocessing can achieve comparable or superior performance to heavy Transformer-based models while maintaining computational efficiency.

2.3. Attention Mechanisms

Attention mechanisms have been widely applied in computer vision and temporal modeling tasks. Vision Transformer (ViT) [36] pioneered the application of pure Transformer architectures to image recognition, demonstrating the effectiveness of self-attention for visual feature learning. The Squeeze-and-Excitation Network (SENet) [37] learns channel attention weights through global average pooling and two-layer fully connected networks, adaptively recalibrating channel-wise feature responses. The convolutional block attention module (CBAM) [38] further extends this concept by fusing both channel and spatial attention mechanisms. In temporal modeling, the temporal shift module (TSM) [39] captures temporal dependencies through efficient channel shift operations, and the temporal convolutional network (TCN) [40] uses causal convolution to construct a long-term temporal information flow. Transformer [41], based on a self-attention mechanism, can explicitly model global dependencies, but its computational complexity is O ( N 2 ) , which becomes prohibitive for high-resolution radar data.
In recent years, researchers have proposed various lightweight temporal attention structures to reduce the temporal modeling complexity while maintaining effectiveness. Lightweight Vision Transformers [42] achieve competitive performance with significantly reduced parameters through bidirectional interaction between local and global features. Drawing on the temporal neighborhood concept from TSM, this paper designs a GTA module consisting of three-point temporal attention based on roll operations and a global gating mechanism. This module explicitly models inter-frame dependencies in the temporal dimension while maintaining linear complexity O ( N ) , achieving efficient temporal alignment and dynamic feature aggregation and significantly improving the stability and robustness of object detection in dynamic scenes.

3. Radar Platform and Signal Modeling

This section introduces the AWR1642 radar hardware platform and the single-chirp RA map construction process, which serve as the foundation for subsequent preprocessing and deep learning-based object detection.

3.1. AWR1642 Hardware Configuration and Resolution Characteristics

The TI AWR1642 operates in the 77 GHz frequency band with a two-transmit and four-receive antenna configuration, where T 1 and T 2 denote the two transmit antennas, and R 1 , R 2 , R 3 , R 4 denote the four receive antennas, forming an eight-channel virtual array through TDM-MIMO. Key configuration parameters are shown in Table 1.
The AWR1642 radar operates within the 76–81 GHz frequency band as specified in the hardware manual. The sweep bandwidth is determined by the frequency slope and chirp duration as follows:
BW = slope × chirp duration = 8.014 MHz / μ s × 56 μ s = 448.784 MHz .
The frame rate of 10 Hz (100 ms frame period) is independent of the active chirp sequence duration (56 μs × 128 = 7.168 ms), as the radar firmware inserts idle time between frames to achieve the configured frame periodicity.
The ROD2021 dataset [43] uses the TI AWR1843 radar with a two-transmit and four-receive TDM-MIMO configuration. The range resolution is 0.06 m and the angular resolution is approximately 0 . 15 ° . The AWR1642 used in this paper forms an eight-channel virtual array and achieves a range resolution calculated as
Δ R = c 2 × BW = 3 × 10 8 m / s 2 × 448.784 × 10 6 Hz 0.334 m .
These hardware differences lead to inconsistent data distributions, requiring systematic calibration and standardization processing.

3.2. Single-Chirp RA Construction

To ensure geometric consistency and temporal synchronization, this paper adopts a single-chirp intra-frame RA construction scheme where each chirp generates an RA map independently with no cross-chirp Doppler FFT processing. In the spatial domain, TDM-MIMO virtual array expansion technology is adopted. Within the same chirp, T 1 and T 2 transmit in time division, and each transmit antenna corresponds to four receive antennas ( R 1 , R 2 , R 3 , R 4 ). After channel calibration (detailed in Section 4.2), they are concatenated to form an eight-channel virtual array for angle estimation.
Let the single-chirp complex-valued baseband be x [ n , a ] , where n = 0:255 represents the fast time sampling index (corresponding to range bins within a single chirp pulse), and a = 0:3 indexes the receive channels R 1 through R 4 . First, after removing DC from the complex baseband signal, a Hanning window is applied and a 512-point range FFT is performed:
X r [ k , a ] = n = 0 255 ( x [ n , a ] x ¯ [ a ] ) w [ n ] e j 2 π k n / 512 , k = 0:511 ,
where w [ n ] is the Hanning window, defined as w [ n ] = 0.5 0.5 cos ( 2 π n / N ) with N = 256 .
Subsequently, a Taylor window is applied to the four receive channels and a 128-point angle FFT is performed with fftshift to center:
X a [ k , ] = a = 0 3 X r [ k , a ] h [ a ] e j 2 π a / 128 , = 0:127 ,
where h [ a ] is the Taylor window, designed to suppress sidelobes in the angle spectrum and generated using standard parameters from the SciPy library ( n ¯ = 4 , SLL = −30 dB).
Finally, the range dimension is resampled to 128 bins limited to the 20 m range. The 128-point angle FFT spans an approximately ± 90 ° field of view, from which the central ± 60 ° region is extracted, resulting in a 128 × 128 RA map. The complete process is shown in Figure 1.

4. Methodology

Building upon the radar platform established in Section 3, this section presents the proposed radar–camera fusion framework for cross-device object detection. The AWR1642’s time-division multiplexing multiple-input multiple-output (TDM-MIMO) array structure requires channel calibration (Section 4.2), while device-specific signal statistics necessitate Z-score standardization (Section 4.3) for cross-device compatibility. The framework consists of four key components: (1) geometric alignment for radar–camera coordinate fusion, (2) TDM-MIMO channel calibration for cross-device consistency, (3) chirp-wise Z-score standardization for statistical alignment, and (4) GTA for temporal enhancement. These components enable effective cross-modal knowledge transfer from camera to radar while maintaining radar-only inference. The overall architecture is shown in Figure 2.

4.1. Radar–Camera Geometric Alignment for Cross-Modal Fusion

To achieve spatial consistency in radar–camera fusion, this subsection introduces the geometric alignment module that establishes precise coordinate correspondence between radar and camera. This module includes (1) coordinate system definition and extrinsic calibration via an improved PnP algorithm, (2) bidirectional projection for cross-modal mapping, and (3) closed-loop geometric consistency verification. The overall geometric relationship is shown in Figure 3.

4.1.1. Coordinate System Definition and Extrinsic Calibration

The precise calibration of camera and radar sensors is essential for cross-modal fusion [44]. We define a right-hand coordinate system including world 𝒲 , camera C , and radar R coordinates. The radar–camera extrinsics are represented by rotation matrix R CR and translation vector t CR . Camera intrinsics are denoted by K . The transformation is
X C = R CR X R + t CR .
The camera uses a pinhole imaging model with distortion correction completed. Its projection is
s u v 1 = K X c Y c Z c , π ( X C ) = X c / Z c Y c / Z c .
where s denotes the scale factor in homogeneous projection, and π ( · ) denotes the perspective division that maps homogeneous coordinates to inhomogeneous pixel coordinates.
From Equations (5) and (6), the projection matrix is
s u v 1 = P X R 1 , P = K [ R CR | t CR ] R 3 × 4 .
Extrinsic calibration uses the PnP algorithm based on 2D-3D point matching. It introduces ground plane prior Π : n T X 𝒲 + d = 0 (commonly Z = Z 0 ) and installation constraints such as the yaw angle, pitch angle, and camera height range to optimize solution stability. Extrinsic estimation is achieved by minimizing the reprojection error:
min R , t i = 1 N ρ p i π K ( R P w , i + t ) 2 ,
where ρ ( · ) is the Huber loss function used to suppress outliers. To reduce inter-frame jitter, this paper introduces a temporal smoothing regularization term in video sequences:
L smooth = t log ( R t 1 T R t ) 2 2 + t t t t 1 2 2 .

4.1.2. Bidirectional Projection and Visibility Determination

After completing extrinsic estimation, the geometric relationship between radar and camera can be established through bidirectional projection. As shown in Figure 4, the radar range–azimuth grid ( r , θ ) can be mapped to three-dimensional coordinates,
X R ( r , θ ) = r sin θ 0 r cos θ ,
and the corresponding pixel coordinates can be calculated by the projection matrix P :
u ( r , θ ) = π K ( R CR X R ( r , θ ) + t CR ) .
The inverse projection process is used to recover radar coordinates from image coordinates. For pixel point u = [ u , v ] T , its normalized line-of-sight direction is
d C = K 1 [ u , v , 1 ] T K 1 [ u , v , 1 ] T .
To convert image coordinates ( u , v ) to radar coordinates ( r , θ ) , we leverage the ground plane assumption Z w = Z 0 (typically Z 0 = 0 for road surface). The depth estimation is constrained by this prior: the line-of-sight ray d C intersects the ground plane Π at a unique 3D point P w , which is then transformed to radar coordinates via P R = R CR 1 ( P w t CR ) . The range and azimuth are computed as r = ( P R , x 2 + P R , z 2 ) 1 / 2 and θ = arctan ( P R , x / P R , z ) . This geometric constraint eliminates depth ambiguity for ground-plane objects.
The intersection point of the line of sight with the ground plane Π is mapped back to the radar frame through rigid transformation to obtain ( r , θ ) . If the projection point exceeds the image plane, satisfies Z c 0 , or falls into semantically invalid regions such as the sky or a car roof, the visibility mask is defined as M ( r , θ ) = 0 ; otherwise, it is 1.

4.1.3. Closed-Loop Geometric Consistency Verification

To further evaluate and correct calibration errors, this paper introduces a closed-loop geometric consistency verification mechanism. For the camera detection box bottom center pixel ( u , v ) , the world coordinate P w is obtained through back-projection and intersection with the ground plane and then projected back to the image plane through Equation (7). The reprojection error is calculated as
e = ( u u ) 2 + ( v v ) 2 .
If both the median error and 95% percentile error are below the threshold (such as 10 pixels), geometric consistency is considered good. Otherwise, extrinsic fine-tuning is triggered. This closed-loop error can both serve as a regularization term for extrinsic optimization and be used to automatically filter out pseudo-labels, enhancing the reliability of cross-modal supervision. Figure 5 illustrates the closed-loop verification process.

4.2. TDM-MIMO Channel Calibration

TDM-MIMO radar uses the time-division multiplexing of transmitters combined with multiple receivers to construct a virtual array, thereby improving the angular resolution [21]. However, phase inconsistencies between the receive channels and transmit-phase deviations can lead to distorted angle spectra, widened main lobes, and false peaks, reducing the detection accuracy and cross-device consistency. This subsection introduces a multi-range-bin joint TDM-MIMO channel calibration method to correct the receive-channel complex gains and transmit-phase deviations, improving the cross-device array consistency.

4.2.1. TDM-MIMO Virtual Array Phase Model

For a two-transmit four-receive system (AWR1642), transmitters T 1 and T 2 transmit sequentially in time, and each transmit is synchronized with all receivers, forming two groups of four-channel data. After merging, this forms an eight-channel virtual array: [ T 1 R 1 , T 1 R 2 , T 1 R 3 , T 1 R 4 , T 2 R 1 , T 2 R 2 , T 2 R 3 , T 2 R 4 ] . Let the complex gain of receive channel n be α n e i ϕ n (where α n represents the amplitude gain due to hardware mismatch and ϕ n denotes the phase offset from channel delay) and the transmit-phase deviation of transmitter T m be Δ ψ m (representing the transmit-path phase error from TDM switching). Then, the response of virtual channel ( T m , R n ) for a target with azimuth angle θ can be written as
V m n ( θ ) = α n e i ( ϕ n + Δ ψ m + k θ d m n ( θ ) ) ,
where d m n ( θ ) represents the equivalent baseline of the virtual channel, and k θ = 2 π sin θ / λ . If α n and ϕ n between channels or Δ ψ m between transmitters are inconsistent, phase coherence is destroyed, resulting in distorted angle spectra.

4.2.2. Multi-Range-Bin Joint Calibration Algorithm

To correct receive-channel complex gains and transmit-phase deviations, we propose a multi-range-bin joint calibration method. Corner reflector targets are placed at known angles, and complex echoes S [ k , c ] from multiple range bins k are extracted for each channel c. The calibration parameters { α n , ϕ n , Δ ψ m } are jointly optimized via least squares:
min { α n , ϕ n , Δ ψ m } k , θ a ( θ ) diag ( α     e i ϕ ) G ( Δ ψ ) S k ( θ ) 2 ,
where a ( θ ) is the ideal steering vector, G encodes transmit-phase shifts, and ○ denotes the element-wise product. The corrected virtual array signal is
V m n = V m n α n e i ( ϕ n + Δ ψ m ) .
This multi-range joint approach averages out noise and multipath effects, proving more robust than single-point calibration. As shown in Table 2, the method reduces the main lobe width by 42.3% and spurious peaks by 67.6%, significantly improving the angle spectrum clarity.
Figure 6 illustrates the calibration effect: the target becomes sharply localized with significantly reduced clutter after calibration.

4.3. Chirp-Wise Z-Score Standardization

Radar signals exhibit significant statistical variability due to hardware differences, environmental clutter, and temporal drift. To achieve cross-device statistical alignment, this paper proposes a chirp-wise Z-score standardization strategy that normalizes each chirp independently before angle FFT.
For the single-chirp range spectrum X r C N r   ×   N a after range FFT (where N r is the range bins and N a is the receive channels), standardization is applied:
X r = X r μ σ + ϵ ,
where μ and σ are the mean and standard deviation computed separately for the real and imaginary parts of X r , and ϵ = 10 6 provides numerical stability. This chirp-level standardization ensures consistent statistical properties independent of the device or environment.
Experiments show that chirp-wise Z-score standardization stabilizes the complex signal standard deviation to 0.037 ± 0.005 across all sequences and devices, with the 99th percentile at 0.19, effectively mitigating domain shift and improving model generalization. Table 3 shows the statistical standardization results.
As shown in Table 3, before standardization, AWR1642 exhibits Real std = 0.187 and Imag std = 0.203, significantly deviating from ROD2021’s 0.037 baseline. After chirp-wise Z-score standardization, both converge to 0.037, achieving perfect alignment with the ROD2021 reference. The amplitude P99 converges from 0.523 to 0.19, matching ROD2021 and indicating a controlled dynamic range. This statistical alignment enables cross-device generalization without distribution shift.

4.4. Global–Temporal Adapter Module

Radar object detection in dynamic driving scenarios faces temporal inconsistency challenges, where the detection results exhibit frame-to-frame jitter and false alarms due to signal noise, clutter, and insufficient temporal correlation modeling. While conventional 3D convolutions provide implicit temporal aggregation through local receptive fields, they lack explicit inter-frame dependency modeling and struggle to capture long-range temporal patterns. Transformer-based temporal attention [24,25] can model global dependencies but introduces quadratic complexity O ( N 2 ) and a heavy parameter overhead, limiting real-time deployment on automotive platforms.
To address these limitations, this paper proposes a lightweight GTA module with only 0.12 M parameters ( 0.96% increase over baseline E-RODNet). The GTA adopts a dual-path architecture combining global spatial gating with explicit three-frame temporal attention, achieving linear complexity O ( N ) while maintaining effective temporal modeling. The global gating path adaptively recalibrates channel-wise features based on the spatial context, while the three-point temporal attention explicitly captures inter-frame correlations through efficient roll-based frame alignment. The GTA module is inserted after the SFF block in the E-RODNet baseline, where temporal features from multiple frames have been preliminarily fused but lack explicit attention-based refinement. The GTA module architecture is shown in Figure 7.

4.4.1. Global Gating Mechanism

The global gating component uses global average pooling to extract the spatial context, followed by two 1 × 1 convolutions with ReLU activation and a sigmoid gate to produce channel-wise attention weights:
g = σ W 2 · ReLU ( W 1 · GAP ( F ) ) ,
where F R C   ×   H   ×   W is the input feature map, GAP is global average pooling, W 1 , W 2 are 1 × 1 conv weights, and σ is the sigmoid function. The gated feature is then
F gated = g F ,
where ⊙ denotes element-wise multiplication.

4.4.2. Three-Point Temporal Attention

To model temporal dependencies explicitly, the GTA module applies a three-point temporal attention mechanism. For a sequence of three consecutive frames { F t 1 , F t , F t + 1 } , features are shifted along the temporal dimension using roll operations (circular temporal shift) and then concatenated and processed by a 3 × 3 convolution:
A temp = Conv 3 × 3 Concat [ Roll + 1 ( F t 1 ) , F t , Roll 1 ( F t + 1 ) ] ,
where Roll ± 1 denotes a circular shift operation along the temporal dimension by ± 1 position (similar to NumPy’s roll function), aligning adjacent frames before concatenation for temporal correlation modeling.
A Softmax operation is applied to generate temporal attention weights, which are then multiplied with the center frame to produce temporally enhanced features. This explicit three-frame attention enables the network to capture motion patterns and reduce detection jitter.
The GTA module adds only 0.12 M parameters (0.96% increase) yet significantly improves the temporal stability and detection performance.

4.5. Overall Architecture and Training Strategy

Overall architecture: As shown in Figure 2, the framework builds upon the E-RODNet encoder–decoder architecture with GTA modules inserted after each encoder stage. GTA insertion can be controlled via a configuration file switch.
Loss function: We employ the smooth L1 loss, which provides better robustness to annotation noise compared to binary cross-entropy:
L Smooth L 1 ( x ) = 0.5 x 2 , if | x | < 1 | x | 0.5 , otherwise
Annotation generation and post-processing: YOLOv5s generates camera pseudo-labels, which are mapped to the RA space via geometric projection. In post-processing, location-based non-maximum suppression (L-NMS) using object location similarity (OLS) suppresses duplicate detections, and person+bicycle co-occurrence identifies the cyclist category to ensure training data quality and category consistency.

4.6. Method Limitations and Design Trade-Offs

While the proposed framework addresses cross-device generalization and temporal consistency, several inherent limitations and design trade-offs require discussion. The geometric alignment module (Section 4.1) requires accurate initial extrinsic calibration between camera and radar sensors, typically achieved through corner reflector-based procedures in controlled environments. Calibration errors propagate through the projection pipeline, affecting the pseudo-label quality. We mitigate this through closed-loop verification (Section 4.1.3) with reprojection error thresholds, but manual calibration remains a prerequisite for deployment.
Although the GTA module adds minimal overhead (0.96% parameters, 0.59% giga floating-point operations per second (GFLOPs)), the complete framework including preprocessing (TDM-MIMO calibration, Z-score standardization) and inference achieves 303.87 ms per frame on Tesla V100 GPUs. This satisfies typical autonomous driving requirements (10 Hz operation), but resource-constrained automotive processors may require model compression techniques.
The framework relies on YOLOv5s-generated camera pseudo-labels for training supervision. Camera detection failures in adverse conditions (heavy rain, dense fog, extreme lighting) directly impact radar network training quality. This dependency represents a trade-off in the radar–camera co-training paradigm: using camera semantics for radar feature learning requires camera reliability during the training phase. However, inference remains camera-free, maintaining radar’s all-weather advantage.
The TDM-MIMO calibration (Section 4.2) is tailored to two-transmit four-receive configurations (AWR1642, AWR1843). Radars with different antenna layouts (e.g., single-transmit, cascaded arrays) would require adapted calibration procedures. The chirp-wise Z-score standardization generalizes across devices but assumes similar signal processing pipelines (range–Doppler–angle FFT). This specialization enables effective cross-device alignment within the common TDM-MIMO category but limits direct applicability to fundamentally different radar architectures. These limitations represent deliberate design choices balancing performance, generalizability, and deployment practicality.

5. Experiments

This section evaluates the proposed radar–camera fusion framework on the public ROD2021 dataset [43]. We first describe the dataset and implementation details and then present quantitative results through comparison with baselines. Finally, cross-device verification and module analysis validate the effectiveness of the proposed components.

5.1. Dataset and Implementation Details

ROD2021 Dataset: The dataset contains synchronized radar and camera data captured with the TI AWR1843 radar and a monocular camera. It includes 10,158 training samples and 3289 validation samples with annotations for three categories: pedestrian, cyclist, and car. The radar operates at 77 GHz with a range resolution of 0.06 m and an angular resolution of approximately 0.15 ° .
Implementation: The framework is implemented in PyTorch 1.10.2 with Python 3.7.12. Training is conducted on NVIDIA Tesla V100 GPUs (Nvidia Corporation, Santa Clara, CA, USA). For our AWR1642 radar data collection and processing (cross-device verification in Section 5.3), we use TI AWR1642 radar (Texas Instruments, Dallas, TX, USA), and camera pseudo-labels are generated using YOLOv5s v6.0. The lightweight E-RODNet [11] architecture serves as the baseline, with an input sequence length of 16 frames. The GTA module is inserted after the SFF block. Training uses the Adam optimizer ( β 1 = 0.9 , β 2 = 0.999 ) with initial learning rate 1 × 10 4 and a cosine annealing learning rate scheduler over 25 epochs. The batch size is set to 6. Data augmentation is applied with probability 0.5 for temporal flip and probabilities of 0.25 each for horizontal flip, Gaussian noise (amplitude ∼0.1 × std), and combined noise with horizontal flip. All experiments are conducted on NVIDIA Tesla V100 GPUs.
Evaluation Metrics: AP is evaluated using object location similarity (OLS) instead of the traditional intersection over union (IoU), as radar detections are represented as point locations in range–azimuth coordinates rather than bounding boxes. OLS is defined as OLS = exp ( d 2 / ( 2 ( s κ cls ) 2 ) ) , where d is the distance (in meters) between the detection and ground truth, s is the object distance from radar representing scale information, and κ cls is a per-class constant for error tolerance. For each category, precision–recall curves are computed by varying the detection confidence thresholds. We use different OLS thresholds from 0.5 to 0.9 with a step of 0.05 to calculate the AP at each threshold. AP represents the average precision across all OLS thresholds from 0.5 to 0.9. The final AP is the mean across all three categories (pedestrian, cyclist, car). We also report the AP at specific OLS thresholds (AP0.5, AP0.7, AP0.9) to assess the localization accuracy under different error tolerances.

5.2. Comparison with Baselines

Table 4 shows the quantitative comparison of the proposed method with baseline approaches on the ROD2021 validation set. The results demonstrate that the proposed method achieves 86.32% average precision, outperforming the E-RODNet baseline by 22.88 percentage points with only a 0.12 M parameter increase (0.96%). Compared to the heavy T-RODNet (159.7M parameters), our method achieves competitive performance with 92.1% fewer parameters, making it suitable for real-time on-board deployment.
In terms of computational efficiency, the GTA module adds only 2.06 GFLOPs (+0.59%) compared to the baseline E-RODNet, while the inference time remains almost unchanged (303.87 ms vs. 304.27 ms, −0.40 ms). This demonstrates that the proposed lightweight design achieves a significant performance improvement (86.32% AP vs. 63.44% AP) with minimal computational overhead. The proposed fusion framework maintains a lightweight design, with only a 0.96% parameter increase compared to the baseline E-RODNet (12.40 M to 12.52 M). Compared to T-RODNet, with 83.83% AP and 159.70 M parameters, our framework achieves 86.32% AP with only 7.8% of its parameters, demonstrating superior parameter efficiency with a ratio of 12.8:1.
Qualitative results are shown in Figure 8, which presents representative detection results at different time frames on the BMS1001 test sequence. Each subfigure shows four components: a camera image (top left), the predicted RA heatmap (top right), the detection overlay (bottom left), and the ground truth RA map (bottom right). The proposed fusion framework with GTA-enhanced temporal modeling demonstrates consistent and accurate vehicle detection across different time instances.

5.3. AWR1642 Cross-Device Verification

To verify the generalizability of the proposed preprocessing method across different radar hardware, comprehensive signal-level quality verification was conducted on AWR1642 radar data. The AWR1642 uses a TDM-MIMO configuration with two transmit and four receive antennas and a sweep bandwidth of 448.784 MHz (range resolution 0.334 m), which represent significant hardware parameter differences from the AWR1843 radar used in the ROD2021 dataset (range resolution 0.06 m). These hardware differences lead to inconsistent signal characteristics, making cross-device verification particularly challenging. The AWR1642 data were collected from campus road scenarios (five sequences, total 5000 frames) covering diverse environmental conditions, including vehicles, pedestrians, and cyclists at various ranges (2–20 m) and speeds (0–15 m/s).

5.3.1. Quantitative Analysis of Preprocessing Effects

Table 5 presents the comprehensive preprocessing results for the AWR1642 data, demonstrating the effectiveness of each processing stage in achieving cross-device statistical alignment.
Analyzing Table 5 row by row reveals the progressive improvement achieved by each preprocessing stage. First, the raw AWR1642 data exhibit significantly degraded angle spectrum quality compared to the ROD2021 baseline, with a main lobe width of 12.8 bins indicating severe angular spreading that would cause target localization errors of approximately ±12°. The presence of 3.7 spurious peaks per frame indicates high false alarm rates from uncalibrated array phase errors. The statistical distributions also show device-specific characteristics, with real/imaginary standard deviations of 0.187/0.203, substantially different from the ROD2021 reference values.
After applying TDM-MIMO channel calibration, the main lobe width decreases sharply from 12.8 bins to 7.4 bins, representing a 42.3% reduction. This improvement directly translates to an enhanced angular resolution, from ±12° to approximately ±7°, enabling more precise target localization. The peak power gain of +5.2 dB indicates that calibration restores coherent combining across virtual array elements, effectively recovering the array processing gain that was lost due to phase misalignment. Most significantly, spurious peaks decrease dramatically from 3.7 to 1.2 per frame (67.6% reduction), demonstrating that the multi-range-bin joint calibration successfully corrects transmit–receive-channel phase deviations that previously manifested as false targets.
The subsequent chirp-wise Z-score standardization stage achieves perfect statistical alignment. The real and imaginary standard deviations both converge to 0.037, precisely matching the ROD2021 reference values. This alignment is crucial because neural networks trained on ROD2021 data expect inputs with specific statistical properties. The convergence from (0.187, 0.203) to (0.037, 0.037) represents an 80.2% and 81.8% variance reduction, respectively, effectively standardizing the input distribution despite the substantial hardware differences. Notably, the angle spectrum metrics (lobe width, peak power, spurious peaks) remain stable during standardization, confirming that Z-score standardization does not degrade the spatial resolution, while successfully achieving statistical domain alignment.

5.3.2. Visual Analysis of Cross-Device Standardization

Figure 9 provides qualitative visualization of the standardization effect on a representative frame containing a vehicle target at 8.5 m range.
Examining Figure 9, the camera image (a) shows the ground truth scenario with a vehicle at medium range in a campus road environment. The corresponding standardized RA heatmap (b) demonstrates several critical qualities. First, the vehicle target appears as a well-localized high-intensity region centered at approximately 8–9 m range and 0° azimuth, with clear boundaries and minimal angular spreading. Second, the background clutter exhibits uniform low-intensity characteristics without significant artifacts or false alarms, validating the effectiveness of spurious peak suppression. Third, the embedded signal quality metrics ( σ = 0.037, P99 = 0.20) confirm quantitative alignment: the standard deviation exactly matches the ROD2021 reference, while the 99th percentile value of 0.20 indicates well-controlled dynamic range without saturation.
Comparing the visual quality with typical ROD2021 RA maps, the AWR1642 standardized output exhibits comparable signal-to-clutter characteristics despite originating from different radar hardware. The target-to-background contrast ratio exceeds 15 dB, and the angular localization precision appears consistent with the ROD2021 data. This visual similarity is essential in enabling model transferability, as convolutional neural networks are sensitive to both statistical and spatial feature distributions.

5.3.3. Cross-Device Generalization Implications

The successful alignment of the AWR1642 data to the ROD2021 statistical properties validates two key claims. First, the proposed preprocessing pipeline is hardware-agnostic and generalizes across radars with substantially different specifications (2.6× bandwidth difference). Second, the combination of multi-range-bin joint calibration and chirp-wise Z-score standardization effectively decouples device-specific signal characteristics from semantic target information, enabling knowledge transfer from models trained on one radar platform to another without retraining. This cross-device capability is particularly valuable for practical deployment scenarios, where different autonomous vehicles may be equipped with different radar sensors yet require consistent perception performance.

5.4. Impact of GTA Temporal Window Size

To quantify the performance improvement brought by the GTA module’s temporal modeling, we first establish the single-frame baseline. The E-RODNet baseline (without GTA) processes each input frame independently, achieving 63.44% AP. This represents the single-frame detection performance without explicit temporal attention. To evaluate the impact of the temporal window size, we compare the three-point neighborhood ( { t 1 , t , t + 1 } ) with a five-point window ( { t 2 , t 1 , t , t + 1 , t + 2 } ). As shown in Table 6, adding GTA with a three-point temporal window improves the AP to 86.32% (+22.88 percentage points over baseline), demonstrating the significant contribution of temporal modeling in reducing detection jitter and improving consistency. The five-point window only improves the AP to 86.51%, with the parameter count increased to 12.72 M (+0.2 M). The three-point window already captures short-term temporal dependencies effectively. Given that the ROD2021 dataset has a frame rate of 10 Hz (100 ms interval), the three-point window covers a ±200 ms time range. For targets with typical vehicle speeds of 20 m/s, the displacement in 200 ms is only 4 m, making the three-point window sufficient for capturing motion patterns.

6. Conclusions

This paper presents a systematic framework for robust radar–camera fusion, specifically addressing the challenges of hardware heterogeneity and temporal consistency. To overcome device-dependent signal variations, we established a comprehensive calibration and standardization pipeline. This methodology was explicitly validated by aligning a low-resolution TI AWR1642 sensor to the signal standards of the AWR1843 platform, successfully bridging a 5.6× gap in range resolution and unifying signal distributions (std ≈ 0.037). Building upon these standardized inputs, our proposed lightweight GTA module was introduced to enhance the temporal modeling capabilities. Experimental results on the ROD2021 dataset demonstrate the efficacy of this feature-level optimization, where the framework achieves 86.32% average precision, outperforming the baseline by 22.88 percentage points with a 0.96% parameter overhead. In summary, this work delivers a dual contribution: a generalizable preprocessing strategy for cross-device hardware alignment and a high-performance temporal attention mechanism for robust object detection in autonomous systems.
While the current framework demonstrates strong performance on ROD2021 and cross-device validation on AWR1642 hardware, several directions merit future investigation. Evaluation on additional public benchmarks and self-collected datasets under diverse scenarios would further strengthen the generalizability claims. Developing automatic calibration procedures and online adaptation mechanisms for sensor degradation would improve the deployment practicality. Exploring reduced supervision methods and deployment optimization techniques would facilitate broader adoption in production autonomous driving systems.

Author Contributions

Research design, M.F. (Manman Fan) and X.W.; data acquisition, M.F. (Manman Fan) and Y.Y.; writing—original draft preparation, M.F. (Manman Fan); writing—review and editing, M.F. (Mingcheng Fu), Y.Y., Y.G. and X.L.; supervision, X.W.; funding acquisition, X.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Hainan Province Science and Technology Special Fund under Grants ZDYF2023GXJS168, ZDYF2023GXJS159, and ZDYF2025SHFZ058 and in part by the Scientific Research Projects of Higher Education Institutions in Hainan Province under Grant Hnky2025-1.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this article are included in the manuscript. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
RODRadar Object Detection
TDM-MIMOTime-Division Multiplexing Multiple-Input Multiple-Output
RARange–Azimuth
GTAGlobal–Temporal Adapter
SFFShort-Sequence Fusion
CFARConstant False Alarm Rate
APAverage Precision
OLSObject Location Similarity
L-NMSLocation-Based Non-Maximum Suppression
CNNConvolutional Neural Network
PnPPerspective-n-Point
FFTFast Fourier Transform
BEVBird’s Eye View
GFLOPGiga Floating-Point Operations Per Second

References

  1. Yin, T.; Zhou, X.; Krähenbühl, P. Center-based 3D object detection and tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 11784–11793. [Google Scholar]
  2. Wang, Y.; Jiang, Z.; Li, Y.; Hwang, J.N.; Xing, G.; Liu, H. RODNet: A real-time radar object detection network cross-supervised by camera-radar fused object 3D localization. IEEE J. Sel. Top. Signal Process. 2021, 15, 954–967. [Google Scholar] [CrossRef]
  3. Nobis, F.; Geisslinger, M.; Weber, M.; Betz, J.; Lienkamp, M. A deep learning-based radar and camera sensor fusion architecture for object detection. In Proceedings of the 2019 Sensor Data Fusion: Trends, Solutions, Applications (SDF), Bonn, Germany, 15–17 October 2019; pp. 1–7. [Google Scholar]
  4. Yao, S.; Guan, R.; Huang, X.; Li, Z.; Sha, X.; Yue, Y.; Lim, E.G.; Seo, H.; Man, K.L.; Zhu, X.; et al. Radar-camera fusion for object detection and semantic segmentation in autonomous driving: A comprehensive review. IEEE Trans. Intell. Veh. 2023, 9, 2094–2128. [Google Scholar] [CrossRef]
  5. Rohling, H. Radar CFAR thresholding in clutter and multiple target situations. IEEE Trans. Aerosp. Electron. Syst. 1983, AES-19, 608–621. [Google Scholar] [CrossRef]
  6. Major, B.; Fontijne, D.; Ansari, A.; Sukhavasi, R.T.; Gowaikar, R.; Hamilton, M.; Lee, S.; Grzechnik, S.; Subramanian, S. Vehicle detection with automotive radar using deep learning on range-azimuth-doppler tensors. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 3556–3563. [Google Scholar]
  7. Kim, J.; Seong, M.; Choi, J.W. CRT-Fusion: Camera, radar, temporal fusion using motion information for 3D object detection. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Vancouver, BC, Canada, 9–15 December 2024; Volume 37, pp. 108625–108648. [Google Scholar]
  8. Wang, Y.; Jiang, Z.; Li, Y.; Hwang, J.-N.; Xing, G.; Liu, H. Temporal enhancement for 3D radar object detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 21613–21627. [Google Scholar]
  9. Li, J.; Zhang, H.; Wu, Z.; Xu, T. Radar-camera fusion for 3D object detection with aggregation transformer. Appl. Intell. 2024, 54, 10627–10639. [Google Scholar] [CrossRef]
  10. Hazarika, A.; Fotouhi, M.; Rahmati, M.; Arabshahi, P.; Cheng, W. MultiFusionNet: Spatio-Temporal Camera-Radar Fusion in Dynamic Urban Environments. IEEE Sens. J. 2025, 25, 33175–33185. [Google Scholar] [CrossRef]
  11. Xu, W.; Lu, P.; Zhao, Y. E-RODNet: Lightweight approach to object detection by vehicular millimeter-wave radar. IEEE Sens. J. 2024, 24, 16584–16595. [Google Scholar] [CrossRef]
  12. Lin, Z.; Liu, Z.; Xia, Z.; Wang, X.; Wang, Y.; Qi, S.; Dong, Y.; Dong, N.; Zhang, L.; Zhu, C. RCBEVDet: Radar-camera fusion in bird’s eye view for 3D object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 14928–14937. [Google Scholar]
  13. Hwang, J.J.; Kretzschmar, H.; Manela, J.; Rafferty, S.; Armstrong-Crews, N.; Chen, T.; Anguelov, D. CramNet: Camera-radar fusion with ray-constrained cross-attention for robust 3D object detection. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 388–405. [Google Scholar]
  14. Zheng, L.; Ma, Z.; Zhu, X.; Tan, B.; Li, S.; Long, K.; Sun, W.; Chen, S.; Zhang, L.; Wan, M.; et al. RCFusion: Fusing 4-D radar and camera with bird’s-eye view features for 3-D object detection. IEEE Trans. Instrum. Meas. 2023, 72, 5001914. [Google Scholar] [CrossRef]
  15. Xiong, X.; Chai, S.; Zhu, X.; Ma, Z.; Li, S.; Long, K.; Sun, W.; Chen, S.; Zhang, L.; Zhu, B. LXL: LiDAR excluded lean 3D object detection with 4D imaging radar and camera fusion. IEEE Trans. Intell. Veh. 2024, 9, 79–92. [Google Scholar] [CrossRef]
  16. He, Z.; Zhang, L. Domain adaptive object detection via asymmetric tri-way Faster-RCNN. In Proceedings of the European Conference on Computer Vision (ECCV), Glasgow, UK, 23–28 August 2020; pp. 309–324. [Google Scholar]
  17. Lee, J.; Lee, J.H.; Lee, J.K.; Kim, J.S.; Kwon, S.; Kim, S. Dual adaptive data augmentation for 3D object detection. In Proceedings of the 2023 14th International Conference on Information and Communication Technology Convergence (ICTC), Jeju Island, Republic of Korea, 11–13 October 2023; pp. 1732–1737. [Google Scholar]
  18. Decourt, C.; VanRullen, R.; Salle, D.; Oberlin, T. A recurrent CNN for online object detection on raw radar frames. IEEE Trans. Intell. Transp. Syst. 2024, 25, 13432–13441. [Google Scholar] [CrossRef]
  19. Texas Instruments. AWR1642 Single-Chip 77- and 79-GHz FMCW Radar Sensor; Technical Report SWRS213; Texas Instruments: Dallas, TX, USA, 2017. [Google Scholar]
  20. Richards, M.A. Fundamentals of Radar Signal Processing; McGraw-Hill: New York, NY, USA, 2005. [Google Scholar]
  21. Li, J.; Stoica, P. MIMO Radar Signal Processing; Wiley: Hoboken, NJ, USA, 2008. [Google Scholar]
  22. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  23. Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention (MICCAI), Munich, Germany, 5–9 October 2015; pp. 234–241. [Google Scholar]
  24. Jiang, T.; Zhuang, L.; An, Q.; Wang, J.; Xiao, K.; Wang, A. T-RODNet: Transformer for vehicular millimeter-wave radar object detection. IEEE Trans. Instrum. Meas. 2022, 72, 5004012. [Google Scholar] [CrossRef]
  25. Jia, F.; Li, C.; Bi, S.; Qian, J.; Wei, L.; Sun, G. TC-Radar: Transformer-CNN hybrid network for millimeter-wave radar object detection. Remote Sens. 2024, 16, 2881. [Google Scholar] [CrossRef]
  26. Dalbah, Y.; Lahoud, J.; Cholakkal, H. TransRadar: Adaptive-directional transformer for real-time multi-view radar semantic segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–7 January 2024; pp. 353–362. [Google Scholar]
  27. Shi, W.; Zhu, Z.; Zhang, K.; Chen, H.; Yu, Z.; Zhu, Y. SMIFormer: Learning spatial feature representation for 3D object detection from 4D imaging radar via multi-view interactive transformers. Sensors 2023, 23, 9429. [Google Scholar] [CrossRef] [PubMed]
  28. Paek, D.H.; Kong, S.H.; Wijaya, K.T. K-Radar: 4D radar object detection for autonomous driving in various weather conditions. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 3819–3829. [Google Scholar]
  29. Chae, H.; Kang, C.M.; Kim, B.D.; Kim, J.; Chung, C.C.; Choi, J.W. Towards robust 3D object detection with LiDAR and 4D radar fusion in various weather conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 28116–28126. [Google Scholar]
  30. Liu, Q.; Zheng, Y.; Chen, Z.; Lyu, Y.; Zhu, C.; Huang, K. Deep instance segmentation with automotive radar detection points. IEEE Trans. Intell. Veh. 2023, 8, 84–94. [Google Scholar] [CrossRef]
  31. Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. PointNet++: Deep hierarchical feature learning on point sets in a metric space. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5099–5108. [Google Scholar]
  32. Meyer, M.; Kuschk, G. Automotive radar dataset for deep learning based 3D object detection. In Proceedings of the 2019 16th European Radar Conference (EuRAD), Paris, France, 2–4 October 2019; pp. 129–132. [Google Scholar]
  33. Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8. Available online: https://github.com/ultralytics/ultralytics (accessed on 1 January 2025).
  34. Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
  35. Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
  36. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Virtual, 3–7 May 2021. [Google Scholar]
  37. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
  38. Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  39. Lin, J.; Gan, C.; Han, S. TSM: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
  40. Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar] [CrossRef]
  41. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  42. Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. CvT: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual, 11–17 October 2021; pp. 22–31. [Google Scholar]
  43. Wang, Y.; Wang, G.; Hsu, H.M.; Liu, H.; Hwang, J.N. Rethinking of radar’s role: A camera-radar dataset and systematic annotator via coordinate alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Virtual, 19–25 June 2021; pp. 2815–2824. [Google Scholar]
  44. Zhang, Y.; Carballo, A.; Yang, H.; Takeda, K. Perception and sensing for autonomous vehicles under adverse weather conditions: A survey. ISPRS J. Photogramm. Remote Sens. 2023, 196, 146–177. [Google Scholar] [CrossRef]
Figure 1. Single-chirp RA construction process. (a) Signal processing flow; (b) processed RA map.
Figure 1. Single-chirp RA construction process. (a) Signal processing flow; (b) processed RA map.
Sensors 26 00373 g001
Figure 2. Overall architecture of the proposed radar–camera fusion framework. (Left): Training phase with radar–camera fusion (camera provides YOLOv5s pseudo-labels projected to RA space via geometric alignment). (Middle): Preprocessing pipeline (TDM-MIMO calibration and chirp-wise Z-score standardization). (Right): E-RODNet encoder–decoder with GTA modules inserted after SFF blocks. During inference, only radar input is required (camera-free operation).
Figure 2. Overall architecture of the proposed radar–camera fusion framework. (Left): Training phase with radar–camera fusion (camera provides YOLOv5s pseudo-labels projected to RA space via geometric alignment). (Middle): Preprocessing pipeline (TDM-MIMO calibration and chirp-wise Z-score standardization). (Right): E-RODNet encoder–decoder with GTA modules inserted after SFF blocks. During inference, only radar input is required (camera-free operation).
Sensors 26 00373 g002
Figure 3. Radar–camera coordinate systems and geometric mapping. 𝒲 : world coordinate system. C : camera coordinates. R : radar coordinates. ( u , v ) : pixel coordinates. K : camera intrinsic matrix. R and t : rotation and translation from radar to camera. P = K [ R | t ] : projection matrix.
Figure 3. Radar–camera coordinate systems and geometric mapping. 𝒲 : world coordinate system. C : camera coordinates. R : radar coordinates. ( u , v ) : pixel coordinates. K : camera intrinsic matrix. R and t : rotation and translation from radar to camera. P = K [ R | t ] : projection matrix.
Sensors 26 00373 g003
Figure 4. Bidirectional projection between camera and radar coordinates. 𝒲 : world frame, C : camera frame, R : radar frame, P w : 3D point in world coordinates, ( u , v ) : pixel coordinates, ( r , θ ) : radar range–azimuth coordinates, Π : ground plane. The blue highlighted regions in the camera and radar frames indicate the valid field-of-view overlap where cross-modal projection is geometrically feasible. P denotes the projection matrix, P = K [ R | t ] .
Figure 4. Bidirectional projection between camera and radar coordinates. 𝒲 : world frame, C : camera frame, R : radar frame, P w : 3D point in world coordinates, ( u , v ) : pixel coordinates, ( r , θ ) : radar range–azimuth coordinates, Π : ground plane. The blue highlighted regions in the camera and radar frames indicate the valid field-of-view overlap where cross-modal projection is geometrically feasible. P denotes the projection matrix, P = K [ R | t ] .
Sensors 26 00373 g004
Figure 5. Closed-loop geometric consistency verification for radar–camera calibration. The process flow is as follows: (1) extract camera detection box bottom center ( u , v ) ; (2) back-project to 3D world point P w via ground plane intersection; (3) forward-project to reprojection point ( u , v ) ; (4) compute reprojection error e = ( u , v ) ( u , v ) 2 . The horizontal axis shows the frame index, and the vertical axis shows the reprojection error in pixels. Errors below 10 pixels indicate good calibration quality.
Figure 5. Closed-loop geometric consistency verification for radar–camera calibration. The process flow is as follows: (1) extract camera detection box bottom center ( u , v ) ; (2) back-project to 3D world point P w via ground plane intersection; (3) forward-project to reprojection point ( u , v ) ; (4) compute reprojection error e = ( u , v ) ( u , v ) 2 . The horizontal axis shows the frame index, and the vertical axis shows the reprojection error in pixels. Errors below 10 pixels indicate good calibration quality.
Sensors 26 00373 g005
Figure 6. Range–azimuth map comparison before (left) and after (right) TDM-MIMO channel calibration.
Figure 6. Range–azimuth map comparison before (left) and after (right) TDM-MIMO channel calibration.
Sensors 26 00373 g006
Figure 7. GTA module architecture. The module consists of two paths. (1) Global Gating: GAP → Conv 1 × 1 → ReLU → Conv 1 × 1 → Sigmoid produces channel attention weights g . (2) Three-Point Temporal Attention: Roll-based temporal shift + concatenation + Conv 3 × 3 + Softmax generates temporal attention A temp . The outputs are fused via element-wise multiplication and residual addition. Note: * denotes element-wise multiplication (Hadamard product); × denotes repetition count. Conv3d ( k , 1 , 1 ) denotes a 3D convolution with kernel size k along the temporal dimension and a 1 × 1 spatial kernel.
Figure 7. GTA module architecture. The module consists of two paths. (1) Global Gating: GAP → Conv 1 × 1 → ReLU → Conv 1 × 1 → Sigmoid produces channel attention weights g . (2) Three-Point Temporal Attention: Roll-based temporal shift + concatenation + Conv 3 × 3 + Softmax generates temporal attention A temp . The outputs are fused via element-wise multiplication and residual addition. Note: * denotes element-wise multiplication (Hadamard product); × denotes repetition count. Conv3d ( k , 1 , 1 ) denotes a 3D convolution with kernel size k along the temporal dimension and a 1 × 1 spatial kernel.
Sensors 26 00373 g007
Figure 8. Detection results on ROD2021 dataset. Each row shows one time frame with four columns from left to right: RGB camera image, RF-predicted RA heatmap, ground truth (GT) annotation, and GTA (ours) detection result. Detection boxes are color-coded by object class: red for pedestrians, green for cyclists, and blue for cars. Note: Some detected targets may be more prominent in radar than in the corresponding RGB image due to radar–camera field-of-view differences, temporal synchronization offsets, or radar’s ability to detect partially occluded metallic objects.
Figure 8. Detection results on ROD2021 dataset. Each row shows one time frame with four columns from left to right: RGB camera image, RF-predicted RA heatmap, ground truth (GT) annotation, and GTA (ours) detection result. Detection boxes are color-coded by object class: red for pedestrians, green for cyclists, and blue for cars. Note: Some detected targets may be more prominent in radar than in the corresponding RGB image due to radar–camera field-of-view differences, temporal synchronization offsets, or radar’s ability to detect partially occluded metallic objects.
Sensors 26 00373 g008
Figure 9. Cross-device statistical standardization on AWR1642. (a) Camera image showing vehicle target; (b) standardized RA heatmap with signal quality metrics ( σ = 0.037, P99 = 0.20) matching ROD2021 reference values.
Figure 9. Cross-device statistical standardization on AWR1642. (a) Camera image showing vehicle target; (b) standardized RA heatmap with signal quality metrics ( σ = 0.037, P99 = 0.20) matching ROD2021 reference values.
Sensors 26 00373 g009
Table 1. AWR1642 configuration parameters.
Table 1. AWR1642 configuration parameters.
ParameterValue
Operating Frequency77 GHz
Frequency Slope8.014 MHz/μs
Sweep Bandwidth448.784 MHz
Sampling Frequency5 MHz
ADC Samples256
Chirp Duration56 μs
Chirps per Frame128
Frame Rate10 Hz
Table 2. TDM-MIMO channel calibration results. Note: “bins” refers to discrete angular frequency bins in the FFT spectrum; each bin corresponds to approximately 1.4 ° angular spacing.
Table 2. TDM-MIMO channel calibration results. Note: “bins” refers to discrete angular frequency bins in the FFT spectrum; each bin corresponds to approximately 1.4 ° angular spacing.
MetricBeforeAfterUnit
Main lobe width12.87.4bins
Peak sidelobe ratio−18.3−27.0dB
Target peak power0.0+5.2dB
Angle RMSE2.11.2°
Spurious peaks3.71.2/frame
Table 3. Statistical standardization results.
Table 3. Statistical standardization results.
StatisticBeforeAfterROD2021
Real std0.1870.0370.037
Imag std0.2030.0370.037
Amplitude P990.5230.190.19
Amplitude Mean0.1420.0480.045
Value Range[−0.98, 1.23][−0.65, 0.65][−0.65, 0.65]
Table 4. Performance comparison on ROD2021 dataset.
Table 4. Performance comparison on ROD2021 dataset.
MethodAP (%)Params (M)GFLOPsInference (ms)
E-RODNet (Baseline)63.4412.40348.54304.27
T-RODNet83.83159.70
Ours (E-RODNet + GTA)86.3212.52350.60303.87
Improvement vs. Baseline+22.88+0.12 (+0.96%)+2.06 (+0.59%)−0.40 (−0.13%)
Table 5. Preprocessing results on AWR1642 data.
Table 5. Preprocessing results on AWR1642 data.
MethodLobe Width (bins)Peak Power (dB)Spurious Peaks (/Frame)Real stdImag std
Raw Data12.803.70.1870.203
+Calibration7.4+5.21.20.1870.203
+Standardization7.4+5.21.20.0370.037
ROD2021 Ref0.0370.037
Table 6. GTA temporal window size comparison.
Table 6. GTA temporal window size comparison.
WindowAP (%)Params (M)
Baseline (E-RODNet, no GTA)63.4412.40
3-point ( { t 1 , t , t + 1 } )86.3212.52
5-point ( { t 2 , , t + 2 } )86.5112.72
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Fan, M.; Wang, X.; Fu, M.; Yang, Y.; Guo, Y.; Lan, X. An Improved RODNet for Object Detection Based on Radar and Camera Fusion. Sensors 2026, 26, 373. https://doi.org/10.3390/s26020373

AMA Style

Fan M, Wang X, Fu M, Yang Y, Guo Y, Lan X. An Improved RODNet for Object Detection Based on Radar and Camera Fusion. Sensors. 2026; 26(2):373. https://doi.org/10.3390/s26020373

Chicago/Turabian Style

Fan, Manman, Xianpeng Wang, Mingcheng Fu, Yanqiu Yang, Yuehao Guo, and Xiang Lan. 2026. "An Improved RODNet for Object Detection Based on Radar and Camera Fusion" Sensors 26, no. 2: 373. https://doi.org/10.3390/s26020373

APA Style

Fan, M., Wang, X., Fu, M., Yang, Y., Guo, Y., & Lan, X. (2026). An Improved RODNet for Object Detection Based on Radar and Camera Fusion. Sensors, 26(2), 373. https://doi.org/10.3390/s26020373

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop