TC–Radar: Transformer–CNN Hybrid Network for Millimeter-Wave Radar Object Detection

Jia, Fengde; Li, Chenyang; Bi, Siyi; Qian, Junhui; Wei, Leizhe; Sun, Guohao

doi:10.3390/rs16162881

Open AccessArticle

TC–Radar: Transformer–CNN Hybrid Network for Millimeter-Wave Radar Object Detection

by

Fengde Jia

¹

,

Chenyang Li

¹,

Siyi Bi

²

,

Junhui Qian

^3,*

,

Leizhe Wei

⁴ and

Guohao Sun

^5,6

¹

School of Information Science and Technology, Donghua University, Shanghai 201620, China

²

Shanghai Frontier Science Research Center for Modern Textiles, College of Textiles, Donghua University, Shanghai 201620, China

³

School of Microelectronic and Communication Engineering, Chongqing University, Chongqing 400044, China

⁴

Shanghai Wellplan Technology Co., Ltd., Shanghai 200030, China

⁵

School of Aeronautics and Astronautics, Sichuan University, Chengdu 610207, China

⁶

Sichuan Provincial Key Laboratory of Robotics Satellites, Chengdu 610207, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(16), 2881; https://doi.org/10.3390/rs16162881

Submission received: 23 June 2024 / Revised: 31 July 2024 / Accepted: 3 August 2024 / Published: 7 August 2024

(This article belongs to the Special Issue Technical Developments in Radar—Processing and Application)

Download

Browse Figures

Versions Notes

Abstract

:

In smart transportation, assisted driving relies on data integration from various sensors, notably LiDAR and cameras. However, their optical performance can degrade under adverse weather conditions, potentially compromising vehicle safety. Millimeter-wave radar, which can overcome these issues more economically, has been re-evaluated. Despite this, developing an accurate detection model is challenging due to significant noise interference and limited semantic information. To address these practical challenges, this paper presents the TC–Radar model, a novel approach that synergistically integrates the strengths of transformer and the convolutional neural network (CNN) to optimize the sensing potential of millimeter-wave radar in smart transportation systems. The rationale for this integration lies in the complementary nature of CNNs, which are adept at capturing local spatial features, and transformers, which excel at modeling long-range dependencies and global context within data. This hybrid approach allows for a more robust and accurate representation of radar signals, leading to enhanced detection performance. A key innovation of our approach is the introduction of the Cross-Attention (CA) module, which facilitates efficient and dynamic information exchange between the encoder and decoder stages of the network. This CA mechanism ensures that critical features are accurately captured and transferred, thereby significantly improving the overall network performance. In addition, the model contains the dense information fusion block (DIFB) to further enrich the feature representation by integrating different high-frequency local features. This integration process ensures thorough incorporation of key data points. Extensive tests conducted on the CRUW and CARRADA datasets validate the strengths of this method, with the model achieving an average precision (AP) of 83.99% and a mean intersection over union (mIoU) of 45.2%, demonstrating robust radar sensing capabilities.

Keywords:

radar object detection; millimeter-wave radar; transformer; CNN

Graphical Abstract

1. Introduction

In recent years, advanced driver assistance systems (ADASs) have evolved rapidly due to the increasing demands for enhanced long-term stability and safety in smart transportation systems. Multi-sensor fusion has emerged as a reliable method for information acquisition, extensively applied across diverse applications [1]. LiDAR sensors encounter challenges such as microparticles and dense point clouds, while cameras are susceptible to light interference [2]. Conversely, radar technology is capable of overcoming these issues. However, the limitations of radar encompass low angular resolution, considerable signal noise, and a scarcity of detailed semantic information in echo signals, leading to limited radar-based road perception. Radar employs frequency-modulated continuous wave (FMCW) signals in the millimeter-wave spectrum to detect reflected signals, even in adverse weather conditions [3]. This capability offers a potential avenue to compensate for informational gaps in driving systems under particular conditions, thereby establishing radar as a crucial element of ADASs [4,5].

Traditional constant false alarm rate (CFAR) techniques [6] in radar detection do not provide object classification information and require manual parameter adjustment for practical deployment. With advancements in deep learning, radar signal processing via a deep neural network (DNN) has emerged as a significant research domain. Raw FMCW radar data can undergo conversion via the application of fast Fourier transform (FFT) and peak detection techniques to produce Range–Azimuth–Doppler (RAD) tensors and point cloud (PC) data. In contrast to LiDAR PC methodologies, radar point cloud information is inherently sparse, complicating the process of target feature extraction; therefore, fusion techniques are often employed to facilitate perceptual learning [7,8,9,10,11]. An alternative method of processing RAD [12,13] data involves substantial computational expenditure, where feature aggregation or dimensional reduction condenses the data into bi-dimensional tensors from various vantage points. This study utilizes Range–Azimuth (RA) millimeter-wave radar heatmaps to minimize the use of computational resources, a critical factor in automotive radar perception.

Recently, significant progress has been made by researchers in exploring radar semantic segmentation (RSS) and radar object detection (ROD) tasks as demonstrated, for example, in [14], which presents a unified tensor approach for channel and target parameter estimation within the context of integrated sensing and communication with massive MIMO systems. However, both single-view perception [15,16,17,18] and multi-view fusion methods [19,20] do not adequately consider continuous frame information in radar data. For RA data, the lack of Doppler information makes the network insensitive to the motion characteristics of objects. Therefore, it is crucial to model radar data by considering both temporal and spatial features. When multiple heatmaps are used as input simultaneously to extract motion information, the goal is to capture the temporal dynamics within the sequence. Each frame’s heatmap is treated as an independent data point within the batch, enabling the model to learn motion features across the entire batch. This approach ensures that each frame contributes to the extraction of motion features, effectively addressing the issue of sparse semantic information to some extent. By utilizing this approach, some researchers have incorporated 3D convolutional neural network (3D CNN) [13,21,22] and LSTM [23,24,25] into ROD networks, which have resulted in improved detection accuracy. However, while LSTM primarily focuses on temporal information prediction, it is now increasingly imperative to utilize information across multiple radar frames for feature extraction. Similarly, 3D CNN attempts to extract long-range modeling information by increasing the size of the convolution kernel or by deepening the network structures, yet these improvements are limited. Additionally, the presence of noise in radar data makes models prone to confusing targets with the background, which leads to false positives. In response to these challenges, this paper proposes a novel radar signal perception network. Inspired by the transformer architecture in the field of Natural Language Processing (NLP) [26], our research integrates CNN and transformer elements into our model, essentially acting as a combination of high-pass and low-pass filters to better distinguish high-frequency noise interference from low-frequency target information, which is vital for encapsulating both local information and global perception [27,28,29,30]. Our contributions can be summarized as follows:

Transformer was integrated with CNN to develop the TC–Radar model, resulting in a significant enhancement in radar detection accuracy. Comprehensive experiments conducted on the CRUW and CARRADA datasets demonstrated that the TC–Radar model outperformed existing methods in terms of detection accuracy and robustness.
The incorporation of CA and DIFB modules established an effective feature fusion mechanism between the encoder and decoder stages.
TC–Radar has demonstrated significant practicality under low-light conditions by compensating for lost information in assisted driving systems and exhibiting robust perception capabilities.

The remainder of this paper is organized as follows: Section 2 introduces the relevant background and achievements in radar perception; Section 3 details the design of the proposed TC–Radar; Section 4 presents experimental results and ablation studies; and Section 5 offers the conclusion of the paper.

2. Background and Related Work

2.1. Radar Signal Processing Chain

Radar utilizes electromagnetic waves to detect and analyze reflected signals originating from objects within the surrounding environment [3]. Various data formats are generated through a complex series of signal processing procedures as illustrated in Figure 1. The raw radar data, acquired through analog-to-digital conversion (ADC) performed by both transmitting and receiving antennas, contain substantial noise and are inherently abstract, rendering them difficult to interpret without further processing. The FFT is employed along the chirp sequence, the sampling axis, and the antenna dimension to transform the data into a more interpretable format. This process results in the generation of RAD tensors, whereas PC data are processed through peak detection algorithms to extract meaningful features. Typically, the usability of RAD tensors and PC data is limited due to the former’s high computational demands and the latter’s sparse informational content.

2.2. Problem Formulation

Before we delve into the methodology, we present a mathematical formulation of the radar target detection problem addressed in this paper. Let

R = \{R_{1}, R_{2}, . . ., R_{N}\}

represent a set of N frames of radar sequences, where each

R_{i}

is a vector described by distance and angle dimensions, containing radar frequency domain information at a specific time and spatial position. The detection problem is formulated as estimating the parameters

θ

of the target model M, with the goal of determining the optimal parameters

θ^{*}

that maximize the probability of observing the target object in M, expressed as:

\begin{matrix} θ^{*} = {argmax}_{θ} P (R | M, θ) \end{matrix}

(1)

This method corrects the model’s learning results by addressing the discrepancy between the detection results

R_{i - o u t p u t} (r, θ)

of model M and the true values

R_{g t} (r_{g t}, θ_{g t})

, aiming to improve detection accuracy. The specific mathematical relationship is detailed in Section 3.4. Here, r and

θ

represent the distance and angle of the target coordinates detected by the model, while

r_{g t}

and

θ_{g t}

denote the actual position information of the target.

2.3. Radar Semantic Segmentation and Object Detection

A low signal-to-noise ratio (SNR) in radar data poses significant challenges for dataset generation. Utilizing cross-modal supervised training methods across different sensors helps to reduce manual annotation costs. Specifically, Jin et al. [31] utilize synchronized camera images as ground truth (GT) for model training. Another method involves the supervision of radar DNN training with pseudo-labels derived from camera pre-training [32]. Additionally, as reported in [33], radar data are converted into camera images for object detection through differentiable data mapping. Recently, the SS-RODNet model, as proposed in [34], has employed masked image modeling (MIM) for self-supervised learning to address dataset limitations. However, these methods require data matching between different sensors or downstream task adjustments, failing to effectively leverage the intrinsic features of radar data itself. CRUW [21] and Carrada [19] are particularly noteworthy among the several proposed millimeter-wave radar datasets [12,19,21,35]. Both datasets offer RA perspective millimeter-wave radar heatmaps that include the distance and angle information of objects, which makes them more intuitive for ROD and RSS tasks. In previous experiments, methods such as DANet [22], RODNet [21], TMVA-Net [19], ERASE-Net [36], and RECORD [24] have shown the potential of CNN and LSTM for radar perception; however, they encounter problems with detection accuracy. This issue arises from the inherent limitations of CNN and the low semantic content of radar data. Consequently, PeakConv [37] has redefined convolutional operators to better suit radar signals, yielding effective results in RSS tasks. In summary, most current methods remain reliant on CNN and necessitate deeper networks to aggregate multiple local features. Thus, exploring methods capable of capturing global feature modeling capabilities is needed to better exploit the potential of radar sensors.

2.4. Vision Transformer (ViT)-Based Perception Tasks

Transformer was initially introduced to the vision field [38], with the seminal work in [39] leveraging the encoder–decoder framework originally developed for NLP. Currently, transformers are increasingly being applied to a diverse range of perception tasks in the field of computer vision. Sparse radar point clouds have achieved high-quality image segmentation by utilizing self-attention (SA) mechanisms [40]. Significant progress has also been made in radar-assisted target 3D perception [41,42] and monocular depth estimation [43]. Additionally, models that rely on millimeter-wave radar heatmap data, such as T-RODNet [44], Radarformer [45], and TransRSS [20], have successfully utilized this technology. However, traditional ViT methods are computationally intensive and fail to meet the stringent real-time requirements of automotive radar perception systems. The Swin Transformer algorithm [46] addresses this issue in part by employing within-window SA mechanisms and sliding window information interaction methods to reduce computational complexity. This paper applies the aforementioned method to RSS and ROD tasks, combining the local high-frequency perception capabilities of CNN with the global contextual understanding of transformers to enhance radar detection performance.

3. Methodology

3.1. Overall Architecture

TC–Radar employs a traditional encoder–decoder architecture. The model accepts radar sequences of shape

x \in R^{D \times W \times H \times C}

as input and generates detection results of size

x \in R^{N \times D \times H \times W}

, where D and N represent the number of frames and categories, respectively, H represents the height of the radar data matrix, W represents the width, and C represents the number of channels. Multi-scale feature stacking is an effective method for model learning; however, in the context of radar data, the low resolution and sparse textual details make deeper network structures less effective. Therefore, an asymmetric structure is employed, with three basic layers in the encoding stage for enhanced multi-scale information acquisition, and two basic layers in the decoding part, integrating CA modules for data restoration.

3.1.1. Encode

Specifically, in the encoding phase, we use the Swin Transformer method to calculate the SA mechanism. The corresponding patch embedding operation is the standard preprocessing stage, which divides the data into small blocks (patches) to facilitate subsequent processing. The accompanying patch merging reduces the resolution of the radar image data, enabling the model to learn features at different scales. In this stage, we use a 3 × 3 × 3 convolution kernel to downsample the data, with temporal and spatial downsampling rates set to (2, 2, 2) and (1, 2, 2), respectively. Within each basic layer, feature extraction occurs through parallel DIFB and transformer modules, which focus on learning high-frequency local features and global information, respectively, balanced by a learnable factor according to their output significance.

3.1.2. Decode

During the decoding stage, traditional channel serial concatenation becomes unsuitable for radar perception tasks; thus, CA modules are integrated to enable information exchange between the encoder and decoder branches. This basic layer processes the inherent information

x_{u}

originating from the decoder and the corresponding feature information

x_{d}

from the encoder. The operation is designed using a multi-scale approach. We employ layer-wise upsampling of the data through patch expand calculations. Specifically, the first two patch expand calculations utilize the trilinear interpolation algorithm instead of transposed convolutions, which helps to reduce the model’s complexity. The final layer is a linear classification layer that produces the output result. As depicted in Figure 2, the CA module serves as a channel for information exchange. The structure of this module will be discussed in detail in Section 3.3.

3.2. Dense Information Fusion Block (DIFB)

Considering the characteristics of radar imaging, the presence of background noise disturbances presents a challenge for capturing target information.

Therefore, beyond leveraging transformer to enhance long-range modeling capabilities, improving the perception capabilities of CNN is also crucial. In this study, we propose a multi-branch feature aggregation module presented in Figure 3, where each

B r a n c h_{i}

contains interconnected residual basic blocks

\sum_{1}^{i} B r a n c h_{i}

to merge adjacent information streams. The Atrous Spatial Pyramid Pooling (ASPP) module [19] demonstrates superiority in RSS tasks, effectively integrating multi-receptive field information through dilated convolutions with different dilation rates while not escalating the computational burden. Inspired by this approach, we employ consecutive

3 \times 3 \times 3

and

3 \times 5 \times 5

convolution kernels in each branch with dilation rates of

\{(2, 2, 2), (2, 4, 4), (2, 6, 6)\}

. The detailed parameters of this module can be found in Table 1. The relationship between the equivalent kernel size

\hat{k}

and the dilation factor d is described by the following formula:

\begin{matrix} \hat{k} = k + (k - 1) \times (d - 1) \end{matrix}

(2)

the equivalent kernel sizes are

\{(5, 5, 5), (5, 9, 9), (5, 13, 13)\}

and

{(5, 9, 9), (5, 17, 17),

(5, 25, 25)}

. Residual blocks with varying receptive fields are utilized across all branches except the first one, which is acknowledged as an effective CNN structure [47]. Each branch is mathematically represented as follows:

\begin{matrix} Branc h_{i} = ResConv (input) \end{matrix}

(3)

where the

ResConv (•))

denotes the internal residual operation. Unlike ASPP, which consolidates all information at the output stage, the

i -

th branch of DIFB already incorporates information from the preceding

i - 1

branches during processing:

\begin{matrix} OutBranc h_{i} = Conv (Concat (\sum_{1}^{i} Branc h_{i})) \end{matrix}

(4)

This process allows for the interleaving of multi-branch features, leveraging this compact information transfer to reduce information loss in the neural network. Additionally, a

1 \times 1 \times 1

convolution kernel is applied for dimensionality reduction of the merged information, thereby matching the number of channels required for subsequent feature stacking. The resulting overall output is given by:

\begin{matrix} Output = Concat (\sum_{i}^{4} Branc h_{i}) \end{matrix}

(5)

Each basic layer within the encoding stage incorporates a single DIFB module. The contribution of this component will be discussed subsequently.

3.3. Cross-Attention (CA)

With dense echo data such as radar signals, both coarse semantic features and fine-grained details are essential for accurate detection [48]. However, classical skip connections encounter difficulties in transmitting features. This challenge stems from juxtaposing noisy shallow information with comparatively clear deep features. Consequently, this juxtaposition introduces additional noise during the merging process, leading to feature misalignment and inadequate feature representation, thereby compromising decoder performance. To counter this issue, attention feature maps of particular dimensions from the encoder are integrated into the decoder’s process, thus forming an information exchange channel between the two to facilitate adaptive feature aggregation. The structure is depicted in Figure 4. Specifically, encoder-derived information

x_{d}^{l + 1}

results from the combined processing by the DIFB and transformer:

\begin{matrix} x_{d}^{l + 1} = λ Conv (x_{c d}^{l}) + (1 - λ) SA (x_{t d}^{l}) \end{matrix}

(6)

where

λ

is the learnable balancing factor, and the SA mechanism executes dot-product operations within a predefined window size of 4:

\begin{matrix} SA (Q, K, V) = SoftMax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V \end{matrix}

(7)

The benefits of incorporating CA are manifold. First, the integration of same-sized data within the decoder guides the detail recovery process. To reinforce this guidance, the key (K) and value (V) matrices produced by the shifted window/window multi-head self-attention (SW/W-MSA) in the corresponding branch are relayed to the CA. To safeguard against the loss of the decoder’s intrinsic information, this module is devised to concurrently process two streams of information: the global information

{\hat{x}}_{g}

sourced from the encoder

x_{d}^{l + 1}

, and the inherent decoder information

x_{u}^{l}

:

\begin{matrix} {\hat{x}}_{g} = α x_{d}^{l + 1} + (1 - α) x_{u}^{l} \end{matrix}

(8)

The second facet entails comprehending the decoder’s inherent features. The operations of the SA mechanism for these aspects are delineated as follows:

\begin{matrix} S A_{G} = SA (Q_{G}, K_{E}, V_{E}) \end{matrix}

(9)

\begin{matrix} S A_{D} = SA (Q_{D}, K_{D}, V_{D}) \end{matrix}

(10)

where

Q_{G}

denotes the global query (Q) matrix derived from the DIFB, transformer, and the decoder’s intrinsic information.

{(•)}_{D}

and

{(•)}_{E}

denote the decoding and encoding stages, respectively. The CA module is used for the interaction and fusion of different information streams. For the current decoder output state, the

Q_{D}

matrix is generated. Next, the association distribution is obtained by calculating the similarity between the

Q_{D}

and the encoder

K_{E}

vector. Then, the association is multiplied by the encoder

V_{E}

vector to obtain the attention weight distribution. At this point, the key feature positions in the encoder are highlighted with larger weights. This sequence information shows the encoder’s potential abstract association features in the decoder. The complete CA representation is obtained through subsequent fusion with the decoder’s inherent information. At this juncture, the information within

S A_{G}

is more comprehensive, encapsulating abstract latent features, while

S A_{D}

primarily concentrates on the decoder’s intrinsic details. Secondly, the intermingling of information promotes the amalgamation of analogous details in the decoding stage to aid in the structural replenishment of gaps in information. The approach illustrated in Figure 4b is employed to generate outputs

x_{g u}^{l}

and

x_{u p}^{l}

, delineated as follows:

\begin{matrix} x_{u}^{l + 1} = Conv (Concat (x_{g u}^{l}, x_{u p}^{l}) + MLP (LN (x_{u p}^{l})) + F (x_{u m}^{l})) \end{matrix}

(11)

The inherent decoder information is processed using multi-layer perceptron (MLP) and layer normalization (LN), while

F (•)

signifies Fourier positional encoding (FPE), which enhances the model’s resilience to spatial transformations. In summary, the CA module thoroughly addresses the disparities in information between the encoder and decoder, processes features across various scales, and exhibits robust decoding and predictive capabilities.

3.4. Loss Function

In the RSS task context, cross-entropy (CE) and soft dice (SDice) losses are recognized as effective. To tackle the challenge posed by imbalanced segmentation labels, weighted coefficients are utilized to equilibrate the two losses. The mathematical formulations for these losses are given as:

\begin{matrix} L_{w c e} = - \sum_{i} w_{k} P (y_{i}) log P ({\hat{y}}_{i}) \end{matrix}

(12)

\begin{matrix} L_{d i c e} = \frac{1}{N} \sum_{n = 1}^{N} [1 - \frac{2 \sum_{i} P (y_{i}) P ({\hat{y}}_{i})}{\sum_{i} (P^{2} (y_{i}) + P^{2} ({\hat{y}}_{i}))}] \end{matrix}

(13)

where

P (y_{i})

and

P ({\hat{y}}_{i})

denote the GT and predicted probabilities, respectively,

w_{k}

denotes the class weight, which is inversely proportional to the frequency of the respective class in the training set. The composite loss function for the RSS task is defined as follows:

\begin{matrix} L_{R S S} = λ_{w c e} L_{w c e} + λ_{d i c e} L_{d i c e} \end{matrix}

(14)

the balancing factors

λ_{w c e}

and

λ_{d i c e}

are set to 1 and 5, respectively.

In the ROD task, the mean squared error (MSE) loss function is utilized to quantify the disparity between the model’s outputs and the ground truth. The mathematical formulation for this loss is expressed as:

\begin{matrix} L_{M S E} = \frac{1}{N} \sum_{i = 1}^{N} {(P ({\hat{y}}_{i}) - P (y_{i}))}^{2} \end{matrix}

(15)

the definitions of

P (y_{i})

and

P ({\hat{y}}_{i})

are the same as above.

4. Experiments

Perception experiments for the ROD task on the CRUW dataset and the RSS task on the CARRADA dataset were conducted, with comparisons of our results to previous state-of-the-art (SOTA) methods. The visualization result of the data is shown in Figure 5.

4.1. Dataset

Object Detection on CRUW: The sensor suite comprises both a camera and a 77 GHz FMCW millimeter-wave radar. A total of 3.5 h of driving data is captured across street, campus, highway, and parking lot scenarios at a rate of 30 frames per second, amounting to roughly 400 K frames. Furthermore, the dataset encompasses data from diverse scenarios, including nighttime and glare conditions, which serve to evaluate the model’s comprehensive perception abilities. Altogether, CRUW represents a high-quality dataset for radar-based object detection. This dataset autonomously produces radar data annotations in the RA perspective utilizing cross-modal supervision techniques, identifying three classes: pedestrians, bicycles, and vehicles. In contrast to the commonly used intersection over union (IoU) evaluation metric, CRUW employs an anchor-free approach known as object localization similarity (OLS) to gauge classification confidence. OLS is delineated as follows:

$\begin{matrix} OLS = \exp \{\frac{- d^{2}}{2 {(S \cdot K_{c l s})}^{2}}\} \end{matrix}$

(16)

d, S and $K_{c l s}$ denote the distance between two points in the RA image, the radial distance from the target to the radar, and the category-specific tolerance constant (defined by the mean object size within each class), respectively. The evaluation metrics employed are AP and average recall (AR). These metrics are assessed using OLS thresholds that vary from 0.5 to 0.9, in increments of 0.05.
Semantic Segmentation on CARRADA: The CARRADA dataset comprises 12,666 frames, annotated via a semi-automated process. Relative to the CRUW dataset, CARRADA offers less variety in scene composition yet furnishes comprehensive RAD data. This dataset is segmented into RA, Range–Doppler (RD), and Azimuth–Doppler (AD) 2D tensors, providing authentic datasets for further multi-perspective information fusion research. Within its 30 sequences, the primary entities encompass four classes, pedestrians, cyclists, vehicles, and the background, with detailed mask annotations available in both RA and RD views. Contrasting with the OLS evaluation utilized for CRUW, this dataset employs IoU and Dice as assessment metrics. The formulations for these metrics are delineated as follows:

$\begin{matrix} IOU = |\frac{Y_{g} \cap Y_{p}}{Y_{g} \cup Y_{p}}| \end{matrix}$

(17)

$\begin{matrix} Dice = \frac{2 |Y_{g} \cap Y_{p}|}{|Y_{g}| + |Y_{p}|} \end{matrix}$

(18)

In this context, $Y_{g}$ and $Y_{p}$ represent the GT and the predicted values, respectively.

4.2. Implementation Details

All experiments ran on a single RTX 3090 GPU within the PyTorch framework. GFLOPs and model parameters were computed via the third-party thop library, barring instances where official paper data were cited. Experiments utilized the Adam optimizer and, based on empirical evidence, configured the transformer’s dimension to 64 with a window size of 4. The initial learning rate was 1 × 10⁻⁴, following a cosine annealing schedule for optimal training performance. With the CRUW dataset, the RA heatmap resolution was established at

128 \times 128

pixels. To more effectively capture target motion information, 16 radar data frames were input simultaneously. The channel count and batch size were both set to 2. In contrast, the CARRADA dataset specified RA dimensions of

256 \times 256

pixels. For the RSS task, the input frame count was reduced to 4, and the batch size was increased to 8. The channel count for this task was established at 1. Owing to varying evaluation criteria across datasets, network output sizes required corresponding adjustments. Specifically, within the RSS task, modifications were confined to the linear layer during the patch expansion phase.

4.3. Comparison with SOTAs

This study conducts a comprehensive comparative analysis of alternative algorithms on the CRUW and CARRADA datasets. Table 2 and Table 3 present the quantitative comparison results of all evaluated methods. Figure 6 shows the computational complexity comparison of different models. TC–Radar outperforms competing models in terms of AP, AR, mIoU, and mean dice (mDice) metrics. In the ROD task, our model surpasses the baseline RODNet-HWGI by 5.3% in AP and 4.6% in AR, while the computational complexity, measured in Floating Point Operations Per Second (FLOPs), stands at a mere 6.2% of that baseline. This reduction is a critical factor for automotive radar perception systems, indicating significantly lower computational demands on edge devices, thus facilitating easier model deployment. In comparison, the lightweight Radarformer [45] model achieves satisfactory detection accuracy with fewer parameters but contends with significant computational complexity. Regarding specific category assessments, TC–Radar exhibits enhanced perceptual performance, particularly in pedestrian detection. Support for this conclusion is evident from the detection outcomes illustrated in the accompanying figures. Specifically, as depicted in Figure 7, rows one and five, overlapping instances of vehicles and bicycles complicate the detection of pedestrian subjects. Nonetheless, the TC–Radar detection results manifest in more intense hues, indicating a higher probability of accurate categorization. This is a testament to the effective local perception capabilities of the DIFB module and the transformer’s robust noise differentiation capacity. During the quantitative evaluation of the RSS task, our model surpasses the baseline TMVA-Net in mIoU and mDice by 3.9% and 4.6%, respectively, while also achieving a 10% reduction in computational complexity. Given that TMVA-Net is a multi-view fusion model, it necessitates the input of RA, RD, and AD data for training. While this approach mitigates some information loss inherent to a single perspective, the segmentation results suffer from imprecise accuracy and misclassifications.

This is particularly noticeable in the erroneous or missed detections of minor pedestrian and cyclist targets as observed in Figure 8. With the exception of LQCANet, which has marginally higher scores, competing models exhibit an approximate 2% performance deficit relative to TC–Radar. Furthermore, our model’s parameter count is approximately 22% of that of LQCANet, indicating greater learning efficiency.

In summary, considering both qualitative and quantitative evaluations, the judicious design of the network architecture endows the model with robust perceptual abilities while maintaining a balance between detection speed and accuracy. Future experiments will delve into the specific contributions of each module in greater detail.

4.4. Ablation Study

To comprehensively assess the impacts of the DIFB, transformer, and CA modules, ablation studies are conducted on the CRUW and CARRADA datasets. Table 4 and Table 5 illustrate the quantitative outcomes of the ablation studies. Analysis of the results reveals that each module contributes positively, with the transformer and CA modules making particularly substantial contributions. This underscores the robust noise differentiation capability of long-range feature modeling in radar signal processing.

Notably, during the impact assessment of the transformer, it is only omitted from the encoder segment to preserve the integral role of the CA module, given that CA relies on the K and V matrices from the transformer to direct detail refinement, and these components are produced by the decoding branch to compensate for informational deficiencies. To examine the impacts of CA, three experimental setups are created: excluding

x \in R^{\frac{D}{2} \times \frac{W}{2} \times \frac{H}{2} \times C}

, omitting

x \in R^{\frac{D}{4} \times \frac{W}{4} \times \frac{H}{4} \times 2 C}

, and fully eliminating the module. In these cases, the Cross-Attention module operates exclusively with either low-resolution (WLR) or high-resolution (WHR) inputs, or solely relies on the upsampling layer for patch details. The analyzed results indicate that both WLR and WHR configurations lead to variations in detection and classification performance. However, the WLR-modeled network tends to yield predictions with comparatively greater certainty. This is attributed to two factors: first, the encoder’s deeper features are more refined; second, the simplified resolution helps the model identify targets more readily, while the intricate features assist in the decoder’s refinement process. Figure 9 corroborates this finding, demonstrating that WLR mode enhances accuracy in detection classifications and segmentation details. Furthermore, across all results, the DIFB module consistently bolsters both segmentation and detection tasks, affirming its efficacy in enlarging the receptive field for feature analysis. In addition, we discuss the importance of input radar sequences. Compared to single-frame radar information, the detection performance of models using multi-frame input data is more reliable, suggesting that the redundancy inherent in radar sequence information can help mitigate the issue of low semantic content to some extent. We conduct the experiment according to the methodology outlined in [21], which involves recording the time taken for a single forward propagation of the neural network as the inference time metric, and extend this method to the CARRADA dataset. The specific quantitative metrics are presented in Table 6. The results indicate that timing positively impacts radar perception performance.

4.5. Super Test

To assess the feasibility of utilizing radar as a cost-effective alternative to optical sensors under low-light conditions, comprehensive testing is conducted on a dataset specifically captured during nighttime. Due to the inherent difficulties in calibrating ground truth data during nighttime conditions, this particular data sequence is excluded from the quantitative analysis. As depicted in Figure 10, radar perception tests are conducted on streets at night, which are marked by insufficient and complex lighting conditions, as well as a high density of clustered vehicles. While optical imaging systems fail to perform target detection tasks under these conditions, the all-weather-capable radar model, which has been trained extensively, successfully identifies targets with high precision, demonstrating its independence from optical sensors. Consequently, the broad availability and robust performance of millimeter-wave radar technology can effectively compensate for the absence of perceptual data typically provided by optical sensors in assisted driving systems.

5. Conclusions

This paper presents the TC–Radar model, which leverages the advantages of a hybrid CNN and transformer architecture to achieve excellent results in radar sensing tasks. Compared to other detection models, the TC–Radar model fully utilizes both local and global information from radar data. This success is primarily attributed to the integration of the DIFB and CA modules, which address high-frequency information perception and efficient data exchange. Through experiments on the CRUW and CARRADA datasets, TC–Radar has achieved the best detection performance compared to other SOTA methods, while skillfully balancing detection accuracy and computational efficiency. In the super test experiments, the model’s robustness and accuracy under low-light conditions are further validated. Additionally, it is found that the strong fitting capability of the transformer enables the model to quickly learn, resulting in the loss gradually converging near a specific threshold. Therefore, compared to competing algorithms, the TC–Radar model reduces the training time requirements significantly, thus lowering the time costs.

However, the current method still faces challenges related to the number of parameters and computational complexity, which pose obstacles for subsequent model deployment. Future research will focus on exploring new improvement methods to optimize the model’s detection speed, which is crucial for practical applications in vehicle radar systems. Additionally, efforts will be directed towards expanding radar data collection. By collecting data using enhanced radar equipment, the method aims to be validated and improved with larger-scale and more diverse target types.

Author Contributions

Conceptualization, F.J. and C.L.; investigation, C.L., J.Q. and S.B.; methodology, F.J. and C.L.; writing—original draft and formal analysis, F.J. and C.L.; writing—review and editing, J.Q., S.B., L.W. and G.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Shanghai Sailing Program (22YF1400500), in part by the Fundamental Research Funds for the Central Universities (2232022D-11, 2232023G-06), in part by the National Natural Science Foundation of China (62001064, 62201371), and in part by Sichuan Science and Technology Program (2023NSFSC1382).

Data Availability Statement

This article did not generate new datasets. Maintenance data not available.

Acknowledgments

The authors would like to thank the reviewers and editor for their valuable comments and suggestions.

Conflicts of Interest

Author Leizhe Wei was employed by the company Shanghai Wellplan Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Xiang, C.; Feng, C.; Xie, X.; Shi, B.; Lu, H.; Lv, Y.; Yang, M.; Niu, Z. Multi-Sensor Fusion and Cooperative Perception for Autonomous Driving: A Review. IEEE Intell. Transp. Syst. Mag. 2023, 15, 36–58. [Google Scholar] [CrossRef]
Fernandes, D.; Silva, A.; Névoa, R.; Simões, C.; Gonzalez, D.; Guevara, M.; Novais, P.; Monteiro, J.; Melo-Pinto, P. Point-cloud based 3D object detection and classification methods for self-driving applications: A survey and taxonomy. Inf. Fusion 2021, 68, 161–191. [Google Scholar] [CrossRef]
Venon, A.; Dupuis, Y.; Vasseur, P.; Merriaux, P. Millimeter Wave FMCW RADARs for Perception, Recognition and Localization in Automotive Applications: A Survey. IEEE Trans. Intell. Veh. 2022, 7, 533–555. [Google Scholar] [CrossRef]
Zhou, Y.; Liu, L.; Zhao, H.; López-Benítez, M.; Yu, L.; Yue, Y. Towards Deep Radar Perception for Autonomous Driving: Datasets, Methods, and Challenges. Sensors 2022, 22, 4208. [Google Scholar] [CrossRef] [PubMed]
Ignatious, H.A.; El-Sayed, H.; Kulkarni, P. Multilevel Data and Decision Fusion Using Heterogeneous Sensory Data for Autonomous Vehicles. Remote Sens. 2023, 15, 2256. [Google Scholar] [CrossRef]
Rohling, H. Radar CFAR Thresholding in Clutter and Multiple Target Situations. IEEE Trans. Aerosp. Electron. Syst. 1983, AES-19, 608–621. [Google Scholar] [CrossRef]
Ravindran, R.; Santora, M.J.; Jamali, M.M. Camera, LiDAR, and Radar Sensor Fusion Based on Bayesian Neural Network (CLR-BNN). IEEE Sens. J. 2022, 22, 6964–6974. [Google Scholar] [CrossRef]
Wang, Y.; Deng, J.; Li, Y.; Hu, J.; Liu, C.; Zhang, Y.; Ji, J.; Ouyang, W.; Zhang, Y. Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 13394–13403. [Google Scholar] [CrossRef]
Montañez, O.J.; Suarez, M.J.; Fernandez, E.A. Application of Data Sensor Fusion Using Extended Kalman Filter Algorithm for Identification and Tracking of Moving Targets from LiDAR–Radar Data. Remote Sens. 2023, 15, 3396. [Google Scholar] [CrossRef]
Wang, Z.; Miao, X.; Huang, Z.; Luo, H. Research of Target Detection and Classification Techniques Using Millimeter-Wave Radar and Vision Sensors. Remote Sens. 2021, 13, 1064. [Google Scholar] [CrossRef]
Yang, Y.; Wang, X.; Wu, X.; Lan, X.; Su, T.; Guo, Y. A Robust Target Detection Algorithm Based on the Fusion of Frequency-Modulated Continuous Wave Radar and a Monocular Camera. Remote Sens. 2024, 16, 2225. [Google Scholar] [CrossRef]
Zhang, A.; Nowruzi, F.E.; Laganiere, R. RADDet: Range-Azimuth-Doppler based Radar Object Detection for Dynamic Road Users. In Proceedings of the 2021 18th Conference on Robots and Vision (CRV), Burnaby, BC, Canada, 26–28 May 2021; pp. 95–102. [Google Scholar] [CrossRef]
Gao, X.; Xing, G.; Roy, S.; Liu, H. RAMP-CNN: A Novel Neural Network for Enhanced Automotive Radar Object Recognition. IEEE Sens. J. 2021, 21, 5119–5132. [Google Scholar] [CrossRef]
Zhang, R.; Cheng, L.; Wang, S.; Lou, Y.; Gao, Y.; Wu, W.; Ng, D.W.K. Integrated Sensing and Communication with Massive MIMO: A Unified Tensor Approach for Channel and Target Parameter Estimation. IEEE Trans. Wirel. Commun. 2024, 1. [Google Scholar] [CrossRef]
Jin, Y.; Hoffmann, M.; Deligiannis, A.; Fuentes-Michel, J.C.; Vossiek, M. Semantic Segmentation-Based Occupancy Grid Map Learning with Automotive Radar Raw Data. IEEE Trans. Intell. Veh. 2024, 9, 216–230. [Google Scholar] [CrossRef]
Xu, Y.; Li, W.; Yang, Y.; Ji, H.; Lang, Y. Superimposed Mask-Guided Contrastive Regularization for Multiple Targets Echo Separation on Range–Doppler Maps. IEEE Trans. Instrum. Meas. 2023, 72, 5028712. [Google Scholar] [CrossRef]
Tatarchenko, M.; Rambach, K. Histogram-based Deep Learning for Automotive Radar. In Proceedings of the 2023 IEEE Radar Conference (RadarConf23), San Antonio, TX, USA, 1–4 May 2023; pp. 1–6. [Google Scholar] [CrossRef]
Meng, C.; Duan, Y.; He, C.; Wang, D.; Fan, X.; Zhang, Y. mmPlace: Robust Place Recognition with Intermediate Frequency Signal of Low-Cost Single-Chip Millimeter Wave Radar. IEEE Robot. Autom. Lett. 2024, 9, 4878–4885. [Google Scholar] [CrossRef]
Ouaknine, A.; Newson, A.; Pérez, P.; Tupin, F.; Rebut, J. Multi-View Radar Semantic Segmentation. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 15651–15660. [Google Scholar] [CrossRef]
Zou, H.; Xie, Z.; Ou, J.; Gao, Y. TransRSS: Transformer-based Radar Semantic Segmentation. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 6965–6972. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, Z.; Li, Y.; Hwang, J.N.; Xing, G.; Liu, H. RODNet: A Real-Time Radar Object Detection Network Cross-Supervised by Camera-Radar Fused Object 3D Localization. IEEE J. Sel. Top. Signal Process. 2021, 15, 954–967. [Google Scholar] [CrossRef]
Ju, B.; Yang, W.; Jia, J.; Ye, X.; Chen, Q.; Tan, X.; Sun, H.; Shi, Y.; Ding, E. DANet: Dimension Apart Network for Radar Object Detection. In Proceedings of the 2021 International Conference on Multimedia Retrieval, New York, NY, USA, 30 August 2021; ICMR ’21. pp. 533–539. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Decourt, C.; VanRullen, R.; Salle, D.; Oberlin, T. A Recurrent CNN for Online Object Detection on Raw Radar Frames. IEEE Trans. Intell. Transp. Syst. 2024, 1–10. [Google Scholar] [CrossRef]
Jia, F.; Tan, J.; Lu, X.; Qian, J. Radar Timing Range–Doppler Spectral Target Detection Based on Attention ConvLSTM in Traffic Scenes. Remote Sens. 2023, 15, 4150. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Long Beach, CA, USA, 2017; Volume 30. [Google Scholar]
Liu, M.; Chai, Z.; Deng, H.; Liu, R. A CNN-Transformer Network With Multiscale Context Aggregation for Fine-Grained Cropland Change Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2022, 15, 4297–4306. [Google Scholar] [CrossRef]
Gao, L.; Liu, H.; Yang, M.; Chen, L.; Wan, Y.; Xiao, Z.; Qian, Y. STransFuse: Fusing Swin Transformer and Convolutional Neural Network for Remote Sensing Image Semantic Segmentation. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 10990–11003. [Google Scholar] [CrossRef]
Chen, J.; Yuan, Z.; Peng, J.; Chen, L.; Huang, H.; Zhu, J.; Liu, Y.; Li, H. DASNet: Dual Attentive Fully Convolutional Siamese Networks for Change Detection in High-Resolution Satellite Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens. 2021, 14, 1194–1206. [Google Scholar] [CrossRef]
Yang, C.; Kong, Y.; Wang, X.; Cheng, Y. Hyperspectral Image Classification Based on Adaptive Global–Local Feature Fusion. Remote Sens. 2024, 16, 1918. [Google Scholar] [CrossRef]
Jin, Y.; Deligiannis, A.; Fuentes-Michel, J.C.; Vossiek, M. Cross-Modal Supervision-Based Multitask Learning with Automotive Radar Raw Data. IEEE Trans. Intell. Veh. 2023, 8, 3012–3025. [Google Scholar] [CrossRef]
Orr, I.; Cohen, M.; Zalevsky, Z. High-resolution radar road segmentation using weakly supervised learning. Nat. Mach. Intell. 2021, 3, 239–246. [Google Scholar] [CrossRef]
Grimm, C.; Fei, T.; Warsitz, E.; Farhoud, R.; Breddermann, T.; Haeb-Umbach, R. Warping of Radar Data Into Camera Image for Cross-Modal Supervision in Automotive Applications. IEEE Trans. Veh. Technol. 2022, 71, 9435–9449. [Google Scholar] [CrossRef]
Zhuang, L.; Jiang, T.; Wang, J.; An, Q.; Xiao, K.; Wang, A. Effective mmWave Radar Object Detection Pretraining Based on Masked Image Modeling. IEEE Sens. J. 2024, 24, 3999–4010. [Google Scholar] [CrossRef]
Schumann, O.; Hahn, M.; Scheiner, N.; Weishaupt, F.; Tilly, J.F.; Dickmann, J.; Wöhler, C. RadarScenes: A Real-World Radar Point Cloud Data Set for Automotive Applications. In Proceedings of the 2021 IEEE 24th International Conference on Information Fusion (FUSION), Sun City, South Africa, 1–4 November 2021; pp. 1–8. [Google Scholar] [CrossRef]
Fang, S.; Zhu, H.; Bisla, D.; Choromanska, A.; Ravindran, S.; Ren, D.; Wu, R. ERASE-Net: Efficient Segmentation Networks for Automotive Radar Signals. In Proceedings of the 2023 IEEE International Conference on Robotics and Automation (ICRA), London, UK, 29 May–2 June 2023; pp. 9331–9337. [Google Scholar] [CrossRef]
Zhang, L.; Zhang, X.; Zhang, Y.; Guo, Y.; Chen, Y.; Huang, X.; Ma, Z. PeakConv: Learning Peak Receptive Field for Radar Semantic Segmentation. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 17577–17586. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M., Eds.; Springer: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Zeller, M.; Behley, J.; Heidingsfeld, M.; Stachniss, C. Gaussian Radar Transformer for Semantic Segmentation in Noisy Radar Data. IEEE Robot. Autom. Lett. 2023, 8, 344–351. [Google Scholar] [CrossRef]
Kim, Y.; Kim, S.; Choi, J.W.; Kum, D. CRAFT: Camera-Radar 3D Object Detection with Spatio-Contextual Fusion Transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 1160–1168. [Google Scholar] [CrossRef]
Hwang, J.J.; Kretzschmar, H.; Manela, J.; Rafferty, S.; Armstrong-Crews, N.; Chen, T.; Anguelov, D. CramNet: Camera-Radar Fusion with Ray-Constrained Cross-Attention for Robust 3D Object Detection. In Proceedings of the Computer Vision—ECCV 2022, Tel Aviv, Israel, 23–27 October 2022; Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T., Eds.; Springer: Cham, Switzerland, 2022; pp. 388–405. [Google Scholar]
Lo, C.C.; Vandewalle, P. RCDPT: Radar-Camera Fusion Dense Prediction Transformer. In Proceedings of the ICASSP 2023—2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Rhodes Island, Greece, 4–10 June 2023; pp. 1–5. [Google Scholar] [CrossRef]
Jiang, T.; Zhuang, L.; An, Q.; Wang, J.; Xiao, K.; Wang, A. T-RODNet: Transformer for Vehicular Millimeter-Wave Radar Object Detection. IEEE Trans. Instrum. Meas. 2023, 72, 1–12. [Google Scholar] [CrossRef]
Dalbah, Y.; Lahoud, J.; Cholakkal, H. RadarFormer: Lightweight and Accurate Real-Time Radar Object Detection Model. In Proceedings of the Image Analysis; Gade, R., Felsberg, M., Kämäräinen, J.K., Eds.; Springer: Cham, Switzerland, 2023; pp. 341–358. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 10–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar] [CrossRef]
Agarwal, A.; Arora, C. Attention Attention Everywhere: Monocular Depth Prediction with Skip Attention. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2023; pp. 5850–5859. [Google Scholar] [CrossRef]
Hatamizadeh, A.; Tang, Y.; Nath, V.; Yang, D.; Myronenko, A.; Landman, B.; Roth, H.R.; Xu, D. UNETR: Transformers for 3D Medical Image Segmentation. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 2–7 January 2022; pp. 1748–1758. [Google Scholar] [CrossRef]
Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. HRFormer: High-Resolution Vision Transformer for Dense Predict. In Proceedings of the Advances in Neural Information Processing Systems, Online, 6–14 December 2021; Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., Vaughan, J.W., Eds.; Volume 34, pp. 7281–7293. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar] [CrossRef]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015, Munich, Germany, 5–9 October 2015; Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F., Eds.; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y., Eds.; Springer: Cham, Switzerland, 2018; pp. 833–851. [Google Scholar]
Kaul, P.; de Martini, D.; Gadd, M.; Newman, P. RSS-Net: Weakly-Supervised Multi-Class Semantic Segmentation with FMCW Radar. In Proceedings of the 2020 IEEE Intelligent Vehicles Symposium (IV), Las Vegas, NV, USA, 19 October–13 November 2020; pp. 431–436. [Google Scholar] [CrossRef]
Zhuang, L.; Jiang, T.; Jiang, H.; Wang, A.; Huang, Z. LQCANet: Learnable-Query-Guided Multi-Scale Fusion Network Based on Cross-Attention for Radar Semantic Segmentation. IEEE Trans. Intell. Veh. 2024, 9, 3330–3344. [Google Scholar] [CrossRef]

Figure 1. Overview of the radar signal processing chain. PC data need to go through multiple preprocessing processes, and the information is relatively sparse.

Figure 2. Overall structure of the TC–Radar model. The top is the encode branch, and the bottom is the decode branch. The DIFB module and the Cross-Attention module act in these two parts, respectively, making full use of the multi-scale information.

Figure 3. Overview of the DIFB framework, extracting local high-frequency information through parallel atrous convolutions with different expansion rates.

Figure 4. Cross-attention module framework. (a) describes the data operation process, (b) represents the fusion method, and (c) shows the flow of cross information.

Figure 5. Radar echo visual display, target information is often interfered by clutter.

Figure 6. Comparison of model complexity. The results from the CRUW dataset (sections (a,b)) and the CARRADA dataset (sections (c,d)) demonstrate that TC–Radar achieves an optimal balance between detection accuracy and model complexity.

Figure 7. Visual detection effects of different algorithms on the CRUW dataset. Green, red, and blue represent bicycle, pedestrian, and car targets, respectively. The darker the color, the higher the confidence level of the target.

Figure 8. Comparing the visual segmentation effect of the baseline model on the CARRADA dataset, green, red, and blue represent bicycle, pedestrian, and car targets, respectively. It should be noted that TC–Radar only learns RA perspective data, while the baseline model requires joint learning from multiple perspectives.

Figure 9. Visualization results of the contribution of each module. The first four rows and the last four rows of results are based on CARRADA and CRUW data, respectively.

Figure 10. To validate the potential of radar as a low-cost alternative to optical sensors in low-light conditions, we conduct a comprehensive evaluation, termed the “super test”, using a nighttime dataset.

Table 1. The configurations of 3D convolution in our DIFB.

Branch	Kernel	Padding	Dilation	OutChannels
Branch1	(3, 3, 3)	(1, 1, 1)	(1, 1, 1)	1/4 C
Branch2-1	(3, 3, 3)	(1, 2, 2)	(1, 2, 2)	1/4 C
Branch2-2	(3, 5, 5)	(2, 4, 4)	(2, 2, 2)	1/4 C
Branch3-1	(3, 3, 3)	(1, 4, 4)	(1, 4, 4)	1/4 C
Branch3-2	(3, 5, 5)	(2, 8, 8)	(2, 4, 4)	1/4 C
Branch4-1	(3, 3, 3)	(1, 6, 6)	(1, 6, 6)	1/4 C
Branch4-2	(3, 5, 5)	(2, 12, 12)	(2, 6, 6)	1/4 C

Table 2. Comparison of metrics across models on the CRUW dataset. Bold represents the first place, and underline represents the second place.

Model	FLOPs (G)	Params (M)	Total		Pedestrian		Cyclist		Car
Model	FLOPs (G)	Params (M)	AP (%)	AR (%)	AP (%)	AR (%)	AP (%)	AR (%)	AP (%)	AR (%)
RODNet-CDC [21]	560.1	34.5	76.41	82.20	76.79	80.98	74.92	78.46	77.47	88.34
RODNet-HG [21]	2258.3	64.8	78.97	83.98	79.18	82.83	75.14	80.24	82.96	90.03
RODNet-HWGI [21]	5949.7	61.2	78.65	83.44	78.02	83.26	76.23	78.37	82.40	89.43
RODNet-2D [45]	389.5	2.8	76.80	83.50	74.31	81.46	80.21	81.89	77.03	88.62
Unetr [49]	2221.7	285.4	82.39	86.14	81.52	85.18	85.46	86.50	80.37	87.20
HRFormer [50]	1286.2	187.6	74.54	78.23	75.24	78.02	66.34	67.95	82.60	90.06
Radarformer [45]	2125.7	6.4	81.51	86.03	81.49	86.67	81.86	82.93	81.17	88.46
T-RODNet [44]	182.5	159.7	83.27	86.98	82.19	85.41	82.28	84.30	86.22	92.53
SS-RODNet [34]	172.8	33.1	83.07	86.43	81.37	84.61	83.34	84.34	85.55	90.86
TC–Radar (Ours)	384.6	32.6	83.99	88.02	85.19	88.52	82.74	84.42	83.46	91.25

Table 3. Comparison of metrics across models on the CARRADA dataset. Bold represents the first place, and underline represents the second place.

Model	FLOPs (G)	Params (M)	IoU (%)					Dice (%)
Model	FLOPs (G)	Params (M)	Bkg	Ped	Cyc	Car	mIoU	Bkg	Ped	Cyc	Car	mDice
FCN-8s [51]	134.3	31.0	99.8	14.8	0.0	23.3	34.5	99.9	25.8	0.0	37.8	40.9
U-Net [52]	56.6	17.3	99.8	22.4	8.8	0.0	32.8	99.9	36.6	16.1	0.0	38.2
DeepLabv3+ [53]	22.4	59.3	99.9	3.4	5.9	21.8	32.7	99.9	6.5	11.1	35.7	38.3
RSS-Net [54]	45.3	10.1	99.5	7.3	5.6	15.8	32.1	99.8	13.7	10.5	27.4	37.8
RAMP-CNN [13]	420.4	106.4	99.8	1.7	2.6	7.2	27.9	99.9	3.4	5.1	13.5	30.5
MV-Net [19]	58.1	2.4	99.8	0.1	1.1	6.2	26.8	99.0	0.0	7.3	24.8	28.5
TMVA-Net [19]	102.3	5.6	99.8	26.0	8.6	30.7	41.3	99.9	41.3	15.9	47.0	51.0
T-RODNet [44]	44.3	162.0	99.9	25.4	9.5	39.4	43.5	99.9	40.5	17.4	56.6	53.6
SS-RODNet [34]	44.3	33.1	99.9	26.7	8.9	37.2	43.2	99.9	42.2	16.3	54.2	53.2
LQCANet [55]	27.6	148.3	99.9	25.3	11.3	39.5	44.0	99.9	40.4	20.5	56.6	54.4
PeakConv [37]	✗	6.3	✗	✗	✗	✗	✗	✗	✗	✗	✗	53.3
TransRSS [20]	✗	✗	99.6	24.9	13.6	33.9	43.0	99.7	37.4	15.9	47.0	51.0
TC–Radar (ours)	88.9	34.0	99.9	28.6	10.8	41.6	45.2	99.9	44.4	19.5	58.7	55.6

Table 4. Analysis of the contribution of each module on the CRUW dataset. Bold represents the first place.

DIFB	Transformer	CA	AP (%)	AR (%)
✗	✓	✓	82.58	85.71
✓	✗	✓	81.88	85.87
✓	✓	✗	77.46	82.05
✓	✓	WLR	81.58	85.96
✓	✓	WHR	80.23	83.50
✓	✓	✓	83.99	88.02

Table 5. Analysis of the contribution of each module on the CARRADA dataset. Bold represents the first place.

DIFB	Transformer	CA	mIoU (%)	mDice (%)
✗	✓	✓	43.5	54.0
✓	✗	✓	42.7	52.3
✓	✓	✗	41.3	50.8
✓	✓	WLR	43.8	53.6
✓	✓	WHR	42.4	51.9
✓	✓	✓	45.2	55.6

Table 6. Analysis of the performance of the model as a function of radar frame number. Bold represents the first place.

Dataset	Frames	Infer (ms)	AP (mIoU)%	AR (mDice)%
CRUW	1	78.06	74.52	80.72
CRUW	16	117.12	83.99	88.02
CARRADA	1	34.09	(40.6)	(49.0)
CARRADA	4	36.06	(45.2)	(55.6)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jia, F.; Li, C.; Bi, S.; Qian, J.; Wei, L.; Sun, G. TC–Radar: Transformer–CNN Hybrid Network for Millimeter-Wave Radar Object Detection. Remote Sens. 2024, 16, 2881. https://doi.org/10.3390/rs16162881

AMA Style

Jia F, Li C, Bi S, Qian J, Wei L, Sun G. TC–Radar: Transformer–CNN Hybrid Network for Millimeter-Wave Radar Object Detection. Remote Sensing. 2024; 16(16):2881. https://doi.org/10.3390/rs16162881

Chicago/Turabian Style

Jia, Fengde, Chenyang Li, Siyi Bi, Junhui Qian, Leizhe Wei, and Guohao Sun. 2024. "TC–Radar: Transformer–CNN Hybrid Network for Millimeter-Wave Radar Object Detection" Remote Sensing 16, no. 16: 2881. https://doi.org/10.3390/rs16162881

APA Style

Jia, F., Li, C., Bi, S., Qian, J., Wei, L., & Sun, G. (2024). TC–Radar: Transformer–CNN Hybrid Network for Millimeter-Wave Radar Object Detection. Remote Sensing, 16(16), 2881. https://doi.org/10.3390/rs16162881

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

TC–Radar: Transformer–CNN Hybrid Network for Millimeter-Wave Radar Object Detection

Abstract

1. Introduction

2. Background and Related Work

2.1. Radar Signal Processing Chain

2.2. Problem Formulation

2.3. Radar Semantic Segmentation and Object Detection

2.4. Vision Transformer (ViT)-Based Perception Tasks

3. Methodology

3.1. Overall Architecture

3.1.1. Encode

3.1.2. Decode

3.2. Dense Information Fusion Block (DIFB)

3.3. Cross-Attention (CA)

3.4. Loss Function

4. Experiments

4.1. Dataset

4.2. Implementation Details

4.3. Comparison with SOTAs

4.4. Ablation Study

4.5. Super Test

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI