TC–Radar: Transformer–CNN Hybrid Network for Millimeter-Wave Radar Object Detection

: In smart transportation, assisted driving relies on data integration from various sensors, notably LiDAR and cameras. However, their optical performance can degrade under adverse weather conditions, potentially compromising vehicle safety. Millimeter-wave radar, which can overcome these issues more economically, has been re-evaluated. Despite this, developing an accurate detection model is challenging due to significant noise interference and limited semantic information. To address these practical challenges, this paper presents the TC–Radar model, a novel approach that synergistically integrates the strengths of transformer and the convolutional neural network (CNN) to optimize the sensing potential of millimeter-wave radar in smart transportation systems. The rationale for this integration lies in the complementary nature of CNNs, which are adept at capturing local spatial features, and transformers, which excel at modeling long-range dependencies and global context within data. This hybrid approach allows for a more robust and accurate representation of radar signals, leading to enhanced detection performance. A key innovation of our approach is the introduction of the Cross-Attention (CA) module, which facilitates efficient and dynamic information exchange between the encoder and decoder stages of the network. This CA mechanism ensures that critical features are accurately captured and transferred, thereby significantly improving the overall network performance. In addition, the model contains the dense information fusion block (DIFB) to further enrich the feature representation by integrating different high-frequency local features. This integration process ensures thorough incorporation of key data points. Extensive tests conducted on the CRUW and CARRADA datasets validate the strengths of this method, with the model achieving an average precision (AP) of 83.99% and a mean intersection over union (mIoU) of 45.2%, demonstrating robust radar sensing capabilities.


Introduction
In recent years, advanced driver assistance systems (ADASs) have evolved rapidly due to the increasing demands for enhanced long-term stability and safety in smart transportation systems.Multi-sensor fusion has emerged as a reliable method for information acquisition, extensively applied across diverse applications [1].LiDAR sensors encounter challenges such as microparticles and dense point clouds, while cameras are susceptible to light interference [2].Conversely, radar technology is capable of overcoming these issues.However, the limitations of radar encompass low angular resolution, considerable signal noise, and a scarcity of detailed semantic information in echo signals, leading to limited radar-based road perception.Radar employs frequency-modulated continuous wave (FMCW) signals in the millimeter-wave spectrum to detect reflected signals, even in adverse weather conditions [3].This capability offers a potential avenue to compensate for informational gaps in driving systems under particular conditions, thereby establishing radar as a crucial element of ADASs [4,5].
Traditional constant false alarm rate (CFAR) techniques [6] in radar detection do not provide object classification information and require manual parameter adjustment for practical deployment.With advancements in deep learning, radar signal processing via a deep neural network (DNN) has emerged as a significant research domain.Raw FMCW radar data can undergo conversion via the application of fast Fourier transform (FFT) and peak detection techniques to produce Range-Azimuth-Doppler (RAD) tensors and point cloud (PC) data.In contrast to LiDAR PC methodologies, radar point cloud information is inherently sparse, complicating the process of target feature extraction; therefore, fusion techniques are often employed to facilitate perceptual learning [7][8][9][10][11].An alternative method of processing RAD [12,13] data involves substantial computational expenditure, where feature aggregation or dimensional reduction condenses the data into bi-dimensional tensors from various vantage points.This study utilizes Range-Azimuth (RA) millimeterwave radar heatmaps to minimize the use of computational resources, a critical factor in automotive radar perception.
Recently, significant progress has been made by researchers in exploring radar semantic segmentation (RSS) and radar object detection (ROD) tasks as demonstrated, for example, in [14], which presents a unified tensor approach for channel and target parameter estimation within the context of integrated sensing and communication with massive MIMO systems.However, both single-view perception [15][16][17][18] and multi-view fusion methods [19,20] do not adequately consider continuous frame information in radar data.For RA data, the lack of Doppler information makes the network insensitive to the motion characteristics of objects.Therefore, it is crucial to model radar data by considering both temporal and spatial features.When multiple heatmaps are used as input simultaneously to extract motion information, the goal is to capture the temporal dynamics within the sequence.Each frame's heatmap is treated as an independent data point within the batch, enabling the model to learn motion features across the entire batch.This approach ensures that each frame contributes to the extraction of motion features, effectively addressing the issue of sparse semantic information to some extent.By utilizing this approach, some researchers have incorporated 3D convolutional neural network (3D CNN) [13, 21,22] and LSTM [23][24][25] into ROD networks, which have resulted in improved detection accuracy.However, while LSTM primarily focuses on temporal information prediction, it is now increasingly imperative to utilize information across multiple radar frames for feature extraction.Similarly, 3D CNN attempts to extract long-range modeling information by increasing the size of the convolution kernel or by deepening the network structures, yet these improvements are limited.Additionally, the presence of noise in radar data makes models prone to confusing targets with the background, which leads to false positives.In response to these challenges, this paper proposes a novel radar signal perception network.Inspired by the transformer architecture in the field of Natural Language Processing (NLP) [26], our research integrates CNN and transformer elements into our model, essentially acting as a combination of high-pass and low-pass filters to better distinguish high-frequency noise interference from low-frequency target information, which is vital for encapsulating both local information and global perception [27][28][29][30].Our contributions can be summarized as follows: 1.
Transformer was integrated with CNN to develop the TC-Radar model, resulting in a significant enhancement in radar detection accuracy.Comprehensive experiments conducted on the CRUW and CARRADA datasets demonstrated that the TC-Radar model outperformed existing methods in terms of detection accuracy and robustness.

2.
The incorporation of CA and DIFB modules established an effective feature fusion mechanism between the encoder and decoder stages.

3.
TC-Radar has demonstrated significant practicality under low-light conditions by compensating for lost information in assisted driving systems and exhibiting robust perception capabilities.
The remainder of this paper is organized as follows: Section 2 introduces the relevant background and achievements in radar perception; Section 3 details the design of the proposed TC-Radar; Section 4 presents experimental results and ablation studies; and Section 5 offers the conclusion of the paper.

Radar Signal Processing Chain
Radar utilizes electromagnetic waves to detect and analyze reflected signals originating from objects within the surrounding environment [3].Various data formats are generated through a complex series of signal processing procedures as illustrated in Figure 1.The raw radar data, acquired through analog-to-digital conversion (ADC) performed by both transmitting and receiving antennas, contain substantial noise and are inherently abstract, rendering them difficult to interpret without further processing.The FFT is employed along the chirp sequence, the sampling axis, and the antenna dimension to transform the data into a more interpretable format.This process results in the generation of RAD tensors, whereas PC data are processed through peak detection algorithms to extract meaningful features.Typically, the usability of RAD tensors and PC data is limited due to the former's high computational demands and the latter's sparse informational content.

Problem Formulation
Before we delve into the methodology, we present a mathematical formulation of the radar target detection problem addressed in this paper.Let R = {R 1 , R 2 , ..., R N } represent a set of N frames of radar sequences, where each R i is a vector described by distance and angle dimensions, containing radar frequency domain information at a specific time and spatial position.The detection problem is formulated as estimating the parameters θ of the target model M, with the goal of determining the optimal parameters θ * that maximize the probability of observing the target object in M, expressed as: This method corrects the model's learning results by addressing the discrepancy between the detection results R i−output (r, θ) of model M and the true values R gt (r gt , θ gt ), aiming to improve detection accuracy.The specific mathematical relationship is detailed in Section 3.4.Here, r and θ represent the distance and angle of the target coordinates detected by the model, while r gt and θ gt denote the actual position information of the target.

Radar Semantic Segmentation and Object Detection
A low signal-to-noise ratio (SNR) in radar data poses significant challenges for dataset generation.Utilizing cross-modal supervised training methods across different sensors helps to reduce manual annotation costs.Specifically, Jin et al. [31] utilize synchronized camera images as ground truth (GT) for model training.Another method involves the supervision of radar DNN training with pseudo-labels derived from camera pre-training [32].Additionally, as reported in [33], radar data are converted into camera images for object detection through differentiable data mapping.Recently, the SS-RODNet model, as proposed in [34], has employed masked image modeling (MIM) for self-supervised learning to address dataset limitations.However, these methods require data matching between different sensors or downstream task adjustments, failing to effectively leverage the intrinsic features of radar data itself.CRUW [21] and Carrada [19] are particularly noteworthy among the several proposed millimeter-wave radar datasets [12,19,21,35].Both datasets offer RA perspective millimeter-wave radar heatmaps that include the distance and angle information of objects, which makes them more intuitive for ROD and RSS tasks.In previous experiments, methods such as DANet [22], RODNet [21], TMVA-Net [19], ERASE-Net [36], and RECORD [24] have shown the potential of CNN and LSTM for radar perception; however, they encounter problems with detection accuracy.This issue arises from the inherent limitations of CNN and the low semantic content of radar data.Consequently, PeakConv [37] has redefined convolutional operators to better suit radar signals, yielding effective results in RSS tasks.In summary, most current methods remain reliant on CNN and necessitate deeper networks to aggregate multiple local features.Thus, exploring methods capable of capturing global feature modeling capabilities is needed to better exploit the potential of radar sensors.

Vision Transformer (ViT)-Based Perception Tasks
Transformer was initially introduced to the vision field [38], with the seminal work in [39] leveraging the encoder-decoder framework originally developed for NLP.Currently, transformers are increasingly being applied to a diverse range of perception tasks in the field of computer vision.Sparse radar point clouds have achieved high-quality image segmentation by utilizing self-attention (SA) mechanisms [40].Significant progress has also been made in radar-assisted target 3D perception [41,42] and monocular depth estimation [43].Additionally, models that rely on millimeter-wave radar heatmap data, such as T-RODNet [44], Radarformer [45], and TransRSS [20], have successfully utilized this technology.However, traditional ViT methods are computationally intensive and fail to meet the stringent real-time requirements of automotive radar perception systems.The Swin Transformer algorithm [46] addresses this issue in part by employing withinwindow SA mechanisms and sliding window information interaction methods to reduce computational complexity.This paper applies the aforementioned method to RSS and ROD tasks, combining the local high-frequency perception capabilities of CNN with the global contextual understanding of transformers to enhance radar detection performance.

Overall Architecture
TC-Radar employs a traditional encoder-decoder architecture.The model accepts radar sequences of shape x ∈ R D×W×H×C as input and generates detection results of size x ∈ R N×D×H×W , where D and N represent the number of frames and categories, respectively, H represents the height of the radar data matrix, W represents the width, and C represents the number of channels.Multi-scale feature stacking is an effective method for model learning; however, in the context of radar data, the low resolution and sparse textual details make deeper network structures less effective.Therefore, an asymmetric structure is employed, with three basic layers in the encoding stage for enhanced multiscale information acquisition, and two basic layers in the decoding part, integrating CA modules for data restoration.

Encode
Specifically, in the encoding phase, we use the Swin Transformer method to calculate the SA mechanism.The corresponding patch embedding operation is the standard preprocessing stage, which divides the data into small blocks (patches) to facilitate subsequent processing.The accompanying patch merging reduces the resolution of the radar image data, enabling the model to learn features at different scales.In this stage, we use a 3 × 3 × 3 convolution kernel to downsample the data, with temporal and spatial downsampling rates set to (2, 2, 2) and (1, 2, 2), respectively.Within each basic layer, feature extraction occurs through parallel DIFB and transformer modules, which focus on learning high-frequency local features and global information, respectively, balanced by a learnable factor according to their output significance.

Decode
During the decoding stage, traditional channel serial concatenation becomes unsuitable for radar perception tasks; thus, CA modules are integrated to enable information exchange between the encoder and decoder branches.This basic layer processes the inherent information x u originating from the decoder and the corresponding feature information x d from the encoder.The operation is designed using a multi-scale approach.We employ layer-wise upsampling of the data through patch expand calculations.Specifically, the first two patch expand calculations utilize the trilinear interpolation algorithm instead of transposed convolutions, which helps to reduce the model's complexity.The final layer is a linear classification layer that produces the output result.As depicted in Figure 2, the CA module serves as a channel for information exchange.The structure of this module will be discussed in detail in Section 3.3.

Dense Information Fusion Block (DIFB)
Considering the characteristics of radar imaging, the presence of background noise disturbances presents a challenge for capturing target information.
Therefore, beyond leveraging transformer to enhance long-range modeling capabilities, improving the perception capabilities of CNN is also crucial.In this study, we propose a multi-branch feature aggregation module presented in Figure 3, where each Branch i contains interconnected residual basic blocks ∑ i 1 Branch i to merge adjacent information streams.The Atrous Spatial Pyramid Pooling (ASPP) module [19] demonstrates superiority in RSS tasks, effectively integrating multi-receptive field information through dilated convolutions with different dilation rates while not escalating the computational burden.Inspired by this approach, we employ consecutive 3 × 3 × 3 and 3 × 5 × 5 convolution kernels in each branch with dilation rates of {(2, 2, 2), (2, 4, 4), (2,6,6)}.The detailed parameters of this module can be found in Table 1.The relationship between the equivalent kernel size k and the dilation factor d is described by the following formula: the equivalent kernel sizes are {(5, 5, 5), (5, 9, 9), (5, 13, 13)} and {(5, 9, 9), (5,17,17), (5,25,25)}.Residual blocks with varying receptive fields are utilized across all branches except the first one, which is acknowledged as an effective CNN structure [47].Each branch is mathematically represented as follows: where the ResConv(•)) denotes the internal residual operation.Unlike ASPP, which consolidates all information at the output stage, the i-th branch of DIFB already incorporates information from the preceding i − 1 branches during processing: This process allows for the interleaving of multi-branch features, leveraging this compact information transfer to reduce information loss in the neural network.Additionally, a 1 × 1 × 1 convolution kernel is applied for dimensionality reduction of the merged information, thereby matching the number of channels required for subsequent feature stacking.The resulting overall output is given by: Each basic layer within the encoding stage incorporates a single DIFB module.The contribution of this component will be discussed subsequently.

Cross-Attention (CA)
With dense echo data such as radar signals, both coarse semantic features and finegrained details are essential for accurate detection [48].However, classical skip connections encounter difficulties in transmitting features.This challenge stems from juxtaposing noisy shallow information with comparatively clear deep features.Consequently, this juxtaposition introduces additional noise during the merging process, leading to feature misalignment and inadequate feature representation, thereby compromising decoder performance.To counter this issue, attention feature maps of particular dimensions from the encoder are integrated into the decoder's process, thus forming an information exchange channel between the two to facilitate adaptive feature aggregation.The structure is depicted in Figure 4. Specifically, encoder-derived information x l+1 d results from the combined processing by the DIFB and transformer: where λ is the learnable balancing factor, and the SA mechanism executes dot-product operations within a predefined window size of 4: The benefits of incorporating CA are manifold.First, the integration of same-sized data within the decoder guides the detail recovery process.To reinforce this guidance, the key (K) and value (V) matrices produced by the shifted window/window multi-head self-attention (SW/W-MSA) in the corresponding branch are relayed to the CA.To safeguard against the loss of the decoder's intrinsic information, this module is devised to concurrently process two streams of information: the global information xg sourced from the encoder x l+1 d , and the inherent decoder information x l u : The second facet entails comprehending the decoder's inherent features.The operations of the SA mechanism for these aspects are delineated as follows: where Q G denotes the global query (Q) matrix derived from the DIFB, transformer, and the decoder's intrinsic information.(•) D and (•) E denote the decoding and encoding stages, respectively.The CA module is used for the interaction and fusion of different information streams.For the current decoder output state, the Q D matrix is generated.Next, the association distribution is obtained by calculating the similarity between the Q D and the encoder K E vector.Then, the association is multiplied by the encoder V E vector to obtain the attention weight distribution.At this point, the key feature positions in the encoder are highlighted with larger weights.This sequence information shows the encoder's potential abstract association features in the decoder.The complete CA representation is obtained through subsequent fusion with the decoder's inherent information.At this juncture, the information within SA G is more comprehensive, encapsulating abstract latent features, while SA D primarily concentrates on the decoder's intrinsic details.Secondly, the intermingling of information promotes the amalgamation of analogous details in the decoding stage to aid in the structural replenishment of gaps in information.The approach illustrated in Figure 4b is employed to generate outputs x l gu and x l up , delineated as follows: The inherent decoder information is processed using multi-layer perceptron (MLP) and layer normalization (LN), while F (•) signifies Fourier positional encoding (FPE), which enhances the model's resilience to spatial transformations.In summary, the CA module thoroughly addresses the disparities in information between the encoder and decoder, processes features across various scales, and exhibits robust decoding and predictive capabilities.

Loss Function
In the RSS task context, cross-entropy (CE) and soft dice (SDice) losses are recognized as effective.To tackle the challenge posed by imbalanced segmentation labels, weighted coefficients are utilized to equilibrate the two losses.The mathematical formulations for these losses are given as: where P(y i ) and P( ŷi ) denote the GT and predicted probabilities, respectively, w k denotes the class weight, which is inversely proportional to the frequency of the respective class in the training set.The composite loss function for the RSS task is defined as follows: the balancing factors λ wce and λ dice are set to 1 and 5, respectively.
In the ROD task, the mean squared error (MSE) loss function is utilized to quantify the disparity between the model's outputs and the ground truth.The mathematical formulation for this loss is expressed as: the definitions of P(y i ) and P( ŷi ) are the same as above.

Experiments
Perception experiments for the ROD task on the CRUW dataset and the RSS task on the CARRADA dataset were conducted, with comparisons of our results to previous state-of-the-art (SOTA) methods.The visualization result of the data is shown in Figure 5.  Object Detection on CRUW: The sensor suite comprises both a camera and a 77 GHz FMCW millimeter-wave radar.A total of 3.5 h of driving data is captured across street, campus, highway, and parking lot scenarios at a rate of 30 frames per second, amounting to roughly 400 K frames.Furthermore, the dataset encompasses data from diverse scenarios, including nighttime and glare conditions, which serve to evaluate the model's comprehensive perception abilities.Altogether, CRUW represents a high-quality dataset for radar-based object detection.This dataset autonomously produces radar data annotations in the RA perspective utilizing cross-modal supervision techniques, identifying three classes: pedestrians, bicycles, and vehicles.In contrast to the commonly used intersection over union (IoU) evaluation metric, CRUW employs an anchor-free approach known as object localization similarity (OLS) to gauge classification confidence.OLS is delineated as follows: d, S and K cls denote the distance between two points in the RA image, the radial distance from the target to the radar, and the category-specific tolerance constant (defined by the mean object size within each class), respectively.The evaluation metrics employed are AP and average recall (AR).These metrics are assessed using OLS thresholds that vary from 0.5 to 0.9, in increments of 0.05.

2.
Semantic Segmentation on CARRADA: The CARRADA dataset comprises 12,666 frames, annotated via a semi-automated process.Relative to the CRUW dataset, CARRADA offers less variety in scene composition yet furnishes comprehensive RAD data.This dataset is segmented into RA, Range-Doppler (RD), and Azimuth-Doppler (AD) 2D tensors, providing authentic datasets for further multi-perspective information fusion research.Within its 30 sequences, the primary entities encompass four classes, pedestrians, cyclists, vehicles, and the background, with detailed mask annotations available in both RA and RD views.Contrasting with the OLS evaluation utilized for CRUW, this dataset employs IoU and Dice as assessment metrics.The formulations for these metrics are delineated as follows: In this context, Y g and Y p represent the GT and the predicted values, respectively.

Implementation Details
All experiments ran on a single RTX 3090 GPU within the PyTorch framework.GFLOPs and model parameters were computed via the third-party thop library, barring instances where official paper data were cited.Experiments utilized the Adam optimizer and, based on empirical evidence, configured the transformer's dimension to 64 with a window size of 4. The initial learning rate was 1 × 10 −4 , following a cosine annealing schedule for optimal training performance.With the CRUW dataset, the RA heatmap resolution was established at 128 × 128 pixels.To more effectively capture target motion information, 16 radar data frames were input simultaneously.The channel count and batch size were both set to 2. In contrast, the CARRADA dataset specified RA dimensions of 256 × 256 pixels.For the RSS task, the input frame count was reduced to 4, and the batch size was increased to 8. The channel count for this task was established at 1. Owing to varying evaluation criteria across datasets, network output sizes required corresponding adjustments.Specifically, within the RSS task, modifications were confined to the linear layer during the patch expansion phase.

Comparison with SOTAs
This study conducts a comprehensive comparative analysis of alternative algorithms on the CRUW and CARRADA datasets.Tables 2 and 3 present the quantitative comparison results of all evaluated methods.Figure 6 shows the computational complexity comparison of different models.TC-Radar outperforms competing models in terms of AP, AR, mIoU, and mean dice (mDice) metrics.In the ROD task, our model surpasses the baseline RODNet-HWGI by 5.3% in AP and 4.6% in AR, while the computational complexity, measured in Floating Point Operations Per Second (FLOPs), stands at a mere 6.2% of that baseline.This reduction is a critical factor for automotive radar perception systems, indicating significantly lower computational demands on edge devices, thus facilitating easier model deployment.In comparison, the lightweight Radarformer [45] model achieves satisfactory detection accuracy with fewer parameters but contends with significant computational complexity.Regarding specific category assessments, TC-Radar exhibits enhanced perceptual performance, particularly in pedestrian detection.Support for this conclusion is evident from the detection outcomes illustrated in the accompanying figures.Specifically, as depicted in Figure 7, rows one and five, overlapping instances of vehicles and bicycles complicate the detection of pedestrian subjects.Nonetheless, the TC-Radar detection results manifest in more intense hues, indicating a higher probability of accurate categorization.This is a testament to the effective local perception capabilities of the DIFB module and the transformer's robust noise differentiation capacity.During the quantitative evaluation of the RSS task, our model surpasses the baseline TMVA-Net in mIoU and mDice by 3.9% and 4.6%, respectively, while also achieving a 10% reduction in computational complexity.Given that TMVA-Net is a multi-view fusion model, it necessitates the input of RA, RD, and AD data for training.While this approach mitigates some information loss inherent to a single perspective, the segmentation results suffer from imprecise accuracy and misclassifications.This is particularly noticeable in the erroneous or missed detections of minor pedestrian and cyclist targets as observed in Figure 8.With the exception of LQCANet, which has marginally higher scores, competing models exhibit an approximate 2% performance deficit relative to TC-Radar.Furthermore, our model's parameter count is approximately 22% of that of LQCANet, indicating greater learning efficiency.
In summary, considering both qualitative and quantitative evaluations, the judicious design of the network architecture endows the model with robust perceptual abilities while maintaining a balance between detection speed and accuracy.Future experiments will delve into the specific contributions of each module in greater detail.

Ablation Study
To comprehensively assess the impacts of the DIFB, transformer, and CA modules, ablation studies are conducted on the CRUW and CARRADA datasets.Tables 4 and 5 illustrate the quantitative outcomes of the ablation studies.Analysis of the results reveals that each module contributes positively, with the transformer and CA modules making particularly substantial contributions.This underscores the robust noise differentiation capability of long-range feature modeling in radar signal processing.
Notably, during the impact assessment of the transformer, it is only omitted from the encoder segment to preserve the integral role of the CA module, given that CA relies on the K and V matrices from the transformer to direct detail refinement, and these components are produced by the decoding branch to compensate for informational deficiencies.To examine the impacts of CA, three experimental setups are created: excluding , and fully eliminating the module.In these cases, the Cross-Attention module operates exclusively with either low-resolution (WLR) or high-resolution (WHR) inputs, or solely relies on the upsampling layer for patch details.The analyzed results indicate that both WLR and WHR configurations lead to variations in detection and classification performance.However, the WLR-modeled network tends to yield predictions with comparatively greater certainty.This is attributed to two factors: first, the encoder's deeper features are more refined; second, the simplified resolution helps the model identify targets more readily, while the intricate features assist in the decoder's refinement process.Figure 9 corroborates this finding, demonstrating that WLR mode enhances accuracy in detection classifications and segmentation details.Furthermore, across all results, the DIFB module consistently bolsters both segmentation and detection tasks, affirming its efficacy in enlarging the receptive field for feature analysis.In addition, we discuss the importance of input radar sequences.Compared to single-frame radar information, the detection performance of models using multi-frame input data is more reliable, suggesting that the redundancy inherent in radar sequence information can help mitigate the issue of low semantic content to some extent.We conduct the experiment according to the methodology outlined in [21], which involves recording the time taken for a single forward propagation of the neural network as the inference time metric, and extend this method to the CARRADA dataset.The specific quantitative metrics are presented in Table 6.The results indicate that timing positively impacts radar perception performance.

Super Test
To assess the feasibility of utilizing radar as a cost-effective alternative to optical sensors under low-light conditions, comprehensive testing is conducted on a dataset specifically captured during nighttime.Due to the inherent difficulties in calibrating ground truth data during nighttime conditions, this particular data sequence is excluded from the quantitative analysis.As depicted in Figure 10, radar perception tests are conducted on streets at night, which are marked by insufficient and complex lighting conditions, as well as a high density of clustered vehicles.While optical imaging systems fail to perform target detection tasks under these conditions, the all-weather-capable radar model, which has been trained extensively, successfully identifies targets with high precision, demonstrating its independence from optical sensors.Consequently, the broad availability and robust performance of millimeter-wave radar technology can effectively compensate for the absence of perceptual data typically provided by optical sensors in assisted driving systems.To validate the potential of radar as a low-cost alternative to optical sensors in low-light conditions, we conduct a comprehensive evaluation, termed the "super test", using a nighttime dataset.

Conclusions
This paper presents the TC-Radar model, which leverages the advantages of a hybrid CNN and transformer architecture to achieve excellent results in radar sensing tasks.Compared to other detection models, the TC-Radar model fully utilizes both local and global information from radar data.This success is primarily attributed to the integration of the DIFB and CA modules, which address high-frequency information perception and efficient data exchange.Through experiments on the CRUW and CARRADA datasets, TC-Radar has achieved the best detection performance compared to other SOTA methods, while skillfully balancing detection accuracy and computational efficiency.In the super test experiments, the model's robustness and accuracy under low-light conditions are further validated.Additionally, it is found that the strong fitting capability of the transformer enables the model to quickly learn, resulting in the loss gradually converging near a specific threshold.Therefore, compared to competing algorithms, the TC-Radar model reduces the training time requirements significantly, thus lowering the time costs.
However, the current method still faces challenges related to the number of parameters and computational complexity, which pose obstacles for subsequent model deployment.Future research will focus on exploring new improvement methods to optimize the model's detection speed, which is crucial for practical applications in vehicle radar systems.Additionally, efforts will be directed towards expanding radar data collection.By collecting data using enhanced radar equipment, the method aims to be validated and improved with larger-scale and more diverse target types.

Figure 1 .
Figure 1.Overview of the radar signal processing chain.PC data need to go through multiple preprocessing processes, and the information is relatively sparse.

Figure 2 .
Figure 2. Overall structure of the TC-Radar model.The top is the encode branch, and the bottom is the decode branch.The DIFB module and the Cross-Attention module act in these two parts, respectively, making full use of the multi-scale information.

Figure 3 .
Figure 3. Overview of the DIFB framework, extracting local high-frequency information through parallel atrous convolutions with different expansion rates.

Figure 4 .
Figure 4. Cross-attention module framework.(a) describes the data operation process, (b) represents the fusion method, and (c) shows the flow of cross information.

Figure 5 .
Figure 5. Radar echo visual display, target information is often interfered by clutter.

Figure 6 .
Figure 6.Comparison of model complexity.The results from the CRUW dataset (sections (a,b)) and the CARRADA dataset (sections(c,d)) demonstrate that TC-Radar achieves an optimal balance between detection accuracy and model complexity.

Figure 7 .
Figure 7. Visual detection effects of different algorithms on the CRUW dataset.Green, red, and blue represent bicycle, pedestrian, and car targets, respectively.The darker the color, the higher the confidence level of the target.

Figure 8 .
Figure 8. Comparing the visual segmentation effect of the baseline model on the CARRADA dataset, green, red, and blue represent bicycle, pedestrian, and car targets, respectively.It should be noted that TC-Radar only learns RA perspective data, while the baseline model requires joint learning from multiple perspectives.

Figure 9 .
Figure 9. Visualization results of the contribution of each module.The first four rows and the last four rows of results are based on CARRADA and CRUW data, respectively.

Figure 10 .
Figure10.To validate the potential of radar as a low-cost alternative to optical sensors in low-light conditions, we conduct a comprehensive evaluation, termed the "super test", using a nighttime dataset.

Table 1 .
The configurations of 3D convolution in our DIFB.

Table 2 .
Comparison of metrics across models on the CRUW dataset.Bold represents the first place, and underline represents the second place.

Table 3 .
Comparison of metrics across models on the CARRADA dataset.Bold represents the first place, and underline represents the second place.

Table 4 .
Analysis of the contribution of each module on the CRUW dataset.Bold represents the first place.

Table 5 .
Analysis of the contribution of each module on the CARRADA dataset.Bold represents the first place.

Table 6 .
Analysis of the performance of the model as a function of radar frame number.Bold represents the first place.