Real-Time Driver State Detection Using mmWave Radar: A Spatiotemporal Fusion Network for Behavior Monitoring on Edge Platforms

Tseng, Shih-Pang; Wu, Wun-Yang; Wang, Jhing-Fa; Tao, Dawei

doi:10.3390/electronics14173556

Open AccessArticle

Real-Time Driver State Detection Using mmWave Radar: A Spatiotemporal Fusion Network for Behavior Monitoring on Edge Platforms

by

Shih-Pang Tseng

^1,2,*,

Wun-Yang Wu

³,

Jhing-Fa Wang

³ and

Dawei Tao

²

¹

School of Information Science and Technology, Sanda University, Shanghai 201209, China

²

School of Software and Big Data, Changzhou College of Information Technology, Changzhou 213164, China

³

Department of Electrical Engineering, National Cheng Kung University, Tainan 701, Taiwan

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(17), 3556; https://doi.org/10.3390/electronics14173556

Submission received: 6 July 2025 / Revised: 29 August 2025 / Accepted: 1 September 2025 / Published: 7 September 2025

(This article belongs to the Special Issue Artificial Intelligence (AI) Based Radar Detection and Recognition in Complex Electromagnetic Environments)

Download

Browse Figures

Versions Notes

Abstract

Fatigue and distracted driving are among the leading causes of traffic accidents, highlighting the importance of developing efficient and non-intrusive driver monitoring systems. Traditional camera-based methods are often limited by lighting variations, occlusions, and privacy concerns. In contrast, millimeter-wave (mmWave) radar offers a non-contact, privacy-preserving, and environment-robust solution, providing a forward-looking alternative. This study introduces a novel deep learning model, RTSFN (radar-based temporal-spatial fusion network), which simultaneously analyzes the temporal motion changes and spatial posture features of the driver. RTSFN incorporates a cross-gated fusion mechanism that dynamically integrates multi-modal information, enhancing feature complementarity and stabilizing behavior recognition. Experimental results show that RTSFN effectively detects dangerous driving states with an average F1 score of 94% and recognizes specific high-risk behaviors with an average F1 score of 97% and can run in real-time on edge devices such as the NVIDIA Jetson Orin Nano, demonstrating its strong potential for deployment in intelligent transportation and in-vehicle safety systems.

Keywords:

millimeter-wave radar; driver behavior recognition; RTSFN; edge computing

1. Introduction

Safe and focused driving is fundamental to preventing traffic accidents and safeguarding the well-being of drivers, passengers, and other road users. However, driver fatigue and distraction have been widely recognized as major contributing factors to road traffic accidents, particularly in contexts such as long-distance transportation, commercial driving, and ride-sharing services. According to the Global Status Report on Road Safety 2023 published by the World Health Organization (WHO) [1], approximately 69% of global traffic fatalities involve individuals aged 18 to 59—the most economically active segment of the population. Furthermore, 18% of driver deaths are associated with commuting or work-related driving, underscoring the heightened risks of fatigue, especially in long-haul or professional driving scenarios. Distracted driving has also emerged as an increasingly critical concern, exacerbated by the widespread use of mobile devices. Studies have shown that drivers who use mobile phones while driving are significantly more likely to be involved in traffic collisions compared to those who do not. Such behaviors impair visual attention, cognitive processing, and reaction time, thereby posing substantial threats to road safety. As a result, the accurate and efficient detection of driver fatigue and distraction has become an urgent need in intelligent transportation systems, autonomous vehicles, and fleet management applications.

Existing methods for detecting driver fatigue and distraction can be broadly categorized into four main approaches, as follows: vehicle operation data analysis, physiological signal monitoring, facial landmark detection, and behavior analysis.

Vehicle operation data analysis methods infer driver state by analyzing abnormal vehicle behaviors, such as abrupt steering, lane deviations, or sudden acceleration, using data collected from the vehicle’s onboard diagnostics (OBD) system [2,3]. These methods typically employ statistical models or machine learning algorithms to identify deviations from standard driving patterns. While these methods are easy to deploy and cost-efficient, they are highly susceptible to external interferences, including weather conditions, road types, and driver habits, which may result in false alarms or missed detections.

Physiological signal monitoring methods aim to directly evaluate the driver’s internal mental state by measuring signals such as heart rate variability (HRV) [4], brainwave activity (EEG) [5,6], and skin conductance level (EDA) [7]. Although these methods offer high detection accuracy, their reliance on wearable or contact-based sensors may negatively affect user comfort and acceptance, and limit scalability in real-world driving environments.

Facial landmark detection methods capture video data of the driver’s face using in-cabin cameras to extract visual cues such as eyelid closure percentage (PERCLOS), gaze direction, and facial expressions [8,9]. These indicators are commonly used as proxies for drowsiness and distraction. Similarly, behavior-based approaches extend this analysis by observing the driver’s gross motor behaviors, including head nodding, posture shifts, and hand movements, through video analysis or motion sensors installed inside the cabin [10]. While these methods are non-invasive and intuitive, they suffer from high sensitivity to lighting variations, facial occlusions (e.g., masks, sunglasses), and privacy concerns, particularly in shared mobility scenarios. These limitations hinder their wide-scale deployment and raise significant ethical and data protection issues.

The role of mmWave radar as a promising non-contact sensing technology has expanded significantly in recent years, with critical applications found in physiological signal monitoring [11], and even in traffic safety, such as external environment perception [12], as well as within the domain of in-cabin driver monitoring. It offers several advantages, including strong robustness to environmental conditions, the ability to penetrate obstructions, and enhanced privacy protection. The mmWave radar exhibits diverse potential in in-vehicle applications. For example, Cardillo et al. [13] in their study, “Microwave and mmWave Radar Circuits for the Next Generation Contact-Less In-Cabin Detection,” thoroughly explored microwave and mmWave radar circuits for next-generation contact-less in-cabin detection, covering various scenarios such as passenger presence detection and touch-free infotainment system control. However, there is a significant opportunity to extend this robust technology to the critical safety domain of driver state monitoring.

Therefore, this paper proposes a novel approach that utilizes mmWave radar. By leveraging this “non-contact” and “privacy-preserving” technology, our system detects driver fatigue and distraction through the analysis of driver posture and behavior dynamics. The main contributions of this study are summarized as follows:

A driver monitoring framework is developed based on mmWave radar to robustly identify driver behavior and posture changes.
A deep learning architecture named the radar-based spatiotemporal fusion network (RTSFN) is proposed, which effectively fuses temporal and spatial features extracted from radar signals to identify risky driving behaviors.
The system is designed with a balance between detection accuracy and computational efficiency, allowing real-time inference on edge devices and supporting practical deployment in in-vehicle environments.

The remainder of this paper is structured as follows. Section 2 introduces the principles of FMCW radar and details the hardware setup. Section 3 provides a comprehensive overview of the proposed driver monitoring system. Section 4 elaborates on the RTSFN architecture, including the input data pre-processing, the spatiotemporal fusion network, and the real-time optimization mechanism. Section 5 presents the experimental setup, results, a comparative analysis with other methods, and the ablation study. Finally, the conclusion is given in Section 6 and Section 7.

2. FMCW Radar Principles and Hardware Setup

2.1. Overview of FMCW mmWave Radar Technology

Frequency-modulated continuous wave (FMCW) radar transmits a chirp signal—a continuous wave whose frequency increases or decreases linearly over time. When this signal encounters a target, part of it is reflected back to the radar receiver. The system generates this chirp signal via a signal generator, amplifies it, and transmits it through the transmitting antenna. The reflected signal is captured by the receiving antenna, amplified, and mixed with a local copy of the transmitted signal in a process called dechirping, which produces an intermediate frequency (IF) signal. This IF signal encodes information about the target’s distance and velocity. After digitization, digital signal processing techniques such as the fast Fourier transform (FFT) are applied to extract key target parameters, including range and velocity. Figure 1 presents a simplified block diagram of an FMCW radar system.

FMCW technology is particularly advantageous for real-time driver monitoring due to its high range and velocity resolution, low power consumption [14], and cost-effectiveness. Furthermore, its continuous-wave nature enables real-time measurements and robustness against multipath fading, making it suitable for tracking moving targets in dynamic in-vehicle environments.

Based on the frequency and phase characteristics of the received signals, the system can estimate both the range and velocity of targets. These are derived through well-established signal processing techniques as described below.

2.1.1. Range Estimation

The distance d of an object is estimated based on the beat frequency

f_{b}

, which represents the difference in frequency between the transmitted and received signals. The relationship between the distance and the beat frequency is given by the following:

d = \frac{c}{2} \times \frac{T_{c}}{B} f_{b}

(1)

where c is the speed of light, B is the chirp bandwidth, and

T_{c}

is the chirp duration. A higher beat frequency corresponds to a larger distance between the radar and the target.

2.1.2. Velocity Estimation

The velocity v of an object is estimated using the Doppler effect, which causes a frequency shift (

f_{d}

) in the reflected signal as the object moves relative to the radar. The Doppler frequency shift is directly proportional to the relative velocity and can be expressed as follows:

v = - \frac{f_{d} λ}{2}

(2)

where

f_{d}

is the Doppler frequency shift and

λ

is the wavelength of the transmitted signal. The magnitude of the Doppler shift is proportional to the target’s speed, while its sign indicates the direction of motion relative to the radar.

2.2. Advantages and Selection Considerations of FMCW Radar

While Doppler radar excels at measuring object velocity and provides stable phase information when the target distance is constant, this study primarily selects FMCW radar as the core technology for driver monitoring applications due to its multiple significant advantages. FMCW radar can simultaneously provide highly accurate range and velocity information, which is crucial for precisely locating the driver and capturing subtle changes in their posture and movements. In contrast, traditional Doppler radar, lacking a high-resolution range dimension, struggles to differentiate targets at various distances that share the same velocity. This presents a significant limitation in complex, multi-target, and noisy in-cabin environments. FMCW radar’s superior range resolution enables it to effectively separate the driver from vehicle structures or other objects amidst complex backgrounds, thereby substantially improving signal quality. Furthermore, FMCW radar’s advantages in relatively low power consumption and cost-effectiveness make it an ideal choice for real-time deployment on edge computing devices. These characteristics allow FMCW radar to more comprehensively and reliably meet the needs of driver behavior detection in this study compared to a standalone Doppler radar.

2.3. Radar Hardware Setup and System Parameters

To enable robust and real-time driver monitoring, we selected the AWR1642BOOST FMCW radar, Texas Instruments, Dallas, TX, USA, as our primary sensing device. Texas Instruments offers a variety of mmWave radar sensors under the AWR and IWR series. The AWR series is specifically optimized for automotive and industrial applications, making it well-suited for driving-related scenarios such as fatigue and distraction detection. In contrast, the IWR series is geared toward broader applications, including industrial sensing, security, and IoT domains.

Given the automotive focus of this study, the AWR1642 device was chosen due to its high-frequency operation in the 77 GHz band, which provides improved range and Doppler resolution compared to lower-frequency alternatives. Additionally, the AWR1642 integrates a high-performance digital signal processor (DSP), which supports efficient on-chip processing of radar signals. This is particularly advantageous for real-time applications, where fast and reliable computation is critical for handling large volumes of streaming data and executing complex signal processing algorithms with low latency.

AWR1642BOOST’s main radar configuration parameters (as shown in Table 1) are designed to maximize the performance of in-cabin driver monitoring. The system operates in the 77 GHz frequency band, whose high-frequency characteristics ensure excellent detection capabilities. A range resolution of 0.047 m and a radial velocity resolution of 0.13 m/s ensure that the system meticulously captures driver posture changes and motion dynamics. The maximum unambiguous range is set to 2.42 m, which fully covers the effective detection area within the cockpit, ensuring that even significant tilting or stretching by the driver can be completely captured. Furthermore, a maximum radial velocity of 1 m/s is chosen to encompass various typical driver motion speeds that may occur during fatigue (e.g., nodding, yawning) or distraction (e.g., using a phone, turning around to pick up objects). The frame duration of 250 ms (corresponding to four frames per second) strikes a balance between the temporal resolution required for behavior analysis and the computational load for real-time processing on edge devices. Detection thresholds are uniformly set to 15 dB, which helps effectively filter out low-amplitude noise while ensuring the detection of critical driver-induced reflections. Overall, the synergistic setting of these parameters ensures that the system can accurately capture and classify diverse driving behaviors in real-time within a real in-vehicle environment. Data were collected through the UART interface using the TLV (type–length–value) format, enabling structured communication between the radar and the host system. And the specific antenna position used in our experimental setup is detailed in Section 5.2.

3. System Overview

This study proposes a robust driver monitoring system based on FMCW mmWave radar, which focuses on analyzing behavioral cues for accurate driver state evaluation. As illustrated in Figure 2, the system’s operation begins with the mmWave radar receiving raw radar signals. These raw signals then proceed to the radar signal decoding module, where they are converted into analyzable radar signal data. This data primarily includes three types of features: range–Doppler maps, noise profiles, and range profiles. Next, these decoded data enter the data pre-processing stage, which involves several key steps. First is Frame Stacking, where consecutive radar frames are stacked to capture temporal dynamics. This is followed by data concatenation, which integrates different feature data. Finally, normalization is performed to ensure consistency in data scale, which helps to improve the accuracy and stability of the model’s inference results.

The pre-processed data stream then serves as input to the radar-based spatiotemporal fusion network (RTSFN). Within the RTSFN, feature extraction is performed to derive behavior-relevant spatiotemporal features from the radar data. These features are subsequently fed into the behavior classification module, used for identifying various driving behaviors. Ultimately, the behavioral classification results from RTSFN are transmitted to the driver state evaluation module, which assesses the driver’s current state, for instance, whether abnormal driving behavior is present. The system specifically identifies and outputs multiple abnormal behaviors, including fatigue driving, using a phone (raising a hand to the ear for a call or picking up a phone to use), turning backward, picking up objects, bending forward, and drinking behavior. By focusing on behavioral classification, the proposed system enables effective evaluation of driver fatigue and distraction, offering timely warning potential in real-world driving environments.

4. Human Motion Detection: RTSFN-Based Driver Action Detection

4.1. Input Data of Radar Signals

mmWave radar signals provide three types of input data, each offering distinct information for analyzing target characteristics and driver actions. To illustrate how these data types capture a dynamic action, we use a driver nodding event as a running example (as depicted in Figure 3).

Range profile: The range profile is a one-dimensional representation of the total reflected signal power received by the radar at various range points. This profile is generated by performing a 1D fast Fourier transform (1D-FFT) on the received signal, offering a comprehensive snapshot of all objects within the radar’s field of view. It serves as the foundation for estimating the driver’s approximate position and torso area. Figure 4 illustrates a time series of range profiles during a driver nodding event. In frames 25 and 26, significant power fluctuations are visible within the “Driver Body Area.” These fluctuations directly correspond to the movement of the driver’s head and upper body, as this action changes their distance and reflection angle relative to the sensor. In contrast, the more stable peaks in other regions may show positions such as the car seat or dashboard, which provide consistent reflections.

Noise profile: The noise profile, the second one-dimensional data feature used in our model, is physically defined as the background noise floor, which typically serves as a reference for algorithms such as the constant false alarm rate (CFAR). While traditionally treated as a baseline to be filtered or subtracted, our empirical analysis revealed that this profile contains valuable information sensitive to dynamic events. As illustrated in Figure 5, the profile exhibits a stable contour when the driver is stationary (e.g., Frame 23), but shows drastic, instantaneous peaks within the driver’s area during nodding periods (e.g., Frames 25 and 26). We posit that this is not because the profile directly measures motion energy, but because physical actions from the driver introduce micro-vibrations and subtle changes to the in-cabin signal environment, which in turn modulate the overall background energy level. Therefore, we treat this data stream as a sensitive auxiliary indicator of dynamic events rather than as random noise. The significant contribution of this unconventional feature to the model’s performance is empirically validated in our comprehensive ablation study (Section 5.6).

The range–Doppler map: The range–Doppler map is generated by applying a two-dimensional FFT to the received FMCW radar signal. First, a range FFT is performed on each individual chirp signal along the fast-time dimension to obtain the range spectrum. Subsequently, a Doppler FFT is performed for each range bin along the slow-time dimension (across chirps) to resolve the target’s velocity. The resulting range–Doppler map (as shown in Figure 6) is a two-dimensional plot that simultaneously presents the distribution of the radar echo signal in both the range and velocity dimensions. Its horizontal axis represents the range in meters (m), while the vertical axis represents the relative velocity in meters per second (m/s), with the central axis (zero Doppler) corresponding to stationary objects. The color intensity of each cell represents the signal energy reflected by an object located at that specific range and moving at a specific radial velocity. This map provides the most complete view of the scene’s dynamics, capable of simultaneously resolving a target’s range, velocity, and reflection intensity. Figure 6 illustrates a sequence of range–Doppler maps during the nodding event. In the initial phase of the motion (e.g., frame 23), the energy is primarily concentrated on the zero-Doppler horizontal line at the driver’s position, indicating a strong stationary reflection. When the nodding motion occurs (especially in frames 25 and 26), the energy spreads significantly in the vertical direction, spanning multiple velocity bins. This phenomenon is known as Doppler broadening. This is precisely because the head movement generates velocity, causing its reflected energy to shift from the zero-Doppler channel to channels representing non-zero velocities. As the motion concludes (frames 27–28), the energy converges back to the zero-Doppler line.

4.2. Input Data Pre-Processing

Human behavior is recognized through dynamic movements that evolve over time. To capture these temporal dynamics, we employ a frame stacking strategy where consecutive radar frames are concatenated along the temporal axis. This forms a time-extended input tensor, representing a temporal window of

T_{W}

frames for each inference (as illustrated in Figure 7). This approach is analogous to the windowing process in the short-time Fourier transform (STFT), which also analyzes a long signal in successive segments. However, a key distinction lies in the subsequent processing: instead of applying a Fourier Transform to extract pre-defined frequency features, our method inputs the raw tensor directly into a neural network. This allows the model to learn the most salient temporal features for behavior recognition in an end-to-end, data-driven manner.

Prior to model inference, each radar frame sequence undergoes median filtering to suppress impulsive noise and min–max normalization to maintain a consistent input scale. These preprocessing steps enhance the robustness and stability of the model in real-time operating conditions.

4.3. Radar-Based Spatiotemporal Fusion Network (RTSFN)

In this study, we propose a deep learning framework termed the radar-based spatiotemporal fusion network (RTSFN), as depicted in Figure 8. The central design philosophy of RTSFN lies in the principle of functional decoupling of distinct physical observables derived from millimeter-wave radar signals. Driving behavior can be regarded as a composite of dynamic actions and static postures, which are inherently represented in the radar data streams. Accordingly, RTSFN employs a dual-stream parallel encoder architecture, where each branch is dedicated to processing radar data that encapsulates a specific physical property.

Temporal feature extraction (dynamic motion information): We select the range–Doppler map as the input for the temporal encoder, aiming to capture the driver’s dynamic motion information. The core advantage of the range–Doppler map lies in its velocity dimension, which directly reflects the target’s motion relative to the radar. When the driver performs actions such as nodding, reaching, or turning, their body parts generate distinct signatures in the Doppler spectrum (i.e., Doppler broadening). Therefore, a continuous sequence of range–Doppler maps provides a clear temporal evolution of such dynamic behaviors, making it an ideal source for extracting time-related features.
Spatial feature extraction (static postural information): We select the range profile and the noise profile as the inputs for the spatial encoder, aiming to extract static or semi-static postural information. The range profile represents the total reflected energy at each range bin, offering a stable snapshot of the driver’s body contour and spatial layout. The noise profile, while traditionally regarded as a baseline noise floor, is empirically observed to be sensitive to subtle environmental modulations caused by the driver’s posture and micro-movements. Together, the range profile and the noise profile can be considered as “cross-sectional snapshots” of the driver’s current state, making them highly suitable for extracting spatial features representative of static or semi-static postures.

Through the explicit separation of dynamic motion information (derived from range–Doppler maps) and static posture information (extracted from range profiles and noise profiles) at the signal level, RTSFN is able to learn temporal and spatial features in the most efficient manner from their respective optimal data sources. Finally, a dedicated fusion module deeply integrates these specialized features, thereby enabling a comprehensive and accurate assessment of the driver’s state.

4.3.1. Temporal Modality Encoder

The temporal modality encoder processes range–Doppler map sequences with an input shape of (16, 64, 14), corresponding to the height, width, and temporal length of the map. The core task of this encoder is to learn features representative of the driver’s dynamic behavior by employing a “spatial feature extraction followed by temporal modeling” strategy.

First, to process each frame independently, the input tensor is reshaped, allowing a lightweight convolutional neural network (CNN) wrapped in a TimeDistributed layer to extract spatial features from each frame. This CNN consists of two convolutional layers (with 32 and 64 3 × 3 filters, respectively, using the ReLU activation function) followed by a 2 × 2 max-pooling layer. This design aims to efficiently capture the localized Doppler broadening and energy distribution patterns caused by body movements within each individual map. The resulting sequence of frame-level feature vectors is then fed into a gated temporal convolutional network (gated TCN (Figure 9)) for deep temporal modeling. We chose gated TCN over traditional recurrent neural networks (RNNs), such as LSTM, due to several intrinsic architectural advantages. Firstly, the 1D convolutional operations in a TCN are inherently highly parallelizable. This characteristic allows the model to fully leverage the computational power of modern GPUs during the training phase, significantly shortening the development and iteration cycles. Secondly, its structure of causal and dilated convolutions enables it to stably capture very long-range temporal dependencies while effectively avoiding the vanishing gradient problem common in RNNs. These properties make TCN a powerful and efficient foundation for our sequence modeling task.

4.3.2. Spatial Modality Encoder

The primary responsibility of the Spatial Modality Encoder is to extract spatial features representative of the driver’s static or semi-static ‘posture’ from the one-dimensional range profile and noise profile. The encoder receives the concatenated profile data with an input shape of (64, 2, 14), representing the number of range bins, signal channels, and temporal frames, respectively. Similar to the temporal encoder, the input tensor is reshaped to align the temporal frames, allowing a TimeDistributed layer to apply the same spatial feature extractor to the (64, 2) data of each time step.

This hierarchical spatial feature extraction is implemented via a lightweight convolutional neural network (CNN). The network consists of three 2D convolutional layers with progressively increasing filter counts (e.g., 16, 32, 64) and employs asymmetric kernel sizes suitable for this input shape. Each convolutional layer is followed by a ReLU activation function, and downsampling is performed at appropriate stages. Following the convolutional layers, we introduce a squeeze-and-excitation (SE) block. Its purpose is to enable the network to adaptively learn the importance of different feature channels. For instance, different feature channels might learn to detect different postural sub-features, with some focusing on the strong reflection peaks of the torso. The SE block uses global information to dynamically re-weight these channels, allowing the model to ‘pay more attention’ to the channels that are most critical for identifying the current specific posture, thereby improving the accuracy of posture recognition. Finally, the feature map undergoes global average pooling and flattening operations to produce a 64-dimensional spatial feature vector,

z_{s} \in R^{64}

.

4.3.3. Spatial–Temporal Fusion Module

After extracting the temporal feature vector

z_{t}

and the spatial feature vector

z_{s}

, respectively, the task of the fusion module is to deeply integrate them to generate a comprehensive understanding of the driver’s state. This process consists of several carefully designed stages. First, we employ a cross-gated fusion strategy, where temporal and spatial features act as mutual “attention filters” to enhance feature complementarity and suppress potential noise or misjudgments from a single modality. This mechanism is mathematically defined as follows:

\begin{matrix} G_{S} = σ (W_{1} z_{t} + W_{2} z_{S} + b_{1}), G_{t} = σ (W_{3} z_{t} + W_{4} z_{S} + b_{2}) \end{matrix}

(3)

\begin{matrix} z_{fused} = (z_{s} ⊙ G_{s}) + (z_{t} ⊙ G_{t}) \end{matrix}

(4)

Here, the gating signals (

G_{s}, G_{t}

) are dynamic weights learned from both the spatial (

z_{s}

) and temporal (

z_{t}

) features. These gates are applied to the original features via element-wise multiplication (⊙), and the results are summed to produce the fused feature vector,

z_{fused}

. This design allows the model to dynamically adjust feature importance based on context—for instance, amplifying postural features (

z_{s}

) when significant motion (

z_{t}

) is detected—thereby producing a more robust and comprehensive representation of the driver’s state. Next, to allow the downstream classification heads to utilize all available raw and fused information, we concatenate these three features:

[z_{s}, z_{t}, z_{fused}]

. Furthermore, to enable the model to specialize for the two different tasks of “dangerous driving state (binary classification)” and “specific dangerous behavior (multi-class classification),” we designed a task-specific gating mechanism. An independent gating layer is introduced before each task-specific prediction head. This gating layer learns a weight vector and performs an element-wise multiplication with the concatenated features, thereby allowing each task to autonomously amplify the feature combinations most important to it while suppressing interference from irrelevant features.

After the task-specific gating adjustments, the feature streams are processed by two prediction heads arranged in a hierarchical structure to achieve a more efficient classification strategy. First, the upstream prediction head is responsible for the binary classification of the overall driving state. Its input features first pass through an Adapter module, with the structure

A d a p t e r (x) = x + W_{2} (ReLU (W_{1} x))

. This lightweight residual structure helps to stabilize the training process and enhance feature representation. Subsequently, after being processed by several fully-connected layers (MLPs), this head outputs a prediction of ‘Danger’ or ‘Safe’. This binary prediction then acts as a conditional gate to determine whether to activate the downstream multi-class behavior classification task. Specifically, when the prediction is ‘Danger’, the gate signal is 1, allowing the features of the lower branch to proceed; conversely, when the prediction is ‘Safe’, the gate signal is 0, effectively halting the unnecessary and more computationally intensive fine-grained classification. When activated, the features for the second prediction head similarly pass through an identical Adapter module and MLP structure to output the final prediction for the multi-class dangerous behaviors. This hierarchical design ensures that computational resources are intelligently allocated, performing detailed classification only when truly necessary. This strategy offers a dual advantage: firstly, it significantly enhances the system’s response time, allowing for immediate warnings with minimal latency upon detecting a potential hazard. Secondly, by enabling the upstream binary classifier to focus on the more distinct states of ‘Safe’ versus ‘Danger,’ it effectively filters out normal driving data. This reduces the burden and potential confusion for the downstream multi-class classifier, thereby indirectly improving the accuracy and stability of the overall behavior recognition.

In summary, the proposed RTSFN core architecture, featuring dual-stream encoders, a cross-gated fusion strategy, and hierarchical prediction heads, effectively integrates spatiotemporal features for accurate driver behavior recognition. However, this direct-processing approach presents a significant efficiency bottleneck for real-time streaming applications due to redundant computations on the entire sliding window,

T_{W}

. To overcome this challenge and enable efficient deployment on resource-constrained edge devices, the following section details a corresponding performance optimization scheme.

4.4. Real-Time Optimization via Feature Buffering Mechanism

As mentioned in the previous section, the RTSFN core architecture faces an efficiency bottleneck due to redundant computations when directly processing real-time data streams. To overcome this challenge and enable the model to run efficiently on resource-constrained edge devices, we designed and implemented a key performance optimization scheme: the feature buffering mechanism. This mechanism, as illustrated in the overall architecture in Figure 10, is located after the feature encoders and before the fusion module. Its operation is defined by the following two independent key parameters:

Feature buffer size ( $N_{B}$ ): This refers to the capacity of each buffer, i.e., the number of feature vectors it can store. This value is determined by the input sequence length requirement of the downstream model (e.g., the gated TCN). In our design, $N_{B}$ is set to 14.
Step size (S): This represents the number of new raw radar frames processed in each update step. This value is a key hyperparameter that balances the trade-off between real-time responsiveness and computational cost.

The mechanism follows the principle of a fixed-size first-in, first-out (FIFO) queue, and its detailed operational flow is as follows:

Initialization: Upon system startup, the encoders process an initial $N_{B}$ raw frames to generate $N_{B}$ feature vectors, completely filling the buffer.
First inference: Once the buffer is full, the complete feature sequence of length $N_{B}$ is sent to the subsequent modules to perform the first inference.
Sliding update: During stable operation, the system only processes S new raw frames, and the encoders generate S new feature vectors. Concurrently, the buffer discards (dequeues) the S oldest feature vectors from its head and adds (enqueues) the S new feature vectors to its tail, maintaining its total capacity at $N_{B}$ .
Subsequent inference: This updated feature sequence, now containing the latest information, is sent again to the subsequent modules for the next inference. This process repeats continuously with the incoming data stream.

Through this elegant design, the feature buffering mechanism brings decisive advantages to the entire system. Firstly, in terms of efficiency, it significantly reduces computational load and power consumption by avoiding redundant calculations on the entire window. Secondly, this high efficiency directly translates to the system’s real-time performance, enabling high-frequency, low-latency responses. Finally, by acting as a ‘data adapter,’ the mechanism ensures the stability of the downstream models by providing them with a consistent, fixed-length input. The choice of the step size S directly determines the system’s operational characteristics and performance trade-offs. Choosing a smaller S value (e.g., 1 or 2) leads to a very high update frequency and low latency, enabling the system to react most quickly to instantaneous changes in the driver’s state, and the prediction results are also smoother and more stable. This is crucial for real-time monitoring applications that pursue the highest level of safety. Conversely, choosing a larger S value reduces the system’s update frequency and overall computational load, which is suitable for scenarios where real-time requirements are slightly lower, but more emphasis is placed on average power consumption. In the implementation of this study, we prioritize real-time response speed, and therefore primarily adopted smaller S values in our experiments.

4.5. Dataset Labeling

To obtain accurate ground-truth labels, we first perform manual, frame-by-frame annotation for all collected data by simultaneously cross-referencing visualized radar data (e.g., range–Doppler maps) with a synchronized video feed. This initial annotation assigns a specific label to each frame, where a label of ‘0’ indicates normal driving, and labels ‘1’ through ‘7’ correspond to one of the seven defined dangerous behaviors.

Based on these initial annotations, we generate two distinct sets of labels for our model’s multi-task learning framework:

Driving risk classification (binary task): This task determines if the overall driving state is safe or dangerous. Frames originally labeled as ‘Normal Driving’ (label 0) are mapped to the ‘Safe’ class. All frames corresponding to any of the seven dangerous behaviors (labels 1–7) are uniformly categorized into a single ‘Dangerous’ class. This allows the model to first learn the general patterns that distinguish abnormal states from normal driving.
Behavior recognition (multi-class task): This task aims to identify the specific type of dangerous behavior. Therefore, it exclusively uses the original detailed labels from ‘1’ to ‘7’. Data corresponding to ‘Normal Driving’ is masked or ignored in this task, enabling the model to focus on learning the subtle differences between the various dangerous actions.

After establishing these two sets of task-specific labels, we employ a sliding window approach to prepare the final training samples. The label for each sample window is determined by the task-specific annotation of the last frame within that window. Furthermore, to enhance labeling stability and reduce noise from transient actions, we adopt a delayed labeling strategy, as illustrated in Figure 11. A behavior is only labeled if it persists for at least three consecutive frames, and the label is extended for an additional three frames after the behavior concludes. This entire approach ensures robust and consistent labeling for training our spatiotemporal model.

4.6. Model Training

To simultaneously optimize the driving risk classification and behavior recognition tasks, we designed a multi-task learning framework that incorporates both uncertainty-based loss weighting and representation alignment.

During training, the model is supervised using three loss components, as follows:

Task-specific losses: We compute cross-entropy losses for both classification heads:

$L_{1} = CE (y^{drive}, {\hat{y}}^{drive}), L_{2} = CE (y^{behavior}, {\hat{y}}^{behavior})$

(5)

where y is the ground truth label and $\hat{y}$ is the predicted probability.
Uncertainty-based weighting: To address the issue that the training process can be dominated by a specific task in multi-task learning, we adopt the uncertainty-based weighting method proposed by Kendall et al. [15]. This approach introduces two learnable log-variance parameters, $log σ_{1}$ and $log σ_{2}$ , to adaptively balance the tasks. The uncertainty-weighted total task loss is as follows:

$L_{task} = \frac{1}{2 σ_{1}^{2}} L_{1} + \frac{1}{2 σ_{2}^{2}} L_{2} + log σ_{1} + log σ_{2}$

(6)
Cosine alignment loss: To encourage consistent feature representations across modalities (e.g., range and Doppler), we apply a cosine similarity penalty between the latent vectors $z_{s}$ and $z_{t}$ . This acts as a form of regularization, compelling the model to learn shared, robust features from two different perspectives of the same event. By enforcing this consistency, we can prevent the model from overfitting to the idiosyncrasies of a single modality and improve its ability to generalize.

$L_{align} = 1 - cos (z_{s}, z_{t})$

(7)

The total loss function is expressed as follows:

L_{total} = L_{task} + α \cdot L_{align}

(8)

where

α

is a hyperparameter that controls the alignment strength. This strategy allows the model to automatically reweight the importance of each task based on their predictive uncertainty, while promoting cross-modal feature consistency. All parameters, including the uncertainty weights, are learned end-to-end through backpropagation.

In our experiments, the model was trained for a total of 50 epochs using the Adam optimizer with its default initial learning rate of 0.001. We used a batch size of 32. To address class imbalance in the driving risk classification task, we employed a cost-sensitive learning approach by adjusting the class weights in the loss function. The final model configuration includes the cosine alignment loss, with the hyperparameter

α

set to 0.05. The entire training process was conducted on a single NVIDIA GeForce RTX 3070 Ti GPU.

5. Results

5.1. Experimental Setup

To evaluate the performance of our proposed system, we implemented it using the Texas Instruments AWR1642BOOST mmWave radar and deployed the model on the NVIDIA Jetson Orin Nano edge computing platform.

5.1.1. AWR1642BOOST mmWave Radar

The AWR1642BOOST is a single-chip 77 GHz mmWave radar sensor developed by Texas Instruments (Table 2). It operates based on frequency-modulated continuous wave (FMCW) radar technology, enabling high-accuracy motion and position tracking. This radar module integrates a digital signal processor (DSP) and hardware accelerators to facilitate efficient real-time signal processing.

5.1.2. NVIDIA Jetson Orin Nano

The NVIDIA Jetson Orin Nano is a high-performance edge AI computing platform tailored for real-time inference applications (Table 3). It offers sufficient computational resources to run deep learning models efficiently while maintaining low power consumption. The device supports CUDA and TensorRT acceleration, which optimizes the inference speed of our RTSFCN model.

5.2. Dataset

The dataset was collected in a stationary vehicle environment, where mmWave radar sensors were strategically positioned at different angles to comprehensively capture the driver’s behavior. We recorded driving activities from four participants across three different vehicles to ensure diversity in driving postures and cabin layouts. In total, approximately 400 min of annotated data were accumulated for subsequent model training and analysis. The setup and sensor placement are illustrated in Figure 12. Specifically, the device is mounted on the windshield to the right front of the driver, adjacent to the in-cabin rear-view mirror. The sensor’s antenna is oriented toward the driver’s head and upper torso, with this setup designed to ensure that the radar beam optimally covers key body movements associated with fatigue and distraction.

5.3. Performance Evaluation

To evaluate the performance of the proposed driver monitoring system, we present the confusion matrices for both binary driving risk classification and multi-class behavior recognition, as shown in Figure 13 and Figure 14.

Figure 13 shows the confusion matrix for classifying driving behavior as either safe or dangerous. The system achieved a true positive rate of 0.89 for dangerous driving and 0.98 for safe driving. Most misclassifications occurred during the initial stages of risky behaviors, where the motion features are subtle and difficult to distinguish. For instance, early movements such as slight posture shifts or the initial phase of reaching for an object may be mistaken for normal behavior. However, field testing has demonstrated that once the full motion is completed—such as a full backward turn or extended phone usage—the system reliably identifies the behavior as dangerous. This indicates strong practical robustness in detecting meaningful instances of unsafe driving. Figure 14 displays the confusion matrix for the seven-class behavior detection task. Overall, the model shows high classification accuracy across all categories, with minimal confusion between behaviors. These results support the effectiveness of the proposed spatiotemporal fusion framework in recognizing diverse driving actions using mmWave radar signals. To quantitatively evaluate the model, we adopted precision, recall, and F1 score as standard classification performance metrics. The detailed results are presented in Table 4 and Table 5.

In addition to recognition accuracy, the system’s real-time performance is a critical factor for its practical deployment. According to the parameters in Table 1, the AWR1642BOOST radar sensor is configured with a 250 ms frame duration, resulting in a data acquisition rate of 4 frames per second (FPS). We evaluated the model’s performance on the target edge computing platform, the NVIDIA Jetson Orin Nano, and confirmed that the RTSFN model’s inference speed meets this real-time requirement, processing the incoming 4 FPS data stream without creating a bottleneck. This achievement, combined with the high accuracy results, validates the feasibility of our proposed framework for practical in-vehicle deployment.

5.4. Comparisons with Other Methods

Table 6 summarizes the accuracy of driver behavior detection across various state-of-the-art methods utilizing different sensor technologies. The results show that our proposed FMCW radar-based multi-modal system achieves the best performance in most behavior recognition tasks, demonstrating strong detection capability and robustness.

Compared to the work by Sen et al. [16], while both studies confirm the effectiveness of FMCW radar for driver monitoring, the architectural superiority of our proposed RTSFN model more clearly explains its performance advantages. Specifically, RTSFN adopts more advanced designs in two key areas. First, for time-series modeling, where the fused-CNN model in [16] relies on traditional CNNs, RTSFN explicitly incorporates the gated temporal convolutional network (gated TCN), which more effectively captures the long-range temporal dependencies of actions. Second, for the feature fusion mechanism, where [16] uses simple feature concatenation, RTSFN employs a more sophisticated cross-gated residual fusion mechanism, allowing the model to dynamically learn and weigh information from different branches for a more intelligent fusion.

The impact of these design choices is evident in the results for specific behaviors. For instance, the high accuracy achieved for fatigue indicators that require longer observation, such as yawning (98% accuracy) and nodding (96% accuracy), directly validates the gated TCN’s ability to capture subtle, long-range temporal patterns. Furthermore, for large-scale body movements like picking object(s) (97%) and turning back (95%), the superior performance of RTSFN highlights the strength of its fusion module in combining spatial posture data with temporal motion information for a more robust classification.

Although vision-based methods continue to evolve, from traditional cameras (e.g., in the work of Sengar et al. [18] and Guo et al. [20]) to more novel event cameras (e.g., Shariff et al. [19]), their performance on behavior detection tasks remains impacted by several fundamental limitations. The accuracy of traditional cameras degrades under variable lighting and occlusion, a weakness particularly evident when handling large-scale body movements such as turning back (95%) and picking object(s) (97%), where self-occlusion is inevitable. While technologies like event cameras can mitigate some lighting issues, all vision-based solutions fundamentally rely on a clear line of sight. In contrast, our radar system leverages its penetrative properties and the effective fusion of posture and motion information by the RTSFN model to achieve stable and high-accuracy recognition, demonstrating superior robustness in real-world driving scenarios.

Overall, this comparative analysis confirms the advantages of our FMCW radar-based driver behavior recognition system in terms of accuracy and robustness, and highlights its potential for practical deployment in advanced driver assistance systems (ADASs).

5.5. Sensitivity Analysis of Feature Buffering Mechanism and Step Size (S)

To enable real-time performance on edge devices, we employ a feature buffering mechanism. This approach significantly reduces redundant computations compared to a naive sliding window method that recalculates features for the entire temporal window with every new frame. To quantify this improvement, we compared the computational load (Table 7) for making a prediction with every incoming frame (four predictions per second). The naive approach requires approximately 551 MFLOPs per inference, resulting in an average load of 2.20 GFLOPs/s. By implementing feature buffering with a Step Size (S) of 1, we only compute features for the single newest frame, reducing the cost to 47.6 MFLOPs per inference. This lowers the average load to just 0.19 GFLOPs/s—a more than 10-fold improvement without sacrificing temporal resolution and model performance.

We further conducted a sensitivity analysis on the Step Size (S) to evaluate the trade-off between real-time responsiveness and computational load, with results summarized in Table 8. While increasing S from 1 to 2 further decreases the average compute load, it doubles the update latency from 250 ms (radar sensor is configured with a 250 ms frame duration) to 500 ms. This increased latency means the system has a higher risk of completely missing a transient high-risk behavior that occurs between inference intervals. Given that the primary objective of a driver safety application is to maximize event-capture probability and minimize reaction time, we confirmed that S = 1 is the optimal configuration for our system.

5.6. Ablation Study

The results of our ablation studies are summarized in Table 9. The findings strongly validate our architectural choices:

Effect of input streams: The necessity of both temporal and spatial modalities is evident. Models using only the temporal stream (A1) or only the spatial stream (A2) suffered a severe performance drop, with the multi-class F1 score falling to 0.742 and 0.651, respectively. This confirms that both motion dynamics and postural information are critical for robust classification. Furthermore, removing the noise profile (A3) also degraded performance to a 0.946 F1 score, justifying its inclusion as a valuable auxiliary feature.
Effect of fusion method: By replacing our proposed cross-gated fusion with a simpler method (B1), the multi-class F1 score dropped from 0.967 to 0.941. This demonstrates the superiority of our dynamic fusion mechanism in effectively integrating the multi-modal features.
Effect of temporal module and alignment loss: The use of a gated TCN is justified by the performance decrease when using a standard TCN (C1), where the F1 score fell to 0.947. Similarly, removing the cosine alignment loss (D1) resulted in a lower F1 score of 0.953, confirming its beneficial contribution as a regularizer that encourages consistent feature learning.

Collectively, these results demonstrate that each component of the proposed RTSFN architecture provides a tangible and positive contribution to the final model’s performance.

6. Discussion

6.1. System Advantages

The experimental results validate the effectiveness of the proposed RTSFN architecture in recognizing driver behaviors by fusing temporal and spatial features extracted from mmWave radar signals. The gated TCN module successfully captures long-term dependencies in motion sequences, while the SE-enhanced convolutional network strengthens the representation of static postural features. The modality fusion mechanism (cross-gate fusion) further enhances the complementarity between spatial and temporal streams, enabling robust classification under varying driving conditions. Compared to existing vision-based or wearable-sensor-based approaches, the proposed radar-based system offers significant advantages, particularly in terms of privacy preservation, robustness to lighting conditions, and non-intrusiveness.

6.2. Future Work

While this study demonstrates the feasibility of the proposed model, we acknowledge limitations that define the scope of our current findings and guide our future research. The primary limitation is that the data was collected in a stationary vehicle under simulated conditions with four participants. To enhance the model’s generalizability and robustness for real-world deployment, our future work will proceed in several key directions, as follows:

Dataset expansion and diversity: We will significantly expand the dataset by recruiting a larger, more diverse cohort of participants, including variations in age and driving experience.
Data augmentation: To mitigate the effects of a limited dataset during initial expansion, we will implement data augmentation techniques specific to radar data. These methods will include adding noise to simulate sensor interference and random shifting to account for minor positional changes.
IMU dynamic compensation: To improve robustness in real-world driving environments, future work will integrate an IMU with the radar via sensor fusion. IMU data will be utilized to perform real-time dynamic compensation on the radar signals to filter out vibration interference caused by dynamic vehicle motion.

7. Conclusions

This paper introduces a novel driver monitoring system that leverages mmWave FMCW radar and a deep spatiotemporal fusion network to recognize high-risk driving behaviors in a fully non-contact and privacy-preserving manner. Unlike conventional approaches that rely solely on visual cues or physiological sensors, the proposed method combines multi-modal radar features and multi-task deep learning to robustly assess a range of high-risk driving behaviors.

The proposed system achieves high recognition accuracy while maintaining real-time performance on edge hardware, making a crucial step towards deployment in practical automotive environments. By operating independently of ambient light and without requiring driver cooperation or wearable devices, the system offers a scalable and unobtrusive solution for enhancing road safety.

Overall, this work contributes a unified sensing and inference framework that advances the capabilities of radar-based driver monitoring. It opens new directions for future in-vehicle intelligence, paving the way for seamless integration into next-generation driver assistance and autonomous driving platforms.

Author Contributions

Conceptualization, S.-P.T. and J.-F.W.; methodology, S.-P.T. and W.-Y.W.; software, W.-Y.W. and D.T.; validation, J.-F.W. and D.T.; resources, D.T.; data curation, D.T.; writing—original draft preparation, W.-Y.W.; writing—review and editing, S.-P.T.; visualization, W.-Y.W.; supervision, J.-F.W.; project administration, S.-P.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support the findings of this study are not publicly available due to institutional data policies. Data may be made available from the corresponding author upon reasonable request and with permission of the institution.

Acknowledgments

We would like to express our sincere gratitude to National Cheng Kung University (NCKU) for providing the research facilities, funding, and academic support that made this study possible. Special thanks are extended to the faculty and staff of the Department of Electrical Engineering at NCKU for their invaluable guidance, resources, and encouragement throughout the research process. We also appreciate the collaboration and assistance from our research team members and external partners who contributed to the data collection and model development for this project.

Conflicts of Interest

The authors declare no conflicts of interest.

References

World Health Organization. Global Status Report on Road Safety 2023; World Health Organization: Geneva, Switzerland, 2023. [Google Scholar]
Jeon, Y.; Kim, B.; Baek, Y. Ensemble CNN to detect drowsy driving with in-vehicle sensor data. Sensors 2021, 21, 2372. [Google Scholar] [CrossRef] [PubMed]
Ersal, T.; Fuller, H.J.A.; Tsimhoni, O.; Stein, J.L.; Fathy, H.K. Model-based analysis and classification of driver distraction under secondary tasks. IEEE Trans. Intell. Transp. Syst. 2010, 11, 692–701. [Google Scholar] [CrossRef]
Freitas, A.; Almeida, R.; Gonçales, H.; Conceição, G.; Freitas, A. Monitoring fatigue and drowsiness in motor vehicle occupants using electrocardiogram and heart rate- a systematic review. Transp. Res. Part F Traffic Psychol. Behav. 2024, 103, 586–607. [Google Scholar] [CrossRef]
Luo, H.; Qiu, T.; Liu, C.; Huang, P. Research on fatigue driving detection using forehead EEG based on adaptive multi-scale entropy. Biomed. Signal Process. Control 2019, 51, 50–58. [Google Scholar] [CrossRef]
Ren, B.; Guan, W.; Zhou, Q.; Wang, Z. EEG-based driver fatigue monitoring within a human–ship–environment system: Implications for ship braking safety. Sensors 2023, 23, 4644. [Google Scholar] [CrossRef] [PubMed]
Jiao, Y.; Zhang, C.; Chen, X.; Fu, L.; Jiang, C.; Wen, C. Driver fatigue detection using measures of heart rate variability and electrodermal activity. IEEE Trans. Intell. Transp. Syst. 2023, 25, 5510–5524. [Google Scholar] [CrossRef]
Lian, Z.; Xu, T.; Yuan, Z.; Li, J.; Thakor, N.; Wang, H. Driving fatigue detection based on hybrid electroencephalography and eye tracking. IEEE J. Biomed. Health Inform. 2024, 28, 6568–6580. [Google Scholar] [CrossRef] [PubMed]
Zhu, T.; Zhang, C.; Wu, T.; Ouyang, Z.; Li, H.; Na, X.; Liang, J.; Li, W. Research on a real-time driver fatigue detection algorithm based on facial video sequences. Appl. Sci. 2022, 12, 2224. [Google Scholar] [CrossRef]
Kır Savaş, B.; Becerikli, Y. Behavior-based driver fatigue detection system with deep belief network. Neural Comput. Appl. 2022, 34, 14053–14065. [Google Scholar] [CrossRef]
Xu, Z.; Xue, S.; Wang, X.; Zhu, Y.; Wang, Y. Enhanced sparse optimization approach for vital signal extraction from millimeter-wave radar. Biomed. Signal Process. Control 2026, 112, 108446. [Google Scholar] [CrossRef]
Ma, Y.; Long, W.; Miao, C.; Chen, Q.; Zhang, J.; Yu, Y.; Wu, W. Enhanced dual lane detection in automotive radar systems using harmonic coordinated beamforming of time-modulated arrays. Wirel. Netw. 2025, 31, 1801–1812. [Google Scholar] [CrossRef]
Cardillo, E.; Ferro, L.; Li, C. Microwave and millimeter-wave radar circuits for the next generation contact-less in-cabin detection. In Proceedings of the 2022 Asia-Pacific Microwave Conference (APMC), Yokohama, Japan, 29 November–2 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 231–233. [Google Scholar]
Venon, A.; Dupuis, Y.; Vasseur, P.; Merriaux, P. Millimeter wave fmcw radars for perception, recognition and localization in automotive applications: A survey. IEEE Trans. Intell. Veh. 2022, 7, 533–555. [Google Scholar] [CrossRef]
Kendall, A.; Gal, Y.; Cipolla, R. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7482–7491. [Google Scholar]
Sen, A.; Mandal, A.; Karmakar, P.; Das, A.; Chakraborty, S. MmDrive: MmWave sensing for live monitoring and on-device inference of dangerous driving. In Proceedings of the 2023 IEEE International Conference on Pervasive Computing and Communications (PerCom), Atlanta, GA, USA, 13–17 March 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 2–11. [Google Scholar]
Jung, J.; Lim, S.; Kim, B.-K.; Lee, S. CNN-based driver monitoring using millimeter-wave radar sensor. IEEE Sens. Lett. 2021, 5, 3500404. [Google Scholar] [CrossRef]
Sengar, N.; Kumari, I.; Lee, J.; Har, D. PoseViNet: Distracted Driver Action Recognition Framework Using Multi-View Pose Estimation and Vision Transformer. arXiv 2023, arXiv:2312.14577. [Google Scholar]
Shariff, W.; Kielty, P.; Lemley, J.; Corcoran, P. Spiking-DD: Neuromorphic event camera based driver distraction detection with spiking neural network. In IET Conference Proceedings CP887; IET: Stevenage, UK, 2024; Volume 2024, pp. 71–78. [Google Scholar]
Guo, Z.; Liu, Q.; Zhang, L.; Li, Z.; Li, G. L-TLA: A lightweight driver distraction detection method based on three-level attention mechanisms. IEEE Trans. Reliab. 2024, 73, 1731–1742. [Google Scholar] [CrossRef]

Figure 1. FMCW radar system block diagram.

Figure 2. Overview of driver monitoring and risk detection system.

Figure 3. Driver nodding event as a running example.

Figure 4. Sequence of range profiles during a driver nodding event. The highlighted signal fluctuations indicate the nodding motion, with the event’s timing and label validated against synchronized ground-truth video recordings.

Figure 5. Sequence of noise profiles during a driver nodding event. The highlighted signal fluctuations indicate the nodding motion, with the event’s timing and label validated against synchronized ground-truth video recordings.

Figure 6. Sequence of range–Doppler maps during a driver nodding event. The highlighted signal fluctuations indicate the nodding motion, with the event’s timing and label validated against synchronized ground-truth video recordings.

Figure 7. Illustration of the sliding window mechanism during inference when

T_{W} = 14

.

Figure 7. Illustration of the sliding window mechanism during inference when

T_{W} = 14

.

Figure 8. Radar-based spatiotemporal fusion network.

Figure 9. Gated TCN.

Figure 10. Radar-based spatiotemporal fusion network with feature buffering mechanism.

Figure 11. Illustration of the data labeling strategy in the window.

Figure 12. Deployment of the radar devices within the vehicle.

Figure 13. Confusion matrix of safe and dangerous driving.

Figure 14. Confusion matrix of behavior detection.

Table 1. AWR1642BOOST radar system parameters.

Parameter	Value
Frequency	77 GHz
Azimuth resolution	15°
Bandwidth	3.2 GHz
Range resolution	0.047 m
Maximum unambiguous range	2.42 m
Maximum radial velocity	1 m/s
Radial velocity resolution	0.13 m/s
Frame duration	250 ms
Range detection threshold	15 dB
Doppler detection threshold	15 dB

Table 2. Specifications of AWR1642BOOST.

Specification	AWR1642BOOST
Frequency Band	77 GHz
Processing Unit	DSP + Hardware Accelerators
Interface	UART, SPI, I2C
Application	Radar Signal Processing

Table 3. Specifications of NVIDIA Jetson Orin Nano.

Specification	Jetson Orin Nano
Processing Unit	6-core ARM Cortex-A78AE CPU
AI Acceleration	1024-core NVIDIA Ampere GPU + 32 Tensor Cores
Memory	8 GB LPDDR5
Power Consumption	7 W–15 W
Interface	USB, PCIe, GPIO, I2C, SPI
Application	Edge AI Computing

Table 4. Performance metrics for binary driving state classification.

Class	Precision	Recall	F1 Score
Danger	0.98	0.89	0.93
Safe	0.90	0.98	0.94

Table 5. Performance metrics for multi-class behavior recognition.

Behavior Class	Precision	Recall	F1 Score
Drinking	0.99	0.99	0.99
Forward	0.98	0.96	0.97
Nodding	0.94	0.96	0.95
Turn back	0.96	0.95	0.95
Using phone	0.98	0.95	0.96
Picking object	0.97	0.97	0.97
Yawning	0.96	0.98	0.97
Average	0.97	0.97	0.97

Table 6. Accuracy of driver behavior detection across different sensors and studies.

Method	Sen et al. [16]	Jung et al. [17]	Sengar et al. [18]	Shariff et al. [19]	Guo et al. [20]	Our Method
Sensor Type	FMCW Radar	FMCW Radar	Multi-View Camera	Event Camera	Depth Camera	FMCW Radar
Nodding	93%	80%	-	-	-	96%
Yawning	96%	-	86%	-	-	98%
Using Phone	97%	-	89%	-	90%	95%
Picking object	97%	-	85%	-	-	97%
Turning back	97%	-	92%	-	86.67%	95%
Drinking	99%	-	92%	-	93.33%	99%
Fetching forward	87%	-	89%	-	84.44%	96%
Fatigue driving	-	-	-	94.4%	-	-%

Table 7. Computational load comparison of inference methods.

Inference Method	FLOPs per Inference	Avg. Compute Load (GFLOPs/s)
Naive Sliding Window	551 M	2.20
With Feature Buffering (S = 1)	47.6 M	0.19

Table 8. Sensitivity analysis of step size (S) on the latency-compute trade-off.

Step Size (S)	Inference Rate (pred/s)	Avg. Compute Load (GFLOPs/s)
1	4.0	0.190
2	2.0	0.161

Table 9. Ablation studies on key components of the RTSFN architecture.

ID	Model Configuration	Binary Task (Avg. F1)	Multi-Class Task (Avg. F1)
Baseline Model
–	Full RTSFN Model	0.935	0.967
Ablation on input streams
A1	Temporal modality only	0.868	0.742
A2	Spatial modality only	0.802	0.651
A3	Without noise profile	0.910	0.946
Ablation on Fusion Method
B1	Without cross-gated fusion	0.915	0.941
Ablation on Temporal Module
C1	Using standard TCN	0.919	0.947
Ablation on Alignment Loss
D1	Without alignment loss	0.924	0.953

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Tseng, S.-P.; Wu, W.-Y.; Wang, J.-F.; Tao, D. Real-Time Driver State Detection Using mmWave Radar: A Spatiotemporal Fusion Network for Behavior Monitoring on Edge Platforms. Electronics 2025, 14, 3556. https://doi.org/10.3390/electronics14173556

AMA Style

Tseng S-P, Wu W-Y, Wang J-F, Tao D. Real-Time Driver State Detection Using mmWave Radar: A Spatiotemporal Fusion Network for Behavior Monitoring on Edge Platforms. Electronics. 2025; 14(17):3556. https://doi.org/10.3390/electronics14173556

Chicago/Turabian Style

Tseng, Shih-Pang, Wun-Yang Wu, Jhing-Fa Wang, and Dawei Tao. 2025. "Real-Time Driver State Detection Using mmWave Radar: A Spatiotemporal Fusion Network for Behavior Monitoring on Edge Platforms" Electronics 14, no. 17: 3556. https://doi.org/10.3390/electronics14173556

APA Style

Tseng, S.-P., Wu, W.-Y., Wang, J.-F., & Tao, D. (2025). Real-Time Driver State Detection Using mmWave Radar: A Spatiotemporal Fusion Network for Behavior Monitoring on Edge Platforms. Electronics, 14(17), 3556. https://doi.org/10.3390/electronics14173556

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Real-Time Driver State Detection Using mmWave Radar: A Spatiotemporal Fusion Network for Behavior Monitoring on Edge Platforms

Abstract

1. Introduction

2. FMCW Radar Principles and Hardware Setup

2.1. Overview of FMCW mmWave Radar Technology

2.1.1. Range Estimation

2.1.2. Velocity Estimation

2.2. Advantages and Selection Considerations of FMCW Radar

2.3. Radar Hardware Setup and System Parameters

3. System Overview

4. Human Motion Detection: RTSFN-Based Driver Action Detection

4.1. Input Data of Radar Signals

4.2. Input Data Pre-Processing

4.3. Radar-Based Spatiotemporal Fusion Network (RTSFN)

4.3.1. Temporal Modality Encoder

4.3.2. Spatial Modality Encoder

4.3.3. Spatial–Temporal Fusion Module

4.4. Real-Time Optimization via Feature Buffering Mechanism

4.5. Dataset Labeling

4.6. Model Training

5. Results

5.1. Experimental Setup

5.1.1. AWR1642BOOST mmWave Radar

5.1.2. NVIDIA Jetson Orin Nano

5.2. Dataset

5.3. Performance Evaluation

5.4. Comparisons with Other Methods

5.5. Sensitivity Analysis of Feature Buffering Mechanism and Step Size (S)

5.6. Ablation Study

6. Discussion

6.1. System Advantages

6.2. Future Work

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI