1. Introduction
Safe and focused driving is fundamental to preventing traffic accidents and safeguarding the well-being of drivers, passengers, and other road users. However, driver fatigue and distraction have been widely recognized as major contributing factors to road traffic accidents, particularly in contexts such as long-distance transportation, commercial driving, and ride-sharing services. According to the
Global Status Report on Road Safety 2023 published by the World Health Organization (WHO) [
1], approximately 69% of global traffic fatalities involve individuals aged 18 to 59—the most economically active segment of the population. Furthermore, 18% of driver deaths are associated with commuting or work-related driving, underscoring the heightened risks of fatigue, especially in long-haul or professional driving scenarios. Distracted driving has also emerged as an increasingly critical concern, exacerbated by the widespread use of mobile devices. Studies have shown that drivers who use mobile phones while driving are significantly more likely to be involved in traffic collisions compared to those who do not. Such behaviors impair visual attention, cognitive processing, and reaction time, thereby posing substantial threats to road safety. As a result, the accurate and efficient detection of driver fatigue and distraction has become an urgent need in intelligent transportation systems, autonomous vehicles, and fleet management applications.
Existing methods for detecting driver fatigue and distraction can be broadly categorized into four main approaches, as follows: vehicle operation data analysis, physiological signal monitoring, facial landmark detection, and behavior analysis.
Vehicle operation data analysis methods infer driver state by analyzing abnormal vehicle behaviors, such as abrupt steering, lane deviations, or sudden acceleration, using data collected from the vehicle’s onboard diagnostics (OBD) system [
2,
3]. These methods typically employ statistical models or machine learning algorithms to identify deviations from standard driving patterns. While these methods are easy to deploy and cost-efficient, they are highly susceptible to external interferences, including weather conditions, road types, and driver habits, which may result in false alarms or missed detections.
Physiological signal monitoring methods aim to directly evaluate the driver’s internal mental state by measuring signals such as heart rate variability (HRV) [
4], brainwave activity (EEG) [
5,
6], and skin conductance level (EDA) [
7]. Although these methods offer high detection accuracy, their reliance on wearable or contact-based sensors may negatively affect user comfort and acceptance, and limit scalability in real-world driving environments.
Facial landmark detection methods capture video data of the driver’s face using in-cabin cameras to extract visual cues such as eyelid closure percentage (PERCLOS), gaze direction, and facial expressions [
8,
9]. These indicators are commonly used as proxies for drowsiness and distraction. Similarly, behavior-based approaches extend this analysis by observing the driver’s gross motor behaviors, including head nodding, posture shifts, and hand movements, through video analysis or motion sensors installed inside the cabin [
10]. While these methods are non-invasive and intuitive, they suffer from high sensitivity to lighting variations, facial occlusions (e.g., masks, sunglasses), and privacy concerns, particularly in shared mobility scenarios. These limitations hinder their wide-scale deployment and raise significant ethical and data protection issues.
The role of mmWave radar as a promising non-contact sensing technology has expanded significantly in recent years, with critical applications found in physiological signal monitoring [
11], and even in traffic safety, such as external environment perception [
12], as well as within the domain of in-cabin driver monitoring. It offers several advantages, including strong robustness to environmental conditions, the ability to penetrate obstructions, and enhanced privacy protection. The mmWave radar exhibits diverse potential in in-vehicle applications. For example, Cardillo et al. [
13] in their study, “Microwave and mmWave Radar Circuits for the Next Generation Contact-Less In-Cabin Detection,” thoroughly explored microwave and mmWave radar circuits for next-generation contact-less in-cabin detection, covering various scenarios such as passenger presence detection and touch-free infotainment system control. However, there is a significant opportunity to extend this robust technology to the critical safety domain of driver state monitoring.
Therefore, this paper proposes a novel approach that utilizes mmWave radar. By leveraging this “non-contact” and “privacy-preserving” technology, our system detects driver fatigue and distraction through the analysis of driver posture and behavior dynamics. The main contributions of this study are summarized as follows:
A driver monitoring framework is developed based on mmWave radar to robustly identify driver behavior and posture changes.
A deep learning architecture named the radar-based spatiotemporal fusion network (RTSFN) is proposed, which effectively fuses temporal and spatial features extracted from radar signals to identify risky driving behaviors.
The system is designed with a balance between detection accuracy and computational efficiency, allowing real-time inference on edge devices and supporting practical deployment in in-vehicle environments.
The remainder of this paper is structured as follows.
Section 2 introduces the principles of FMCW radar and details the hardware setup.
Section 3 provides a comprehensive overview of the proposed driver monitoring system.
Section 4 elaborates on the RTSFN architecture, including the input data pre-processing, the spatiotemporal fusion network, and the real-time optimization mechanism.
Section 5 presents the experimental setup, results, a comparative analysis with other methods, and the ablation study. Finally, the conclusion is given in
Section 6 and
Section 7.
2. FMCW Radar Principles and Hardware Setup
2.1. Overview of FMCW mmWave Radar Technology
Frequency-modulated continuous wave (FMCW) radar transmits a chirp signal—a continuous wave whose frequency increases or decreases linearly over time. When this signal encounters a target, part of it is reflected back to the radar receiver. The system generates this chirp signal via a signal generator, amplifies it, and transmits it through the transmitting antenna. The reflected signal is captured by the receiving antenna, amplified, and mixed with a local copy of the transmitted signal in a process called dechirping, which produces an intermediate frequency (IF) signal. This IF signal encodes information about the target’s distance and velocity. After digitization, digital signal processing techniques such as the fast Fourier transform (FFT) are applied to extract key target parameters, including range and velocity.
Figure 1 presents a simplified block diagram of an FMCW radar system.
FMCW technology is particularly advantageous for real-time driver monitoring due to its high range and velocity resolution, low power consumption [
14], and cost-effectiveness. Furthermore, its continuous-wave nature enables real-time measurements and robustness against multipath fading, making it suitable for tracking moving targets in dynamic in-vehicle environments.
Based on the frequency and phase characteristics of the received signals, the system can estimate both the range and velocity of targets. These are derived through well-established signal processing techniques as described below.
2.1.1. Range Estimation
The distance
d of an object is estimated based on the beat frequency
, which represents the difference in frequency between the transmitted and received signals. The relationship between the distance and the beat frequency is given by the following:
where
c is the speed of light,
B is the chirp bandwidth, and
is the chirp duration. A higher beat frequency corresponds to a larger distance between the radar and the target.
2.1.2. Velocity Estimation
The velocity
v of an object is estimated using the Doppler effect, which causes a frequency shift (
) in the reflected signal as the object moves relative to the radar. The Doppler frequency shift is directly proportional to the relative velocity and can be expressed as follows:
where
is the Doppler frequency shift and
is the wavelength of the transmitted signal. The magnitude of the Doppler shift is proportional to the target’s speed, while its sign indicates the direction of motion relative to the radar.
2.2. Advantages and Selection Considerations of FMCW Radar
While Doppler radar excels at measuring object velocity and provides stable phase information when the target distance is constant, this study primarily selects FMCW radar as the core technology for driver monitoring applications due to its multiple significant advantages. FMCW radar can simultaneously provide highly accurate range and velocity information, which is crucial for precisely locating the driver and capturing subtle changes in their posture and movements. In contrast, traditional Doppler radar, lacking a high-resolution range dimension, struggles to differentiate targets at various distances that share the same velocity. This presents a significant limitation in complex, multi-target, and noisy in-cabin environments. FMCW radar’s superior range resolution enables it to effectively separate the driver from vehicle structures or other objects amidst complex backgrounds, thereby substantially improving signal quality. Furthermore, FMCW radar’s advantages in relatively low power consumption and cost-effectiveness make it an ideal choice for real-time deployment on edge computing devices. These characteristics allow FMCW radar to more comprehensively and reliably meet the needs of driver behavior detection in this study compared to a standalone Doppler radar.
2.3. Radar Hardware Setup and System Parameters
To enable robust and real-time driver monitoring, we selected the AWR1642BOOST FMCW radar, Texas Instruments, Dallas, TX, USA, as our primary sensing device. Texas Instruments offers a variety of mmWave radar sensors under the AWR and IWR series. The AWR series is specifically optimized for automotive and industrial applications, making it well-suited for driving-related scenarios such as fatigue and distraction detection. In contrast, the IWR series is geared toward broader applications, including industrial sensing, security, and IoT domains.
Given the automotive focus of this study, the AWR1642 device was chosen due to its high-frequency operation in the 77 GHz band, which provides improved range and Doppler resolution compared to lower-frequency alternatives. Additionally, the AWR1642 integrates a high-performance digital signal processor (DSP), which supports efficient on-chip processing of radar signals. This is particularly advantageous for real-time applications, where fast and reliable computation is critical for handling large volumes of streaming data and executing complex signal processing algorithms with low latency.
AWR1642BOOST’s main radar configuration parameters (as shown in
Table 1) are designed to maximize the performance of in-cabin driver monitoring. The system operates in the 77 GHz frequency band, whose high-frequency characteristics ensure excellent detection capabilities. A range resolution of 0.047 m and a radial velocity resolution of 0.13 m/s ensure that the system meticulously captures driver posture changes and motion dynamics. The maximum unambiguous range is set to 2.42 m, which fully covers the effective detection area within the cockpit, ensuring that even significant tilting or stretching by the driver can be completely captured. Furthermore, a maximum radial velocity of 1 m/s is chosen to encompass various typical driver motion speeds that may occur during fatigue (e.g., nodding, yawning) or distraction (e.g., using a phone, turning around to pick up objects). The frame duration of 250 ms (corresponding to four frames per second) strikes a balance between the temporal resolution required for behavior analysis and the computational load for real-time processing on edge devices. Detection thresholds are uniformly set to 15 dB, which helps effectively filter out low-amplitude noise while ensuring the detection of critical driver-induced reflections. Overall, the synergistic setting of these parameters ensures that the system can accurately capture and classify diverse driving behaviors in real-time within a real in-vehicle environment. Data were collected through the UART interface using the TLV (type–length–value) format, enabling structured communication between the radar and the host system. And the specific antenna position used in our experimental setup is detailed in
Section 5.2.
3. System Overview
This study proposes a robust driver monitoring system based on FMCW mmWave radar, which focuses on analyzing behavioral cues for accurate driver state evaluation. As illustrated in
Figure 2, the system’s operation begins with the mmWave radar receiving raw radar signals. These raw signals then proceed to the radar signal decoding module, where they are converted into analyzable radar signal data. This data primarily includes three types of features: range–Doppler maps, noise profiles, and range profiles. Next, these decoded data enter the data pre-processing stage, which involves several key steps. First is Frame Stacking, where consecutive radar frames are stacked to capture temporal dynamics. This is followed by data concatenation, which integrates different feature data. Finally, normalization is performed to ensure consistency in data scale, which helps to improve the accuracy and stability of the model’s inference results.
The pre-processed data stream then serves as input to the radar-based spatiotemporal fusion network (RTSFN). Within the RTSFN, feature extraction is performed to derive behavior-relevant spatiotemporal features from the radar data. These features are subsequently fed into the behavior classification module, used for identifying various driving behaviors. Ultimately, the behavioral classification results from RTSFN are transmitted to the driver state evaluation module, which assesses the driver’s current state, for instance, whether abnormal driving behavior is present. The system specifically identifies and outputs multiple abnormal behaviors, including fatigue driving, using a phone (raising a hand to the ear for a call or picking up a phone to use), turning backward, picking up objects, bending forward, and drinking behavior. By focusing on behavioral classification, the proposed system enables effective evaluation of driver fatigue and distraction, offering timely warning potential in real-world driving environments.
4. Human Motion Detection: RTSFN-Based Driver Action Detection
4.1. Input Data of Radar Signals
mmWave radar signals provide three types of input data, each offering distinct information for analyzing target characteristics and driver actions. To illustrate how these data types capture a dynamic action, we use a driver nodding event as a running example (as depicted in
Figure 3).
Range profile: The range profile is a one-dimensional representation of the total reflected signal power received by the radar at various range points. This profile is generated by performing a 1D fast Fourier transform (1D-FFT) on the received signal, offering a comprehensive snapshot of all objects within the radar’s field of view. It serves as the foundation for estimating the driver’s approximate position and torso area.
Figure 4 illustrates a time series of range profiles during a driver nodding event. In frames 25 and 26, significant power fluctuations are visible within the “Driver Body Area.” These fluctuations directly correspond to the movement of the driver’s head and upper body, as this action changes their distance and reflection angle relative to the sensor. In contrast, the more stable peaks in other regions may show positions such as the car seat or dashboard, which provide consistent reflections.
Noise profile: The noise profile, the second one-dimensional data feature used in our model, is physically defined as the background noise floor, which typically serves as a reference for algorithms such as the constant false alarm rate (CFAR). While traditionally treated as a baseline to be filtered or subtracted, our empirical analysis revealed that this profile contains valuable information sensitive to dynamic events. As illustrated in
Figure 5, the profile exhibits a stable contour when the driver is stationary (e.g., Frame 23), but shows drastic, instantaneous peaks within the driver’s area during nodding periods (e.g., Frames 25 and 26). We posit that this is not because the profile directly measures motion energy, but because physical actions from the driver introduce micro-vibrations and subtle changes to the in-cabin signal environment, which in turn modulate the overall background energy level. Therefore, we treat this data stream as a sensitive auxiliary indicator of dynamic events rather than as random noise. The significant contribution of this unconventional feature to the model’s performance is empirically validated in our comprehensive ablation study (
Section 5.6).
The range–Doppler map: The range–Doppler map is generated by applying a two-dimensional FFT to the received FMCW radar signal. First, a range FFT is performed on each individual chirp signal along the fast-time dimension to obtain the range spectrum. Subsequently, a Doppler FFT is performed for each range bin along the slow-time dimension (across chirps) to resolve the target’s velocity. The resulting range–Doppler map (as shown in
Figure 6) is a two-dimensional plot that simultaneously presents the distribution of the radar echo signal in both the range and velocity dimensions. Its horizontal axis represents the range in meters (m), while the vertical axis represents the relative velocity in meters per second (m/s), with the central axis (zero Doppler) corresponding to stationary objects. The color intensity of each cell represents the signal energy reflected by an object located at that specific range and moving at a specific radial velocity. This map provides the most complete view of the scene’s dynamics, capable of simultaneously resolving a target’s range, velocity, and reflection intensity.
Figure 6 illustrates a sequence of range–Doppler maps during the nodding event. In the initial phase of the motion (e.g., frame 23), the energy is primarily concentrated on the zero-Doppler horizontal line at the driver’s position, indicating a strong stationary reflection. When the nodding motion occurs (especially in frames 25 and 26), the energy spreads significantly in the vertical direction, spanning multiple velocity bins. This phenomenon is known as Doppler broadening. This is precisely because the head movement generates velocity, causing its reflected energy to shift from the zero-Doppler channel to channels representing non-zero velocities. As the motion concludes (frames 27–28), the energy converges back to the zero-Doppler line.
4.2. Input Data Pre-Processing
Human behavior is recognized through dynamic movements that evolve over time. To capture these temporal dynamics, we employ a frame stacking strategy where consecutive radar frames are concatenated along the temporal axis. This forms a time-extended input tensor, representing a temporal window of
frames for each inference (as illustrated in
Figure 7). This approach is analogous to the windowing process in the short-time Fourier transform (STFT), which also analyzes a long signal in successive segments. However, a key distinction lies in the subsequent processing: instead of applying a Fourier Transform to extract pre-defined frequency features, our method inputs the raw tensor directly into a neural network. This allows the model to learn the most salient temporal features for behavior recognition in an end-to-end, data-driven manner.
Prior to model inference, each radar frame sequence undergoes median filtering to suppress impulsive noise and min–max normalization to maintain a consistent input scale. These preprocessing steps enhance the robustness and stability of the model in real-time operating conditions.
4.3. Radar-Based Spatiotemporal Fusion Network (RTSFN)
In this study, we propose a deep learning framework termed the radar-based spatiotemporal fusion network (RTSFN), as depicted in
Figure 8. The central design philosophy of RTSFN lies in the principle of functional decoupling of distinct physical observables derived from millimeter-wave radar signals. Driving behavior can be regarded as a composite of dynamic actions and static postures, which are inherently represented in the radar data streams. Accordingly, RTSFN employs a dual-stream parallel encoder architecture, where each branch is dedicated to processing radar data that encapsulates a specific physical property.
Temporal feature extraction (dynamic motion information): We select the range–Doppler map as the input for the temporal encoder, aiming to capture the driver’s dynamic motion information. The core advantage of the range–Doppler map lies in its velocity dimension, which directly reflects the target’s motion relative to the radar. When the driver performs actions such as nodding, reaching, or turning, their body parts generate distinct signatures in the Doppler spectrum (i.e., Doppler broadening). Therefore, a continuous sequence of range–Doppler maps provides a clear temporal evolution of such dynamic behaviors, making it an ideal source for extracting time-related features.
Spatial feature extraction (static postural information): We select the range profile and the noise profile as the inputs for the spatial encoder, aiming to extract static or semi-static postural information. The range profile represents the total reflected energy at each range bin, offering a stable snapshot of the driver’s body contour and spatial layout. The noise profile, while traditionally regarded as a baseline noise floor, is empirically observed to be sensitive to subtle environmental modulations caused by the driver’s posture and micro-movements. Together, the range profile and the noise profile can be considered as “cross-sectional snapshots” of the driver’s current state, making them highly suitable for extracting spatial features representative of static or semi-static postures.
Through the explicit separation of dynamic motion information (derived from range–Doppler maps) and static posture information (extracted from range profiles and noise profiles) at the signal level, RTSFN is able to learn temporal and spatial features in the most efficient manner from their respective optimal data sources. Finally, a dedicated fusion module deeply integrates these specialized features, thereby enabling a comprehensive and accurate assessment of the driver’s state.
4.3.1. Temporal Modality Encoder
The temporal modality encoder processes range–Doppler map sequences with an input shape of (16, 64, 14), corresponding to the height, width, and temporal length of the map. The core task of this encoder is to learn features representative of the driver’s dynamic behavior by employing a “spatial feature extraction followed by temporal modeling” strategy.
First, to process each frame independently, the input tensor is reshaped, allowing a lightweight convolutional neural network (CNN) wrapped in a TimeDistributed layer to extract spatial features from each frame. This CNN consists of two convolutional layers (with 32 and 64 3 × 3 filters, respectively, using the ReLU activation function) followed by a 2 × 2 max-pooling layer. This design aims to efficiently capture the localized Doppler broadening and energy distribution patterns caused by body movements within each individual map. The resulting sequence of frame-level feature vectors is then fed into a gated temporal convolutional network (gated TCN (
Figure 9)) for deep temporal modeling. We chose gated TCN over traditional recurrent neural networks (RNNs), such as LSTM, due to several intrinsic architectural advantages. Firstly, the 1D convolutional operations in a TCN are inherently highly parallelizable. This characteristic allows the model to fully leverage the computational power of modern GPUs during the training phase, significantly shortening the development and iteration cycles. Secondly, its structure of causal and dilated convolutions enables it to stably capture very long-range temporal dependencies while effectively avoiding the vanishing gradient problem common in RNNs. These properties make TCN a powerful and efficient foundation for our sequence modeling task.
4.3.2. Spatial Modality Encoder
The primary responsibility of the Spatial Modality Encoder is to extract spatial features representative of the driver’s static or semi-static ‘posture’ from the one-dimensional range profile and noise profile. The encoder receives the concatenated profile data with an input shape of (64, 2, 14), representing the number of range bins, signal channels, and temporal frames, respectively. Similar to the temporal encoder, the input tensor is reshaped to align the temporal frames, allowing a TimeDistributed layer to apply the same spatial feature extractor to the (64, 2) data of each time step.
This hierarchical spatial feature extraction is implemented via a lightweight convolutional neural network (CNN). The network consists of three 2D convolutional layers with progressively increasing filter counts (e.g., 16, 32, 64) and employs asymmetric kernel sizes suitable for this input shape. Each convolutional layer is followed by a ReLU activation function, and downsampling is performed at appropriate stages. Following the convolutional layers, we introduce a squeeze-and-excitation (SE) block. Its purpose is to enable the network to adaptively learn the importance of different feature channels. For instance, different feature channels might learn to detect different postural sub-features, with some focusing on the strong reflection peaks of the torso. The SE block uses global information to dynamically re-weight these channels, allowing the model to ‘pay more attention’ to the channels that are most critical for identifying the current specific posture, thereby improving the accuracy of posture recognition. Finally, the feature map undergoes global average pooling and flattening operations to produce a 64-dimensional spatial feature vector, .
4.3.3. Spatial–Temporal Fusion Module
After extracting the temporal feature vector
and the spatial feature vector
, respectively, the task of the fusion module is to deeply integrate them to generate a comprehensive understanding of the driver’s state. This process consists of several carefully designed stages. First, we employ a cross-gated fusion strategy, where temporal and spatial features act as mutual “attention filters” to enhance feature complementarity and suppress potential noise or misjudgments from a single modality. This mechanism is mathematically defined as follows:
Here, the gating signals () are dynamic weights learned from both the spatial () and temporal () features. These gates are applied to the original features via element-wise multiplication (⊙), and the results are summed to produce the fused feature vector, . This design allows the model to dynamically adjust feature importance based on context—for instance, amplifying postural features () when significant motion () is detected—thereby producing a more robust and comprehensive representation of the driver’s state. Next, to allow the downstream classification heads to utilize all available raw and fused information, we concatenate these three features: . Furthermore, to enable the model to specialize for the two different tasks of “dangerous driving state (binary classification)” and “specific dangerous behavior (multi-class classification),” we designed a task-specific gating mechanism. An independent gating layer is introduced before each task-specific prediction head. This gating layer learns a weight vector and performs an element-wise multiplication with the concatenated features, thereby allowing each task to autonomously amplify the feature combinations most important to it while suppressing interference from irrelevant features.
After the task-specific gating adjustments, the feature streams are processed by two prediction heads arranged in a hierarchical structure to achieve a more efficient classification strategy. First, the upstream prediction head is responsible for the binary classification of the overall driving state. Its input features first pass through an Adapter module, with the structure . This lightweight residual structure helps to stabilize the training process and enhance feature representation. Subsequently, after being processed by several fully-connected layers (MLPs), this head outputs a prediction of ‘Danger’ or ‘Safe’. This binary prediction then acts as a conditional gate to determine whether to activate the downstream multi-class behavior classification task. Specifically, when the prediction is ‘Danger’, the gate signal is 1, allowing the features of the lower branch to proceed; conversely, when the prediction is ‘Safe’, the gate signal is 0, effectively halting the unnecessary and more computationally intensive fine-grained classification. When activated, the features for the second prediction head similarly pass through an identical Adapter module and MLP structure to output the final prediction for the multi-class dangerous behaviors. This hierarchical design ensures that computational resources are intelligently allocated, performing detailed classification only when truly necessary. This strategy offers a dual advantage: firstly, it significantly enhances the system’s response time, allowing for immediate warnings with minimal latency upon detecting a potential hazard. Secondly, by enabling the upstream binary classifier to focus on the more distinct states of ‘Safe’ versus ‘Danger,’ it effectively filters out normal driving data. This reduces the burden and potential confusion for the downstream multi-class classifier, thereby indirectly improving the accuracy and stability of the overall behavior recognition.
In summary, the proposed RTSFN core architecture, featuring dual-stream encoders, a cross-gated fusion strategy, and hierarchical prediction heads, effectively integrates spatiotemporal features for accurate driver behavior recognition. However, this direct-processing approach presents a significant efficiency bottleneck for real-time streaming applications due to redundant computations on the entire sliding window, . To overcome this challenge and enable efficient deployment on resource-constrained edge devices, the following section details a corresponding performance optimization scheme.
4.4. Real-Time Optimization via Feature Buffering Mechanism
As mentioned in the previous section, the RTSFN core architecture faces an efficiency bottleneck due to redundant computations when directly processing real-time data streams. To overcome this challenge and enable the model to run efficiently on resource-constrained edge devices, we designed and implemented a key performance optimization scheme: the feature buffering mechanism. This mechanism, as illustrated in the overall architecture in
Figure 10, is located after the feature encoders and before the fusion module. Its operation is defined by the following two independent key parameters:
Feature buffer size (): This refers to the capacity of each buffer, i.e., the number of feature vectors it can store. This value is determined by the input sequence length requirement of the downstream model (e.g., the gated TCN). In our design, is set to 14.
Step size (S): This represents the number of new raw radar frames processed in each update step. This value is a key hyperparameter that balances the trade-off between real-time responsiveness and computational cost.
The mechanism follows the principle of a fixed-size first-in, first-out (FIFO) queue, and its detailed operational flow is as follows:
Initialization: Upon system startup, the encoders process an initial raw frames to generate feature vectors, completely filling the buffer.
First inference: Once the buffer is full, the complete feature sequence of length is sent to the subsequent modules to perform the first inference.
Sliding update: During stable operation, the system only processes S new raw frames, and the encoders generate S new feature vectors. Concurrently, the buffer discards (dequeues) the S oldest feature vectors from its head and adds (enqueues) the S new feature vectors to its tail, maintaining its total capacity at .
Subsequent inference: This updated feature sequence, now containing the latest information, is sent again to the subsequent modules for the next inference. This process repeats continuously with the incoming data stream.
Through this elegant design, the feature buffering mechanism brings decisive advantages to the entire system. Firstly, in terms of efficiency, it significantly reduces computational load and power consumption by avoiding redundant calculations on the entire window. Secondly, this high efficiency directly translates to the system’s real-time performance, enabling high-frequency, low-latency responses. Finally, by acting as a ‘data adapter,’ the mechanism ensures the stability of the downstream models by providing them with a consistent, fixed-length input. The choice of the step size S directly determines the system’s operational characteristics and performance trade-offs. Choosing a smaller S value (e.g., 1 or 2) leads to a very high update frequency and low latency, enabling the system to react most quickly to instantaneous changes in the driver’s state, and the prediction results are also smoother and more stable. This is crucial for real-time monitoring applications that pursue the highest level of safety. Conversely, choosing a larger S value reduces the system’s update frequency and overall computational load, which is suitable for scenarios where real-time requirements are slightly lower, but more emphasis is placed on average power consumption. In the implementation of this study, we prioritize real-time response speed, and therefore primarily adopted smaller S values in our experiments.
4.5. Dataset Labeling
To obtain accurate ground-truth labels, we first perform manual, frame-by-frame annotation for all collected data by simultaneously cross-referencing visualized radar data (e.g., range–Doppler maps) with a synchronized video feed. This initial annotation assigns a specific label to each frame, where a label of ‘0’ indicates normal driving, and labels ‘1’ through ‘7’ correspond to one of the seven defined dangerous behaviors.
Based on these initial annotations, we generate two distinct sets of labels for our model’s multi-task learning framework:
Driving risk classification (binary task): This task determines if the overall driving state is safe or dangerous. Frames originally labeled as ‘Normal Driving’ (label 0) are mapped to the ‘Safe’ class. All frames corresponding to any of the seven dangerous behaviors (labels 1–7) are uniformly categorized into a single ‘Dangerous’ class. This allows the model to first learn the general patterns that distinguish abnormal states from normal driving.
Behavior recognition (multi-class task): This task aims to identify the specific type of dangerous behavior. Therefore, it exclusively uses the original detailed labels from ‘1’ to ‘7’. Data corresponding to ‘Normal Driving’ is masked or ignored in this task, enabling the model to focus on learning the subtle differences between the various dangerous actions.
After establishing these two sets of task-specific labels, we employ a sliding window approach to prepare the final training samples. The label for each sample window is determined by the task-specific annotation of the last frame within that window. Furthermore, to enhance labeling stability and reduce noise from transient actions, we adopt a delayed labeling strategy, as illustrated in
Figure 11. A behavior is only labeled if it persists for at least three consecutive frames, and the label is extended for an additional three frames after the behavior concludes. This entire approach ensures robust and consistent labeling for training our spatiotemporal model.
4.6. Model Training
To simultaneously optimize the driving risk classification and behavior recognition tasks, we designed a multi-task learning framework that incorporates both uncertainty-based loss weighting and representation alignment.
During training, the model is supervised using three loss components, as follows:
Task-specific losses: We compute cross-entropy losses for both classification heads:
where
y is the ground truth label and
is the predicted probability.
Uncertainty-based weighting: To address the issue that the training process can be dominated by a specific task in multi-task learning, we adopt the uncertainty-based weighting method proposed by Kendall et al. [
15]. This approach introduces two learnable log-variance parameters,
and
, to adaptively balance the tasks. The uncertainty-weighted total task loss is as follows:
Cosine alignment loss: To encourage consistent feature representations across modalities (e.g., range and Doppler), we apply a cosine similarity penalty between the latent vectors
and
. This acts as a form of regularization, compelling the model to learn shared, robust features from two different perspectives of the same event. By enforcing this consistency, we can prevent the model from overfitting to the idiosyncrasies of a single modality and improve its ability to generalize.
The total loss function is expressed as follows:
where
is a hyperparameter that controls the alignment strength. This strategy allows the model to automatically reweight the importance of each task based on their predictive uncertainty, while promoting cross-modal feature consistency. All parameters, including the uncertainty weights, are learned end-to-end through backpropagation.
In our experiments, the model was trained for a total of 50 epochs using the Adam optimizer with its default initial learning rate of 0.001. We used a batch size of 32. To address class imbalance in the driving risk classification task, we employed a cost-sensitive learning approach by adjusting the class weights in the loss function. The final model configuration includes the cosine alignment loss, with the hyperparameter set to 0.05. The entire training process was conducted on a single NVIDIA GeForce RTX 3070 Ti GPU.
7. Conclusions
This paper introduces a novel driver monitoring system that leverages mmWave FMCW radar and a deep spatiotemporal fusion network to recognize high-risk driving behaviors in a fully non-contact and privacy-preserving manner. Unlike conventional approaches that rely solely on visual cues or physiological sensors, the proposed method combines multi-modal radar features and multi-task deep learning to robustly assess a range of high-risk driving behaviors.
The proposed system achieves high recognition accuracy while maintaining real-time performance on edge hardware, making a crucial step towards deployment in practical automotive environments. By operating independently of ambient light and without requiring driver cooperation or wearable devices, the system offers a scalable and unobtrusive solution for enhancing road safety.
Overall, this work contributes a unified sensing and inference framework that advances the capabilities of radar-based driver monitoring. It opens new directions for future in-vehicle intelligence, paving the way for seamless integration into next-generation driver assistance and autonomous driving platforms.