3.1. Data Preprocessing
In FMCW radar systems, the transmitted signal frequency increases linearly over time, commonly referred to as a linear frequency modulated pulse (chirp). The transmitting antenna sends this pulse, which is reflected by an object and received by the radar’s receiving antenna. The mixer combines the received signal (RX) with the transmitted signal (TX) to generate an intermediate frequency (IF) signal.
To extract range and velocity information of the target, the FMCW radar transmits multiple chirps continuously. Consequently, the mixer outputs multiple IF signals, which are organized into an matrix, where M represents the number of consecutively transmitted chirps and N denotes the number of ADC samples per chirp. For a radar with R receiving antennas, R such matrices are generated.
Each chirp in the IF signal matrix undergoes a Fast Fourier Transform (FFT) transformation along the sampling dimension, referred to as range-FFT, to obtain distance information. The resulting spectrum exhibits peaks corresponding to objects at specific distances, effectively separating the radar data into different range bins. After performing the range-FFT, each received signal
from each antenna is normalized by subtracting the mean across all samples to suppress static objects:
where
denotes the clutter-removed signal of the
i-th chirp,
represents the IF signal,
is the chirp index, and
is the ADC sampling index within each chirp.
To remove range bins with negligible information, we calculated the average signal across all antennas and then took the modulus of the result. Additionally, averaging across multiple receiving antennas helps reduce noise and improve accuracy. The range bin with the strongest reflected signal is considered the primary range bin. In this study, we extracted 48 range bins centered on this primary bin, which are assumed to contain the relevant activity information of the subjects.
To obtain the velocity component at each distance, an FFT is applied along the chirp dimension of each range bin, referred to as Doppler-FFT. Processing the radar signal through range-FFT and Doppler-FFT produces the RDM [
28]. In the RDM, negative velocities indicate objects moving toward the radar, while positive velocities correspond to objects moving away. The RDM thus reflects the movement of different parts of the target. The range-FFT and Doppler-FFT processing pipeline is illustrated in
Figure 1.
In the experimental setup of this study, the raw radar data are complex-valued arrays with shape (NUM_CHIRPS, NUM_TX, NUM_RX, NUM_ADC_SAMPLES). The specific parameters are as follows:
NUM_CHIRPS: 1500, denoting the number of chirps.
NUM_TX: 1, indicating the number of transmitting antennas.
NUM_RX: 4, indicating the number of receiving antennas.
NUM_ADC_SAMPLES: 108, corresponding to the number of sampling points per chirp.
As described above, the radar data were acquired at a frame rate of 500 Hz, with one chirp per frame, resulting in a total of 1500 chirps recorded over 3 s. To compute RDMs, a sliding-window Doppler-FFT was applied within each range bin, where the window length was set to 125 chirps and the step size to 40 chirps. This procedure produces RDM sequences with dimensions of . Here, 35 denotes the number of short-time frames generated by the sliding window, 1 denotes the number of channels, 48 corresponds to the number of selected range bins, and 125 represents the length of the Doppler spectrum.
Since the RDM is complex-valued, we compute its magnitude and then average across multiple receiving antennas to improve signal reliability. Furthermore, a logarithmic transformation is applied to the RDM to enhance weak reflected components, such as those caused by cough events, thereby facilitating the model in capturing discriminative information.
3.2. Feature Extraction Module
In this experiment, human movements in front of the radar produce echo signals, which are subsequently processed into a sequence of RDMs. Each RDM characterizes the target’s range and Doppler velocity information within a short time interval. Consequently, the sequence of RDMs reflects the temporal evolution of the target’s range–velocity distribution relative to the radar. Accurate identification of cough events therefore, requires the extraction of both spatial and temporal features from the RDM sequences.
Specifically, spatial features describe the range–velocity distribution within an individual RDM, while temporal features capture the dynamic variations in this distribution over time [
29,
30]. In recent years, ResNet has been widely adopted in image recognition tasks [
31,
32], and Self-Attention mechanisms [
33] have demonstrated strong capability in modeling temporal dependencies. Motivated by these advances, we propose a spatiotemporal feature extraction framework consisting of a spatial submodule based on ResNet and a temporal submodule employing Self-Attention. The spatial submodule extracts range–velocity features from individual RDMs, while the temporal submodule learns the correlations across the entire RDM sequence. The fused spatiotemporal features are then passed to a downstream classification module, which outputs the recognition results. The overall structure of the proposed model is illustrated in
Figure 2.
Figure 3 illustrates the spatial submodule of the proposed model. As shown, this module employs ResNet-34 to extract high-level spatial features from each RDM in the input sequence, producing the spatial feature sequence as follows:
where
denotes the
t-th range-Doppler map in the sequence
of subject
g, and
represents the parameters of the modified ResNet-34 network. The extracted spatial feature for each frame is
, and the full sequence of features is expressed as
.
The rationale for adopting ResNet-34 lies in its residual learning mechanism, which enables the network to preserve low-level structural information while progressively capturing more abstract spatial patterns. Compared with shallower CNNs, ResNet-34 avoids gradient vanishing and overfitting problems, allowing the model to generalize effectively on radar data.
In our case, the input radar data are represented as RDMs, which inherently encode the joint distribution of target distance and relative velocity. The convolutional filters of ResNet-34 are well-suited for extracting localized patterns such as spectral ridges, motion trajectories, and micro-Doppler signatures from these 2D maps. After layer-by-layer abstraction, the network produces feature embeddings that highlight spatial correlations related to human respiratory motion and cough events.
By combining these spatial embeddings with the subsequent temporal submodule, the framework not only captures static geometric characteristics but also encodes dynamic variations across frames. This integration ensures that both the physical properties of radar echoes and their temporal evolution are preserved in the learned representation. Ultimately, the spatial submodule serves as the foundation for robust detection by transforming raw RDMs into compact yet discriminative features aligned with the underlying physical phenomena.
Figure 4 illustrates the temporal submodule of the proposed model. This module takes the sequence of spatial features
as input and captures temporal dependencies through two stacked layers, each comprising a multi-head Self-Attention mechanism with four attention heads, followed by a position-wise feed-forward network, residual connections, and layer normalization. This design enables the model to jointly attend to information from multiple temporal perspectives while maintaining stable optimization. The learned temporal features are then concatenated with the corresponding spatial features to form a unified spatiotemporal representation of the RDM sequence, expressed as
where
denotes the learnable parameters of the Self-Attention module. For each subject
g, the input sequence
is transformed into a temporally enhanced representation
, where each
encodes contextual dependencies across time.
The rationale for using Self-Attention lies in its ability to dynamically assign weights to different frames, enabling the model to emphasize frames containing salient respiratory or cough-induced variations while suppressing irrelevant or noisy segments. Compared with recurrent architectures, the Self-Attention mechanism can capture both short- and long-range dependencies without suffering from gradient vanishing, which is particularly beneficial given the temporal irregularities of cough events.
By integrating the spatial submodule with the temporal submodule, the framework not only preserves fine-grained spatial cues—such as Doppler shifts and spectral ridges—but also models their temporal evolution across frames. This ensures that both static information (e.g., subject posture) and dynamic information (e.g., transient motion patterns associated with coughing) are effectively represented. Ultimately, the fusion of spatial and temporal features allows the model to construct a compact yet discriminative representation of radar echoes, enhancing its robustness in contactless cough recognition.
Figure 5 illustrates the downstream classifier of the proposed model. The extracted spatiotemporal features
are first processed through a mean pooling layer along the temporal dimension to obtain a compact representation
, which is subsequently fed into a linear classifier to generate the final prediction results
:
where
denotes temporal mean pooling applied across all
T frames. Here, each
represents the temporal feature at frame
t, and the resulting vector
is the aggregated representation for subject
g, which serves as the input to the classification layer.
where
denotes the linear classifier parameterized by
. The resulting probability vector
represents the predicted likelihood that the sample
g belongs to class
(non-cough) or
(cough).
The objective of this study is to correctly identify the target’s behavior as coughing or non-coughing. To guide the training, we adopt the cross-entropy loss:
where
denotes the cross-entropy loss computed over a batch of
N samples. Here,
is the one-hot ground truth label indicating whether the
g-th sample belongs to class
(non-cough or cough), and
is the corresponding predicted probability obtained from the classifier output
.
3.4. Experimental Setup
Due to the lack of publicly available datasets, we validated the proposed method on a dataset collected in our laboratory. Specifically, we used the IWR6843ISK millimeter-wave radar and the DCA1000 real-time data capture board from Texas Instruments (Dallas, TX, USA).
For the radar configuration, the chirp repetition frequency was set to 500 Hz, and a total of 3 s of data was collected, resulting in 1500 chirps, with 108 sampling points per chirp. Detailed radar parameters are listed in
Table 1. Under these settings, the radar achieves a range resolution of approximately 0.042 m and a velocity resolution of approximately 0.008 m per second. The system consists of one transmitting antenna and four receiving antennas.
Radar signal datasets were collected from 15 subjects in two rooms, as illustrated in
Figure 8. In the first room, data were collected in a bed scene, while in the second room, data were collected in a sitting scene. As shown in the figure, our experimental environment was uncontrolled, containing multiple tables, chairs, and other objects, thereby representing a challenging scenario for radar-based monitoring.
For the bed scene, the radar was positioned about 96 cm above the ground and placed about 26 cm horizontally from the bed edge. It was oriented toward the upper part of the bed to primarily capture the subject’s body. Data were collected with subjects either facing the radar directly or at a diagonal angle, and in four postures: supine, left lateral, right lateral, and prone. The bed scene dataset comprises five activity categories: coughing, normal breathing, moving arms, turning over, and sitting up or lying down.
For the sitting scene, the radar was positioned about 94 cm above the ground and oriented horizontally. Data were collected at subject-to-radar distances of about 1 m and 1.5 m. Recordings were performed with subjects facing the radar directly, at a 45° angle, and perpendicular to the radar. The sitting scene dataset comprises five activity categories: coughing, normal breathing, moving arm, moving head, and standing up or sitting down.
Each activity type captures typical movements in the respective scenario. No constraints were imposed on the subjects’ movements to ensure diversity and realism. In addition to coughing and normal breathing, complex actions within each category (e.g., raising hands or scratching the head) were included to enhance the representativeness and practical relevance of the dataset, which also places higher demands on model generalization.
Each data sample was extracted from a 3-second segment. In total, 3165 samples were collected. For the bed scene, 21 cough samples and 82 non-cough samples per subject were recorded, resulting in 1545 samples across 15 subjects. For the sitting scene, 18 cough samples and 90 non-cough samples per subject were recorded, totaling 1620 samples for 15 subjects.
3.5. Model Setup and Training Details
As introduced previously, the input is an RDM sequence denoted as , where 35 represents the sequence length T, and each frame corresponds to a single channel feature map. The spatial submodule employs modified ResNet-34 for spatial feature extraction, with two key modifications: adjusting the input channel from 3 to 1, and removing the final fully connected layer. This processing yields spatial feature maps of shape , which are then projected via a linear layer to (the input to the temporal submodule). The temporal submodule adopts a two-layer Transformer-inspired architecture, where each layer integrates multi-head attention with a hidden dimension of 256 and four attention heads. Its output consists of temporal feature maps denoted as , which are then aggregated by average pooling along the temporal dimension to yield a pooled feature representation . Finally, the linear classifier processes and maps it to a logit vector , corresponding to the two target activity categories (i.e., cough and non-cough).
For data augmentation, image translation is applied with a maximum displacement of 12 pixels along the distance dimension. Random erasing is further employed by removing a region covering 5–20% of the RDM area, where the width-to-height ratio of the erased region is randomly sampled within the range of 0.3 to 3.33. To preserve temporal consistency, the same augmentation is applied uniformly across all frames within an RDM sequence.
Model performance is evaluated using F1-score and overall accuracy. Five-fold cross-validation is conducted as follows: three non-overlapping subjects are randomly selected from the 15 participants as the test set, and from the remaining 12 participants, 80% of the data are assigned to the training set and 20% to the validation set. The model is trained on the training set, and the parameters achieving the highest F1-score on the validation set are used for testing. Final results are obtained by averaging the performance metrics across the five folds. To ensure representative feature distributions and avoid sampling bias, stratified sampling is applied during dataset partitioning.
The model is implemented in PyTorch (version 2.5.1, running on CUDA 12.1) [
35] and trained on a workstation equipped with a 12-core Intel(R) Xeon(R) Silver 4214R CPU (Intel Corporation, Santa Clara, CA, USA), an NVIDIA GeForce RTX 3080Ti GPU (NVIDIA Corporation, Santa Clara, CA, USA) with 12 GB memory, and 90 GB of RAM. The network parameters are optimized using the Adam optimizer with a batch size of 32, a learning rate of 0.0001, a weight decay of 0.3, and a total of 15 training epochs.