Dual-Scale Doppler Attention for Human Identification

This paper considers a Deep Convolutional Neural Network (DCNN) with an attention mechanism referred to as Dual-Scale Doppler Attention (DSDA) for human identification given a micro-Doppler (MD) signature induced as input. The MD signature includes unique gait characteristics by different sized body parts moving, as arms and legs move rapidly, while the torso moves slowly. Each person is identified based on his/her unique gait characteristic in the MD signature. DSDA provides attention at different time-frequency resolutions to cater to different MD components composed of both fast-varying and steady. Through this, DSDA can capture the unique gait characteristic of each person used for human identification. We demonstrate the validity of DSDA on a recently published benchmark dataset, IDRad. The empirical results show that the proposed DSDA outperforms previous methods, using a qualitative analysis interpretability on MD signatures.


Introduction
Human Identification (HI) serves as an essential building block for many personal identification services including surveillance, security and identification systems. In general, HI adopts visual video as it has information that can be easily understood. However, HI through visual video can be problematic under low lighting conditions and with the privacy infringement issues [1]. As an alternative to the use of cameras, radar devices bypass these problems. Radar is a detection system that measures the distance, direction, angle, and speed of a target by emitting electromagnetic waves from a device and analyzing electromagnetic waves that are reflected and returned from an object. Radar can operate under low light conditions, and its tendency to bend around obstacles makes it suitable for identification in obscured environments [2][3][4]. Moreover, radar information is relatively safe regarding privacy, as people cannot directly interpret information obtained by radar. Furthermore, radar also has the advantage of observing far-distance targets. Overall, radar is a more robust sensor for HI than visual video. However, since it is difficult to observe the iris, voice, and face using radar, this paper adopts gait instead of aforementioned biometrics. Gait can be observed from a distance, and unlike the biometrics (iris, voice, and face), it is behavior over a certain period of time, so the security is relatively high. These advantages are consistent with the reason for using radar-based HI instead of video-based HI.
For this radar-based HI, the micro-Doppler (MD) signature has been a popular choice. As shown in Figure 1, it sequentially records Doppler effects in electromagnetic signals of moving targets [5], and the superposition of these Doppler signals is summarized to make MD signatures, where it holds granularity to specify their information. In the methodology, conventional machine learning (ML) algorithms have attempted to analyze the statistics of the MD signatures [6][7][8][9]. Unfortunately, ML has several drawbacks as it is based on heuristic feature extraction and a low capacity model. Recent Deep Convolutional Neural Network (DCNN), via capturing spatial relationships and comprehending high-level spatial features, overcomes these limitations and has revolutionized many applications including radar-based human identification and motion recognition. Motion recognition is easily conducted by training DCNN to recognize the common patterns among identical motions. However, human identification [10,11] should identify the unique characteristics of each person in an uncontrolled scenario, which requires more fine-grained understanding than motion recognition. For these fine-grained understandings, we exploit that MD signatures from different body parts such as arms, legs, and torso display different signal characteristics. Each of us has a unique gait characteristic distinguished by different sized body parts moving and swinging in a distinctive pattern, as the arms and legs move rapidly, while the torso moves slowly. Our empirical analysis validates that more than 95% of MD signatures on the radar dataset (Validation is performed on the IDRad dataset) include this distinguishable gait characteristic. Therefore, we utilize these unique gait characteristics from the MD signatures to identify humans. For the detailed explanation of the gait characteristics of MD signatures, as shown in Figure 1, MD signatures have fast and slow-moving components. The MD signatures recorded the Doppler effects along the time, where the orange curve shows a small amplitude wave denoting slow-moving features, and the signal that spreads around the orange curve makes a high-amplitude wave representing fast-moving features.
Under this observation, our proposed Dual-Scale Doppler Attention (DSDA) is composed of Temporal Window Attention (TWA), which attends fast-moving MD signatures via building temporal windows for MD signatures, and Holistic Window Attention (HWA), which attends slow-moving MD signatures via building holistic windows. To perform TWA and HWA, we define a common attention module defined as Window Attention (WA), which conducts self-attention among windows by calculating their similarities. In the overall pipeline, TWA extracts fast-moving features by generating multiple temporal windows and applying WA among them. HWA generates slow-moving features by subtracting the attended fast-moving features from original signals, and WA again performs self-attention between subtracted and original signals. With these attended fast and slow-moving features, DSDA extracts unique characteristics of each person in fast and slow moving, and identifies human identity. Using the IDRad [10] dataset, we validate state-of-the-art performances on human identification tasks and show the interpretability of the proposed DSDA.

Doppler Radar Systems for Human Identification
Doppler radar, which uses a single-tone radio wave, has been frequently used for human identification [6,10,12,13]. By the Doppler effect, the received frequency f r of the moving target is shifted away from the transmitted frequency f t , and the Doppler frequency is defined by subtracting f t from f r as: where c is the speed of light, and v is the radial speed of the moving target. By capturing the Doppler shift, it is able to detect the human motions [5,14,15]. Moreover, frequencymodulated continuous-wave (FMWC) radar is commonly used for short-range multiple targets detection by generating a Doppler map within a certain range [10,16,17]. Vandersmissen et al. [10] utilized low-power 77 GHz FMCW radar for person identification and constructed the IDentification with Radar (IDRad) data set. IDRad is a micro-Doppler map received from several people walking around spontaneously in any possible direction. Our proposed DSDA is experimented on IDRad and validates its interpretability on MD signatures.

Deep Learning for Micro-Doppler Signatures
As the use of radar-based systems increases, several applications of MD signatures using deep learning have emerged [18][19][20][21]. Kim et al. [22] first applied a neural network for human motion recognition on MD signatures and showed the applicability of deep learning on radar signals. After that, several deep learning techniques have been applied for the radar-based motion recognition, including large-scale pre-training [23] and recurrent neural network [24]. Lin et al. [25] proposed iterative CNN followed with random forests in MD signatures which showed performance boost. Park et al. [23] utilized DCNN pre-trained with a large-scale image classification dataset, ImageNet [26], which presented the connectivity between radar and computer vision. Furthermore, Wang et al. [24] used a recurrent neural network (RNN) to detect dynamic gestures with a short-range radar-based sensor, Google's Soli. Recent studies have been conducted via performing human identification (HI) on an MD signature, which requires the understanding of unique characteristics in a single person. In detail, MD signatures from heartbeat signals are utilized for HI [11,27]. Henceforth, MD signatures on gait characteristics of humans [10] are used for this HI, which is more challenging as they are performed in an uncontrolled scenario where a target is allowed to walk around in a free and spontaneous way. Cao et al. [27] primarily applied DCNN to MD signatures for HI. Vandersmissen et al. [10] also used the DCNN and constructed public dataset IDRad for HI, which contributed to subsequent research. Although several methods have been proposed for the aforementioned tasks, they do not fully utilize the details of MD signatures induced from moving human body parts. Therefore, we propose DSDA, which can exclusively recognize unique signals generated by human walking.

Recent Radar-Based Human Identification Analysis
Radar systems have mainly been applied on the radar-based human identifications [10,11,[27][28][29]. We also compare these previous works to our research in terms of the differences, advantages and disadvantages. The work [28] is performed on detecting humans in specific conditions (i.e., short-range through-wall and long-range foliage penetration). The difference between this study and ours is that it was carried out to find people under specific conditions, but our proposed DSDAs are more contributing on human identification from general human behaviors including human walking motions, arm and body movements. In terms of the method, they applied SVM for human detection. However, our study also contain SVM methods and makes more experimental contributions (SVM, CNN, RNN, and Attention model). The advantage of this study, we think, is that it defines specific tasks well and suggests their solutions. However, the disadvantage is that it is too task-specific and reduces its applicability for other research.
The work [29] holds the commonality with our work in that it performs human identification through the recognition of gait characteristics in MD signatures. However, this work mainly focused on open-set feature analysis, which means how the model can do better when the 'unknown class' exists in the inference, but our research is mainly about sequential feature pattern analysis; thus, our experimental contributions are more relying on sequential pattern analysis and its solutions. The advantage of the work [29] is novel problem definition for establishing the generality of human identification, but the disadvantage is that the feature analysis is insufficient in that the MD signatures contain sequential information.
The works in [11,27] are also holding commonality with our work in that they perform feature pattern analysis (limbs, torso, heart beat). However, the difference is that our model introduces an attention method to better recognize sequential feature patterns. As shown in Table 1, in the paper, our initial attempts also include the CNN models such as [10,11,27], but we confirmed that the RNN-based model should perform better. Therefore, our further studies are focused on designing sequential feature processing model (RNN and DSDA). The advantages of works [11,27] are specific feature pattern analysis, and we speculate that the disadvantages are a lack of concern for a model that can recognize the feature pattern well because of reliance on the popular CNN model.
For the work [10], our proposed DSDA is validated on the same dataset (IDRad) released in [10]. However, the simple convolutional neural network used in [10] is limited in understanding sequential patterns in the radar continuity; thus, we more focused on a method that can accurately recognize sequential information of the radar sequence. In this respect, we have performed several experimental contributions including Recurrent Neural Network models and an Attention model. Finally, we proposed a Dual-Scale Doppler Attention technique for recognizing fast and slow-moving patterns that are prominently present in continuous signals, where we speculate that this makes the methodological differences comparing to the work [10]. The advantages in the work [10] exist in the contributions from dataset release and task proposal, but for the disadvantages, we guess that there is a lack of contribution on how to better understand the features pattern in sequential MD signatures.

Generating Micro-Doppler Signatures
Our proposed Dual-Scale Doppler Attention (DSDA) is validated on the Micro-Doppler signatures (MD signatures); we first explain how to generate the MD signatures. To make 45 time stamps of MD signatures, we first build a single time stamp MD signature and connect 45 stamps in a row. As shown in Figure 2, the single time stamp MD signature is obtained from a single range-Doppler map. We integrate the range-Doppler map (e.g., the dimension for the range-Doppler map is 256 × R, where 256 is the Doppler channel and R is the number of discrete range bins) along the range axis, which constructs the single MD signature (e.g., 256 × 1). This single MD signature contains the Doppler-shift information, where this Dopplershift value denotes whether the target is moving closer or farther away (e.g., 129-256 channels contain the Doppler shift by the target moving closer and 1-128 channels contain the Doppler shift by the target moving farther away). Following the [10], some channels (i.e., 127-129 channels are zero-Doppler effective, and Doppler channels at both ends are too noisy) are not helpful to identify the human. We also remove them and finally can generate 205 × 45 time stamps MD signatures.

Model Overview
MD signatures are provided for identifying the human identity [10,11,[27][28][29][30][31]. DSDA takes MD signature R ∈ R C×T consisting of the number of channels C and the number of time stamps T (C = 205, T = 45). The time stamp can be modulated according to the size of MD signatures. As shown in Figure 3, the MD signature is a time-evolving sequence representing the micro-Doppler effect of a person moving.   . Window Generation provides dual-scale windows: a temporal window and holistic window. The temporal window W T is focused on recognizing fast-moving features extracted by MD signatures so that it is uniformly divided with a temporal sliding window of stride S. The temporal window size is C × L; as such, the total number of temporal windows is N (S = 5, L = 25 and N = 5). The holistic window W H is focused on recognizing the slow-moving signal and extracted by an original MD signature. For the Window Encoder (WE), it embeds temporal windows and holistic window into d-dimensional feature representation through a series of Conv-ELU-MaxPooling layers. The encoded temporal features T and holistic features H are defined as: where LN denotes the layer normalization [32] and the Exponential Linear Unit (ELU) [33] is nonlinearity operation. We present detailed specifications of the Window Encoder used in TWA and HWA in Tables 1 and 2. The number of temporal windows is considered as N = 3 in Table 1 (i.e., this is designed to help understand the Window Encoder's structure used in the Temporal Window Attention). The following sub-sections will explain the details of the remaining model components.

Window Attention
Our proposed basic attention unit of the TWA and HWA is referred to as Window Attention (WA), which calculates the attention weights among multi-element features. Given X = [x 1 . . . x I ] T ∈ R I×d and Y = [y 1 . . . y J ] T ∈ R J×d , the WA W(X, Y) : R I×d × R J×d → R I×d attends X using Y, which is defined as follows: where w(x i , Y) : R d × R J×d → R and W 1 , W 2 ∈ R d×d are learnable parameters. (Our experiments also validate the attention method in the Section 4.4) Using the ELU nonlinearity, WA tries to preserve common signatures between input X and Y windows. Based on the WA, the following TWA and HWA are defined and attend their moving signatures. We also provide the process of Window Attention in Figure 4, where WA is performed on two window features. For the TWA, the two types of window are temporal windows and, for the HWA, the two types are holistic windows.

Temporal Window Attention
We observe that the fast-moving features of arms and legs are unique depending on the person's height and walking pattern. To preserve and highlight these features, the Temporal Window Attention (TWA) takes multiple temporal windows as inputs and applies WA as follows: T avg = AvgPool(T ) ∈ R 1×d .
where AvgPool() is the average pooling function. TWA produces two types of features. One is the original TWA output T in Equation (7), which contributes to classifying the person identification, and the other is the average of T over N temporal windows T avg in Equation (8). This average of temporal window attended features T avg is treated to have the characteristic of the overall fast-moving features. Furthermore, T avg is used to remove fast-moving information from the original holistic features in the following Holistic Window Attention.

Holistic Window Attention
The Holistic Window Attention (HWA) is designed to recognize slow-moving signatures such as the torso. Different from TWA, HWA takes two types of features as input. One is original holistic features H, and the other is subtracted holistic features H sub obtained by subtracting the T avg from H in Equation (9). Semantically, the H can include slow and fast-moving features, and the H sub includes only the slow-moving features by removing characteristics of overall fast-moving features. These two holistic features are concatenated, and HWA attends the common slow-moving signals using the WA in Equations (10) and (11): where [··] is the concatenation operation. Holistic features H * through HWA represent the slow-moving feature in the MD signature and are utilized to classify the targets.

Classification
In classification, we use two aforementioned temporal and holistic features for classifying targets. These two attended temporal and holistic features are concatenated and transmitted to the final classifier as follows: where Flat() is a function that converts 2D data into 1D in order to apply the CNN data type to the fully connected neural network, Equation (14) is the last Dropout-ELU-Linear block in the classifier, W p ∈ R 128×((N+2)×d) and W c ∈ R C T ×128 are the learnable parameters, C T = 5 (IDRad dataset contain five different targets made by five different people.) is the number of targets and p = {p [1], p [2], p [3], p [4], p [5]} is inferred prediction distribution values. In the inference, prediction is performed using argmax function on the p. We also add some specifications of the classifier network in Table 3. The (3 + 2) in Table 3 denotes concatenation between TWA and HWA, where HWA includes (2 × 1024) features from the holistic feature (1 × 1024) in Table 2 and the subtracted feature (1 × 1024). TWA includes (3 × 1024) features in Table 3.

Training Loss
The entire model is trained in an end-to-end manner using cross-entropy loss L H I (y, p), where the y is the ground-truth target information and the p is the prediction of human identification from DSDA, as shown below:

IDRad Dataset
As the MD signature dataset for HI, we use IDRad [10] using FMCW radar, which records range-Doppler maps with a speed of around 15 FPS. The IDRad dataset contains 95,650 frames of 20 min for a training set and 22,535 frames of 5 min for a test set. One frame contains a Doppler frequency channel along the range axis. To construct micro-Doppler signatures, the IDRad dataset integrates a range-Doppler map along the range axis and connects the integrated Doppler signals in temporal order to make Doppler-time maps as previously stated [10]. Here, one time stamp includes 256 Doppler channels, and one MD signature is composed of 45 time stamps. For a fair comparison, we also performed the same preprocessing as [10] and used the default input of 205 × 45 MD signatures, which translates to 205 Doppler channels and 45 time stamps. Regarding the details of preprocessing, among 256 Doppler channels, 127∼129 Doppler channels representing static objects are removed and 24 Doppler channels in the top and bottom of the Doppler axis in MD signatures are also removed, because they are a too high speed range for humans to demonstrate. The IDRad dataset films the movements of five subjects whose age ranges are from 23 to 32 years old, weight ranges are from 60 to 90 kg and their height ranges are from 178 to 185 cm. Five subjects are able to move freely within a certain range of filmed room, and they are able to freely walk, run, or stop in various directions.

Experimental Details
The dimension of the hidden layer is set to d = 1024. For DSDA, we use three blocks of the Conv-ELU-MaxPooling layer in the Window Encoder and four blocks of the Dropout [34]-ELU-Linear layer in the classifier. Our model can be easily implemented with six layers of convolutional neural networks and trained on NVIDIA TITAN V (12 GB of memory) GPU with an Adam optimizer [35] with β 1 = 0.9, β 2 = 0.98 and = 10 −9 . For all experiments, we select the batch size of 64, a dropout rate of 0.2, and the model is trained up to 15 epochs. From this condition, we do not perform any hyperparameter fine tunning.
The overall evaluation of DSDA is performed using error rate, where the error rate is calculated as: 'error rate' = 100 × (number of incorrect predictions)/(total samples). The total sample can be composed of a validation set and test set. Table 4 summarizes the experimental results on the IDRad Dataset where we compare DSDA with several recent methods. Considering the temporal properties, we use 10 s MD signatures as an input (150 time stamps), which is different from the input of 3 s MD signatures (45 time stamps) in [10]. For the fair comparison, we also measured the baseline with the same sizes of the input MD signatures, where the baseline with 10 s MD signatures is reproduced using their public code and reported as 'Baseline (150 time stamp)' in Table 4. The extended input size of the baseline also shows slight effectiveness, but it still remained in the variation of original performances reported in the paper [10]. To validate the deep learning-based model on this task, we first built a PCA model with SVM classifier, which shows the classical classifying performance with a machine learning algorithm. Comparing to the performances of the PCA model, the deep learning-based models (i.e., baseline, LSTM-based model, DSDA) are giving superior performances. Although there are sequential data in the IDRad dataset, there have not been sequential models to perform human identification. Thus, for these sequential radar image data, we also consider a sequential RNN (Recurrent Neural Network) model on this task. We devised a Long Short-Term Memory (LSTM) [36,37] based model. The LSTM model shows better performances compared to the baseline built with a CNN structure, which explains the necessity of sequential understanding for radar data. The recent success of Transformer [38][39][40], our proposed DSDA, is based on the attention mechanism and also utilizes the specifications in radar (fast-moving and slow-moving features). DSDA achieves state-of-the-art performance against all methods with a large margin. The results indicate that extracting slow and fast-moving features and attending them improve the interpretability of MD signatures.

Ablation Study
We experiment with several variants of DSDA to measure the effectiveness of the proposed key components. The first block of Table 5 is full DSDA with three multiscale windows composed of strides s = {5, 10, 20} and windows L = {35, 25, 25}. The second block of Table 5 provides ablation results of TWA and HWA. Since TWA and HWA have influenced performance improvements, both fast-moving features and slow-moving features imply target features. Especially as TWA boosts performance significantly, we can see that unique information, which identifies a person, is inherent in fast-moving features. The third block of Table 5 provides ablation results on the scale of temporal windows in TWA. Window size and stride between windows influence the extraction of the proper characteristic features. We have confirmed that the best performance is achieved when the stride s is 5 and window size L is 35 for single fixed window. The fourth block of Table 5 provides ablation results with multi-scale windows. Here, we use multi-scale temporal windows of various sizes for TWA via selecting several windows in the third block of Table 5. The M is the number of different types of windows. When M = 3, we select the several combinations of three windows, where the best performances are shown with the windows composed of strides s = {5, 10, 20} and windows L = {35, 25, 25}. We also validate for M = 4 via adding one more window on top of the best-performance condition in M = 3. If all the results are not improved, then M = 3 is the best-performance condition, so we report the averaging results on M = 4. To adopt multi-scale temporal windows, TWA is applied to each window scale, and HWA uses several subtracted holistic features obtained by multi-scale temporal windows. We consider that several moving features extracted by multi-scale temporal windows make it suitable to capture a person's unique characteristics. Table 5. Ablation study on model variants of DSDA on the validation split of IDRad.

Model Variants
Error Rate (%)  Table 6 presents the ablation on the attention methods for ELU(W 1 x i )ELU(W 2 y j ) T in Equation (5). We experimented with three attention methods that highlight common features. Based on the results, we selected A * B for the final DSDA. Here, the operation; denotes concatenation with a d-dimensional axis. To follow the dimensional condition, we add more of an embedding matrix W add ∈ R d×1 for A + B and embedding matrix W cat ∈ R 2d×1 for A; B. Our empirical experiments for more attention methods are in the variance of A + B and A; B, and they do not give further performance gain. Table 7 represents the Kappa Index Analysis of our proposed DSDA with Baseline [10]. 'TRUE' means 'correct' on the target, and 'FALSE' means 'incorrect' on the target. To calculate the kappa value K = (1 − Pa)/(Pa − Pc), where Pa is the observational probability of agreement and Pc is the hypothetical expected probability of agreement. Pa is obtained as Pa = (1082 + 122)/1490 = 0.808 and Pc is obtained as Pc = (1328/1490) × (1122/1490) + (162/1490) × (368/1490) = 0.698. Thus, the kappa value is obtained as K = 0.36. Therefore, according to the kappa value analysis, our pro-posed DSDA decision performance has 'fair' strength of agreement with the baseline [10]. We consider this is because DSDA includes improved performance (10.8% error rate) compared to the baseline (24.7% error rate), which contributes to the disagreement with baseline in the case the baseline performs an incorrect decision on the target. Table 6. Comparison of attention methods in WA on the validation split of IDRad.

Attention Method for Calculating Similarity
Error Rate (%)  Here, temporal windows adopt 45 time stamps and 15 strides. Although the multi-scale windows performed well in this human identification task, we select the single window that can help easily understand how the attention weights of TWA are formed and these weights identify the fast-moving information of the MD signatures. Thus, we can find out which windows have been highlighted through the attention weight. The MD signatures corresponding to three windows (i.e., 1, 2, 3) from the left contain fast-moving features, and also, the signatures corresponding to two windows (i.e., 7, 8) from the right contain fast-moving features. Our analysis on synchronized video identifies that these fast-moving features are from the human walking. The other signatures (i.e., MD signatures from windows 4, 5, 6) are formed when the humans stop moving and change the direction of walking. The attention weights in TWA are highlighted on these windows containing fastmoving features such as walking; however, the small weights are given on slow-moving features such as changing direction or non-moving. Therefore, it is confirmed that the TWA is properly trained to recognize the fast-moving information from the MD signatures.
In Figure 6, we build a confusion matrix of DSDA. The vertical axis denotes the groundtruth category and the horizontal axis denotes the predicted category. For all targets, DSDA perform 100% accuracy on target 3 with the highest accuracy and performs 72% accuracy on target 4. We consider the reason why DSDA's prediction is the lowest on target 4, where the target 4, target 1 and target 5 show similar movements and have a relatively similar body shape. This gives the challenge in identifying their characteristics. To qualitatively compare the confusion matrix with the baseline [10], we give the confusion matrix predicted from the baseline (We obtain confusion matrices from fully trained baseline and DSDA.) in the (b) of Figure 6. Comparing to the baseline, the performance improvement can be confirmed in all targets for our proposed DSDA, and it was confirmed that both the baseline and DSDA showed excellent performance in target 3. However, it is also notable that our DSDA is more improved in target 4, where this is because DSDA is more robust to distinguish fine-grained information in the MD signatures.  In Figure 7, we also perform a target-wise confusion matrix to calculate the sensitivity and specificity according to the targets. The average sensitivity is 0.88, and the average specificity is 0.97. Our DSDA is effective in the specificity, which means our model is highly sensible on the negative targets. It is also notable that DSDA performs 100% on target 3, which implies that the model finds suitable feature space that can classify target 3, and also, the model localizes other target features on the proper feature spaces.
In Figure 8, we perform efficiency analysis. The attention mechanism in DSDA does not require many memory resources (The resources that are required are joint space embedding matrices.) but also performs early saturation of training error rate. As shown in orange curve of Figure 8, the error rate on the IDRad training dataset converges more early than the blue curve of the baseline, which denotes that the window attention mechanism promotes weights in the network to be sensible on this identification task. After epoch 15, two curves from the baseline and DSDA are converged and become saturated.
In Figure 9, we also perform dataset sensitivity analysis according to the ratio of training dataset usage. Our proposed DSDA gives robustness until 60% usage of the training dataset by keeping the error rate below 30%, and then, the error rate increases. However, the baseline shows the robustness until 80% usage of the training dataset, which shows that our model learns more efficiently on the dataset and is less sensitive to dataset scarcity. Error Rate (%)

Sensitivity Analysis according to usage of Training dataset
Training dataset usage (%) Figure 9. Sensitivity analysis of model performances according to the training dataset usage. The blue curve shows baseline [10], and the orange curve shows the proposed DSDA.
In Figure 10, to verify the practicality of the proposed model, we extended the HI task into a human localization task, where the MD signatures are extended to the longer size (i.e., 450 time stamps), and we perform time stamp wise human identifications. This can be thought of as human localization in MD signatures, which is more challenging and requires a fine-grained understanding of MD signatures. MD signatures are randomly selected in the validation set. The prediction of each stamp is performed on the center of 150 time stamp windows. DSDA traverses all the MD signatures and predicts every time stamp, which shows its predictions in the below Figure 10, where the y-axis denotes the probability of each target. Thus, five distributions corresponding to five targets are generated, and the above bar shows the ground-truth target for every signature. It is clearly confirmed that the distributions matched with the ground-truth target are highlighted for the given MD signatures. This denotes that DSDA is available to perform fine-grained human identification and is properly extended to the human localization task in radar signals.

Limitation
We would like to present two limitations from two different perspectives. The first limitation exists in terms of task. Our current experiments are mainly performed under a human identification task. However, as shown in Figure 10 of the paper, we found that this task can be extended up to the human detection task in the radar sequence by stiching MD signatures and localizing the human in them. Our further studies will include this extended analysis and possibilities of human detection. The second limitation is the lack of datasets for this study. We validate DSDA on the IDRad dataset. To build a general radar-based model, we should also validate more different radar datasets. In this respect, our further research will focus on how to establish model generality on radar signals.

Future Work
Our future works are three-fold. The first is that we will annotate the radar dataset to be suitable localization tasks. As shown in Figure 10, we further evaluate our DSDA in terms of human identification and localization tasks on the radar dataset, where DSDA shows enough applications on these tasks. The second is that we will modify the window. Currently, our window operates on the sliding window mechanism, but we will consider more diverse windowing methods. The third is that we will validate our DSDA on another radar dataset to be a general attention module for radar understanding.

Conclusions
We propose Dual-Scale Doppler Attention (DSDA) for human identification. DSDA adopts Temporal Window Attention (TWA) that attends fast-moving MD signatures of arms and legs and Holistic Window Attention (HWA) that attends slow-moving MD signatures of the torso. Our experimental results on the IDRad dataset show state-of-the-art performance and the qualitative results validate the efficiency of our proposed module. The model analysis including a sensitivity and confusion matrix shows DSDA's robustness. Extended experiments incorporating radar-based human detection tasks show the flexibility of DSDA. From these experiments, our future works are more obvious and substantial.