Whitening-Aided Learning from Radar Micro-Doppler Signatures for Human Activity Recognition

Deep learning architectures are being increasingly adopted for human activity recognition using radar technology. A majority of these architectures are based on convolutional neural networks (CNNs) and accept radar micro-Doppler signatures as input. The state-of-the-art CNN-based models employ batch normalization (BN) to optimize network training and improve generalization. In this paper, we present whitening-aided CNN models for classifying human activities with radar sensors. We replace BN layers in a CNN model with whitening layers, which is shown to improve the model’s accuracy by not only centering and scaling activations, similar to BN, but also decorrelating them. We also exploit the rotational freedom afforded by whitening matrices to align the whitened activations in the latent space with the corresponding activity classes. Using real data measurements of six different activities, we show that whitening provides superior performance over BN in terms of classification accuracy for a CNN-based classifier. This demonstrates the potential of whitening-aided CNN models to provide enhanced human activity recognition with radar sensors.


Introduction
Owing to its privacy-aware nature and robustness against a variety of operating conditions, radar technology is finding increasing applications in healthcare [1][2][3][4][5][6][7][8][9][10][11].These include remote patient monitoring outside of a hospital setting, rehabilitation interventions with a focus on improving mobility, and eldercare for aging-in-place.From an algorithmic perspective, human activity recognition is a core characteristic of radar-sensing solutions for such applications.
Classification of human activities using radar has recently experienced an influx of deep learning models due to their predictive power and ability to automatically learn relevant discriminant features from radar measurements [12][13][14][15][16][17][18].In particular, convolutional neural networks (CNNs) are being extensively used for learning spatial hierarchies of features from micro-Doppler signatures of human activities [19][20][21][22][23][24][25][26].In [19], a four-layer CNN-based activity classifier was used with Cepstral heatmaps, which were computed from the real radar spectrograms by applying an optimized filter bank generated on a diversified simulation database.A flexible deep CNN model was proposed in [20] to classify Doppler signatures of humans walking with different arm movements.Therein, a Bayesian learning technique was used to optimize the network.In [21], a dot-product attention-augmented convolutional autoencoder was proposed to learn both localized information and global features from micro-Doppler signatures.Superior classification accuracy was achieved by the attention-augmented model compared to its conventional counterpart.In [22], AlexNet was trained with an attention module to learn to highlight salient regions in micro-Doppler signatures, which in turn was shown to enhance the network predictions.A hybrid model comprising a long short-term memory (LSTM) network and a one-dimensional CNN, was introduced in [23], which provided enhanced classifications of human activities with relatively low complexity over two-dimensional (2-D) CNN methods.Complex-valued CNN-based architectures were investigated in [24] with micro-Doppler signatures, range-time plots, and range-Doppler maps as the data formats of choice.Using experimental data of nine human activities, the advantages of complex-valued models over their real-valued counterparts were demonstrated for certain data formats and network architectures.In [25], a multi-view CNN and LSTM hybrid network was proposed for human activity recognition, which fused multiple views of the time-range-Doppler radar data-cube.In [26], a millimeter-wave radar was used for real-time contactless fitness tracking via deep CNNs, providing an effective alternative to body wearable fitness trackers.
Most CNN-based solutions for recognizing human activities with radar readily employ batch normalization (BN) [27], which standardizes the activations of each batch in a layer.This renders the loss function considerably smooth, which in turn leads to improved accuracy and training speed for gradient-based methods [28].Benefits beyond those afforded by BN in terms of model optimization and generalization can be achieved by whitening the hidden layers' activations [29].However, to the best of our knowledge, the impact of decorrelating the activations by whitening has not been investigated for the application at hand.In this paper, we propose the use of a whitening-aided CNN to effectively distinguish between radar micro-Doppler signatures of different human activities.We employ the iterative batch normalization (IterNorm) technique [30] which uses Newton's iterations to efficiently implement whitening, thereby avoiding the high computational load imposed by eigen-decomposition of the data covariance matrix required otherwise.Convergence of IterNorm is guaranteed by normalizing the eigenvalues of the covariance matrix.Additionally, following the work in [31], we exploit the rotational freedom afforded by the whitening matrix to design an add-on rotation module, which can align different activity classes in orthogonal directions in the latent space.We test two different whitening-aided CNN models, one exploiting IterNorm only in lieu of BN layers and the other replacing BN layers with IterNorm + rotation module, on real data measurements of six different activities, namely, sitting down, standing up, walking, drinking water, bending to pick up an object, and falling.We show that whitening the latent space of a model provides significant enhancements to the classification accuracy compared to the CNN architecture with BN layers, with the alignment of the axes along the classes via rotation providing a slight advantage over the IterNorm only model.
The remainder of the paper is organized as follows.Section 2 describes the radar signal model and the micro-Doppler signatures.The BN and whitening methods are presented in Section 3, while the whitening-aided CNN models for human activity classification are described in Section 4. With the aid of real data examples, we demonstrate in Section 5 the usefulness of the whitening-aided models in achieving higher classification accuracy and also provide insights into the achieved performance enhancements over a base model employing BN layers.Concluding remarks are provided in Section 6.

Signal Model and Micro-Doppler Signatures
Consider a frequency-modulated continuous-wave (FMCW) radar, with the transmit signal, s T (t), given by where A T (t) is the signal amplitude, f c is the carrier frequency, and α is the chirp rate.For a moving point target, the radar return, s R (t), can be expressed as where A R (t) is the received signal amplitude, τ is the two-way travel time, and f D is the Doppler shift.The in-phase (I) and quadrature-phase (Q) components of the complex baseband signal can be obtained by demodulating s R (t) using the I/Q demodulator as where A(t) is the amplitude of s(t).
For the activity recognition problem, the human body can be viewed as a collection of moving point scatterers, which results in the corresponding radar return being a superposition of individual returns of the form of (3), represented by where A i (t) is the amplitude, f D i is the Doppler frequency, and τ i is the two-way travel time, all corresponding to the ith point scatterer.
Once the complex baseband signal has been sampled, it can be arranged as a 2-D matrix, s(n 1 , n 2 ), with n 1 and n 2 denoting fast-time and slow-time, respectively.To compute the range map, R(p, n 2 ), we take the discrete Fourier transform (DFT) along the matrix columns, represented by where N 1 is the number of samples (range bins) in one pulse repetition interval, p = 0, 1, . . ., N 1 − 1, and n 2 = 0, 1, . . ., N 2 − 1, with N 2 representing the total number of considered pulse repetition intervals.Next, the corresponding micro-Doppler signature is obtained through a two-step process.First, we sum the data over the range bins of interest as with p 1 and p 2 being the minimum and maximum range bins considered.Then, we apply the Short-Time Fourier Transform (STFT) to v(n 2 ) and compute the micro-Doppler signature, D(k 1 , k 2 ), as the spectrogram (the squared-magnitude of the STFT).That is, where w(n) represents the window of length N(< N 2 ) that determines the trade-off between time and frequency resolutions [32], the integer h determines the step size by which the window is shifted across the signal v(n 2 ), k 1 is the time index.and k 2 is the frequency index.These micro-Doppler signatures serve as the input to the CNN-based classifier for human activity recognition.

Whitening Methods
We briefly review BN and present two whitening methods, which form integral algorithmic components of the proposed whitening-aided CNN-based models for classification of human activities.

Batch Normalization
Let X ∈ R d×m be the batch input of a layer, with d denoting the dimension of the layer's vector input and m representing the number of samples in the batch.BN operation first centers and scales X to produce a standardized output X S as where the matrix Λ s = diag(σ 2 1 , ..., σ 2 d ) + I d contains the batch variances σ 2 i corresponding to the ith input dimension and incorporates diagonal loading for numerical stability via the second term, I d is an identity matrix of size d, > 0 is the diagonal loading factor, µ ∈ R d is the batch mean given by 1 m is an m × 1 column vector of all ones, and the superscript (•) T denotes matrix transpose.Each column of X S has zero mean and unit variance for each dimension.To ensure that BN represents an identity transformation when inserted in the deep learning model, a scale parameter α ∈ R d and a shift parameter β ∈ R d are introduced to yield the output of the BN layer as [27] where ' ' denotes the Hadamard product.Both α and β are learned during model training to restore its representation power; see [27] for more details.

Whitening Method 1: IterNorm Batch Whitening
The output of a whitening layer is obtained by centering and decorrelating the batch input X through a d × d whitening matrix W as where Σ is the covariance matrix of X, X C is defined in (9), and (•) −1 denotes the matrix inverse.The constraint in (12), however, does not uniquely determine W [33].A popular choice for the whitening matrix is given by where (•) −1/2 denotes the inverse square-root of the matrix argument.Typically, the eigendecomposition of Σ is used to determine Σ −1/2 .However, the eigen-decomposition is computationally demanding and can excessively increase the computational complexity of the deep learning model.Instead, the efficient IterNorm batch whitening [30] can be employed which uses Newton's method to iteratively compute the whitening matrix W.
The IterNorm technique is provided in Algorithm 1.The batch mean µ is computed using (10) in line 1, followed by the centered activations X C using ( 9) in line 2. The covariance matrix Σ is estimated in line 3 as 1 m X C X T C + I d , where the second term represents diagonal loading for numerical stability.Next, in lines 5 through 8, the algorithm estimates the whitening matrix iteratively using where P k is the estimated whitening matrix at the kth iteration and K is the total number of iterations.We note that to guarantee convergence under a limited batch size, IterNorm uses trace-normalized covariance matrix, Σ N , instead of Σ, in (14).This is evident from line 7 of Algorithm 1, with Σ N calculated in line 4 as Σ/tr(Σ), where tr(•) denotes the trace of its matrix argument.At the end of K iterations, the whitening matrix W is calculated using P K in line 9, which is finally utilized together with X C to compute the whitened output, X W , in line 10.
end for 9: Calculate the whitening matrix: W = P K / tr(Σ) 10: Calculate whitened output: While the constraint in (12) does not uniquely specify the whitening matrix W, it does enable rotational freedom.That is, is also a valid whitening matrix, provided Q is an orthogonal matrix with Equation ( 15) interprets whitening as a combination of rescaling by W of ( 13) and rotation by Q [33].Following the work in [31] and assuming that we are interested in recognizing N c classes of human activity, we align the ith column q i of the orthogonal matrix Q with class c i .That is, after whitening the activations using W obtained via IterNorm, we employ Q to rotate the samples such that the data corresponding to class c i is maximally activated along q i .Such a matrix Q can be determined by solving the optimization problem [31] max where X W,c i ∈ R d×m i denotes the activations corresponding to class c i after whitening with W and m i is the number of samples for class c i .The problem in (16) with orthogonality constraint can be solved via gradient-based approaches on the Stiefel manifold [31,34].

Whitening-Aided CNN-Based Activity Classification
Having described the whitening methods, we are now in a position to present the whitening-aided CNN-models for human activity recognition.
We consider a base CNN model consisting of a series of building blocks.Each building block comprises a convolutional layer, followed by a max-pooling layer and then a BN layer, as seen in Figure 1a.Each convolutional layer generates feature maps by convolving its input with 2-D filters in a sliding window fashion and then feeding the filter outputs to an activation function.Considering a convolutional layer with L filters and denoting the input of the convolutional layer by C ∈ R h t ×w t , we can express the lth convolutional map, O (l) ∈ R h c ×w c , corresponding to the lth filter as where ' * ' denotes 2-D convolution, σ is the activation function, b (l) is the bias term corresponding to the lth map, and f (l) ∈ R h f ×w f is the lth 2-D convolutional filter.Next, the max-pooling layer downsamples the feature maps by taking the maximum over an h p × w p spatial window for complexity reduction [35].Finally, the BN layer applies centering and scaling operations to normalize the downsampled feature maps within a batch.We note that the micro-Doppler signature of ( 7) serves as the input of the first building block, whereas the input of each subsequent block is the output of the previous block.
A whitening-aided CNN model is essentially the same as the base CNN model with the exception that it employs a whitening layer in lieu of BN in its building blocks.We consider two whitening-aided models, namely, whitening-aided models 1 and 2; the former replaces BN layer with an IterNorm layer as shown in Figure 1b, whereas the latter employs IterNorm + Rotation in place of BN as depicted in Figure 1c.
We note that in Section 3, the activations for the BN and whitening methods are assumed to be vectors.However, the output of a convolutional layer comprises a total of L 2-D feature maps.As such, the batch input to any normalization layer in this case would be of size h d × w d × L × m, where h d and w d indicate the height and width of the downsampled feature maps (output of the max-pooling layer) and m is the number of samples in the batch.Following [27,30,31], we unroll the batch input as X ∈ L × (mh d w d ).
The BN and whitening operations can now proceed with the unrolled X as the batch input.

Experimental Results
In this section, we evaluate the performance of the whitening-aided CNN models for human activity classification using real data measurements.We compare the classification accuracy of the whitening-aided models with that of the base CNN model.

Experimental Dataset
We employ the human activity dataset collected at the University of Glasgow, UK [36].This dataset consists of six smaller subsets, out of which we employ the three subsets collected in 2017 in a laboratory environment.The data were collected using an FMCW radar, model SDR-KIT-580B by Ancortek (Fairfax, VA, USA), with a 5.8 GHz carrier frequency, 400 MHz bandwidth, and a chirp duration of 1 ms, delivering an output power of approximately 18 dBm.Two Yagi antennas, each with a gain of about 17 dB, were used for signal transmission and reception.The number of samples per recorded beat-note signal was set as 128.The dataset contains six activity classes: walking, sitting down, standing up, bending to pick up an object, drinking water, and falling.A total of 33 participants were used as test subjects, 31 of them were male and two were female, ranging in height from 149 cm to 188 cm with ages between 22 and 36 years.Each participant repeated each activity two to three times along the radar's line of sight, i.e., measurements were made at normal incidence.The spectrograms were computed using a Hanning window length of 256 with 2048 frequency points and 254 points overlap, i.e., h = 2 in (7).The resulting micro-Doppler signatures were then cropped, downscaled, and converted to grayscale images with dimensions of 75 × 75 and pixel values ranging from 0 to 255.The dataset contains a total of 570 micro-Doppler signatures, with 95 signatures per class.Representative signatures of each of the six activities are shown in Figure 2; the horizontal axis represents time while the vertical axis is Doppler frequency.

CNN Models and Training
For illustration, we employ the learning architecture depicted in Figure 3, where the input to the network is a micro-Doppler signature of size 75 × 75.The network output is a one-hot encoded length-6 vector such that the location of a '1' indicates a specific human activity.The input is passed through a 3-layer CNN implementing 32, 64, and 128 filters, respectively, each of kernel size 3 × 3. A max-pooling layer with a stride of 3 follows each convolutional layer.A normalization layer is the last module in each building block.A dropout layer (not shown in Figure 3) with a 15% rate is also included before the fullyconnected output layer.The ReLU activation function is used for all layers except the output layer, which uses a softmax function.Three different variants of this learning architecture are considered, differing in terms of the employed normalization method, as detailed in Figure 1.Specifically, these include the base model with BN layers, whitening-aided model 1 with IterNorm layers, and whitening-aided model 2 with IterNorm + Rotation layers.We utilize cross-entropy as the loss function for activity classification.To optimize the model, we apply stochastic gradient descent with a batch size of 10.We used an adaptive learning rate with an appropriate initial value for each CNN model, decreased by a factor of 10 after every seven epochs.A maximum of 30 epochs are used for training the base model and whitening-aided model 1, with the number of iterations for IterNorm set to 5. For whiteningaided model 2, we perform a warm start with the pretrained whitening-aided model 1 to which we add the rotation modules and continue the training for five additional epochs.

Classification Accuracy
We first examine the classification accuracy of the proposed whitening-aided models as a function of the number of training samples per class.We let the number of training samples vary from 20% to 80% in increments of 30%.The remaining signatures in each instance are utilized for testing.We conduct 30 classification experiments over distinct training and testing datasets for each considered split using the base CNN model and its whitening-aided counterparts.We calculate the mean and standard deviation of the test data classification accuracy for all three classifiers.The results are provided in Table 1.We clearly observe that for each training/testing split, both whitening-aided models significantly outperform the base model, especially under limited training samples.This is attributed to the reduced model confusion amongst the six classes resulting from the whitening of the latent space.The addition of the rotation module in whitening-aided model 2 to maximize the class activations along the latent space axes provides an additional 1.5% to 2% increase in average accuracy and relatively lower standard deviation values over whitening-aided model 1.This attests to further class disentanglement brought about by constraining the latent space to represent the classes.For further illustration of the impact of whitening, we compute the confusion matrices, averaged over 30 trials, corresponding to the base and the whitening-aided models for the 50%-50% training/testing data split.These confusion matrices, depicted in Figure 4, clearly demonstrate that the addition of the whitening layers cause a reduction in the model confusion for all six classes, with whitening-aided model 2 providing slightly higher reductions as compared to whitening-aided model 1.Next, we consider 50%-50% training/testing data split and investigate the impact of whitening on the classification performance when introduced as a replacement for a single BN layer in the base model, leaving the remaining two BN layers intact.The corresponding average value and standard deviation of the classification accuracy are provided in Table 2, with the values corresponding to the base model under column labeled as "Base Model" and those corresponding to whitening methods 1 and 2 replacing BN in the first, second, and third layers of the network in respective columns labeled as "Layer 1", "Layer 2", and "Layer 3".We observe that, compared to the base model, even replacing one BN layer with either whitening module yields performance enhancements, with progressively higher improvements for the introduction of the whitening layer at increasing depth of the network.Again, whitening method 2 provides higher accuracy on average and lower standard deviation as compared to whitening method 1. Comparing the results in Table 1 for 50%-50% training/testing data split and Table 2, we see that while replacing all BN layers with whitening layers yields the best performance, there is considerable value in replacing even a single BN layer with a whitening layer, especially deeper in the network and more so for whitening method 2 than method 1.

Correlation Coefficients
To visually highlight the decorrelation aspect of the whitening layers, we consider the 50%-50% training/testing data split and measure the output of the normalization modules for the test set in each layer in the base model, whitening-aided model 1, and whitening-aided model 2 after training.We then calculate the absolute value of the correlation coefficient of every feature pair in each layer of the respective models.As depicted in the top row of Figure 5, the base model with all BN layers exhibits relatively strong correlations.This is expected since BN only standardizes the activations and does not decorrelate them.On the other hand, when all BN layers are replaced by either IterNorm layers or IterNorm+Rotation layers, the features in every layer indeed become decorrelated as seen in the middle and bottom rows of Figure 5, thereby leading to improved classification performance.

Top Activated Signatures
An important characteristic of whitening method 2 is its alignment of the axes of the latent space with the activity classes, which has been shown to enable an understanding the learning process across the layers [31].To this end, in this example, we assess the relationship between the test samples and a class label in the latent space for a trained whitening-aided model 2 with 50%-50% training/testing data split.We calculate the activation values of the test samples on each axis for each label and identify the top activated signature for each class in each layer, depicted in Figure 6.We observe that in the third layer, the top activated signatures correspond to the correct class labels.However, in the first layer, as the convolutional layers capture low-level information, the alignment is not as accurate as the higher levels.We also determine the empirical receptive fields of the top activated signatures by identifying those locations in each signature which when masked cause the largest reduction in the activation values on different latent space axes [31].For this purpose, we apply 32 × 32 random masking patches with a stride of 5 on the top activated images.The corresponding results are shown as highlighted regions in Figure 6.Clearly, in the first layer, the extracted features appear to be related to the background, while by the third layer, the learned features are predominantly from the main pattern of the micro-Doppler signature.For example, the "Walking" axis in the third layer focuses on sinusoidal segments of the signature, while the "Falling" axis converges on the waterfall shape of the corresponding micro-Doppler signature.

Performance with Unseen Testing Data
In this final example, we examine the performance of the whitening-aided models under unseen testing data.Specifically, we retrain the networks using micro-Doppler signatures of 27 out of 33 human subjects (77 samples per class).The signatures of the remaining six subjects (18 samples per class), which were excluded from the training data, are used for testing.This is roughly equivalent to an 80%/20% training/testing data split.The respective classification accuracy values of the base model, whitening-aided model 1, and whitening-aided model 2 are 85.18%, 89.81%, and 92.59%.We note that the accuracy of each model is relatively lower than the corresponding average values reported in Table 2 for the 80%/20% data split.However, even in this case of unseen data, the superiority of the whitening-aided models over the base model is clearly evident, with whitening-aided model 2 outperforming whitening-aided model 1 as in the previous examples.

Summary of Findings
The above examples clearly demonstrate the superior performance of the whiteningaided CNN models over the base CNN model for human activity classification.The performance enhancements exist irrespective of testing with unseen data or samples from subjects the models have seen before during training.This superiority is attributed to the ability of the whitening layers to not only standardize but, more importantly, decorrelate the activations, and in case of whitening method 2 also to the alignment of the latent space axes with the activity classes.Further, while the results suggest replacing all BN layers in a CNN model with whitening layers to exploit their offerings to the fullest, considerable performance enhancements over the base model can be realized by using a whitening layer in lieu of even a single BN layer; the level of improvement increasing with increasing depth at which this replacement occurs in the network.Furthermore, performance evaluation of the two whitening methods showed that addition of the specific rotation module to IterNorm which maximizes the activation of the classes along the latent space axes provides model 2 with an appreciable advantage over model 1 in terms of classification accuracy, albeit at the additional expense of implementing the rotation module.

Conclusions
In this paper, we have presented CNN-based learning models that utilize whitening of the hidden layers' activations for enhanced human activity recognition using radar.We employed IterNorm technique based on Newton's method to significantly reduce the computational burden associated with the traditional eigen-decomposition approach for computing the whitening matrix.A rotation of the whitened activations to align the latent space axes with the corresponding class labels was also utilized.Results using real radar measurements of six different human activities were provided which validated the superior performance of the whitening-aided CNN models over the base CNN model in terms of classification accuracy.We also showed that the introduction of the specific rotation module can lead to appreciable improvements in classification accuracy over the IterNorm only layer.These findings demonstrate the potential of whitening-aided CNN models in enhancing the accuracy of human activity recognition using radar micro-Doppler signatures.

Figure 1 .
Figure 1.Building blocks of the various CNN models, differing in terms of the employed normalization layer.(a) Base CNN model employing BatchNorm layer to center and normalize the input.(b) Whitening-aided model 1 using IterNorm whitening module to decorrelate the centered input.(c) Whitening-aided model 2 which adds a rotation module after the IterNorm module to maximize the class activations along the axes of the whitened latent space.

Figure 2 .
Figure 2. Micro-Doppler signatures of six human activities.Top row (from left to right): Drinking water, falling, and bending to pick up an object.Bottom row (from left to right): Sitting down, standing up, and walking.The horizontal axis denotes time whereas the vertical axis denotes Doppler frequency.

Figure 5 .
Figure 5. Absolute value the coefficient every feature pair in the first, second and third layer for the base CNN model (top row), whitening-aided model 1 (middle row), and whitening-aided model 2 (bottom row).

Figure 6 .
Figure 6.Most activated signatures in each axis in different layers for a trained Whitening-aided model 2 under 50%/50% training/testing data split.For each image, the empirical receptive field is also highlighted.

Table 1 .
Classification accuracy of the 3-layer CNN model with and without whitening in all layers for different training/testing splits.

Table 2 .
Classification accuracy of the 3-layer CNN model with and without Whitening.