A Survey of Deep Learning-Based Human Activity Recognition in Radar

: Radar, as one of the sensors for human activity recognition (HAR), has unique characteristics such as privacy protection and contactless sensing. Radar-based HAR has been applied in many fields such as human–computer interaction, smart surveillance and health assessment. Conventional machine learning approaches rely on heuristic hand-crafted feature extraction, and their generalization capability is limited. Additionally, extracting features manually is time–consuming and inefficient. Deep learning acts as a hierarchical approach to learn high-level features automatically and has achieved superior performance for HAR. This paper surveys deep learning based HAR in radar from three aspects: deep learning techniques, radar systems, and deep learning for radar-based HAR. Especially, we elaborate deep learning approaches designed for activity recognition in radar according to the dimension of radar returns (i.e., 1D, 2D and 3D echoes). Due to the difference of echo forms, corresponding deep learning approaches are different to fully exploit motion information. Experimental results have demonstrated the feasibility of applying deep learning for radar-based HAR in 1D, 2D and 3D echoes. Finally, we address some current research considerations and future opportunities.


Introduction
Research on human activity recognition (HAR) has made significant progress over the past decade.Successful HAR applications include surveillance [1], smart home [2], video analytics [3], autopilot [4] and human-computer interaction [5].The purpose of HAR is to identify a user's behavior so as to allow computing systems to proactively provide assistance for the user [6].There are two main categories of HAR [7]: vision-based and sensor-based.Taking advantage of the high resolution of optical sensors and the rapidly evolving computer vision (CV) techniques, vision-based HAR has yielded fruitful results [8][9][10][11][12].Despite the superiority of vision-based HAR, there are still many open issues, such as illumination, occlusion, privacy leakage, etc. [13][14][15].With the rapid development of sensor technology, sensor-based methods have set off a new wave in HAR [16][17][18][19].Sensor-based activity recognition acquires data from accelerometer, gyroscope, radar, acoustic sensor and so on, and seeks the profound high-level information of human behaviors from multitudes of low-level sensor readings.
As one of the sensor-based methods, radar-based HAR [20][21][22][23] has drawn much attention due to the following reasons.Firstly, radar is robust to light and weather conditions so it is able to be applied in harsh environments.Secondly, radar could protect visual privacy.Instead of capturing the visual shape of the target, the returned signals modulated by the target carry abundant time-varying range and velocity information of activities [24].Thirdly, radar is able to detect human through walls, which makes radar-based HAR applicable to more scenarios.Lastly, radar systems do not need any tag attached to the human body, which makes it more user-friendly.Consequently, radar has been adopted more and more for recognizing human activities recently.
In the past, radar-based HAR systems often adopt conventional machine learning (ML) techniques [25][26][27][28][29].These traditional algorithms are built on theoretical foundations so they are explicable and could be optimized theoretically.Compared with deep learning models, their complexity is often lower so the computation burden is lighter.Support vector machine (SVM) [30], dynamic time warping (DTW) [31], random forest classifier [27] are the most used conventional ML algorithms for radar-based HAR.Y. Kim et al. [30] used SVM to recognize human activities based on micro-Doppler signatures.Features were extracted manually from the time-Doppler spectrograms, as illustrated in Figure 1.By employing a decision-tree structure, the system that consists of six SVMs succeeded to classify 12 different human activities.In [27], a gesture recognition system using a 60 GHz mm-wave radar was built.A random forest classifier was employed in this system to perform real-time gesture recognition.In [31], an improved DTW algorithm was proposed for hand gesture recognition with a terahertz radar.Experiments showed that the improved DTW algorithm was capable of fully exploring the properties of range profiles and Doppler signatures.Although widely used for radar-based activity recognition, traditional ML solutions have several drawbacks that hinder the further improvement of robustness and generalization.Firstly, features are extracted heuristically and manually, which highly relies on human experience and domain knowledge.Secondly, hand-crafted features often refer to some low-level statistical information including mean, variance, frequency and amplitude [32], which are task-specific.When a model trained with shallow hand-crafted features is applied to a new dataset, the performance is always not as good as in the original dataset.Thirdly, traditional ML methods mainly learn on small-scale static data.However, in the real world, activity data are coming in a stream and changeable.Conventional ML approaches are not competent to train a robust model in this circumstance.Deep learning (DL), a rapidly evolving technology, tends to break through these restrictions.As a new branch of ML, DL came into sight and has emerged as a powerful tool in the past few years.DL approaches extract high-level deep features automatically through hierarchical architectures.Artificial feature extraction using specialized knowledge is not acquired.Furthermore, with the advent of GPU, it is capable of fast computing with huge amounts of data.DL algorithms can take full advantage of parallel computing to achieve fast processing.With its excellent feature learning ability from massive data, DL has not only promoted the development of visual object recognition, speech processing, natural language processing, etc. [33], but also made HAR more intelligent and versatile.
To the best of our knowledge, currently there is no survey that addresses DL-based HAR progress in radar, and this is the first article to present the recent advance of this area.We hope it provides a comprehensive summary and motivates more inspirations for relevant future research.
The remainder of the paper is organized as follows: In Section 2, we compactly overview DL techniques.Section 3 reviews examples of radar systems adopted to recognize human activities.Section 4 presents DL approaches for radar-based HAR in detail.We divide existing literature into three parts according to the dimension of radar echoes, and then DL techniques applied to the three parts are discussed respectively.Future research considerations and directions are discussed in Section 5. Finally, the paper is concluded in Section 6.

Deep Learning Techniques
With the emergence of deep learning algorithms, the development of many fields such as speech recognition, visual object recognition and even drug discovery has been accelerated [34].A deep learning model has multiple processing layers to learn high-level representations automatically.Heavy feature engineering and domain knowledge are not required for DL.Furthermore, with so many deep transformations, very complex functions could be learned and difficult classification and recognition problems could be solved [35,36].As a result, deep learning has advanced the development of many fields, including HAR.In this section, we investigate several deep learning models and analyze their unique advantages for HAR tasks.Table 1 describes all the models in brief.

Convolutional Neural Network
The convolutional neural network is inspired by the visual cortex structure which is composed of simple cells and complex cells.It adopts four key ideas: local connections, parameter sharing, pooling and multi-layers.In CNN, convolution operation replaces the general matrix multiplication in general neural network.In this way, the complexity of the network is reduced due to the decreased number of weights.It should be noted that CNN is the first DL architecture with the hierarchical layers [37].Multiple convolutional layers enable CNN to extract higher-level spatial features from the lower-level ones hierarchically, avoiding the manual feature extraction procedure in the conventional ML algorithms.After convolutional layers, pooling and fully connected layers are usually employed for classification or regression tasks.Additionally, thanks to the excellent deep feature learning ability, CNN is often employed as an automatic feature extractor for a variety of tasks [38,39].
When utilized for HAR, CNN has two main advantages [32]: taking nearby signals into consideration and scale-invariant for different paces or frequencies.The first advantage allows CNN to extract localized features from the positions which are space-related, rather than from a single position.The second advantage allows pace or frequency information to be retained in the extracted features.In [23,[40][41][42][43][44][45][46][47][48][49], CNNs with different architectures and convolution kernels of various sizes were employed as classifiers to recognize human activities with time-Doppler maps, as illustrated in Figure 2a.In [28,[50][51][52], CNN acts as a spatial feature extractor and extracts high-level representations of human activities for further identification, as illustrated in Figure 2b.[41].(b) CNN is adopted as a feature extractor to learn high-level features from the input time-range maps.Adopted from [28].

Recurrent Neural Network
With the successful application in NLP and speech recognition, recurrent neural network (RNN) has caught researchers' attention in HAR.RNN has shone light on modeling temporal sequences because of the ability of mining temporal and semantic information.From the perspective of network structure, RNN remembers the previous information and uses it to influence the output of the following nodes.However, conventional RNN has its own limit: long-term dependencies.To overcome this shortcoming, long short term memory (LSTM) (see Figure 3a) came into being [53] and performs better in many tasks.LSTM owns three special gates: input gate, output gate and forget gate.By using these memory units especially the forget gate, LSTM is able to access a long-range context of sequential data.Compared with CNN, which can only process data with a predefined size, the prediction of RNN and its variants is assumed to increase in accuracy when more data is available.The prediction result is changing with time.Consequently, RNN is more sensitive to the change of the input data than CNN.
For HAR, RNN and its variants have the superiority of exploiting temporal correlations in an activity, which is a crucial issue for recognizing human activities.Ref. [28,50,51,54] all utilized RNN and its variants to model the temporal characteristics in human activities.In [50], the features that extracted from range-Doppler maps by a CNN are time-correlated, so LSTM was utilized for learning complex dependencies across time in those features.In this way, both spatial and temporal information was explored under the cooperation of CNN and LSTM.

Auto-Encoder
Auto-encoder (AE) (see Figure 3b) is a feed-forward neural network that aims to reconstruct the input under certain constraints [52].It learns deep feature representations of unlabeled input via several rounds of encoding-decoding procedures in an unsupervised fashion.Especially when the input data are highly similar, AE is able to discover nuances in the data itself by the layer-wise unsupervised pre-training principle.Furthermore, unsupervised pre-training tends to function as a regularizer, which potentially prevents the network from overfitting [55].
The commonly used variants of AE in HAR are the following kinds: (a) stack auto-encoder (SAE) that stacks multiple sparse AEs together to acquire more compact feature coding.(b) convolutional auto-encoder (CAE) that essentially combines CNN and AE, and the encoding-decoding procedures are accomplished by convolution and deconvolution.(c) de-noising AE and contractive AE that make the models more generic by adding noise to the input or adding a well-chosen penalty term to the loss function.
In [52], CAE was employed for unsupervised pre-training, as illustrated in Figure 4. Then the decoding part of the network was removed, and fully connected layers, as well as a softmax classifier, were added to the encoder for classification.Taking advantages of unsupervised pre-training and localized feature learning, CAE outperformed CNN and plain AE for identifying human activities.In [56], stacked AEs were applied to obtain the most prominent features in radar echoes, and then softmax classifiers were utilized for recognizing human motions.A similar approach was also adopted in [57][58][59].

Hybrid Deep Model
Every model has its own disadvantages and is not competent to all tasks.Hybrid deep models integrate several networks together and take advantage of all these networks.Such cooperation is built on each model's own strength so as to obtain better performance.So far, in HAR, CNN and RNN are commonly combined as shown in Figure 2b, because they are good at abstracting different domain features: CNN captures spatial relationships while RNN captures temporal relationships [32].
Ref. [28,50,51] provided good examples for how to combine CNN and RNN.Those work has demonstrated that combining CNN and RNN tends to reinforce the power of recognizing the activities that vary in time and space.In addition, AE is often combined with CNN or RNN owing to its ability of unsupervised extracting high-dimensional features [52].

Radar System for Human Activity Recognition
Radar is an active sensing system that transmits radio wave and receives returned signals modulated by illuminated objects.It has been mostly used in remote sensing systems such as satellite remote sensing, air and terrestrial traffic control and geophysical monitoring in the past few decades [60].Moreover, there has been a recent expansion of short-range radar for HAR tasks.
The radar-based HAR approaches are more robust than vision-based ones because of radar's insensitivity to light and weather condition.They can detect human presence and activities directly without any tag attached to the human body.When a person is moving, the speeds/Doppler frequencies of body parts are time-varying with respect to the person's movement.Subsequently, the ranges of these parts are not linear with respect to time.The targets' range, speed and angle information that radar obtains could be utilized to recognize human activities [61].
Due to its intrinsic advantages such as simple architecture, easy system integration, relatively low cost, and penetration capability, radar is feasible as a kind of human motion measurement technology [62].There are a series of radars used for HAR, such as continuous-wave radar, ultra-wide band radar and noise radar.While there are many advanced researches of noise radar HAR systems [63][64][65][66][67][68], they do not involve ML techniques, and thus are out of scope of this paper.Next, we introduce several kinds of radar designed for HAR purposes.Table 2 briefly outlines those radars and their basic characteristics.

Continuous-Wave (CW) Radar
CW radar transmits a known stable-frequency CW ratio signal and receives the reflected signal that is modulated by objects on the ratio signal path [60].It is able to operate on either modulated mode or unmodulated mode.CW radar has a simple architecture with easy system integration and low power consumption, which makes CW radar attractive for mobile and portable applications.Various commercial CW radar chips and systems are available for HAR applications [61], such as 77 GHz AWR1642 and AWR1443 of Texas Instruments (Dallas, TX, USA), 77 GHz TEF8181EN and TEF8102EN of NXP (Eindhoven, The Netherlands) and 24 GHz BGT24MTR11 of Infineon (Neubiberg, Germany).Typical CW radar systems for HAR are Doppler radar, frequency-modulated CW Radar and interferometry radar.

A. Doppler Radar
Doppler radar, as shown in Figure 5a, is one of the most popular radars in HAR [20,[41][42][43]45,52,69]. A Doppler radar sends out single-tone radio waves and no modulation is involved.When the target is moving, the frequency of the received signals is shifted away from the transmitted ones because of the Doppler effect.The frequency f r of backscattered signals is shown as follows [62], where f t is the frequency of single tone radio signals sent by the Doppler radar, c is the speed of light, v is the radial speed of the target.The Doppler frequency shift f d is thus Doppler radar is used to detect time-varying radial speeds of human motion due to its ability of capturing Doppler shifts.Owing to the relatively simple signal processing, Doppler radar is capable of acquiring appreciable performance in motion and displacement measurement [70,71].

B. Frequency-Modulated Continuous-Wave Radar
Frequency-modulated continuous-wave (FMCW) radar, as shown in Figure 5b, is able to sense the range and Doppler properties of targets simultaneously.When more than one source of reflection reaches at radar antennas at the same time, both range and Doppler information are indispensable.Consequently, FMCW radar is widely employed in various short-range scenarios [28,50,51,57,73,74], especially scenarios with the presence of multiple targets [75,76].
In an FMCW radar system, a known stable frequency continuous wave that varies up and down in frequency over a fixed period of time is transmitted, such as a sine wave and sawtooth wave [77].As illustrated in Figure 6, the backscattered echoes are mixed with transmitted signals to produce beat signals.By demodulating the beat signals and calculating the frequency delay of the received signals, range information could be extracted [73].The range resolution of an FMCW radar that refers to the minimum separation in range of two objects of an equal cross section, is proportional to c/2B, where B is modulation bandwidth.So, the larger the bandwidth is, the higher the range resolution is.As for the Doppler information, it is obtained in the same way of unmodulated CW radar.

C. Interferometry Radar
One disadvantage of Doppler radar is that the frequency shift highly depends on radial velocity.As a result, it is hard to recognize noncooperative activities performed along the tangent direction.In this circumstance, interferometry radar is more helpful.It utilizes an interferometric receiver composed of two antennas, and the output of the two antennas are cross-correlated [78].When a target moves under the interferometric mode of radar, a signal whose frequency is proportional to the angular velocity of the target is produced.As a consequence, interferometry radar produces micro-Doppler signatures regardless of the moving direction of a person.
Interferometry radar has been applied in many fields, such as engineering metrology, remote sensing and small-displacement measurement [79].In HAR, interferometry radar is also adopted thanks to its ability of acquiring tangential motion information [80,81].In addition, the interferometry mode is often combined with FMCW mode for indoor precise positioning, versatile life activity monitoring and vital sign tracking.In [82], time-Doppler maps of a walking person acquired from an interferometric radar and a Doppler radar were compared.It is seen that the two time-Doppler maps look similar and both contain micro-Doppler features, which means that it is possible to apply the classification algorithms of Doppler radar to interferometric radar in a straight-forward manner.

Ultra-Wide Band Radar
Ultra-wideband (UWB) radar, whose fractional bandwidth of the transmitted signals is greater than 25%, is another type of radar that is often utilized for human detection and activity recognition [22,44,46,[83][84][85][86].Fractional bandwidth of UWB radar is defined as where f H refers to the upper bound of frequency and f L refers to the lower bound frequency.UWB radar is to transmit pulses with very short durations in nanosecond range or even less.Due to the wideband, UWB radar has the capacity of anti-interference, penetrability, fine range resolution and short range detection.Thus, UWB radar is able to distinguish the major scattering centers of the target and identify short-range human activities [87,88].Despite the contradiction between range resolution and Doppler resolution, UWB radar is able to acquire the Doppler information of each scattering center of the human body when compromising the range resolution and the Doppler resolution.Additionally, it has low power consumption, which makes it more applicable for portable HAR devices with limited computational capabilities.The transmitted pulse signal of UWB radar has a certain bandwidth, and theoretically has the ability for multi-target activity recognition.However, there is no related work at present, which is mainly due to the high complexity of the algorithms for separating targets and identifying individual activities.

Deep Learning Approaches for Human Activity Recognition in Radar
In Section 2, we have discussed several common DL models and their advantages for HAR.As for radar, since radar echoes contain time, range and Doppler information, it is desirable that the DL algorithms are designed specifically for radar echoes.Motivated by this, in this section, we describe deep learning approaches for human activity recognition in radar according to the dimension of radar returns.Table 3 lists all the surveyed work in this section.Radar signals are transformed into 3D time-range-Doppler data cube by range-Doppler (RD) processing [92], which uses Doppler effect to determine the radial component of target's velocity.In this way, multiple components of a target are resolved not only in range but also in Doppler.The 3D RD 'video' describes the slow-time evolution of the target's activity, as shown in Figure 7a.Radar signals can also be represented in 2D, namely time-Doppler map (Figure 7b), time-range map (Figure 7c) and rang-Doppler map (Figure 7d).In order to make full use of the information in echoes, deep learning methods should be designed more carefully for different forms of echoes.

Deep Learning Approaches in 3D Radar Echo
range-Doppler frames reveal moving properties, as well as micro-Doppler properties of targets [61].Consisting of N time-sampled 2D range-Doppler frames, the 3D RD video sequence demonstrates both spatial and temporal characteristics.Range and Doppler information consists in every RD frame while time information exists between frames.Compared with 1D and 2D echoes, the joint time-range-Doppler echoes contain almost all the activity information that radar receives.Models that are able to extract both temporal and spatial information are required.Since it is difficult to design features manually from 3D echoes, DL methods are more feasible and preferable for 3D echo-based HAR, thanks to its capability of automatically extracting deep features.Furthermore, the advent of GPU makes it possible for DL models to process 3D data quickly and efficiently.Although there are few DL algorithms proposed for 3D radar echoes till now, DL approaches on 3D echoes are promising for HAR.
3D CNN is one of the most used models for processing 3D data recently [4,93,94].It extends the spatial CNN into a spatio-temporal model, and spatial-temporal features are learned automatically.Z. Zhang et al. [28] proposed a recurrent 3-D CNN model for continuous dynamic gesture recognition using an FMCW radar.3D CNN was used for extracting short temporal-spatial features in continuous time-range maps and then an LSTM was adopted for global temporal feature learning.Experiment showed that when 3D CNN was substituted with a traditional 2D-CNN, the recognition was reduced by around 5%, which demonstrated that compared with 2D CNN, 3D CNN was able to learn better representations of hand gestures.Though the input of 3D CNN is time-range maps, this approach is also suitable for a 3D data cube because the cube contains almost all the activity information in continuous time-range maps.
A representative example using 3D radar echoes for HAR is GoogleSoli, as shown in Figure 8. GoogleSoli is the first gesture recognition system capable of recognizing a rich set of dynamic gestures based on short-range FMCW radar [50,51].It is based on an end-to-end trained combination of deep convolutional and recurrent neural networks, and the dataset is comprised of 3D radar echoes.
Combining CNN and LSTM could enhance the ability to recognize different activities that have varied time span and spatial distributions.It was shown that the approach with 3D range-Doppler videos was better than the frame-level classification approaches, and the end-to-end 'CNN + LSTM' method was able to explore the gesture information more fully than the single CNN or LSTM models.With the advent of GoogleSoli, other DL architectures have been proposed based on it [28,31,94].

Deep Learning Approaches in 2D Radar Echo
Containing plentiful information of human activity, 3D human backscattering echoes are still complicated to process.2D radar echoes, which are mainly referred as time-Doppler map, time-range map and range-Doppler map, also carry sufficient human activity information.Generally, 2D echoes are treated as images, so along with the line of computer vision, CNN has become the most commonly utilized model for 2D echoes.Thus, 2D echo-based HAR is often transformed into an image classification task.
(1) time-Doppler map (also referred to as micro-Doppler signatures) includes sufficient time-varying Doppler information that is pivotal for radar-based HAR [95].When a human target is moving, the main Doppler shift is caused by torso while micro-Doppler is produced by rotating or vibrating parts, such as legs, feet and hands.The range and velocities of every body parts are often different, as shown in Figure 9.When the target acts differently, the time-Doppler maps corresponding to these activities are various.time-Doppler maps are easy to obtain by transforming raw echoes with STFT [96] and other joint time-frequency analysis methods.A simple CW radar with one transmitter and one receiver could be employed for identifying basic human activities with time-Doppler maps.In addition, time-Doppler maps are intuitive and explicable.As a consequence, compared with other 2D radar echoes, the time-Doppler maps are most commonly used for radar-based HAR up to now [20,[41][42][43]45,48,49,54,56,69].R.P. Trommel et al. [45] applied a 14-layer deep CNN (DCNN) on time-Doppler maps to classify human gaits.The experimental result showed that the DCNN architecture was able to extract effective micro-Doppler features of human gaits even at lower frequencies or low SNR levels, which exceeded the performance of SVM and the artificial neural network.M.S. Seyfioglu et al. [52] employed a CAE architecture to discriminate 12 indoor human activities involving aided and unaided human motions, which often resulted in highly similar micro-Doppler signatures.The CAE model is composed of 3 convolutional layers and three deconvolutional layers, as illustrated in Figure 4.It is able to learn nuances in the micro-Doppler signatures and obtains a good recognition performance of 94.2%.This HAR method shows the potential of radar-based health monitoring systems for assisted living.In [42], a DCNN-based hand gesture recognition system using time-Doppler maps was proposed.There were three convolutional layers and a fully connected layer in the model.In addition, how the DCNN effectively recognizing hand gestures in uncontrolled environments was investigated.Results showed that micro-Doppler signatures varied with aspect angle and distance to the radar, and recognition performance of the model under different scenarios.Ref. [47] proposed a DCNN architecture composed of cascaded convolutional network layers to classify human activities with time-Doppler maps, as shown in Figure 10.The Bayesian optimization with Gaussian prior process was utilized to optimize the network.Experimental results showed that the performance of this method was better than three existing feature-based methods.(2) Time-range map is composed of multiple pulses along time (see Figure 7c).It contains time-varying range information between the target and the radar.When a person is moving, different components of the human body have different relative distances from the radar, as illustrated in Figure 9a.As a result, although time-range maps neglect Doppler information, the time-varying range information of the human body is still able to be used for recognizing human activities [28].
In [98], time-range maps were utilized to detect falling in assisted living.By providing range information, the false alarms caused by fall-like activities such as sitting were reduced.In [22], Y. Shao et al. employed a three-layer DCNN to classify six human motions such as walking, running and boxing.It was shown that the time-range maps were more robust than the time-Doppler maps, especially when the radial velocity was low.Additionally, when increasing the incident angle, the recognition accuracy was maintained at a stable value, because the range information did not change significantly with the signal to noise ratio.
(3) range-Doppler map (see Figure 7d) illustrates range and Doppler information of a moving target at a specific time.It has the ability to separate different components of the moving human body parts and locate the target accurately.In addition, range-Doppler maps are able to track multiple targets simultaneously, which is promising for multiple human activity recognition.P. Molchanov et al. [73] utilized a short-range monopulse FMCW radar with one Tx and three Rx to sense dynamic hand gestures.A 4D vector representing spatial coordinates and radial velocity of the hand was estimated with range-Doppler maps from three antennas.Similarly, in [74], a 4D vector obtained from three range-Doppler maps was combined with a mask from a depth image.Then a resulting velocity layer was fed into a 3D CNN to identify dynamic car-driver hand gestures.The 3D CNN is able to extract the spatial-temporal features, which is indispensable for recognizing dynamic hand gestures of short durations.In [57], two sparse AEs were stacked to learn sparse representation from range-Doppler maps gradually, and a Softmax layer was employed for classification.In [58], a stack AE was utilized to extract features from range-Doppler maps, and logistic regression was applied for identifying fall/non-fall.Ref. [57,58] gave examples of applying DL methods on range-Doppler maps for HAR.
(4) Hybrid 2D maps Up to now, most HAR systems based on 2D radar echoes only utilize one of the above three kinds of maps.However, sometimes it is observed that activities which could be easily distinguished with one map may not be correctly identified with another map.This motivates the use of multiple maps aiming at reducing false alarms.Ref. [99] utilized time-Doppler map, time-range map and range-Doppler map for falling detection.By extracting range and Doppler information from the three maps, the false alarm rate of fall detection was reduced.In [57], three stack AEs and three Softmax classifiers were employed to classify four human motions (falling, sitting, bending and walking), as described in Figure 11.In this method, time-Doppler maps, time-range maps and range-Doppler maps were all applied in order to fully explore the motion information that radar echoes contained.Then three classification results were combined to deliver the final result by voting strategy.Experiments showed that the performance was better than the one that only used one kind of maps.In [58], fall detection procedure was divided into two stages: using a stacked AE composed of two sparse AEs to distinguish fall/walk from sit/bend with time-range maps and using another stacked AE with the same structure to distinguish fall from walk with time-Doppler maps.Detection accuracy of 97.1% was achieved.

Deep Learning Approaches in 1D Radar Echo
Projecting the time-range-Doppler data cube on range dimension results in 1D radar echoes, namely high resolution range profile (HRRP), as shown in Figure 12.Though HRRP is not as intuitive as 2D and 3D radar echoes, it carries enough information for identifying human activities likewise.Ref. [100] applied HRRP to analyze human target gaits with an ultra-wideband radar.Ref. [101] combined HRRP and micro-Doppler signatures to classify human gaits.Z. Zhou et al. adopted multi-modal signals, including HRRPs and Doppler signatures acquired from a terahertz radar system to recognize dynamic gestures and the recognition rate reached more than 91% [31].1D radar echoes are essentially time-series, and similar to the data obtained from sensors like accelerometer and gyroscope.Thus, many approaches used for time series could be adopted to 1D echo-based HAR.RNN is often utilized for 1D data due to advantages of modeling sequential data.For instance, A. Graves et al. proposed a speech recognition architecture composed of LSTM and Connectionist Temporal Classification (CTC) algorithm that is suitable to label unsegmented sequence data [102].This provides us insights on how to recognize continuous activities without annotating manually in advance.A. Hamid et al. [103] applied 1D CNN to hybrid NN-HMM model for speech recognition and proposed partial weight sharing for the first time.Although there are few DL-related studies for 1D radar echoes, DL approaches have the potential to extract sequential features and deliver good classification results for 1D radar echoes.

Future Directions
Despite radar-based HAR with DL algorithms has made noteworthy progress, there is a way to go before it matures.As a tool for feature extraction and activity identification, it is essential for the designed DL architectures to be capable of exploring activity information in radar echoes as much as possible.A few future research considerations are listed below.

A. Complex human activity recognition.
Complex human activity, such as drinking coffee and cooking, is composed of several simple activities that are simultaneous.Compared with simple human activities such as walking, running and sitting, complex activities, which is more reflective of people's intentions, are worth studying.Due to the complicated semantic and context information, complex activities, are harder to be recognized than the simple activities.
(1) Hybrid deep model design.In Table 3, most work adopts single and basic DL models, such as CNN and AE.However, as described in Section 2, each type of DL models has its own unique characteristics for HAR task.In order to fully take advantage of the semantic and context information for identification, it is far from enough to purely use a single DL model.Motivated by this, designing hybrid DL models for recognizing complex activities is imperative.
(2) Multiple forms of echoes.Table 3 shows that compared with 2D radar echoes, there is less work based on 1D and 3D echoes so far.It is mainly because that the current radar-based HAR tasks mostly focus on identifying simple activities.In this case, the information in 2D echoes is enough to obtain good recognition performance.In addition, 2D echoes are intuitive and explicable, which makes it more acceptable.Generally speaking, there is a loss of information during the radar signal transformation process, no matter the signals are converted into 1D, 2D or 3D.However, in order to make radar-based HAR systems more robust and generic for complex scenarios, more activity information in radar echoes should be utilized.To this end, different types of radar echoes could be employed for information extraction.Consequently, it is necessary to cooperate with multiple forms of echoes for HAR.
(3) Aspect angle sensitivity.In HAR task, Doppler shift is caused by the radial velocity of moving targets, and the radial velocity changes with the relative position between the target and radar.When the motion directions are different, radar backscattered signals produced by a subject differ a lot [104].In this regard, the designed model should be robust to the aspect angle changes.Since there is a little research on this issue [22,42], it still needs to be investigated more.

B. Radar-based human activity recognition in real-world scenarios.
So far, most of the recent radar-based HAR approaches are only applicable in the controllable environments, where a human target acts several discontinuous and assigned activities with little interference.In addition, the real-time processing capability of the model is not taken into account.However, in order to make radar-based HAR applied in real-world scenarios, several issues should be considered carefully.
(1) Light-weight deep model design.Training a DL model often requires lots of computing resources, which makes it often be executed off-line with a limited amount of data.However, in reality, activity data often come in a stream and require robust online and incremental learning.Though capable of processing and classifying data in real-time, huge feature engineering and hand-craft feature extraction hinder the use of traditional ML approaches for real-world HAR.Consequently, it is necessary to design light-weight DL models for radar-based HAR.There are two ideas available for investigation: combining hand-crafted features with deep features, and cooperating DL models with conventional ML algorithms.
(2) Continuous activity segmentation and recognition.In real-world scenarios, a person always acts continuously and freely, not merely performing assigned activities.Accurate segmentation and recognition of the interested activities is crucial.Recently, there is a trend of addressing segmentation and recognition jointly.For example, in [28], a Connectionist Temporal Classification (CTC) algorithm [105] was employed to recognize continuous dynamic hand gestures.CTC enables gesture recognition without explicit pre-segmentation and addresses segmentation and recognition simultaneously.In further research, more algorithms aiming at jointly segmenting and recognizing a series of activities are desired.
(3) Multi-target activity recognition.How to identify multiple targets' activities or separate the target from a group is worth studying.In [45,54], multi-target human gait recognition with DL approaches was studied.In [75], an FMCW radar was utilized to separate and recognize several assigned hand gestures in the presence of multiple targets.However, those solutions often work in less disturbing scenarios, such as the scenario where the human target is making gestures, and another person is walking toward the radar meanwhile.When it comes to the circumstances where the radar echoes are modulated by multiple moving targets, applying DL models to learn high-level features are of great significance.More elaborate DL models should be designed for multi-target activity recognition.

C. Unsupervised activity recognition in radar.
DL models require large-scale labeled data to prevent overfitting and obtain good generalization.In radar applications, however, acquiring a mass of measured labeled data is challenging due to constraints on manpower, cost, and other resources.As a result, unsupervised HAR in radar is urgent.
(1) Deep Transfer learning.Transfer learning generally refers to transferring the knowledge or models learned from a certain task to another related, but different task.Up to now, transfer learning for radar-based HAR mainly includes two perspectives: transferring the models trained with large-scale natural image datasets, such as ImageNet [84,89,90] and transferring the models trained with simulated radar image dataset [83,85].How to elaborate a DL model that is capable of adequately learning the relatedness between source domain and target domain is an open issue in radar-based HAR area.
(2) Cross-modal knowledge distillation.For the activity recognition task, it is verified that the shared representations exist in different types of sensory data.In addition, the shared representations could be utilized as supervision for training a radar-based DL model.Only synchronized but unlabeled data are employed during the cross-modal knowledge distillation process.Ref. [106] demonstrates that training a model by cross-modal knowledge distillation not only reduces the amount of required labeled data but also speeds up the training process.For radar-based HAR, cross-modal knowledge distillation is also effective and large-scale labeled data are no longer necessary.

Conclusions
Human activity recognition is one of the interesting research topics in human-computer interaction and smart surveillance.As an active system for human activity recognition, radar has many unique advantages and has attracted the attention of researchers gradually.Deep learning is able to extract deep hierarchical features automatically and has achieved desirable classification performance.In this paper, we first survey several state-of-the-art deep learning models.Those models have different characteristics for identifying human activities and there is a trend of combining multiple models to better learn the features of human activities.Then, radar systems that are mostly employed for HAR are described.Doppler radar is able to obtain Doppler information for HAR while FMCW radar provides both range and Doppler information.UWB radar has a high range resolution and is capable of distinguishing the scattering centers of the human body.Interferometry radar provides Doppler information regardless of the directions of human movement.Furthermore, by classifying radar echoes into three different forms: 1D, 2D and 3D, we discuss the development of deep learning based HAR in radar.Various deep learning techniques designed specially for 1D/2D/3D radar echoes have been discussed, and the experiment results demonstrate the feasibility of such techniques.2D radar echoes, especially time-Doppler maps, are more commonly used for radar-based HAR because they are more intuitive and contain sufficient activity information.3D echoes contain more information, but they are also more difficult to process than 2D and 1D echoes.Because of the simple form of 1D echoes, the activity information contained in them is still waiting to be fully mined.Thanks to the ability of feature learning, DL techniques shows potential for radar-based HAR.Finally, several future research directions for radar-based HAR is presented.Though the adoption of radar for HAR is still lagging behind vision-based technologies, we should be optimistic about the potential of radar-based HAR techniques because of radar's unique advantages such as environment-insensitivity and better privacy protection.

Figure 1 .
Figure 1.Illustration of features extracted from a time-Doppler map.Adopted from [30].

Figure 2 .
Figure 2. (a) CNN is performed as a classifier for radar-based HAR.Adopted from[41].(b) CNN is adopted as a feature extractor to learn high-level features from the input time-range maps.Adopted from[28].

Figure 6 .
Figure 6.Illustration of how an FMCW radar acquires range and Doppler information, taking sawtooth wave as an example.f d is Doppler shift while τ is time delay.Adopted from [73].

Figure 8 .
Figure 8. Deep learning architecture of Google Soli, a hybrid model that consists of CNN and LSTM.Adopted from [50].

Figure 9 .
Figure 9. Moving trajectories of different body parts when a human target is walking: (a) Range of different parts.(b) Radial velocity of different parts.Adopted from [97].

Figure 12 .
Figure 12. High resolution range profiles of a hand at a different time.Adopted from [31].Each sub-figure illustrates the HRRP at a specific time.

Table 1 .
DL models and advantages for human activity recognition.

Table 2 .
Radar system and basic characteristics.

Table 3 .
Summation of existing works on DL based human activity recognition in radar.