Room-Level Fall Detection Based on Ultra-Wideband (UWB) Monostatic Radar and Convolutional Long Short-Term Memory (LSTM)

Timely calls for help can really make a difference for elders who suffer from falls, particularly in private locations. Considering privacy protection and convenience for the users, in this paper, we approach the problem by using impulse–radio ultra-wideband (IR-UWB) monostatic radar and propose a learning model that combines convolutional layers and convolutional long short term memory (ConvLSTM) to extract robust spatiotemporal features for fall detection. The performance of the proposed scheme was evaluated in terms of accuracy, sensitivity, and specificity. The results show that the proposed method outperforms convolutional neural network (CNN)-based methods. Of the six activities we investigated, the proposed method can achieve a sensitivity of 95% and a specificity of 92.6% at a range of 8 meters. Further tests in a heavily furnished lounge environment showed that the model can detect falls with more than 90% sensitivity, even without re-training effort. The proposed method can detect falls without exposing the identity of the users. Thus, the proposed method is ideal for room-level fall detection in privacy-prioritized scenarios.


Introduction
With the increasing life expectancy and the declining birth rate, aging populations have become a grave social issue. Providing health and social care for the growing older population is drawing increasing attention. According to a report by the World Health Organization (WHO), millions of fall-induced injuries and deaths occur around the globe each year. Older people have weaker balance, so they have a high risk of falling down [1,2]. As older people are physically frail, so they are vulnerable to the injuries caused by falls [3]. If they cannot get immediate medical assistance, their condition is likely to deteriorate, ultimately resulting in life-endangering illness. Thus, timely calls for help in the event of a fall can make a difference in the well-being of our elders.
When an elder falls and becomes injured, there are two possible scenarios. The first is that the elder is in the company of others or in a public area. In this scenario, the fall event can be easily noticed, and the elder will be likely to get medical assistance in time. The second scenario is that the elder is in a private location and alone. When this happens, the elder must attract the notice of others to achieve help. Clearly, the second scenario is more dangerous for the injured elder. However, in this scenario, given that video-based methods have privacy issues in private locations and the elder may leave any carried devices behind, this problem needs to be approached from a different angle.
The current research on fall detection can be categorized into three groups: wearable-devicebased [4], ambient-sensor-based, and camera(vision)-based [5]. Wearable-device-based fall detection Many researchers have adopted this new tool in fall detection research and acquired good results. The work of Y. Lin et al. [32] used an iterative convolutional neural network (ICNN) followed by random forests to deal with radar signals. Sadreazami et al. employed a deep convolutional neural network [33] and a deep residual neural network [34] to learn features from radar time-series signals. However, those network structures either only dealt with one dimensional radar time-series signals or used CNNs to extract features based on shapes but failed to take advantage of the spatiotemporal structure of the data. In this paper, we approach UWB-radar-based fall detection by combining a CNN and convolutional long short term memory (ConvLSTM) to extract spatiotemporal features from radar ranging data flow. A summary of the recent studies on fall detection systems based on machine learning is shown in Table 1.
In this paper, we propose a whole-room fall detection scheme by using IR-UWB radar that produces one-dimensional data frames. The fall detection problem was transformed into a classification problem of sequences of one-dimensional range data. A network structure combining CNN layers and ConvLSTM is proposed and evaluated in terms of accuracy, sensitivity, and specificity. The experimental results show that our method outperforms CNN-based methods in terms of accuracy and robustness due to its ability to extract both the shape-based features and time-based sequential features.

Data Acquisition with IR-UWB Monostatic Radar
IR-UWB monostatic radar consists of a UWB transmitter and a receiver. For each data frame, the UWB transmitter emits several UWB pulses (p(t)). Pulses bounce back on reflective surfaces in the environment and are sampled by the receiver at a fixed sample rate, F s . Signals that have traveled different distances can then be distinguished by their time of arrival (TOA). The received echo signal can be expressed in vector form as where N represents the number of samples in one data frame. A larger N value indicates a longer detection range. A sample of raw received signal by an IR-UWB monostatic radar is shown in Figure 1a. Note that the signal was mapped into [0, 1]. In a typical indoor setting, there are a lot of static reflective surfaces in the environment. These surfaces accounted for the peaks in Figure 1a. Those pikes obscured the signal that was reflected from the target. To solve this problem, a reference datum R Re f , taken with no target in the environment, can be subtracted from the raw received signal.
As the target enters the detection range of the IR-UWB radar, it introduces more reflective surfaces to the environment. As a result, a strong peak appears in ∆R t at the location of the target. Due to the effect of multipath in indoor environments, more than one peak can appear, as is shown in Figure 1c. The shape and strength of the peaks contain information on the pose of the target. Taking a step further, the activity of the target can be inferred by extracting features that represent the change of pose in consecutive frames. Then these features can be used to detect falls.
The goal of fall detection by using IR-UWB radar is to use the observed sequential radar echo frames to determine which activity the target is conducting. This problem can be regarded as a spatiotemporal sequence classification problem.
Let M be the number of frames each observation contains, S ⊂ R N×M be the domain of collected echo data, and L ⊂ {1, . . . K} ⊂ R be the domain of output activity label l (with a total number of K different activities). Then, the fall detection problem can be expressed as  Figure 1a. Note that the signal was mapped into [0,1]. In a typical indoor setting, there are a lot of static reflective surfaces in the environment. These surfaces accounted for the peaks in Figure 1a. Those pikes obscured the signal that was reflected from the target. To solve this problem, a reference datum , taken with no target in the environment, can be subtracted from the raw received signal.
(2) Figure 1. (a) Raw echo data that were collected from the impulse-response ultra-wideband (IR-UWB) module and (b) reference echo data that were collected when no target was in the monitored area.
(c) The signal difference that was calculated by subtracting the reference echo data from the raw echo data. (d) The result of applying wavelet denoising to the signal difference.
As the target enters the detection range of the IR-UWB radar, it introduces more reflective surfaces to the environment. As a result, a strong peak appears in ∆ at the location of the target. Due to the effect of multipath in indoor environments, more than one peak can appear, as is shown in Figure 1c. The shape and strength of the peaks contain information on the pose of the target. Taking a step further, the activity of the target can be inferred by extracting features that represent the change of pose in consecutive frames. Then these features can be used to detect falls.
The goal of fall detection by using IR-UWB radar is to use the observed sequential radar echo frames to determine which activity the target is conducting. This problem can be regarded as a spatiotemporal sequence classification problem. The signal difference that was calculated by subtracting the reference echo data from the raw echo data. (d) The result of applying wavelet denoising to the signal difference.

Dataset
In this section, all the data were collected by using an off-the-shelf IR-UWB radar in two indoor environments, including a laboratory and a lounge at the Public Experimental Teaching Center of Shandong University. The environmental settings are shown in Figure 2. The laboratory environment had no furniture, while the lounge environment was heavily furnished. The radar module we used, Pulson P440 [36], transmitted UWB pulses at 30 dBm by using two omnidirectional antennas. Figure 3 shows the IR-UWB radar and the antenna we used. The UWB radar operated at 4 GHz, with a 1.7 GHz bandwidth. The sample rate of the UWB receiver was 61 ps, which enabled a spatial resolution of approximately 0.9 centimeters. The antennas that were used were Time Domain's BroadSpec planar elliptical dipole antennas with a 3 dBi gain. The UWB modules were mounted on tripods at a height of 1.2 m, which was approximately the height of the torso of the test subjects. In our experiments, two UWB radars were employed to collect data. The UWB radar modules were connected to a PC via network cables. Interference between the radar modules was avoided by letting them take turns transmitting signal and processing data. The frame rate of the raw data flow was 5 frames/s. Five volunteers, including three men and two women, served as test subjects. Basic information of the five subjects are shown in Table 2. To distinguish falls and other daily activities, six different activities were examined including standing still, falling, lying still, standing up, walking, and jumping. Figure 4a shows the six activities we focused on. A reference signal was collected before the targets entered the area of interest. Volunteers conducted the six activities in the test area at random locations and facing random directions. To reduce the possibility of human error, the raw data recordings were manually labelled by three researchers, and the average value of their labelling results was treated as the truth value. The time of recording for each activity sample was 4 s, which meant that each sample contained 20 frames. In the laboratory environment, we collected 102 samples for each one of the five volunteers who performed every activity. Among the 102 samples, 62 (60.7) samples were used to generate the training set, 20 (19.6%) samples were used to form the validation set, and the remaining 20 (19.6%) samples were used to form the test set. A total of 3060 samples were collected. The validation set and the test set contained 600 samples each. Some samples of the dataset are shown in Figure 4b. In the lounge environment, we collected 126 samples for one volunteer who performed every activity. In this environment, a total number of 756 samples were collected. All the data that were collected in the lounge were used to form test set 2. The dataset is available online [37]. approximately 0.9 centimeters. The antennas that were used were Time Domain's BroadSpec planar elliptical dipole antennas with a 3 dBi gain. The UWB modules were mounted on tripods at a height of 1.2 m, which was approximately the height of the torso of the test subjects. In our experiments, two UWB radars were employed to collect data. The UWB radar modules were connected to a PC via network cables. Interference between the radar modules was avoided by letting them take turns transmitting signal and processing data. The frame rate of the raw data flow was 5 frames/s.

Preprocessing
As can be seen in Figure 1c, the signal difference was noisy. First, the raw echo data that were collected by the UWB monostatic radar were normalized into [0, 1]. To eliminate the effect of noise, a wavelet filter was used to denoise each scan data. We used Daubechies5 wavelet, as it is a compactly supported wavelet with an extremal phase and the highest number of vanishing moments for a given support width [38]. It is ideal for smoothing a noise-cluttered signal. An example result of the denoising process is shown in Figure 1d.
In our experiments, we found that, sometimes, very high peaks existed in ∆R t . After normalization, these peaks obscured all other features in the result, which was detrimental for extracting robust features of a target's activities. As a result, very little useful information could be learned during the training phase. Thus, after denoising, the amplitude of the result was transformed into logarithmic coordinate before being mapped into [0, 1]. In this way, features with low amplitude were enhanced and the robustness of the system was improved.
The training data were then augmented by linear translation of the sampling window. An illustration of the linear translation we employed is shown in Figure 5. The sampling window moved in the direction of the range axis with a predetermined step, and multiple training samples were generated with one raw measured data sample. After augmentation, the training set contained 9300 samples.
extracting robust features of a target's activities. As a result, very little useful information could be learned during the training phase. Thus, after denoising, the amplitude of the result was transformed into logarithmic coordinate before being mapped into [0, 1]. In this way, features with low amplitude were enhanced and the robustness of the system was improved.
The training data were then augmented by linear translation of the sampling window. An illustration of the linear translation we employed is shown in Figure 5. The sampling window moved in the direction of the range axis with a predetermined step, and multiple training samples were generated with one raw measured data sample. After augmentation, the training set contained 9300 samples. Figure 5. The training data were augmented by moving the sample window along the distance axis with a fixed step length.

Convolutional LSTM
As a special Recurrent neural network structure, LSTM is a powerful tool for modeling long sequences. The core contribution of an LSTM structure is the introduction of memory cells that accumulate or discard state information through dynamic gates. LSTM has been used in various studies to model long-range correlations. However, spatiotemporal data has to be converted into 1D sequences when serving as input of the LSTM structure, thus losing some of their structural Figure 5. The training data were augmented by moving the sample window along the distance axis with a fixed step length.

Convolutional LSTM
As a special Recurrent neural network structure, LSTM is a powerful tool for modeling long sequences. The core contribution of an LSTM structure is the introduction of memory cells that accumulate or discard state information through dynamic gates. LSTM has been used in various studies to model long-range correlations. However, spatiotemporal data has to be converted into 1D sequences when serving as input of the LSTM structure, thus losing some of their structural information.
To solve this problem, Xingjian Shi et al. proposed the ConvLSTM structure [39]. By introducing convolution into the classic LSTM structure, their solution can extract spatial information as well as temporal information. In contrast to the classic LSTM structure, ConvLSTM updates the state of a certain cell by the current and past states of a small reception field. The update process of ConvLSTM can be expressed as where X t is the input, C t is the cell output, H t is the hidden state, I t , F t , O t are the gates, W x , W t , W c are convolutional kernels, b i , b f , b c are the biases of the gates, ' * ' denotes the convolution operator, and '•' denotes element-wise multiplication. The structure of the ConvLSTM cell is illustrated in Figure 6a. Note that in the ConvLSTM structure, the inputs, outputs, states, and gates are all 3D tensors instead of vectors, as in a standard LSTM structure. In the input-to-state and state-to-state transitions, product operations are replaced with convolution operations. The ConvLSTM structure was initially proposed to process video data in which each frame is a 2D image, so it employed 2D convolutional operations to enhance the standard LSTM structure such that it gained a small reception field for the two spatial dimensions. In our case, however, each frame had only one spatial dimension. Thus, 1D kernels were used in the convolutional LSTM layer. Before training, all the weights were randomly and orthogonally initialized. The Nesterov adaptive momentum (Nadam) optimizer [40] was utilized to train the model. When training and testing, data were segmented into mini-batches of 100 data samples. The accumulated gradient of each batch was computed and used to update the parameters. Dropout layers and L 2 regularizers were utilized to prevent overfitting. reception field for the two spatial dimensions. In our case, however, each frame had only one spatial dimension. Thus, 1D kernels were used in the convolutional LSTM layer.
Before training, all the weights were randomly and orthogonally initialized. The Nesterov adaptive momentum (Nadam) optimizer [40] was utilized to train the model. When training and testing, data were segmented into mini-batches of 100 data samples. The accumulated gradient of each batch was computed and used to update the parameters. Dropout layers and regularizers were utilized to prevent overfitting. The network we propose had two 2D convolution layers-a 1D convolutional LSTM layer followed by a dense layer. In this network, the 2D convolution layers are used in the same way that 3D convolution layers are used in solving video classification problem. The convolution layer uses 3 × 3 kernels, followed by max-pooling layers with a scaling of 2. The convolution layers with a small reception field followed with pooling layers can extract low-level spatiotemporal features and reduce the number of parameters that need to be trained in the following layers. The output of the CNN layer The network we propose had two 2D convolution layers-a 1D convolutional LSTM layer followed by a dense layer. In this network, the 2D convolution layers are used in the same way that 3D convolution layers are used in solving video classification problem. The convolution layer uses 3 × 3 kernels, followed by max-pooling layers with a scaling of 2. The convolution layers with a small reception field followed with pooling layers can extract low-level spatiotemporal features and reduce the number of parameters that need to be trained in the following layers. The output of the CNN layer is then fed into a 1D convolutional LSTM layer. Rather than returning the full sequence, the ConvLSTM layer is set to the mode where only the last output in the sequence is returned. Figure 6b,c illustrates the differences between the ConvLSTM structure proposed in [39] and the one that we employed in our method. The dense layer immediately after the ConvLSTM layer has 64 hidden neurons. The proposed network structure is shown in Figure 6d. The total number of trainable parameters of the proposed network structure is 172,836. The total number of trainable parameters of Lenet-5 network with the same input is 1,027,026.
First, we employed two CNN layers to automatically define and extract features. Then, by adding a convolutional LSTM layer, the temporal structure of the falling process could be learned by the network.

Results and Discussion
The effectiveness of the proposed fall-detection method is shown in terms of accuracy, sensitivity, and specificity [41]. The metrics can be expressed as where TP is the number of fall samples identified as falls, FP is the number of non-fall samples identified as falls, TN is the number of correctly classified non-fall samples, and FN is the number of falsely classified non-fall samples. In our experiments, all the activities other than falling were considered non-falls. Non-fall samples must be classified as the corresponding activity to contribute to TN count. We compared the proposed method with the classic Lenet-5 architecture, which has been adopted in similar research [32,35]. Three kinds of classifiers, SoftMax, K-nearest neighbors (KNN), and random forest (RF), were selected to test the performance of the proposed method when using different classifiers. In the training process, the drop rate of dropout layers was set to 0.5, and the penalty of the L 2 regularizers was 0.1. The learning rate was set to 0.001, and the number of maximum epochs was set to 1000 with an early stopper enabled.

Performance Evaluation
The performance comparison of methods is shown in Table 3. The data from the lab environment were used in this evaluation. By combining the CNN and ConvLSTM, the proposed method achieved a better overall performance than the CNN-based network structure. Among the classifiers, the SoftMax classifier had the best performance in terms of accuracy and specificity. Please note that the classification task was challenging because that the radar had a low frame rate and the test subjects performed the activities in randomized locations and with different facings. Despite the challenges, the proposed method was still able to show good performance, and it was comparable to state-of-the art methods.  Figure 7 shows the confusion matrices of the classification result of six activities when using different methods. Generally, our methods performed better than the CNN-based methods. It is interesting to note that standing still and lying still were the easiest to confuse. This might have been due to the lack of robust features to distinguish between the two activities, as the most appreciable differences between the two activities were the power of the reflected signal, rather than spatiotemporal motions. Compared with the CNN-based approach, the ConvLSTM-based approach has an advantage in identifying movement-based activities. This is because ConvLSTM layers have the ability to correlate long time sequences while the convolution operation makes it possible to gain a larger area of perception than the classic LSTM structure.

Effect of the Number of Frames per Sample
The effect of the number of frames per sample on the sensitivity and specificity of the methods is shown in Figure 8. The data from lab environment were used in this test. The datasets with fewer frames were generated from the dataset with the maximum number of frames by alternatively removing frames from the two ends of each sample. It can be concluded from the figure that our method constantly yielded better results than the CNN-based methods. The best result was achieved when each sample contained 20 frames.

Effect of the Number of Frames per Sample
The effect of the number of frames per sample on the sensitivity and specificity of the methods is shown in Figure 8. The data from lab environment were used in this test. The datasets with fewer frames were generated from the dataset with the maximum number of frames by alternatively removing frames from the two ends of each sample. It can be concluded from the figure that our method constantly yielded better results than the CNN-based methods. The best result was achieved when each sample contained 20 frames.
The effect of the number of frames per sample on the sensitivity and specificity of the methods is shown in Figure 8. The data from lab environment were used in this test. The datasets with fewer frames were generated from the dataset with the maximum number of frames by alternatively removing frames from the two ends of each sample. It can be concluded from the figure that our method constantly yielded better results than the CNN-based methods. The best result was achieved when each sample contained 20 frames. Intuitively, as the number of frames per sample increased, the samples contained more transition information about the different poses that the target person assumed before and after the activity. The transition information could be utilized by learning algorithms to gain better results. This was generally true when each sample contained no more than 16 frames. However, the performance of the investigated methods may become worse beyond that point. This may be caused by the introduction of redundant information that reduces the robustness of the extracted features. Note Intuitively, as the number of frames per sample increased, the samples contained more transition information about the different poses that the target person assumed before and after the activity. The transition information could be utilized by learning algorithms to gain better results. This was generally true when each sample contained no more than 16 frames. However, the performance of the investigated methods may become worse beyond that point. This may be caused by the introduction of redundant information that reduces the robustness of the extracted features. Note that the frame rate of the system was about 5 frames/s. Thus, 20 frames covered a time span of 4 s. In our experiments, falls generally took 1-2 s, while the activity of standing up took 2-4 s. Table 4 shows the accuracy, sensitivity, and specificity of the trained model when predicting the data that were collected in the heavily furnished lounge environment. The models were trained with the data from the lab environment and tested with the data from the lounge environment. The confusion matrix of the six activities is shown in Figure 9. The model trained in Section 3.1 and the data from the lounge environment were used in this evaluation. It can be concluded that the proposed method could still detect falls with a high sensitivity even when the environment was changed. However, the trained model showed a poor performance when classifying the jumping samples in the new environment. This was because with the furniture in the environment, the movements of some body parts of the target person may have gotten blocked, especially the movements of the person's legs. As the model heavily relied on the activities of the legs to classify jumping activity, its performance was heavily influenced.

Evaluation of Performance Classifying Activities of Unknown Subjects
The data set from all five people were trained by five-fold cross-validation in a leave-one-subject-out fashion, and the result is shown in Table 5. Specifically, each time a model was trained, the data of four test subjects were used for training and validation and the data of the remaining person were used for testing. The data samples that belonged to one test subject whose data were used for training and validation were divided into four folds-three folds for training and the remaining fold for validation. The result showed that the performance of the proposed method remained stable when classifying activities of previously unknown subjects. Table 5. Accuracy, sensitivity and specificity of the schemes in the leave-one-subject-out cross validation.  Table 6 shows the accuracy of the schemes in classifying the five subjects by using deep learning methods. The dataset that was used in this test was the same as that in Section 3.1, but the label was changed to the serial number of the corresponding test subject. It seems that the proposed method was sensitive to different activities but was insensitive to differences of individuals. This is good for preserving the privacy of the users, because it can provide the fall detection service with a low risk of exposing the identity of the users.  Table 7 shows the execution times of the models. Note that this result was achieved on a laptop with an Intel i5-7300 central processing unit (CPU) and a 2.6 GHz main frequency. The values were calculated as the average execution time for 100 classification tasks. Though the proposed method had a longer execution time compared to the CNN-based models, since the frame rate of the UWB radar we used was 5 Hz, the laptop could provide real-time service for the classification tasks with less than 0.5% CPU time.

Conclusions
In this paper, a room-level fall detection scheme based on IR-UWB monostatic radar was proposed. After preprocessing and the data augmentation of the raw data, the fall detection problem was transformed into a spatiotemporal sequence classification problem. To solve this problem, we devised a deep learning scheme based on the combination of a CNN and 1D ConvLSTM layers. We evaluated the performance of our method against the CNN-based methods that have been used in similar studies. In terms of accuracy, sensitivity, and specificity, the experimental results suggest that the proposed method had a better and more robust performance than the CNN-based methods. Further tests in a lounge environment showed that the model could still detect falls with high sensitivity in heavily furnished environments. Experiments showed that the proposed method could provide accurate real-time fall detection service with a low risk of exposing the identity of the users. The proposed method is ideal for deployment in private rooms.
In future work, the influence of the environment should be further investigated to enhance the robustness of the extracted features.