Fall Detection with CNN-Casual LSTM Network

Falls are one of the main causes of elderly injuries. If the faller can be found in time, further injury can be effectively avoided. In order to protect personal privacy and improve the accuracy of fall detection, this paper proposes a fall detection algorithm using the CNN-Casual LSTM network based on three-axis acceleration and three-axis rotation angular velocity sensors. The neural network in this system includes an encoding layer, a decoding layer, and a ResNet18 classifier. Furthermore, the encoding layer includes three layers of CNN and three layers of Casual LSTM. The decoding layer includes three layers of deconvolution and three layers of Casual LSTM. The decoding layer maps spatio-temporal information to a hidden variable output that is more conducive relative to the work of the classification network, which is classified by ResNet18. Moreover, we used the public data set SisFall to evaluate the performance of the algorithm. The results of the experiments show that the algorithm has high accuracy up to 99.79%.


Introduction
Falls are one of the main causes of elderly injuries [1]. According to the World Health Organization (WHO), about 30% of people over 65 fall every year, causing more than 300,000 deaths [2]. The injuries to the elderly caused by falls not only cause harm physically but also mentally [3]. Furthermore, medical analyses have demonstrated that the injuries caused by falls are highly dependent on the response and rescue time [4]. Therefore, for the elderly, it is of great significance to protect their health by calling the police and the hospital in time when they fall.
In order to eliminate the risk of injury caused by falls for the elderly, researchers have developed different protection measures from Fall Detection to Fall Prevention [5]. The main goal of fall detection systems is to distinguish between fall events and activities of daily living (ADLs) [6]. Some ADLs, such as moving from standing or sitting position to lying position, are similar to falls. Therefore, it is tough to develop a method that identifies them accurately. In addition to the accuracy and timeliness of the fall detection algorithm, the most important issue is user convenience and privacy.
In this research, from the perspective of versatility and privacy, a fall detection method is proposed based on a three-axis acceleration and three-axis rotation angular velocity sensor that preserves privacy. The algorithm is based on a CNN-Casual LSTM, which includes an encoding layer, a decoding layer, and a ResNet18 classifier [7].
The organization of this paper is as follows. In Section 2, we discuss the existing methods for fall detection using wearable devices, with a focus on deep learning research. In Section 3, we introduce the details of the collected data set, data preprocessing steps, and CNN-Casual LSTM. The evaluation settings and results are presented and discussed in Section 4. The conclusion of this article is given in Section 5.

Related Work
Nowadays, existing fall detection technology can be roughly divided into three categories [4]: vision-based sensors [8,9], ambient sensors [10,11], and wearable sensors [12,13]. Vision-based sensors obtain motion information by monitoring equipment and extracting human body image inclination [14] or human bone annotations from the obtained video or picture information [15] to detect whether a fall has occurred. Subhash Chand Agrawal [16] used the improved GMM to perform background subtraction to find the foreground object and judges the fall by calculating the distance between the top and the middle center of the rectangle covering the human body. The user does not need to wear extra equipment. However, it is easily blocked and invades the privacy of the subjects. In order to solve this difficulty, Xiangbo Kong [17] used a depth camera to obtain the skeleton images of a person who is standing or falling down, and he used FFT to encrypt images and to detect the fall. Ambient sensors often detect falls by collecting infrared [18], radar [19], or other signals from the scene sensor. Tao Yang [20] used radars placed in three fixed positions to measure data, and he used time-frequency distribution and range-time intensity as input data for feature extraction to detect falls. Although it does not cause privacy invasion issues, it comes with a high cost. It is vulnerable to noise, and the detection range is relatively limited. Wearable sensors rely on various low-cost sensors to detect falls [21]. Its detection capabilities rely on real-time wear of the sensor, but the elderly may not be able to wear them in some cases, such as taking a bath. In addition, personal wear may cause discomfort for some elderly people.
In recent years, due to the low cost of sensors, wearable sensors have attracted more and more attention. The most commonly used parts of wearable sensors are the calf, spine, head, pelvis, and feet [22] to obtain the three-axis acceleration at different positions and the three-axis rotation angular velocity in a gyroscope. Kazi Md. Shahiduzzaman [23] used smart helmets to integrate wearable cameras, accelerometers, and gyroscope sensors and processed multi-sensor collaboration data on the edge. Seyed Amirhossein Mousavi [24] proposed a method of using smartphones and acceleration signals to detect falls by using smartphone sensors and reporting the person's position, with an accuracy rate of 96.33%. Kimaya Desai [12] performed human fall detection by deploying a simple 32-bit microcontroller on the wearable belt. Threshold methods and machine classification algorithms are widely used in wearable devices for fall detection [25]. Threshold methods are roughly divided into static threshold methods [26] and dynamic threshold methods [27]. When the threshold is passed (either upwardly or downwardly), it is judged as a fall. Machine learning methods are mainly divided into traditional pattern recognition and deep learning-based classification and recognition [5]. Traditional recognition algorithms (such as support vector machine [26], K-nearest neighbor algorithm [28], etc.) all rely on manual feature extraction for recognition. Therefore, higher requirements for fall detection are needed for researchers. First of all, it is necessary to find the parts of the human body involved in the process of falling. Second, it is essential to consider how these features are distinguished from ADLs such as sitting and jumping, and the process of feature extraction will be greatly delayed. Classification and recognition based on deep learning are currently more commonly used in fall detection algorithms, which can automatically extract feature information. Due to this advantage, deep learning methods have become more and more popular in the research community. They have been used in numerous areas in which they have played a role equivalent to human experts. Generally, the steps involved in deep learning methods using wearable sensor data are to preprocess the acquired signals, extract features from signal segments, and train a model that uses these features as input [29]. Therefore, the existing research in the field of wearable sensor data fall risk assessment mainly focuses on engineering optimizing features. The extracted features are used as input to different deep learning algorithms to predict the occurrence of falls. Mirto Musci [30] used a fall detection algorithm based on LSTM, taking a long-time sequence as input and automatically extracting temporal information. He Jian [31] utilized the FD-CNN network to preprocess the input data into a picture format, where the three-axis acceleration is normalized and mapped to the RGB channel. Four hundred three-axis data windows can be regarded as 20 × 20 pixels and automatically extract spatial feature information.

Materials and Methods
As shown in Figure 1, the framework proposed in this study comprises three main steps: motion and data acquisition, data pre-processing, and CNN-Casual LSTM algorithmbased feature extraction and classification. In detail, the framework consists of two stages: Stage 1-preprocessing the acquired data into a form suitable to apply CNN-Casual LSTM; Stage 2-automatic feature extraction and learning for the changes of the data when a fall event occurs using the CNN-Casual LSTM algorithm and using the trained CNN-Casual LSTM model to finally determine whether a fall occurs by using image recognition and classification. axis acceleration is normalized and mapped to the RGB channel. Four hundre data windows can be regarded as 20×20 pixels and automatically extract spa information.

Materials and Methods
As shown in Figure 1, the framework proposed in this study comprises steps: motion and data acquisition, data pre-processing, and CNN-Casual L rithm-based feature extraction and classification. In detail, the framework con stages: Stage 1-preprocessing the acquired data into a form suitable to apply ual LSTM; Stage 2-automatic feature extraction and learning for the changes when a fall event occurs using the CNN-Casual LSTM algorithm and using CNN-Casual LSTM model to finally determine whether a fall occurs by u recognition and classification.  Figure 1. The proposed approach for fall detection by using a convolutional neural network.

Motion and Data Acquisition
We used the public data set SisFall [32] to test the accuracy and latency of This data set uses a wearable embedded device to collect three-axis acceleration axis angular velocity during volunteer activities. The device is worn at the wai are provided by a total of 38 volunteers: (1) 23 young people and a 60 year old provided 19 types of non-fall and 15 types of fall data; (2) 15 types of non-fa vided by 14 healthy people over 62 years old.
The SisFall data set collects three-axis acceleration and three-axis rotat velocity information by fixing the embedded device on the volunteers' waists

Data Pre-Processing
According to statistics, human falling time is generally less than 2 s [31 duration of a received three-axis accelerometer and three-axis gyroscope da ceived in Sisfall is 12 s (sampling frequency is 200 Hz and 2400 sampling poin thus, the data were divided into six pieces associated with time information. piece) contains 400 sampling points: where Xt is the data point at the time t; Xt__accx, Xt__accy, and Xt__accz, respectively the three-axis acceleration; and Xt__gyrox, Xt__gyroy, and Xt__gyroz represent the thre tion angular velocity.
We rearrange the feature values based on measurement types and axe group, we reshape the 400 sampling points to 20×20 to obtain the spatial relat tween the following values: Figure 1. The proposed approach for fall detection by using a convolutional neural network.

Motion and Data Acquisition
We used the public data set SisFall [32] to test the accuracy and latency of the model. This data set uses a wearable embedded device to collect three-axis acceleration and threeaxis angular velocity during volunteer activities. The device is worn at the waist. The data are provided by a total of 38 volunteers: (1) 23 young people and a 60 year old judo athlete provided 19 types of non-fall and 15 types of fall data; (2) 15 types of non-fall data provided by 14 healthy people over 62 years old.
The SisFall data set collects three-axis acceleration and three-axis rotation angular velocity information by fixing the embedded device on the volunteers' waists.

Data Pre-Processing
According to statistics, human falling time is generally less than 2 s [31]. The time duration of a received three-axis accelerometer and three-axis gyroscope data point received in Sisfall is 12 s (sampling frequency is 200 Hz and 2400 sampling points in total); thus, the data were divided into six pieces associated with time information. Each X (2 s piece) contains 400 sampling points: where X t is the data point at the time t; X t_accx , X t_accy , and X t_accz , respectively, represent the three-axis acceleration; and X t_gyrox , X t_gyroy , and X t_gyroz represent the three-axis rotation angular velocity. We rearrange the feature values based on measurement types and axes. For each group, we reshape the 400 sampling points to 20 × 20 to obtain the spatial relationship between the following values: where F accx , F accy , and F accz are the feature vectors of three-axis acceleration, respectively; F gyrox , F gyroy , and F gyroz are the feature vectors of three-axis rotation angular velocity, respectively; and Y is the aggregated output. Therefore, after processing, the output has dimensions of 6 × 6 × 20 × 20 and is used as the input of the subsequent network.

CNN-Casual LSTM Algorithm-based Feature Extraction and Classification
The architecture of the neural network in this paper is shown in Figure 2. The network is divided into coding layer, decoding layer, and classification layer. The encoding layer includes three layers of CNN, three layers of Causal LSTM, and a single layer of GHU. The encoding layer finally converts the data dimension from [6,6,20,20] to [6,10,10,128] through a series of transformations. The decoding layer includes three layers of deconvolution and three layers of Casual LSTM. The decoding layer restores the data through deconvolution and transforms the data dimension into [1,6,20,20]. The decoding layer outputs a hidden variable that is classified by ResNet18. The specific introduction of each layer will be provided below.
where Faccx, Faccy, and Faccz are the feature vectors of three-axis acceleration, respective Fgyrox, Fgyroy, and Fgyroz are the feature vectors of three-axis rotation angular velocity, resp tively; and Y is the aggregated output.
Therefore, after processing, the output has dimensions of 6×6×20×20 and is used the input of the subsequent network.

CNN-Casual LSTM Algorithm-based Feature Extraction and Classification
The architecture of the neural network in this paper is shown in Figure 2. The n work is divided into coding layer, decoding layer, and classification layer. The encodi layer includes three layers of CNN, three layers of Causal LSTM, and a single layer GHU. The encoding layer finally converts the data dimension from [6,20] to [6,10,1 through a series of transformations. The decoding layer includes three layers of deconv lution and three layers of Casual LSTM. The decoding layer restores the data throu deconvolution and transforms the data dimension into [1,6,20]. The decoding layer o puts a hidden variable that is classified by ResNet18. The specific introduction of ea layer will be provided below.

Causal LSTM
Causal LSTM is a gated recurrent neural network (Gated RNN) proposed by Yun Wang [33] from the Tsinghua team in 2018. Causal LSTM includes a three-layer structu In the case of the same number of levels (about 8000 samples), more non-linear operatio are added to extract the features, and the dual memories are linked in a cascaded mann Therefore, compared with traditional LSTM, Causal LSTM obtains stronger spatial cor lation and short-term dynamic modeling abilities, which is more conducive for capturi short-term dynamic changes and emergencies.
The structure of Causal LSTM is shown in Figure 3. Compared with LSTM, all ga of Causal LSTM are jointly determined by X, H, and C. The "input gate" is the informati added to the cell. The "forget gate" determines the information to be discarded. The "o put gate" determines the final output.

Causal LSTM
Causal LSTM is a gated recurrent neural network (Gated RNN) proposed by Yunbo Wang [33] from the Tsinghua team in 2018. Causal LSTM includes a three-layer structure. In the case of the same number of levels (about 8000 samples), more non-linear operations are added to extract the features, and the dual memories are linked in a cascaded manner. Therefore, compared with traditional LSTM, Causal LSTM obtains stronger spatial correlation and short-term dynamic modeling abilities, which is more conducive for capturing short-term dynamic changes and emergencies.
The structure of Causal LSTM is shown in Figure 3. Compared with LSTM, all gates of Causal LSTM are jointly determined by X, H, and C. The "input gate" is the information added to the cell. The "forget gate" determines the information to be discarded. The "output gate" determines the final output. Causal LSTM has a three-layer structure. The output of the first layer is k t C , and it is determined by the input X , the output response k H , and k C . K is the number Causal LSTM has a three-layer structure. The output of the first layer is C k t , and it is determined by the input X t , the output response H k t−1 , and C k t−1 . K is the number of hidden layers, and C is the temporal state including temporal dimension information: where * is convolution operation, σ(·) is a Sigmoid function: σ(x) = 1/(1 + e −x ), W 1 is a convolutional filter, ⊗ calculates the Hadamard product between vectors, f t is a forget gate, i t is an input gate, and g t is an intermediate long-term memory state. The output of the second layer structure is M k t , determined by X t , C k t , and the previous layer M k−1 t . M determines the spatial state of the cell and contains spatial dimension information: where f t is a forget gate; i t is an input gate; and g t is an intermediate long-term memory state. The output of the third layer structure is the output of the cell and includes time and space state information: where the output is H k t , and it is determined by the time state C k t , space state M k t , and input X t at time t; and o t is an intermediate state.
Causal LSTM has a long-term problem with gradients in back-propagation. In particular, due to the long transition, time memory information may be forgotten, especially when it is processing information with periodic motions or frequent occlusions.
We need an information highway to learn skip-frame relations. Gradient Highway Unit (GHU) is a "high-speed channel" in neural networks that can effectively transmit gradients in very deep networks and then prevent long-term gradient dispersion.
The structure of GHU is shown in Figure 4. The input is the output of the current lower layer X t and the input of the GHU at the previous moment Z t−1 . The input of the current and the previous moments is connected so that the propagation distance is shortened.
where the output is Causal LSTM has a long-term problem with gradients in back-propagation. In particular, due to the long transition, time memory information may be forgotten, especially when it is processing information with periodic motions or frequent occlusions.
We need an information highway to learn skip-frame relations. Gradient Highway Unit (GHU) is a "high-speed channel" in neural networks that can effectively transmit gradients in very deep networks and then prevent long-term gradient dispersion.
The structure of GHU is shown in Figure 4. The input is the output of the current lower layer t X and the input of the GHU at the previous moment ) (·  is a Sigmoid function.

Proposed CNN-Casual LSTM Network
This paper designs a fall detection model based on the CNN-Casual LSTM Network network. It mainly consists of three parts:

Proposed CNN-Casual LSTM Network
This paper designs a fall detection model based on the CNN-Casual LSTM Network network. It mainly consists of three parts:

Encoding Layer
As shown in Figure 5, this layer has a total of 6 moments of data input and a 3-layer network structure, including CNNs, Causal LSTMs, and a GHU high-speed channel. The first layer uses CNN to sample the data at each moment in 1 × 1. Then, through the Causal LSTM layer, a series of linear and nonlinear transformations are performed on the sampling results according to Equations (3)- (8). The latter two layers use CNN to perform 3 × 3 downsampling to extract feature information from the input and spatial state information at the previous moment and extract spatio-temporal information from reduced-dimensional information. The three-tier cascade structure ensures that the model can learn enough spatio-temporal features. The weight matrices inside the network are optimized iteratively during the training phase and learn spatio-temporal characteristics. The GHU network is inserted between the first and the second layers and directly transmits the information obtained from the first layer to the next moment, which can effectively prevent gradient dispersion caused by the deep network.

Decoding Layer
The layer is also in the form of a three-layer cascade in Figure 6, including deconvolutional layers, Causal LSTMs, and a GHU high-speed channel in the reverse order of those in the encoding layer.

Decoding Layer
The layer is also in the form of a three-layer cascade in Figure 6, including deconvolutional layers, Causal LSTMs, and a GHU high-speed channel in the reverse order of those in the encoding layer. According to Equations (9)-(12), the spatiotemporal information obtained by the coding layer is processed and restored by the Causal LSTM layer and the deconvolution layer twice. It is then processed through a layer of GHU in order to directly obtain the information of the previous layer and finally through a layer of Causal LSTM. The threelayer cascade structure ensures that the spatiotemporal information is mapped to the hidden variable output that is more conducive to the work of the classification network.

Classification Layer
This layer classifies the output of the decoding layer (falling or non-falling). This article uses the ResNet18 basic network that comes with PyTorch to perform the classification task. The network contains 17 convolutional layers and one fully connected layer.

Experiments
In order to ensure the integrity of the information, we use a 12 s time window to extract data for SisFall. After processing the data set, 1798 cases of falls and 6146 cases of non-falls are obtained.
The hardware of this experiment includes an Intel Core i7-9700 processor and an NVIDIA GeForce GTX 1660 graphics card. The software environment is Python 3.7.0 with PyTorch 1.2 and CUDA 10.2.

Ablation Experiments
The experimental data set is divided into 80% training set and 20% test set. The ablation experiment in this article is divided into two parts. First of all, similar to the one shown in Figure 5, we have conducted many experiments through the network structure of k = 2~4 layers to verify the rationality of the network structure proposed in this paper. The structure of k = 2 is similar to the structure in Figure 5. Compared with the network structure of K = 3, it only has a CNN convolutional layer and a Casual LSTM cell structure in the coding layer. On the contrary, the structure of k = 4 adds a layer of structure on the basis of this article, and the layer structure includes one more deconvolution on the original basis in order to achieve the same number of channels. In addition, the number of decoding layers corresponds to the number of coding layers. Second, we use the ST-LSTM [34] unit proposed by Ashesh Jain as an alternative to the Casual LSTM unit. We change all the places where the Casual LSTM unit should be set to ST-LSTM while other parameters remain unchanged in order to explore the effectiveness of the Casual LSTM unit.
The experimental results in Table 1 show that, compared to ST-LSTM, although it can also send a reminder for every fall behavior, the performance of Casual-LSTM has been improved, which can better reduce the false alarm rate. In addition, the two-layer Casual-LSTM's ability to extract spatial information is obviously inferior to the three-layer structure. The training process converges slowly, and the sensitivity of the experimental results decreases (judging that the fall data does not fall), which can result in dangerous results. The four-layer Casual-LSTM can also achieve the same excellent results, but we finally chose the three-layer LSTM as the structure of this article after considering its better operating efficiency.  . It shows that the network proposed in this article is very sensitive to falls and has high recognition accuracy. After reviewing the three cases of data, we found that these data all come from young people and include actions such as squatting or squatting after taking off, which shows that our network cannot accurately identify such behaviors.  Table 2 provides the performance of algorithms with three indicators. We selected RNN and LSTM [35] and a convolutional neural network (FD-CNN [31]) as baselines. According to the data in Table 2, the ACC of this model is 99.79%; the SEN is 100%; and the SPE is 99.73%. It can be observed that the network proposed in this paper is superior to other methods in three metrics.

Experimental Results
Among them, the ACC of LSTM [35] is 99.58%, and the SEN is 99.27%, both of which  Table 2 provides the performance of algorithms with three indicators. We selected RNN and LSTM [35] and a convolutional neural network (FD-CNN [31]) as baselines. According to the data in Table 2, the ACC of this model is 99.79%; the SEN is 100%; and the SPE is 99.73%. It can be observed that the network proposed in this paper is superior to other methods in three metrics.
Among them, the ACC of LSTM [35] is 99.58%, and the SEN is 99.27%, both of which are slightly lower than this article. We think this may be due to the network structure of LSTM. LSTM tends to pay more attention to adjacent information when dealing with long sequences. It is very prone to gradient dispersion. Therefore, if the data sequence is long, the data utilization rate is insufficient. In this paper, by processing of the data set, the 12S long sequence data are processed into 6 groups of 2S data. For the 400 data volume of each group of data, we will first extract 20 data points into one point, and then the processed 20 consecutive points are processed. By arranging these data in space and then using convolution to extract features, the process of extracting features does not only rely on long or short time intervals, long time interval features such as the overall action of the falling process, and short time intervals such as falling that invert the details of the action in progress. It takes into account the receptive fields of different time scales while ensuring that there is no gradient dispersion phenomenon.
In addition, FD-CNN [31] treats four hundred three-axis data as 20 × 20 × 3 pixels as input. It displays the original coordinate axis information as spatial information, and each group of data is 2S. However, a fall is a continuous process that may include leaning forward, bending the lower limbs, and rebounding from contact with the ground. This means that if a fall occurs at the end of a set of data and the beginning of the next set, part of the information that contains the fall process may be missing, and the lack of time information in this part may cause it to be misjudged. This may be the reason why its ACC is 97.47%. The spatiotemporal network model proposed in this paper extracts the spatial features of each input through the coding layer, while retaining the influence of time information through the GHU fast channel, which can make full use of spatial and temporal information, which means that it has a high recognition rate for fall behavior. For every fall occurrence, it can trigger an alarm, which improves the safety of the elderly.

Conclusions
In this work, we study a network model that relies on sensor device input, compared it to other networks, and observed that the CNN-Casual LSTM network can extract temporal and spatial information better, thereby improving the accuracy of fall detection. The detection accuracy of the algorithm in this paper is 99.79% (0.21% false-positive rate and 0% false-negative rate). The detection performance is better than other methods, and it can detect every fall occurrence, which further proves the high robustness and stability of this method. However, this article still has many directions worthy of improvement. First, limited by the data of the public data set, the data set of the network of this article is very imbalanced, and the fall data only accounts for about 27% of all data. Secondly, the network structure proposed in this article is too redundant. In order to process the 12S data set of the public data set, six sets of 2S data sets are adapted, which renders the training process of the entire network very long. In addition, we also have many aspects of work that can continue to be studied in depth. Firstly, the application scenario of this article is a nursing home. However, considering the risk of falling, the elderly data in the public data set accounts for a relatively small proportion. In the future work, if allowed, we can collect more fall data of the elderly and train the network structure of this article again. Secondly, the network structure of the text is relatively complicated, and it is necessary to explore more convenient channels to transmit information in the future.