Deep Learning Based Fall Detection Algorithms for Embedded Systems, Smartwatches, and IoT Devices Using Accelerometers

: A fall of an elderly person often leads to serious injuries or even death. Many falls occur in the home environment and remain unrecognized. Therefore, a reliable fall detection is absolutely necessary for a fast help. Wrist-worn accelerometer based fall detection systems are developed, but the accuracy and precision are not standardized, comparable, or sometimes even known. In this work, we present an overview about existing public databases with sensor based fall datasets and harmonize existing wrist-worn datasets for a broader and robust evaluation. Furthermore, we are analyzing the current possible recognition rate of fall detection using deep learning algorithms for mobile and embedded systems. The presented results and databases can be used for further research and optimizations in order to increase the recognition rate to enhance the independent life of the elderly. Furthermore, we give an outlook for a convenient application and wrist device.


Introduction
The independent life of an elderly person can be changed drastically after a fall. Depending on the health condition of the elderly, almost 10 percent of the people who fall will suffer from serious injuries, or might even die directly after a fall if no intermediate help is available [1]. To prevent the severe consequences of such falls, a reliable fall detection is needed. One common approach to fall detection is using wrist worn detection systems that are measuring acceleration forces. These wrist devices are gaining more and more acceptance across the population and becoming increasingly powerful in terms of computational performance that the usage of artificial intelligence is reasonable. In general, older adults appear to be interested in using such devices although they express concerns over privacy and understanding exactly what the device is doing at specific times [2]. The evaluation of mobile fall detection systems is highly sophisticated because live data from falls of elderly people are rare. Boyle et al. tried to use real-time data with 15 adults over the course of 300 days and was only able to record four falls during that time [3]. Even simulated data are barely available and they are existing only in various characteristics. Therefore, the aim of this paper is • to present existing public datasets of fall detection, • to describe the harmonization process for these datasets, • to state the current accuracy of fall detection for tiny, mobile and embedded systems using deep learning algorithms, • to invite and motivate researchers to compete and contribute in fall detection by using the existing databases and to provide various achievements for the future life of the elderly.

Related Work
The detection of falls can be done through several approaches and technologies. Some approaches are infrastructure based and using external cameras [4] or floor sensors [5,6], while other approaches consist of mobile sensor analysis through body worn sensors, e.g., accelerometers, gyroscopes, and air pressure sensors. Each system records a sensor data stream that can be analyzed to recognize a fall [2]. To distinguish between no fall and fall events, various strategies were employed. Often, datasets are used to train machine learning algorithms. Other approaches do not rely on certain datasets as they consist of constructing rules to distinguish between those events. The quality of such datasets is an important constraint for the quality of the trained fall detection algorithm.

Datasets
The availability of fall data is mandatory for the development of fall recognition systems. At first, datasets were interspersed throughout the scientific community until Casilari et al. [7] provided a comprehensive overview of publicly available fall detection datasets. Since then, new datasets were published. We augmented the overview of Casilari et al. by adding recently published datasets to our overview table (see Table 1) with their corresponding characteristics (see Table 2).

Position of Sensors
More than half (10 out of 19) of the datasets described in Table 1 contain waist-worn acceleration data, while only seven datasets contain wrist-worn acceleration data. We do not focus on thigh data sets due to the limited practicability in daily life. We suspect that a wrist or waist worn acceleration sensor is much easier to use and provide a higher wear comfort. Pannurat et al. pointed out that most fall detection systems using sensors mounted to the torso [25]. Krupizer et al. recently provided an overview of the most common positions for wearable sensors in fall detection systems [26]. They confirmed that the waist is a popular position for most automatic fall detection systems.

Wrist vs. Waist
Liao al. showed in [27] that their approach performed best on the human activity recognition task when the data are obtained by a waist worn accelerometer. In this paper, we focus on wrist-worn data, as we strongly believe that sensor solutions for the wrist are more convenient for the end user and may be adopted earlier. Furthermore, a fall detection algorithm may be easily integrated into currently available smartwatch or fitness tracker products.

Fall Detection Algorithms
Several fall detection algorithms were proposed in the last decade. These algorithms ranging from simple threshold based approaches [28,29], over handcrafted feature based machine learning algorithms [30,31] and finally to deep learning based automatic feature extraction neural networks [24,32,33]. Most of the aforementioned contributed datasets come with a detailed approach for detecting falls. Some of the relevant work may be found in Table 3. Note that the performance reported may not express the true capability of these approaches, as the validation methods of these algorithms are mixing users in their train and validation sets, resulting in a fragile evaluation process.

1D Convolutional Neural Networks
Convolutional neural networks (CNN) were first introduced by Krizhevsky et al. [40] and gained attention by winning the ImageNet challenge [41] with a large margin in 2012. Since then, new CNN Architectures, e.g., ResNet [42] were introduced and enhanced the performance of CNN significantly. CNN's were mostly used for the image domain. In 2016, Wang et al. [43] introduced a fully convolutional neural network (FCN) architecture to classify time series data. They validated their approach on 44 datasets from the UCR/UEA archive. CNN for time series data processing achieved state of the art performances in various domains e.g., ECG Classification, Sound Classification, and Natural Language Processing. The name giving convolution can be seen as applying and sliding a one-dimensional filter over the time series. Unlike images, the filters cover only one dimension (time) opposed to two dimensions (width and height). A convolutional neural network consists of various filters, ranging from moving average filters to more complex filters. Those filters are learned through the backpropagation algorithm. By passing a univariate time series through a convolutional neural network, multiple convolutions with different filters are applied. This may be seen as using a filter bank to extract useful features, removing outliers or general filtering. To surpass the linear nature of a convolution operation, a nonlinear function is applied after a convolution to introduce nonlinearity and therefore ensure a nonlinear transformation of the time series data. An illustration of a one-dimensional convolutional neural network is denoted in Figure 1. Mathematically, a convolutional neural network apply the following equation for each timestamp t of the time series data X: were W, k, t, b equals weights of the kernel, length of the kernel, timestamp, and bias, respectively. After applying n convolutions on the input accelerometer data with length l, we settle with n channels, where each channel represents a new filtered time series. These n channels with shape n × l are then convolved with m different filters with shape n × m × k, where each m i−th filter is slid across all n channels resulting in m additional time series, where each m i channel is a sum of the convolutions of m i−th filter across all n channels.

Recurrent Neural Networks
Recurrent Neural Networks (RNN) are designed to work with time series data. Their recurrent mechanisms ensure that features in the time series data are learned jointly over time. To train a recurrent neural network, the back-propagation through time algorithm is used. Recurrent neural networks emerged in 1986 [44] and underlie various advances. One major drawback of traditional recurrent neural network is the vanishing or exploding gradient problem, which occur when the gradient of each cell is multiplied over time and gets larger or smaller with each multiplication. Training a recurrent neural network was a hard task to complete. Long Short-Term Memory (LSTM) [45] neural networks eliminated the problem of vanishing and exploding gradients by introducing forget gates. This mechanism enabled the effective training of recurrent neural network models.

Long Short-Term Memory
One major difference to traditional RNN models is essentially what a Long Short-Term Memory (LSTM) learns. It can learn to keep only relevant information to make predictions and forget non relevant data. This renders the LSTM easier to train and perform better on classification or regression tasks. This process is made through a sigmoid-like function, which outputs a value between zero (discard) and one (keep).

Data Preparation
In order to combine datasets containing wrist-worn accelerometer data, a series of steps is employed to utilize them for training. Due to different sampling rates and sensors, several datasets need special handling. Some datasets include a label for each timestamp, while others provide a sample as a text file with the label being part of the files name. For each recording (containing either falls or activities of daily living) in a data set, we segment the recording into 10 s of non overlapping windows and down-or upsample each window to 50 Hz. The range and quantization of the raw data remain unchanged. The variety in coupling between human body and sensor, precisely the attachment of the sensor at the wrist, and also different sensor weights are not considered. Each fall type is labeled as fall and every other activity is labeled as not fall.

Down and Upsampling Technique
In order to perform robust down-and upsampling, we use a polyphase filtering approach provided by the Python SciPy 1.2.3 package [46]. The acceleration signal is upsampled by a factor up, before a zero-phase low-pass FIR filter is applied, and then again downsampled by the factor down. The resulting sample rate is up/down times the original sample rate. We pad the signal boundaries by fitting a line at the start and end of the signal to avoid overshooting at the boundaries of the signal. We are aware that a down-and upsampling of data are somehow affecting the data quality, but this limitation is necessary for a comparable dataset.

SmartFall, Smartwatch, and Notch
For the SmartFall, Smartwatch, and Notch datasets, the following steps are applied:

1.
Cluster timestamps containing a fall to create separated 10 s segments.

2.
Each segment is upsampled from 310 (31 Hz) to 500 timestamps (50 Hz) After applying this strategy, SmartFall, Smartwatch, and Notch consist of solely fall samples. We do not use samples where a fall sample is mixed with an activity of daily living labeled as non fall. This would result in an ambiguity, as we already use samples containing fall and activities of daily living labeled as fall. To enrich the Notch, Smartwatch, and SmartFall datasets with additional activities of daily living, we added 500 random 10 s segments from the RealWorld Human Activity Recognition [47] dataset to the SmartFall, Smartwatch, and Notch datasets.

MUMA, UP Fall, and Sim Fall
The MUMA dataset consists of separated files containing either fall or activities of daily living, where each file consists of 300 timestamps. We cropped a 200 timestamp window (10 s) out of the time series and upsampled it to 500 timestamps. To harmonize the UP Fall dataset, we applied a slightly different strategy, as this dataset does not offer a file for each fall or activity of daily living. A peak detection algorithm provided by the Python SciPy 1.2.3 package [46] is used to find the segments containing a fall, where we set the prominence to a 95% quantile of the Signal Magnitude Vector values. We crop a 180 timestamp window, centered around the detected peaks, and upsampled this window to 500 timestamps. The Sim Fall dataset consists of separated files containing either data of fall or activities of daily living. We crop a 250 timestamp window, centered around the center corresponding time series.

Harmonized Dataset
The resulting harmonized dataset consists of 1716 Falls and 3567 activities of daily living, where each sample consists of 500 timestamps associated with x, y, and z-axis acceleration values. We settle with a total tensor shape of 5283 × 500 × 3 for our experiments. Table 4 shows the number of falls and activities of daily living for each dataset. Note that the number of activities of daily living is significantly larger than the number of falls. To compensate such unbalanced dataset, we induce a weight to the cross entropy loss function calculated by the ratio of falls and activities of daily living. Note that we did not integrate the TST Fall Detection dataset, as it is not available anymore and it is not clear which sensor is used on the wrist and which sensor is used on the waist.

Problem Formulation
The fall detection or fall classification problem may be formulated as a problem of time series classification. A fall may represent a univariate time series X = [x 1 , x 2 , . . . , x t ] with an ordered set of real values or a multivariate time series M = [X 1 , X 2 , . . . , X n ], where M consists of different univariate time series X. A dataset may be formed by pairing the time series data X with a label Y to a tuple (X i , Y i ). This label is a numerical representation of a class label either fall with Y i = 1 or non fall with Y i = 0. The task is to train a classifier on a dataset consisting of multiple time series in order to map the space of possible inputs to a probability distribution over the class variable values, referred to as labels.

Preprocessing
Contrary to other approaches e.g., [32], we do not think that a scaling or standardization preprocessing technique should be performed on the input data, as the magnitude information is a crucial aspect to distinguish between falls and activities of daily living. To evaluate the impact of preprocessing on the performance, we apply a min-max scaling of the training and testing dataset.

Signal Magnitude Vector (SMV)
To reduce the computational load, we solely use the acceleration data, precisely the Signal Magnitude Vector. The Signal Magnitude Vector maps the three-dimensional acceleration vector X ∈ R 3 to a one-dimensional acceleration vector with no orientation information. To compute the Signal Magnitude Vector, the following equation is applied for each time series X: The Signal Magnitude Vector transformation greatly reduces the input size for neural networks and thus reducing the required memory and computational load of an IoT device. Note that running deep learning models on IoT devices requires a trade-off between classification performance and computational complexity.

Data Augmentation
Data Augmentation can be viewed as an injection of prior knowledge about the invariant properties of the data against certain transformations. Augmented data can cover unexplored input space, prevent overfitting, and improve the generalization ability of a deep learning model [48]. Um et al. pointed out that time series augmentation significantly improves the classification accuracy of neural network architectures from 77.54% to 86.88 % in the domain of Parkinson's Disease classification in Alzheimer patients [49]. In the domain of fall detection, augmentation showed already an enormous performance boost in other works [32]. Before augmentation, one requires knowledge about the semantic characteristics of the input data. Scaling of the input data may induce bias, as falls typically have a defined range of acceleration values. To evaluate the effect of data augmentation, we apply a combination of the following transformations dynamically during training: We suspect that shifting in the time dimension does not contribute to the performance of a one-dimensional CNN, as the learned filters are translation-invariant by definition. However, by shifting (rolling) the vector in the time dimension, elements that roll beyond the last position of the vector are re-introduced at the first position. This induces some additional variance to data and may affect the performance of the neural networks.

Model Architectures
To compare our IoT neural network to other neural network architectures, we choose a number of different architecture types. We utilized the PyTorch 1.2.0 Framework [50] for training and testing our approach. All models were trained 250 epochs each with the Adam Optimizer [51] with initial learning rate of 0.001, batch size of 32, β 1 = 0.9, β 2 = 0.999 and no weight decay using a weighted Cross Entropy loss function. As our convolutional neural networks use ReLU (Rectifier Linear Unit) activation functions, all convolutional layers were initialized via He initialization [52]. The ReLU is a nonlinear activation function which is defined as σ(x) = max(0, x).

Classic 1D CNN
CNN-3B3Conv [32] consists of three-layer blocks. The first block consists of three convolutional layers and one maxpooling layer. Each of the convolutional layer consists of 64 kernels with a size of 4 and a maxpooling size of 2. The second block also consists of three convolutional layers and one maxpooling layer, this time with a kernel size of 3 while the maxpooling size remains unchanged. The third block consists of three fully-connected layers with 64 neurons, 32 neurons, and two neurons. To deal with the different input size (500 opposed to 32), we changed the pooling size of the maxpooling parameter to 15.

1D ResNet
We use a small-scale 1D ResNet model consisting of two basic residual connection blocks. We settle with a convolutional layer followed by two blocks with two convolutional layers and a shortcut connection each block. The kernel size is set to 8 for the first convolutional layer and is changed to four for all consecutive convolutional layers. The classification head consists of a fully connected layer mapping 64 neurons to two neurons.

LSTM
The LSTM Classifier consists of a single unidirectional LSTM layer with 128 hidden neurons followed by a dropout layer with a dropout probability of p = 0.25. A fully connected layer is used to map the 128 hidden neurons to two neurons for classification.

Proposed Model Architecture
Our baseline CNN for fall classification includes several changes compared to Santos et al. [32]. As Santos et al. pointed out in their evaluation, a deeper network does not necessary increase the performance of classification. Our approach to fall detection consists of greatly reducing the number of layers and filters of the convolutional neural network. Furthermore, we replaced the computational expensive max-pooling layer with efficient strided convolutional layer [53]. By removing the flattening layer right before the fully connected layer and replacing them with a global average pooling (GAP) layer followed by a 1 × 1 convolutional layer, we further reduced the number of trainable parameter significantly. To effectively deploy our model to IoT devices, we additionally use quantization which reduce the footprint of the model drastically. After employing an extensive gridsearch, we settle with the following structure: Each convolutional layer C(N), with N channels, uses a stride of 4 and a kernel size of 8 with no padding, except for the last convolutional layer which uses stride and kernel size of 1 with no padding.

Maxpooling vs. Strided Convolutions
Strided Convolutions reduce, similar to maxpooling, the size of the resulting feature map and thus reduce the computational complexity of the processed data. We replaced the maxpooling layers with convolutional layers with stride 4 to eliminate computational expensive calculation of the maximum value and additionally reduce the number of dot products by a factor of 4. A strided convolution calculates the dot product for every n − th timestamp by skipping n timestamps with each calculation resulting in a downsampling by a factor of n.

Global Average Pooling Layer
Global Average Pooling (GAP) allows the Convolutional Neural Network to process time series with different lengths and reduces the number of learned parameters significantly [54]. Furthermore, a GAP layer enables us to use Class Activation Maps for visualization. Global Average Pooling calculates the mean for each channel (activation map) resulting in a vector of averaged activation values of each channel.

Learnable Parameters
The amount of learnable parameters affect the training time, computational complexity, and footprint on the device. By replacing the flattening layer with GAP and Maxpooling layer with strided convolutions, we reduced the number of learnable parameters to around 5000 opposed to 72.512 in [32] and further reduced the required memory for storing the activation maps (channels) significantly. Our model is approx. 14 times smaller than [32] and can be easily used on small scale IoT devices as it only uses approx. 20 KB of RAM for storing the weights and approx. 10 KB for storing the activation values for each channel.

Quantization
Quantization reduces the footprint of a model greatly, by converting the weights of the neural network from four byte floats to one byte unsigned integer. Wu et al. [55] showed that a quantization occurs with a minor loss in accuracy while reducing the computation time greatly. We use a Quantization method provided by the PyTorch 1.4.0 Framework, in particular the min and max values to compute the necessary quantization parameters. We do not evaluate different quantization strategies to conserve space.

Evaluation Method
In order to employ a robust evaluation process, our approach consists of applying Leave One Out (LOO) cross validation on whole datasets, e.g., using the MUMA Dataset for testing and the remaining five wrist-worn datasets for training. We reset the weights and the optimizer parameters with each fold. This procedure is repeated five times until each dataset has been used as a testing dataset. To discard potential weight initialization bias, we repeated the LOO cross validation five times and averaged the results.

Evaluation Metric
We assess the classification performance of our deep learning model by using the weighted F 1 score, precision, and recall. Note that, in the binary classification task, recall of the positive class (fall) is denoted as sensitivity and recall of the negative class (not fall) as specificity.

Effect of Augmentation
In line with other work, we observed performance improvements while using data augmentation techniques for all tested models. The results achieved with data augmentation are denoted in Tables 5 and 6. While we observe that our three-axis model increased in terms of performance, our single-axis IoT-CNN achieved only little or negligible improvement. Larger models, e.g., CNN-3B3Conv, improved in performance while using dynamic data augmentation during three-axis and single-axis training training.

Learned Filters
To visualize the frequency response of the learned filters in the first convolutional layer, we use the freqz method, provided by the Python package scipy [46]. A lot of filters resemble low pass, bandstop, high pass, and further combination of low and highpass filters. The learned kernel weights with their respective frequency response are denoted in Figures 2 and 3. Note that most learned filters activate on strong edge shaped structures.

Class Activation Maps
Class Activation Maps (CAM) for time series data, introduced by Zhou et al. in [56], indicate which regions in a time series contribute to the decision-making process of a neural network. A Class Activation Map (CAM) for a class c may be computed by the following equation: where A m (t) is the univariate time series for the variable m ∈ [1, M], which is in fact the result of applying the m − th filter and w c m is the weight between the m − th filter (last convolutional layer) and the output neuron of class label c [54]. Examples of class activation maps with a sample for each dataset are depicted in Figure 4. In most cases, the impact region (region with the highest acceleration magnitude) and free fall phase (region right before the impact) are the most contributing parts to the decision-making of our convolutional neural network.

Results
Referring to Tables 5 and 6, we demonstrated that our proposed algorithm performs very accurate by an F 1 = 0.96 without, and F 1 = 0.97 with data augmentation (both on SMV). The results in Figure 5 suggest that a larger number of parameters does not lead to increased performance in terms of AUC. Furthermore, the three-axis models do not perform better than their single-axis equivalent. The LSTM neural network especially shows increasing performance when using the Signal Magnitude Vector as input. In particular, the three-axis version of our IoT-CNN does perform poorly on the MUMA dataset and SimFall dataset if no augmentation is applied. We suspect that this may be due to the relative small number of parameters compared to other models. All CNNs perform better on the SmartWatch, SmartFall, Notch, and UP Fall datasets. This indicates that certain datasets are easier to handle than other and may contain less variety. Note that the SmartWatch, SmartFall, and Notch dataset was published by the same researchers. We further augmented the SmartFall, Smartwatch, and Notch dataset with samples of a different dataset. This is done due to the sequence length of a fall in these datasets, as they only contain fall events shorter than 2 s. Regarding the LSTM-based neural network, the single-axis version shows a comparable performance with respect to their CNN counterparts. Quantization, on the other hand, achieved, as expected, worse results. In line with other work, quantization reduces the classification performance (see Table 7). While the performance on most datasets remains comparable to the results without quantization, the performance on the Notch dataset decreases by a large amount. This may be due to the gravitational offset from the used sensor.

Discussion
Historically, fall detection algorithms were discrete and engineers developed a mobile system especially for this purpose. Today, neural networks can be exploited on embedded systems, so a flexible structure can be used. Because of the connectivity of IoT devices, a broad database is or will be available in future. Therefore, the realization of a reliable fall detection becomes available. The performance consideration of fall detection is ambiguous as far as varying datasets are used. Therefore, we identified available datasets and evaluated the capabilities of a small-scale neural network. Large networks that can run on servers or high end machines might outperform small-scale neural networks, but these can be integrated on edge or small devices and are more relevant for the real life scenarios, as the computation is done on the device. This further ensures (a) that sensible data are processed on the device and no network connection is needed, besides the alarming mechanism, and (b) that privacy is protected. Regarding the existing fall datasets, we assume that resampling, range, resolution, and sensor type (internal filtering) only have a minor effect. However, this has to be confirmed by further research.

Outlook and Application
Typical deep neural network models require high computational power for training a large scale neural network. However, the inference is much less energy consuming. As we could show in our research, even small models provide a reasonable recognition rate of 97%. The representation of weights can be shrunk from 32-bit floats to unsigned 8-bit integers without a remarkable loss of accuracy and reduction in the memory footprint on the device. By levering compact architecture with quantized weights, we are able to use our efficient neural network on embedded devices. This research forms the center of the lightweight wrist-based fall detection device, called UMA, designed by the company Next Step Dynamics, which has a long battery life as the fall detection is processed in a dual recognition pipe-an acceleration sensor monitors all movements and a small size trigger algorithm works as a gatekeeper and detects very roughly possible falls. Due to the dual recognition pipe, a possible requirement of a fast processing speed of the neural net is not necessary. Furthermore, the trigger algorithm ensures a continuous processing and provides easy to handle segmented data to the second inference step. The device (based on a Nordic nRF9160) is shown in Figure 6. We implemented our proposed neural network on the UMA fall detection device and validated its capabilities in a real-life scenario in future work. Elderly persons are usually very calm; even a normal adult is performing moderate to vigorous leisure time physical activity only less than one hour a day [57]. The simple gatekeeper algorithm clears inactive periods, so the deep learning algorithm is used only for active periods.

Conclusions
In this paper, we presented 19 datasets with fall raw data, assessed at the wrist (7), waist (10), and/or other positions (2) with accelerometers. We illustrated that our optimized neural network could be applied on an embedded system like IoT devices, smart watches, or activity trackers. In former research, the consideration of the accuracy of fall detection was performed within a focus group and its dataset. Because of the identification of multiple datasets, a broader evaluation was feasible. We could show that neural networks are performing well on our harmonized dataset. Furthermore, the increasing calculation power of mobile devices enables the usage of deep learning algorithm for fall detection on wrist based embedded systems. We demonstrated that small scale convolutional neural networks achieve a reasonable accuracy of 97% on our harmonized fall detection data set. While applying quantization, our neural network performs less accurately, which may be addressed in future work. We suspect that calibration of the neural network based on the activity of the user may enhance the performance significantly by lowering the false positive rate. For future work, we see that the sensitivity and specificity of fall detection is highly relevant in the everyday usage. A waterproof, wrist based sensing device can be worn 24/7 and should indicate no false detection. This is a high demand and requires the knowledge about the general condition about the user. Very active people are moving differently compared to passive and calm people. This requirement of a low false positive rate can be achieved by individualized algorithms or calibration. We assume that even energy efficient mobile wrist devices allow a reliable fall detection system to assist the elderly in everyday life.