Elderly Fall Detection with an Accelerometer Using Lightweight Neural Networks

: Falls have been one of the main threats to people’s health, especially for the elderly. Detecting falls in time can prevent the long lying time, which is extremely fatal. This paper intends to show the efﬁcacy of detecting falls using a wearable accelerometer. In the past decade, the fall detection problem has been extensively studied. However, since the hardware resources of wearable devices are limited, designing highly accurate embeddable models with feasible computational cost remains an open research problem. In this paper, different types of shallow and lightweight neural networks, including supervised and unsupervised models are explored to improve the fall detection results. Experiment results on a large open dataset show that the lightweight neural networks proposed have obtained much better results than machine learning methods used in previous work. Moreover, the storage and computation requirements of these lightweight models are only a few hundredths of deep neural networks in literature. In tested lightweight neural networks, the best one is proved to be the supervised convolutional neural network (CNN) that can achieve an accuracy beyond 99.9% with only 441 parameters. Its storage and computation requirements are only 1.2 KB and 0.008 MFLOPs, which make it more suitable to be implemented in wearable devices with restricted memory size and computation power.


Introduction
The world is currently experiencing an unprecedented aging of the population [1]. It has been estimated that the population of elder people aged 60 and over will keep increasing rapidly and exceed three billion by 2100. Such a huge elder market will stimulate the development of the healthcare industry. Hence, providing healthcare service to the elder to reduce living risks associated with their daily life is increasingly being demanded.
Meanwhile, falls have been one of the main threats in elder people's life [2]. Almost 80% of reported accidents among elder patients are due to falls [3]. This situation is even worse in high altitude areas that are usually covered with snow and ice for most of the year, such as Canada, North America, and China. For instance, a living environment with a high risk of falling in Kelowna (Canada) is shown in Figure 1.
Early detection of falls can minimize the time between a fall and the arrival of medical caretakers, hence prevent long lying times that are potentially fatal. Therefore, fall detection has become a hot research topic during the past decade and a large number of fall detection systems have been proposed [4][5][6][7]. Based on the different sensors used in detection, these systems can be categorized into vision-based [8,9], and wearable sensor-based [10]. Vision-based fall detection has been an active research topic for a long time [11]. Recently, interest in wearable sensor-based systems has increased rapidly due to the emergence of low-cost physical sensors [12][13][14][15].
In literature, different methods have been proposed to detect falls using wearable sensors. Some of them are threshold-based and others are machine-learning based [16]. In these methods, machine learning methods have shown superior performance over threshold methods. Hence, they have been widely explored in previous work. Methods including k-nearest neighbors (KNN), kernel Fisher discriminant (KFD) and support vector machine (SVM) were used in [17] to detect falls based on an integrated device attached to the waist of the human body. Five methods including logistic regression (LR), naïve Bayes (NB), decision tree (DT), SVM and KNN were evaluated together by Aziz et al. [18] in fall detection based on seven distributed accelerometers on the human body and the SVM is proven to be the best.
Moreover, neural networks have been increasingly popular in the machine learning field due to the improvement of computation force and the breakthrough of theory. Their more advanced modeling capability has also attracted a large amount of attention in the fall detection field [19]. Different types of neural networks including recurrent and convolutional neural networks have been used in literature.
In [20], a long short-term memory (LSTM) neural network, named LSTM-Acc and a variant LSTM-Acc Rot were proposed to detect falls. The LSTM models consist of two LSTM layers and two fully-connected layers with each layer consisting of 200 neurons. Experiment results have shown the proposed LSTM models could achieve an accuracy of 98.57%. Furthermore, a gated recurrent units (GRU) neural network was used in [21] to detect falls based on a smartwatch. The GRU model used consists of three nodes at the input layer, a GRU layer, a fully connected layer, and a two-node softmax output layer.
Some other researchers used convolutional neural networks in their work. A convolutional neural network (CNN) composed of four convolution layers and four pooling layers was used to recognize human falls in [22]. Experiment results have shown the proposed CNN model could achieve an accuracy of 99.1%. Another CNN model composed of two convolutional and two max-pooling layers was used in [23] to detect falls and the results proved the CNN could achieve an accuracy of 98.61%. Furthermore, a CNN named CNN-3B3Conv was proposed in [24] to detect falls using acceleration measurements. The experiment results proved that the CNN-3B3Conv model could obtain much better results than recurrent neural networks with an accuracy near 99%. Indeed, good results have been obtained by machine learning methods, especially the deep learning techniques in literature in the context of fall detection. However, most of the neural networks used are deep, complex and computationally intensive, and implementing them in wearable devices with limited hardware resources is a challenge. One solution used to tackle this problem is to avoid embedding these deep neural networks on the wearable device itself but on a base-station instead as in [23]. Raw data (or preprocessed data) are sent via some wireless link from the wearable device to the base station where these data are processed to detect falls. However, this solution is not appropriate for outdoor environments as the distance between the wearable device and the base station is limited in the considered technologies e.g., ZigBee in [23]. Therefore, developing highly accurate embeddable models with lightweight architectures and feasible computational cost is mandatory to achieve an accurate wearable fall detector that could work in both indoor and outdoor environments.
In this work, different types of lightweight neural networks, including the supervised and unsupervised models are explored in fall detection based on an accelerometer worn on the human waist. The performance of these lightweight neural networks is evaluated against both the conventional machine learning methods and the deep neural networks used in literature.
As shown in Figure 2, the standard process of machine-learning-based fall detection consists of three main steps. Acquired sensor signals are firstly segmented into small data blocks and then features that can reflect characteristics of human falls are extracted and fed into classifiers for recognition. According to this process, the rest of this paper is organized as follows. The dataset, signal pre-processing methods and classification protocol used in this work are first explained in Section 2. Section 3 provides a brief introduction on classifiers used. Then, experimental results are presented in Section 4. Finally, Section 5 draws conclusions.

Dataset Description
To guarantee a reliable evaluation, a large public dataset known as the SisFall dataset is used in this work [25]. This dataset has been used in previous work for its diversity and integrity [26]. The dataset was recorded with a self-developed embedded device composed of a Kinets MKL25Z128VLK4 microcontroller (NPX, Austin, TX, USA), an Analog Devices (Norwood, MA, USA) ADXL345 accelerometer, a Freescale MMA8451Q accelerometer, an ITG3200 gyroscope, and a 1000 mA/h generic battery. During data collection, the device was tethered on the waist of subjects as shown in Figure 3a with a sampling rate of 200 Hz and then different activities listed in Table 1 were performed in the classrooms and open spaces of a coliseum at the Universidad de Antioquia (Medellín, Colombia). Some of the data collection scenarios are shown in Figure 4. In order to guarantee safety conditions, falls were simulated using safety landing mats [25]. To collect the dataset, overall 38 volunteers including 15 elder people and 23 young people were employed and the characteristics of these subjects such as sex, age, height, and weight are summarized in Table 2    In this work, only acceleration data acquired from the three-axial accelerometer ADXL345 are used as in [25]. As shown in Figure 3b, the ADXL345 is an energy-efficient accelerometer that has been widely embedded in handsets, medical instrumentation, gaming, and pointing devices, industrial instrumentation, and personal navigation devices. The ADXL345 used is configured with a measuring range of ±16 g and a resolution of 13 bits with a sensitivity of 3.9 mg/LSB. The supply voltage range of ADXL345 is 2.0 V to 3.6 V and the temperature range is −40 • C to +85 • C. Moreover, the accelerometer has a small size of 3 mm × 5 mm × 1 mm [27].
Since it has been found that there is no significant gain for having sampling frequency higher than 25 Hz in fall detection [26], the original acceleration measurements are first downsampled to 25 Hz. In data downsampling, original acceleration measurements are decimated by an integer factor instead of resampling sensing data, where artifacts and distortion may occur. When the original sensing data S = {s 1 , s 2 , ..., s l } is downsampled by an integer of n, it would keep the first sample from every n samples and starting with an integer offset of m as follows.
where 0 ≤ m < n, DS n m is the downsampled data, α is an integer and 0 ≤ α ≤ l n . If the original sampling rate is R Hz, the sensor data after downsampling is R n Hz. In this work, an integer of eight is used to downsample sensor signals to 25 Hz.

Data Pre-Processing
In this section, the segmentation, feature extraction and data oversampling methods used to pre-process the acquired acceleration measurements are explained in detail. To segment sensor signals for classification, most researchers in literature used a sliding window method shown in Figure 5a, where sensor data are continuously segmented by a moving window with an overlap. This method is simple but energy-intensive since the classifiers need to operate continuously at a small interval. Moreover, it is also not accurate in extracting data blocks of falls since a sliding window with an overlap may not locate exactly on the whole data block of a fall. The window may only cover a part of the fall and another part of human activities happened before the fall such as walking or running and this may cause bias in recognition.
To deal with this, an impact point-based data segmentation method is used in this work. It is based on the fact that a fall is always associated with an extreme impact between the human body and the ground. By detecting the impact, the sensor signals of falls can be accurately located. Moreover, a large number of uninterested sensor data (e.g., data of activities without evident impact such as sitting, standing or lying) can be excluded to avoid unnecessary recognition and save energy.
To detect the impact point, the acceleration magnitude (AM) that can reflect the energy contained in the sensor signals is used with a threshold of 1.6 g according to [28,29]. The AM can be obtained as following: where a represents the acceleration measurements on different axes of the accelerometer. Figure 5b shows the process of data segmentation with an impact point in fall detection. Once an impact is identified with the pre-defined threshold of AM, a window is centered on the impact point to extract the complete fall process. In the experiment, a window of 3 s is used according to previous work [18].

Feature Extraction
Once sensor signals are segmented, meaningful features should be extracted for classification. As for neural networks, features can be extracted automatically. However, human-design features that can reflect the shape, energy, and dispersion of sensor signals are needed for other conventional machine learning classifiers such as SVM and KNN. In this work, 13 types of statistical features that have been used in literature [26] are extracted from acceleration measurements on each axis: 1) Minimum values of acceleration measurements; 2) Maximum values of acceleration measurements; 3) Mean values of acceleration measurements; 4) Median values of acceleration measurements; 5) Interquartile range of acceleration measurements; 6) Variance of acceleration measurements; 7) Standard deviation of acceleration measurements; 8) Mean absolute deviation of acceleration measurements; 9) Root mean square of acceleration measurements; 10) Entropy of acceleration measurements; 11) Energy of acceleration measurements; 12) Skewness of acceleration measurements; 13) Kurtosis of acceleration measurements.

Mitigating Effects of Class Imbalance
One issue with dataset generation that is frequently overlooked in previous work is class imbalance. It is quite common in fall detection datasets, due to the difficulty of collecting fall trials and practical constraints on collecting data from multiple subjects, that the number of data samples for each class are not equal. Imbalance in the dataset can cause algorithms to be biased toward the classes having more data. The data imbalance in the Sisfall dataset is larger than 50:1 (ADLs to falls).
To deal with this, the synthetic minority oversampling technique (SMOTE) is used on the training dataset to prevent imbalanced learning and avoid overfitting. SMOTE solves the data imbalanced problem by oversampling the samples in the minority class. In oversampling, new instances of minority class are interpolated using the KNN within the feature space. A new synthetic data instance X is generated as follows: where X i is a sample of minority class, X j is one of the nearest neighbors of X i of the same class. This interpolation process is then repeated for the other nearest neighbors of X i . As a result, SMOTE generates more general regions from the minority class and many machine learning classifiers are able to use the data set for better generalizations. Figure 6 shows some fall trails generated by the SMOTE method in data oversampling.

Evaluation Metrics
In this work, the performance of different classifiers is presented with the confusion matrix, accuracy (ACC), sensitivity (SEN) and specificity (SPE). Table 3 shows the confusion matrix in fall detection. In the matrix, true-positive (TP) is the number of observations that are falls and were predicted to be falls, false-negative (FN) is the number of observations that are ADLs but were predicted to be falls, true-negative (TN) is the number of observations that are ADLs and were predicted to be ADLs, and false-positive (FP) is the number of observations that are ADLs but were predicted to be falls (false alarms). P is the number of falls, and N is the number of ADLs observations.
In these metrics, ACC is a measure of the overall performance of a classifier. SEN can be used to know how correct a classifier is and SPE can be used to assess the capability of a classifier to avoid misclassifying. Since an accurate classifier with a large number of false alarms is still not acceptable in daily use, both of the abilities to recognize falls and exclude false alarms of classifiers are important. Generally, a classifier is deemed to have a higher level performance only when its accuracy, specificity, and sensitivity are all higher than others.

Classification Protocol
In order to present the performance of different machine learning methods in a realistic way. The SisFall dataset is divided into two parts: the first one contains the activities performed by young adults Y1, . . . , Y12 and elderly E1, . . . , E8, while the second part contains activities performed by the remaining young adults Y13, . . . , Y23 and elderly E9, . . ., E15. Then, a two-fold cross-validation strategy is conducted on these two different datasets. In this way, activities performed by some subjects are always tested with classifiers trained on different persons, which guarantees realistic evaluation. Finally, the total numbers of TP, TN, FP, and FN are counted from the validation results and used to assess the performance.

Machine Learning Methods
In this section, the background of machine learning classifiers used in this paper is introduced to facilitate understanding. Overall eight machine learning approaches are used including four types of conventional methods and four types of neural networks.

Conventional Machine Learning Methods
Conventional machine learning methods used in this work include SVM, decision tree (DT), KNN, and extreme gradient boosting method (XGB).

SVM
The SVM theory was proposed by Vapnik and Chervonenkis [30] and it has been proven very effective in addressing problems including handwritten digit recognition and face detection in images. The principle of SVM is to find a boundary between two hyperplanes that can separate samples of different classes.
Given the training data X = {X 1 , X 2 , ..., X N } and corresponding label Y = {y 1 , y 2 , ...y N , y i ∈ [1, −1]}, two hyperplanes can be found: where w and b are the parameters that represent hyperplanes. SVM is to find a boundary between these two hyperplanes meanwhile maximizing the distance d = 2 w between them.

KNN
KNN classifies an unseen feature vector based on the votes of its most similar samples in the training dataset. Generally, a Euclidean distance function is first used to measure the similarity between the target feature vector and training samples: where d(X i , X j ) means the distance between samples X i and X j , R X k is the group of k nearest neighbors of the new feature vector X. Then, the new feature vector is assigned to the class, to which the majority of its k nearest neighbors belong.

DT
DT solves a classification problem through a series of cascading decision questions. A feature vector, which satisfies a specific set of questions, is assigned to a specific class. This method is represented graphically using a tree structure, where each internal node is a test on a feature compared with the threshold, and the remaining values refer to the decided classes. Its implementation is based on a loop of if/else conditions. Many types of DTs have been generated by different algorithms. In our research, a C4.5 is used.

XGB
The XGB is a meta-algorithm. It is a method that can be used with other machine-learning methods to improve recognition accuracy. It combines the outputs of plenty of "weak" classifiers into a weighted sum that represents the final output. The individual learners can be weak, but as long as the performance of each one is slightly better than random guessing, the final model can be proven to converge to a strong learner. In this paper, XGB embedded with decision trees is used.
In the experiments, the performance of SVM was compared for two different kernels: linear and radial basis function (RBF) kernel, between which the linear kernel was found to yield better results and was finally selected. The parameter searching of k in KNN was performed in a wide range from 1 to 10 and a value of 1 was selected. Model parameters of the XGB were optimized using a grid search over two parameters: the number of trees and maximum depth of the tree. The best results were achieved based on 50 trees with a maximum depth of 3.

Neural Networks
Neural networks are a family of statistical learning models through replicating the working principle of neurons in the human brain. Overall, four types of neural networks are used in this work including supervised models such as multi-layer perceptron (MLP), convolutional neural network (CNN) and unsupervised autoencoders.

MLP
An MLP that is also known as the feed-forward neural network is shown in Figure 7a. It is a model that processes information through a series of interconnected computational neurons. The inputs are fed directly to the outputs via a series of hidden neurons, which are grouped into layers and associated with previous layers using weighted connections. Formally, neurons are defined as the following function: where a l is the value of neurons in layer l, (a l i denotes the value of neuron i in layer l), W is the weight matrix between layer l and l + 1, b l is the bias associated with neurons in layer l and σ is the activation function. For the first layer in the network, the neuron value is a (1) = x, which is the input to the neural network (flattened sensor signal in this work). MLPs use a fully-connected topology, where each neuron in the present layer is connected with every neuron in the previous one.

CNN
The architecture of the CNN is shown in Figure 7b. Different from the MLP, there are many additional convolutional layers between the input and fully connected layers. These convolutional layers can help to extract more meaningful feature maps for recognition by conducting convolutional operation on the input signals with different kernels. In the convolutional operation, kernels act as different filters or feature detectors. Formally, a feature map is generated by a kernel as following: where a l+1 j means the value of feature map j in layer l + 1, σ is the activation function, n is the number of feature maps in layer l, k l j f denotes the kernel that convolves over feature maps in layer l to create the feature map j in layer l + 1, a l is the value of feature maps in layer l, b l is the bias vector. Once feature maps are generated with convolutional layers, they will be flattened and fed into subsequent fully-connected layers for classification.
The training of MLP and CNN is based on optimizing their parameters including weights and biases and the optimization can be realized by minimizing the following cross-entropy error function: where, w and b denote the weight and bias parameters, N means the number of samples, y n is the real value of the sample n and y n is the prediction from neural networks. Given the training dataset X = {X 1 , X 2 , ..., X N } and corresponding label Y = {y 1 , y 2 , ...y N , y i ∈ [1, 0]}, the optimal values of parameters in MLP and CNN can be found based on gradient-descent approach.

Autoencoders
Autoencoders are neural networks that are trained in an unsupervised way. Autoencoders aim to learn the representation (encoding) of sensor signals with the purpose to reconstruct themselves. Since only sensor signals of different activities without labels are needed during training, autoencoders are known as the unsupervised models. Figure 7c shows a dense autoencoder (DAE) that is built based on an MLP. The MLP is used as an encoder δ in the DAE with another MLP that has a symmetrical structure as a decoder ψ. Similarly, a convolutional autoencoder (CAE) can be built based a CNN as shown in Figure 7d.
The aims of encoders and decoders in autoencoders are to learn how to condense input signals into representative features and then use them to reconstruct the signal as follows: where, x means the input signal, c means the condensed code and x means the reconstructed signal. Different from MLP and CNN, the training of DAE and CAE is based on minimizing the construction error between the original and reconstructed signals and a mean square error function is used during training: In this work, the DAE and CAE are built based on the MLP and CNN used. After unsupervised training, the encoders in DAE and CAE are extracted out and concatenated with a fine-turned fully connected layer for recognition.

Neural Network Architectures
Overall, seven neural networks are evaluated in this paper. Three of them are the models that have achieved superior performance in literature. They are used as the baselines to compare with the lightweight neural networks proposed in this paper: • (CNN-HE) [23]: CNN-HE consists of two convolutional layers (each appended with a max-pooling layer) and two fully-connected layers. The first convolutional layer consists of 32 kernels and the second layer consists of 64 kernels. The size of kernels used is 1 × 5 with a stride of 1. Furthermore, the first fully-connected layer consists of 512 neurons and the second layer consists of 8 neurons (change to 1 in this work) for classification. • (CNN-3B3Conv) [24]: CNN-3B3Conv consists of three-layer blocks. The first block consists of three convolutional layers and one max-pooling layer. Each of the convolutional layer consists of 64 kernels with a size of 1 × 4. The second block also consists of three convolutional layers and one max-pooling layer, but the kernel size is set to 1 × 3 empirically. The third block consists of three fully-connected layers with 64 neurons, 32 neurons and two neurons (changed to one in this work) respectively. • CNN-EDU [22]: CNN-EDU consists of four convolutional layers composed by 16 kenerls, 32 kenerls, 64 kenerls and 128 kenerls (1 × 5 ) respectively. Each convolutional layer is also appended with a pooling layer. Moreover, two fully-connected layers are appended in the end.
Another four neural networks are the lightweight neural networks used in this work. They are designed based on the evaluation results in Tables 4 and 5. In Table 4, we compare the effect of filter size, as well as the depth (layer number) and width (kernel number) of the CNN on the resulting accuracy. Notably, the max-pooling layers and the additional fully-connected layers, which were often appended after convolutional layers in previous work, are abandoned due to information loss [31] and parameter redundancy.  Based on Table 4, a simple CNN consisting of a single convolutional layer composed of ten 1 × 5 kernels with a stride of 3, and one fully-connected layer is chosen (as highlighted in bold in Table 2). Similarly, a simple MLP consisting of a single hidden layer with 64 hidden neurons is selected according to the results in Table 5. Meanwhile, a DAE and a CAE are built based on the lightweight MLP and CNN selected.
In all of these neural networks, the rectified linear units (ReLU) is used as the activation function except in the last fully-connected layers where a sigmoid function is used for classification. Moreover, a learning rate of 0.001 and a batch size of 128 are proved the best and used with the ADAM algorithm [32] in parameter optimizing.

Lightweight Neural Networks against Conventional Methods
To guarantee reliable experimental results, each of the classifiers used was run for 10 rounds (detailed results see Appendix A), and the final average results are used for evaluation. Firstly, the performance of lightweight neural networks are compared with conventional machine learning methods in Table 6.
As we can see from the results, the XGB performs the best with an accuracy of 99.35% among conventional methods. DT and KNN come next to XGB with an accuracy of 98.93% and 98.52%. The SVM performs the worst with an accuracy of 98.30%. The improvement of the boosting method over other conventional classifiers is evident, especially on the false positive samples (decrease 1309.4 of SVM, 1000.9 of KNN and 799.4 of DC to 496.7 of XGB).
As for the lightweight neural networks, much better results can be obtained. The accuracy of each neural network was higher than 99.5% which was even higher than the best conventional method (99.35% of XGB). The best results of neural networks were obtained from the CNN with an accuracy of 99.94%, a sensitivity of 98.71% and a specificity of 99.96%. These metrics show significant improvement over conventional methods, especially on decreasing false alarms. Let us consider, as an example, the specificity of CNN and XGB which shows a specificity of 99.36%. Now, comparing this result with that of CNN i.e., 99.96%, the latter improved the specificity only by 0.59%. However, this difference is significant as it means reducing the number of false alarms from 497.7 to 26.9 only.
In our analysis, the better results of CNN were partly due to its advanced modeling ability, but mainly due to the ability of CNN to extract local features. The convolutional kernels in the CNN are visualized in Figure 8, where X, Y, and Z denote the kernels on each axis of acceleration measurements. As we can see, these kernels were in different patterns and shapes and were also different on every axis. Some of them were line segments with a big slope and some of them are line segments fluctuating uniformly. These kernels act as various pattern detectors and move along the input signals to identify certain signal patterns on different locations for classification. Compared to other methods that depend on features extracted from the whole data segments, these automatically learned kernels can help CNN to extract local features that can reveal the differences between signals of falls and ADLs on a much smaller scale. In this work, kernels in the CNN can extract local features on a scale as small as 0.2 s (1 × 5) at each step for recognition.  On the other hand, although autoencoders have been proved effective in learning the intrinsic characteristics of data, their slightly poor performance over supervised neural networks proves the efficacy of autoencoders is not evident in fall detection. This may due to the fact that sensor signals used in fall detection are usually not complex that only last for many seconds. Hence, supervised models are enough to learn effective features for recognition.

Leightweight Neural Networks against Baseline Models
The performance of lightweight neural networks is compared with baseline models used in previous work in Table 7. Notably, to further compare the complexity of different neural networks, the number of parameters (PARA) and the number of floating-point operations (FLOPs [33], detailed calculation see Appendix B) of each neural network are also listed in Table 7.
As we can see from the accuracy metrics, even though the baseline models are much deeper and more complex, they could only achieve a similar accuracy around 99.93% as the lightweight models. However, the number of parameters of baseline models are generally hundreds of times the lightweight models, which also means hundreds of times the storage requirement. The simplest models are the lightweight CNN and CAE with only 411 parameters and the most complex one is CNN-HE with 60.1 × 10 4 parameters.
Furthermore, the complex structure of baseline models also leads to more computational cost during classification. In the baseline models, even the simplest mode (CNN-EDU) still requires 1.4 MFLOPs to make one decision (fall/no fall), which is hundreds of times the lightweight CNN and CAE. Such large FLOPs mean higher power requirements and more frequent battery recharging that make the wearable fall detector more obtrusive to use in daily life. Even though many deep neural networks that consist of more than three layers with thousands of neurons have been the focus in previous work, the experiment results prove that lightweight neural networks which consist of only one hidden layer with less than 100 neurons are enough to achieve satisfying accuracy in fall detection. These lightweight neural networks have fewer parameters and smaller FLOPs that make them more suitable to be embedded in wearable devices that usually have real-time requirements restricting the memory size and computation power. In this work, the most simple and accurate neural network is the lightweight CNN used, which has only 411 parameters (160 from the convolutional layer and 251 from the final fully-connected layer). The total storage space needed is only 4 × δ = 1.2 KB (using 4-Byte floating-point numbers)and the FLOPs needed in classification is 0.008 MFLOPs that is only a few hundredth of deep models used previously.

Conclusions
As the population of elderly people is increasing fast, providing healthcare service to the elderly to reduce living risks associated with their daily life is increasingly demanded. Falls are one of the main threats to the life of elder people that have caused a large number of accidents. The treatment of falls has also been a huge financial burden to society. Since early detection of falls can prevent the extremely fatal long lying time, the quest to detect falls of elder people with the highest possible accuracy using wearable sensors has been a hot research topic in past decades.
Even though a large number of work has been done, developing highly accurate embeddable models with lightweight architectures and feasible computational cost is still an obstacle to realize a pervasive sensing fall detector using wearable devices. In this paper, different types of lightweight neural networks are proposed including supervised and unsupervised models. Experiment results prove the superior performance of proposed lightweight neural networks. The best results are obtained from a lightweight CNN. This model can provide an accuracy beyond 99.9% with a small size of only 1.2 KB and a low computational cost of 0.008 MFLOPs that is more suitable to be implemented on wearable devices.
In the future, we plan to design different types of neural networks to detect human falls using other wearable devices such as the smartphone to provide the fall detection service to the general public. We also plan to improve our model to detect other human activities such as walking, running and jumping to realize a cognitive wearable module to use in healthcare industry.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A
To guarantee the reliability of experiment results, every classifier used in this work is run for 10 rounds. Detailed results are presented in this appendix.

Appendix B
To compute the number of floating-point operations (FLOPs), we assume convolution is implemented as a sliding window and that the nonlinearity function is computed for free. For convolutional layers and fully-connected layers we compute FLOPs respectively as: where I is the dimension of input feature vector; C in is the number of channels of the input feature vector; K is the kernel width; C out is the number of channels of the output feature vector; s is the stride of kernels; O is the output dimensionality [33].