Improving Fall Detection Using an On-Wrist Wearable Accelerometer

Fall detection is a very important challenge that affects both elderly people and the carers. Improvements in fall detection would reduce the aid response time. This research focuses on a method for fall detection with a sensor placed on the wrist. Falls are detected using a published threshold-based solution, although a study on threshold tuning has been carried out. The feature extraction is extended in order to balance the dataset for the minority class. Alternative models have been analyzed to reduce the computational constraints so the solution can be embedded in smart-phones or smart wristbands. Several published datasets have been used in the Materials and Methods section. Although these datasets do not include data from real falls of elderly people, a complete comparison study of fall-related datasets shows statistical differences between the simulated falls and real falls from participants suffering from impairment diseases. Given the obtained results, the rule-based systems represent a promising research line as they perform similarly to neural networks, but with a reduced computational cost. Furthermore, support vector machines performed with a high specificity. However, further research to validate the proposal in real on-line scenarios is needed. Furthermore, a slight improvement should be made to reduce the number of false alarms.


Introduction
Fall Detection (FD) is a very active research area, with many applications in health care, work safety, etc. [1]. Even though there are plenty of commercial products, the best rated products only reach an 80% success rate [2,3]. There are basically two types of FD systems: context-aware systems and wearable devices [4,5]. FD has been widely studied using context-aware systems, i.e., video systems [6]; nevertheless, the use of wearable devices is crucial because of the high percentage of elderly people and their desire to live autonomously in their own house [7].
Wearable-based solutions may combine different sensors, such as a barometer and inertial sensors [8], 3DACC combined with other devices, like a gyroscope [9], intelligent tiles [10] or a barometer in a necklace [11]. By far, 3DACC is the most used option within the literature [12][13][14][15][16]. Different solutions have been proposed to perform the FD; for instance, a feature extraction stage and SVM have been applied directly in [12,14], using some transformations and thresholds with very simple rules for classifying an event as a fall [15][16][17]. A comparison of classifiers has been presented in [13].
The common characteristic in all these solutions is that the wearable devices are placed on the waist or on the chest. The reason for this location is that it is by far much easier to detect a fall using the sensory system in this placement [18]. Clearly, each location is the best option for some cases, while for other problems, it may not be the best one. For instance, placing the sensor on the waist is valid for patients with severe impairment; however, the requirement to use a belt with some dressing might not be valid in the case of healthy participants. Furthermore, this type of device lacks usability, and people might find it easy to forget them on the bedside table [4,19]. Thus, this research limits itself to use a single sensor, a commercial smart wristband, placed on the wrist.
Furthermore, these previous studies do not focus on the specific dynamics of a falling event: although some of the proposals report good performances, they are just machine learning applied to the FD problem. There are studies concerned with the dynamics in a fall event with sensors located on the waist [20][21][22][23], establishing the taxonomy and the time periods for each sequence. Interestingly, it has been found that the vast majority of the solutions have been obtained using data gathered from simulated falls [3,24]; these studies have found that analyzing the solutions with data gathered from real falls produce a high error rate and rather poor performances. As far as we know, FARSEEING (FAll Repository for the design of Smart and sElf-adaptive Environments prolonging Independent livinG) is the first dataset including data from real elderly persons' falls [3]. The data have been gathered from patients suffering an impairment illness while using a 3DACC placed either on the thigh or on the lower back. It seems the data must come from the same population in order to record the inherent behavior of the subjects when falling, which may vary with age [25].
Focusing on FD using a wrist-worn bracelet, there are several works published in the literature. Ngu et al. proposed using a BLE link with a smartphone that has access to intelligent web services [26]. Basically, the agent running on the smartphone gathers the 3DACC data stream, analyzing each sliding window with a one step sample. For each sliding window, up to four features are computed and fed to an SVM, which classifies the window as FALLor NOT_FALL. For training the SVM, the authors proposed a training stage for each user, sending the data stream to an intelligent web service to learn the model. Furthermore, the authors proposed a set of ADL to be performed by the user to gather data, including the fall simulation.
In [27], a smartwatch sends the data to a smartphone, where the detection takes place in the smartphone using Kalman filters and the CUMSUMalgorithm [28] using adaptive thresholds. Similarly, a commercial wrist-worn 3DACC wearable was used together with a smartphone to detect falls and inactivity in [29]. In this case, the wearable delivers the processed data to the BLE-linked smartphone, which includes an implementation of WEKA. With the gathered data, an NN model was trained to discriminate between fall and several ADL; also, a threshold-based inactivity detection is continuously updated. As the authors stated, one of the problems is the energy efficiency of the wearable: the data stream, even with BLE, penalizes the battery life. Furthermore, an important problem refers to computing the models in the wearable, which also reduces the autonomy. Moreover, the authors stated the problem of the "heavy dependence on the availability of a smartphone which should therefore always be within a few meters of the user" [29].
A threshold-based solution was analyzed in [30], where up to four different features were computed for each sliding window with one sample shift. Up to 11 thresholds were defined; their values were found experimentally. The authors reported very good results when the alternative ADLs were walking, sitting and other. When the threshold algorithms ran both on the smartphone and on the wrist worn wearable device, the performance was enhanced between 5 and 15%. Although thresholds have been widely used in the literature, having only this type of discrimination might not apply to the general population. Additionally, depending on the availability of the smartphone, this represents a big challenge as the whole FD is computed in this device. A similar solution is proposed in [31], using threshold-based algorithms in both the smartwatch and in the smartphone.
An alternative solution was proposed in [32], where a smartwatch works autonomously to detect falls and send notifications. Threshold-based solutions were proposed assuming that only those falls for which the user faints are the ones to be detected: in the rest of the cases, the user is in charge of calling the ehealth service. Similarly, [27] made use of an Android smartphone to run threshold-based FD. In these latter studies, the authors proposed a continuous analysis of the acceleration magnitude in order to classify the current motion as FALL or NOT_FALL. As the authors stated in these papers, performing more complex models continuously would drain the battery, severely reducing the autonomy of the solution.
In one of the published solutions, Abbate et al. [33] proposed the use of the inherent dynamics of a fall as the basis of the FD algorithm with the sensor placed on the waist. A fall event detection is run continuously based on peak detection; once a peak is detected, a feature extraction is performed, and a feed forward NN classifies the event as FALL or NOT_FALL. A very interesting point of this approach is that the computational constraints for the first two stages are kept moderate so as to be deployed in a wearable device, although this solution includes a high number of thresholds to tune.
The aim of this study is to develop a wrist wearable solution for FD focused on the elderly. A wrist-worn 3DACC on a smart wristband is proposed to enhance the ergonomic skills of the solution. Based on [33], the solution has been implemented and enhanced with (i) an intelligent optimization stage to improve the peak detection, (ii) a dataset balancing stage to avoid biasing the models towards the majority class, (iii) alternative machine learning methods compared to the one originally proposed in order to reduce the computational complexity and promote a longer battery life. Finally, this study makes use of several published datasets, including real falls [3], simulated falls and ADL [34], ADL only [35] and from ADL and simulated epileptic seizures [36]; all of them are using 3DACC, the former with the sensor on the lower back or on the thigh, the three latter with the sensor placed on a wrist. A comparison between real falls suffered by elderly people and falls from young participants in ideal conditions is included to analyze the validity of the results using the simulated falls. Moreover, a complex cross-validation stage, including training, testing and validation, is performed. To our knowledge, this is the first study considering so many different published datasets and a complex scheme of comparison to analyze different FD solutions.
The remainder of this study is organized as follows. Next, the description of the solution proposed in this research is outlined. Section 3 details the experimentation that has been carried out, while Section 4 shows the experiment results and the discussion on them. The study ends with the conclusions drawn from this research.

Fall Detection with a Wrist-Worn Sensor
The block diagram depicted in Figure 1 is defined in this research, which basically is the proposal in [33]. The data gathered from a 3DACC located on the wrist are processed using a sliding window. A peak detection is performed, and if a peak is found, the data within the sliding window are analyzed to extract several features, which are ultimately classified as FALL or NOT_FALL. The FD block is performed with an AI classifier.
The next subsection describes the method for detecting a peak, as well as the feature extraction, while the method for training the FD block is detailed in Section 2.2. For each case, the proposed modifications are included. A discussion on the most suitable models to be used in this approach is held in Section 2.3. Finally, a new stage is included in the process devoted to the tuning of the peak detection threshold; this stage is explained in Section 2.4.

Feature Extraction Based on the Dynamics of a Fall
Abate et al. [33] proposed the following scheme to represent the dynamics within a fall, so a possible fall event could be detected (refer to Figure 2). Let us assume that gravity is g = 9.8 m/s. Given the current times tamp t, we find a peak at peak time pt = t − 2500 ms (Point 1) if at time pt the magnitude of the acceleration a is higher than th 1 = 3× g and there is no other peak in the period (t − 2500 ms, t) (no other a value higher than th 1 ). If this condition holds, then it is stated that a peak occurred at pt.  [33], showing the evolution of the magnitude of the acceleration in multiples of g. Analyzing the signal at time stamp t, the peak condition described in the text must be found in order to detect a fall. The X-axis represents the time, and each mark corresponds to 500 ms.
When a peak is detected, the feature extraction is performed, computing for this peak time several parameters and features. The impact end (ie) (Point 2) denotes the end of the fall event; it is the last time for which the a value is higher than th 2 = 1.5× g. Finally, the impact start (is) (Point 3) denotes the starting time of the fall event, computed as the time of the first sequence of an a <= th 3 (th 3 = 0.8× g) followed by a value of a >= th 2 . The impact start must belong to the interval [ie − 1200 ms, pt]. If no impact end is found, then it is fixed to pt + 1000 ms. If no impact start is found, it is fixed to pt.
With these three times-is, pt and ie-calculated, the following transformations should be computed: , with N the number of samples in the interval.
Peak Duration Index, PDI = pe − ps, with ps the peak start defined as the time of the last magnitude sample below th PDI = 1.8× g occurring before pt and pe the peak end defined as the time of the first magnitude sample below th PDI = 1.8× g occurring after pt. • Activity Ratio Index, ARI, calculated as the ratio between the number of samples not in [th ARIlow 0.85 × g, th ARI Ihigh = 1.3 × g] and the total number of samples in the 700-ms interval centered in (is + ie)/2. • Free Fall Index, FFI, the average magnitude in the interval [t FFI , pt]. The value of t FFI is the time between the first acceleration magnitude below th FFI = 0.8× g occurring up to 200 ms before pt; if not found, it is set to pt − 200 ms.

•
Step Count Index, SCI, measured as the number of peaks in the interval [pt − 2200, pt].
According to the block diagram, each sample of these eight features is classified as a fall event or not using the predefined model. Therefore, this model has to be trained; this topic is covered in the next subsection.

Training the FD Model
Provided there exists a collection of TS with data gathered from real falls or from ADL, a training phase can be proposed to train the FD model. Let us consider a dataset containing {TS L i }, with i = 1 · · · N, n the number of TS samples and L the assigned label; that is, a sample of this dataset is a TS L i with the data gathered from a participant using a 3DACC on the chosen location, i.e., on a wrist. Let us assume we know a priori whether this TS L i includes or not the signal gathered when a fall occurred; therefore, each TS is labeled as L = FALL or L = NOT_FALL. Now, let us evaluate the peak detection and the feature extraction blocks for each TS. Whenever a TS L i has no peak, the TS L i is discarded. When a peak is detected for TS L i , then the eight features are computed, and label L can be assigned to this new sample. Therefore, a new dataset is created with M being eight features' labeled samples, with M ≤ N. This dataset was used in [33] to train the feed-forward NN.
Nevertheless, it has been found that this solution (i) might generate more than a sample for a single TS L i , which is not a problem, and (ii) certainly will generate a very biased dataset, with the majority of the samples belonging to the class FALL. From their study [33], it can be easily seen that the main reason for a 100% detection is this biased dataset.
Consequently, in this research, we propose to include a dataset balancing stage using SMOTE [37], so at least a 40/60 ratio is obtained for the minority class.

Model Complexity and the Battery Life
In [33], Abbate et al. made use of a feed-forward NN. Although the number of hidden neurons was set to seven, using a balanced training dataset as stated in the previous section raises this NN parameter up to 20. Basically, the use of any type of NN is a well-known solution that works quite well in computerized environments [12,14]. Nevertheless, it is known that the higher the number of operations with real numbers the higher the effort a computer has to perform; in the context of wearable and mobile devices, this extra cost matters [38].
In previous research, a comparison between models and their suitability to each possible scenario was presented [36,39]. As it has been shown, those models that include high computation seem to perform better. Actually, K-nearest neighbor outperformed many other solutions; however, its implementation in battery feed devices could drain the battery in a relatively short period of time [40].
Therefore, this research proposes to constrain the models to those that include a low computational impact, reducing complex calculations as much as possible. Actually, in this research, only decision trees and rule-based systems are proposed. These models are based on comparison operations, which are much simpler; the hypothesis is that the obtained results are not going to significantly differ from those of an NN. Finally, to obtain a comparison with state-of-the-art modeling [12,14], we also include the SVM as an alternative.

Tuning the Peak Threshold
As stated in the Introduction of this study, several solutions in the literature are based on thresholds (for instance, [15,[20][21][22], among others). In all of these studies, the thresholds were set up based on the data analysis, either by experts or by data engineers through data visualization.
The solution proposed in [33] is not different. Furthermore, several thresholds are used in that study, not only to detect a peak, but also to compute the extracted features. All of them have been fixed by analyzing the gathered data, establishing some typical values for the features for the class FALL.
However, this can be improved by means of computational intelligence and optimization. In this research, we propose to use well-known techniques (genetic algorithms and simulated annealing) to find the most suitable values for these thresholds. This study, in any case, requires not only optimization, but also some design decisions to modify the features. Therefore, for the purpose of this study, we constrain ourselves to focus on the optimization of the peak threshold, which is the most important threshold as it is the one responsible for finding fall event candidates.

Public Datasets
A common way of studying FD is by developing a dataset of simulated falls plus extra sessions of different ADL; all of these TS are labeled and become the test bed for the corresponding study. In this context, a simulated fall is performed by a set of healthy young participants wearing the sensory system, each of them letting him/herself fall towards a mattress from a standing still position.
The vast majority of these datasets were gathered with the sensor attached to the main body, either on the chest, waist, lumbar area or thigh. Interestingly, the UMAFall [34] dataset includes data gathered from 3DACC sensors placed on different parts of the body-ankle, waist, wrist and head-while performing simulated falls; this is the type of data needed in this research as long as the main hypothesis of this study is to perform FD with a sensor worn on a wrist. Furthermore, there is no pattern in the number of repetitions of each activity or fall simulation. Some participants did not simulate any fall; some performed 6 or 9; and one participant simulated 60 falls.
Besides, this research also includes more publicly available datasets. On the one hand, the ADL and simulated epileptic seizure datasets published in [36] are considered because they include a high movement activity, the simulated partial tonic-clonic seizures, followed by a relatively calm period plus some other ADL, all of them measured using 3DACC placed on the dominant wrist. Although this dataset includes neither simulated, nor real falls, it includes activities that share similar dynamics with that proposed for a fall.
Additionally, the DaLiac [35] dataset is also considered in this study. This dataset includes several sensors, one on the wrist and one on the waist, among others. Up to 19 young healthy participants and up to 13 different ADLs are considered, from sitting to cycling.
On the other hand, the FARSEEING dataset [3] is also used for studying the validity of the simulated falls compared with real falls. As stated on the web page, "the FARSEEING real-world fall repository: a large-scale collaborative database to collect and share sensor signals from real-world falls". Data from 15 participants have been gathered for a total of 22 TS; each TS corresponds to a fall: 7 participants (producing 7 TS) have the 3DACC placed on a thigh, while 8 participants (producing 15 TS) have the 3DACC sensor placed on the lower back. Therefore, this dataset is used to validate the simulated falls, so the extent of the conclusions using the available datasets can be determined. Table 1 summarizes the datasets used in this study.

Dataset Comparison
As mentioned before, the published studies on FD use to base their experimentation on simulated falls with healthy participants, the UMA Fall among them, with an age out of the range of the population on which we focus in this research. In this context, it can be argued that the extrapolation of the conclusions could not be straightforward.
Therefore, a comparison of the signals recorded from the waist from UMA Fall and lower back from FARSEEING is performed, so a conclusion about the similarity of the simulated and the real falls can be drawn. This comparison will consider an exhaustive visual comparison of the signals. To do so, signals of falls from the UMA Fall dataset with the same direction-forward, backward or lateral-will be compared with each of the fall signals coming from a sensor placed on the lower back. The idea is to evaluate whether the dynamics from those TS are similar and if they are similar to that mentioned in [20,33].

A Complete Cross-Validation Scheme
In this research, a complete cross-validation (cv) scheme is performed, that is including training, testing and validation. Each of these stages includes all the TS from the same individual. In other words, once a participant is chosen to become part of the dataset partition, either validation or training and testing, all of his/her TS are included in that partition (refer to Figure 3). Therefore, none of the TS from a participant included in the validation dataset are used in the training and testing: these two partitions-on one side, the validation, and on the other side, the training and testing-are absolutely unrelated.
The first thing that has been done is choosing the participants from the UMA Fall and the simulated epileptic seizures datasets that are preserved for the validation. Fifteen percent of the participants from each dataset have been chosen to be included in the validation dataset. The remaining participants are assigned to the training and testing dataset.  On this training and testing dataset, cross-validation is performed. Both 10-fold cv and 5 × 2 cv based on participants are performed on a participant's basis, as well. This means that for each fold, the participants are grouped for training or for testing. Once a participant is grouped for either training or testing, then all of his/her TS are used in the corresponding process. Again, for each fold, the training and testing partitions do not share any participant's TS; they are completely unrelated.
This scheme is outlined in Figure 3. The advantage of this cv scheme is that it will allow one to evaluate the performance of the solution with unseen participants, those preserved for validation, like would be the case in real life. Furthermore, this scheme allows one to perform training and testing on independent participants. This means that a model is trained with data from a set of participants and then it is tested with data from a different and independent set of participants. Therefore, the training models are tested against data from participants that are totally unseen by them. For sure, this will reduce the performance of the methods, but will allow one to evaluate the robustness of the solutions.
The general process is depicted in Figure 4. The training and testing dataset is used for tuning the threshold to perform the peak detection, this optimization process is detailed in Section 3.4. Once the threshold is obtained, then the peak detection takes place. Each TS included in the training and testing dataset is analyzed to find out whether there exists a peak or not. Those detected peaks are analyzed in depth, extracting the eight features and assigning a label: FALL or NOT_FALL.
This new intermediate dataset, called the model training dataset, might be highly imbalanced; therefore SMOTE is applied to obtain a more suitable dataset to use in the learning process. The learning process is detailed in Section 3.5 and includes up to four types of models: feed forward NN, SVM, DT and RBS. The SMOTE configuration will be to obtain a 40-60% representation of the minority class at least. This balanced dataset and the best model configuration found using a grid scheme are used in the training of the model.
Finally, the validation dataset is considered. It goes through the peak detection block, using the optimized threshold, and whenever a peak is found, the feature extraction stage is executed. Finally, the eight features are classified using the best model found in the previous stage. A TS from the validation dataset will be classified as FALL if a peak is detected and the subsequent classifier outputs the FALL label; otherwise, the TS will be assigned the label NO_FALL.  Figure 4. The machine learning process within the cross-validation scheme. The training and testing dataset is used for (i) threshold optimization and (ii) peak detection and feature extraction. The labeled dataset is then used for the machine learning process to find the best modeling option. The best option is then evaluated with the validation dataset once processed so the real performance of the system can be obtained.
To evaluate this validation stage, and every classification result in this study, the standard measurements accuracy, Kappa factor, precision, sensitivity, specificity and the geometric mean of these two latter will be computed. In order to compute the TP, TN, FP and FN, each TS is labeled with FALL if it includes a fall event; otherwise, it is labeled as NOT_FALL. Each TS is evaluated using each of the classifiers; a label FALL is assigned to the TS whenever a peak is detected and the corresponding output of the classifier is FALL; otherwise, the TS is labeled as NOT_FALL. Then, the following formulas hold.

Tuning the Peak Detection Threshold
A peak is detected whenever the acceleration magnitude is higher than 3× g as defined in [33] when the sensor is located on the waist. However, is this a valid value when the sensor is located on a wrist? This question will be answered using two metaheuristics: Genetic Algorithms (GA) and Simulated Annealing (SA).
The peak threshold is encoded as a real value ranging from 2.0 to 3.5. As explained in the dataset comparison experiments, these values were collected from the analysis of the TS from the UMA Fall gathered with the sensor on the wrist; for the sake of brevity, these TS are not plotted. The encoded real value represents a possible solution for both GA and SA approaches. The quality of the solution is evaluated using a fitness function based on the sensitivity and specificity obtained by the classification measurements generated using the current peak threshold. The fitness function used to guide the search process of the metaheuristics is the geometric mean of the specificity and the sensitivity, that is f (x) = G(x); see Equation (8).
The GA starts with a population of randomly-generated individuals. Each generation, convex crossover is applied with a certain probability between each individual and a mate selected using a binary tournament. The resulting offspring replaces the first parent if it has a better fitness value. Gaussian mutation is then applied to the current individual with a fixed probability. Mutation perturbs the peak threshold using a zero-mean Gaussian distribution, and the new obtained value is allowed to replace the current individual. This unconditioned replacement enhances the diversity of the population and benefits the search process. The parameter setting is performed with the aim to keep the number of fitness evaluations as low as possible in order to avoid high computational cost. To this end, the peak threshold optimization using GA is based on a population size and generation number of 10, crossover probability of 0.8 and mutation probability of 0.2.
The SA algorithm is based on a single solution initialized with a random value in the considered range [2.0, 3.5]. The neighborhood of a solution is defined based on the Gaussian mutation. A new solution y selected from the neighborhood of a current solution x is accepted as the new current solution if it has better fitness or with a probability defined according to the SA approach (as given below).
The probability of accepting a new solution from the neighborhood that does not improve the current fitness value depends on the difference between the fitnesses of the two solutions and on an SA parameter called temperature (denoted by T above). The cooling scheme for the temperature is based on a simple iterative function that returns the current T multiplied by a constant value α. For each value of T starting from the initial temperature to the minimum temperature, several iterations are allowed to select a new neighboring solution. In the current parameter setting, the value of T starts at 1.0, the minimum temperature is 0.1, the value of α is 0.9 and the number of iterations is set to 5. Parameter values for both GA and SA have been selected based on some preliminary experiments according to the results obtained and the computational cost.

Model Learning
The original solution proposed in [33] made use of a feed-forward NN with 7 hidden neurons. However, in that original paper, the authors did not balance the model training dataset. In our experience, the feature extraction domain was clearly unbalanced toward the FALL label, so obtaining good results for the FALL label does not guarantee a good performance as the specificity might be really poor. Further, if this approach were to be deployed on a smart wristband or similar device, it would be advisable to use low computational models.
Therefore, in this study, several different models are proposed: the feed-forward NN, support vector machines (SVM), C5.0 decision trees (DT) and C5.0 rule-based systems (RBS). The former is the one proposed in the original work, and the two latter are simpler models based on C4.5. Alternatively, SVM is proposed as an alternative state-of-the-art modeling technique that has been applied in FD [12,14]. All of them are implementations included in the caret package for R [41,42].
For each model technique, a grid search for the most interesting parameters will be performed after the balancing stage, even for the NN as long as the model training dataset has changed from that originally published.

Dataset Comparison
The FARSEEEING dataset includes up to 15 falls from elderly people using a 3DACC placed on the lower back; for each of them, there might be a break in the circumstances of the fall event. This context information is included in Table 2 with the corresponding ID within this research and within the FARSEEING dataset. Furthermore, in Figure 5, the evolution of the acceleration magnitude is plotted for F1to F8. Although for the majority of the subjects, the 3× g threshold remains valid, some subjects perform with lower peak values; i.e, F3 in the figure. Furthermore, F9 has a peak value below the Abbate et al. threshold, though it has not been included for the sake of space.  Besides, Figures 6 and 7 depict several fall events from the participants in the UMA Fall dataset. In these figures, P x refers to the corresponding participant in that dataset, and the plots include the 3DACC magnitude (see Equation (10)) data from the sensor on the waist. Most of the participants did fairly similar to the hypothesis of dynamics and also the thresholds in [33]. Nevertheless, there were also several exceptions; see Figure 6. For instance, Participants 1, 2 and 15 seem to have been falling with fear: their movements were clearly slower. For these participants, some tests were fair, even with a remarkable magnitude value higher than the 3× g threshold; for some other tests, they performed gently. In some tests, the participant behaved really differently, with the evolution of the magnitude of the acceleration having a totally different shape: Participant 12, the backward fall included in the figure.
However, the majority of the simulations behaved as expected (refer to Figure 7). As seen in these plots, with the independence of the fluctuation of the signal due to the different sampling frequencies, the dynamics can be considered similar to those shown in FARSEEEING, accomplished to some degree with the dynamics proposed in [33]. Still, some differences in this issue can be observed.
On the one hand, the peak threshold is valid for the majority of the cases, but some of the TS behaved under that limit. This will produce a false negative, that is there will be undetected falls. This is the reason why in this research an optimization stage is included in order to tune the peak threshold. The range of possible candidates is defined with the smallest peak threshold found for all the TS from the UMA Fall dataset for the sensor on the wrist: this value has been found to be 2.5× g; therefore, the lower limit was set to 2.0× g. The upper limit of the range is defined as a relatively large threshold, which was estimated as 3.5× g.
On the other hand, the FARSEEING includes some TS that cover walking and a sudden fall; the TS obtained for these cases may change the time periods mentioned in [33]. Moreover, each subject and participant has a different reaction speed. These two ideas must be reconsidered in future work to revisit the definition of the extracted features.
Due to the fact that there were visual differences in the behavior of the different datasets, and also because it would allow a better comprehension of the similarities between the simulated and real falls, a comparison between the TS from the FARSEEING and from the UMA Fall datasets is performed using the algorithm and thresholds proposed in [33]. Table 3 shows the mean and the standard deviation of the values of the extracted features for the TS that include a fall event. Using the Shapiro normality test, it was found that not all features follow a normal distribution; thus, a Mann-Whitney-Wilcoxon test was used to evaluate whether the features from each dataset belonged to the same distribution or not. These results are included in Table 3, as well. As can be seen, the results clearly show the differences between the simulated and the real falls. This is a very relevant finding as it is normal in the literature to use simulated falls in the evaluation of FD algorithms: now, it is found that there is evidence to not accept simulated falls as valid. Although these differences might be explained because the participants in the FARSEEING datasets suffer from impairment illnesses, it is clear from the obtained results that what is found out in the next subsections needs to be validated in real scenarios, with participants from the population in focus living independently, but keeping a log of any possible fall that might happen so real data could be gathered.
Notwithstanding the differences between the simulated and the real fall datasets found so far, we have no other option than to use the simulated fall dataset because, to the best of our knowledge, there are no publicly available real fall datasets using a 3DACC sensor placed on a wrist. Nevertheless, further research will be needed, as explained before.
Moreover, there are some issues in the UMA Fall dataset that need further addressing. When people fall, they use their arms to protect themselves and to try to grab something to avoid falling. Therefore, there will be much more movement variability, from those who fall without moving the arms to those that frantically try not to fall. Research with sensors worn on the wrist and in real scenarios will be needed.

Threshold Optimization
The GA and SA algorithms have been run 10 times based on the parameter setting given in Section 3.4. The results obtained have been analyzed according to the fitness function defined to guide the search process. The dataset used in this threshold optimization, following the experiment scheme shown in Figure 4, was the training and testing dataset.
The best fitness value generated by the GA is 0.870 for the peak threshold values 3.09629, 3.09632 and 3.0971. The average fitness over 10 runs is 0.8695, which only slightly deviates from the best run. The best thresholds detected by GA are mostly in the fitness range from 3.093 to 3.109 with a median value of 3.09590.
The SA algorithm obtains similar results to the GA. The best fitness value is 0.869 obtained for the peak threshold values 3.0936, 3.0921, 3.0940 and 3.0984. The average fitness for the 10 SA runs considered is 0.868, which is, as in the case of GA, near the best value obtained. This indicates a stable performance for both algorithms over the independent runs. Most peak values detected by SA range from 3.078 to 3.093 with a median value of 3.09290. As already emphasized, these are fitness values obtained based on the training and testing data. As can be noticed, GA and SA trigger similar results both in terms of the best and average fitness values, as well as the median peak threshold values.
After these optimization stages, and also by the visualization stage performed in the previous subsection, the following thresholds will be compared: • th25 = 2.5× g: as the minimum value to detect any peak in the datasets. • th3 = 3.0× g: the original proposal from [33]. • th309 = 3.09290× g: the median of the values obtained from the SA optimization. The median value obtained from the GA runs was 3.09590× g, which is quite a similar value; for the sake of the length, only the SA optimized value will be analyzed.

Model Training and Cross-Validation Results
Recall that the experimentation design included several published datasets; these datasets were split into training, testing and validation. When splitting, the participants (and all of the TS gathered from them) were assigned either to the training and testing or to the validation datasets. Furthermore, the majority of the available datasets gathered using a wrist-worn 3DACC do not include fall events but ADL, including jumping, simulated seizures or running, among others; this results in a more balanced feature extraction dataset than if only a single dataset were used. Nevertheless, a SMOTE stage was performed to guarantee 40% minority samples in the training and testing dataset.
The best parameter subset was obtained for each pair of threshold and model type using a grid search. The obtained parameter subsets are shown in Table 4 for the feed forward NN, in Table 5 for SVM and in Table 6 for both the decision tree and the rule-based system based on C5.0.  Both 10-fold cv and 5 × 2 cv were performed, and the obtained results are depicted and shown in Figure 8 and Tables 7 and 8 for threshold th25. For both thresholds th3 and th309, only the 5 × 2 cv results are included for the sake of both readability and space; Table 9 shows the 5 × 2 cv for threshold th3, and Table 10 shows the 5 × 2 cv results for threshold th309. Finally, Figure 9 depicts the boxplots for 5 × 2 cv for both th3 and th309.   (1) to (8). (a) Ten-fold cv; (b) 5 × 2 cv. Table 7. Results obtained from the 10-fold cv when the threshold is set to 2.5× g: the different statistics are the Accuracy (Acc), Kappa factor (Kp), Sensitivity (Se), Specificity (Sp), Precision (Pr) and the Geometric mean (G), all of them computed using Equations (1) to (8). The models are feed forward NN, Support Vector Machine (SVM), decision trees learned with C5.0 (DT) and Rule-Bases systems learned with C5.0 (RBS).    Table 9. Results obtained from the 5 × 2 cv when the threshold is set to 3. Recall that these results regard the feature extraction dataset obtained for the corresponding threshold. This means that, in this stage of the experiment, we are only considering that if a peak is found we could correctly label it to belong to the FALL or NOT_FALL class. Thus, this would allow us to choose the most suitable model, if enough evidence is found. Table 10. Results obtained from the 5 × 2 cv when the threshold is set to 3.09290× g: the different statistics are the Accuracy (Acc), Kappa factor (Kp), Sensitivity (Se), Specificity (Sp), Precision (Pr) and the Geometric mean (G), all of them computed using Equations (1) to (8). The three different models are feed forward NN, Support Vector Machines (SVM), Decision Trees learned with C5.0 (DT) and Rule-Based Systems learned with C5.0 (RBS).   In general, the results for 10-fold cv are better due to the differences in the number of samples contained in the training and testing datasets; however, the same behavior of the statistics can be observed. This is the reason why only the results for 5 × 2 cv are shown in the remainder of this subsection.

NN
We have statistically compared the different methods for each of the thresholds. To do so, we have used the analysis of variance and Tukey honest significance differences, both tools included in R. With a confidence level of 95%, it has been stated that:

•
For th25, SVM outperforms in sensitivity the other methods. However, in the remainder of the classifier performance measurements, all the methods are comparable.

•
For th3, all of the methods are comparable except when using the sensitivity. With this measurement, NN is outperformed by SVM, DT and RBS.

•
For th309, all the methods are comparable for all the measurements.
As a conclusion of this stage, we can state that: • The different models are totally comparable, and there is no evidence that one of the combinations outperforms the others.

•
In this scenario of similar behavior, the DT or the RBS might be advisable due to their simplicity and tuning capability. However, SVM are at least as interesting as these two models.

•
No threshold has been observed better than the others. The original threshold proposed in [33] is just in the middle compared with the performances obtained for th25 and th309.

•
The decrease in the performance from the 10-fold cv to the 5 × 2 cv suggests that the validation results might be significantly worse, as the participants chosen for validation have not been presented to the models, representing the performance in real scenarios in the case of deploying the solution.
Henceforward, there is no clear winner from this comparison, neither with the threshold values, nor with the models. Thus, the next stage of the experimentation, which will evaluate the overall performance of the pair <threshold, model>, will be the definitive phase in this research.

Final Validation
In this stage, the performance of the whole solution will be evaluated. To do so, for each threshold, a model will be learned using the corresponding best parameter subset and the full training and testing datasets. With these models, the following algorithm is performed: For each participant included in validation For each TS from the current participant If a peak is found using the currently chosen threshold Extract the features Predict the class using the corresponding model Update the classifying statistics according to the TS label Otherwise Update the classifying statistics according to the TS label The obtained results are included in Tables 11 and 12. The former shows the confusion matrices for each combination of threshold and model. The latter shows the classifying performance of the whole solution. In our opinion, the confusion matrices obtained for the th25 threshold, independently of the model, show a performance where (i) the number of false alarms is higher for the NN and RBS and (ii) there are several undetected fall events. The relevance of an undetected fall event makes the th25 threshold the worst candidate.
Increasing the threshold has a clear impact on the number of undetected fall events. However, the false alarm number varies from one case to the other: the number of false alarms and the corresponding specificities suggest that further research is needed to tackle this problem. More importantly, for the th309, two of the models did detect all the fall events, which suggests this th309 threshold learned from the optimization stage may be considered the best solution.
Furthermore, the comparison of the two main models (NN and RBS) shows this latter as more robust and reliable as the number of false alarms is about one third of that reported for the NN.
Besides, the SVM performance is really good if we consider the small number of false alarms (showing a very good specificity indeed), although it was not able to detect all the falls. Table 12. Results obtained for the best model for each threshold. Different statistics are shown: the Accuracy (Acc), Kappa factor (Kp), Sensitivity (Se), Specificity (Sp), Precision (Pr) and the Geometric mean (G), all of them computed using Equations (1) to (8). The models are feed forward NN, Decision Trees learned with C5.0 (DT), Rule-Bases Systems learned with C5.0 (RBS) and Support Vector Machines (SVM). However, there is no real evidence about how this solution would perform with elderly people because (i) the intensity level of the ADL is expected to be smaller for this population, which favors the solution, indeed, (ii) there are no real fall datasets with the sensor placed on one wrist, which go against this solution, and (iii) adapting the thresholds for each individual is not addressable in the current design. Furthermore, there would be differences between the evolution of the 3DACC TS for healthy elderly people and those obtained for, say, elderly people with impairments. These reflections lead us to conclude that a solution should be independent of the user intensity level and easier to tune and adapt to the current user. Moreover, gathering data from the elderly population would help in obtaining a more representative dataset. In those cases, like in faints, where there is not enough data, mimicking the faints with human-like flexible mannequins can also help.

Conclusions
This research focuses on fall detection for elderly people. Several solutions were studied, one of them was chosen for deployment and improvement with the premise of a reduced computational cost because it has to be implemented on wearable sensors. A threshold-based peak detection plus an NN stage to label the features extracted from the data has been extended with (i) an optimization stage to find the best threshold candidate, (ii) an SMOTE stage to balance the classes in the feature extraction domain and (iii) alternative classifiers with reduced computation and higher adaptability.
The experimentation includes several published datasets: the FARSEEING dataset that includes real falls gathered from a 3DACC placed on the lower back of patients suffering from some impairment illnesses, the UMA FALL including simulated falls and ADL with several sensors and locations, the DaLiac including ADL and the simulated epilepsy, including ADL and simulated seizures; these two latter datasets gathered the data from a 3DACC placed on a wrist. After a comparison of the falls included in FARSEEING and in UMA Fall, it has been found that simulating falls might not represent the real movements. Therefore, using simulated data might help in evaluating a solution, but extra research with data from real falls will be needed in order to validate a solution.
The threshold optimization introduced did not show a clear advantage with regard to neither the original proposed in [33], nor the manually chosen one. However, SVM, RBS and DT were found comparable for almost all the cases. Besides, SVM was the modeling technique that performed with better specificity, producing the smallest amount of false alarms.
More research is needed to find a solution that performs independently of the intensity level of the user. Furthermore, the relevance of the wrist orientation in the FD must be evaluated. Moreover, a dataset gathered from elderly people using the sensor on a wrist and including real falls is needed. Additionally, using mannequins would enrich the fall detection dataset. Finally, the use of different oracles for different types of falls, like faints, for instance, might be needed to cope with all the possible sources of fall events to detect. Perhaps introducing ensembles can enhance the final results, but always keeping in mind the battery life of the wearable smartwatches.
Author Contributions: José R. Villar, Samad Barri Khojasteh and Camelia Chira conceived of and designed the experiments. All the authors participated in performing the experiments, collecting, processing, organizing and analyzing the data of the experiments. José R. Villar and Camelia Chira wrote the paper. All the authors participated reading, improving and amending the paper.

Acknowledgments:
The authors thank all participating men and women in the FARSEEING project, as well as all FARSEEING research scientists, study and data managers and clinical and administrative staff who made this study possible.

Conflicts of Interest:
The authors declare no conflict of interest. The founding sponsors had no role in the design of the study; in the collection, analyses or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: 3DACC Step Count Index