A Study of One-Class Classification Algorithms for Wearable Fall Sensors

In recent years, the popularity of wearable devices has fostered the investigation of automatic fall detection systems based on the analysis of the signals captured by transportable inertial sensors. Due to the complexity and variety of human movements, the detection algorithms that offer the best performance when discriminating falls from conventional Activities of Daily Living (ADLs) are those built on machine learning and deep learning mechanisms. In this regard, supervised machine learning binary classification methods have been massively employed by the related literature. However, the learning phase of these algorithms requires mobility patterns caused by falls, which are very difficult to obtain in realistic application scenarios. An interesting alternative is offered by One-Class Classifiers (OCCs), which can be exclusively trained and configured with movement traces of a single type (ADLs). In this paper, a systematic study of the performance of various typical OCCs (for diverse sets of input features and hyperparameters) is performed when applied to nine public repositories of falls and ADLs. The results show the potentials of these classifiers, which are capable of achieving performance metrics very similar to those of supervised algorithms (with values for the specificity and the sensitivity higher than 95%). However, the study warns of the need to have a wide variety of types of ADLs when training OCCs, since activities with a high degree of mobility can significantly increase the frequency of false alarms (ADLs identified as falls) if not considered in the data subsets used for training.


Introduction
According to the World Health Organization (WHO), a fall is defined as an involuntary event that results in a person losing their balance and coming to lie unintentionally on the ground or other lower level [1]. Despite the fact that the majority of falls are not fatal, it is estimated that 646,000 fatal falls occur annually, which makes them the second worldwide cause of death due to accidental injuries [1].
Fall-related health problems are particularly serious among older people as they are strongly associated to loss of autonomy, impairment, and early death. In the world, about 28-35% of adults over 65 suffer one or more falls per year, while this percentage rises to 32-42% among those over 70 [2]. This situation poses a logistical and economic challenge for national health systems, especially if we think that the share of population aged over 60 will double in 2050, reaching a figure of 2 billion people, compared to 900 million in 2015 [3]. This problem is aggravated as a significant proportion of older adults live alone, so that if an accident occurs, a caregiver (a family member, medical or nursing staff, etc.) must be alerted to provide help. In this context, the time that elapses between a fall and the moment in which the person is assisted has been shown to determine the physical aftermaths of the accident and even the probability of survival [4]. Consequently, the last decade has witnessed an increasing interest in the development of affordable Fall Detection Systems (FDSs), which are able to permanently monitor patients and to trigger an automatic alarm message to a remote agent as soon as the occurrence of a fall is presumed.
Existing FDSs can be categorized into two generic groups. Firstly, context-aware systems are grounded on the deployment of cameras, microphones, and/or other environmental sensors in the specific locations where the user must be monitored. On the other hand, wearable-based systems utilize small transportable sensors that can be easily integrated or attached to the users' clothing or garments to measure different parameters that describe their mobility.
When compared to context-aware solutions, the monitoring provided by wearable architectures offers a more ubiquitous service as they are not restricted to the particular area where the contextual sensors are installed. In addition, they are less privacy intrusive than camera-based methods and more robust to the presence of external artifacts or the alteration of the user's setting. In addition, this type of FDS can benefit from the widespread acceptability and decreasing costs of wearable devices (smartwatches, sport bands, etc.).
The fundamental purpose of automatic fall detectors is to achieve the most accurate discernment between falls and other movements or Activities of Daily Living (ADLs), by simultaneously minimizing the number of undetected falls and false alarms (ADLs misjudged as falls). The efficiency of an FDS relies on the algorithm that makes the detection decision after processing and analyzing the measurements that are constantly captured by the wearable sensors (mainly, accelerometers, less frequently, gyroscopes, and in some prototypes, magnetometers, barometers, or heart rate sensors).
Detection strategies can be roughly classified into two groups [5]: threshold-based and machine learning methods. Threshold-based algorithms assume that a fall has occurred when one or several parameters (derived from the sensor measurements) exceeds or drops below a certain threshold limit. These algorithms are easy to implement and have a low computational load, although they are too simplistic and rigid to correctly classify many complex movements (especially those ADLs that involve an intense physical activity). Contrariwise, algorithms based on machine learning models usually overperform the thresholding schemes [6], as they have a greater potential to self-adapt to a wider typology of ADLs and falls, by directly learning from a set of samples or movement traces and without requiring the explicit and heuristic definition of a threshold value.
In most studies of the related literature, machine learning algorithms follow a fully supervised approach, so they need to be trained with labeled examples of both ADLs and falls. However, falls are rare events, and most studies on FDS are almost completely determined by the lack of real-world fall examples. Owing to the evident difficulties of capturing samples of actual falls experienced by the target public of these systems (older adults), falls aimed at training and testing new proposals on FDSs have to be normally generated in a testbed through the movements of young and healthy volunteers that emulate falls on cushioned surfaces according to a systematic and predefined test plan.
The validity of this procedure is still under discussion. Some related studies [7,8] have compared the dynamics of the falls experienced by older people and those 'mimicked' by young subjects in an experimental environment. Authors concluded that although there are similarities between the characteristics of both fall patterns, there also exist relevant differences in the monitored magnitudes related to the reaction time and the mechanisms of the compensatory movements to avoid falling or further damages. In this respect, Aziz et al. showed in [9] that the effectiveness of some supervised learning algorithms may dramatically decrease when they are evaluated in real scenarios.
To cope with this problem, one-class classifiers (OCCs) are a subtype of machine learning architectures particularly adequate to develop binary pattern classifiers with heavily unbalanced datasets [10]. OCCs bypass the need of obtaining laboratory samples of the minority class (falls), as they are conceived to be exclusively trained with traces of the most common class (ADLs). In this way, in the case of FDSs, once the training of the system is accomplished, a fall is detected whenever a certain movement is classified as a 'anomaly' ('novelty' or 'outlier'). This occurs when its features substantially diverge from the samples of the majority class used during the training phase.
In a real use scenario, FDSs will have most likely to be adjusted or 'tuned' to the particular dynamics of the movements of the user to be monitored. In this vein, Medrano et al. evinced in [11] the benefits of 'personalizing' the configuration of the FDS by training the models with movements generated by the final user. Obviously, this process should not oblige the patient to emulate or generate fall patterns to particularize the FDS. In this regard, OCCs may greatly ease the implementation of this system personalization, as long as any user could train from scratch a certain machine learning method just by wearing the system during a certain training period in which the sensors could collect the traces generated by the daily routines of the user and feed the detector.
The idea of utilizing OCCs as the decision core of an FDS is not new. Table 1 summarizes the works that have assessed the performance of anomaly detectors when they are programmed to detect falls with a wearable device. In some specific cases, the FDS develops a 'hybrid' approach by combining an OCC and a thresholding method (such as the proposal by Viet et al. in [12]) or an OCC and a fully supervised classifier (such as that proposed by Lisowska et al. in [13]).
In all cases, the algorithms are primarily based on the analysis of the signals captured by a triaxial accelerometer, which is a strategy that has been massively adopted by the related literature on wearable FDSs. Only in six papers the information provided by the accelerometer is complemented by the use of other inertial sensors (a gyroscope, a magnetometer, or an orientation sensor), and in just two cases, a more complex sensorfusion policy is applied, so that the classifiers are also fed with signals captured by other type of wearable sensing units (e.g., a heart rate monitor in the paper by Nho et al. [14]). Table 1 indicates the best reported performing metrics (normally expressed in terms of sensitivity or specificity) of the corresponding OCC in the review literature. When more than one type of classifier is compared, the best performing algorithm in each study is marked in bold in the third column of the table. The results show that in some works, OCCs may achieve a noteworthy efficacy to discriminate ADLs from falls (with sensitivities and specificities higher than 0.98 or 98%). Furthermore, in [15], Medrano et al. illustrate that one-class classifiers may even exhibit a significantly better performance than their supervised counterparts. However, as it can be also appreciated from the last column in the table, all the works employ only one or at most two datasets to evaluate these algorithms. In some studies, these datasets are not obtained from a public repository but directly generated (and not released) by the authors. Due to the limited number of subjects and types of ADLs and falls considered in these datasets, it is legitimate to question if these results can be extrapolated to other repositories. Furthermore, the design criteria of these benchmarking datasets do not follow any particular recommendation and strongly rely on particular decisions of their creators. In a recent work [16], we have shown that even a deep learning method may achieve very divergent results when it is applied to different datasets. Thus, the good performance metrics obtained with a certain repository should be confirmed by training and testing the classifier with other datasets.
Another key problem of OCCs that is normally neglected by the related literature relates to the fact that these detectors may produce false alarms when tested with types of ADLs that were not part of the training subset [17]. This situation would be not so uncommon in a realistic scenario where the monitored user may execute unexpected movements (not caused by falls) that can be consequently be catalogued as 'anomalies' by the detector and trigger an undesired alerting message. Contrariwise, in the previous works on OCC-based FDSs, the ADLs included in the data subsets used for testing incorporate the same types of movements utilized for the configuration of the detector, which inherently minimizes the possibility of experiencing these false alarms.  1 The system is actually designed to detect "near-miss falls" (not falls). 2 The system is actually designed to generically detect abnormal activities (not only falls). n. In this paper, we thoroughly analyze these two issues. To this end, we systematically analyze the behavior of five basic types of anomaly detectors (with diverse hyperparameter configurations and input feature sets) when they are employed with nine different wellknown datasets captured on the same body positions (the waist). We also investigate if the classification efficacy degrades when new of types of ADLs (not considered for training) are used for testing.
The paper is organized as follows: after the introduction and analysis of the related works presented in this section, Section 2 describes the different aspects of the methodology followed to evaluate the classifiers. Section 3 displays and discussed the main results for the considered study cases. Finally, Section 4 recapitulates the main conclusions of the article.

Election of the Datasets
To date, there have been released about 25 available datasets to benchmark detection algorithms for transportable FDSs (see [33] for a comprehensive review on this topic). These databases are formed by a set of numerical traces describing the signals captured by inertial sensors placed on one or several locations of the body. To the best of our knowledge, just one released dataset, provided by the FARSEEING project [34], publicly offers a very limited and unrepresentative number of traces captured from actual falls of older adults. In the other cases, the repositories are generated by recruiting a group of volunteers that systematically execute or emulate a series of predetermined ADLs or falls while transporting the corresponding sensor or sensors. For each movement, a trace (labeled as ADL or fall) is created.
Several studies [35][36][37][38][39] have shown that FDSs located on the waist outperform those placed on other body positions with a higher and independent mobility (e.g., a limb) as long as the waist is adjacent to the center of gravity of the human body. Therefore, in order to set up a common reference framework under optimal conditions, we limit our analysis to those 15 repositories that offer inertial data measured at the waist (although some of them also contain measurements captured on other body positions). For the study, we also discard those datasets that do not provide a significant number of samples (less than 400) or those that were collected with an accelerometer range of 2 g, which is too small to properly characterize the abrupt acceleration peaks caused by falls. After applying these criteria, we selected the 9 datasets (DLR, DOFDA, Erciyes, FallAllD, IMUFD, KFall, SisFall, UMAFall, and UP-Fall) described in Table 2. This quantity is clearly superior to the number of benchmarking repositories that are typically considered by the related literature to assess the performance of fall detection algorithms (in fact, as confirmed in Table 1, most proposals are validated against a single dataset). This need of evaluating the classifiers with different repositories is critical if we consider the remarkable heterogeneity [33,40] that exists among the available datasets in terms of the typology of the emulated ADLs and falls, strategies to generate the movements, duration of the traces, environment for the testbed, election of the volunteers, etc.

Compared One-Class Classifying Algorithms
As aforementioned, one-class classifiers constitute a particularization of binary supervised classification systems, in which the detection algorithms are trained only with data of one class. After the classifier is trained on these one-class traces, data corresponding to a category different from that used during training can be detected as anomalies. Therefore, once the model of an OCC is developed, input patterns can be identified as anomalies when a certain parameter derived from the input signals (e.g., a distance) exceeds a predefined decision threshold.
In the case of FDSs, the concept of an anomaly fits well with that of a fall, which can be envisaged as an unexpected movement that presents atypical characteristics with regard to those of the common or majority class (ADLs). Thus, in our evaluation, the classifiers are trained exclusively with part of the ADL samples included in the datasets while they are tested with both the falls and the rest of the ADLs (those not employed during the training stage).
In order to thoroughly evaluate the feasibility of using an OCC as the core of FDSs, we analyze the performance of five well-known one-class classifiers [10]: an autoencoder, a Gaussian Mixture Model (GMM), a Parzen Probabilistic Neural Network (PPNN), a One-Class K-Nearest Neighbor (OC-KNN), and One-Class Support Vector Machine (OC-SVM). All the classifiers were implemented and executed with Matlab scripts that used the Statistics and Machine Learning Toolbox [48]. Table 3 summarizes the values and possible considered alternatives to hyper-parameterize these classifiers. Through a grid search, we evaluated the performance of the algorithms for the different combinations of these hyperparameters.

One-Class Support Vector Machine (OC-SVM)
Kernel functions Linear, cubic, quadratic, medium gaussian As the decision threshold to detect the anomaly for each OCC, we employ the variable described in Table 4.

Feature Selection
In order to characterize the mobility samples and feed the machine learning classifiers, we compute a set of features derived from the raw signals collected by the inertial sensors. As all the repositories include the data captured by an accelerometer, which is the most employed sensor in the literature on wearable FDSs, the features are derived from the triaxial acceleration measurements. Falls provoke sudden peaks of the acceleration magnitude when the body hits the ground. This Signal Magnitude Vector (SMV i ), for the i-th measurement, is computed as: where A x i , A y i , and A z i indicate the values of triaxial components of the acceleration for each axis. For every movement trace (ADL or fall), the feature extraction exclusively focuses on a time interval of ±1 s around the sample where the maximum value of SMV i is identified, while the rest of the measurements in the sequence are not considered. The election of the duration of this observation window of 2 s (centered around the acceleration peak) is justified by the fact that an interval between 1 and 2 s is a good trade-off between recognition speed and accuracy to recognize most human activities [49]. In any case, the duration of the critical (impact) phase of a fall does not typically last longer than 0.5 s [50,51]. Thus, all the features will be derived from the consecutive acceleration components collected in the interval: where i o is the index of the sample in which the maximum acceleration module is located: where N denotes the number of measurements of in the trace (for each axis), while N W describes the number of samples captured during the observation window. N W can be straightforwardly calculated as: where f s is the sampling rate of the trace and t w is the total duration of the window (2 s). As a proper selection of the input features is a crucial factor in the design of any machine learning method, we consider different alternative candidate feature sets.
Firstly, we employ a set of twelve statistical candidate features that are physically interpretable as they entail a certain characterization of human dynamics. These features have been utilized by other works in the related literature on fall detection and activity recognition systems (refer, for example, to the comprehensive studies presented by Vallabh in [52] or by Xi in [53]). The symbol, labels (or labeling identifiers) and description of these twelve features are presented in Table 5 (a more detailed formal description of these parameters is provided in [33]).
In order to select the most convenient combination of input features from these 12 candidate statistics, we performed a preliminary analysis of the effectiveness of these statistics when they are applied to the aforementioned datasets to discriminate falls and ADLs with the classifiers. For all the studies, all the features were z-score normalized before training and testing. After implementing all the possible permutations of the statistics to feed the detectors, obtained results (not presented here for the sake of simplicity) revealed that the two combinations that yielded the best performance metric (sensitivity and specificity) in the classifiers were those using the seven features labeled as B, C, D, F, G, I, and K in Table 4 ('BCDFGIK' feature set) as well as the set that employed the 12 candidate features ('ABCDEFGHIJKL' feature set). Table 5. Values and alternatives of statistics analyzed to select the input feature set of the classifiers.

ID Symbol Description
A µ SMV The mean Signal Magnitude Vector (SMV) B A ω di f f (max) Magnitude of the maximum variation of the acceleration components As the election of these input feature sets can still seem arbitrary, we also consider another set of features obtained from the hctsa (Highly Comparative Time-Series Analysis) Matlab software package [54]. This software is capable of extracting thousands of heterogeneous features from a time-series dataset to produce an optimized low-dimensional representation of the data.
In our case, a set (HCTSA feature set) of 12 features has been selected according to the following procedure:

•
The SisFall [31] repository is selected as the baseline reference as it is considered one of the most complete in terms of types and quantity of movements and number and typology of subjects.

•
The candidate features of the samples are obtained by using HCTSA.

•
The performance resulting from the classification of the data is calculated by using each characteristic as input of a Support Vector Machine classifier with linear kernel and a k-fold analysis (with k = 10).

•
The tool analyzed the correlation between the features that have led to the best results. Then, the application was programmed to divide these features into 12 different clusters, grouping those that are correlated into the same cluster. From each cluster, hctsa selected the most representative feature (the one closest to the center of the cluster).

Performance Metrics and Model Evaluation
For each combination of the hyperparameters, input feature set, and dataset, we trained an instance of the five contemplated OCCs with a certain number of ADLs and tested it with both ADLs and falls of the same repository. To assess the capability of the one-class classifiers to discriminate both categories, we employed two metrics universally employed in the evaluation of binary classifiers: the sensitivity (Se) or recall, which is defined as the ratio of falls in the test subset that are properly recognized, and specificity (Sp), which is defined as the proportion of test ADLs that are not misidentified as falls. Unlike other metrics (such as the accuracy or F1 score), sensitivity and specificity are not affected if the data classes are unbalanced in the datasets. Once the model is trained, the classifier is tested with 2500 possible values of the detection threshold (between a minimum and maximum that respectively guarantee the maximization of the sensitivity and specificity). Through the estimation of Se and Sp for each value of the discrimination threshold, we compute the receiver operating characteristic curve (ROC curve), which represents the evolution of Se (true positive rate) against 1-Sp (false positive rate). From the curve, we graphically calculate the AUC (Area Under the Curve) as a metric commonly used to characterize the overall performance of the binary classifiers. Additionally, as another global performance metric of the system, which describes the trade-off between an adequate recognition rate of falls (high sensitivity) and the absence of false alarms (high specificity), we also utilize the geometric mean of Se and Sp ( Se·Sp) (together with the values of Se and Sp) in the point of the ROC where the maximum of this statistic is found. The election of this optimal cut-point in the ROC to select the corresponding decision threshold has been also proposed in works such as [55].
In order to minimize the impact of the election of the data used for training and testing the models, we evaluated the classifiers by means of a k-fold cross-validation [56,57]. For that purpose, the ADL traces of all datasets were split in five partitions (k = 5). Thus, for each combination of OCC, hyperparameters, input feature set, and dataset, the classifier is independently trained and tested five times. For each iteration, one of the five different partitions is reserved for the testing phase, while the rest of the ADLs and all the falls in the corresponding database are used to test the model. The performance metrics obtained with the test datasets for the five iterations (AUC, Se, and Sp for the threshold value that yields the highest value of Se·Sp) are averaged to characterize the performance of the classifier.

Study for the 'Fair' Case
As previously commented and indicated in Table 2, the datasets were generated by considering different predetermined types of ADLs and falls, which were executed by the experimental subjects. In our first analysis, we investigate the performance of the OCCs when the different typologies of ADLs are evenly ('fairly') distributed among the five subsets for five-fold cross-validation. Thus, we guarantee that all the types of ADL movements are represented in the subsets with which the anomaly detectors are trained.
The performance metrics obtained for the five algorithms and the nine datasets are presented in Table 6. Due to the high number of combinations that were evaluated, for each dataset and each type of OCC, the table only shows the combination of hyperparameters and input feature set (also indicated in the table) with which the highest value of the geometric mean of sensitivity and specificity ( Se·Sp) was achieved. For each dataset, the row corresponding to the classifier with the best global metric is highlighted in bold. Aiming at giving an insight of the confidence interval of the measurements, together with the mean value of the global metric Se·Sp, the table also includes in the last column (preceded by the sign ±) the standard deviation of this parameter obtained for the five tests of the corresponding k-fold validation of the classifier. To ease the visualization of the comparison of the algorithms, the particular results of the AUC and Se·Sp are summarized in Tables 7 and 8, respectively. The highest values are also emphasized in bold. From the results, we can draw the following conclusions: • The best results are achieved by the OC-KNN classifier, which outperforms the rest of the detection methods for five out of the nine analyzed datasets (in terms of the geometric mean of sensitivity and specificity), while it presents the second or third best results for the other three datasets.

•
The one-class SVM detector produces the best results for three datasets, while it offers the second-best behavior for five repositories. In any case, if we take into account the confidence interval that can be derived from the measurements, we can conclude that the differences in the behavior of OC-KNN and OC-SVM are not statistically significant.

•
In most cases, the best performance is attained with the simplest input feature set (with the seven features labeled as BCDFGIK and described in Table 5): This suggests that if the features are conveniently selected, a parsimonious OCC architecture can be sufficient to produce efficient detection decisions. • GMM, autoencoder and, specially, PPNN classifiers offer a more variable and erratic behavior as the quality of the classification strongly depends on the employed datasets.
In several databases, the best achieved geometric mean of sensitivity and specificity is under 0.90. • For all the datasets, the OC-KNN classifier yields at least a specificity and a sensitivity of 0.9. In most cases, these metrics are both higher than 0.95. These results are in line with most of the supervised (double-class) methods of machine learning that can be found in the related literature (see, for example, the surveys presented in [58][59][60][61][62][63]). This implies that if the decision threshold is properly chosen, an OCC can behave as a two-class classifier without requiring training the detector with falls. In a realistic use scenario, the final user of the detector e.g., an older adult) could be monitored during his/her daily routines to generate a dataset of ADLs. This dataset could be used to train and personalize an FDS based on an OCC.

Study of the Benefits of Ensemble Learning
Ensemble methods offer a simple and efficient paradigm to boost the prediction capability of single machine learning methods basing on the combined decision of multiple models [64]. In this subsection, we assess if the aggregate knowledge reached by the models evaluated in the previous analysis can improve the individual performance of the classifiers. In particular, we re-calculate the detection decision when a simplistic majority voting of three classifiers is applied (a similar performance is achieved if a higher number of models is considered). In this case, for each dataset, we use as base learners the three combinations of hyperparameters, input feature sets, and OCCs with which the three highest global performance metrics (geometric mean of Se and Sp) were obtained. Thus, during the testing phase, a trace is identified as a fall if a majority of the decision classifiers (two or three) classify the movement as a fall.
The obtained results are presented in Table 9. For comparison purposes, the table also indicates the best results (extracted from Table 6) corresponding to the best discrimination ratio achieved by a single OCC. In the table, the metrics of the ensemble classifier are marked in bold when they improve those generated by the single learner. Conversely, the results are highlighted in italics when the majority voting underperforms the best single model. As it can be observed, the use of the ensemble improves the global performance metric in six out of the nine analyzed datasets (in several cases, a value of Se·Sp close to 0.99 is attained), while with just one repository (DLR), the application of the voting technique reduces the effectiveness of the binary classification process.

Impact of the Typology of ADLs Employed in the Training Phase
As mentioned above, OCCs avoid the need of obtaining (or generating) traces describing real or emulated falls that are required to train supervised learning algorithms. In contrast, the use of one-class classifiers can present difficulties related to a lower specificity of the system due to the appearance of a greater number of false alarms or false positives, which is caused by ADLs not contemplated in the training dataset and identified as anomalies.
To determine the extent of this problem, we repeat the previous study of Section 3.1 when a certain typology of ADLs is removed from the training set and included in the testing subset. For this purpose, as already suggested in our previous studies in [33,40], the ADL movements of all the repositories have been split into three categories, which are displayed in Table 10, depending on the physical effort that they required to perform. Table 10. Categorization criteria to divide the ADL movements into different types.

ADL Category Description Examples
Basic movements Simple movements of low intensity Getting up of a bed or chair, sitting down, lying, turning over while lying down, standing, clapping hands, etc.

Standard movements
Routines of daily life that require intermediate physical effort or a certain degree of mobility Walking at a normal pace, climbing up/down stairs, squatting, picking up an object from the floor, etc.

Sporting movements
Activities that require a higher physical effort and brusque and/or repetitive movements Running, jogging, hopping, walking fast, etc.
For each dataset (except for the DOFDA repository, which does not include sufficient traces of two categories), we generated three subsets of ADLs containing the traces of the corresponding categories. The best combination of hyperparameters and input feature set of each type of OCC obtained in Section 3.1 is trained and tested three times. In each experiment, each model is exclusively trained with the subsets of two categories and then tested with the falls and ADLs of the remaining category using the optimal decision threshold computed for the 'fair' case.
The results for all the analyzed datasets and the best performing OCC of each type are shown in Tables 11-13 for the cases in which the training sets do not include basic, standard, and sporting activities, respectively. The last column of each table ('Loss') indicates the difference between the global performance metric obtained with this segregation of the training and test subsets based on the categorization of the ADLs and the performance metric achieved with the 'fair' case (Table 6) in which traces of all the categories of ADLs are incorporated into the training subset. Consequently, a negative value of this parameter denotes a deterioration of the recognition capacity of the classifier.
As it could be expected, results show that the presence of new types of ADLs in the testing sets (not considered during the training phase) causes a strong degradation of the capability of the classifiers to discriminate falls from ADLs. This loss of effectiveness is particularly remarkable in those repositories (such as FallAllD) that encompass a greater number of types of ADL.
In this regard, the poorest discrimination rate is achieved when the system is tested with sporting movements. In some datasets, the best results for this situation achieve specificities below 80% (which imply that more than 20% of sporting actions are considered as falls and would trigger a false alarm). The brusque mobility patterns induced by this category of movements obviously provoke that the classifiers (trained with much less agitated activities) misinterpret them as anomalies.
Paradoxically, the results also indicate that very basic and less energetic activities also result in false positives, as they can be also identified as 'novelties' if traces corresponding to low motion movements are not included in the training subset. Nevertheless, these false alarms originated by 'sedentary' actions could be most probably avoided by a simple thresholding technique so that a movement trace is inputted to the OCC only if the magnitude of the acceleration exceeds a certain value and a fall can be reasonably suspected.
Finally, the movements included in the standard category seem to be the typology of activities with the lowest impact on the effectiveness of the training. This can be explained by the fact that these activities represent an intermediate point of physical strength between basic and sporting movements. Thus, training with movements with a lower and greater intensity (basic activities and sports, respectively) gives enough information to the classifiers to avoid being considered as anomalies. Yet, a relevant decay of the performance of certain OCCs is also perceived when this category is excluded from the training phase.

Conclusions
This work has assessed the effectiveness of utilizing one-class classifiers as the decision core of fall detection systems based on wearable inertial sensors. Unlike fully supervised methods, OCCs benefit from the fact that they can be trained exclusively with samples of a single class (conventional Activities of Daily Living), which avoids the need of obtaining traces captured during falls to train the classifiers.
In particular, we have analyzed the performance of five well-known OCCs under different input feature sets and a wide selection of hyperparameters. In contrast with most studies in the literature, which base their analysis on the use of a single dataset, we have extended the study to nine public repositories.
The achieved results (with values of the geometric mean of sensitivity and specificity higher than 95%) have shown the capability of the OCC to discriminate falls from ADLs with a high accuracy if the election of the decision threshold is optimized. This performance is comparable to that obtained with supervised systems in the literature. For almost all tests and datasets, the one-class KNN classifier stood out as the best (or second best) detection algorithm, which is a conclusion that is coherent with other previous analysis in the related works. The study has also revealed that the use of simplistic ensemble learning methods (such as voting) may improve the hit rate of the detector if the decisions of several OCCs are simultaneously considered.
In any case, the analyses have illustrated the extreme vulnerability of these classifiers to the typology of the ADLs used for the training phase. Actions that involve rapid movements (such as sports) and even very basic activities (which do not require any physical effort) may be straightforwardly identified as anomalies if they are not considered in the patterns used for training. This problem, which could be alleviated with the combination of OCCs and other simple methods that avoid identifying certain typical ADLs as falls, forces rethinking the way in which one-class detectors are adjusted and evaluated. The results clearly show the importance of having a sufficiently varied set of samples for training. Likewise, in the test phase, and as stress tests of the system, the evaluation should ponder the use of ADLs (not used for training) that entail agitated movements that may affect the decision of the classifier. Future studies should also focus on methodologies that automatically optimize the election of the decision threshold.