Inertial Data-Based AI Approaches for ADL and Fall Recognition

The recognition of Activities of Daily Living (ADL) has been a widely debated topic, with applications in a vast range of fields. ADL recognition can be accomplished by processing data from wearable sensors, specially located at the lower trunk, which appears to be a suitable option in uncontrolled environments. Several authors have addressed ADL recognition using Artificial Intelligence (AI)-based algorithms, obtaining encouraging results. However, the number of ADL recognized by these algorithms is still limited, rarely focusing on transitional activities, and without addressing falls. Furthermore, the small amount of data used and the lack of information regarding validation processes are other drawbacks found in the literature. To overcome these drawbacks, a total of nine public and private datasets were merged in order to gather a large amount of data to improve the robustness of several ADL recognition algorithms. Furthermore, an AI-based framework was developed in this manuscript to perform a comparative analysis of several ADL Machine Learning (ML)-based classifiers. Feature selection algorithms were used to extract only the relevant features from the dataset’s lower trunk inertial data. For the recognition of 20 different ADL and falls, results have shown that the best performance was obtained with the K-NN classifier with the first 85 features ranked by Relief-F (98.22% accuracy). However, Ensemble Learning classifier with the first 65 features ranked by Principal Component Analysis (PCA) presented 96.53% overall accuracy while maintaining a lower classification time per window (0.039 ms), showing a higher potential for its usage in real-time scenarios in the future. Deep Learning algorithms were also tested. Despite its outcomes not being as good as in the prior procedure, their potential was also demonstrated (overall accuracy of 92.55% for Bidirectional Long Short-Term Memory (LSTM) Neural Network), indicating that they could be a valid option in the future.


Introduction
The recognition of Activities of Daily Living (ADL) has been a widely debated topic of study for the past several years with applications in a vast range of fields, from medicine to supervision of a persons' driving style, and even to sports training analysis, passing through surveillance [1,2]. Several activities are being recognized nowadays. ADL related to human locomotion, such as walking, running, moving up and down stairs, or just sitting or lying down, are identified in several papers [2][3][4][5][6]. Other activities involving finer gestures with the upper limbs, such as driving, talking on the phone, or eating are also addressed [3,7]. Fall detection [8], recognition of sedentary behavior [9] and comfort in smart homes [10] are just a few more examples of automatic ADL recognition applications. The richness and diversity of human activities, as well as the large dimensionality of the data gathered, make the recognition of human activities a difficult but promising to 96.62%. Table 1 describes the datasets used by the analyzed works as well as the results obtained by their Deep Learning algorithms. Table 1. Datasets used for the evaluation of the Deep Learning algorithms analyzed in the literature and respective description regarding sensing methods, sample frequency, participants, number of activities (classes) recorded and algorithm performance (accuracy). In this table: A = Accelerometer, G = Gyroscope, M = Magnetometer, B = Barometer and ADL = Activities of Daily Living.

DataSet
Work Sensors Sample Frequency Participants Nº of Classes Accuracy

Private Dataset
Chung et al. [22] A, G, M 100 Hz 5 9 93% SisFall [16] Wang et al. [8] A, G 200 Hz 38 2 <99% PAMAP2 [18] Gil-Martín et al. [23] A, G, M 9 Hz 9 18 96.62% UCI-HAD [15] Altuve et al. [2] Murad et al. [3] A, G, M 50 Hz 30 6 [2]-96.7% [3]-92.9% USC-HAD [24] Murad et al. [3] A, G 100 Hz 14 12 97.8% Opportunity [17] Murad et al. [ For the most part, the results achieved by Artificial Intelligence (AI)-based algorithms for ADL recognition are highly satisfactory. However, the number of activities recognized is still generally small, with the analyzed state-of-the-art detecting a maximum of 14 ADL (in this case, with 80% accuracy) [4]. Moreover, the amount of data used, whether through data acquisition protocols or through public datasets use, is small, since the largest public dataset identified in our research contains data from 38 people [16], and the average number of subjects found is less than 22, and with low variability. As a result, there is a greater demand for large-scale data collection campaigns. Studies in the field of merging datasets and preprocessing pipelines are also required in order to properly combine and decrease disparities between data obtained from various sources [27]. Another issue discovered is that the validation processes used in the algorithms are not stated or are inadequately explained in some research. This may cause the developed models to perform well just on the used dataset, but they perform poorly when evaluated on new datasets [28]. Furthermore, despite the fact that some articles implement feature selection techniques, they do not indicate which features and information are most essential for ADL recognition.
This manuscript addresses the aforementioned drawbacks and, in addition, addresses a higher number of ADL than any other of the works analyzed before, recognizing a total of 20 classes, including transitional activities (Sit-to-Stand, Stand-to-Sit, Lying-to-Stand, Stand-to-Lying, pick objects from the ground, bending and turning) and falls (16 ADL and four types of falls). Furthermore, this manuscript aims to answer the following research questions concerning recognition/classification of 16 ADL and four types of fall events from data collected from waist-located inertial sensors: (i) Which is the most suitable classifier and what are the most relevant features? and (ii) Which approach, ML or DL-based models, presents better performance?
With these objectives in mind, an AI-based framework was developed that allows for the benchmarking of several classifiers, including Discriminant Analysis (DA), K-NN, Ensemble Learning, DTs, SVM, CNN, mono and bidirectional LSTM, and CNN-LSTM neural networks. Simultaneously, feature selection algorithms are used to extract only the relevant features from the dataset's lower trunk inertial data. Furthermore, a dataset fusion and normalization procedure is applied in order to gather a large amount of data to improve the robustness of the suggested algorithms during the several validation steps.
The remainder of this paper is organized as follows. Section 2 explains the inertial data-based dataset fusion and normalization process, feature extraction and selection processes, and finally, the comparative analysis carried out during the models building and evaluation. Then, Section 3 presents multiple results obtained with the proposed methods for validation, evaluation and best model choosing, followed by comparative studies, which try to uncover how changes in window size and model type can influence the final performance of classification models. Finally, Section 5 concludes the document, offering some remarks of the work developed along with directions for future improvements.

Public Dataset Fusion and Normalization
The proposed algorithm makes use of a large number of publicly available datasets to validate activity recognition models based on inertial data collected from the participants' waists. Thus, a vast dataset was built from the fusion and normalization of several ADL public datasets. This process, to the best of our knowledge, has not been carried out before in any other works of the kind, despite its importance being highlighted multiple times [27,28]. Datasets that met certain requirements were searched: (i) be publicly available online for download; (ii) contain inertial data collected from the lower trunk (back) of at least accelerometers and gyroscopes; and (iii) contain postural daily activities and/or fall events. The research carried out resulted in the gathering of the public datasets and the three other team-own datasets, which are described in Table 2. The gathered public datasets were as follows.

1.
Sisfall [16] InertialLab [35]: Dataset which includes data from 11 able-bodied subjects (24.53 ± 2.09 years old, 171 ± 10 cm, 65.29 ± 9.02 kg). Gyroscopes and accelerometers were attached to six lower limbs and trunk segments. Walking in varying speed and terrain (flat, ramp, and stairs) and including turns were the activities carried out by the subjects.
The fusion of these various datasets, which include information from both adult and elderly subjects, as well as healthy and unhealthy individuals, served two purposes. First, to create a large and diverse dataset suitable for training ML models. The second goal was to develop AI-based models capable of recognizing ADL regardless of the subject's age and health condition. A global dataset containing 6702 files covering a total of 20 ADL and falls conducted by 180 subjects (age = 33.60 ± 16.84 years, weight = 69.98 ± 10.99 kg, height = 168.99 ± 9.42 cm) was generated. This vast amount of data is crucial for validation purposes since it is far larger than any other dataset used in the AI-based ADL recognition research carried out. Furthermore, the global dataset presents a balanced distribution regarding the subject's gender (M = 54%, F = 46%), also containing data from both young adults and elderly people. However, despite the presence of elderly data, the average and standard deviation values show that the global dataset is still made up mostly of young adults, with the percentage of people over 65 years old being just over 20% of the total dataset (37/180 subjects).
Due to the great variability found between datasets, it was necessary to normalize the data from all datasets, according to the normalization procedure depicted in Figure 1a. First, only data corresponding to the acceleration (accelerometer) and angular velocities (gyroscope) of the sensors located in the subjects' waist were considered. Then, the sensor reorientation method was applied so that the axis orientation corresponded to the one depicted in Figure 1b. Finally, all datasets underwent a resampling process so that the sampling frequency was normalized to 50 Hz.

ADL and Falls
The datasets, whether public or private, contain the great majority of ADL used for activity recognition in the literature. Therefore, a total of 20 labels, including periodic activities, static postures, transitions between postures and falls, were used in order to cover all ADL listed in every dataset. It should be noted that some activities whose labels in public datasets were considered different were recognized as the same activity in this work, since their basic body movement is similar, e.g., the cases of Sisfall's activity of sitting in high chairs or low chairs were included in the "Stand to Sit" class; or even cases of standing in different places, such as in the room and in the elevator, were all included in the "Standing" class. Table 3 lists the ADL that were adressed in this work. A study carried out on how the ADL were distributed showed that the global dataset is unbalanced, with a greater tendency toward cyclical activities, such as walking or lying (29.73% and 18.52%, respectively), with only a small percentage of transitions between activities and fall events, such as the fall by syncope, which is the activity with the least amount of data in the constructed dataset (0.27%). Figure 2 shows the percentage amounts of each activity present in the created dataset.  The activities are named according to Table 3.

Machine and Deep Learning Classifiers: Comparative Analysis
We performed a comparative analysis, whose strategy is illustrated in Figure 3, to determine the most suitable AI-based classification model and the subset of features that attains the best performance. In the next sections, every step of the proposed strategy will be addressed. In this comparative analysis, the first approach consisted of training several ML classifiers to classify the addressed ADL and fall events with different feature subsets. The subsets of features with the best performance were then used in a second approach, where several Neural Networks architectures were proposed, and their performances were compared with the ML classifiers. Finally, for the best models found between ML and DL approaches, a study to assess the influence of the window size used for feature extraction on the classification model's performance was carried out. It should be noted that all operations were performed on the global dataset without the use of any noise filters or other sorts of processing, i.e., the raw inertial data were directly implemented in the referred procedures. All the processes used for the development, validation and evaluation of these ADL recognition algorithms were implemented offline using the Matlab 2021b version on a Lenovo Legion Y540: processor-intel core i5, 9th Gen; graphics card-NVIDIA ® GeForce ® GTX 1650; memory-8 GB DDR4 at 2666 MHz and SSD PCIe of 512 GB.

Feature Extraction
Feature extraction was achieved through the sliding window method, where a signal is segmented into several windows of equal size, on which different features can be calculated. The most used sliding windows' size corresponded to approximately 1 s for this type of activity classification, and the overlap between consecutive windows can vary from 50% to 87% [1,5,14]. Within the scope of this work, firstly, a one-second window was selected for the comparative analysis, which corresponds to a 50-sample window, with an overlap of 80% ( Figure 4). In addition to the initial window size, 4 other different sizes were explored in the window size study: 0.5 s; 1 s; 1.5 s; and 2 s. The overlap was kept at 80% for all tests, despite the literature suggesting that it can also have a high impact on the classification performance and computational cost for real-time applications [1,36]. The segmentation in windows with a size of 1s and an overlap of 80% resulted in a total set of 666,660 windows for the models' training and evaluation. Similarly, for a window size of 0.5 s, 1.5 s, and 2 s, a total of 1,366,289, 435,111, and 318,516 windows were obtained, respectively.
Thus, for each window, several features, such as the averages, maximums, minimums, and standard deviations of each of the acceleration and angular velocity signals, among other metrics, were extracted, making a total of 199 features calculated. A summary of the extracted features can be seen in Table 4. The window labeling was carried out according to the Mode Labeling Method, where the label of a given window would be the mode of the labels present in that window's samples [36].  Table 4. List of all extracted features from each window created. AP, V and ML refer to the anteroposterior, vertical and mediolateral axis, respectively. [1][2][3][4][5][6] Acceleration and Angular velocity (AP, V, ML) [7][8] SumVM of acceleration and Angular velocity [9][10][11][12][13][14][15][16][17][18][19][20][21][22][23][24] Skewness and kurtosis of acceleration, Angular velocity (AP, V, ML) and SumVM signals [25- After the aforementioned extraction, min-max scaling [0, 1] was carried out to ensure a low computational cost when building the models [37]. Since each of the datasets collected data with different devices and sensors with various sensitivities, the normalization process was carried out for each public dataset separately in order to reduce the possible bias derived from the different types, ranges, and sensitivities of the sensors used.

Feature Number Feature Description
In order to understand what kind of improvements this feature selection process can bring, nine Feature Selection Methods (FSM) with diverse types of selection (Filtering, Wrapper and Embedded) were applied to the extracted features [38,39]. Covering the three different types of selection methods mentioned above, the selected methods and their respective type is indicated in Table 5.

Feature Selection Methods (FSM) FSM Type
Infnite Latent Feature Selection (ILFS) [38] Filtering Unsupervised Feature Selection with Ordinal Locality (UFSOL) [40] Wrapper Feature Selection with Adaptive Structure Learning (FSASL) [41] Wrapper Minimum-Redundancy Maximum-Relevancy (MRMR) [42] Filtering Relief-F [43] Filtering Mutual Information Feature Selection (MutInfFS) [44] Filtering Feature Selection Via Concave Minimization (FSV) [45] Embedded Correlation-Based Feature Selection (CFS) [46] Filtering Least Absolute Shrinkage and Selection Operator (LASSO) [43] Embedded Principal Component Analysis (PCA) [47] Filtering Feature selection was also performed by the Principal Component Analysis (PCA), as shown in Figure 5. Principal components (PCs) with a cumulative percent explained of 70% were chosen [47], and a resultant and proportional PC was created and used to rank features. PCA also serves a secondary goal, which is to reduce the computational cost of the comparative analysis. Rather than using all 199 features, the number of features used was decided by multiplying the number of features having a PC value greater than 1 199 by 2 [34,48]. As a result, we chose a higher number of features for comparative analysis than those chosen for PCA and with significant contributions to the variability of the data. The success of this feature selection procedure will be determined by achieving the best performance with a lower number of features than the number of features extracted (199).

Model Building and Evaluation
We compared the following five ML classifiers by using a set of procedures depicted in Figure 3: DA with linear and quadratic approaches; K-NN with squared inverse distances; Ensemble Learning; and DTs. Table 6 presents a description of these classifiers. A progressive comparative analysis was carried out for each of these classifiers, where using the several FSM (Table 5), it was studied with how many of the ranked features (from most to least important), each of the classifiers obtains the best performance results. As previously mentioned, achieving the best classification performance with a lower number of features than the total number of extracted features in the previous step will show a successful application of this feature selection procedure and progressive analysis. A method that finds combinations of features that separate two or more classes of objects or events, searching for the most variance between classes, and information that maximizes the difference between classes.
K-NN [50] Compares each new instance with all datasets available and the instance closest by distance metrics is used to perform classification. Since every sample of the dataset must be checked for every instance, the time and complexity of the method rises according to the dataset size.
Ensemble Learning [51] Creates multiple instances of traditional ML methods and combines them to evolve a single optimal solution to a problem. This approach is capable of producing better predictive models compared to the traditional approach.
DTs [52] A model that predicts the value of a target variable based on numerous input variables. A decision tree is constituted by an internal node, based on which the tree splits into branches. The end of the branch that does not split any longer is the decision.
Initially, we performed the hold-out (HO) method to split 70% of the created dataset's data for training and 30% for testing. The mentioned classifiers were used to obtain the subset of the most relevant features by performing an initial five-fold cross-validation (CV) with one repetition and using only training data. Once we determined the subset of the most relevant features, we performed a five-fold CV with ten repetitions to the four best classifiers from the previous step in order to evaluate the generalization capabilities of each model. The two best classifiers from this step were chosen, and its hyperparameters were optimized through a gridsearch process. The optimized ML classifiers and the neural networks architectures were further trained with all training data and tested with the test data from the HO method and the final performance of each classifier was compared in order to choose the best AI-based classification models.
In order to understand if these can be a viable alternative to ML methods, the classification of ADL and fall events was also carried out using the four Neural Networks architectures presented in Figure 6. As a result, the two best feature subsets acquired from the validation steps were used as input to train and evaluate the DL models. The performance of neural networks with the two best set of features was analyzed, and from the performance results obtained, a choice between the ML-based and DL-based models was made, for the most suitable classification model for ADL recognition. During all of the operations, test specifications such as the loss function used, number of epochs, the optimizer employed, number of hidden layers, batch size and the Learning Rate were kept constant for all the architectures. Table 7 provides a summary of all these characteristics and respective values.  Table 7. Specifications for the use of the Deep Learning models depicted in Figure 6. The last study evaluates which of the chosen window sizes provides better performance results to the selected classifiers. Due to the great diversity in the classes to be classified (static postures, cyclical activities, transitions between postures and falls) and their duration, this study became imperative, in order to find the size that best fits the activities to be classified. Furthermore, the time required to perform the training and testing of the K-NN and Ensemble Learning classifiers for each of the window sizes was also computed. This exercise had the objective of investigating the possibility of using one of the algorithms that presented better performance in real-time situations.

Epoch
The key performance indicators in the developed work were the Mathew's correlation coefficient (MCC) and the accuracy (ACC). The first evaluation metric exhibits good representational features of unbalanced classes, as occurs in this study [55]. We used it for model performance comparison and reporting. The ACC, as well as the sensitivity (Sens), specificity (Spec), precision (Prec), and F1-score (F1S), were calculated for benchmarking the findings of the literature.

PCA Outcomes
We determined that 11 PCs were necessary for a cumulative percent explained greater than 70%. Furthermore, the resulting PC demonstrated that there were 55 features with PC values greater than 1/199. After performing the PCA, we reduced the number of features to lower the computational cost of the comparative analysis ( Figure 5); i.e., instead of using all 199 features, we only used the first 110 features ranked by any feature selection method included in this study. Only training data split from HO method were used.

ADL and Fall Events Classification
The results attained from the five-fold CV with one repetition disclosed the Ensemble Learning classifier as the one that presented the best performance among the used classifiers (MCC = 85.78%; ACC = 94.59%) when using the first 65 features ranked by the PCA method (Appendix A, Table A1). With the first 85 features ranked by the Relief-F (Appendix A, Table A1), K-NN produced similar but inferior results (MCC = 85.10%; ACC = 93.63%). DTs performed worse with the first 74 features ordered by the same technique (MCC = 70.65%; ACC = 88.22%); and Quadratic and Linear DA had the worst performance results with the first 55 and 66 features ranked by the Relief-F method, respectively. The two best classifiers went through a five-fold CV with ten repetitions, and we realized that increasing the number of repetitions did not change significantly the CV results either for the Ensemble Learning (MCC = 85.79%; ACC = 94.59%) or K-NN classifier (MCC = 85.05%; ACC = 93.62%). From this second CV stage, the Ensemble Learning using the first 65 features ranked by PCA and the K-NN using the first 85 features ranked by Relief-F were chosen for the next phases. Table 8 presents the main results for the two phases of the CV process.  When using test data from the HO data split method, the two best classifiers presented slight improvements in their performance in comparison to the results shown in Table 8. However, the Ensemble Learning model presented lower results (MCC = 88.36%; ACC = 95.44%) than the K-NN classifier (MCC = 93.19%; ACC = 97.27%) when tested with unseen data, contrary to what was verified during the CV process. After the optimization stage, K-NN hyperparameters were: (i) distance-minkowski; (ii) distance weight-squared inverse; (iii) exponent-0.5; and (iv) number of neighbors-1. Ensemble Learning hyperparameters were: (i) Method-Bag; and (ii) number of learning cycles-37. Table 9 depicts the main results obtained for the HO validation process. Table 9. Hold-out test results for the Ensemble Learning with the first 65 features ranked by the PCA and for the K-NN classifier with the first 85 features ranked by the Relief-f.

Deep Learning Outcomes
The BiLSTM stood out among the neural networks in both case studies, using the first 85 features ranked by the Relief-F method (MCC = 82.83%; ACC = 92.55%) and the first 65 features ranked by PCA (MCC = 80.52%; ACC = 91.48%). However, in both cases, it presented lower results when compared to Ensemble Learning and K-NN. On the contrary, CNN presented the lowest performance results for both case studies, using the first 85 features ranked by the Relief-F method (MCC = 37.87%; ACC = 57.01%) and the first 65 features ranked by PCA (MCC = 24.90%; ACC = 42.67%). Thus, the ML-based methods, K-NN and Ensemble Learning, were considered the classifiers with better performance for ADL and fall events recognition, among all tested classifiers, being selected for the window size study and classification time analysis. Table 10 contains the main results for the DL-based classification problem.

Window Size and Classification Time
The results attained in this last analysis, for the optimized K-NN and Ensemble Learning classifiers, are depicted in Table 11. It should be noted that the same labeling method was used during the feature extraction step for each window size selected for this analysis. Both tables show a decreasing trend in the performance of the two classifiers as the window size increases from 0.5 to 2 s. Table 11. Window size comparative study results for the K-NN best optimized model with the Relief-F feature selection model.  In addition to the results obtained regarding performance metrics, the time required to perform the training and testing of the K-NN and Ensemble Learning classifiers for each of the window sizes used in this last study was also computed. This exercise has the objective of studying the possibility of using one of the combinations which presented better performance results in real-time situations. Table 12 depicts the results obtained for the training and testing time of each classifier and window size combination.
Through direct observation of Table 12, the K-NN classifier has a training time of around four seconds, regardless of the size of the windows. The Ensemble classifier's training time shows an increasing trend as the window size decreases. On the other hand, the time required to test only one window (last column of Table 12) is lower than 4.5 × 10 −5 s for every window size tested in the case of the Ensemble Learning classification model. The test time per window for the K-NN classifier shows an increasing trend as the window size decreases.

Discussion
The ML-based classification was established with a combination of different classifiers and feature selection methods, and it proved to be an accurate strategy. This process allowed the most relevant features to be found for the classification of 20 distinct activities and fall classes (Appendix A), which should aid in a time-effective and low computation strategy for real-time ADL recognition. The PCA-based approach for reducing the computational burden of the comparison analysis may therefore be judged effective, since for most of the tested models and feature selection methods, the CV best performance was achieved with fewer features than those initially extracted.
The success achieved regarding the performance of the classification models through different validation methods, namely hold-out and cross-validation, shows the robustness of the applied processes. The creation of a vast dataset also positively influenced the performance results obtained. According to the results presented in Tables 11 and 12, the Ensemble Learning classifier with a subset of the 65 first features ranked by the feature selection method PCA, with a window of 0.5 s and an overlap of 80%, was the model which showed the most potential for the classification of the 20 ADL classes in real time. Despite not being the best performer in terms of evaluation metrics, it had a classification time lower than the window advance time, allowing, in theory, its deployment in real-time situations. In addition, for the Ensemble Learning classifier, the comparative analysis of the set of features extracted by the different feature selection models provided a reduction in the number of features by more than 65% from the 199 initial features, to a subset of 65 features, also supporting the possibility of applying this type of models in real-time systems.
According to the literature, windows with a larger size have the tendency to perform worst in the recognition of shorter activities or transitions and better in the classification of cyclical or static activities, which are maintained for longer periods of time. Furthermore, as the window size increases, the model's capability to recognize ADL in real time decreases [1,36]. As stated in Table 11, for smaller windows, the performance of the classifiers analyzed in this dissertation increases, thus corroborating what is described in the literature.
A direct comparison between the two best results obtained in tests performed with Neural Networks (Table 10) and the process developed for ML classifiers (Table 11) shows that the use of different architectures with the same features ranked by the best feature selection methods as inputs did not produce as good results as when using the K-NN and Ensemble Learning classifiers. The tested LSTM and BiLSTM achieved the best results in terms of performance metrics when compared to the CNN and CNN-LSTM architectures as well as in relation to the computation time of training and testing them. As mentioned in the literature, LSTM has a more suitable architecture for classifying sequential data [3,22]. Moreover, as investigated in the literature, the results achieved by BiLSTM were slightly superior to the basic LSTM [5,56]. The 85 features ranked by Relief-f returned performance metrics results slightly higher than the 65 features ranked by the PCA, as what happened in the studies carried out on ML models. CNNs are usually used in problems involving image inputs, and in this process, the implemented CNN architecture received as input 1D arrays of features extracted from sequential data. This method of use may justify the poor results obtained with the CNNs in terms of performance metrics, given that the convolution processes applied to extract features from the arrays used were not carried out in the most effective way. Despite these poor performance results, other architectures involving this type of networks should continue to be tested, with different type of inputs, such as the raw inertial data, since several studies claim to obtain positive results in the recognition of ADL with the use of this type of neural network [8,14,23].
Thus, this initial study of the behavior of several neural networks showed that despite positive results in some of the cases tested, it did not reach the desired potential when compared to the results obtained with the ML classifiers. This fact raises the need to carry out several studies and some changes in the future in order to obtain performance results higher than those found in the literature. The architectures used must be improved in the search for better results in order to be comparable with the studies currently carried out in the literature. Finally, and as mentioned in most of the works analyzed in the literature, new tests should be carried out in the future with windows segmented from raw inertial data as inputs rather than features, given the ability of neural networks to carry out a process extraction of features [23].

Conclusions and Future Work
An algorithm to recognize sixteen ADL and four different types of fall (twenty classes in total) from several AI-based classification models and feature selection models was built in order to find the combination that presents better performance in this type of classification. The performance of the different combinations was evaluated using the following parameters: (i) performance evaluation metrics, (ii) subset of features used, and (iii) classification time per window of the models.
Two different approaches (ML and DL) were investigated and compared: when performing one waist-located inertial sensor-based ADL and fall events recognition. Furthermore, a new procedure of fusion and normalization of public datasets was carried out in order to generate a big enough dataset to validate the activity classification models in order to battle concerns observed in the literature. Our long-term objective is to evaluate these techniques on elderly waistband users to evaluate whether the results of this study translate to identical outcomes in continuous real-life usage.
Taking into account the performance values as well as the classification times found in this work for the machine used, it is concluded that the most effective AI-based classifier was the Ensemble Learning classifier with the first 65 features ranked by the PCA feature selection method. Moreover, the classification time per window in this combination was lower than the window advance time in every window tested, which represents an encouraging result for the application of this algorithm in real time in the future. The DL outcomes were not as good as in the prior procedure; however, their potential was demonstrated, indicating that they could be a good option in the future with the appropriate future work regarding the input data used and its architectures. Although the established methodology allows for a reduction in computing cost, we recommend using a sufficient processor so that all classifiers respond rapidly.
The constructed dataset contains a large amount of successfully normalized data for the validation of activity recognition algorithms. Different steps can still be applied for its continuous improvement, such as the addition of more senior subjects' data and data-balancing techniques, such as data augmentation, to balance the amount of samples of ADL represented on a smaller scale. However, the choice between using balanced data or preserving the overall distribution of activities should be backed by a critical analysis of the results obtained through a validation with data collected in day-to-day circumstances over long periods of time. Data fusion from other sensors, such as magnetometers or barometers, as well as the addition of other devices for data collection (such as smartwatches, which do not impose any restrictions on the subject's movements), may aid in the differentiation of activities with similar movement patterns. Improved data-splitting methods must be verified in the future to guarantee that the results attained were not due to the usage of comparable data during the train and test stages of the model. Despite attaining performance results comparable with the literature for bidirectional LSTM, considering the higher number of classes classified, continued development of the tested neural networks is needed. New and improved architectures, as well as ablation studies, should also be conducted to determine the impact of the various settings and stages of each architecture on ADL and fall events recognition. Furthermore, the constructed classifiers should be trained and tested on the various public datasets independently in order to make a more direct comparison between the methods covered in this study and those listed in the literature that use the same datasets for evaluation.
This study contributes to the state-of-the-art with two versatile ADL and fall events detection methods, which are capable of discriminating 20 classes of events. There is also evidence that dataset fusion and normalization is imperative to guarantee a vast and diverse amount of data for ADL recognition algorithms' validation. Furthermore, it may lead to the incorporation of these tools on instrumented waistbands or other devices for real-time fall risk assessment in the future [57].

Conflicts of Interest:
The authors declare no conflict of interest.