Detection of Fall Risk in Multiple Sclerosis by Gait Analysis—An Innovative Approach Using Feature Selection Ensemble and Machine Learning Algorithms

One of the common causes of falls in people with Multiple Sclerosis (pwMS) is walking impairment. Therefore, assessment of gait is of importance in MS. Gait analysis and fall detection can take place in the clinical context using a wide variety of available methods. However, combining these methods while using machine learning algorithms for detecting falls has not been performed. Our objective was to determine the most relevant method for determining fall risk by analyzing eleven different gait data sets with machine learning algorithms. In addition, we examined the most important features of fall detection. A new feature selection ensemble (FS-Ensemble) and four classification models (Gaussian Naive Bayes, Decision Tree, k-Nearest Neighbor, Support Vector Machine) were used. The FS-Ensemble consisted of four filter methods: Chi-square test, information gain, Minimum Redundancy Maximum Relevance and RelieF. Various thresholds (50%, 25% and 10%) and combination methods (Union, Union 2, Union 3 and Intersection) were examined. Patient-reported outcomes using specialized walking questionnaires such as the 12-item Multiple Sclerosis Walking Scale (MSWS-12) and the Early Mobility Impairment Questionnaire (EMIQ) achieved the best performances with an F1 score of 0.54 for detecting falls. A combination of selected features of MSWS-12 and EMIQ, including the estimation of walking, running and stair climbing ability, the subjective effort as well as necessary concentration and walking fluency during walking, the frequency of stumbling and the indication of avoidance of social activity achieved the best recall of 75%. The Gaussian Naive Bayes was the best classification model for detecting falls with almost all data sets. FS-Ensemble improved the classification models and is an appropriate technique for reducing data sets with a large number of features. Future research on other risk factors, such as fear of falling, could provide further insights.


Introduction
People with Multiple Sclerosis (pwMS) are often at a high risk of falling due to many neurological abnormalities associated with the disease progression [1]. There are several risk factors in pwMS for falling, such as a high Expanded Disability Status Scale (EDSS), disease activity, sensory or cognitive impairments, impaired perception resulting from visual impairment and bladder incontinence as well as gait abnormalities and fear of falling [2][3][4][5][6][7][8][9][10]. Among these risk factors, gait abnormalities are considered to be one of the most common impairments in pwMS [11,12].
There are different possibilities to assess gait impairment, for example, rater-based, sensor-based and patient-based methods. Rater-based methods are objective and give an and RelieF rank the features. Then, a threshold reduced this ranked list to a feature subset. A combination method combined these four feature subsets to a final feature subset for classification. Three different thresholds were investigated: 50%, 25%, 10%. Four different combination methods were investigated: Union, Union 2, Union 3, Intersection.
An ensemble benefits from different metrics used by each filter method to rate the features by relevance [50]. Therefore, we used four filter methods: Chi-square test (Chi-Square), information gain (InfoGain), Minimum Redundancy Maximum Relevance (MRMR) and RelieF. These methods are based on different metrics (statistical, information, mutual information, distance), which makes them suitable for an ensemble [46]. We limited the feature selection methods to filters because these methods are independent of a classification model in contrast to wrapper and embedded methods. As a result, these methods do not tend to overfit and will produce more general results than wrapper methods [32]. The filter methods Chi-Square, MRMR and RelieF were performed using MATLAB R2021b [51]. The filter method InfoGain was performed using WEKA 3.8 [52].
All four filter methods generated a ranked list of features. A threshold was applied to select a subset of features of each filter method (filter subsets). Then, a combination method combined the four filter subsets. In the literature, this approach showed a better performance than first applying a combination method and, subsequently, a threshold method [46,47].
Three thresholds and four combination methods were investigated according to a study by Seijo-Pardo et al. [46]. The thresholds were set to 50%, 25% or 10% and determined the number of features in each filter subset. The following combination methods were applied to merge the subsets:  Union: A feature was selected for the final subset if this feature was present in at least one of the filter subsets.  Union 2: A feature was selected in the final subset if this feature was in at least two of the filter subsets.  Union 3: A feature was selected in the final subset if this feature was in at least three of the filter subsets.  Intersection: A feature was selected in the final subset if this feature was in all of the filter subsets.
The final subset created in each process contains all input features used for the classification models. The combination methods were performed using Python version 3.8.8.

Classification Models
Four classification methods were applied: Gaussian Naive Bayes, Decision Tree, k-Nearest Neighbor and Support Vector Machine. The Classification and Regression Tree algorithm (CART algorithm) was used to build up the Decision Tree. The Radial Basis RelieF rank the features. Then, a threshold reduced this ranked list to a feature subset. A combination method combined these four feature subsets to a final feature subset for classification. Three different thresholds were investigated: 50%, 25%, 10%. Four different combination methods were investigated: Union, Union 2, Union 3, Intersection.
All four filter methods generated a ranked list of features. A threshold was applied to select a subset of features of each filter method (filter subsets). Then, a combination method combined the four filter subsets. In the literature, this approach showed a better performance than first applying a combination method and, subsequently, a threshold method [46,47].
Three thresholds and four combination methods were investigated according to a study by Seijo-Pardo et al. [46]. The thresholds were set to 50%, 25% or 10% and determined the number of features in each filter subset. The following combination methods were applied to merge the subsets: • Union: A feature was selected for the final subset if this feature was present in at least one of the filter subsets. • Union 2: A feature was selected in the final subset if this feature was in at least two of the filter subsets. • Union 3: A feature was selected in the final subset if this feature was in at least three of the filter subsets. • Intersection: A feature was selected in the final subset if this feature was in all of the filter subsets.
The final subset created in each process contains all input features used for the classification models. The combination methods were performed using Python version 3.8.8.

Classification Models
Four classification methods were applied: Gaussian Naive Bayes, Decision Tree, k-Nearest Neighbor and Support Vector Machine. The Classification and Regression Tree algorithm (CART algorithm) was used to build up the Decision Tree. The Radial Basis Function (RBF) kernel was used for training a Support Vector Machine. The parameters of the Decision Tree, the k-Nearest Neighbor and the Support Vector Machine were optimized using grid search. The ranges of the hyperparameters are shown in Table 2. A stratified 5-fold cross-validation with F1 score as evaluation score was implemented. F1 score is more suitable than the accuracy in the case of imbalanced data [53,54]. F1 score, recall, precision, specificity and Cohen s kappa were calculated to evaluate and compare the different models based on the hyperparameter optimization using 5-fold cross-validation. The stratified 5-fold cross-validation was repeated 10 times to reduce bias when splitting the data into the folds.
A permutation test was performed on each model to verify how the results compare to random guessing [55]. The test performed 1000 permutations. Each permutation consists of data points with randomly chosen labels. This removed the dependency between features and labels. The p-value of this test was calculated to estimate whether the predictions were better than random guessing. Machine learning classification and evaluation were performed using Python version 3.8.8 and scikit-learn version 0.18.1 [56].
Not all data sets were available for each patient due to mobility impairments, system errors or missing self-reports.

Feature Selection Ensemble and Classification
Four classification models were used to determine the most suitable method for fall detection. For a better overview, we focused on the best classification results for each data set. PROs (MSWS-12 and EMIQ) achieved the best F1 score (F1 = 0.54 ± 0.00). The All data set showed an almost equally good performance (F1 = 0.53 ± 0.00). In these three data sets, recall was higher than precision. Rater-based methods had a better F1 score (F1 = 0.48 ± 0.00 and F1 = 0.47 ± 0.01) than sensor-based methods. Among the sensor-based methods, the measurements with cognitive tasks showed worse F1 values (F1 = 0.43 ± 0.01 and F1 = 0.42 ± 0.00) than measurements during normal walking (F1 = 0.39 ± 0.00 and F1 = 0.38 ± 0.01). Balance tests showed the worst results (F1 = 0.30 ± 0.01 and F1 = 0.29 ± 0.02). Gaussian Naive Bayes was the best classification model for all data sets except for ML_S_EO. The k-Nearest Neighbor was the best model (F1 = 0.29 ± 0.02) for this method. However, the performance of this model was poorest in comparison to the other performances of the Gaussian Naive Bayes. ML_S_EO and ML_S_EC achieved the worst results with the Support Vector Machine (F1 = 0.20 ± 0.01 and F1 = 0.19 ± 0.02). All of the best-performing classification models of each data set generated a highly significant value (p ≤ 0.001) in the permutation test. The best results for the F1 score for each classification model are shown in Figure 2. The overall best performances for each data set are shown in Table 4. If two results showed the same F1 score for a data set, the result with fewer features was chosen. Not all data sets were available for each patient due to mobility impairments, system errors or missing self-reports.

Feature Selection Ensemble and Classification
Four classification models were used to determine the most suitable method for fall detection. For a better overview, we focused on the best classification results for each data set. PROs (MSWS-12 and EMIQ) achieved the best F1 score (F1 = 0.54 ± 0.00). The All data set showed an almost equally good performance (F1 = 0.53 ± 0.00). In these three data sets, recall was higher than precision. Rater-based methods had a better F1 score (F1 = 0.48 ± 0.00 and F1 = 0.47 ± 0.01) than sensor-based methods. Among the sensor-based methods, the measurements with cognitive tasks showed worse F1 values (F1 = 0.43 ± 0.01 and F1 = 0.42 ± 0.00) than measurements during normal walking (F1 = 0.39 ± 0.00 and F1 = 0.38 ± 0.01). Balance tests showed the worst results (F1 = 0.30 ± 0.01 and F1 = 0.29 ± 0.02). Gaussian Naive Bayes was the best classification model for all data sets except for ML_S_EO. The k-Nearest Neighbor was the best model (F1 = 0.29 ± 0.02) for this method. However, the performance of this model was poorest in comparison to the other performances of the Gaussian Naive Bayes. ML_S_EO and ML_S_EC achieved the worst results with the Support Vector Machine (F1 = 0.20 ± 0.01 and F1 = 0.19 ± 0.02). All of the best-performing classification models of each data set generated a highly significant value (p ≤ 0.001) in the permutation test.
The best results for the F1 score for each classification model are shown in Figure 2. The overall best performances for each data set are shown in Table 4. If two results showed the same F1 score for a data set, the result with fewer features was chosen.

Figure 2.
Best Performance (F1 score) of the four classification models on the detection of fall using different data sets. The error bars present the mean and standard deviation across 5-fold crossvalidation repetition. All data set = All features of the following data sets; Basic data set = 25 Foot Walk Figure 2. Best Performance (F1 score) of the four classification models on the detection of fall using different data sets. The error bars present the mean and standard deviation across 5-fold crossvalidation repetition. All data set = All features of the following data sets; Basic data set = 25 Foot Walk Test + 2-minute walk test + EDSS; Expanded Disability Status Scale (EDSS); GAITRite System normal (GR_N data set) and dual-task (GR_D data set) walking; Mobility Lab Gait normal (ML_N data set) and dual-task (ML_D data set) walking; Mobility Lab Romberg stance with open (ML_S_EO data set) and closed eyes (ML_S_EC data set); Twelve Item Multiple Sclerosis Walking Scale (MSWS-12); Early Mobility Impairment Questionnaire (EMIQ). The optimized parameters for each classification model are shown in Appendix A (Table A1). Tables A2-A12, in Appendix B show the classification results for all eleven data sets. The FS-Ensemble improved the classification performance in most cases or at least did not lead to any fallbacks. There was no threshold or combination method that always worked best, but often a restrictive selection improved the performance.
When using the data set with all 428 features (All data set), all models showed a better F1 score when reducing the number of features. The Gaussian Naive Bayes outperformed the other classification models in each experiment. Additionally, the method showed a consistently lower standard deviation. Decision Tree achieved the worst performances in most cases. The best performance for the Gaussian Naive Bayes classification model was achieved by reducing to 9 features (99% reduction), for the Decision Tree when limiting the number of features to 9 (99% reduction), for the Support Vector Machine when restricting to 25 (94% reduction) features and for the k-Nearest Neighbor when using 44 features (90% reduction). Figure 3 shows the classification performances of the four classification models using all data (All data set) depending on the number of features.
Brain Sci. 2022, 12, 1477 9 of 36 most cases. The best performance for the Gaussian Naive Bayes classification model was achieved by reducing to 9 features (99% reduction), for the Decision Tree when limiting the number of features to 9 (99% reduction), for the Support Vector Machine when restricting to 25 (94% reduction) features and for the k-Nearest Neighbor when using 44 features (90% reduction). Figure 3 shows the classification performances of the four classification models using all data (All data set) depending on the number of features. No assumption is made for the performance of feature subsets that lie between the data points. All data set = 25 Foot Walk Test + 2-minute walk test + Expanded Disability Status Scale + GAITRite System data sets + Mobility Lab Gait data sets + 12 item Multiple Sclerosis Walking Scale data set + Early Mobility Impairment Questionnaire data set.
After determining the best method for each data set, we considered the list of features selected in each method. Table 5 shows the final feature subsets that achieved the best performance according to Table 4. After determining the best method for each data set, we considered the list of features selected in each method. Table 5 shows the final feature subsets that achieved the best performance according to Table 4. GR_N data set

GNB
Step Extremity (ratio) L, Step Length (cm) L, Step Length (cm) R, Stride Length (cm) R 4/82  Although only features from EMIQ and MSWS-12 were selected by the filter methods, the All data set achieved the highest recall (74.0%). Q1_EMIQ, Q4_EMIQ, Q8_EMIQ and Q12_MSWS-12 were selected in both final subsets: using all data and the individual data set. For detecting falls with the Gaussian Naive Bayes when using the All data set, the features addressing the patient s judgment of walking, running and stair climbing ability, subjective effort as well as necessary concentration and walking fluency during walking, frequency of stumbling and indication of avoidance of social activities were used. For rater-based data sets (Basic and EDSS), all given features were used to achieve the best performance. GAITRite data sets concentrated on step count, velocity and step and stride length for the left and right leg, whereas under dual-task conditions, only step and stride length were used. The final feature subset of the Mobility Lab normal walking data set used for fall detection consisted of the percentage amount of the individual stride phases, the gait speed and the stride length of the right leg. Among the dual-task conditions, the number of used features was reduced to the stride length of the left leg, the percentage amount of the Terminal Double Support phase of the right leg and the degree of the toe-off angle of both legs. During balance testing with eyes open, the jerk trajectory and degree of the 95% Ellipse rotation were the best features for detecting falls. For balance testing with eyes closed, mostly all features were used. The final feature subset of the MSWS-12 data set included, just like the All data set, the concentration and estimation of the balance and needed walking aid at home. The final feature subset of the EMIQ data set included, just like the All data set, the concentration, frequency of stumbling and indication of avoidance of social activities. Furthermore, the indication of avoidance of sports activities was used.

Discussion
Our objective was to investigate which of the eleven data sets is suitable to detect falls and the most important features for this purpose. A new feature selection ensemble (FS-Ensemble) and four classification models were applied. The Questionnaires (MSWS-12 and EMIQ) achieved the best performance (F1 = 0.54 ± 0.00/0.01) and seemed most suitable for fall detection when using the Gaussian Naive Bayes. Thus, we are able to confirm our hypothesis that patients' self-assessments can better reflect health status than other methods. PROs can provide more in-depth insights into a specific domain, are thought to be more sensitive to changes in health status, and questions may seem more relevant and, therefore, more acceptable to patients [13,57,58]. This statement is supported by the composition of the feature subset of the All data set, which consists exclusively of questionnaire parameters. A recall of 74% for fall detection can be attained if using a selection of features of MSWS-12 and EMIQ.
Contrary to the assumption that balance parameters are more suitable in detecting falls, based upon a more sensitive recording of gait alterations compared to typical gait data [15,16], our results show that data sets with balance parameters (ML_S_EO, ML_S_EC) are less able to detect falls in pwMS (F1 = 0.30 ± 0.01 and F1 = 0.29 ± 0.02) instead of PROs and rater-based items with the used ML methods. Including further machine learning methods could still improve their F1 score.
Rater-based methods showed the second-best performance after patient-based methods. One reason could be the examination of different bodily functions. Research on fall risk factors in inpatients identified twenty different intrinsic and extrinsic factors [59]. This suggests that including multiple factors, as collected with the EDSS, leads to a better result than using only sensor-based data. The performances were not improved when including the T25-FW and the 2MWT in addition to the EDSS score and the functional scores (Basic data set). Thus, these methods bring no added benefit but instead increase the effort of data collection. In this case, we do not recommend this combination of tests and EDSS scores when detecting falls with machine learning.
Sensor-based systems such as the GAITRite system or the Mobility Lab show the best F1 scores with the Gaussian Naive Bayes but seem to be unsuitable for fall detection compared to the questionnaires and rater-based methods. The systems are only occasionally used in routine clinical practice, and clinicians only consider a few parameters in the assessment. An automatic analysis of the results of these sensor systems would be a benefit for the users. First, this could save personnel capacities, and second, an evaluation of all available parameters would be possible. Therefore, in our opinion, further research with the sensor-based system used is promising. Additionally, extending the database can improve performance by collecting more sensor-based data from fallen patients.
The Gaussian Naive Bayes outperformed the other classification methods and is suitable for handling these different data sets ( Figure 2). Thus, we recommend using the Gaussian Naive Bayes for detecting falls in pwMS. The fewer features included in the calculations, the better the detection of falls ( Figure 3).
Knowing that perceived fragility in the wake of fear of falling can lead to falls, we recommend including this factor in future studies. Fear of falling can reduce self-confidence, worsen physical health, including mobility, due to social withdrawal and physical inactivity, and subsequently lead to falls [8]. Therefore, some studies defined fear of falling as a risk factor for fall detection [60,61]. Thus, detecting fear of falling and appropriate treatment can reduce the risk of falling [8].
Our study is not without limitations. Self-reported questionnaires, as well as the question "How often have you fallen in the last thirty days?", always collect retrospective information, so a recall bias could impact the reporting of falls. A basic human need is independence. As reporting of falls indicates a lack of independence, MS patients may under-report their number of falls [10,62]. Bias could be reduced by recording falls more accurately via a diary, in which patients note falls daily. Furthermore, the use of longitudinal data would be interesting to predict falls.
Missing values are a major problem of studies, as they affect the significance of results. A natural selection of patients who were unable to complete gait tests due to the severity of their disease limits the validity of our results. Cognitive and balance tasks are more challenging since patients with increased cognitive impairment need more attention while walking in order to compensate for further disorders [14]. This may result in missing data in more severely impaired individuals and could explain the poor performance of the sensor-based data sets in our study ( Figure 2) compared to other studies [14][15][16][17].
The data show significant differences between patients who fall and those who do not fall in relation to their age and how their MS disease is treated. To achieve better comparability of the data, a propensity score matching procedure would be beneficial. Due to the high number of patents, we included in our study (N = 1240), we have dispensed with this procedure.
Furthermore, the answer options for the questionnaires were coded from one to four or five. The questionnaire items own a natural order (ordinal features). Therefore, the items were assumed as quasimetric features in the preprocessing step to process all features equally and to use the same classification models. Thus, the results must be considered with caution. The numeric distances between the coded items could be unequal. Therefore, the results do not have to correspond to reality in this case, but this assumption is a common preprocessing step in statistics and machine learning [63,64].
Test sets were not used in our study due to the small data size. Therefore, the results could have a positive bias. We addressed this problem by only using filter methods in contrast to other studies [46,48,65]. Filter methods are more general than other methods and do not tend to overfit [35,66]. In addition, a stratified 5-fold cross-validation was used for the grid search and performance evaluation. This validation method was often used for small data sizes and to make the results more generally valid [67]. However, the split of the folds in the train and validation set is a significant issue for calculating and interpreting the validation score since the score could be obtained by chance [68]. Thus, the cross-validation was repeated 10 times for performance evaluation and preceded by a permutation test. These methods are sufficient to overcome the missing test sets.
The data sets in this study were imbalanced due to the nature of fall detection. The used machine learning methods are not suitable for handling these imbalanced data sets [69]. This problem can be reduced with feature selection to delete redundant and irrelevant features. Thus, the methods could achieve better performance. Our FS-Ensemble consisted of four filter methods. These filter methods are suitable to ensure diversity for the ensemble [36,55], but another ensemble configuration could increase the diversity and, thus, improve the classification performance. Seijo-Pardo et al. used an ensemble that consists of four filter methods and two embedded methods [46]. They summarized that the ensemble works better than the individual feature selection methods. Moghimi et al. investigated filter, wrapper and embedded methods as an ensemble [70]. A combination of each one of these selection methods achieved the best classification performance.
The FS-Ensemble failed when the data set already consisted of very few features (e.g., Basic, EDSS, EMIQ and MSWS12). The nature of our feature selection method allows for the selection of few or even only one feature. This feature size would be insufficient for some of the machine learning algorithms in this study. Most machine learning methods are suitable for detecting patterns by considering feature dependencies. This benefit disappears when using a few or just one feature. Furthermore, the applied permutation test failed in some of these cases (p > 0.001). This means that the classification results occurred randomly and confirms that few or one features in a machine learning model are insufficient for providing valid results. Since some classification models can handle a single feature within the data set, the same method was applied throughout all machine learning algorithms for completeness to detect the most suitable one. For this purpose, the FS-Ensemble was an appropriate method. But we recommend using a higher threshold and the union combination method on data sets with few features.

Conclusions
We investigated eleven data sets derived from widely used gait assessment systems and PRO questionaries for mobility assessment to evaluate their predictive power for the detection of falling in pwMS. We can confirm that patients' self-assessments can better reflect health status than other methods. Therefore, easy to use questionnaires (MSWS-12 and EMIQ) can provide a highly cost-effective tool for day-to-day clinical praxis when assisted by a sufficient trained machine learning algorithm. These kinds of algorithms could be implemented in AI-based medical software with very little effort compared to sensor-based assessment systems. Nevertheless, sensor data allows additional in-depth insights into the specific impairment as the root cause of the fall risk and possible starting point for a successful treatment.
The FS-Ensemble is appropriate to improve the classification performance using data sets with more than 30 features. Further research is needed to investigate whether classification performance can be improved and generalized with more data from fallen patients. Additionally, the inclusion of more risk factors of falling, such as fear of falling, must be investigated for preventing falls.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on reasonable request from the corresponding author.  Table A1. Results of the hyperparameter optimization for each data set; Decision Tree (DT); k-Nearest Neighbor (kNN); Support Vector Machine (SVM); Euclidean (Eucl.); Manhattan (Manh.). Expanded Disability Status Scale (EDSS); Basic data set = 25 Foot Walk Test + 2-minute walk test + EDSS; All data set = All features of the Basic data set, the GAITRite System data sets, the Mobility Lab Gait data sets, the Twelve Item Multiple Sclerosis Walking Scale data set and the Early Mobility Impairment Questionnaire data set.           Table A6. Performance of the four classification models on detection fall using GAITRite System dual task walking data set; Gaussian Naive Bayes (GNB); Decision Tree (DT); k-Nearest Neighbor (kNN); Support Vector Machine (SVM); p-value via permutation test.   Table A7. Performance of the four classification models on detection fall using Mobility Lab Gait normal walking data set; Gaussian Naive Bayes (GNB); Decision Tree (DT); k-Nearest Neighbor (kNN); Support Vector Machine (SVM); p-value via permutation test.           Table A12. Performance of the four classification models on detection fall using Early Mobility Impairment Questionnaire data set; Gaussian Naive Bayes (GNB); Decision Tree (DT); k-Nearest Neighbor (kNN); Support Vector Machine (SVM); p-value via permutation test.