Random Forest for Automatic Feature Importance Estimation and Selection for Explainable Postural Stability of a Multi-Factor Clinical Test

Falling is a common incident that affects the health of elder adults worldwide. Postural instability is one of the major contributors to this problem. In this study, we propose a supplementary method for measuring postural stability that reduces doctor intervention. We used simple clinical tests, including the timed-up and go test (TUG), short form berg balance scale (SFBBS), and short portable mental status questionnaire (SPMSQ) to measure different factors related to postural stability that have been found to increase the risk of falling. We attached an inertial sensor to the lower back of a group of elderly subjects while they performed the TUG test, providing us with a tri-axial acceleration signal, which we used to extract a set of features, including multi-scale entropy (MSE), permutation entropy (PE), and statistical features. Using the score for each clinical test, we classified our participants into fallers or non-fallers in order to (1) compare the features calculated from the inertial sensor data, and (2) compare the screening capabilities of the multifactor clinical test against each individual test. We use random forest to select features and classify subjects across all scenarios. The results show that the combination of MSE and statistic features overall provide the best classification results. Meanwhile, PE is not an important feature in any scenario in our study. In addition, a t-test shows that the multifactor test of TUG and BBS is a better classifier of subjects in this study.


Introduction
Almost 30% of adults over 65 years old worldwide experience falls [1,2], and falls represent the main cause of their injuries, which include movement impairment, fractures, long-term or permanent disabilities, and death [3]. Suffering from a fall represents not only a great health risk, but is also associated with important economic costs which can range from hospitalizations to long-term home care [4]. Furthermore, adults who have suffered from a traumatic fall can be affected by the fear of suffering from a second fall (FOF), which increases the probability of experiencing recurrent episodes [5]. Approximately 50% of adults who have suffered from a fall are considered potential recurrent fallers [6].
Falls are a multifactorial problem that can mainly be attributed to intrinsic (behavioral, physical, and cognitive) and environmental reasons [7]. Among the intrinsic factors, poor balance and gait abnormalities are estimated to be related to 10-25% of falls [8]. Moreover, previous studies have proven that suffering problems with balance [9][10][11][12] and mobility [5,[13][14][15][16] can greatly increase the probabilities of experiencing a fall. The abundant health implications associated with falling has fostered a substantial number of studies captured by our sensor. Multiscale Entropy (MSE) is a variation of traditional entropy, which facilitates the quantization of complexity of physical and physiological time series. MSE has been implemented to assess differences in balance by evaluating the complexity between different groups of subjects [32]. Costa [33] analyzed the complexity of the gait signal of subjects while performing a free or paced walk, and concluded that MSE was capable of detecting characteristics of the signal that other statistic tools could not. Riva et al. [34] found MSE to have a positive relationship with fall history, making it a useful instrument to identify individuals at risk of falling. In a similar study, Lee et al. [35] concluded that MSE can be used as a tool to screen falling behavior among elder adults in a community-dwelling setting. The computation of MSE, however, requires the calculation of multiple time series under different scales, which can be time consuming, making it difficult to use when performing immediate decisions. Therefore, Band and Pompe [36] proposed the use of Permutation Entropy (PE), as it is computationally simple, making it ideal for larger time series or databases. The implementation of PE for gait analysis was studied by Lee et al. [37], where an inertial sensor was used to measure the gait information from a subject performing the TUG test. Subjects also performed a short-form berg balance scale test (SFBBS) as a means to measure balance. They used a set of statistical, PE, and weighted permutation entropy (WPE) features to successfully estimate the SFBBS score, which can provide doctors with information on the fall risk of patients. Despite promising results, this study failed to implement a multifactorial assessment test which has been proven to be more effective than a single clinical tool at capturing the complex nature of falls [38,39]. Furthermore, the study did not compare the PE and MSE, as both tools were designed to measure the complexity of a signal. This encouraged us to compare the importance that MSE and PE features can have in the analysis of gait signals while predicting scores for multiple clinical tests.
Our research focuses on studying the application of combined inertial sensors with multifactor assessments, namely (i) the Timed-up and Go test, (ii) Short-Form Berg Balance Scale, and (iii) Short Portable Mental Status Questionnaire to develop an auxiliary tool for medical professionals to assess mobility and balance. The main highlights of this research are: first, we use features that can be automatically extracted or calculated from data collected by inertial sensors without any processing or segmentation, thereby reducing the burden on medical staff and creating a tool that provides data that can be easily interpreted by the doctors or physicians across hospitals. Second, we also compare the performance of our method across different clinical tests in order to increase the robustness of our model by estimating multiple factors that can cause falls among the elder population. Finally, we use MSE features as a means to measure the complexity of the TUG signals and compare their impact to the classification performance against permutation entropy.

General Approach
Wearing an inertial sensor capable of measuring acceleration in three directions, subjects performed a series of balance and mobility tests. With the data collected by the sensor, we calculated a set of features, which included statistic, MSE, and PE. We used these features to train a Random Forest classifier in order to predict the subject's scores in the multifactor assessment. From these results, we estimated feature importance and compared the model performance when using the most important features for each clinical test. By doing this, we were able to determine a set of features that can best predict the mobility and stability scores of the participants in our study.

Subjects
Assisted by a team of medical professionals, which included physiotherapists, functional therapists, and rehabilitation physicians, we performed a series of clinical tests between April 2014 and May 2015 in a hospital in central Taiwan to assess fall risk among the elderly population. Subjects who participated in the study wore a belt around their waist with a tri-axial inertial sensor attached to it, which was located at their lower back while they performed a series of tests (which we discuss in more detail in Section 2.3). At the end of the study, we collected inertial acceleration data from 65 different elderly adults (average age 76.12 ± 6.99 years). The recruitment criteria for the participants stated that they must not have previously suffered from any musculoskeletal injuries, they must not have any history of central nervous system injuries, and they had to be able to walk independently in order to perform the clinical tests. Despite being collected almost seven years ago, this dataset remains relevant as it is focused on a problem that continues to affect the health of a growing elderly population. Moreover, as it was collected using a sensitive sensor, following a careful and scientific methodology from a wide range of elder subjects, and with the support and supervision of professional medical staff, it has allowed our team to continue to develop different methodologies to study it. A summary of the demographic data from the subjects who participated in the study is included in Table

Clinical Tests
Due to the demography of our subject population, we implemented quick and simple clinical assessment tests to study mobility and balance, which the participants performed under the supervision of our medical team. This set of tests included:

•
The Timed Up and Go test (TUG) [40] is a common clinical test of gait and mobility. Different geriatric institutions recommend its implementation for fall-risk screening [41]. Previous studies have determined the effectiveness of using an inertial sensor during a TUG test to measure mobility [42], as well as to detect frailty [43] which could potentially result in a fall. It has also been proven to be an accurate measurement tool for predicting falls among community-dwelling elder adults [44]. Physicians commonly employ this clinical test in community settings due to its ease of implementation. Before starting the TUG test, subjects sit on a chair in a comfortable position, facing an object on a floor, which is located 3 m in front of them. When the test starts, subjects are asked to stand up, walk naturally towards the object, then return to the chair at their natural pace and sit down. The total time the subjects require to perform this test is recorded and used to label the subjects that performed the test in over 12.47 s as having mobility problems [44]. A summary of the label distribution for each clinical test can be observed in Table 2.

•
The Short-Form Berg Balance Scale (SFBBS) [45] is the simplified version of the Berg Balance Scale (BBS) [46] which is used to assess balance. It is easier to perform as it has half the number of activities, greatly reducing the time required to assess a subject. These activities include (i) bending your back forward with outstretched arms, (ii) standing with both feet while keeping eyes closed, (iii) standing with one foot in front of the other, (iv) turning the back and neck to look backwards without moving the feet or knees, (v) bending down to pick up an object from the floor, (vi) standing on one foot while having the other foot in the air, and (vii) standing up from a chair and sitting down again. While subjects perform a SFBBS test, a medical expert evaluates their performance by assigning scores to each activity. The performance criteria states that a score between zero points (subject was unable to perform the activity) and four points (subject completed the activity without problems) is assigned based on the expert's observations. Therefore, in this study, subjects who scored 28 points were considered to have correct balance since they were able to perform all seven tasks without problems. Meanwhile, subjects with a score below 23 were labeled as having balance problems [47,48]. The Short Portable Mental Status Questionnaire (SPMSQ) [49] is as a sensitive clinical tool to detect brain syndromes such as dementia in elderly adults [50]. Elderly patients diagnosed with dementia have two to three times higher risk of suffering from a fall when compared to patients with healthy cognition [51][52][53][54][55], which makes detecting dementia an important step towards fall risk assessment. A previous study included SPMSQ among the clinical tools used to assess fall-risk factors in a community-dwelling environment [56]. The SPMSQ consists of a set of 10 questions which patients can choose to answer independently or with the help of their family members at home. These questions focus on evaluating cognitive functions such as memory, attention, thinking process, consciousness, general knowledge, and orientation. This provides a preliminary insight on the mental health status of elder adults. Granger et al. [57] determined that a score in the SMPSQ of 60 provides crucial information as patients are transitioning from assisted independence to dependence. In addition, subjects are considered to be at risk of suffering from dementia if they answer three or more questions incorrectly. We used this criterion to label subjects as not having normal brain functionality.

Wearable Accelerometer
We collected the TUG accelerometer data using a wireless tri-axial accelerometer system (comprised of the Freescale RD3152MMA7260Q accelerometer, a Bluetooth transmitter, a battery as the power source, and an Arduino as the data-processing device). While the subjects performed their clinical tests, the sensor was located at the lower back of the subjects, around the area between the L3 to L5 vertebrae, since previous studies have concluded that this location approximates the center of mass of the human body [58], making this the most common location for similar studies published within the last two decades [59]. This sensor recorded TUG acceleration data in the mediolateral (ML), vertical (V), and anterior-posterior (AP) directions. An illustration of the sensor system with an overview of the axis directions is included in Figure 1.

Data Analysis
We used Python to automatically calculate the set of features, perform the analysis, and train the classification model. Table 3 summarizes the set of features we used in this

Data Analysis
We used Python to automatically calculate the set of features, perform the analysis, and train the classification model. Table 3 summarizes the set of features we used in this study, which we calculated from the unprocessed TUG acceleration signal captured by the inertial sensor. We divided them into statistic, MSE, and PE groups. We selected these features as calculating them requires no signal processing or analysis. Among the statistic set, we calculated the mean, standard deviation, maximum value, minimum value, and zero-crossing rate (ZCR) for each axis. Lee and Sun [60] included ZCR among their feature set, which they used in order to screen fallers from non-fallers. In our study, ZCR measures the frequency at which the gait signal crosses through zero acceleration. The MSE feature set included the average and standard deviation across all time scales and the complexity index. Table 3. List of features extracted from the inertial sensor data. MSE is an effective tool to measure a physiologic time series' complexity [61]. Costa et al. [33] found that MSE is capable of detecting differences in gait of subjects under pathologic conditions. In our study, we calculated MSE for the entire TUG signal in order to obtain information of the gait of our subjects. A flowchart detailing the steps needed to calculate MSE is illustrated in Figure 2. As part of the first step, multiple overlapping segment windows with a length equal to the current scale factor are extracted from a given time series. For each window, the average value for all the data points is calculated, creating a new time series known as coarse-grained time series. The formula used to calculate this is shown in Equation (1), As part of the first step, multiple overlapping segment windows with a length equal to the current scale factor are extracted from a given time series. For each window, the average value for all the data points is calculated, creating a new time series known as coarse-grained time series. The formula used to calculate this is shown in Equation (1), where τ is the scale factor, N is the length of the original time series, and x i is a single data point from the original time series.
The next step involves calculating sample entropy (SampEn) for each coarse-grained series. SampEn was developed in order to analyze the complexity of biological time series [62]. SampEn is defined as the negative logarithmic probability of a series having two sets of consecutive data points (of size m + 1) with distance < r, given that the same series contains two sets of consecutive data points (of size m) with distance < r. This is expressed in mathematical notation in Equation (2), where N represents the input time series. For our study, we defined these parameters as m = 2 and = 0.15.
Finally, Costa et al. [61] proposed the use of the complexity index to evaluate fall behavior, which is defined as the summation of the sample entropy values for all scale factors τ. This measurement has been proven to be an effective tool to screen communitydwelling elderly people for falling behavior [35]. The formula used to calculate CI can be observed in Equation (3), where τ represents the scale factor. In our study, we defined τ = 10.

Permutation Entropy Calculation
PE quantifies complexity by estimating the frequency of sequence patterns within a time series. In order to achieve this, at the first step, it converts a one-dimensional time series into a T − (D − 1)τ matrix, where D represents the embedding dimension which defines the size of each column vector in the matrix, and τ represents the embedding time delay which determines the number of time periods that separate the elements of every consecutive pair of columns in the matrix. The next step involves converting a one-dimensional time series into a T − (D − 1)τ matrix, where D represents the embedding dimension which defines the size of each column vector in the matrix, and τ represents the embedding time delay, which determines the number of time periods that separate the elements of every consecutive pair of columns in our matrix.
As part of the third step, every column in the matrix is mapped into D! unique permutations. These permutations are then sorted in ascending order, which allows the user to obtain the ordinal rankings of the data and their corresponding ordinal patterns. These ordinal patterns are labeled as π i , = {r 0 , r 1 , r 2 , . . . , r D!−1 ,}.
Using the ordinal patterns π i for each permutation, their relative frequency (defined as the number of times such permutation is present in the time series divided by the total number of sequences) is calculated. This result can be interpreted as the probability of finding each permutation in the time series p i .
Finally, using the previously calculated probabilities, the PE value can be calculated following Equation (4). As pointed out by [63], a more regular time series is characterized by having a lower PE value.

Random Forest for Feature Importance and Classification
In our study, we trained a Random Forest [64] classifier to estimate feature importance. Random Forest for feature selection has been used in problems such as power generation forecasting [65], network intrusion detection [66], and leukemia and cervical cancer classification [67]. To reduce the bias of our model towards the samples in the training set, we employed a 50-fold cross-validation approach. We repeated this process for each clinical test in order to determine the relationship between features to clinical tests. We estimated the importance of each feature by calculating the mean coefficient value for every feature across folds. Once the feature importance for all features was estimated, we proceeded to compare the classification performance of our model by re-training it using the top 5, top 10, and top 15 features. A similar approach was employed in previous research [68] where the authors calculated the mean coefficient scores for each feature, then selected the top 30, 20, 10, and 5 features to test their Random Forest model's performance. We repeated the 50-fold cross-validation approach to obtain a mean AUC score for each scenario.

Results and Discussion
We begin our analysis by classifying every subject with either fall risk or non-fall risk using the scores and the special criteria for each clinical test. Next, we calculated the features from the TUG acceleration data collected by the inertial sensor. Using Python's library Scikit-Learn [69], we estimated feature importance with a Random Forest (RF) classifier, and compared it across clinical tests. We re-trained the RF classifier using the top 5, 10, and 15 features and tested the model's performance for each clinical test. We proceeded to compare the impact of including and excluding MSE from the feature set as to estimate its effect in measuring balance and mobility, two key factors in fall-risk classification. Finally, we compare the model's performance across the multifactor clinical tests, in order to determine which assessment tool has the best screening capabilities for fall risk among the community-dwelling elderly subjects who participated in our study. The discussion of our results is divided into three main segments: (a) feature selection for each clinical test, (b) classification performance for each clinical test under multiple criteria for feature selection, and (c) comparison between the classification performance with and without MSE features.

Feature Selection for Each Clinical Test
The top 5, 10, and 15 features that our model determined to be the most important for each clinical test are summarized in Table 4. The results show that Standard Deviation (for all three directions), Maximum Value (for ML, and V directions), Minimum Value (V), Zero-Crossing Rate (ML), and MSE Mean (ML) are present within these features for all clinical tests. This is consistent with [37], as the author found multiple of these features to have an impact on the screening performance of the model. Additionally, these results indicate that from an axis point of view, ML and V are critical in the classification of the subjects. Furthermore, the selection of MSE features as important for the screening of fallrisk subjects is also consistent with previous studies [35], indicating that the measurement of signal complexity can help to detect differences in balance and mobility. The remaining features are different for each case since every clinical test measures different characteristics of the subject's posture. Additionally, PE was selected to be among the top features for two clinical tests; however, its importance is clearly lower than MSE. We attribute this to PE's focus on sample order without considering amplitude. Moreover, having multiple parameters pairs can also lead to testing problems since the values of PE are directly dependent to the parameter setting, as was discovered by [70].

Classification Performance for Each Clinical Test under Multiple Criteria for Feature Selection
After training Random Forest with 50-fold cross validation for multiple criteria of feature selection, we tested each model's performance by analyzing the mean AUC scores summarized in Table 5. As is evident from the results, the model can classify the subjects according to their respective clinical test scores with high accuracy in most cases. This indicates that the set of features we calculated from the TUG signal are sensitive enough to be used in our study, which is consistent with the findings of [20] who concluded that inertial sensors can be used in fall-risk assessment studies. From the results, it can also be observed that combining SPMSQ with other clinical tests yields the worst performance as it results in the lowest AUC scores. We attribute this to the nature of the SPMSQ, where the score is based on the answers of a written questionnaire, which are highly subjective. Moreover, this test does not measure any balance or motion from the subject, making it more difficult for the set of features we calculated from the TUG data to estimate its score. In addition, PE features were only present on SPMSQ tests, which clearly indicate that other features such as MSE have a higher importance for the clinical tests that directly test mobility and balance. This table also shows that the best results are obtained when selecting the top five features. Considering that such groups include MSE features, we tested the impact to the model's performance when removing MSE, and discuss the results in Section 3.3 Classification performance with and without MSE.
In such a comparison, we removed SPMSQ and its combinations as it has the worst screening results, as previously discussed.  Table 6 summarizes the comparison in mean AUC scores of the model when MSE is excluded from the feature set. The results summarized in this table show an overall reduction in the model's performance, emphasizing the importance of MSE to analyze the complex TUG acceleration signal. The importance of MSE in our model goes in accordance with [34], where it was concluded that MSE can help to identify subjects in risk of suffering from a fall. It can also be observed that including MSE will improve the classification accuracy of the model across clinical tests, independent of the percentage of features selected. In addition, the multifactor test outperforms the single BBS assessment in all scenarios, which is consistent with previous studies which determined that a multifactor test is better at capturing the complex nature of falls [38,39]. Despite TUG having a higher AUC score than the multifactor test, it is important to point out that the latter is simultaneously assessing both mobility and balance, which are two of the main factors that affect falls. A similar tendency is observed on Table 7, where the average precision and average recall values for the classifiers clearly decrease after removing MSE from the feature set. It can also be observed that the highest precision and recall values can be found when using the TOP 10 and TOP 5 features on the multifactor test. This further indicates the importance of using a multifactor test. In addition, the values presented on such table indicate the models are robust and have high classification accuracy. We also tested the normality of our results from Table 6 using the Kolgomorov-Smirnov test, and compiled the results on Table 8. The results of each condition's p-value are > 0.05, showing that the data follows a normal distribution. Finally, in Table 9 we included the results of our t-test, which we performed in order to determine whether including or excluding MSE has any statistical impact in the accuracy of each scenario. We found that the "Top 10" and "Top 5" combinations of features have significant statistical differences. Looking back at 4, we can find that the features after T10 indeed do not appear in each combination. Especially after comparing the results in Table 6, it can be found that among the top 10 features, MSE has relative discriminative power. The results from our statistical analysis are consistent with [29]. Additionally, in the t-test we also included the different clinical tools we used for our study. From the results, we can also observe a statistical difference, further highlighting the discrimination capability of MSE.

Conclusions
This study analyzed the application of statistical, MSE, and PE features calculated from the inertial sensor data of elder subjects to estimate their scores from multiple clinical tests, as these tests can support medical professionals to screen elder adults for fall risk. We proved that using automatically extracted features from inertial sensor data can provide good screening performance as our model was capable of estimating the multifactor score of the different tests with high accuracy. By analyzing the feature selection, we found that the important features belonged to a combination of statistic and MSE features, indicating that PE was less important when predicting clinical scores. Furthermore, MSE features were present among the top features for all clinical tests. This led to a comparison of the impact in classification AUC score when including and excluding MSE from the set of features to be used in the model. The results for such a test showed that including MSE features increases the performance of the model when estimating BBS, TUG, and TUG + BBS medical scores. We also found the utilization of a multifactor assessment to not only provide better results than the single BBS clinical tool, but also categorize subjects based on mobility and stability, two factors that have been found to be related to falls. In the future, we plan to compare the impact that MSE has in fall risk assessment when two different group of subjects participate in the study. Furthermore, we plan to investigate if the same set of features are selected as important when different sensors are used to collect the data. Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
We have signed contracts with the hospital which prevents us from distributing or uploading the data collected.