Performance and Characteristics of Wearable Sensor Systems Discriminating and Classifying Older Adults According to Fall Risk: A Systematic Review

Sensor-based fall risk assessment (SFRA) utilizes wearable sensors for monitoring individuals’ motions in fall risk assessment tasks. Previous SFRA reviews recommend methodological improvements to better support the use of SFRA in clinical practice. This systematic review aimed to investigate the existing evidence of SFRA (discriminative capability, classification performance) and methodological factors (study design, samples, sensor features, and model validation) contributing to the risk of bias. The review was conducted according to recommended guidelines and 33 of 389 screened records were eligible for inclusion. Evidence of SFRA was identified: several sensor features and three classification models differed significantly between groups with different fall risk (mostly fallers/non-fallers). Moreover, classification performance corresponding the AUCs of at least 0.74 and/or accuracies of at least 84% were obtained from sensor features in six studies and from classification models in seven studies. Specificity was at least as high as sensitivity among studies reporting both values. Insufficient use of prospective design, small sample size, low in-sample inclusion of participants with elevated fall risk, high amounts and low degree of consensus in used features, and limited use of recommended model validation methods were identified in the included studies. Hence, future SFRA research should further reduce risk of bias by continuously improving methodology.


Introduction
Falls are the second leading cause of accidental or unintentional injury resulting in death worldwide [1]. Approximately 35% of all people aged 65 years or older fall every year [2] and the incidence of falls increases with age [3]. Important risk factors include impaired balance and gait performance, polypharmacy, and a history of previous falls [4]. Interventions combining fall preventive physical activities with strategies to increase safety in home environments have proven to be the most effective in reducing the incidence and risk of falls [5]. Technologies can improve fall prevention interventions' efficiency and effectiveness. Hence, fall prevention technologies are mainly used to assess and decrease fall risk, to increase adherence to fall prevention training interventions or to detect occurring falls and alarms in case of an accident [6].
Sensor-based fall risk assessment (SFRA) utilizes wearable sensors for monitoring individuals' motions during assessment tasks. The sensor signals are processed, and specific features are extracted and incorporated into algorithms which aim at predicting fall occurrence or classifying individuals into risk categories [7]. Several reviews of stateof-the-art of SFRA research were published during 2012-2019.
In 2012, Shany et al. discussed the practicalities and challenges associated with the use of wearable sensors for the quantification of older people's fall risk [7]. They identified several study design elements that need to be fulfilled in order to support future real-life use of SFRA. These include: (1) prospective design, (2) larger validations of higher quality enabling meta-analyses, (3) rigorous testing including reliability of test-retest and ratereffects, (4) validation of SFRA-tools on different samples and by research groups other than those suggesting/developing the tool, and (5) an increased focus on SFRA supporting clinical staff in supervised assessments [7].
The following year, Howcroft et al. (2013) made a systematic review of SFRA in geriatric populations using inertial sensors. The review was based on 40 articles published 2003-March 2013 and confirmed the need of prospective design in SFRA research [8]. Moreover, Howcroft et al. emphasized the need to use separate datasets in training and validation of classification models, and more appropriate intelligent computing methods, such as neural networks and Bayesian classifiers, instead of regression [8]. The use of separate datasets had been neglected in 50% of the studies involving classification models included in the review [8]. Howcroft et al. also identified a need for: (1) systematically assessing which combinations of body locations of sensors and sensor-based variables result in high reliability, (2) investigating long-term user compliance to SFRA methods, (3) using SFRA in specialized populations, systematic matching of predictive variables and specific fall risk factors, and (4) comparing accuracies of SFRA methods with accuracies of clinical assessments, both obtained by prospective studies [8].
In 2015, Shany et al. published a review of articles including features extracted from sensor signals in statistical models intended to estimate fall risk or predict falls in older people [9]. This review, which was based on 31 articles published 1997-2015, identified problems with publication bias, inadequate sample sizes, inadequate number of fall events in samples, misuse and lack of model validation, deficiencies in model selection and feature extraction procedures, and insufficient use of prospective fall occurrence as serious issues [9]. Shany et al. (2015) pointed out that some of the included studies reported classification accuracies exceeding the estimated theoretical maximal accuracy (0.81) in predicting the occurrence of a fall during a one-year period [10]. They concluded that the prediction performance was overestimated in the literature, mainly due to small samples, large feature pools, model overfitting, lack of validation, and misuse of modelling techniques [9]. Therefore, Shany et al. suggested that sample bias should be prevented by recruiting cohorts ensuring that an adequate number of falls occur and by considering the recommendations of 1:10 features/event [11] during feature selection [9]. They also suggested improvements in feature selection by tightening the significance thresholds, removing redundant features, and selecting the correct statistical methods [9]. Finally, the need for appropriate model validation methods, preferably by external validation of the final model, was stressed [9].
Roeing et al. [12] conducted a review on the use of mobile health applications for assessment of balance, i.e., one of the fall risk factors. The article included 13 articles published 2011-2016. Several of the articles included young samples, while others lacked information on the studied group of participants. Five articles assessed the validity of mobile health applications by comparing the data collected with data collected using 3D motion capture measurement, an accelerometer or a force platform.
Three systematic reviews were published in 2018 [13][14][15]. Sun and Sosnoff [13] reviewed the use of novel sensing technology in fall risk assessment in 22 articles published 2011-May 2017. Their recommendations for future research included the use of: (1) prospective fall occurrence of at least 6 months to label subjects, (2) a reduced number of variables, selection of variables based on previous research, and (3) appropriate model validation [13].
Montesinos et al. [14] presented a systematic review and meta-analysis of the use of wearable inertial sensors for fall risk assessment and prediction in older adults. The review included 13 articles published up until 2016. Montesinos et al. [14] identified strong/very strong associations between fall risk assessment outcomes and nine triads (combinations of a sensor feature category, a task, and a sensor placement). The recommended and not-recommended triads were found to be task-dependent when analyzing the tasks quiet standing, sit-to-stand/stand-to-sit, Timed Up and Go (TUG) test and walking. For both quiet standing and sit-to-stand/stand-to-sit, the recommended feature category and sensor location were linear acceleration and lower back. For TUG, the recommended sensor category and sensor location were temporal and shins. For walking, there existed both recommended and not-recommended triads. The recommended combinations of sensor feature category and sensor location for walking task were: (1) angular velocity-shins, (2) frequency-upper back, and (3) frequency-lower back. The not-recommended combinations of sensor category and sensor location during walking were: (1) angular velocitylower back, (2) frequency-shins, and (3) linear acceleration-shins. Hence, the sensor location recommended by [14] varies depending on the feature category, particularly for walking.
Rucco et al. [15] reviewed the type and location of wearable sensors for monitoring falls during static and dynamic tasks in healthy elderly. The review was based on 42 articles published 2002-2017. Rucco et al. concluded that the majority of studies used a maximum of two sensors with accelerometers and gyroscopes being the most common, and that the majority of studies presented preliminary results [15]. The trunk was identified as the most studied body segment. The most frequently used tasks varied depending on whether the task was static or dynamic. For measuring static stability, a quiet standing test with eyes opened/closed was most common. For dynamic evaluations, the most common tasks were walking and stand-sit tests [15]. Finally, Rucco et al. [15] stated that information on performance, i.e., accuracy, sensitivity, and specificity, was too diverse and did not allow for evaluating the impact of different system characteristics. Therefore, they identified the need for golden standards in terms of sensors (types, position) and tasks.
In 2019, Bet et al. made a systematic review on fall detection and fall risk assessment in older persons using wearable sensors [16]. The review, which was based on 29 different articles published 2002-2019, presented performance metrics and reported on number of sensors, sensor types, sensor location and assessment tasks. It should be noted that 20 of the articles included only accelerometer features. The use of other sensors was sparse, one article used only gyroscope features, five used a combination of accelerometer and gyroscope features, two used a combination of accelerometer and barometer features, and one used a combination of accelerometer, gyroscope, and magnetometer features. Bet et al. also analyzed sensor locations and found that the most common location was the waist (8 articles), followed by the lumbar region (7), ankle (4), pelvis (4), and head (3) [16].
It is worth noticing here that different terminologies have potentially been used to denote the same sensor location in previous review articles. For example, Montesinos et al. [14], who identified recommended and not-recommended triads, used the notation shins in their triads, while Bet et al. [16] identified four articles with sensors located on the ankle. Further, the most frequently used locations in [16] were the waist and lower back (lumbar spine) whereas [14] stated that the most common placement was the lower back (approximately L3). Rucco et al. [15] used the notation trunk for sensors located at L3, L5, sternum, waist, pelvis, neck, and chest. Hence, comparing the results obtained in this review with results from the previous reviews is not straightforward.
The aim of this systematic review was to analyze the characteristics and performance of wearable sensor systems used to assess older people's fall risk by classifying individuals according to fall risk or by discriminating between groups of older people with different fall risk. The following research questions were in focus: RQ1 What is the evidence of SFRA in terms of (a) discriminative capability, and (b) classification performance? RQ2 Which of the previously identified risk factors for study bias can be identified among the included studies? The risk factors analyzed included: (a) low use of prospective study design, (b) use of small study samples with low amounts of fall events, (c) low consensus in features used in SFRA models; and (d) misuse of model validation methods.

Literature Search
The systematic literature review was conducted according to the PRISMA guidelines [17]. The review elements (aim including PICO elements, eligibility criteria and outcomes) are defined in Table 1. PRO data c.
CLIN data d.
A combination of a-c 3. Sample: N ≥ 10, age ≥ 60 years 4. Wearable or mobile inertial sensors used to characterize movements by extracting features from sensor signals. 5. Evidence of SFRAs in terms of (a) discriminative capacity (statistically significant discriminatory features) and/or (b) classification performance (accuracy, sensitivity, specificity). Inclusion criteria 2-4 were based on a previous systematic review of SFRA [14].
Papers must not include participants with severe cognitive impairment, e.g., dementia.
Papers must not only include measurements of total physical activity by activity monitors.

Outcomes
(a) Qualitative data on features with statistically significant discriminative capacity (p < 0.05).
The systematic literature search was done in four databases: Web of Science Core Collection (i.e., SCI-EXPANDED, SSCI, A&HCI, CPCI-S, CPCI-SSH and ESCI), IEEE Xplore, Pubmed, and Medline. Search phrases and search dates for each database are presented in Table 2. Web of Science and IEEE Xplore were searched twice with modifications made in search phrases.

Study Selection
The systematic literature search in the four databases identified 614 records. After removal of 225 duplicates, 389 publications were screened for eligibility according to the inclusion and exclusion criteria in Table 1. The titles and abstracts of potentially relevant articles were screened independently by two researchers (ME and AK). Eligibility assessments of full text records were performed independently by the same two researchers. In both steps, disagreement was resolved through discussions until consensus was reached. A total of 304 articles were excluded in the screening. Full text copies were downloaded for the remaining 33 articles included in this review ( Figure 1).

Data Extraction
Data from the full text articles was extracted to a study specific template with defined variables (Table A1 in Appendix A). Data extraction was performed independently by three researchers (ME, AK, and JD). All reported data/results were discussed by at least two researchers until consensus was reached.

Research Questions and Data Analysis
The study's two main research questions were: RQ1 What is the evidence of SFRA in terms of (a) discriminative capacity, and (b) classification performance? RQ2 Which of the previously identified risk factors for study bias can be identified among the included studies? The risk factors analyzed included: (a) low use of prospective study design, (b) use of small study samples with low amounts of fall events, (c) low consensus in features used in SFRA models, and (d) misuse of model validation methods.
In order to guide the analysis of the collected data, a larger number of more detailed research questions were formulated. These questions guided how the data, collected in the study specific data collection template (Appendix A), were summarized and presented in eight tables. Table 3 and Section 3.1 present data on study characteristics; Tables 4-6 and Sections 3.2 and 3.3 present data on fall risk assessment system  Table 4 presents articles performing discrimination by feature selection.  Tables 5 and 6 present articles performing classification methods/models with and without machine learning algorithms. Section 3.4 presents the results of an analysis on whether Montesinos et al.'s triads [14] can be identified in the included articles and whether the triad theory applies also on articles using classification models. Tables 7-10 and Section 3.5 present data on the evaluation methodology and fall risk discrimination/ classification performance.

Research Questions and Data Analysis
The study's two main research questions were: RQ1 What is the evidence of SFRA in terms of (a) discriminative capacity, and (b) classification performance? RQ2 Which of the previously identified risk factors for study bias can be identified among the included studies? The risk factors analyzed included: (a) low use of prospective study design, (b) use of small study samples with low amounts of fall events, (c) low consensus in features used in SFRA models, and (d) misuse of model validation methods.
In order to guide the analysis of the collected data, a larger number of more detailed research questions were formulated. These questions guided how the data, collected in the study specific data collection template (Appendix A), were summarized and presented in eight tables. Table 3 and Section 3.1 present data on study characteristics; Tables 4-6 and Sections 3.2-3.3 present data on fall risk assessment system characteristics. Table  4 presents articles performing discrimination by feature selection. Tables 5 and 6 present articles performing classification methods/models with and without machine learning algorithms. Section 3.4 presents the results of an analysis on whether Montesinos et al.'s triads [14] can be identified in the included articles and whether the triad theory applies also on articles using classification models. Tables 7-10 and Section 3.5 present data on the evaluation methodology and fall risk discrimination/classification performance.
Qualitative data were analyzed according to content and quantitative data were an- Qualitative data were analyzed according to content and quantitative data were analyzed using descriptive statistics if possible.

Results
The presented results include study characteristics (Section 3.1), wearable sensors used for fall risk assessment (Section 3.2), signal processing (Section 3.3), the identification of triads and assessment of their applicability on classification methods/models (Section 3.4), and statistical analysis on the sensor-based methods' capabilities to assess fall risk (Section 3.5).
The studies were published between January 2010 and December 2019. The number of articles per year was highest in 2017 (n = 8) followed by 2011, 2016, 2018, and 2019 (n = 5), and 2014 (n = 3). Only one article from 2013 and 2015 respectively is included. None of the included articles were published in 2010, 2012, or 2020.

Study Characteristics
The characteristics of the 33 included studies are presented in Table 3.   The 33 included articles were authored by 145 authors affiliated in 16 countries on four continents (Asia, Europe, North America, and Oceania). Five authors were affiliated with organizations in two different countries. Most authors (116/145) authored one article. However, 21 authored two articles and eight authors were in the author list of at least four articles (number of articles in parenthesis): Brodie (4), Caulfield (4), Delbaere (4), Greene (6), Hausdorff (5), Lord (4), Redmond (4), and Weiss (4).
Most articles (25/33) were written by authors affiliated in the same country. The distribution per continent was as follows: Asia: Israel (n = 1), Japan (n = 2), South Korea (n = 1); Europe: Belgium (n = 1), Germany (n = 1), Ireland (n = 4), Italy (n = 1); North America: Canada (n = 2), United States (n = 6): and Oceania: Australia (n = 6). Eight articles had authors affiliated in different countries: Australia-Ireland (n = 1), Ireland-USA (n = 1), Israel-Taiwan-USA (n = 1), Israel-Norway (n = 1), Germany-Israel-Norway (n = 1), Czech Republic-France-Italy (n = 1), Belgium-Netherlands (n = 1), Belgium-Israel-Italy-Netherlands-UK-USA (n = 1). It is worth noting that the Australian articles were authored by two groups, one group authoring [45,48], and another group authoring [25,28,35,41]. Greene was on the author list of all articles from Ireland and on the author list of the Australia-Ireland and Ireland-USA articles. The USA articles were almost exclusively written by different research groups although two authors were on the author list for 2/6 articles from USA. The articles including authors from Israel were mostly authored in collaboration with authors from other countries.

Study Populations
The study participants were classified as community-dwelling (18 articles), patients (four articles), residential care/continuing-care retirement community (two articles), and other (eight articles) if none of the aforementioned labels matched the reported population (e.g., "people from cohort," or "convenience sample"). In addition, one study [49] had a large, stratified sample including subgroups of community-dwelling, residential care, and patients (neurological and rehabilitation). The populations of all studies per publication year are presented in Figure 2. Community-dwelling was the most studied population, and none of the other populations were studied in publications from 2013-2015. No other clear trends could be identified among the included studies in study population.
The most common method to label a participant as faller or a non-faller (or equivalent) was RE data alone (n = 18) or in combination with CLIN data (n = 4). Three of the included studies solely used CLIN data (formulas or functional tests) to label participants. However, two of these studies used clinical formulas which included RE data. Five studies used PRO data alone and two studies combined RE and PRO data (one of them compared performance of retrospective and prospective classification models [27]). Finally, one study [44] stated that clinical partners determined whether a participant was labelled as high fall risk or age matched low fall risk. This technique was categorized as "other" in the current review ( Figure 3 and Table 4). Figure 3 presents the number of studies per publication year that applied the respective faller/non-faller (or equivalent) labelling method. As can be seen here, the use of PRO data (alone or in combination with RE data) had not increased during 2011-2019. In total, seven studies used PRO data, either alone or in combination with RE data, to label participants. PRO data was mainly followed up for 12 months (5/7 studies), although 6-and 24-months periods were also used. In total, 25 studies used RE data to label participants (either alone, in combination with PRO and CLIN data, or as part of CLIN data). RE data was mostly retrieved from the past 12 (16/25 studies), 60 (5/25 studies), 6 (2/25 studies), 3 (1/25 studies) or 18 months (1/25 studies). Moreover, one study did not specify the length of the period to collect RE data. The most common method to label a participant as faller or a non-faller (or equivalent) was RE data alone (n = 18) or in combination with CLIN data (n = 4). Three of the included studies solely used CLIN data (formulas or functional tests) to label participants. However, two of these studies used clinical formulas which included RE data. Five studies used PRO data alone and two studies combined RE and PRO data (one of them compared performance of retrospective and prospective classification models [27]). Finally, one study [44] stated that clinical partners determined whether a participant was labelled as high fall risk or age matched low fall risk. This technique was categorized as "other" in the current review ( Figure 3 and Table 4). Figure 3 presents the number of studies per publication year that applied the respective faller/non-faller (or equivalent) labelling method. As can be seen here, the use of PRO data (alone or in combination with RE data) had not increased during 2011-2019. In total, seven studies used PRO data, either alone or in combination with RE data, to label participants. PRO data was mainly followed up for 12 months (5/7 studies), although 6-and 24-months periods were also used. In total, 25 studies used RE data to label participants (either alone, in combination with PRO and CLIN data, or as part of CLIN data). RE data was mostly retrieved from the past 12 (16/25 studies), 60 (5/25 studies), 6 (2/25 studies), 3 (1/25 studies) or 18 months (1/25 studies). Moreover, one study did not specify the length of the period to collect RE data.

Size and Proportion of Participants Labelled as Fallers of Study Samples
The studies' sample sizes ranged from 13 to 6295 participants (mean 289, median 73, standard deviation (SD) 1041). One study used three different datasets [27], which were counted as three separate samples in our analysis. One study published in 2019 [49] had an exceptionally large sample of 6295 participants. The studies were categorized into eight categories according to sample size. The distribution of studies for each categorized    (2)) [52] with optimal identified parameter setting [31] Daily life walking 1 inertial sensor (3D accel data used) Lower back (belt) 60 PLS-DA with a backward feature selection RCME and RMPE for trunk acceleration and trunk velocity

Size and Proportion of Participants Labelled as Fallers of Study Samples
The studies' sample sizes ranged from 13 to 6295 participants (mean 289, median 73, standard deviation (SD) 1041). One study used three different datasets [27], which were counted as three separate samples in our analysis. One study published in 2019 [49] had an exceptionally large sample of 6295 participants. The studies were categorized into eight categories according to sample size. The distribution of studies for each categorized sample size is presented in Figure 4. Approximately one third (12/35) of the studies had a sample of at least 100 participants.

Size and Proportion of Participants Labelled as Fallers of Study Samples
The studies' sample sizes ranged from 13 to 6295 participants (mean 289, median 73, standard deviation (SD) 1041). One study used three different datasets [27], which were counted as three separate samples in our analysis. One study published in 2019 [49] had an exceptionally large sample of 6295 participants. The studies were categorized into eight categories according to sample size. The distribution of studies for each categorized sample size is presented in Figure 4. Approximately one third (12/35) of the studies had a sample of at least 100 participants. The proportion of participants labelled as having elevated fall risk (faller, frail or at risk) according to RE data (recorded during periods of 3-60 months) and/or PRO data (during periods of 6-24 months) and/or CLIN data ranged from 14% to 71% (mean 44%, median 46%, SD 14.7%), see Table 3. The threshold used to define a person with elevated fall risk (faller, frail or at risk) varied between the studies. For example, while most studies The proportion of participants labelled as having elevated fall risk (faller, frail or at risk) according to RE data (recorded during periods of 3-60 months) and/or PRO data (during periods of 6-24 months) and/or CLIN data ranged from 14% to 71% (mean 44%, median 46%, SD 14.7%), see Table 3. The threshold used to define a person with elevated fall risk (faller, frail or at risk) varied between the studies. For example, while most studies required at least one previous fall to label a participant as a faller in some studies, a few studies (pointed out in Table 3) required at least two falls. Moreover, most studies performed binary classification of participants (faller/non-faller) while a few studies classified participants into three groups (faller/once-faller/multiple faller). One of the study samples in [27] did not specify the percentage of fallers in sample, the sample was therefore omitted in the analysis. Moreover, one study used both RE and PRO data to label participants and obtained different proportion of fallers depending on method. Both values were included in the analysis.

Sensor-Based Fall Risk Assessment Tasks and Degree of Supervision
Most of the studies (25/33) performed supervised SFRAs where assessment tasks tested walking (n = 9), sit-to-stand transitions in combination with walking (mostly in TUG) (n = 6), standing balance function (n = 4), sit-to-stand transitions (n = 2), turning balance (n = 1), choice stepping reaction time (n = 1), upper extremity function (n = 1), and TUG in combination with other clinical tests (n = 1). Two of the supervised tests were performed in a home setting. In two of the 33 studies, the SFRAs tasks (walking on flat surface and stairs and stair ascent) were performed in semi-supervised conditions at research facilities. Six of the 33 studies analyzed sensor data from unsupervised assessment tasks in a home environment, either ADL or free-living daily gait.
The number of different fall risk assessment tasks identified in this review was higher than the four tasks (quiet standing, sit-to-stand/stand-to-sit, TUG and walking) included in the triads identified in by [14].
Studies basing SFRA on classification methods/models with machine learning used fewer assessment than studies basing SFRA on feature selection and on classification models without machine learning. In addition, the use of unsupervised and semi-supervised assessments was higher among studies using classification methods/models with machine learning (50% supervised, 33% unsupervised and 17% semi-supervised) than among studies performing discrimination by feature selection (77% supervised, 18% unsupervised and 5% semi-supervised) and studies using classification models without machine learning (100% supervised).

Wearable Sensor Used for Fall Risk Assessment
This section provides an overview of trends in the number of wearable sensors and sensor types used, as well as the distribution of wearable sensors at different body locations. The identified differences between studies performing discrimination by feature selection and studies using classification methods/models with and without machine learning algorithms are presented.

Number of Wearable Sensors
The average number of sensors per study varied between 1 and 5 among the articles. Most studies (26/33) used 1-2 sensors. As shown in Figure 5, the variation in average number of sensors used per article and year was higher for studies using classification methods/models (i.e., studies in Tables 5 and 6 where the number varied between 1 and 10) than for studies not using classification methods/models (i.e., the studies in Table 4 where the number varied between 1 and 4). However, the difference in average number of sensors per publication year was not statistically significant between the two groups of studies.   Table  4, and (3) the articles in Tables 5 and 6.

Sensor Types
This section provides information on different types of wearable sensors identified in the included articles, differences between article categories, and identified trends in sensor types.
The following sensor types were identified among the included studies (number of articles given in parenthesis): accelerometers (13), gyroscopes (5), a combination of accelerometers and gyroscopes (6), a combination of accelerometers, gyroscopes and magnetometers (3), a combination of accelerometers and barometer (2), a combination of accelerometers, gyroscopes, magnetometers and barometer (1), a combination of a 2D accelerometer and load cell (1), a combination of accelerometers and photoelectric heart rate (1), and a combination of accelerometers and pressure (1).
During 2011-2019, the number of different sensor types used, i.e., their  Table 4, and (3) the articles in Tables 5 and 6.

Sensor Types
This section provides information on different types of wearable sensors identified in the included articles, differences between article categories, and identified trends in sensor types.
The following sensor types were identified among the included studies (number of articles given in parenthesis): accelerometers (13), gyroscopes (5), a combination of accelerometers and gyroscopes (6), a combination of accelerometers, gyroscopes and magnetometers (3), a combination of accelerometers and barometer (2), a combination of accelerometers, gyroscopes, magnetometers and barometer (1), a combination of a 2D accelerometer and load cell (1), a combination of accelerometers and photoelectric heart rate (1), and a combination of accelerometers and pressure (1).
During 2011-2019, the number of different sensor types used, i.e., their dimensionality, increased after 2016. As shown in Figure A1a (1 article), as well as a combination of a 3D accelerometer and pressure sensor (1 article). During 2017-2019, the number of different sensor types continued to increase, and the dimensionality of the sensor systems increased to 9D (i.e., 3D accelerometers, 3D gyroscopes, 3D magnetometers) and even to 10D by adding barometer data as well.
The variation in number of different sensor types used was higher among the studies performing discrimination by feature selection (Table 4) than among studies using classification methods/models (Tables 5 and 6). However, the difference was not statistically significant. As shown in Figure A1b (in Appendix B), 1-5 different sensor types were used per publication year among the articles performing discrimination by feature selection. Only 1-2 different sensor types were used per publication year among the studies using classification methods/models with or without machine learning (see Figure A1c in Appendix B). Moreover, the use of 3D accelerometers was higher among the studies performing discrimination by feature selection (see Table 4) than among the studies using classification methods/models (Tables 5 and 6). None of the studies in Tables 5 and 6 that were published during 2017-2019 used 3D accelerometers.
Among the identified studies performing discrimination by feature selection (Table 4), gyroscopes started to be used in 2016. In this period, gyroscopes were used to an equal extent as accelerometers. During 2018-2019, two studies using a combination of accelerometer, gyroscope, and magnetometer features were identified, one of them also combined with barometer features.

Distribution of Wearable Sensors at Different Body Locations
This section provides information on how wearable sensors were distributed between body locations, whether there were differences between article categories, as well as trends in distribution.
Starting with the studies performing discrimination by feature selection (i.e., the articles in Table 4), Figure A2 in Appendix B shows that a total of three sensors were used in the two articles from 2011. Two sensors were located on the upper body (pelvis and sternum) and one on the lower body (thigh). In 2013-2014, three articles used a total of four sensors, all located on the upper body (two on the lumbar spine, one on the cervical spine and one on sternum). In 2016, all five articles included sensors located on the upper body (lumbar spine) and one of them also included sensors located on the top of the feet. In 2017, seven articles used a total of nine different sensor body locations. Four of them had sensors located only on the upper body (pelvis, sternum, biceps, wrist, head). Two had sensors located both on the upper and lower body (in [39] at the sternum, lumbar spine, and on the feet, while in [41] on the lumbar spine and one of the ankles) and one [40] positioned the sensors on the shanks/shins. In 2018, all four articles used a sensor located on the upper body (three on the lumbar spine and one on the sternum and pelvis). In addition, two of them used sensors located on at least one shin/shank, and one of them used sensors located on the thighs. All three articles published in 2019 also used sensors located on the upper body (lumbar spine). In addition, one of them [50] used a sensor located on one of the heels.
To summarize, 64% of the wearable sensors used in the studies performing discrimination by feature selection were located on the upper body (Figure 6a). Figure 6b shows that the most common body location on the upper body was the lumbar spine (13 sensors). Other upper body locations used more than once include sternum (5), and pelvis (3). Lower body locations used more than once include shin/shank (5), top of foot (4), and thigh (3). body (lumbar spine) and one of them also included sensors located on the top of the feet. In 2017, seven articles used a total of nine different sensor body locations. Four of them had sensors located only on the upper body (pelvis, sternum, biceps, wrist, head). Two had sensors located both on the upper and lower body (in [39] at the sternum, lumbar spine, and on the feet, while in [41] on the lumbar spine and one of the ankles) and one [40] positioned the sensors on the shanks/shins. In 2018, all four articles used a sensor located on the upper body (three on the lumbar spine and one on the sternum and pelvis). In addition, two of them used sensors located on at least one shin/shank, and one of them used sensors located on the thighs. All three articles published in 2019 also used sensors located on the upper body (lumbar spine). In addition, one of them [50] used a sensor located on one of the heels.
To summarize, 64% of the wearable sensors used in the studies performing discrimination by feature selection were located on the upper body (Figure 6a). Figure 6b shows that the most common body location on the upper body was the lumbar spine (13 sensors). Other upper body locations used more than once include sternum (5), and pelvis (3). Lower body locations used more than once include shin/shank (5), top of foot (4), and thigh (3).
(a) (b) Figure 6. Information on sensor locations for articles in Table 4, i.e., the articles performing discrimination by feature selection: (a) Distribution of sensors located at the upper or lower body; (b) Number of sensors per body location. Figure 6. Information on sensor locations for articles in Table 4, i.e., the articles performing discrimination by feature selection: (a) Distribution of sensors located at the upper or lower body; (b) Number of sensors per body location.
Continuing with the studies using classification methods/models (with or without machine learning algorithms (i.e., the articles in Tables 5 and 6), Figure A3 in Appendix B) shows that most articles reported on sensors located on the lower body with the exception for the publication years 2011 and 2015. However, in 2011, 6/7 of the reported upper body sensors were used in one of the articles where a total of ten sensors were used [20], and only one article was included from 2015. The two articles published in 2014 [27] and 2017 [36] have the same main author and report on the use of sensors located at the shin/shank. The two included articles from 2016 use five different sensor locations, and 6/7 of the sensors were used in one article [29] where most of them were located on the lower body (shins/shanks and under the feet soles), and two of them on the upper body (head and pelvis). The other article from 2016 [32] used a sensor located on the upper body (lumbar spine). Only one article from 2018 [46] used sensors located both on the upper (lumbar spine) and lower body (thighs, and shins/shanks). In 2019, [51] used a sensor combining a 3D accelerometer and a photoelectric heart rate sensor located on the wrist, while the other study [49] positioned the sensors at the shins/shanks.
To summarize, 61% of the wearable sensors used in studies using classification methods/models (with or without machine learning (Figure 7a) were located on the lower body. Figure 7b shows that the most common body location on the lower body was the shin/shank (12 sensors). Other lower body locations used more than once include thigh (4), under foot (2), and ankle (2). Upper body locations used more than once include lumbar spine (3), wrist (3), biceps (2), and shoulder blade (2). Hence, the body locations used in studies using classification methods/models with or without machine learning, are quite different from the body locations used in the studies performing discrimination by features where 64% of the sensors were located on the upper body.
while the other study [49] positioned the sensors at the shins/shanks.
To summarize, 61% of the wearable sensors used in studies using classification methods/models (with or without machine learning (Figure 7a) were located on the lower body. Figure 7b shows that the most common body location on the lower body was the shin/shank (12 sensors). Other lower body locations used more than once include thigh (4), under foot (2), and ankle (2). Upper body locations used more than once include lumbar spine (3), wrist (3), biceps (2), and shoulder blade (2). Hence, the body locations used in studies using classification methods/models with or without machine learning, are quite different from the body locations used in the studies performing discrimination by features where 64% of the sensors were located on the upper body.

Signal Processing
The analysis of methods used for signal/data processing and analysis in the 33 studies identified that the used signal processing approaches could be classified according to three main categories: discrimination by feature selection (22 studies presented in Table  4), classification by use of classification methods/models with and without machine learning algorithms (5 studies without machine learning algorithms presented in Table 5, and 6 studies with machine learning algorithms presented in Table 6).  Tables 5 and 6, i.e., the articles using classification methods/models with or without machine learning algorithms: (a) Distribution of sensors located at the upper or lower body; (b) Number of sensors per body location.

Signal Processing
The analysis of methods used for signal/data processing and analysis in the 33 studies identified that the used signal processing approaches could be classified according to three main categories: discrimination by feature selection (22 studies presented in Table 4), classification by use of classification methods/models with and without machine learning algorithms (5 studies without machine learning algorithms presented in Table 5, and 6 studies with machine learning algorithms presented in Table 6).

Sensor Features
The number of sensor features selected for fall risk assessment analysis (either discrimination or classification) varied from one to hundreds between different studies.
Among studies discriminating by feature selection, i.e., performing statistical analysis directly on selected features ( Table 4), most of the studies (14/21) used up to 10 (4-10) sensor features and four studies used 15-21 sensor features [21,34,41,47]. The highest number of sensor features used among the studies was 60 [31]. However, this number differed significantly from the other studies discriminating fall risk by feature selection. Some studies evaluated the discriminatory capabilities of generated sensor features, e.g., Step Stability Index (SSI) [25], Local Dynamic Stability (LDS) [30], and Biometric Signature Trajectory (BST) by [45].
In the studies using classification models without machine learning algorithms (Table 5), three articles from the same main author [27,36,49] used over 40  sensor features in regularized discriminant classifier models. The other two studies [22,23], which both used regression models, utilized 10 and 14 sensor features respectively.
The studies using classification methods/models with machine learning algorithms ( Table 6) used more sensor features than studies assessing fall risk based on feature extraction (Table 4) and on classification models without machine learning (Table 5). Two studies [29,46] used approximately 150 sensor features, while two other studies [20,32] used approximately 70 sensor features for four different machine learning algorithms. One study [28], which built on the machine learning classification algorithm decision tree (DT), used only seven sensor features. A total of 38 sensor features were combined with 210 variables in Resident Assessment Instrument-Home Care (RAI-HC) and analyzed using machine learning algorithms [51].

Feature Selection
All studies employed feature selection, regardless of whether they used statistical analysis directly on the selected features to assess fall risk or used the selected features in classification methods/models which used machine learning algorithms or other types of classifiers.
In Table 4, these feature selection methods were mostly used for assessment of individual difference and significance, i.e., which of the features to be included in fall discrimination analysis. However, in Tables 5 and 6, the methods mostly tended to prepare data for classification models and machine learning algorithms, catering for the prediction and assessment of the classification ability or performance.

Fall Risk Assessment
The 22 studies in Table 4 employed selection and comparison of features by performing statistical tests to discriminate between fallers and non-fallers or to classify individuals as fallers or non-fallers. The majority of these articles (17/22) compared the features between groups by performing different statistical tests with the aim of identifying features with significant discrimination ability. Some articles (5/22) proposed novel or valid measures from sensor measurements and analyzed the feasibility of those, i.e., a novel measure of SSI [25], LDS [30], refined composite multiscale entropy (RCME) and refined multiscale permutation entropies (RMPE) [31], BST [45], and Comprehensive Gait Assessment using Inertial sensor (C-GAITS) score [50].

Identification of Triads and Assessment of Applicability on Classification Methods/Models
As described in Section 3.1.5, the current review reports on SFRA performed under supervised, semi-supervised, and unsupervised conditions (such as ADL or free-living gait) and the degree of supervision varied between article categories. For example, all studies using classification models without machine learning were supervised while only 50% of the studies using classification methods/models with machine learning algorithms were supervised.
A previous systematic review and meta-analysis of best available evidence of optimal combinations of sensor locations, tasks and feature for fall risk assessment [14] discussed discriminating sensor features of four certain tasks while wearing sensors. The six recommended triads were: (1) angular velocity-walking-chins, (2) frequency-walkinglower back, (3) frequency-walking-upper back, (4) linear acceleration-quiet standinglower back, (4) linear acceleration-quiet standing-lower back, (5) linear accelerationsit-to-stand/stand-to-sit-lower back, and (6) temporal-TUG-shins. The three notrecommended triads were: (1) angular velocity-walking-lower back, (2) frequencywalking-shins, and (3) linear acceleration-walking-shins [14]. In the current review, it was not possible to outline the aforementioned triads for the 11 studies performing classification methods/models, i.e., the studies in Tables 5 and 6. The main reason for this is the fact that they present methods rather than sensor features. Further, the current review identified several assessment tasks that were not included in [14], for example reaction tests, stair ascent and decent, ADL, balance tests in different conditions, and an UEF test. Most of these newer tasks were used in studies performing discrimination by feature selection. Therefore, rather than trying to identify triads like the ones in [14] or counting sensor types/sensor locations for all studies, Sections 3.2.1-3.2.3 present information on number of sensors, sensor types, and sensor locations, differences between article categories as well as trends during 2011-2019. Nevertheless, an analysis relating to the previous systematic review by Montesinos et al.'s triads [14] has been conducted.
Starting with the studies performing discrimination by feature selection (i.e., the studies in Table 4), most sensors were located on the upper body and most of them were located on the lumbar spine. The lower back was included in the triads recommended by [14] for quiet standing and stand-to-sit/sit-to stand tasks but not TUG. For TUG, the recommended triad included temporal-shin. Only the studies [39,40] used TUG as an assessment task, [40] used the recommended sensor location, i.e., the shin. However, none of them presented results that distinguished fallers from non-fallers by using temporal sensor features. Four studies included standing balance tests at different conditions, VR included. Excluding the VR study [38], the three other studies [34,43,48] used sensors located on the lower back or pelvis. Hence, while not being assessment tasks listed in [14], the sensor location mimics the one in the recommended triad (4) above.
Regarding the walking task, [14] identified both recommended and not-recommended triads with respect to the lower back. Eight studies used different walking tasks. These included also walking in different conditions, daily life walking, walking in stairs, 6MWT and 15 m walking tests. The sensor was located on the lower back in seven studies, and on a nearby location (pelvis) in one study. However, the dimensionality of the collected sensor data varied. In the studies [19,24,25,31,41,42], features from one or more 3D accelerometers were used. A combination of 3D accelerometer and 3D gyroscope features was used in [50], and an even more complex combination (3D accelerometer, 3D gyroscope, 3D magnetometer, and barometer features) was used in [44]. Several of the studies using walking as the assessment task also used more than one sensor but the location of them varied. We note that the triad linear acceleration-walking-lower back was not identified as a recommended triad in [14].
The triad angular velocity-walking-lower back was identified as not-recommended in [14]. Nevertheless, gyroscope features were included in the C-GAITS score [50]. Finally, Table 4 includes two studies [21,30] including the sit-to-stand assessment task. Both used one or more sensors providing 3D accelerometer features. For sit-to-stand, the in recommended triad (4) above includes linear acceleration-lower back. This sensor location was used in [30] but the sensors were positioned on one of the thighs and sternum in [21].
The triads by [14] are not directly applicable for the studies using classification models/methods. The majority of the sensors used in the studies presented in Tables 5 and 6 were located on the lower body with shin/shank being the most common location. Three of the studies in Table 5 [27,36,49] used TUG as the assessment task. All of them positioned the sensors on the shin/shank, i.e., the same location as in the recommended triad (6) above for TUG which included temporal-shins. One study, [22] was conducted by the same research group but used walking as an assessment task with sensors located on the shins/shanks. It should be acknowledged here that although the Shimmer sensor (i.e., a combination of 3D accelerometer and 3D gyroscope features) was used during the assessment, the research has resulted in a commercial quantitative TUG assessment tool called QTUG which is provided by the company Kinesis Health Technology.
One article [23] used both walking and TUG as assessment tasks. The location chosen for a 3D accelerometer was the lower back. Also, this triad was not identified by [14].
None of the recommended triads for walking (1-3 above) include linear acceleration, neither does the recommended triad for TUG (6 above). Hence, the studies in Table 5 show that it is possible to use also other triads when using classification models/methods. The assessment tasks vary significantly between the studies in Table 6, therefore, no further analysis on body locations and identification of triads is provided here.

Statistical Analyses on the Sensor-Based Methods' Capabilities to Assess Fall Risk
Statistical analyses were performed on the SRFA methods' capabilities to assess fall risk, either to discriminate between groups with different fall risk or to classify individuals as faller/non-faller. Methodological data and main findings on discriminatory capabilities of sensor features and classification methods/models are presented in Tables 7 and 8. Methodological data and classification performance of sensor features, and classification methods/models are presented in Tables 9 and 10. Table 7. Statistical analyses on the sensor-based features' abilities to discriminate groups with distinct levels of fall risk. BMI = Body Mass Index. [21] 19/39 (RE-60, CLIN)

No. of Features and Type of Assessment Task Able to Discriminate Groups with Different Level of Fall Risk (Fallers/Non-Fallers)
Fallers took significantly longer time to complete sit-stand transitions than non-fallers; Fallers exhibited increased jerk over the complete assessment than non-fallers; SEF was significantly higher for fallers than non-fallers for the total test, sit-stand-sit components, sit-stand and stand-sit transitions 6, sit-stand and stand-sit transitions [25] 39/81 (RE-12) The SSI was significantly higher for fallers than non-fallers under all three walking conditions (baseline with and without harness, obstacle negotiation with harness) 3, gait [26] 36/104 (RE-12) Significantly longer times to regain balance after movement initiation and slower stability time for fallers than for non-fallers.
2, stability/balance  The DTW difference between the reference BST and each participant's BST was significantly higher among elderly multiple fallers than non-fallers.
Hold-out method used for model validation (preferred over CV in [9].) Single task data models: Best Acc > 81%  95% CIs provided for each performance metrics (as recommended in previous reviews). Two sample t-tests on overall classification Acc showed that a significantly higher Acc was achieved using SVM. Best Acc > 81%  Table 7) evaluated the discriminative capability of sensor features and a few (3/33, see Table 8) evaluated the discriminative capability of classification methods/models.
The 15 studies evaluating capabilities of sensor features to discriminate between groups (presented in Table 7) included 5-122 fallers (mean 31, median 20.5, SD 28). Only 2/15 studies identified fallers based on PRO data, one for 12 months and one for 6 months. Each study identified 1-6 sensor features which differed significantly between groups of participants with different fall risk levels. Almost half (7/15) of the studies identified sensor features related to gait, both more complex measures [25,50] and specific gait characteristic features such as within walk variability [35]. Moreover, one study identified that stair descent rate significantly differed between multiple-and non-multiple-fallers [41]. Five studies identified that features related to balance were significantly different between groups, for example during tandem stand [34], in regain of balance after movement initiation [26] and upon external stimuli [43].
The three studies evaluating the capabilities of classification models/algorithms to discriminate between groups (presented in Table 8) included 31-1637 fallers (mean 574, median 54, SD 752). One of the studies used an exceptionally large sample to validate a model that had been previously reported [49]. One of the studies [28] identified fallers based on PRO data, and a period of 12 months was used. Both classification methods/models with [28] and without [22,49] machine learning were evaluated. Two of the articles stated that model validation was performed.

Capability in Classifying Individuals as Fallers/Non-Fallers (or Equivalent)
Fifteen (15/33) studies analyzed the SFRA methods' performance in classifying older adults as fallers/non-fallers or equivalent. Seven of them evaluated classification performance of sensor features directly (Table 9) and eight evaluated classification methods/models which used sensor derived features ( Table 10).
Six (6/7) studies reported Area Under Curve (Operating Characteristics-curve) (AUC)values (range 0.67-0.90), half of them reported 95% confidence intervals (CIs) of the average AUCs and four of them reported AUC-values of 0.81 and higher. These six studies also reported values of sensitivity, i.e., the probability of classifying a true faller as faller, (range 53-88%) and specificity, i.e., the probability of classifying a true non-faller as non-faller (range 72-90%). As shown in Figure 8, 5/6 of these studies reported a specificity that was at least as high as the sensitivity. This indicates that the methods' performance in classifying non-fallers was as least as high as their performance in classifying fallers. For example, ref [39,42] reported sensitivity values of 53% and 54.3%, respectively (see values marked with (1) and (2) in Figure 8). Ihlen et al. [31] reported the highest sensitivity (88%, see value marked with (3) in Figure 8).
Eight studies evaluated classification performance of sensor-based classification methods/models. One of them did not report on number of fallers in their sample but the other seven studies, presented in Table 10, included 11-33 fallers (mean 21, median 22, SD 7), all identified based on RE data.
Type of metrics used to report on classification performance varied between studies: most studies (7/8) reported classification accuracy with best performance values in the range 70-91%. In addition, values on sensitivity (best values of studies in range 36-100%), specificity (best value of studies in range 55-100%) and AUC-values (best value of studies in range 0.67-0.93) were reported. Three of the eight studies presented CIs for the reported values and three studies [23,42,48] reported performance metrics of clinical fall risk assessment methods for comparison.
More than 60% of the studies (5/8) evaluated the performance of models using machine learning algorithms. Each study evaluated the performance of three to six models such as NB, SVM, and multi-layer perceptron NN. Three of the four studies that included SVM-based models in their comparisons identified that this type of machine learning  [46], [44] and [37], respectively.
Eight studies evaluated classification performance of sensor-based classification methods/models. One of them did not report on number of fallers in their sample but the other seven studies, presented in Table 10, included 11-33 fallers (mean 21, median 22, SD 7), all identified based on RE data.
Type of metrics used to report on classification performance varied between studies: most studies (7/8) reported classification accuracy with best performance values in the range 70-91%. In addition, values on sensitivity (best values of studies in range 36-100%), specificity (best value of studies in range 55-100%) and AUC-values (best value of studies in range 0.67-0.93) were reported. Three of the eight studies presented CIs for the reported values and three studies [23,42,48] reported performance metrics of clinical fall risk assessment methods for comparison.
More than 60% of the studies (5/8) evaluated the performance of models using machine learning algorithms. Each study evaluated the performance of three to six models such as NB, SVM, and multi-layer perceptron NN. Three of the four studies that included SVM-based models in their comparisons identified that this type of machine learning algorithm resulted in the best performance [29,32,46]. The remaining 40% of the studies (3/8) evaluated classification models based on logistic regression algorithms [23] and regularized discriminant classifier algorithms [27,36].
Accuracy values reported from studies using classification models with machine learning (79.7-91%) were higher than accuracy values reported from studies that used classification models without machine learning (70-72.7%) (see Table 10). It should be noted though that only the highest achieved accuracy value per classification model is reported in this review. Moreover, studies employing models with machine learning reported higher sensitivity and specificity values compared to classification methods/models using other types of classifiers (see Figure 9). Here, the lowest sensitivity value, presented by [27], was below 50% (see data point marked with double asterisks in Figure 9). However, the study by Caby et al. [20], which compared four different machine learning algorithms, reported sensitivity values of 0 and 1 (see data points marked with one asterisk in Figure 9). However, one study used a hold-out method with 75% of the data in the training set and 25% in the validation set [29], and one study used an independent dataset for model validation [36]. In both cases, the reported specificity was lower than the sensitivity (see data points marked with arrows in Figure 9), and this indicated that their performance in classifying fallers were at least as high as their performance in classifying non-fallers. Moreover, one of the studies which used CV reported that pruning was used in the model training to avoid overfitting [51]. Figure 9. Specificity versus Sensitivity from studies using classification methods/models (presented in Table 10). Data from studies using models with machine learning are represented by blue dots and data from studies using models without machine learning are represented by red dots. The blue line indicates specificity = sensitivity. Data from evaluations using methods other than CV (holdout or independent dataset) are marked with arrows. Two data points from [29] are marked with asterisks and one data point from [41] is marked with double asterisks.

Discussion
This article presents a systematic review of evaluations of SFRA methods in peer reviewed literature published 2010-2020. A total of 389 publications were screened for eligibility and 33 articles were included in the final assessment.
The current review identified that the most studied population was communitydwelling older adults. Although sample sizes varied widely between studies, 33% (12/35) of the samples in the 33 included studies had at least 100 participants. This percentage Figure 9. Specificity versus Sensitivity from studies using classification methods/models (presented in Table 10). Data from studies using models with machine learning are represented by blue dots and data from studies using models without machine learning are represented by red dots. The blue line indicates specificity = sensitivity. Data from evaluations using methods other than CV (hold-out or independent dataset) are marked with arrows. Two data points from [29] are marked with asterisks and one data point from [41] is marked with double asterisks.
All studies validated their classification models, mostly by cross validation (CV). However, one study used a hold-out method with 75% of the data in the training set and 25% in the validation set [29], and one study used an independent dataset for model validation [36]. In both cases, the reported specificity was lower than the sensitivity (see data points marked with arrows in Figure 9), and this indicated that their performance in classifying fallers were at least as high as their performance in classifying non-fallers. Moreover, one of the studies which used CV reported that pruning was used in the model training to avoid overfitting [51].

Discussion
This article presents a systematic review of evaluations of SFRA methods in peer reviewed literature published 2010-2020. A total of 389 publications were screened for eligibility and 33 articles were included in the final assessment.
The current review identified that the most studied population was communitydwelling older adults. Although sample sizes varied widely between studies, 33% (12/35) of the samples in the 33 included studies had at least 100 participants. This percentage was higher than what had been identified in the review by Rucco et al. [15] which reported that only 10% (4/42) of their included studies had more than 100 participants. The review by Shany et al. [7] pointed out the need for high-quality validations of concepts that had been established as proof-of-concept in previous research. The current review identified one example of a large-scale validation published in 2019 [49] with a sample of 6295 participants.
RE data was identified as the most used comparator in the included studies. This result is in accordance with previous reviews which have identified RE data alone [14] or in combination with CLIN data [8,9,13] as the most common methods for generating outcomes to compare SFRAs. Although the need for using PRO data has been emphasized in previous reviews [7][8][9]13], and that a positive trend of increased use of PRO data was identified by Shany et al. [9], the current review could not identify an increase in prospective studies over the publication period.
The proportion of fallers (or equivalent outcomes indicating increased fall risk) in study samples varied between 14 and 71% with an average of 44%. The same range was reported in [15]. However, thresholds used to define a person with increased fall risk (at least one or two falls) varied. Most of the included studies based the SFRA on supervised assessment tasks. This is positive with regards to the need for research to support supervised SFRA and not only focus on unsupervised SFRA which was described in [7].
The number of different sensor types used in the included articles increased over time: while only 1-2 sensor types were used in the 10 articles published during 2011-2015, the number of sensor types increased from 2016. A higher number of sensor types was used among studies performing discrimination by feature selection than by those using classification methods/models. Moreover, the accelerometer, commonly used according to Bet et al. [16], was not used at all in the studies using classification methods/models that were published during 2017-2019.
This review identified that studies performing discrimination by feature selection and studies using classification methods/models differed in sensor locations used: while studies using classification methods/models mostly used sensors located on the lower body (shin/shank was the most common location), studies using feature selection mostly used sensors on the upper body (lumbar spine was the most common position). Another review by Bet et al. [16] also analyzed sensor locations and found that the most common location was the waist (8 articles), followed by the lumbar region (7), ankle (4), pelvis (4), and head (3). It is worth noticing here that different terminologies may possibly be used to denote the same sensor location. For example, [14] identified recommended and not-recommended triads including the shins but lists no articles including features from the ankle, while [16] identified four articles with sensors located on the ankle but none on the shin. Further, while [16] reported that the most frequently used locations were the waist and lower back (lumbar spine), [14] stated that the most common placement was the lower back (approximately L3). In addition, Rucco et al. [15] used the notation trunk for sensors located at L3, L5, sternum, waist, pelvis, neck, and chest. Hence, a direct comparison of results obtained in this review with results from the previous reviews is not straightforward.
The review by Sun and Sosnoff [13] focused on four major sensing technologies (inertial sensors, video/depth camera, pressure sensing platform and laser sensing) for SFRA in older adults. The authors presented outcome measures related to different assessment tasks (steady state walking, TUG test, standing postural sway, and dynamic tests) [13]. Howcroft et al.'s review [8] focused solely on inertial sensors. The current review included studies using wearable or mobile inertial sensors used to characterize movements by extracting features from sensor signals. Hence, the range of sensors used in the included studies was more limited in this review compared to the range reported in [13] but somewhat broader than the range reported in [8].
In accordance with the review by Shany et al. (2015) [9], selected features, methods for selecting/extracting them, as well as the number of features incorporated into each model varied substantially between studies. Shany et al. presented both numbers of features subjected to analysis and numbers of sensor features. In addition, they highlighted uncertainty of numbers by using the symbol "?" [9].
Montesinos et al. [14] identified strong/very strong associations between fall risk assessment outcomes and nine triads (combinations of a feature category, a task, and a sensor placement). In the current review, it was not possible to outline triads for the 11 studies performing fall risk assessment using classification methods/models. Further, several assessment tasks not included in Montesinos et al.'s [14] analysis were identified in the current review. Most of these newer tasks were used in studies performing discrimination by feature selection. Rather than trying to identify triads like the ones outlined by Montesinos et al. [14] or counting sensor types/sensor locations for all studies, the current review has instead presented information on sensor locations and sensor types used for studies performing discrimination by feature selection and classification methods/ models, respectively.
Previous reviews have categorized SFRA signal processing methods differently compared to this review. For example, Howcroft et al. [8] claimed that regression models were employed to predict fall risk in 65% of their included studies. Other methods employed were mathematical classifiers (25%), DT (15%), NN (15%), SVM (10%), and cluster analysis (10%). Some of the studies (30%) employed more than one method. In addition, Sun and Sosnoff [13] presented a diverse collection of quantitative models/methods including logistic regression, linear regression, RBNC, SVM, NB, multi-layer perceptron NN, Locally Weighted Learning, DT, Cluster analysis, kNN, NN, neuro evolution of augmenting topologies (NEAT), and discriminate analysis employed to predict fall risk [13]. Both mentioned reviews categorized regression model as a classification method [8,13]. On the contrary, Bet et al. [16] used two main categories of data processing (feature extraction and machine learning techniques) in their analysis of included articles. They classified only data processing methods that carried out fall risk assessment by feature comparisons using statistical tests as "feature extraction" [16]. Notably, the current review categorized signal processing methods based on both the type of methods and on the results that the methods produced. In Table 4, seven articles employed logistic regression, i.e., logistic regression [41]; logistic regression and ROC curve [42]; logistic regression and ANOVA regression [37]; stepwise logistic regression [39]; stepwise logistic regression and ROC curve [19,24]; and univariate logistic regression [30]. In these articles, logistic regressions are used with the purpose to identify individual significant features associated with fall risk but not directly for classification. Therefore, these signal processing methods are categorized as "feature selection" in this review. Moreover, single linear regression, which was employed only in one article to assess the correlation between C-GAITS score and walking speed [50], was not characterized as a classification method. In Table 5, four articles employed logistic regression as a classification model [22,23,36,49]. These articles developed classification models based on regression models with related data. Moreover, in Table 6, a logistic regression model was employed as a machine learning classification model in one article by Yang et al. [51].
The included studies evaluated either the SFRA methods' capabilities to discriminate groups of older adults with different fall risk (55% of the studies) or their performance in classifying individuals according to fall risk (45% of the studies). The SFRA methods were either using sensor features or classification methods/models (with and without machine learning) for discrimination/classification. This review identified a large number of sensor features (47% of them related to gait) and three classification models were identified to differ significantly between groups with different fall risk levels. Moreover, the review identified that classification performance was mainly reported using accuracy (highest values per feature/model 70-91%), sensitivity (highest values per feature/model 36-100%), specificity (highest values per feature/model 55-100%) and AUC (highest values per feature/model 0.67-0.93). The review by Sun and Sosnoff [13] presented data on full ranges of accuracy, sensitivity and specificity reported in their included articles while the current review only reported the highest values identified for each of the evaluated SFRA methods. Moreover, the review by Sun and Sosnoff [13] and the current review had only four studies [19,20,23,24] in common. Hence, Sun and Sosnoff [13] present lower values in the minima of ranges for accuracy, sensitivity and specificity than the current review. In general, the methods' Specificity (performance in classifying non-fallers) was higher that their Sensitivity (performance in classifying fallers) in the current review. In accordance with the previously identified need to compare accuracies of SFRA methods with accuracy of CLIN data [8], the current review identified three studies [23,42,48] that reported performance metrics of clinical fall risk assessment methods for comparison.
Howcroft et al. [8] have previously pointed out that the reported accuracy values exceed the theoretical maximal accuracy (81%) for SFRA prediction of at least 1 fall in the upcoming year calculated by [10]) and concluded that prediction performance is overestimated in current literature, mainly due to small samples, large feature pools, model overfitting, lack of validation, and misuse of modelling techniques. The current review identified one study which reported an idealistic model performance (Error = 0, Sensitivity = 1, Specificity = 1) [20] and five studies that reported model classification accuracy values exceeding 81% [27,29,32,46,51]. All these studies, except [27], used machine learning algorithms. The current review identified that all the studies presenting fall risk classification performance also reported on model validation methods. Although CV was used in most cases, one study performing validation with an independent sample [36] was also identified. This is a validation method that has been recommended by [9]. In addition, one example of the hold-out method was identified, data from 75 participants was included in a training set and data from 25 participants was used in a test set [29]. The identified use of model validation among the included studies in the current review is higher than the levels identified by Sun and Sosnoff [13]. Only 50% of their included studies had applied the recommended model validation techniques (including leave-one-out CV, ten-fold CV, 0.632 bootstrap technique and hold-out method).

Conclusions
This review identified evidence of SFRA, both in terms of discriminative capacity and classification performance: (1) A large number of sensor features (almost 50% related to gait) and three classification (one with machine learning) models using sensor features (related to gait and stair descent) differed significantly between groups of older adults with different fall risk level.
(2) Six studies reported on sensor features (1-5 features per study, in one study combined with the Tinetti balance score) being able to classify individuals as fallers/nonfallers (or equivalent) with AUCs of at least 0.75. Five of these six studies used only 3D accelerometers and one used only gyroscope data. Assessment tasks monitored were walking (4/6 studies), TUG test (1/6), and standing balance (1/6).
(3) Seven studies reported on classification models (four with machine learning and three without) being able to classify individuals as fallers/non-fallers (or equivalent) with accuracies of at least 84% and/or AUCs of at least 0.74. All these studies used accelerometers, either alone (1 study) or in combination with 1-5 other sensors including gyroscopes (4 studies), magnetometers (1), pressure sensors (1) and heart rate sensor (1). The number of sensor features analyzed in these studies ranged between 38 and 155. Although more than half of the studies (4/7) used clinical tests (mainly TUG test) as assessment task, ADL (2 studies) and walking (1) were also used.
However, the review also identified several factors previously reported to increase risk of bias [7][8][9][12][13][14][15][16]: (1) The use of prospective study design was limited among the included studies and no positive trend over the publication period could be identified. Two thirds of the included studies used cross-sectional design with RE and/or CLIN data as outcomes to compare SFRA with. Potential sources of biases associated with RE data include limited accuracy of recall of falls in the elderly [55] and risk of altered motion patterns due to history of falls [9]. Moreover, clinical assessments can introduce study bias since they are often assessed subjectively and do not achieve 100% clinical accuracies [7,13]

Appendix B
This appendix provides complimentary graphs to the information provided in Section 3.2.
(c) Comments (of study methodology in relation to recommendations of previous review) 21. For studies analyzing discriminatory performance (performance extracted in 19 (a)-(b): Number and type of features able to discriminate groups with different level of fall risk (fallers/non-fallers)

Appendix B
This appendix provides complimentary graphs to the information provided in Section 3.2.  Table 4; (c) articles in Tables 5 and 6; and (d) the sensor types. HR = heart rate. Please note that HR is used for heart rate only in this figure.  Table 4; (c) articles in Tables 5 and 6; and (d) the sensor types. HR = heart rate. Please note that HR is used for heart rate only in this figure.  Table 4, per publication year. Figure A2. Number of sensors per body location for articles in Table 4, per publication year. Figure A2. Number of sensors per body location for articles in Table 4, per publication year. Tables 5 and 6, per publication year.

Appendix C
This appendix provides complimentary information for the feature selection methods in Tables 4-6. Table A2. Complimentary information for the feature selection methods in Table 4. COG = center of gravity, ICC = Intraclass correlation coefficient, SSQ = Simulator Sickness Questionnaire. [19] 1. Assessment of intra-and inter-observer reliability for gait parameters (ICC, CV of standard error of measurement) 2. Assessment of whether each parameter differed significantly between fallers/non-fallers and between walks (ANOVA and t-test, Wilcoxon-signed-rank and Kruskall-Wallis tests for step time asymmetry) 3. Analysis of each gait parameters' predictive value (Stepwise logistic regression: forward likelihood ratio) 4. Analysis of discriminate capacity (ROC curve) [21] 1. Assessment of whether each parameter differed significantly between repetitions for each participant (ANOVA) Figure A3. Number of sensors per body location for articles in Tables 5 and 6, per publication year.

Appendix C
This appendix provides complimentary information for the feature selection methods in Tables 4-6. Table A2. Complimentary information for the feature selection methods in Table 4. COG = center of gravity, ICC = Intra-class correlation coefficient, SSQ = Simulator Sickness Questionnaire.

Ref No.
Feature Selection Methods [19] Table A4. Complimentary information for the feature selection methods in Table 6.