The Contribution of Machine Learning in the Validation of Commercial Wearable Sensors for Gait Monitoring in Patients: A Systematic Review

Gait, balance, and coordination are important in the development of chronic disease, but the ability to accurately assess these in the daily lives of patients may be limited by traditional biased assessment tools. Wearable sensors offer the possibility of minimizing the main limitations of traditional assessment tools by generating quantitative data on a regular basis, which can greatly improve the home monitoring of patients. However, these commercial sensors must be validated in this context with rigorous validation methods. This scoping review summarizes the state-of-the-art between 2010 and 2020 in terms of the use of commercial wearable devices for gait monitoring in patients. For this specific period, 10 databases were searched and 564 records were retrieved from the associated search. This scoping review included 70 studies investigating one or more wearable sensors used to automatically track patient gait in the field. The majority of studies (95%) utilized accelerometers either by itself (N = 17 of 70) or embedded into a device (N = 57 of 70) and/or gyroscopes (51%) to automatically monitor gait via wearable sensors. All of the studies (N = 70) used one or more validation methods in which “ground truth” data were reported. Regarding the validation of wearable sensors, studies using machine learning have become more numerous since 2010, at 17% of included studies. This scoping review highlights the current state of the ability of commercial sensors to enhance traditional methods of gait assessment by passively monitoring gait in daily life, over long periods of time, and with minimal user interaction. Considering our review of the last 10 years in this field, machine learning approaches are algorithms to be considered for the future. These are in fact data-based approaches which, as long as the data collected are numerous, annotated, and representative, allow for the training of an effective model. In this context, commercial wearable sensors allowing for increased data collection and good patient adherence through efforts of miniaturization, energy consumption, and comfort will contribute to its future success.


Introduction
Human gait assessments study human movement and aim to quantify gait characteristics with various spatiotemporal parameters, such as stride speed and length, step length, cadence, standing, double support, and swing times [1]. Normal gait corresponds to an individual's motion pattern, and deviation in gait from this normal pattern can indicate a change in health status. In this regard, recent works have demonstrated that gait could have a link to functional health and could be an indicator for the course of chronic disease and, hence, rehabilitation feedback [2]. For example, ref. [3] demonstrated the value of studying gait asymmetry in post-stroke patients, ref. [4] identified gait variability as a marker of balance in Parkinson's disease, and ref. [5] described changes in gait and balance in the elderly. As a result, there is a move towards using gait analysis to aid in patient health assessment and monitoring.
Traditional methods for gait analysis in patients typically use walk tests as a standard assessment [6,7]. A walk test is an examination carried out over a fixed duration and/or distance in order to easily access speed measurements. The most commonly used walk test is the six-minute walk test (6MWT) [8], which assesses endurance at a comfortable speed for the subject by measuring the distance walked in 6 min along a straight corridor. Even though these tests are widely used to establish a link between the gait and physical state of the patient, important long-term gait longitudinal patterns or transition patterns from one daily activity to another are not measured and cannot be explored. The ability to explore these patterns, such as the transition from turning to sitting [9], frequency of falls [10], or freezing episodes [11] is important because recent literature suggests that they may be able to inform about a deterioration in the patient's state of health and, therefore, of their chronic condition.
Emerging technologies offer the possibility to improve the evaluation of traditional methods by increasing the quality and the duration of the window of data acquisition by measuring gait in daily activities over long periods of time. Wearable devices with embedded sensors allow in particular for the passive collection of various data sets, which can then be used to develop algorithms to assess gait in real life conditions and over long periods of time [12,13]. This opens up many perspectives, especially in the case of chronic diseases where the disease profile varies for each individual and has fluctuating symptoms. Twenty-four hour home monitoring in a real environment is an ideal solution for an accurate diagnosis of symptoms as well as good patient compliance [14].
In the past decade, commercial wearable sensors have been used not only in the consumer market but also in research studies. In particular, wearable sensors are used in physical activity monitoring for measurements and goal setting [15]. More recently, a more specific use of these sensors was introduced in research studies in medicine and rehabilitation [16,17]. Wearable sensors for gait assessment have been primarily conducted in a lab and with controlled protocols [18], traducing that commercial sensors can be challenging to deploy and validate. More recently, the testing of the sensors in patient monitoring has expanded into real-life conditions. Previous research has shown significant differences in spatiotemporal gait parameters between similar in-lab and in-field studies [19], illustrating the importance of establishing commercial sensor validity for long-term patient monitoring and for detecting events and more particularly deviations from normal human gait.
There are already many reviews on the validation of commercial wearable sensors available in the literature, and most were interested in monitoring activity on healthy subjects [15,[20][21][22] while others have taken a descriptive approach centered on a very specific medical application [18,23,24]. However, few studies focus on the validation methods, the ground truth used, and how the reference data are annotated. A common validation method is to use inferential statistics, such as a regression analysis to explore and model the relationship between sensor and ground truth data. These approaches typically assume that the relationship between sensor and ground truth data follows a linear pattern. Linear regression has the advantage of being simple to use and to interpret. In comparison with these linear methods, the nonlinear methods fit more types of data in terms of shape and are hence recognized as being more general. Some nonlinear approaches such as machine learning have the advantage of being less dependent on the assumption of the model and very recently produced promising results in sensor validation [25,26]. Nonlinearity seems particularly interesting in terms of patient monitoring in order to integrate networks of several sensors placed at different places on the patient [27,28] and for high-level tasks (such as the classification of patients into groups according to the evolution of a disease) [29,30], which requires the integration of various information on locomotion and control systems involved in complex gait regulation [31,32].
In this paper, our aim was to conduct a systematic review (i) to determine the statistical methods currently used for the validation of sensors and (ii) to determine to what extent machine learning (ML) is used as a statistical method for this validation step.

Methods
This scoping review is reported using the Preferred Reporting Items for Systematic reviews and Meta-Analyses Extension for Scoping Reviews (PRISMA-ScR) checklist [33].

Databases
We conducted a literature search of the PubMed, SCOPUS, ScienceDirect, Web of Science, IEEE Xplore, ACM Digital Library, Collection of Computer Science Bibliographies, Cochrane Library, DBLB, and Google Scholar (first 50 results) databases for all literature published between 2010 and 2020.

Inclusion Criteria
Only peer-reviewed journals or conference papers were included in this review if they were published between January 2010 and December 2020 and were written in English. In addition, eligible articles had to complete all of the following criteria as part of the content given in the article:

1.
The study must be centered on gait or posture analysis (e.g., detect stance and swing phases, detect the risk of falling, etc.). Studies focusing only on activities or step counting were excluded.

2.
Given the application to remote monitoring in patients, only devices allowing wireless data flow wer considered. This flow had to have been conducted using bluetooth between the device and the smartphone to then send data by Wi-Fi to a remote server. Sensors that temporarily store the data locally and send the data a posteriori when a Wi-Fi connection is available were also included. 3.
The devices had to have been used in a clinical setting for long-term follow-up or rehabilitation of a chronic pathology. Studies on young or healthy patients and on animals were excluded. 4.
The validity of the sensor and the resulting indicators must have been assessed. Therefore, a ground truth must be proposed and the study must include at least one statistical measure (e.g., statistical test, correlation, and mean square error) or one evaluation metric (e.g., accuracy, F1-score, precision, and sensitivity) to indicate the performance of the sensor on detecting the associated gait feature.
Review articles, commentary articles, study protocol articles, and any other articles without reported results from empirical research were excluded.

Selection of Articles
The records retrieved from the databases were gathered in CSV files. All duplicate articles were removed. First, we reviewed the titles and abstracts of all articles ( Figure 1). During this first phase of selection, articles were excluded if they did not describe at least one wearable device used to automatically assess gait as part of the follow-up of a chronic pathology, with particular attention paid to the validation of the device. If this information could not be verified from the title and/or abstract, the article's full text was reviewed in a further screening phase to determine whether it fit the eligibility criteria. Moreover, if the abstract indicated that the study was not peer-reviewed, was not written in English, was not accessible online, or corresponded to a study conducted on animals, it was excluded. After the initial title/abstract selection process, we evaluated the full text of the remaining articles. Articles were then excluded if they did not meet the eligibility criteria ( Figure 1).

Data Extraction
Three research assistants independently extracted the following study characteristics from the final set of eligible studies using a custom-made data extraction worksheet. Here are the different characteristics identified for the analysis of identified papers in the context of our systematic review:

1.
Sample size: the total number of participants for each study.

2.
Pathology: the disease monitored in the study.

3.
Duration of data collection: how long the participants wore the sensor(s) to collect data for the study.

4.
Condition of data collection: specifies on whether the study was conducted in a laboratory or in free-living conditions. 5.
Number of wearable devices: the total number of wearable devices in which the sensor's signal data were used to study the patient's gait. Any other equipment that was part of the acquisition system but did not provide data to evaluate the gait was not included in this count. 6.
Type of sensor(s): the type of sensor embedded within the wearable device(s) used to assess gait. 7.
Device brand(s) and model(s): the specific brand and model of the wearable device(s) used in the study. 8.
Location of device(s): details specific to the placement/location of wearable device(s) on the patient's body. 9.
Gait indicators measured by the device(s): gait outcomes that were derived from the signal recorded on the device. In some studies, several gait indicators were extracted from the raw data. 10. Ground-truth method(s): the method that was used in the study to evaluate the performance of the device(s) to assess gait. 11. Evaluation metric(s) of the device(s): any evaluation metric, reported either in the text, a figure, or a table, that described the performance of the wearable device(s) on assessing gait. Only evaluation metrics that were exclusively used to study gait were included.

Summarizing Data and Categories
Mean and standard deviation were calculated for each extracted numerical variable (sample size, duration of data collection, and number of devices). Frequency tables were constructed for each extracted categorical variable (pathology, condition of data collection, sensor types, device brand and model, device location, ground-truth methods, gait features, and evaluation metrics). Regarding these categorical variables, here are the categories that we considered and their meanings. These categories are not exhaustive of all possible types of categories but correspond to those proposed in the context of the included studies.
The devices are categorized according to three types: (i) smartphone, (ii) inertial measurement unit (IMU), and (iii) single sensor.
The device location is categorized according to four levels: (i) superior, if the device was carried in the hands or on the arms; (ii) inferior, if the device was carried on the legs or feet; (iii) chest, if the device was carried on the chest or the trunk; and (iv) free location, if the device was in a pocket or more prone to moving around, or if the its location on the body was not distinguished.
The ground-truth methods are categorized according to six levels: (i) controls, where a group of subjects served as a reference; (ii) expert, where the data were analyzed with regard to annotations made by experts; (iii) med device, where the data were analyzed with regard to a portable device already used in clinical routine; (iv) medical, where the data were analyzed with regard to a medical examination/test or clinical score; (v) metrologic, where other high resolution equipment were used as a reference; and (vi) user annotations, where the data were analyzed with regard to annotations made by patients during the use of the device.
The gait features are categorized according to three levels: (i) low, where the analysis was conducted on raw signals without postprocessing; (ii) medium, where the analysis was based on statistical descriptors extracted from the signals (mainly statistical moments or common signal processing features); and (iii) high, where the analysis was based on descriptors at a high level of representation that disregards the technical characteristics of the equipment or methods used (e.g., step length, cadence, and number of steps).
Finally, the evaluation methods are categorized according to five levels: (i) descriptive stat, where evaluation was carried out through descriptive statistics only; (ii) descriptive stat + test, where evaluation was carried out through descriptive statistics with statistical tests; (iii) linear models + stat test, where evaluation was carried out through linear models with statistical tests; (iv) machine learning, where evaluation was carried out through machine learning only; and (v) machine learning + stat test, where evaluation was carried out through machine learning with statistical tests.

Results
In this section, we analyze the selected papers by categorizing them following different criteria in order to extract common patterns and trends. Figure 1 details the entire process of paper selection for this review. The literature search (made from the queries given in Table A1 of Appendix A) produced 564 research articles, with 118 duplicates, resulting in 446 articles to be screened. After an initial screening, which consisted of reviewing all article titles and abstracts, the full content of 102 of these articles was screened in more detail for eligibility. After removing the articles that did not meet the inclusion criteria detailed in Section 2.3, 70 articles were deemed eligible for the review .

Literature Search
The number of studies related to the issue of validation on sensors used for patient monitoring has significantly increased since 2010, with a number of papers between 2017 and 2020, more than twice the number of papers between 2010 and 2017 (see Figure 2). Studies using machine learning as a validation method also became more numerous since 2010 [34][35][36]38,45,53,60,63,[68][69][70]77,[79][80][81]86,95,97], with a stable proportion compared to the total number of studies per year. Evolution of the number of papers considering the issue of validation for the use of commercial wearable devices in chronic disease monitoring, with a distinction between papers using machine learning (in red) or not (in blue). The percentages given in red represent the proportion of studies using machine learning.

Acquisition
Time

Wearable Sensor Types
As detailed in Table 2 used multi-sensor systems (incorporating more than one sensor) to automatically assess gait in chronic pathologies. On average, 5.78 wearable sensors (SD = 8.43) were used in the studies, with a range of 1 to 64 sensors (see Table 2). As depicted in Table 3, the most commonly utilized sensor was an accelerometer (95%) either by itself (N = 17) or embedded into a device (N = 57). The second most frequently used sensor was a gyroscope (51%) followed by magnetometer (14%) and others (16%). Figure 4 reports the different brands used for smartphones, sensors, and IMUs. Regarding smartphones, Samsung [41,45,51,68,69,77,86,103] and iPhone [40,42,69,76,89] are the most represented, certainly because of their health applications made for gait recording. Actigraph is the most commonly used brand for sensors [38,40,48,49,67,71,74,85,96,103]. Regarding the different brands in IMU, there is no particular brand that stands out. Table 2. Criteria related to commercial wearable devices through the 70 selected papers. Abbreviations used in the column "No. of device(s)": IMU (Inertial Motion Unit), S (Sensor), and SPHN (Smartphone). Abbreviations used in the column "Sensor Type(s)": A (accelerometer), G (gyroscope), M (magnetometer), and O (others).

Author
No

Author
Ground-Truth Method A closer look at the studies using ML highlights that machine learning-based approaches are often used for high-level validation tasks (see Table 7), such as distinguishing between different groups of patients or stages of disease progression [34][35][36]45,68,70,80,86,97]. This is an important point because ML aims to generalize a model to patients not included in the initial data set. Another point to emphasize, as illustrated in Table 8, is that studies using machine learning as a validation method incorporate a large number of variables (the complete raw signal or a collection of different sensors) [34,60,63,70,77,80,81]. This is not the case in studies using statistical methods that work with a few dozen variables at the maximum and often in a uni-variate way two by two [37,56,59,90,102,103]. Table 7. Selection of papers that use machine learning methods in validation. Abbreviations used in the column "Model type": SVM (support vector machine), GPR (gaussian process regression), NN (neural network), RF (random forest), LSTM (long short time memory), HMM (hidden markov model), kNN (k-nearest neighbors), CNN (convolutional neural network), ROC (receiver operating characteristic), and LDA (linear discriminant analysis). Abbreviations used in the column "Outcome": r (correlation coefficient), NRMSE (normalized root mean square error), RMSE (root mean square error), AUC (area under curve), sens (sensitivity), spe (specificity), and IQR (interquartile range). Studies that use raw data as input have a number of descriptors that correspond to the number of sensors and/or axes multiplied by the length of the recorded data. This is noted (*n) in the  Figure 6. Pie chart representing the percentage of papers using different levels of evaluation identified among the 70 selected papers. These different levels correspond to the categories described in Section 2.6. Table 8. Frequency of studies using less than 10 descriptors, between 10 and 100 descriptors and more than 100 descriptors for the validation of both statistical and ML methods.

Summary of Key Findings
This scoping review included 70 studies related to the validation of commercial wearable sensors to automatically monitor gait in patients published between 2010 and 2020. The majority of studies (95%) used accelerometers either by itself (N = 17 of 70) or embedded into a device (N = 57 of 70), and/or gyroscopes (51%) to automatically monitor gait via wearable sensors. Labeling according to two groups (group of patients and healthy controls) was the most frequently used method (N = 39 of 70) for annotating ground-truth gait data, followed by annotations made by experts on data from videos or measurements during the experiment (N = 15 of 70) and patient self-reports (N = 4 of 70). The references against which the sensor data were compared were a metrological device and a medical examination in equal parts and, to a lesser extent, a third-party portable medical device. Finally, studies using machine learning as a validation method have become more numerous since 2010, at 17% of included studies.

Discussion
Gait monitoring of patients during daily life using commercial wearable sensors is a growing field and offers novel opportunities for future public health research. However, despite their rapid expansion, the use of commercial wearable sensors remains contested in the medical community: objections concern the quality of the data collected as well as the reliability of the technologies in a clinical context where the pathologies are diverse and sometimes combined [104]. Previous literature reviews on the validation of wearable sensors were interested in monitoring activity on healthy subjects [15,[20][21][22] or have often placed a focus on a very specific medical application [18,23,24]. No review to date has focused on studies using wearable devices in a very general way to automatically detect gait in patients in their daily life and via machine learning, which is an approach increasingly used to learn a recognition task from data. By examining the validation methods and performances of wearable devices and sensors that automatically monitor patient gait, several major trends and challenges can be identified.

Trends and Challenges
Acquisition context. Most of the first studies were restricted to the laboratory environment and over short acquisition times (of the order of a few minutes). The first papers to report sensor validation in a free living environment were in 2011 [53,74]. As seen in Table 9, from 2017, studies of this type become more frequent [46,[50][51][52]55,59,62,66,77,86,94,96,98,103] due to changes in the sensors, which are detailed in the following section. Table 9. Data acquisition criteria through the 70 selected papers. Abbreviations used in the column "Duration of data collection": min (t <1 h), hours (1 ≤ t < 24 h), days (1 ≤t< 7 days), weeks (1 ≤ t < 4 weeks), months (1 ≤ t <12 months), and year (t ≥ 1 year). Finally, the cohort size is given as the number of patients.

Author
Year Pathology Cohort Size

Condition Data Collection
Salarian et al. [

Author
Year Pathology Cohort Size

Condition Data Collection
Derungs et al. [52] 2018 Hemiparesis 11 weeks Free living Mileti et al. [81] 2018 Parkinson 26 min Laboratory Aich et al. [35] 2018 Parkinson 51 min Laboratory Cheong et al. [46] 2018 Cancer 102 months Free living Ata et al. [40] 2018 Artery disease 114 min Laboratory Kim et al. [70] 2018 Sensors. In this review, we observe that early research efforts attempted to find improvements for gait monitoring in patients by experimenting with new sensor types and/or sensor locations. The first paper to report the validation of a wearable sensor for monitoring gait in patients was in 2010 [90], but it did not become more prevalent until 2017, during which nine other papers on this subject were published [45,47,54,63,73,78,79,88,96]. Over time, research efforts have focused on refining validation protocols, whether in terms of the number of sensors or their locations, with emphasis on two major criteria: the ability of sensors to capture gait patterns and the practicality of everyday life. As seen in Tables 3 and 2, the majority of studies (95%) used accelerometers and/or gyroscopes, typically embedded within an IMU or smartphone. This observation highlights the emergence of commercial wearable devices as a practical and user-friendly modality for gait monitoring in daily life. In addition to user adoption, commercial wearable devices also have engineering advantages, such as a compact format with suitable computing and power resources. If it is a single sensor, it is usually worn near the center of gravity, in a pocket [42,43,45,50,51,77,86], or on the chest [39,44,64,84,92] or pelvis [59,61,65,72,94,96].
Another trend that emerges from Table 2 is the fact that several sensors were used together and generally at various on-body locations [37,48,52,[54][55][56]60,63,65,67,70,73,75,79,83,87,90,93,95,[97][98][99]102]. However, using a multi-sensor system introduces several challenges, including the integration of different sampling rates and signal amplitudes, and how to align signals from multiple devices and, therefore, different clock times. Despite these challenges, the multi-sensor approach offers high potential for the real-time monitoring of gait, where multi-sensor fusion can provide context-awareness (e.g., if the patient stays mainly at home or leaves home from time to time) and can contribute to the optimization of power (e.g., a low-power sensor can trigger a higher-power sensor only when necessary).
Ground truth. Our review indicates that 53% of the included studies use annotations. As seen in Figure 5, there is still a strong reliance on annotations by groups of individuals (56% ; mainly a group of patients versus a group of healthy subjects) followed by annotations made by experts on data from videos or measurements during the experiment (21%) and patient self-report (0.05%). These last two annotation methods are surely less numerous because they can be very costly and time-intensive and are also of questionable quality because maintaining logs is a process that is very burdensome to the participant and ultimately relies on their memory. This fact has namely led to the emergence of initiatives in terms of intelligent annotation [105].
As seen in Table 7, researchers have used machine learning on sensor data for different tasks: regression for continuous labelled data (speed, step length, or distance) [53,69,79,86] and classification of discrete labelled data such as groups of patients [35,36,38,45,68,80,86,97] or medical functional scores [34,45,63,95,100]. Classification, less commonly used for the validation of sensors, aims for higher-level analyses, namely to identify a robust methodology able to monitor patients in time while at the same time discriminating between a pathological and physiological gait, or the evolution of the disease studied on the basis of gait movements.
The types of machine learning algorithm families have evolved over time, with standard approaches being used before 2017 and the appearance of deep learning approaches with automatic feature extraction without human intervention for the first time in 2018 [77], which are unlike most traditional machine learning algorithms. It should be noted that, in the context of the papers studied in this review [60,70,77,80], these approaches concern studies with a significant number of patients (≥30) or/and relatively long acquisition times [77,80] in order to guarantee a sufficiently representative and realistic sample. Other studies based on machine learning preferred more standard approaches with a small number of expert features if their samples were more limited regarding the number of patients [38,63,68,69,79,81,86,95,100] or the acquisition time [34][35][36]45,97]. Comparing the results of the different studies in terms of performance seems, at this stage, to be a difficult task because, as stated previously, it depends on the complexity of the task to be performed and on the complexity of the machine learning algorithm implemented.
Finally, it should be mentioned that machine learning also has drawbacks, with the first being the computational time required to train a model [106]. This is justified for complex analysis tasks such as classification or significant performance increases for a regression task. Moreover, ML may require the adjustment of hyperparameters that may demand theoretical knowledge in optimization. Finally, ML tends to be more difficult to interpret for a clinician who looks for the most relevant parameters to analyze the gait patterns of patients. However, it should be noted that recent initiatives have been carried out to demystify these two points [107,108].

Recommendations
Advanced inertial sensors, including accelerometers and gyroscopes, are commonly integrated into smartphones and smart devices nowadays. Therefore, it is very convenient and cheap to collect inertial gait data to achieve gait monitoring with high accuracy. Most existing validation methods ask the person to walk along a specified road (e.g., a straight lounge) and/or at a normal speed. Obviously, such strict requirements heavily limit its wide application, which motivates us to give some recommendations for future work in this context. Data acquisition. A first step would be to precisely define validation protocols-by consulting the medical staff-adapted to the study of chronic pathologies. Indeed, many studies only validate sensors for a given medical application without having tested them outside the laboratory, on a very limited number of patients, and over a relatively short time window (at most a few hours). The protocol to be defined should therefore impose experimentation constraints closer to the daily life of patients, namely the data should be acquired at home, on a sufficient number of patients, and over a sufficiently long acquisition period (several weeks or even months).
It would also be necessary to define within the protocol which types of sensors would be more suitable according to the studied pathology, how many sensors would be necessary, and where to place them on the patient [18]. There is a clear trade-off between the accuracy of the recorded data and the invasiveness of the portable system: the greater the number of sensors and the more varied they are placed on different parts of the patient's body, the more accurate the measurements will be, but this is at the expense of a practical, accommodating, and portable use.
Data collection and processing. Today, most sensors record a lot of data about their users. However, most wearable devices do not have the memory and computing power to process and analyze all of the recorded signals. Faced with this problem, two solutions are generally considered: either the system uses only a part of the recorded data to provide accurate indicators (throwing away a massive amount of potentially interesting data) [109,110] or the system stores and analyzes all raw data on the cloud [111,112]. The latter option is often problematic because the traditional architecture is centralized and offers little protection against potential cyber attacks. Centralizing raw data on a server poses some risk, especially if the data is sent to an external server, as it facilitates access to malicious attackers. A more reliable and secure alternative regarding the collection and processing of data would therefore be to process the raw inertial signal on the user's smartphone and to transfer only relevant features unlinked to the identity of users to the cloud [113,114]. Finally, the mobile clients associated with wearable devices have to send a lot of data to a centralized server for training and model inference. This is especially difficult due to user billing plans and user privacy. Thus, very recently, decentralized architectures dedicated to machine learning have emerged [115].
Validation. It is mandatory to ensure that sensor recordings are accurate and sensitive enough for medical diagnosis and prognosis. This is crucial to ensure not only the generalizability of a sensor within a target population but also its ability to measure day-to-day variability data, which can be corroborated with disease symptoms. To this end, data acquired by commercial wearable sensors should be systematically compared to data acquired by reference medical devices (i.e., reliable gold standard systems, medical scores, or groups of subjects). Machine learning approaches make it possible to loosen the strict framework of acquisition protocols but must ensure that the data set collected for training is large, labelled, and realistic. Deep approaches, which automatically select features from data, offer very interesting perspectives given that feature extraction is a task that can take teams of data scientists years to accomplish. It augments the powers of small expert teams, which by their nature do not scale.
Statistical models versus ML. Statistical models are designed for inference about the relationships between variables within the data and are designed for data with a few dozen input variables and small sample sizes. On the other hand, machine learning models are designed to make the most accurate predictions possible. Statistical models can make predictions, but predictive accuracy is not their strength. Indeed, no training and test sets are necessary. Furthermore, machine learning aims to build a model that can make repeatable predictions in a high-dimensional space without formulating a hypothesis on the underlying data generation mechanism. ML methods are particularly useful when the number of input variables exceeds the number of samples [116]. Hence, using machine learning in a validation task highly depends on the purpose of the study. To prove that a sensor is able to respond to a certain kind of stimuli (such as a walking speed), a statistical model should be used. Conversely, to predict from a collection of different sensors whether a patient is affected by a certain grade of a disease affecting the musculoskeletal system, machine learning is probably the best approach. Indeed, this multi-dimensional space (one or more for each sensor) is in fact difficult to interpret and therefore to analyze. The ML model would then probably be a neural network or a random forest in order to take into account the nonlinearities resulting from the complex relationship between the physical sensors and the classification output.

Conclusions
The field of gait monitoring in patients is still emerging, and the accuracy of commercial wearable sensors still depends on careful constraints during data acquisition. Collecting data in daily life is considerably more challenging than conducting research in a laboratory. In free-living conditions, continuous control of the sensors, participants, and hardware or software is lost. Therefore, successful sensor deployment requires really robust algorithms. If the objective is to be able to monitor the gait completely freely over a long period of time, precision must be valued. Considering this review of the last 10 years in the field, validation takes an increasingly important place in the literature, with the number of studies having gradually increased since 2010. In these studies, a significant part of the validation was based on traditional statistical approaches (75%) with a stable contribution of machine learning-based approaches (25%). Machine learning approaches are algorithms that should be considered for the future. These are in fact data-based approaches, which, as long as the data collected are numerous, annotated, and representative, allow for the training of an effective model. It should be noted that commercial wearable sensors allowing for increased data collection and good patient adherence through efforts of miniaturization, energy consumption, and comfort will contribute to its future success.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: 6MWT Six-minute walk test ML Machine learning SD Standard deviation IMU Inertial Measurement Unit Appendix A. Extraction from Databases ((gait OR actimetry OR actigraphy OR walk) AND (smartphone OR wearable OR iot) AND 15 ("chronic disease" OR rehabilitation OR medicine) AND (validity OR reliability OR reproductibility OR validation)) in Title Abstract Keyword-between Jan 2010 and October 2020 DBLB (gait | walk | actimetry) (smartphone | device | iot) (valid | rehabilitation) 31 IEEE Xplore ((gait OR actimetry OR actigraphy OR walk) AND (smartphone OR wearable OR iot) 54 AND ("chronic disease" OR rehabilitation OR medicine) AND (validity OR reliability OR reproductibility or validation)) PubMed ((gait OR actimetry OR actigraphy OR walk) 52 AND (smartphone OR wearable OR iot) AND ("chronic disease" OR rehabilitation OR medicine) AND (validity OR reliability OR reproductibility or validation)) Filters: from 2010-2020 Scholar title:(gait smartphone "wearable device" rehabilitation validity) 1010 ScienceDirect ((gait OR actimetry) AND (smartphone OR iot) AND 3 #1 ("chronic disease" OR medicine) AND (validity OR validation)) ScienceDirect ((gait OR walk) AND (smartphone OR wearable) AND 10 #2 (rehabilitation OR medicine) AND (validity OR reliability))