Development of an Automatic Functional Movement Screening System with Inertial Measurement Unit Sensors

: Background: In this study, an automatic scoring system for the functional movement screen (FMS) was developed. Methods: Thirty healthy adults ﬁtted with full-body inertial measurement unit sensors completed six FMS exercises. The system recorded kinematics data, and a professional athletic trainer graded each participant. To reduce the number of input variables for the predictive model, ordinal logistic regression was used for subset feature selection. The ensemble learning algorithm AdaBoost.M1 was used to construct classiﬁers. Accuracy and F score were used for classiﬁcation model evaluation. The consistency between automatic and manual scoring was assessed using a weighted kappa statistic. Results: When all the features were used, the predict model presented moderate to high accuracy, with kappa values between fair to very good agreement. After feature selection, model accuracy decreased about 10%, with kappa values between poor to moderate agreement. Conclusions: The results indicate that higher prediction accuracy was achieved using the full feature set compared with using the reduced feature set.


Introduction
The functional movement screen (FMS) is a screening tool widely used by sports medicine practitioners to evaluate fundamental movement patterns in competitive athletes at risk of but not currently experiencing signs or symptoms of musculoskeletal injury. The FMS comprises seven tests that require both mobility and stability for successful completion, namely deep squat, hurdle step, in-line lunge, shoulder mobility, active straight leg raise, trunk stability pushup, and rotary stability. Each test is scored on a scale of 0-3 points according to quality of movement, with a maximum total composite score of 21 [1][2][3][4]. When the total score is fewer than 14 points, injury risk increases by a factor of 2.74 (95% confidence interval 1.70-4.43) [2]. Similar to other fitness tests, the FMS must be manually administered and scored by professionally trained individuals [5], challenging its wide implementation in gyms, sports studios, and other settings. In the present study, inertial measurement unit (IMU) sensors were used to record the movement data of participants performing six FMS tests. Artificial intelligence (AI) was used to perform automatic movement screening.
With advancements in science and technology, small, compact, portable, low-cost, multifunctional wearable sensors play increasingly important roles in the sports and medical fields. These sensors allow the acquisition of real-time movement data for analysis. IMUs are typically equipped with a three-axis gyroscope and accelerometer. Their applications include observing the movement patterns of athletes or the gait patterns of patients with a stroke [6,7]. IMU sensors have been used as movement indicators in studies of one-leg squat movements [8], weight training [9], running fatigue [10], walking status [11], skiing [12], and tennis [13], all of which indicated high correlations between IMU sensor signals and movement performance. However, in a 2014 study by Whiteside et al., discrepancies were observed between automatic FMS scoring (achieved with 17 IMU full-body sensors) and manual scoring [14]. These differences may be attributable to the use of a selfset kinematics threshold for FMS scoring; whether this threshold is objective for each test remains questionable. Various kinematic parameters have been applied in other studies on FMS assessment, (e.g., [15]), including the excursion angle of the limb relative to any plane (in degrees), whether the limb is aligned or passes through a certain plane, and the displacement or minimum distance between various landmarks (in cm). To simplify system analysis, in the present study, the quality of physical movement was considered based on mobility and stability, the essential elements of the FMS. Mobility, defined as the "ability of an individual to initiate, control, or sustain active movements of the body to perform motor tasks", is essential for achieving sufficient range of motion (ROM). Stability, defined as the "ability to actively control one's body within a limit of range", refers to the degree of control over a movement [16]. Furthermore, to realize the objective of developing an automatic FMS system, no specific upper or lower ROM thresholds were set for either mobility or stability. The automated score results generated using AI were compared with manual scores for verification of classification accuracy.
AI involves various modeling methods, each with its own advantages, disadvantages, and suitable application areas. The size of the training data set and the number and type of parameters all affect the selection of the most suitable model [17]. The establishment of a good classification model is crucial [18]. In machine learning, an ensemble model denotes a single predictive model composed of submodels. Ensemble models often perform better than any single classifier. Numerous ensemble learning methods, including boosting [19], bagging [20], and stacking [21] have been investigated. The purpose of the present study was to develop a boosting ensemble machine learning method, in which score and ROM are the dependent and independent variables, that assesses movement dysfunction by automatically detecting movement deviations during the FMS test from ROM data collected by IMU sensors.
Although sufficiently large data input may facilitate the determination of appropriate input combinations for improving outputs, the large number of IMUs measuring full-body movement greatly limits the applicability of wearable devices. To reduce the number of IMUs used for FMS score prediction, we used ordinal logistic regression as feature selection algorithm. After selecting a subset of candidate variables, models were built and validated. The relative change in prediction accuracy was compared to the accuracy using all features. In short, the aim of this study was to develop a highly accurate automatic scoring system with minimal human intervention, establish the optimal combination of prediction parameters, and determine whether the screening accuracy can be maintained after reduction of the number of sensors.

Participants
The participants comprised 35 healthy adults (20 men and 15 women, age: 24.9 ± 2.4 years, height: 166.82 ± 9.91 cm, and weight: 60.54 ± 14.35 kg). Individuals who had musculoskeletal diseases or who had experienced trauma of the upper or lower extremities (e.g., fractures) within the past 6 months that caused them pain or prevented them from performing the test normally were excluded. All subjects gave their informed consent for inclusion before they participated in the study. The study was conducted according to the principles of the Declaration of Helsinki, and the protocol was approved by the Institutional Review Board of Kaohsiung Medical University Chung-Ho Memorial Hospital (KMUHIRB-E(I)-20200146, 06/16/2020).

Experimental Equipment and Instruments
BoostFix 6-axis IMU sensors (Compal Electronics Inc., Taipei city, Taiwan) were used to measure the kinematic data for each limb. The corresponding BoostFix smartphone application was used to calculate the movement angle of each joint. According to the manufacturer's specs, the BoostFix sensor measurement angle error was less than ±1 • [22]. As shown in Figure 1, the 11 sensors were placed on the forehead, chest, sacrum, and the midpoints of the left and right upper arm, thigh, calf, and foot. They were positioned in similar orientations such that all were facing the same general direction in a threedimensional space. Before FMS testing, the participants were asked to hold an A standing pose to calibrate the sensor orientations (by defining the zero angles for each joint). The raw data were entered into a quaternion algorithm to allow conversion of the relative angle changes captured by the sensors into three-dimensional visualizations of joint motion. Table 1 presents the 31 evaluated joint motions.

FMS
The regular FMS comprises seven tests: deep squat, hurdle step, in-line lunge, shoulder mobility, active straight leg raise, trunk stability pushup, and rotary stability. Each test is scored between 0 and 3 points. A score of 3 means the movement is accomplished without compensatory movement. A 2 is given for a movement accomplished with some compensation, and a score of 1 is given for a movement that could not be performed according to the criteria. If pain is noted, the associated test is given a score of 0. Since all the participants were healthy adults, the scores ranged from 1 to 3. Since none of the participants were athletes, no one obtained a score of 3 on the rotary stability test; scores ranged between 1 and 2 points [23]. In addition, the shoulder mobility test was excluded because it was determined to be less suitable for measurement only by IMU sensors [14], as this exercise already involves an objective measurement modality (a ruler) and the scoring standard is related to each person's hand-length. Manual scoring was conducted by a trained scorer. Each participant performed the six tests ( Figure 2) in a random order. In total, each participant performed the bilateral symmetrical exercises (i.e., the deep squat and the trunk stability pushup tests) twice. Each participant performed the bilateral asymmetrical exercises (i.e., the hurdle step, in-line lunge, active straight leg raise, and rotary stability tests) once on each side of the body; the raw score represents the results for the right and left side, respectively. The hurdle step, in-line lunge, straight leg raise, and rotary stability tests were scored from the side of the moving leg, front leg, side of the moving limb, and side of the upper moving limb, respectively. For simplicity, the results are discussed in terms of the scoring and non-scoring sides. For these bilateral asymmetrical exercises, 31 joint motion predictive variables (Table 1) were evaluated. For the two tests (i.e., the deep squat and trunk stability pushup tests) that do not have right and left side scores, we only analyzed the 20 joint motions ( Table 1) that excluded the left side data and designated the right side as the scoring side. Finally, we used the FMS score and joint angle data from each test for subsequent model building.

Data Acquisition and Preprocessing
The BoostFix application on the Apple iPhone 6s was used for data collection. After each test was complete, the program saved the data to iTunes and downloaded them to the computer for analysis. The data were entered into Microsoft Excel spreadsheets displaying the angular motion of each joint in each plane. The researcher then calculated the ROM of each joint on each test and recorded the manual scores. The min-max values of the ROM among all the participant's recordings of the same exercise were also calculated. Prior to modeling, min-max normalization of ROM data was conducted, and the data were scaled to the interval [0, 1].

Feature Selection and Modeling
The ensemble machine learning algorithm AdaBoost.M1 was used to construct multiclass classifiers with a full set of features or the remaining features of the best predictors. Tenfold cross-validation was applied to the models (k = 10, mfinal = 50, coeflearn = Breiman, and maxdepth = 5) to improve reliability of model selection. To model the classification system, a training dataset of input-output pairs of main-joint ROM values and the manual scores was collected. Initially, nine folds were used for training and one fold was used for validation The number of cross-validation folds was changed over each of 10 repetitions. This repeated cross-validation was used to estimate any error due to data partitioning. An automatic scoring system was then constructed and used to predict the most appropriate score for each test. To reduce the number of input variables, we used ordinal logistic regression for subset feature selection. Each test selected up to five statistically significant predictors (p < 0.05) into the model. Feature selection and modeling were performed using RStudio Cloud (Version 4.0.2, RStudio, Boston, MA, USA, 2020).

Model Evaluation
The general confusion matrix for binary classification models was not applicable to most of the tests because the dependent variables were divided into three categories. The methods used to calculate the models' accuracy, recall rate, precision, and F scores are presented in Table 2 and Equations (1)-(4). In Table 2, actual and predicted class respectively represents the manual scores and model-estimated scores on the joint movement angles. After the multijoint angle values were entered, the algorithm yielded the classification results (scored as 1, 2, or 3). In good classifiers, these parameters should be close to 100%. To simplify the data display, only the accuracy and F scores, calculated by recall rate and precision, are shown. For the rotary stability test, because the dependent variables were only divided into two categories, the general confusion matrix for binary classification was used.

Statistical Analysis
The consistency or level of agreement between automatic and manual scoring was assessed using a weighted kappa statistic, whose values range between −1 and 1. A kappa value of 1 and −1 represented complete agreement and disagreement, respectively. The level of agreement was further evaluated using the scales developed by Fleiss et al. [16]. Kappa values of <0.20, 0.21-0.40, 0.41-0.60, 0.61-0.80, and >0.80 were defined as poor, fair, moderate, good, and very good agreement, respectively.

Results
The average scores for the deep squat, hurdle step, in-line lunge, active straight leg raise, trunk stability pushup, and rotary stability tests, as graded by the professional athletic trainer and the automatic grading system, were 1.83 ± 0.85 and 1.79 ± 0.85, 1.81 ± 0.55 and 1.93 ± 0.43, 2.03 ± 0.48 and 1.96 ± 0.32, 2.33 ± 0.74 and 2.44 ± 0.69, 1.80 ± 0.93 and 1.86 ± 0.95, and 1.61 ± 0.49 and 1.69 ± 0.47 points, respectively (Table 3). Differences in scores were within 1 point, ranging from 0.04 to 0.12. Table 3 shows the number and classification rate for each class. Table 3. The average scores of the six functional movement screen (FMS) tests graded by the professional athlete trainer and automatic grading system (mean ± SD) and the number and classification rate in each class. As demonstrated in Table 4, the Nagelkerke R 2 values from the ordinal logistic regression analysis for the deep squat, hurdle step, in-line lunge, active straight leg raise, trunk stability pushup, and rotary stability tests were 82.5%, 18.9%, 44.1%, 78.6%, 40.1%, and 34.3%, respectively. Higher R 2 values indicate better goodness of fit. The deep squat and hurdle step tests had the highest and lowest R 2 values and were thus the best and worst models, respectively. In addition, the regression models, each of which contained 1-4 variables, determined the best predictor variables for each exercise. Table 4 presents the given subsets of predictor variables in each model. For the deep squat, hurdle step, in-line lunge, active straight leg raise, trunk stability pushup, and rotary stability tests, the best classifiers after feature selection were trained using the reduced feature sets of the selected M = 4 joint angles (shoulder horizontal abduction (S-SHAB), pelvic tilt (PT), scoring side thigh flexion (S-ThF), and trunk rotation (TR); Table 5), the M = 1 joint angle (head rotation (HR)), the M = 1 joint angle (trunk flexion (TF)), the M = 2 joint angle (S-ThF, PT), the M = 2 joint angles (TF and scoring shoulder rotation (S-SR)), and the M = 3 joint angles (non-scoring shoulder rotation (NS-SR), scoring side shoulder flexion (S-SF), and scoring thigh flexion (S-ThF)), respectively. The ROM parameters presented in bold in Table 4 indicate that the regression model coefficients are negative. A negative coefficient suggests that as the value of the independent variable increases, that of the dependent variable decreases as well. A positive coefficient indicates that as the value of the independent variable increases, the mean of the dependent variable also increases. Table 4 also presents the stepwise regression analysis of the ROM data (means ± standard deviations) for all participants who scored 1, 2, or 3 points on each test. Bold font indicates that the regression model coefficient is negative, which is negatively correlated with the score. As Table 5 shows, the best and worst classification results before feature selection were approximately 91% and 66% in the trunk stability pushup and active straight leg raise tests, respectively. The best and worst classification results after feature selection were approximately 79% and 57% in the in-line lunge and active straight leg raise tests, respectively. Following feature selection, the accuracy of the predictive model was reduced by 2%-20%. The degree of agreement of the deep squat, active straight leg raise, and trunk stability pushup tests dropped from good to moderate, moderate to fair, and very good to moderate, respectively. That of the hurdle step, in-line lunge, and rotary stability tests did not change. Poor classification accuracy was attained in certain score groups with fewer samples; for example, an F score of 0 was noted in the 3-point group for the hurdle step test. Only five participants (7.14%, Table 3) achieved a score of 3 on the hurdle step test, constituting a much lower proportion than that of the other score groups.

Discussion
Overall, the prediction accuracy (Table 4) was between 57% and 79% for the reduced feature set, corresponding to 66%-91% of the prediction accuracy achieved when the full feature set was used. It demonstrates that higher prediction accuracy was attained when the full feature set was used. The prediction accuracy for all tests decreased by 2%-20% when the reduced feature set was used.

In-Line Lunge
In the in-line lunge test, the accuracy achieved by using only one parameter (0.79) did not differ significantly from that achieved by using all of them (0.81; Table 5). TF (Table 4) is a key parameter for which the test can accurately assess whether the execution of this exercise meets the specified criteria. Excessive forward-backward trunk tilt leads to a larger ROM (41.62 • ) and indicates poor trunk stability (Table 4). We observed that movement quality in this test could be monitored using only one sensor, which provides convenience and reduces testing time considerably in real-world application.

Deep Squat and Trunk Stability Push Up
The highest accuracy was found for the symmetrical exercises, namely the deep squat and trunk stability pushup tests, with accuracy as high as 0.87 and 0.91 when all parameters were selected. The FMS essentially identifies imbalances in stability and mobility in fundamental movement patterns. After screening by regression analysis, these two tests were both screened out some stability and mobility factors. In the deep squat test, the stability factors were S-SHAB, PT, and TR, and the mobility factor was S-ThF. Excessive range of S-SHAB, PT, and TR indicated instability, whereas insufficient S-ThF indicated insufficient mobility, which is consistent with the findings of a past study [24] in which the final thigh position was the completion index of deep squat. For example, 3-point scorers had a greater average thigh flexion (118.5 • ) than did 1-point scorers (101.34 • ; Table 4), indicating that they could perform deeper squats. The stability and mobility factors in the trunk stability pushup test were TF and S-SR, respectively. Good performance involves lifting the body as a unit with no lag in the spine; therefore, excessive trunk mobility, including excessive TF, indicates poor posture maintenance. However, some participants could not perform a complete full pushup because of insufficient shoulder rotation mobility. One-point scorers (i.e., poor performers) had a smaller average shoulder rotation ROM than did 3-point scorers (63.37 • vs. 88.73 • ; Table 4).

Hurdle Step and Active Straight Leg Raise
Head rotation (HR) was the parameter selected for the hurdle step and S-ThF, PT were for active straight leg raise test ( Table 4). The main problem that must be considered for the hurdle step test is its high difficulty. Poor stability of the stance leg may increase the difficulty of performing the exercise. Greater HR may be attributable to loss of stability during the exercise, causing the head-the body part farthest from the supporting foot-to shake visibly. In the active straight leg raise test, S-ThF is a key parameter. Whiteside et al. (2014) also demonstrated that peak hip flexion angle is a more sensitive indicator of flexibility in this test [14]. This finding and the test instruction to raise the scoring leg as high as possible were consistent with those of the present study. In addition, excessive range of PT might be able to, by the compensatory movements of pelvic, achieve more leg raise range.

Rotary Stability
The parameters selected by the rotary stability test were NS-SR and S-SF for the stability factors and S-ThF for the mobility factors. The test assesses multiplane stability of the pelvis, core, and shoulder girdle during performance of a combined upper-and lowerextremity movement. It represents the coordination of stability and mobility. To obtain a score of 2, the individual must flex the shoulder while extending the opposite-side hip and knee and then bring the elbow to the knee while maintaining spinal alignment with the board on the ground. Our results indicate that in the rotary stability test, 1-point scorers demonstrate obvious instability of the supporting hand on the nondominant side and more shaking of the raised hand on the dominant side, both of which involve larger shoulder ranges of motion. In contrast, instability of the supporting hand would restrict the tested leg ROM.

Consistency between Automatic and Manual Scoring
The kappa values between manually and automatically assigned scores for the six tests were computed in this study. A previous study showed only poor to moderate intermethod agreement (kappa coefficients 0.05 and 0.52) [14]. Their sensor-based semiautomatic system used manual setting kinematic thresholds to correspond to FMS grading criteria. In the present study, the kappa values before feature selection for the trunk stability pushup, deep squat, active straight leg raise, in-line lunge and rotary stability, and hurdle step tests displayed very good, good, moderate, fair, and poor agreement (kw = 0.85, 0.80, 0.42, 0.37 and 0.34, and 0.18), respectively. The overall kappa values were between 0.18 and 0.85, indicating that the machine learning method used to model the kinematic threshold was more accurate than the manual setting. Although we present suitable parameters, because of the large variations in body movement, we recommend using full-body IMU sensors for FMS assessment to obtain the most accurate evaluation.

Limitation
The test reliability and validity of the FMS and its prediction accuracy for future injury risk remain controversial. Nevertheless, as a movement-based diagnostic system, its theoretical relevance for quantifying movement control ability is indisputable. In view of the possibility of erroneous data labeling, we suggest using unsupervised machine learning (i.e., unlabeled data) for motor skill classification. This approach does not require manual score labeling in advance and allows the model to work independently to discover previously undetected information and data patterns. Thus, the approach maximizes the advantages of AI and helps realize the goal of implementing AI in healthcare.

Conclusions
An IMU sensor-based system can potentially be applied to the automatic screening of functional movement. In this study, the results indicate that higher prediction accuracy was achieved using the full feature set compared with using the reduced feature set. Future studies should collect more data and improve machine learning performance to attain more accurate prediction results.  Informed Consent Statement: Informed consent was obtained from all subjects involved in the study.

Data Availability Statement:
The data presented in this study are available on request from the corresponding author.

Conflicts of Interest:
The authors declare no conflict of interest.