Reliability and Usability Analysis of an Embedded System Capable of Evaluating Balance in Elderly Populations Based on a Modified Wii Balance Board

This paper analyzes the reliability and usability of a portable electronic instrument that measures balance and balance impairment in older adults. The center of pressure (CoP) metrics are measured with a modified Wii Balance Board (mWBB) platform. In the intra- and inter-rater testing, 16 and 43 volunteers (mean 75.66 and standard deviation (SD) of 7.86 years and 72.61 (SD 7.86) years, respectively) collaborated. Five volunteer raters (5.1 (SD 3.69) years of experience) answered the System Usability Scale (SUS). The most reliable CoP index in the intra-examiner tests was the 95% power frequency in the medial-lateral displacement of the CoP with closed-eyes. It had excellent reliability with an intraclass correlation coefficient ICC = 0.948 (C.I. 0.862–0.982) and a Pearson’s correlation coefficient PCC = 0.966 (p < 0.001). The best index for the inter-rater reliability was the centroidal frequency in the anterior-posterior direction closed-eyes, which had an ICC (2,1) = 0.825. The mWBB also obtained a high usability score. These results support the mWBB as a reliable complementary tool for measuring balance in older adults. Additionally, it does not have the limitations of laboratory-grade systems and clinical screening instruments.


Introduction
Human balance is a complex ability to achieve postural stability, which counteracts the inherently unstable perturbations and body sways induced by the gravitational effect [1]. An efficient balance control depends on the visual, vestibular, somatosensory, muscular, and nervous systems. Assessing human balance helps evaluate the integrity of these systems. In this regard, it is well known that the aging process involves a reduction in physiological capacities and balance [2]. These conditions usually lead to falls which directly and negatively impact older adults' quality of life [3]. According to the World Health Organization (WHO), 28-35% of the elderly population (above 65 years old) fall each year, reaching 32-42% for adults over 70 years old. This means that the frequency of falls increases with age and frailty level [4]. The relevance of this public health problem is remarkable due to the accelerated growth of the world population of older adults, the intrinsic and extrinsic multifactorial nature of falls [5], and the negative economic impact of attending to the problem, both personally and for governmental health institutions and systems [6,7]. Thus, the correct and timely diagnosis regarding balance anomalies can lead to clinical actions to avoid their impact.

Study Population
Participants were recruited voluntarily from different nursing homes, universities, and neighborhoods of the cities of Toluca, Metepec, and Villa Guerrero in the State of Mexico, Mexico. Persons eligible to participate were those aged 65 years and over, who could stand for at least 2 min, even using assistive devices. Individuals who drank alcoholic beverages or coffee in the last 24 h or could not complete the physical performance tests (described below) were excluded. The mWBB raters were invited through an open call at the School of Medicine and the School of Nursing and Obstetrics of the Autonomous University of the State of Mexico. All raters were undergraduate students undertaking a bachelor's degree or had an upper degree in gerontology, physical therapy, nursing, or geriatrics, and had over one year experience in geriatric care and management.

Variables
A total of 78 CoP indices (39 with open-eyes and 39 with closed-eyes) previously described [36] were estimated using the mWBB. Table A1 contains the description of the CoP indices used in this study. For this purpose, subjects were placed on the platform surface with their feet together (closely positioned, side by side, and no opening angle), barefoot, assuming the most upright posture possible, with the arms crossed over the chest [37]. Individuals were asked to focus on a fixed point in front, located half a meter apart in the distance and at a height of 1.5 m above the ground. Participants stood on the mWBB; after a 5-s countdown, the device automatically records the CoP data for one minute. Immediately after, through an auditory stimulus, the subjects were instructed to close their eyes, recording another minute. The test was carried out once. The CoP trajectory data were recorded at a stable sampling rate of 50 Hz, with a resolution of 1/100th of a millimeter and saved in a MicroSD card.
Age in years was used as a continuous variable and sex as a dichotomic variable (woman/man) to describe the sample. Anthropometry (height in cm and weight in kg) was determined following validated methodology and by standardized personnel.
Gait was assessed by the time in seconds taken to complete the Timed Up and Go (TUG) test [38]. Gait deficit (yes/no) was defined when participants took 12 s or more to complete the test. Leg strength was assessed by the number of full stands achieved when performing the 30-s Chair Stand test, and strength in legs deficit (yes/no) was adjusted by sex and age [39]. Balance was assessed with the 4-Stage Balance Test; a balance deficit (yes/no) was present if the individual could not hold their feet-together, semi-tandem, and in tandem positions for ten seconds without moving the feet or needing support, or when participants could not maintain the one-legged stance for five seconds [40].
The use of gait assistive devices, the presence of lower limb prostheses, complete or partial visual and hearing impairments, diagnosis of diabetes or hypertension, fear of falling (FES-I score ≥ 23 [41]) and if the participants fell in the previous year of the study (yes/no) were also analyzed.
The usability of the mWBB was assessed with a custom System Usability Scale (SUS) questionnaire (see Table A2). It has a continuous scale ranging from 0 to 100, administered to all raters immediately upon completion of the reliability tests. The age of the raters, years of experience in geriatric care and management, profile, and score of the SUS test were also recorded.

Reliability
All raters gave standardized instructions to the participants on each trial for the reliability tests. Intra-rater reliability (also known as test-retest reliability) consisted of the same examiner applying the balance test to the same participants twice but at different days in the same room. Based on a previous systematic review [19], the time between the test and retest used for the present study was the closest to 48 h. Several examiners applied the balance test to the same participants for inter-rater reliability. Each rater repeated one test within an interval closest to 48 h in the same room and the order of raters was randomized [19].

Statistical Analysis
A descriptive analysis of the sample characteristics, the 78 CoP indices for the reliability tests, the characteristics of the raters, and the results of the usability questionnaire was performed. Continuous variables were represented using means and standard deviations (SD), and categorical variables were expressed as numbers and percentages. The normality of the continuous variables was assessed using a Shapiro-Wilk test with α = 0.05. Comparisons of individuals included in the intra-rater and inter-rater tests were estimated through a Wilcoxon test for continuous variables, and a χ 2 test for categorical variables.
For the intra-rater reliability tests, comparisons of the 78 CoP indices of the test vs. the retest were performed using a t-test for dependent variables for indices with normal distribution. A Wilcoxon test was used for non-parametric indices. To measure the testretest reliability of the normally distributed CoP indices, Pearson's correlation coefficient (PCC), and intraclass correlation coefficient (ICC) at 95% confident intervals based on a single rater/measurement, absolute agreement, and two-way mixed effects model [42], were estimated. For those CoP indices that are not normally distributed, Spearman's Correlation Coefficient (SCC) and ICC at 95% confidence intervals were estimated by the bootstrap technique.
For the inter-rater reliability, a Maulchy's W test was used to check sphericity. Comparisons of the 78 CoP indices among raters were performed using a Friedman test for non-parametric indices. For normally distributed indices with homogeneity of variances, a dependent variables one-way ANOVA test was used by the Pillai trace statistic. For metrics with normal distribution and heterogeneity of variances, a dependent variables one-way ANOVA test was performed by the Greenhouse-Geisser statistic. To measure the test reliability of the normally distributed CoP indices, PCC, intraclass correlation coefficient (ICC (2,1)) and their 95% confident intervals based on a single measurement, absolute agreement, and two-way random effects model [29] were estimated. For those CoP indices that were not normally distributed, the SCC, ICC (2,1), and their 95% confidence intervals were calculated by the bootstrap technique.
For the usability tests, the correlation between age and years of experience in geriatric care and management versus SUS scores was calculated by the SCC. For the estimation of the degree of usability, SUS scores between 50 and 70 indicate deficient usability, SUS scores above 70 indicate acceptable usability, and values above 90 indicate excellent usability [43,44].
For the reliability tests, it was assumed that 95% confidence interval limits of the ICC below 0.5 indicate poor reliability, values between 0.5 and 0.75 indicate moderate reliability, values between 0.75 and 0.9 indicate good reliability, and values above 0.90 indicate excellent reliability [42].
For Spearman and Pearson correlation coefficients, it was assumed that values between 0.90 and 1.00 indicate very high correlation, values between 0.70 and 0.90 high correlation, values between 0.50 and 0.70 moderate correlation, values between 0.30 and 0.50 low correlation, and values between 0.00 and 0.30 indicate insignificant correlation [45].
The discrimination accuracy of presenting a balance deficit for the 78 CoP indices was assessed using the Hosmer-Lemeshow Goodness of Fit test and the area under the receiver-operating characteristic curve (AUC). The optimal cut-off points were obtained for the indices with the higher AUC that best distinguished between people with and without a balance deficit based on Youden's statistic. The accuracy of the classification was evaluated with the AUC, sensitivity, and specificity.
All statistical tests were performed with α = 0.05 using IBM SPSS Statistics (version 26.0, Armonk, NY, USA), except for the bootstrap technique and the discrimination accuracy run in Stata Statistical Software (version 15, College Station, TX, USA).

Sample Size Calculation
The sample size for the test-retest reliability was calculated using the correlation coefficient formula (Equation (1)) [46]: where: n TRT is the sample size for the test-retest reliability, z α = 1.64, assuming a 95% confidence level, z β = 1.44, assuming a β error of 0.075, and r = 0.70, is the expected correlation coefficient.
This calculation resulted in 16 participants needed to achieve the desired correlation coefficient.
For the estimation of inter-rater reliability, it is necessary to establish the rho (ρ) level, the proportion of variation between subjects in relation to the total variation [47]. The sample size can be calculated by using Equation (2): where: n IR is the sample size for the inter-rater reliability, z α/2 = 1.96, assuming a 95% confidence level, ρ = 0.70, is the expected correlation coefficient, ω = 0.25, is the width of the confidence interval, n = 3, is the number of examiners.
This formula resulted in 43 participants needed to achieve the desired correlation coefficient.

Results
In total, 19 individuals aged 65 and older took part in the intra-rater reliability tests. One participant dropped out of the study, and two were excluded because they could not complete the physical performance tests. Therefore, 16 individuals were included in the test-retest reliability analysis. The mean age of these participants was 75.7 (SD 7.6) years and 56.3% of the sample were women. In total, 13 (81.3%) of all individuals presented gait and balance deficits, and 4 (25%) used assistive gait devices and lower limb prostheses. A total of 3 participants (18.3%) reported visual and hearing impairments. The complete sample had a leg strength deficit. Diabetes was present in 37.5%, fear of falling in 56.3%, and 56.3% of the individuals suffered a fall in the previous year.
Of the 46 individuals who participated in the inter-rater reliability tests, 2 dropped out of the study, and 1 did not complete the physical performance tests. Thus, 43 individuals, of whom 19 (44.2%) were women, were included in the inter-rater analysis. The mean age was 72.6 (SD 7.9) years. In total, 27 individuals (62.8%) presented a gait deficit, 30 (69.8%) a balance deficit, 27 (62.8%) had a fear of falling, 21 (48.8%) reported having suffered a fall in the previous year, and 12 (27.9%) were diagnosed with hypertension. All participants had a leg-strength deficit. No significant difference was found between people who participated in the intra-rater tests and individuals who participated in the inter-rater tests. A complete description of the samples is shown in Table 1. For intra-rater reliability, there was no significant difference in any CoP index between the test and retest mean values (see Table A3). Table 2 shows the 17 indices with ICC higher than 0.80. The CoP indices with the best level of reliability in the intra-rater tests are POWER95MLCE (ICC = 0.948 and PCC = 0.966), MVELMLOE (ICC = 0.920 and PCC = 0.926), and RDISTMLOE (ICC = 0.883 and PCC = 0.880). A total of 41 indices (52.6%) presented an ICC higher than 0.7 and a correlation coefficient higher than 0.7 (see Table A4 for full results). For the inter-rater reliability, there was no significant difference in all COP indices between the three examiners, except for MFREQOE, POWER50APOE, FREQDMLOE and FREQDAPOE (see Table A5 for complete results). Table 3 shows the 11 indices with ICC (2,1) higher than 0.75. The CoP indices with the best level reliability in the inter-rater tests are CFREQAPCE (ICC(2,1) = 0.825 (0.717-0.934)), MFREQAPCE (ICC(2,1) = 0.819 (0.711-0.927)) and POWER95APCE (ICC(2,1) = 0.809 (0.701-0.918)). When comparing the three examiners, 30 indices (38.46%) presented an ICC (2,1) higher than 0.7. The correlation coefficient was higher than 0.7 for 46 indices (59.0%) when comparing Rater 1 vs. Rater 2, for 25 indices (32.05%), when comparing Rater 1 vs. Rater 3, and for 23 indices (29.5%) and when comparing Rater 2 vs. Rater 3 (the complete set of the reliability results can be consulted in Table A6). Table 3. Statistical analysis of CoP indices with the best level of reliability in the inter-rater.

Rater 3 Mean (SD)
ICC ( Three gerontology students and two physiotherapists participated in the usability study. The rater that performed the test-retest trials also attended as one of the three evaluators in the inter-rater test (Rater 1 in Tables 3 and 4). The two evaluators (Raters 4 and 5) who participated in our previous study [35] also responded to the SUS questionnaire. The five female raters (age: 25.8 (SD 7.12) years; experience in geriatric care and management: 5.1 (SD 3.69) years) answered the SUS questionnaire at the end of all the experimental balance tests. The results indicate that the mWBB has a mean SUS score of 92.5 points and a standard deviation of 6.84 points. On the other hand, 4 out of 5 raters rank the usability of the WBB as excellent. Only one operator indicated that the WBB has acceptable usability (see the scores in Table 4). The evaluators' age and years of experience seem not to be related to the SUS scores.
To estimate the discrimination accuracy of the mWBB when presenting a balance alteration, we considered all measurements taken by Rater 1. The first trial of the 16 participants of the intra-rater tests and the results obtained from evaluating the 43 older adults included in the inter-rater tests. A Youden index analysis was run to calculate the optimal cut-off values that provide the best trade-off between sensitivity and specificity for identifying a balance deficit. Then, a ROC analysis was carried out and 10 CoP indices with the highest AUC were obtained ( Table 5). The highest AUC was found for the mean frequency of the anterior-posterior CoP time series with eyes open, MFREQAPOE (AUC = 0.778, sensitivity = 0.93, specificity = 0.625). The mean CoP velocity in the anterior-posterior direction and the range of the anterior-posterior CoP presented AUC, sensitivity, and specificity higher than 0.7. Table 5. Statistical analysis of the 10 CoP indices with the highest area under the curve (AUC) related to presenting a balance deficit, optimal cut-off values for the indices, sensitivity and specificity.

Discussion
When assessing static balance in a group of individuals aged 65 years and over with a high prevalence of poor physical performance, the most reliable CoP index in the intrarater tests was the 95% power frequency in the medial-lateral displacement of the CoP with closed-eyes (POWER95MLCE). It had an excellent reliability with an ICC = 0.948 (0.862-0.982) and a PCC = 0.966. The best index for the inter-rater reliability was the centroidal frequency in the anterior-posterior direction with closed-eyes (CFREQAPCE), which had an ICC (2,1) = 0.825. The mWBB also obtained an excellent average usability score of 92.5, showing that the examiners found it useful and easy to use. They will recommend it to other health professionals, regardless of their age or professional experience.
The key indicators when measuring an instrument's quality are validity and reliability [48]. The first estimates the extent to which a measure agrees with the gold standard. Thus, it has been demonstrated that the WBB is a valid instrument that performs comparably to a laboratory-grade force platform for static standing computerized posturography [18]. Furthermore, previous research showed that the mWBB is a valid device that identifies balance alterations in independent, active older adults with no acute condition. Seventy-three percent of the CoP indices obtained with the mWBB were able to detect balance alterations, with the mean velocity of the CoP in the antero-posterior direction with open-eyes (MVELAPOE) being the best at discriminating between groups [35].
Reliability is defined by the consistency among successive measurements of a variable, on the same subject, and under similar conditions [49]. Some of the instrument's most critical reliability tests are inter-device, intra-rater, and inter-rater reliability.
Inter-device reliability refers to the consistency of measurements carried out by different devices. Several studies have shown that the WBB presents low inter-device variability [23]. Even after years of use, these devices do not present significant alterations in their measurements, and the battery charge level does not affect the sensor data [50].
Intra-rater reliability refers to the consistency of measurements performed under similar assessment conditions at two separate times by the same examiner (test-retest). On the other hand, inter-rater reliability points to the consistency of measurements carried out by different examiners. Previous evidence [18] has indicated that the WBB is a reliable, safe, and feasible tool to assess static balance in highly functional individuals [51], older adults at risk of falls [52], and adults with stroke [53]. The primary reported drawback of using the WBB for medical assessment is the inconsistent sampling frequency [19]. However, in the design of the mWBB, this problem was addressed and solved [20]. The number of available variables derived from the trajectory of the CoP recorded in quiet stand varies greatly in the literature [54,55]. Most studies only analyze the total length of the CoP path and velocity in stance (time-domain "distance" measures), but further analysis of the other CoP indices can be useful to improve the reliability results, as shown in the present study, where time-domain "area", time-domain "hybrid", and frequency domain measures appear between the most reliable indices [36].
It is important to note that there was a high prevalence of physical deficit in both reliability test groups. All the participants presented strength deficits, and over 60% of the sample showed gait and balance deficits. The decline in balance with increased gait variability and lower limb strength [56,57] is associated with an increased risk of falls, resulting in measurements varying wildly from test to test. Despite this, the reliability results of the mWBB corroborate the hypothesis that it is a reliable instrument for assessing the balance in older adults.
For the inter-rater reliability, it is interesting to notice that four indices showed significant differences between raters. Comparisons between pairs of raters indicated that the number of highly correlated indices decreased when comparing Raters 1 and 2 with Rater 3 (see Table A6). The repeatability of the tests could be affected by the degree of the physical decline of the participants. Additionally, the little experience of the raters attending older adults with these characteristics also affected these results. Specifically, Rater 3 had shorter experience in geriatric care.
Our results showed that the CoP indices in the ML direction are the most reliable for intra-rater tests ( Table 2). On the other hand, the parameters in the AP direction indicated greater reliability for the inter-rater tests ( Table 3). The direction of the variation of the CoP indices depends on the muscles involved in maintaining balance and the contribution of the joints to postural oscillations [55,58]. Clinical and anthropometric factors influencing the CoP variables include sex, presence of vestibular impairments, comorbidities, height, weight, maximum foot width, base of support area, and foot opening angle [55,59]. However, as shown in Table 1, no significant difference was found in the characteristics between the individuals in the two samples. Therefore, given the high degree of physical deterioration of the participants, other features affect the sway direction in both reliability tests. Future research should include variables that affect balance in older people, such as the presence of dementia, depression, sarcopenia, or frailty [60][61][62][63].
Usability is one of the crucial requirements for health technology [64]. The System Usability Scale (SUS) is frequently used because of its validity and availability and its easy score interpretation. However, it is important to notice that it is a weak indicator of critical and severe usability issues compared to the task completion rates. It is a subjective evaluation instrument and only provides a general score of the usability [65]. Furthermore, a larger sample size of evaluators is needed to generalize the results. Therefore, despite the high usability score obtained by the mWBB, further research is needed to establish its use among health professionals who care for older adults.
Due to the high variability between methodological variables, there is no universal consensus on which CoP indices are the best to assess balance and risk of falling [35,55]. The majority of studies show AUC values between 0.7 and 0.8, most of them presenting sensitivity or specificity below 0.7 [35,[66][67][68][69][70][71] (comparisons between studies can be found in [35]). Therefore, it is interesting to note that for the classification accuracy, the mean CoP velocity in the anterior-posterior direction with open-eyes (MVELAPOE) and the range of the anterior-posterior CoP with open-eyes (RANGEAPOE) presented: AUC = 0.747, sensitivity = 0.744, and specificity = 0.75 (equality of values for both indices is a coincidence). Furthermore, in our previous study of predictive validity [35], MVELAPOE had the best value of AUC to identify a balance deficit (AUC = 0.714, sensibility = 0.478, specificity = 0.930). We attributed the low level of sensibility to the fact that the studied population in [35] was independent, active, and without any acute conditions. Thus, fur-ther research is needed to select indices with high sensibility and specificity in intergroup classifications, depending on the origin of the equilibrium alterations.
Despite all benefits the WBB could bring as a measurement tool in clinical settings [24,72,73], there is an ongoing debate concerning its scientific value [12,[25][26][27][28][29][30][31][32]. Some studies have raised concerns about the accuracy of the WBB, the interchangeability of the device with other force platforms, and its use in clinical applications. On the other hand, scientists and clinicians have drawn attention to the need for affordable evaluation tools in non-specialized clinics and less developed countries, regular follow-ups to adapt treatment according to the patient's performance, and access to tools to prevent the risk of falls. The mWBB presented in this work aims to contribute to the development of more agile and better-adapted hardware and methods that can be available to more patients than current high-end solutions by solving the technical drawbacks of the WBB, and by demonstrating its capability to quantify balance deficits in older adults and the reliability of its measurements.
This study has some limitations. First, reliability tests should be performed under similar assessment conditions and the high degree of physical deterioration of the participants could have affected the tests. However, the results showed which indices were the most appropriate to assess older adults with these characteristics. Second, the difference between the years of experience of the evaluators could have affected the inter-rater reliability tests. Third, a larger sample of experienced personnel is required to generalize the usability results. Fourth, a larger sample is needed to verify the classification accuracy. Finally, like most mass-produced technology, the WBB has a defined life cycle of availability and Nintendo is no longer producing it. However, as prior research has shown similar results between new and used WBBs [74], old platforms could still be used for physical function assessments. Furthermore, the same principle used on these boards is used in electronic bath scales still widely used and produced; these devices are also susceptible to be modified to serve as low-cost balance assessment devices.

Conclusions
Adding to the literature on the WBB as an acceptable, low-cost, portable, easy to use, and valid device for balance measurement, the mWBB is a reliable device to quantify the CoP displacement during balance tests in older adults, capable of discriminating between people with and without balance deficits.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyzes, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results. No support was received from Nintendo, the authors are not related in any form to Nintendo nor their subsidiaries, and the purpose of this project is purely scientific, dedicated to the development of knowledge.

Appendix A
m=3 (m∆ f ) 2 G ML (m))) 1/2 FREQDAP -Frequency Dispersion of Anterior-Posterior CoP data N is the number of data points included in the CoP time series (N = 2400 with open eyes and with closed eyes). T is the period of the time selected for analysis (T = 48 s in this work). G_x (m) is the discrete power spectral density (x stands for RD, ML or AP). u is the smallest integer that converges in recursive sums.
Appendix B Table A2. Rater questionnaire on the usability of the modified Wii Balance Board (mWBB).

Strongly Agree
1. I think I would like to use the mWBB frequently. Appendix C