Accuracy of Mobile Applications versus Wearable Devices in Long-Term Step Measurements

Fitness sensors and health systems are paving the way toward improving the quality of medical care by exploiting the benefits of new technology. For example, the great amount of patient-generated health data available today gives new opportunities to measure life parameters in real time and create a revolution in communication for professionals and patients. In this work, we concentrated on the basic parameter typically measured by fitness applications and devices—the number of steps taken daily. In particular, the main goal of this study was to compare the accuracy and precision of smartphone applications versus those of wearable devices to give users an idea about what can be expected regarding the relative difference in measurements achieved using different system typologies. In particular, the data obtained showed a difference of approximately 30%, proving that smartphone applications provide inaccurate measurements in long-term analysis, while wearable devices are precise and accurate. Accordingly, we challenge the reliability of previous studies reporting data collected with phone-based applications, and besides discussing the current limitations, we support the use of wearable devices for mHealth.


Introduction
Innovative applications (apps) and smart devices such as smartphones and tablets have emerged as integral parts of people's lives [1]. They provide information to profile society [2], but also to make users aware of body signals important for predicting life expectancy [3], promoting patient engagement, self-management of diseases, and assist doctors to remotely follow up patients [4]. In particular, in the healthcare sector, we have seen the development of systems to assist users in different ways during the day [5] and night [6]. These devices offer incredible potential to generate big data, often identified as patient-generated health data (PGHD), that could influence medical doctors' decision-making [7], as wel as the potential for earlier diagnosis [8]. The World Health Organization (WHO) defines these tools under the labels electronic health (eHealth) and mobile health (mHealth): eHealth refers to any use of information and communication technology for health care; mHealth is a subset of eHealth specifically referring to the use of mobile and wireless devices. For instance, fitness trackers, blood glucose meters [9], blood pressure monitors [10], smoking sensors [11], and temperature-detection devices [12] are today popular for monitoring different fitness parameters and vital signs used for health analysis [13] and many other motion-related applications including tracking, positioning, activity recognition, and augmented reality [14]. In addition, electronic textiles (e-textiles), smart clothes, and flexible/printable electronics bring us closer to scenarios where electronic systems are totally PA and patients' clinical status at the scientific community's disposal would help in better profiling patients and understanding the correlations in order to improve people's wellbeing [36].
In this work, we considered a basic parameter for PA evaluation-the number of steps taken daily. Our aim was to give the users an idea about how much steps they miss when measuring them through a smartphone-based application. In particular, we compared six fitness tracking applications for Android mobiles and three commercial fitness tracking wristbands, and we analyzed the precision and accuracy of the phone-based applications versus the dedicated wearable devices. Several brands and models of consumer trackers were examined for their accuracy in step measurement in a laboratory setting [37], e.g., treadmill walking [38], level walking [39], and stair walking [40]. However, count accuracy in real environments remains a major challenge [41]. To assess the accuracy and precision of the considered trackers, we performed two types of experiments ( Figure 1). The first experiment was similar to those performed by Case et al. [42] and Modave et al. [43]. Basically, we evaluated the accuracy of the trackers in controlled outdoor tests with known ground truth, i.e., the real number of steps counted manually by analyzing videos recorded with a camera to serve as the benchmark (Figure 1a) [44]. In the second experiment, we asked a healthy 35-year-old man to install the six applications in his mobile and at the same time, wear all the fitness wristbands, and we measured the precision of the different systems in a 2-month 24 h/7 days experiment ( Figure 1b) scientific community's disposal would help in better profiling patients and understanding the correlations in order to improve people's wellbeing [36]. In this work, we considered a basic parameter for PA evaluation-the number of steps taken daily. Our aim was to give the users an idea about how much steps they miss when measuring them through a smartphone-based application. In particular, we compared six fitness tracking applications for Android mobiles and three commercial fitness tracking wristbands, and we analyzed the precision and accuracy of the phone-based applications versus the dedicated wearable devices. Several brands and models of consumer trackers were examined for their accuracy in step measurement in a laboratory setting [37], e.g., treadmill walking [38], level walking [39], and stair walking [40]. However, count accuracy in real environments remains a major challenge [41]. To assess the accuracy and precision of the considered trackers, we performed two types of experiments ( Figure 1). The first experiment was similar to those performed by Case et al. [42] and Modave et al. [43]. Basically, we evaluated the accuracy of the trackers in controlled outdoor tests with known ground truth, i.e., the real number of steps counted manually by analyzing videos recorded with a camera to serve as the benchmark (Figure 1a) [44]. In the second experiment, we asked a healthy 35-year-old man to install the six applications in his mobile and at the same time, wear all the fitness wristbands, and we measured the precision of the different systems in a 2month 24 h/7 days experiment ( Figure 1b). All the collected data and acquired videos are publicly available on FigShare (https://doi.org/10.6084/m9.figshare.c.4923645.v2). The results obtained show that despite the good accuracy of all the tracking systems, the mobile applications did not provide reliable data in the long-term analysis due to an intrinsic problem, a missing parameter-the length of time the mobile was carried. Basically, in daily routines, mobiles are not always carried by people, especially when acting in a small indoor environment and performing standard daily life activities (e.g., cleaning, cooking, visiting the toilet), and this may generate unreliable day-based measurements. Generally, most of the time, The results obtained show that despite the good accuracy of all the tracking systems, the mobile applications did not provide reliable data in the long-term analysis due to an intrinsic problem, a missing parameter-the length of time the mobile was carried. Basically, in daily routines, mobiles are not always carried by people, especially when acting in a small indoor environment and performing standard daily life activities (e.g., cleaning, cooking, visiting the toilet), and this may generate unreliable day-based measurements. Generally, most of the time, young people carry their mobiles, so the number of steps recorded is reliable, but this is a very different situation for elders. In our case, even though the man performing the experiment was aware of the purpose of the collected data, the difference between the number of steps recorded by the phone-based applications was, on average, 30% lower with respect to what was recorded by the fitness wristbands, and this probably was misleading for all types of subsequent statistics, especially if the data were collected for clinical trials. Furthermore, these noncredible data may be understood by patients as being trustworthy and may also have a negative impact on users' behavior.
We decided to share our data to make the scientific community aware that many studies based on steps counted with smartphone-based applications may be unreliable. Several recent as wel as ongoing studies are based on data collected using smartphone-based applications [45], but our data show that the findings of these studies should be reconsidered. In our opinion, the only way to collect reliable data is to use an accurate fitness tracker worn on someone's body for 24 h a day, 7 days a week. A major limitation of these devices is that they may not be sensitive enough for non-ambulatory physical activities such as cycling, swimming, and dancing [24]. Furthermore, most of them become unreliable when used for people with disabilities [46]. However, they are still a better compromise available today to collect reliable data passively without influencing people's lives.

Phone-Based Applications and Wearable Fitness Trackers
Today, the market offers a wide range of phone-based applications and wearable devices for fitness tracking [47]. They differ in terms of available hardware and software installed, and it is known that the accuracy depends on both software and hardware [48]. However, even when the operating system, application programming interface (API), and sensors are the same, different implementations can affect the computation of the number of steps.
In this work, we mimicked a user interested in installing an application for fitness tracking by downloading one from among those freely available. The applications and devices included in this study were selected considering different software companies, prices, sensors, and algorithms used to analyze the data. Our decision was to install the applications in a mid-range smartphone, considering this to be the representative technology of the average consumer today. In particular, all of the applications were downloaded from the Google Play store and were installed on a Huawei P Smart FIG-LX1 phone with the Android 9 operating system.
In this work, 6 phone-based fitness tracking applications (APPs) and 3 wearable fitness trackers (WFTs) were tested. The 6 applications were:

Tracker Accuracy: Experiment Description
To assess how closely the different trackers agreed, we asked 3 operators to perform a test carrying a mobile with the 6 apps installed in a trouser pocket and wearing the 3 fitness wristbands on their left arm. The operators were 3 healthy subjects: a 35-year-old man (hereafter, Operator1), a 65-year-old man (Operator2), and a 65-year-old woman (Operator3). We asked each operator to walk in a fairly flat park while counting approximately 1200 steps. For the test, we chose this outdoor environment and not an indoor treadmill because it is known that in real-world conditions, especially on difficult terrain, there can be far more variation in step counts, given the changes in gait and wrist movement [43]. To have the ground truth, a cameraman recorded a video by following the operator performing the test with a camera. The acquired videos were post-processed to count the actual number of steps. Accuracy was then assessed by comparing the number of steps recorded by each tracker with the ground truth. All videos considered in the experiments are publicly available on FigShare (https://doi.org/10.6084/m9.figshare.c.4923645.v2).

Tracker Precision: Experiment Description
To evaluate the precision of the different trackers, we performed a long-term experiment based on the one-day test lasting for a few months in order to collect 60 days of reliable data. Basically, we asked an operator to carry the mobile with apps installed and wear the fitness wristbands 24 h/7 days. Then, we kept the data of the first 60 days, in which all the tracking systems were working. Precisely, every time one of the tracking systems was under recharge, we discarded the data collected by all the other systems. Furthermore, all the days when the operator could wear the wristbands but not carry the mobile in a trouser pocket (due to performing activities in his free time that could potentially damage the smartphone, for instance swimming, working in a vegetable garden, riding a horse, dancing) were removed from the analysis.
The operator was Operator1, a healthy 35-year-old man who leads a standard office job life (he is a computer scientist researcher) with several typical activities like working at the computer, answering the office phone, going to the printer, sharing documents with colleagues, and meeting with collaborators. The raw data recorded by different trackers are reported in Table 1. Once again, it is worth remarking that the 60 days considered in the experiment were not continuous, mainly due to battery recharge issues of the wearable devices and operator needs.

Statistics
In the accuracy assessment, the absolute normalized difference percentage (PAND) between step value (V i ) recorded by the trackers (i) and ground truth (G), determined in the post-processing of the videos, was computed to measure the closeness of agreement of the different counts according to Equation (1): In the precision assessment, unbalanced one-way analysis of variance (ANOVA) was performed to measure if the step values recorded daily by the apps significantly differed from those recorded by the fitness wristbands. Then, the normalized difference percentage (PND) between the values (V i ) recorded by the trackers (i) and the average value (A), computed considering the values of apps (A a ) or wearables (A w ) together, was calculated to evaluate the precision of each tracker i, according to Equation (2): The tests were carried out using MATLAB (MathWorks, Inc., Natick, MA, USA), and a p-value ≤ 0.05 was considered statistically significant. The asterisks in column 1 of Table 2 are only intended to flag levels of significance for three of the most commonly used levels: p-value less than 0.05 is flagged with one asterisk (*), p-value less than 0.01 is flagged with two asterisks (**), and p-value less than 0.001 is flagged with three asterisks (***). The bold values reported in column 1 of Table 2 highlight the days when the step values recorded by the apps did not significantly differ (considering p-value = 0.05 as the threshold) from those recorded by the fitness wristbands. The same analysis was repeated by accumulating the steps and subdividing the data in 6 blocks of 10 days (Table 3). Table 2.
Steps recorded by trackers in the 2-month experiment. * p-value less than 0.05; ** p-value less than 0.01; *** p-value less than 0.001. Bold values in column 1 highlight days when step values recorded by APPs do not significantly differ (considering p-value = 0.05 as the threshold) from those recorded by fitness wristbands (WFTs).   Table 3. Cumulative number of steps recorded subdividing the data reported in Table 2 in 6 blocks of 10 days. Asterisks in column 1 highlight days when step values recorded by APPs significantly differ from those recorded by fitness wristbands (WFTs). * p-value less than 0.05; ** p-value less than 0.01; *** p-value less than 0.001.

Tracker: Accuracy: Results
To assess the accuracy of the trackers, we asked three operators to wear the devices and walk in a park while counting approximately 1200 steps. To have the ground truth number of steps, a cameraman followed the operator and recorded a video that was subsequently used to determine the real number of steps. Table 1 reports the name of the operator performing the test in row 1, the ground truth number of steps in row 2, the number of steps recorded by the devices in rows 3-11, and the PAND values in rows 12-21. It is worth noting that all apps counted the same number of steps in all tests, except APP4 in the test performed by Operator2 and Operator3 (Figure 2a). However, all PAND values were lower than 10% (the worst PAND was 8.23% for the apps and 8.08% for the wristbands), proving the accuracy of all the trackers in a short-term analysis performed under the controlled conditions of a standard walk in a park.   Table 1, and (b) day-based averages of the APPs and WFTs step values reported in Table 2. The gray bars in (b) are the standard deviations. * p-value less than 0.05; ** pvalue less than 0.01; *** p-value less than 0.001. Bold values in the x-axis of (b) highlight days when step values recorded by APPs do not significantly differ (considering p-value = 0.05 as the threshold) from those recorded by fitness wristbands (WFTs). Table 3. Cumulative number of steps recorded subdividing the data reported in Table 2 Table 1, and (b) day-based averages of the APPs and WFTs step values reported in Table 2. The gray bars in (b) are the standard deviations. * p-value less than 0.05; ** p-value less than 0.01; *** p-value less than 0.001. Bold values in the x-axis of (b) highlight days when step values recorded by APPs do not significantly differ (considering p-value = 0.05 as the threshold) from those recorded by fitness wristbands (WFTs).

Tracker Precision: Results
To analyze the precision of the trackers, PND values were computed considering A a as a normalization factor for apps and A w for the fitness trackers. Table 2 reports the steps counted by the trackers in the 60-day experiment. Day-based average values and standard deviations of APPs and WFTs are reported in Figure 2b. PND values are reported in Table 4. The average PND computed by considering all the data collected (last line of Table 4) ranges from −3% to 8% for apps and −6% to 4% for wearables, showing good precision of the trackers. The same analysis was repeated not just day-based, but also accumulating the steps and subdividing the data in 6 blocks of 10 days. Table 3 reports the cumulative step numbers and Table 5 the block-based PND values. The average PND computed by considering the 6 blocks together (last line of Table 5) ranges from −4% to 8% for apps and −8% to 4% for wearables. This shows that despite some local values strongly differing from the average, across the whole experiment, there were no significant differences between the APPs or WFTs considered separately. Accordingly, this proves that today, despite recorded values being logically dependent on the quality of the algorithms, on average, thanks to the available technology, all apps and wearables gave practically the same number of counted steps in long-term analysis.  Table 5. Percentage normalized difference (PND) values computed analyzing the 6 blocks of cumulative steps reported in Table 3.
However, the number of steps recorded by the phone-based apps was significantly lower than that recorded by the fitness wristbands. The average number of steps recorded daily by the phone-based apps was on average 34% lower than that recorded by the fitness wristbands (Figure 2b) according to the day-based analysis, and 32% in the cumulative-based one. Furthermore, on approximately 80% of the days (50 out of 60 days), the values recorded by the phone-based apps statistically significantly differed from those recorded by the wearable devices, according to an unbalanced one-way ANOVA test performed on the day-based data. This suggests lower accuracy of the apps in long-term analysis with respect to wearables worn on the body. This is also emphasized when considering the cumulative-based data reported in Table 3: in 100% of the 6 blocks, the cumulative values referring to the phone-based apps strongly (Block I is characterized by a p-value lower than 0.01; all the other blocks by p-values lower than 0.001) statistically significantly differed from those referring to the wearable devices.
It is worth remarking that the operator performing the experiment was well informed about the finality of the acquired data. However, to collect reliable data, he did not modify his daily routine and the way he used his mobile. Accordingly, he was not carrying the mobile during several daily movements typical of a standard office job, like standing from the desk for answering the office phone, going to the printer, bringing documents to colleagues, and visiting the toilet. Furthermore, during his free time (mainly in the evenings at home), he was not carrying the mobile during standard activities like cooking and cleaning. The recorded data proved that these standard movements strongly affect the final count of steps taken, and the difference between steps recorded by phone-based apps and fitness wristbands would be even larger for elderly people, who typically leave their mobile on a desk in a central room of the house and perform standard activities in the environment.

Discussion
One of the goals of this work was to compare the accuracy and precision of smartphone-based apps versus those of wearable fitness wristbands to suggest the best way to collect data for long-term clinical trials. Even though several research studies have confirmed that both the accuracy and precision of activity tracking devices have steadily improved over the last decade, device accuracy still remains a concern. However, technological improvements, increased knowledge on how to use the devices and for what purposes, and a better statistical approach to understanding the data all contribute to more comfort in the use of wearables and sensors for clinical trials.
Logically, to be effective in measuring PA, patients must wear a wristband or use an app for the entire day. Therefore, design, battery life, water-resistance, and comfort must all be maximized [29]. However, since in daily life people are not always carrying their smartphone, data collected by phone-based apps may be unreliable for long-term analysis, and asking people to carry their smartphone 24 h a day, 7 days a week would be impossible and would definitively change their lifestyle. Besides, other studies already proved that even fitness wristbands generally slightly underestimate the number of steps [38]. Consequently, there are several reasons to suggest that data obtained by smartphone-based apps are unreliable. On the other hand, fitness wristbands: (a) are somewhat invasive; (b) do not influence people's lifestyle; and (c) can collect accurate and precise data for long-term analysis. However, since foot pods measure walking or running cadence directly, we do not exclude the idea that more reliable data could be acquired using ankle bracelets, which are probably more accurate than devices measuring at the wrist or the hip [43]. Previous comparisons of data from various consumer-and commercial-grade wearables already demonstrated variability among devices related to body placement. For instance, Hildebrand et al. [49] examined the effects of tracker placement on acceleration values. In particular, they showed a significant difference between hip and wrist placement, showing that the output from the wrist was generally higher than that from the hip. A hip tracker usually has several limitations, including underestimation of energy expenditure during activities with little or no movement at the hip and potential loss of data due to removing the tracker when dressing, and influencing the lifestyle. On the other hand, several studies have documented noncompliance resulting in loss of data when using a hip-mounted accelerometer [16,50].
Actually, how to improve the general accuracy of step counting still remains a very important issue. For instance: (a) by exploiting the decreasing trend of the prices of electronics and the increasing size-reduction of all the sensors, three sensors could be embedded in all the devices to always provide a median value; (b) GPS measurements of the distances that have been walked or ran could be considered in all step counting algorithms to improve the accuracy of the system; (c) other less noise-prone body placement positions could be tested, for instance the neck or waist, or smart wearable technologies, such as smart shirts or pants, could be considered. However, with the information available today, we think that the best trade-off in collecting reliable data without affecting people's lifestyle is offered by wearable fitness wristbands. Therefore, we suggest that these types of devices be used for long-term clinical trials, and we caution the community to reconsider the findings of previous studies based on smartphone applications.

Conclusions
At this time, the number of clinical trials involving physical activity measurements is growing. Smartphone-based applications, fitness wristbands, ankle bracelets, hip belts, and many other tracking systems are very common in society. However, the accuracy and precision of these devices still remain a concern.
In this work, we analyzed the precision and accuracy of six phone-based applications (i.e., APP1-Huawei Health; APP2-Bits&Coffee ActivityTracker; APP3-Best Simple Apps Contapassi; APP4-GALA MIX WinWalk; APP5-LG Electronics LG Health; APP6-Pacer Health's Pacer) and three wearable fitness wristbands (i.e., WFT1-Decathlon OnCoach 100; WFT2-Crane activity tracker; WFT3-Suunto 9). The experiments performed were conceived as a proof of concept to give an idea about what can be expected in terms of relative differences in the measurements achieved using different system typologies. In particular, the number of steps recorded by the trackers in controlled tests with ground truth was considered to assess the accuracy of the trackers; a long-term analysis based on data acquired in a 2-month experiment was considered to estimate the precision. The first experiment performed proves that the accuracy of the smartphone applications is comparable to that of the wearable fitness wristbands; the second experiment proves that due to the fact that people cannot always carry a smartphone, the step measurement in long-term analysis may differ even by 30%. Providing significant statistics about the absolute performance of mobile applications and fitness wristbands is beyond the scope of this work. However, we can reassume that between the applications, just APP4 gave significantly different results, and between the wristbands, WFT1 gave, in general, lower step measurements.
The main outcome of the experiments is that despite good accuracy in short-term analysis of controlled conditions, data acquired with phone-based applications may be unreliable in long-term analysis. This not due to system accuracy, but because people do not always carry their smartphone in their trouser pocket. For instance, they typically leave their mobile on a desk in a central room of the house and do not carry their smartphone when performing standard daily life activities (e.g., cleaning, cooking, visiting the toilet) in small indoor environments, and we proved that the sum of these moments, added to the moments when they do not carry their mobile because performing activities that could potentially damage it (e.g., swimming, working in the vegetable garden, riding a horse, dancing), at the end of the day, may have a significant impact on step measurement.
To conclude, besides suggesting the use of wearable fitness wristbands to collect data, we also caution the scientific community to reconsider outcomes of studies based on data collected with mobile trackers.