Employing Machine Learning to Estimate Hallmark Measures of Physical Activities from Wrist-worn Devices Across Age Groups

Wrist-worn fitness trackers and smartwatches are proliferating with an incessant attention towards health tracking. Given the growing popularity of wrist-worn devices across all age groups, a rigorous evaluation for recognizing hallmark measures of physical activities and estimating energy expenditure is needed to compare their accuracy across the lifespan. The goal of the study was to build machine learning models to recognize physical activity type (sedentary, locomotion, and lifestyle) and intensity (low, light, and moderate), identify individual physical activities, and estimate energy expenditure. The primary aim of this study was to build and compare models for different age groups: young [20-50 years], middle (50-70 years], and old (70-89 years]. Participants (n = 253, 62% women, aged 20-89 years old) performed a battery of 33 daily activities in a standardized laboratory setting while wearing a portable metabolic unit to measure energy expenditure that was used to gauge metabolic intensity. Tri-axial accelerometer collected data at 80-100 Hz from the right wrist that was processed for 49 features. Results from random forests algorithm were quite accurate in recognizing physical activity type, the F1-Score range across age groups was: sedentary [0.955 – 0.973], locomotion [0.942 – 0.964], and lifestyle [0.913 – 0.949]. Recognizing physical activity intensity resulted in lower performance, the F1-Score range across age groups was: sedentary [0.919 – 0.947], light [0.813 – 0.828], and moderate [0.846 – 0.875]. The root mean square error range was [0.835 – 1.009] for the estimation of energy expenditure. The F1-Score range for recognizing individual physical activities was [0.263 – 0.784]. Performances were relatively similar and the accelerometer data features were ranked similarly between age groups. In conclusion, data features derived from wrist worn accelerometers lead to high-moderate accuracy estimating physical activity type, intensity and energy expenditure and are robust to potential age-differences.

Health Organization (WHO) PA recommendations [1]. Mobility is an essential factor for independence and social life engagement. Those who lose mobility have higher risk of morbidity, disability, and mortality [2][3][4][5]. Recently, WHO has published the Global action plan on physical activity [2018][2019][2020][2021][2022][2023][2024][2025][2026][2027][2028][2029][2030] (GAPPA) to enhance PA with a target of 15% reduction in physical inactivity by year 2030 [6]. The most recent WHO guidelines on physical activity and sedentary behavior [7] suggest that adults (aged 18 and older) should do at least 150-300 minutes of moderate-intensity aerobic PA; or at least 75-150 minutes of vigorous intensity aerobic PA; or an equivalent combination of moderate-and vigorous-intensity activity throughout the week. Additionally, adults should replace their time spent being sedentary with PA.
To meet the WHO goals, accurate estimation of physical activity type, intensity and duration is required. The proliferation of fitness trackers and wearable accelerometers offer an excellent opportunity to achieving this goal. The literature contains many examples of machine learning algorithms including decision tree [8], random forests [8,9], and bag-of-words [10] processing and modeling accelerometer data. However, these models are often limited to a specific age group (e.g., adults 20-40 yrs old). The looming question here is whether known age differences in movement patterns influence the performance of the machine learning models. There is a paucity of research to examine the differences between models built to recognize PA type and intensity, recognize individual PA, and estimate energy expenditure (EE) across different age groups. Such knowledge will be useful in deriving age-specific models that improve prediction accuracy.
Historically, the adopted approach used to recognize PA type and intensity, and to estimate energy expenditure (EE) relied on data collected from the hip position in standardized laboratory settings. The advantage of the hip over other positions is the proximity to the body's center of the mass, offering a convenient and accurate approach for capturing ambulatory activity [11]. However, the hip position is riddled with patient/participant compliance issues and inability to gather 24 hour data [12].
Alternatively, the wrist position has become popular for collecting accelerometer data due to a rise in smartwatches, convenience, ability to capture sleep quality (24 hours) and enhanced compliance in research studies [13][14][15][16]. Unfortunately, despite the popularity of wrist-worn accelerometers, there is a paucity of models that are deemed viable for accurately assessing PA [17,18]. The use of the wrist position to recognize PA type and intensity and estimate EE is challenging due to its potential limitation in quantifying and capturing large lower limb movements and other lifestyle activities. Therefore, models that can accurately recognize PA type and intensity and estimate energy expenditure from the wrist are greatly needed to meet the current demand.
This study utilizes a large amount of high-resolution raw accelerometer data collected from the wrist position coupled with metabolic intensity assessed in 253 adults aged 20-89 years. An aggregated set of relevant features were used as an input to machine learning models to recognize PA type and intensity, identify individual PA, and estimate EE. Machine learning models developed on specific age groups (young [20,50], middle (50-70], and old (70-89]) were then compared to test the hypothesis that model performance varies across age-group. Results are expected to help evaluate whether machine learning models used to represent wrist-worn accelerometer data need to be tailored to known age-differences in movement and behavior to optimize their accuracy.

Participants
Participants were community dwelling adults 20+ years old who were able to read and speak English language, were welling to undergo all testing procedures, and their weight was stable in the last three months (+/-5 lbs). Two-hundred and fifty-three (253) of the 264 participants who were enrolled were included in the analysis. Those excluded either had: missing of start/end time of activities (6 participants), insufficient length of activity or missing values (3 participants), and missing demographic information (2 participants). Institutional Review Board at the University of Florida approved all study procedures, and all participants provided written informed consents before the study.

Prescribed Activities and Visits
The ChoresXL study methods have been described previously by our group [19,20]. Briefly, participants performed a battery of 33 typical daily activities that were categorized into activity types and intensities calculated post-facto from metabolic unit data (supplemental Table S1). Tasks were chosen because they mimic daily chores activities, common among most Americans, and they are consistent with average time spent in the 2010 American Time Use Survey [21]. All tasks were performed in a standardized laboratory setting with scripted instructions for approximately 8-10 minutes to achieve a steady state energy expenditure. Participants performed all tasks at their own speed and were ordered from lowest to highest metabolic demand to reduce transfer of high metabolic effects of one task to another. To ease burden and exhaustion, participants performed all tasks over four visits. However, some did not complete all visits.
Overall, 213 participants attended all 4 visits, 21 attended 3 visits, 7 attended only 2 visits, and 12 attended only 1 visit. In total, there were 941 data collection visits.

Instrumentation
Participants wore an ActiGraph GT3X-BT monitors on their right wrists (ActiGraph Inc, Pensacola, FL). The ActiGraph GT3X-BT monitor is a tri-axial lightweight accelerometer that records accelerations in units of gravity (1 g) in perpendicular, anterior-posterior, and medio-lateral axes. Accelerometers were programmed to collect data at 100 Hz sampling rate. Participants also wore a 2 Kg portable metabolic unit that estimated energy expenditure using principles of indirect calorimetry, Cosmed K5 (COSMED, Rome, Italy). Before data collection, the oxygen (O2) and carbon dioxide (CO2) sensors were calibrated using a gas mixture sample of 16.0% O2 and 5.0% CO2 and room air calibration. The turbine flow meter was calibrated using a 3.0-L syringe. A flexible facemask was positioned over the participant's mouth and nose and attached to the flow meter. Oxygen consumption (VO2 = mL.min −1 .kg −1 ) was measured breath-by-breath and were subsequently smoothed with a 30-sec running average window. Steady-state VO2 for each task was manually calculated over approximately 2 minutes when there was evidence of a plateau, which indicates metabolic demand is matched to physical workload. Data were expressed as METs after dividing the VO2 values by the traditional standard of 3.5 mL.min −1 .kg −1 [22].

Problem Formulation
In this paper, we targeted four main tasks to measure the hallmark measures of PA: 1) recognize PA type (classification task) through splitting this task into three binary classification tasks: i) sedentary vs non-sedentary; ii) locomotion vs non-locomotion and iii) lifestyle vs non-lifestyle; 2) recognize PA intensity (classification task) through splitting this task into three binary classification tasks: i) low vs non-low; ii) light vs non-light and iii) moderate vs non-moderate; 3) recognize individual PA (classification task); and 4) estimate the energy expenditure while performing the scripted activities (regression task). We extracted consecutive non-overlapping 60-seconds windows from the raw accelerometer data. Previous studies used various window lengths, ranging from 0.1 seconds to 128 seconds [23][24][25][26][27]. A 60-seconds window was chosen as a compromise between having sufficient data for accurate feature extraction and balancing computational resources. In total, 49 time-and frequencydomain features, listed in Table 1, were extracted. During data processing, some cases with different collection frequencies were discovered (15 at 80 Hz and 100 at 30 Hz). However, no resampling was performed because the resolution was sufficient to extract features over a 60 second window.   In all tasks, all participants were randomly distributed into 5 folds. We used 5-fold nested cross validation (nested-CV), which has an inner CV loop nested in an outer CV loop. The inner loop is responsible for hyperparameter tuning (the process of searching for the optimal parameters of the model), while the outer loop is responsible for error estimation and generalization. We used random search for hyperparameter tuning (number of trees, maximum number of features, maximum depth of each tree, and minimum number of samples per leaf) , in which 10 sets of hyperparameters are set up and combined randomly for training the model. Then, the model with the highest F1-score was chosen. F1score was used to compare across age groups because it protects against the imbalance across classes seen in PA type and intensity categories. There is no absolute criterion for a "good" value of F1 measure, but values above 0.80 generally indicate good performance. For continuous data from energy expenditure (METs), the root mean square error (RMSE) was used to evaluate performance.   Table 3. Figures 4-6 show the confusion matrices of recognizing PA intensity across age groups.

Results
Similarly, the confusion of the models are consistent with the F1 scores shown in  Table 3. Performance metrics of recognizing physical activity type and estimating energy expenditure.
Each value is the mean and standard deviation of the 5-fold nested cross validation.

Discussion
The goal of the study was to build accurate machine learning models to recognizing the hallmark measures of physical activities and estimating energy expenditure across different age groups. We analyzed a large dataset of raw accelerometer data collected from the wrist position. We utilized the random forests algorithm, which is one of the most powerful algorithms in machine learning, to build models. Results showed that the machine learning models were quite accurate at recognizing physical activity type and intensity, and estimating energy expenditure. However, models performed less optimally when recognizing individual physical activities. Our hypothesis that increasing age would impact model performance was rejected as only slight differences were detected among age groups.
The results of the models built to recognize physical activity type showed high performance for all age groups as shown in Table 3. The model built on the young age group achieved the highest performance, followed by the middle, then old age groups for all activity types. Additionally, the highest performance was for sedentary, locomotion, then lifestyle activities for all age groups. Physical activity types seem to be more distinguishable and cause less confusion for younger ages as reflected on the confusion matrices shown in Figures 1-3. It is hard to interpret the drop in the performance from young to old age groups. One potential cause of this drop is the deviations from the standardized protocol that are more common in older adults. For example, there was a certain amount of variability in the trash removal activity among older adults compared to younger adults (older adults could not pull the trash bag quickly). This suggests that the ML models need to incorporate these compensations more accurately among older populations. Another reason is that older adults do not like the wrist device as tight as the younger adults. This can result in unintended artifactual movement that occurred more commonly among the older. Additional cause could be that the middle and old age groups include more participants' data than the young age group. Therefore, the models tend to generalize better and be less optimistic. On the other hand, the drop in the performance from sedentary to lifestyle activity types is intuitive. Lifestyle activities typically require more wrist involvement (i.e., ironing, trash removal) than other physical activity types. This means more variability in physical activities as we move from sedentary to lifestyle activities, which can increase the confusion in recognizing physical activity types as reflected in the confusion matrices shown in Figures 1-3.
The results of the models built to recognize physical activity intensity showed relatively high performance for all age groups, but lower than the performance of recognizing physical activity types as shown in Table 4. The highest performance was for the young and middle age groups alternatively, then old age group for all activity intensities. Additionally, the highest performance was for low, moderate, then light intensities for all age groups. As mentioned above, it is hard to interpret the drop in the performance from young to old age groups. Performance metrics and confusion for labeling physical activity intensities showed a consistent, although slight, reduction in older aged groups (see Table 4 and . If this error was scaled to free-living conditions over a typical day (16 hours), older adults would be expected to have 2% (~19 minutes) more mislabeling of PA intensity compared to a younger group.
Models built to recognize individual physical activities showed lower performance than recognizing physical activity type. The highest F1-score was 0.784 in recognizing the computer work activity in the middle age group and the lowest was 0.263 for recognizing the dressing activity in the old age group.
The overall deterioration in the recognition performance in individual activities compared to other recognition tasks is intuitive, due to the high number of classes and the data imbalance. Summing these activities into categories such as the physical activity types or physical activity intensities can help in enhancing the recognition performance metric as observed in Table 3 and Table 4. In general, there were no consistent differences among age groups.
The scaled impurity-based feature importance ranking generated from the random forest algorithm show how relevant these features are to the problem in hand and help in better understanding the model. a small number of participants, age-range being mostly < 40 years, a low number and diversity of activity types, and most importantly lacking sufficient data from the wrist position. Given these substantial differences, the models presented here show relatively higher performance than others. Additionally, the current model may generalize better due to the high diversity of activities, wide age-span, gender and racial diversity and the larger number of participants enrolled.
A limitation of the current study is that data were collected in controlled lab settings, which is appropriate and a first step in evaluating positional differences [35]. Collecting data in the free-living settings is more reflective of numerous transitions between activity types, but it is challenged by labeling the activity type. Another limitation is the consideration of window size, which was based on previous studies that extracted time-and frequency-domain features. This window size may not reflect the most appropriate size for all tasks and age groups. Additional simulation work should evaluate different window sizes for optimizing performance.

Conclusions
In this study, we tested the hypothesis that the machine learning model performance varies across agegroups for recognizing hallmark measures of physical activities and estimating energy expenditure.
Overall results suggest data features derived from wrist worn accelerometers lead to high-to-moderate accuracy estimating physical activity type, intensity and energy expenditure in all age groups. In conclusion, machine learning models used to represent accelerometry data are robust to age differences and a generalizable approach might be sufficient to utilize in accelerometer-based devices (smartwatches and activity trackers).