Physical Activity in Community-Dwelling Older Adults: Which Real-World Accelerometry Measures Are Robust? A Systematic Review

Measurement of real-world physical activity (PA) data using accelerometry in older adults is informative and clinically relevant, but not without challenges. This review appraises the reliability and validity of accelerometry-based PA measures of older adults collected in real-world conditions. Eight electronic databases were systematically searched, with 13 manuscripts included. Intraclass correlation coefficient (ICC) for inter-rater reliability were: walking duration (0.94 to 0.95), lying duration (0.98 to 0.99), sitting duration (0.78 to 0.99) and standing duration (0.98 to 0.99). ICCs for relative reliability ranged from 0.24 to 0.82 for step counts and 0.48 to 0.86 for active calories. Absolute reliability ranged from 5864 to 10,832 steps and for active calories from 289 to 597 kcal. ICCs for responsiveness for step count were 0.02 to 0.41, and for active calories 0.07 to 0.93. Criterion validity for step count ranged from 0.83 to 0.98. Percentage of agreement for walking ranged from 63.6% to 94.5%; for lying 35.6% to 100%, sitting 79.2% to 100%, and standing 38.6% to 96.1%. Construct validity between step count and criteria for moderate-to-vigorous PA was rs = 0.68 and 0.72. Inter-rater reliability and criterion validity for walking, lying, sitting and standing duration are established. Criterion validity of step count is also established. Clinicians and researchers may use these measures with a limited degree of confidence. Further work is required to establish these properties and to extend the repertoire of PA measures beyond “volume” counts to include more nuanced outcomes such as intensity of movement and duration of postural transitions.


Introduction
Physical activity (PA) has been defined as "any bodily movement produced by skeletal muscles that requires energy expenditure" [1].An increase in PA in older adults is associated with improved muscular strength [2], lower risk of disability [3], and may also protect against cognitive decline [4].The beneficial effects of PA on functional tasks such as walking in older adults have also been reported [5,6], which is important given that loss of functional ability is associated with functional dependence [7,8] and can lead to social isolation [9] and malnutrition among older adults [10].
Measurement of physical activity in older adults is therefore informative and relevant.The use of body-worn sensors (wearables) to objectively quantify PA is a welcome advancement in the field, given the potential for inaccuracy and bias inherent in self-reported data from questionnaires which are most commonly used [11].Wearables are defined as devices that can be worn on the skin or attached to clothing to continuously and closely monitor an individual's activities, without interrupting or limiting the user's motions (adapted from [12]).Wearables typically incorporate accelerometers to enable continuous (usually seven days), unobtrusive monitoring in real-world environments [13].Real-world environments generally include the home (which could be retirement villages) and other free-living environments such as parks and cafes, etc.This confers an advantage over data collected in controlled or simulated environments which may have observational bias and other influences [13,14] and are not reflective of real-world conditions [15,16].Data collected and processed in a controlled laboratory environment, which is usually shorter than those collected in real-world settings, is not reflective of the data collected and processed in real-world environments, especially for older adults [17].
Associated with this advance is the development of novel outcomes from accelerometry data.Frequency counts (e.g., number of sit-to-stand transitions in a day), intensity (e.g., stroll versus run), pattern (accumulation of bouts of walking), and within-person variability give more information than simple volume measures (e.g., total amount of walking time) and therefore provide a greater understanding of the composition of physical activity and the association between it and functional tasks [18].
Several challenges to capturing real-world PA data have been identified.Measurement accuracy in older adults is compromised by the use of walking aids, slower gait, lower level of physical activity intensity, reduced cognitive ability and reduced adherence with research instructions, thus posing technical challenges to detection and capture of movement and analysis [13,19].Moreover, hardware-related costs and technical competency in dealing with interpretation of accelerometry data, could further add to these challenges [20].
Despite these constraints, there has been a marked increase in accelerometry-based PA research including large-scale, population-based studies (e.g., [21,22]) which in turn has led to issues related to the robustness of these metrics-how reliable, valid and responsive are they?Few studies have examined these questions in-depth (e.g., [23,24]).A recent review update reported that consumer-grade activity trackers tend to overestimate step count (167.6 to 2690.3 steps per day), with slower and impaired gait reducing the level of agreement (<10% at gait speeds of 0.4-0.9m/s for ankle placement) with reference methods (e.g., ActiGraph) [25].However, that review included both laboratory and realworld environments, which limits generalizability.Wearables need to be validated under conditions representative of their intended use, that is, in real-world environments [15,16].
In view of the questions that arise from this rapidly expanding field, a synthesis of current research concerning accelerometry-based PA measurement using wearables is required.We posit three key questions: (a) Which PA movements (e.g., sitting, standing, walking) are reliably and validly measured using accelerometers in community-dwelling older adults in real-world conditions?(b) Is the measurement of these PA movements able to show change?(c) How were these PA movements quantified in terms of the type, number, and location of the accelerometers, and duration (time spent) of monitoring?We also report on adherence, usability, and acceptability for wearables where reported.

Materials and Methods
This systematic review followed the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines [26] and was registered with the National Health Service PROSPERO database under the registration number: CRD42021228010.

Search Strategy
Systematic searches were conducted across the following eight databases: AMED, CINAHL, Embase, IEEE, Medline, PsycINFO, Web of Science and Scopus.In addition to the above databases, reference lists of review articles and included studies were hand searched to identify additional relevant studies.The search criteria were limited to studies conducted in the English language.An initial search included articles published between 1 January 2010 and 18 January 2021.A follow-up search was conducted on 25 November 2022.A lower limit of 2010 was chosen given the rapid technological advancement in the design and development of accelerometers which are not comparable to those currently used.

Inclusion and Exclusion Criteria
Table 1 shows the eligibility criteria that were employed in this review.

Data Extraction and Abstraction
All searches were imported and screened for duplicates in EndNote X9 (Version 3.3).The titles were initially screened by KAJ in EndNote X9, then the selected titles and their respective abstracts were exported as a CSV file and imported into a web-based systematic review software-Rayyan [27].Thereafter, the remaining abstracts were screened by two reviewers (RMA & KAJ) in a blinded process.Disagreements over inclusion were adjudicated and resolved by a third reviewer (SL).Reasons for exclusion were recorded for abstracts based on the inclusion/exclusion criteria.Following the abstract screening, the remaining full texts were independently reviewed by two reviewers (RMA & KAJ).
A data extraction form was used to standardize the information extracted from each article.KAJ extracted the data which were verified by RMA.

Clinimetric Properties
Inter-rater reliability was established as the degree of agreement between two independent observers of duration of activities from videos and reported as intra-class correlation (ICC, 95% CI).Relative reliability was established as the degree of agreement in terms of ranks or position of individuals within a group and reported as intra-class correlation (ICC, 95% CI).Absolute reliability was established as the degree of agreement in terms of precision of the individual measurements and reported as minimal detectable changes (that was calculated using standard error of measurement).Responsiveness was established as the ratio between minimally clinically important change on the measure and mean squared error of the response obtained from an analysis of variance model and reported as Guyatt's responsiveness (GR) coefficient [28].
Criterion validity was established either as agreement between a gold standard reference and accelerometry or as percentage of agreement between video observation and accelerometry and was reported as ICC or as F-Score (for comparison between different algorithms) or as sensitivity, specificity, accuracy, precision, or positive predictive values.

Risk of Bias Assessment
The Appraisal tool for Cross-Sectional Studies (AXIS) checklist was used to evaluate the risk of bias for all studies included in this review [34].Two reviewers (RMA and KAJ) independently assessed the quality of the studies, with a third reviewer (SL) settling any disagreements.

Study Selection
The initial search identified 13,872 records, of which 6206 duplicates were removed.The remaining 7666 titles were screened, resulting in 768 records carried through to abstract review.The updated search conducted on the 25th of November 2022 identified 3179 records, of which 1455 duplicates were removed.The remaining 1724 titles were screened by KAJ resulting in 144 records for the abstract stage (Figure 1).Two reviewers (RMA and KAJ) screened the abstracts based on the inclusion and exclusion criteria and identified 79 records for full-text review.Thirteen records passed through to the final full-text review stage.Two further publications were retrieved from reference lists, one of which was classified as a Research Letter.Reasons for exclusion included study setting other than "real-world" such as a semi-structured or simulated real-world environment (n = 24); PA metrics relevant to the review were not reported (n = 16); or the study did not report any clinimetric data (n = 12).EndNote was used to index all records.

Quality of Studies
Thirteen studies included in this review achieved a minimum score of 70% (i.e., 12 out of a possible 17) based on the AXIS toolkit (Table 2).Thus, two studies were excluded from the review due to methodological weaknesses [35,36].[34].

Quality of Studies
Thirteen studies included in this review achieved a minimum score of 70% (i.e., 12 out of a possible 17) based on the AXIS toolkit (Table 2).Thus, two studies were excluded from the review due to methodological weaknesses [35,36].[34].
Note: "Q" refers to question.So "Q1" implies "Question 1".Each 1's, which are in green fonts, represents an affirmative appraisal score for that question, while each 0's, which are in red fonts, represents a negative appraisal score for that question.Please see Downes et al. [34] for more details. 1 These questions related to non-responders were not included in the tabulation of the scores. 2A negative response to this question "Were there any funding sources or conflicts of interest that may affect the authors' interpretation of the results?" is taken as a score of "1" and vice versa.
Brand et al. [38] Cross-sectionalvalidation study; free-living measurements were carried out in participants' homes and in the community.
To detect gait from wrist worn tri-axial accelerometer recordings of daily living of older adults, using an anomaly detection algorithm and compared its performance to four previously published gait detection algorithms.

ND * ND * ND *
The current study did not investigate shorter gait bouts (<6 s).
Briggs et al. [31] Cross-sectionalvalidation study; free-living measurements were carried out in participants' homes and in the community.
To determine the content validity of the Garmin Vivosmart HR compared to ActiGraph GT3X+ for the domains of daily step count and MVPA 2 .
Veterans aged ≥65; able to perform ADL; able to follow instructions in a group setting; free from ischemic heart and severe lung diseases.Does not require walking assists devices; able to provide written consent.

ND * ND *
The participants were mainly male-this limits the generalizability of the findings.

ND *
The accuracy of the consumer-level devices is based upon the agreement with existing reference devices that assumes validity in an older population.The population was composed solely of healthy older adults who were independently ambulatory, thus might not be generalizable on frail older populations.Did not have objective means to determine whether participants wore the device in accordance with the protocol.
To assess the validity of a sensor-based method to detect time-on-legs (standing) and daily life mobility related postures in older adults based on a necklace-worn motion sensor.To evaluate user opinion about the practical use of the sensor.
Community-dwelling or living in an older adult home; aged ≥70 years; able to walk 10 m without support or with a cane or walker.
Orthopaedic impairments that debilitate the ability to walk unsupported for ten metres; total hip or knee replacement surgery in the previous six months; having had a stroke within the last six months; Parkinson's disease stage 4/5 or other neurologic diseases that can impair daily functioning or visual problems to a degree that make it impossible for the participant to accurately read the questionnaires or walk around safely.
Validation was carried out in semi-structured home environment and not in lab settings.Both frail and non-frail participants were included in the study.
Free living data collection was limited to 30 min only.
Outdoor activities, such as cycling was not included in study.Participants performed movements in a rushed manner to complete several tasks which has an impact on accuracy.* ND-not described. 1Bourke et al. [44,45].This was based on 20 subjects who included obese older adults.Data for the 16 involved in the free-living studies alone not provided. 2-moderate-vigorous physical activity.

Study Protocol
Study protocols for validity testing varied with respect to the reference standard, the outcome of interest, environment and duration of testing, as well as the location of sensors.

Reference Standard
Six studies validated consumer-grade wearables with research-grade reference accelerometers [28,29,31,38,39,42].Five of the studies used the video/visual method as their "gold standard" for their validation reference [32,33,37,40,41].One study used both video as well as research-grade reference accelerometers [30].One study used the doubly labelled water (DLW) method as their reference for their validation [43] (Table 5).The DLW method is an established technique for measuring energy expenditure.This method is based on the estimation of the rate of CO 2 elimination from the body [46].

Environment
Nine studies collected and validated real-world accelerometry data exclusively within the participants' home/retirement village environment [29][30][31][32]37,38,[41][42][43].One study investigated criterion validity in a controlled setting within a retirement village as well as in participants' home environment [33].Dijkstra and colleagues investigated criterion validity in the laboratory environment with 20 participants and also carried out further validation in a real-world home environment with five participants [40].Burton and colleagues investigated intra-rater reliability using the two-minute-walk test (2MWT) in a laboratory environment, but construct validity in the home [39].Kastelic and colleagues conducted a test battery that included common real-life tasks, within the laboratory environment as well as an uncontrolled free-living study [28].For the purposes of this review, we have included only the home or free-living environment data in our analysis.

Duration of Wear
Duration of wear was mixed among the studies and ranged from 14 days duration to under 10 min: <10 min (n = 45) [32,33]; 30 min (n = 25) [40,41]; 100 min (n = 16) [37]; 12 h (n = 37) [42]; two days (n = 35) [31]; four days (n = 50) [28]; seven days (n = 57) [29,30]; ten days (n = 12) [31] and 14 days (n = 74) [39,43].None of the studies exceed the 14-day duration.Duration of wear was influenced by the choice of criterion in the sense that studies relying on research-grade accelerometers [28][29][30][31]39,42] and DLW method [43] as their reference standard captured un-instructed daily activities (excluding water activities) during waking hours and exceeded a 12 h period.By contrast, studies that employed video or direct observation as reference [32,33,37,40,41] limited their duration of real-world observation to a maximum of 100 min, with two studies capturing less than 10 min of activities [32,33] (Supplementary Table S2).Thereafter, they were allowed to move freely for 3 min with the only instruction that taking the stairs, sitting and lying had to be completed at least once.They also performed usual domestic activities (such as doing the dishes, watering plants, hanging up laundry or mowing the lawn).During all activities, participants were video recorded.Start and end of each activity was scored by an observer from the video analysis.Inter-rater reliability was determined for the fixed sequence task by two raters for 10 participants.

Mean activity
duration of walking, sitting, standing and lying, Walking was determined, starting from the heel-off for the initial step until ending with full floor contact of the foot making the last step 4 , and the number of steps taken 2 or more.Persons were considered to be sitting when their upper body was upright and at a 90 • angle to the legs.Standing was determined when the participant was in an upright position with no or a small displacement, but no distinctive steps, of the feet.Lying was defined as the person being in a horizontal position and either the side or the back of the body contacting the bed.
Inter-rater reliability was calculated from two independent raters.Intraclass correlation coefficient (ICC) for the duration of walking, sitting, standing and lying were 0.95, 0.78, 0.99 and 0.98, respectively.
Agreement per participant ranged between 68.3 and 85.9% (mean = 79.8%;SD = 6.9 ND * * ND-not described. 1Refer to Awais et al. [37] for more details on these algorithms. 2 From Bourke et al. [45].See Bourke et al. [44] for a complete list of activities recommended included sitting, lying, preparing food or drink while standing and setting up the table.See also Awais et al. [47]. 3TUG-Timed Up and Go assessment. 4 A step was defined as a forward displacement of the foot together with a forward displacement of the trunk. 5Walking was defined when the person was moving the feet forward in a walking pattern with the trunk in a forward displacement, from when the heel of the foot cleared the ground for the initial step until the foot of the closing step made complete contact with the floor, with a minimum of 2 steps. 6Only data from the extended protocol group is discussed. 7Only step counts and active calories burned are reported here as common metrics between the three accelerometers and the reference, 8 count per minute (cpm). 92 MWT-2 m walk test 10 [48]. 11Based on Najafi et al. [49]. 12Only results of unscripted data is further discussed in this paper. 13These are based on combined durations of scripted and unscripted activities.Separate data on scripted and unscripted activities were not provided 14 [50].

Sensor Location
The most common location of wear was the wrist (n = 205) [28,29,31,37,39,42] followed by the back (lower back or back of the waist) (n = 104) [32,33,40,43].One study also extended the validation for the Misfit Shine accelerometer to the waist as an additional location of wear because this device was designed to be worn in both places [29].One study used the wearable on the right hip (n = 32) [30] and another used the wearable as a necklace (n = 20) [41].Awais and colleagues investigated data from participants who wore four accelerometers concurrently-on the wrist, chest, lower back and thigh (n = 16) [37] (Supplementary Table S2).
The location of sensors appeared to influence accuracy.Studies that use a single sensor close to the participant's center of gravity such as the waist [29], hip [30] or lower back [32,33,40] reported higher sensitivities than those placed on the wrist or around the neck, supporting the findings of a recent review [25].

Reliability
Inter-rater reliability conducted within real-world conditions was reported in four studies (n = 63) [33,37,40,41].Dijkstra et al. [40] reported inter-rater reliability of activity durations of ten participants by two independent observers (via video analysis)-walking (0.95), sitting (0.78), standing (0.99) and lying (0.98).Taylor et al. [33] also reported excellent inter-rater reliability between two independent observers on ten randomly selected video footage-walking (0.94), sitting (0.99), standing (0.98) and lying (0.99).Geraedts et al. [41] reported ICC for inter-rater agreement of the video observation was 0.91 in the free movement protocol.Awais et al. [37] reported that the overall level of agreement of out-of-lab activities was above 0.90 for one randomly selected video that was chosen to be rated by five independent raters.
Intra-rater reliability conducted within real-world conditions was reported as relative and absolute reliability.Relative reliability of step counts from commercial accelerometers ranged from poor to good for both single-day averages as well as three-day averages.The results were similar for average active calories (which was based on the differences between total calories computed by the accelerometers and estimated basal metabolic rate based on Harris and Benedict [51]).Absolute reliability was generally better (i.e., lower) for averaged measures-step count and active calories-of three days compared to single-day measures [28].

Validity of Accelerometers
The overall sample size for testing criterion validity was n = 321 including diverse populations and incorporating a range of study protocols.Criterion validity between researchgrade wearables and consumer-grade wearables was excellent for step counts measured at the right hip: ICC = 0.94 (95%CI [0.88, 0.97]) (FitBit One/Zip versus ActiGraph GT3X+) [30] and the waist: ICC = 0.96 (95%CI [0.91, 0.99]) (Misfit Shine versus ActiGraph GT3X+) [29] and ICC = 0.91 (95%CI [0.79, 0.97]) (NL2000i) [29], but lower on the wrist: ICC ranged from ICC = 0.83 (95%CI [0.59, 0.93]) (Misfit Shine versus NL2000i) to ICC = 0.86 (95%CI [0.67, 0.94]) (Misfit Shine versus ActiGraph GT3X+) [29].The average daily step count between consumer-grade wearables and reference devices was overestimated in two studies [28,30].Results were mixed in another, which employed two different locations of wear (wrist and waist) as well as two different research-grade reference devices (ActiGraph and NL2000i) [29].Garmin Vivosport and Garmin Vivoactive 4s performed much better than Polar Vantage M for step counts-0.98versus 0.37 and 0.95 versus 0.37, respectively [28].Briggs et al. [31] found no significant difference (p = 0.22) between the daily step count from wrist-worn Garmin Vivosmart HR and the reference, hip-worn ActiGraph GTX3X+: ICC = 0.94 (95%CI [0.88, 0.97]).This study also reported that the differences due to step counts derived MVPA were reduced using age-specific cut-offs [31].Kastelic et al. [28] reported that computed measures such as activity calories, which were derived from accelerometry and heart rate data, did not perform as well as step counts from the accelerometry.The agreement between activity calories for all three devices was lower than the agreement between steps counts: Garmin Vivosport-0.58,Garmin Vivoactive 4s-0.55 and Polar Vantage M-0.15.Two studies validated algorithms developed to detect the duration of gait bouts estimated from a single wrist worn consumer-grade wearable.Brand et al. collected ten days of data from 12 older adults, and reported that the new algorithm had 76.2% accuracy, 29.9% precision, 67.6% sensitivity and 78.1% specificity for detecting gait bouts.Soltani et al. collected 12 h of data from 37 older adults, and reported even better scores-accuracy was 95.2%, precision was 71.8%, sensitivity was 87.1% and specificity was 96.7% for detecting gait bouts.They also compared their method with previously published algorithms.Their algorithm's F1-score was 74.9%, which was better or close to earlier studies which utilized multiple accelerometers.[42,47].
Construct validity of step counts and moderate-to-vigorous physical activity (MPVA) between GENEactiv accelerometer and consumer-grade wearables-Fitbit Flex and Fitbit ChargeHR-was reported as a moderate level of agreement between the devices (ICC Flex : 0.68; ICC ChargeHR : 0.72) [39].
In summary, these results suggest that step counts and duration of walking, lying, sitting and standing can be measured robustly to a certain degree using a single accelerometer.However, further work is required to understand better how the location of wear and type of reference standard affect accuracy.
One study [43] investigated the validity of a triaxial accelerometer against the doubly labelled water method (DLW) for total energy expenditure reported that the 24 h average metabolic equivalent (MET) of Actimarker was significantly correlated with the PA level assessed by DLW but significantly underestimated it (p < 0.001).Furthermore, the correlation between daily step counts and PA level of DLW was moderate: R 2 = 0.248 (p < 0.001).

Responsiveness of Accelerometers
Only one study reported on the responsiveness of accelerometry (i.e., the capacity of an accelerometer to identify possible changes in PA outcomes associated with a clinical condition over time) [28].Single day measure of step counts performed better than average three-day measures for Garmin Vivoactive 4s (GR-0.411 vs. 0.041) and Polar Vantage M (0.126 vs. 0.060), but not for Garmin Vivosport (0.022 vs. 0.288).However, all three devices showed relatively weak to moderate responsiveness for active calories (GR > 0.232) for both single-day as well as averaged-day measurements, except for Garmin Vivosport (Single day GR = 0.073) [28] (Table 5).

Acceptability and Adherence of Accelerometers
Only one study planned and purposefully measured adherence.Geraedts et al. [41] reported 100% adherence during daytime and 80% during sleep from a necklace sensor worn for seven days.The authors also collected information on the level of comfort, weight, size and usability of the sensor when worn during the daytime using a user-evaluation questionnaire on a scale of 1 to 5.They reported a high mean score of 4.4 ± 0.6 and concluded that user acceptance was high.Three studies reported adherence based on missing data [28,29,39].Farina et al. [29] required participants to wear five devices (two on the wrist and three on the waist) over seven consecutive days and reported excluding three participants (12%) from their analysis due to receiving less than four days of data from the reference device, which indicated that adherence was low for longer durations of data capture.Burton et al. [39] reported that close to 50% of participants had some missing data from their wrist-based wearables over 14 days, also suggesting declining levels in adherence with increasing duration of data capture.Kastelic and colleagues reported the adherence of wearing three different accelerometers on the non-dominant wrist (together with a reference accelerometer on the waist) over 12 days, each device for four days, based on wear time.The wear time compliance with the Polar Vantage M, Garmin Vivoactive 4s and Garmin Vivosport was as high as 24.0 ± 0.1 h/day, 23.9 ± 0.5 h/day and 23.9 ± 0.5 h/day, respectively.None of the four studies reported age-or gender-related differences (Supplementary Table S2).

Summary of Results
Table 6.summarizes the clinimetric properties of accelerometry-based PA measures of older adults collected in real-world conditions.

Discussion
This review is the first to our knowledge to examine the reliability and validity of accelerometry-based PA measures of older adults collected in real-world conditions.Moderate to strong ICCs for inter-rater reliability and criterion validity tentatively establish step count, duration of walking, sitting, standing and lying as robust outcomes.Variations in the methods such as location of sensors and duration of wear highlight differences in the strength of the validity and reliability of the outcome measures.This also points to a need for standardization of protocols of wearing accelerometers in future studies.However, this review identified limitations in the current literature, specifically that most of the outcomes are limited to volume metrics.

Reliability of PA Measures
Good to excellent inter-rater reliability was observed for the durations of sitting, standing, walking and lying activities.Inter-rater reliability of step counts in real-world environment was not reported.Intra-rater reliability differed by the brand and the type of measures.The Garmin Vivosport and Garmin Vivoactive 4s had better relative and absolute reliability than the Polar Vantage M for both step counts as well as active calories.Derived metrics from step counts, such as activity intensities (e.g., MVPA), were not as reliable as steps counts.None of the studies investigating duration of PA activities reported intra-rater reliability.The reasons for omitting reliability testing were not discussed by the authors.This omission limits our understanding as to whether accelerometry-based PA measures such the durations of sitting, standing, walking and lying activities are affected by the individual observers, when captured in real-world conditions.

Validity of PA Measures
The most common "gold-standard" reference for criterion validity was using researchgrade accelerometers, followed by the use of video or direct observation.In all but one reported study, a single tri-axial accelerometer was sufficient to distinguish PA validly.However, there was a lack of homogeneity for real-world assessments with respect to sensor location and duration, the brand of accelerometer employed, and the instructions given to participants when carrying out uninstructed daily activities.
As noted above, the duration of wear varied amongst studies which is partially attributable to the level of intrusiveness of the reference devices used.There seems to be no consensus on the minimum length of duration for accelerometry-related validation studies, but a minimum of 30 min of semi-structured activities has been previously recommended for real-world settings [52].Capturing, processing and annotating videos that are several days in length might be challenging, and the alternative seems to be to aim to capture as many commonly performed activities within a shorter timeframe [52].Additionally, merging and synchronizing of data is challenging, although the use of platforms seems to offer some promise [53].Intrinsic factors (motivation, personal preferences) and extrinsic factors (weather, environment) may affect habitual physical activity performance [15,54].Although this seems to be a reasonable compromise between duration and practicality, it is questionable as to whether the variations in intrinsic and extrinsic factors within daily PA could be captured within such a timeframe.
Chigateri et al. [32] and Dijkstra et al. [40] provided limited instructions for unstructured real-world activities, e.g., "what they normally do during the day", whereas others were more explicit.Taylor et al. [33] informed their participants to include common activities such as walking, lying, sitting, and standing, while Geraedts et al. [41] included common household chores such as vacuuming and clearing dishes.It is noteworthy that among these commonly reported four PA-walking, lying, sitting and, standing-the sensitivity for sitting and standing were relatively lower than the former two.The use of a single sensor on the lower back was not able to sufficiently distinguish sitting from standing, which could have misclassified these two activities in two studies [33,40].However, Chigateri et al. [32] reported that walking duration was overestimated with the uSense wearable device and postulated that there was a higher likelihood for algorithms to overestimate walking duration since inactive durations such as 'pauses during walking' between walks could have been misclassified as walking time [32].Awais et al. [37] dataset consisted of 15 common free-livings activities (see [44,45]) that were performed in an order that suited the participants' preferences, but with no other instructions.This study compared the use of machine learning and deep-learning techniques to classify data from four accelerometers, concurrently worn on four different locations on the body, as walking, lying, sitting, and standing activities.Although the use of additional accelerometers to classify activities produced much better results than studies that used a single sensor, it was not conclusive as to which technique-machine learning versus deep-learning technique-was superior, since the results of both methods plateaued [37].
Steps counts were generally overestimated by commercial-grade wearables, but the evidence was not conclusive since different brands of accelerometers elicited different results [28].Although step count derived metrics generally did not perform as well as direct step counts, and the choice of age-specific cut-offs could improve the accuracy [31].

Study Protocol
Validity and accuracy of the metrics varied with the duration of data collection.Soltani et al. [42] achieved very high accuracy in identifying gait bouts from 12 h of data.Brand et al. [38], who also used the wrist but collected data up to ten days, reported worse results.However, both studies used different accelerometers, and the choice and location of wear of their references was also different-one used the Axivity AX3 on the lower back, while the other used the ActiGraph GT9X Link on the shank.Furthermore, the algorithms implemented by these two studies were also different [31,42].
These discrepancies highlight the need for standardization of the methodology used in validation studies to allow comparison between their results and findings.Future validation studies should aim to adopt recommended methods and protocols relevant for community-dwelling older adults [45,52].
Interestingly, only one study investigated whether wearables could detect change over time, but the findings were mixed, inconclusive and device-specific [28].The responsiveness of single-day measures of step counts was generally better than the three-day average, but this needs to be cautiously interpreted.The lack of evidence on responsiveness from more studies may reflect the recruitment of generally healthier older adults.There is greater impetus to establish responsiveness for people with neuro-musculoskeletal conditions, for example those with age-related degenerative conditions such as osteoarthritis [55].

Adherence to Study Protocol
The duration and location of wear of the sensor affected the level of adherence.Wristworn sensors yielded a high level of adherence, but increasing the duration of data capture could reduce the level of adherence and compliance [28,29,39].Although wearing sensors on the wrist may be more natural than other locations such as the lower back and the ankle, there was a possibility that older adults might forget to put them back on after they had removed them, perhaps during sleep.There was a high level of adherence in wearing the sensors as a necklace, but at the expense of sensitivity, perhaps because there was no need to remove them during sleep and studies constituted a high proportion of females who may already be in the habit of wearing necklaces.Only one study investigated the level of acceptance of the wearables they tested, possibly because the investigators were developing a new wearable prototype [41].
Real-world validation studies of older adults for different intensities of PA, such as different speed or intensity of walking, are missing in the literature.We know that the accuracy of step counts was lower in participants who walked with a slower gait speed [25] or walked with lower intensity [56].Also lacking are validation studies that test more nuanced metrics such as the duration of postural transition, including sit-to-stand and stand-to-sit, which are important indicators of functional mobility and lower limb strength.
Real-world postural transitions, similar to other PA, are ecologically more valid when performed at home as they are executed in a familiar environment [52].
Despite this, there is a growing body of inference-based evidence from studies that use accelerometry to investigate associations between mortality, health and functioning in large populations.These studies indirectly examine aspects of validation such as construct validity [57] and predictive validity [58], thereby providing some assurances.

Strengths and Limitations of the Review
This systematic review used a comprehensive search strategy of eight databases, included clear inclusion and exclusion criteria, utilized the AXIS checklist to access risk of bias, and followed the PRISMA guidelines.It also adopted the blinded adjudication process for the abstract and full-text review.The process followed in the review was designed to minimize bias and increase the transparency of the reporting.
Limitations included a focus on studies in the English language and exclusion of grey literature.Secondly, the sample size for most of the studies was small and predominantly female.Finally, not all the studies reported on the reliability of the wearables, and of those that did, all failed to report test-retest reliability.Both these latter limitations could have weakened the overall strength of the studies reported.In addition, larger-scale and longer-duration studies could better inform us on the level of adherence in wearing the accelerometers among older adults.

Conclusions and Implications for Future Research
Step counts, duration of walking, sitting, standing and lying are reliably and validly measured using accelerometers in community-dwelling older adults in real-world conditions.However, only step counts have been reported to show change over time.
Robust outcomes from accelerometry monitoring of PA are limited to 'volume' counts such as number of steps and duration of sitting, standing, walking and lying, which points to the need for further research on nuanced PA outcomes to provide more indepth understanding on how PA affects functional tasks.Wrist-worn and neck-worn accelerometers are not as metrically robust as those worn at the waist, hip and lower back.Adherence and usability are negatively associated with duration of wear.
To extend the field of research, more real-world studies are needed, in particular, more studies that focus on generally healthy older adults, investigating more nuanced aspects of PA such as intensity of movement (e.g., slow walk versus running) and duration of postural transitions.Data from non-Caucasian populations are also needed.More longitudinal studies are needed to investigate the responsiveness of the metrics, for example, whether step counts are sensitive to detect fall risk in healthy community-dwelling older adults.Finally, future studies should also investigate wearability and acceptance of their wearables in larger sample cohorts.This will inform researchers on whether such wearables could be used in longer-term data collection processes.

Figure 1 .
Figure 1.PRISMA flow chart of study design.

Table 1 .
Eligibility criteria for article selection in this review.

Table 2 .
Methodological quality assessment of selected articles based on AXIS checklist

Table 2 .
Methodological quality assessment of selected articles based on AXIS checklist

Table 3 .
Sample size and basic demographic information of each population from the studies included in the systematic review.

Table 4 .
Design, settings and aims of studies included in the systematic review.

Table 5 .
Clinimetric properties and methods of studies included in the systematic review.

Table 6 .
Summary of clinimetric properties of PA measures in real-world conditions.