Objective Measurement of Posture and Movement in Young Children Using Wearable Sensors and Customised Mathematical Approaches: A Systematic Review

Given the importance of young children’s postures and movements to health and development, robust objective measures are required to provide high-quality evidence. This study aimed to systematically review the available evidence for objective measurement of young (0–5 years) children’s posture and movement using machine learning and other algorithm methods on accelerometer data. From 1663 papers, a total of 20 papers reporting on 18 studies met the inclusion criteria. Papers were quality-assessed and data extracted and synthesised on sample, postures and movements identified, sensors used, model development, and accuracy. A common limitation of studies was a poor description of their sample data, yet over half scored adequate/good on their overall study design quality assessment. There was great diversity in all aspects examined, with evidence of increasing sophistication in approaches used over time. Model accuracy varied greatly, but for a range of postures and movements, models developed on a reasonable-sized (n > 25) sample were able to achieve an accuracy of >80%. Issues related to model development are discussed and implications for future research outlined. The current evidence suggests the rapidly developing field of machine learning has clear potential to enable the collection of high-quality evidence on the postures and movements of young children.


Introduction
The first five years of a child's life are characterised by substantial and rapid neurophysiological development.An important aspect of development is the ability to assume different postures and perform a range of movements.Infants typically develop rapidly to be able to roll from supine to prone (~3-6 months), sit (~5-8 months), and crawl (~6-11 months) [1].Movement capacity continues to develop throughout the toddler to preschooler phases of childhood, including learning to walk, then more dynamic and challenging tasks like climbing stairs, running, and jumping.Achieving these posture and movement abilities signals healthy development [2].Failure or substantial delay in developing these abilities hinders full participation in society [3] and may increase risks for physical and mental health issues, for example, by reducing the ability to be sufficiently physically active [4].There is therefore much interest from health and education professionals and parents in measuring the posture and movement of young children (0-5 years of age).
The most common method of measuring the quantity of a child's posture and movement in clinical and research settings is through subjective interview or a survey completed by a child's caregiver.However, these methods are known to be imprecise and biased [5].Observation methods, either directly or from video, can be very accurate [6] and are the current gold standard.However, these approaches are limited in only capturing what a child does during the short period they are being observed/videoed, and a child may modify their behaviour when they know they are under observation [7].Additionally, observational methods have a high human resource requirement, meaning population surveillance is not practical.Objective yet low-burden methods that can measure postures and movements over longer periods of time and in a child's natural environment are therefore desirable.
Small, wearable sensors known as accelerometers are commonly used to quantify time spent at different physical activity intensities [8].A recent systematic review concluded that accelerometers and accompanying physical activity intensity software were feasible for all-day assessment in children and can provide a good indication of the total amount of activity and temporal patterns of activity [8].The commercially available software often uses count-based algorithms that sum the data over pre-set time periods (e.g., 15 s), and then uses thresholds to classify this data into different intensities of movement, such as sedentary, light, moderate, or vigorous-intensity activity [9].These algorithms were established by comparing activity counts with gold standard energy expenditure measures and have been found to be sufficiently accurate [9].Studies utilising this technology have been pivotal in understanding the link between childhood physical activity and health.However, categorising children's movements into energy expenditure intensity categories overlooks potentially important aspects of specific postures and movements such as prone lying, sitting, standing, walking, and running [8].For example, current intensity-based measures typically are not able to differentiate sitting from standing, despite these postures having different health implications [8].Parents may also understand posture-and movement-based messages (e.g., 'tummy time') better than intensity-based measures (e.g., 'moderate' intensity).Thus, detailed information regarding the postures and movements a child performs daily will help clarify links with health and development outcomes, and refine policy, interventions, and public health messaging.
Accelerometry has also been the most frequently used hardware for posture and movement tracking in children.However, a major challenge is to provide a software solution that adequately recognises specific postures and movements.Traditionally, software was developed to predict posture and movements from key features in the raw accelerometer data, using mathematical approaches such as regression-based equations and thresholds.Key features were selected based on knowledge of each posture and movement; for example, the thigh is typically horizontal during sitting but vertical during standing.While these relatively simple algorithm approaches have demonstrated activity recognition accuracy often above 80% in adults, only limited postures and movements have been targeted [10].Research using these approaches in young children (0-5 years old) is more limited, with lower accuracies being suggested to be related to children spending more time in other postures such as kneeling and crawling [11,12].
Recently, sophisticated machine learning computational approaches have evolved and become more accessible.Machine learning is the overarching term used to define a branch of artificial intelligence and is a rapidly advancing field [13].When applied to wearable sensor data such as accelerometers, the models are trained to learn from the data [13], rather than follow simple rules based on human-defined key features.Machine learning software algorithms have demonstrated reasonable accuracy (>80%) in the identification of various postures and movements in adults [14].There has been particular interest in applying machine learning to identify specific postures and movements in sporting contexts [15] and daily activity monitoring for people with movement impairments [16].Collectively, this research has shown that adult postures and movements can be identified with reasonable accuracy (>80%) in a range of different contexts [14][15][16].Whilst several studies have focused on the accurate identification of postures and movements using accelerometry data collected on young children, this information is yet to be synthesised.
Therefore, this systematic review aimed to answer the following research questions: 1.
How has young children's posture and movement been objectively classified and measured using accelerometry and machine learning or other non-machine learning algorithm-based methods? 2.
What is the degree of accuracy of systems developed for the measurement of young children's posture and movement using machine learning models, or other nonmachine learning algorithm-based methods applied to accelerometry data?

Materials and Methods
This review was registered with Prospero (328600) and adheres to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) statement for systematic reviews.

Search Strategy
Six online databases (Medline, Pubmed, CINAHL, SCOPUS, EMBASE, and IEEE) were systematically searched using terms related to the concepts of "accelerometer", "postures and movements", "pre-school aged children", and "machine learning".See additional file Table A1 for a sample search strategy.Searches were limited to the English language and papers published since January 2010.The initial data base searches were completed in October 2022.

Eligibility Criteria
To be included in this systematic review, studies needed to meet the following criteria: to be published in a peer-review journal or conference proceedings; to use data from accelerometers or inertial measurement units (IMUs) for machine learning model or algorithm development; to use data processing and model development methods inclusive of machine or deep learning algorithms and algorithm-based approaches for semi-automated or automated posture and/or movement recognition; to have developed, validated, or utilised machine learning models for the classification and measurement of posture and/or movement; and to include data on children aged 0-5 years.Studies including typically developing children or children with clinical diagnoses were included.Studies were excluded if they were protocol or review studies, and if they were studies that did not include a posture or movement, for example, studies that were focused on the output of sleep time, energy expenditure, or levels of physical activity intensity defined by thresholds or cut points (e.g., moderate-intensity exercise).

Screening for Relevant Studies to Include in the Review
All retrieved papers were exported into Endnote (v20) and duplicates were removed.Title and abstract screening was performed by two researchers (DH and LS) independently, with assistance from "Research Screener version 1.0", an artificial intelligence-based software system that iteratively learns from screening decisions to reduce the need to review irrelevant papers [17].Any researcher disagreement on the eligibility of studies was resolved through discussion, without the need for escalation to a third reviewer.Following title and abstract screening, full text screening of articles was completed independently by the same two researchers.

Quality Assessment of Individual Studies Included in the Review
Quality assessment for each paper was conducted by DH, LS, and ALR using the COSMIN general recommendations for study design and criterion validity subscales [18].These were modified slightly to fit the context and purpose of the study.For the general recommendations regarding study design, items requiring a conceptual framework (#4) and describing existing evidence (#6) were removed, as they were not necessarily representative of the quality of a study in this area, and for the criterion validity items, continuous scores (#4) and dichotomous scores (#5) were merged, as studies could appropriately use either type of score.For each of the eight study design items and five criterion validity items, each paper was scored "good", "adequate", "doubtful", "inadequate", or "not applicable" based on the information provided.The number of "good" and "adequate" ratings were summed for each paper and item.

Data Extraction and Synthesis of Individual Studies Included in the Review
Twelve parameters were extracted and collated by DH, LS, ALR, AC, and CLR from the full manuscripts identified for final review.The first five parameters were related to study design and included participant details, study aims, sensor information, specific postures and movements performed, and data collection methods.Participant details included the number of participants, their age range, sex, whether they were from a clinical population or typically developing, how they were recruited, and the country the study was conducted in.Sensor information included the type of wearable movement sensor, sampling rate, and sensor location on the body.The next four parameters were related to classification model development methods and included whether papers had described a machine learning approach or non-machine learning algorithm-based approach, window details, feature extraction, and the specific machine learning algorithms used.The final parameters were related to the accuracy of the classification model developed and included the gold standard used for comparison, the validation approach used, and the overall accuracy of the system.A final table compiled the accuracy of the various classification models for each of the postures and movements assessed across all studies.

Results
An outline of the search results and study exclusions is provided in Figure 1.The initial database search identified 1663 results, of which 20 papers, reporting on data from 18 studies, met the inclusion criteria.Of these, 17 papers reported on the development and evaluation of machine learning-based approaches applied to wearable movement sensor data for the recognition of specific postures and movements in young children.The remaining three papers reported on non-machine learning algorithm-based approaches for the recognition of specific postures and movements in young children [19][20][21].Tables 1-6 provide characteristics of all the reviewed studies and are discussed in the following sections.Within each table, papers are presented chronologically by year to demonstrate the evolution of methodological approaches over time.

Quality Assessment
Table 1 provide detail of the quality assessment using the COSMIN guideline items on study design.For the eight components of general recommendations for study design, the total number of 'good' or 'average' ratings ranged from 1/8 to 8/8.Twelve out of 20 papers scored ≥7/8.The remaining eight papers scored ≤5/8.Papers most commonly fell short in describing their sample (e.g., exclusion criteria, inclusion criteria, participant recruitment, and whether the sample was representative of the population).

Quality Assessment
Table 1 provide detail of the quality assessment using the COSMIN guideline items on study design.For the eight components of general recommendations for study design, the total number of 'good' or 'average' ratings ranged from 1/8 to 8/8.Twelve out of 20 papers scored ≥7/8.The remaining eight papers scored ≤5/8.Papers most commonly fell short in describing their sample (e.g., exclusion criteria, inclusion criteria, participant recruitment, and whether the sample was representative of the population).
Ahmadi and Brooks, 2020 Ahmadi and Pavey, 2020 General design items: 1. Provide a clear research aim, including: (1) machine learning approach, sensor, population, and specific postures and movements classified.2. Provide a clear description of the postures and movements to be measured.3. Provide a clear description of the development approach for machine learning or algorithm model, including a description of the target population for which the machine learning was developed.4.This criterion was not used in this review as it was related to the conceptual framework used to define the construct measured, which was not necessary for posture and movement measured (greyed out). 5. Provide a clear description of the structure of the final machine learning or algorithm model.6.This criterion was not used in this review as describing existing evidence on the quality of measures was not necessary for posture and movement measurement (greyed out).7. Provide a clear description of the intended context of use for the machine learning.8. Provide a clear description of the inclusion and exclusion criteria for the sample (e.g., clinical condition or typically developing) and characteristics (e.g., age, sex, country).9. Provide a clear description of the method used to recruit and select sample.10.Describe whether the sample is representative of the target population for use of the machine learning.Assessed: as good (G), adequate (A), doubtful (D), inadequate (I).
Table 2 provides the ratings of the COSMIN guideline items on criterion validity quality assessment.For the five components of criterion validity, the total number of 'good' or 'average' ratings ranged from 1/5 to 5/5.A total of ten papers scored ≥3/5; the remaining papers scored ≤2/5.Papers most commonly fell short in adequately describing the gold standard used and in providing information about missing data.

Study Design
Table 3 details characteristics of the study design for each paper reviewed.

Participants
Reviewed papers acquired data for model development and evaluation from 1 [22] to 100 [23] participants, with most papers (n = 14) having less than 25 participants.Most of the papers included an even number of male and female participants.The age range varied widely: eight papers included children under the age of 3 years, four papers included children aged between 3-5 years [26,27,33,34], and the remaining 12 papers included children over the age of 5 years or adults with a small set of children who fit the inclusion criteria.Some of these only included one participant fitting the inclusion criteria or were unclear in how many participants of each age were within the sample.Most papers did not explicitly state whether children were typically developing; however, 17 papers appeared to develop their models on data acquired from typically developing children without clinical diagnoses.Of the three papers that included children with clinical diagnoses, one included children with asthma [31], one included both typically developing children and children with cerebral palsy [29], and one appeared to include children with unknown clinical conditions, where the paper refers to "newborn's physical condition and other medical devices attached" [36].[24] n = 6, 3-5 years old, all female, clinical/TD not reported, how recruited not reported, Japan.
To evaluate the accuracy of one arm accelerometer for activity recognition, the difference in accuracy between child and adult, and whether SOM has advantages over other classifiers.
Standardised tasks, each activity performed for at least 15 s.~4 min for each participant.Unclear environment.
To describe and evaluate an activity recognition system using a single 3-axis accelerometer and a barometric sensor worn on the waist of the body.
To develop and compare multinomial logistic regression and SVM classification of physical activities among preschool children using triaxial accelerometry data.
1 sensor (ActiGraph GT3x+) Hardware type: accelerometer, magnitude range ±6 g, sampled at 30 Hz. Mounted on right hip. 12 activities: Sleep, watch TV, seated colouring at desk, seated video games, seated floor puzzles, play toy kitchen/blocks, ball toss and quick walking, standing active video game, dance following video instructor, aerobics following video instructor, running in place on game mat.Reclassified into six activities: sleep, rest reclining, quiet sitting play, low active play standing, moderately active play standing, very active play standing.
Standardised tasks.Children wore the sensor one full day (9 a.m.-4 p.m.) and performed a series of activities in a set order, each for 10 min to 2 h duration with some free-time light activities in between.
To develop and evaluate a single arm sensor and SOM system to classify infant activities.
Seven activities subcategorised into two classes: dynamic activities (walking, running, playing) and static activities (sleeping, eating, hand motion, sitting).
To evaluate conventional feed-forward artificial neural network with more advanced deep learning-inspired neural network for predicting physical activity types in preschool children.
Standardised tasks.12 structured activity trials (e.g., watching TV, doing collage, playing active game) for 4-5 min each over two lab sessions within a three-week period.First visit: watching television, sitting on the floor reading, standing making a collage on a wall, walking, playing an active game, and completing an obstacle course.Second visit: sitting on a chair, playing a computer tablet game, sitting on floor playing quietly with toys, treasure hunt, cleaning up toys, bicycle riding, and running.Hegde, 2018 [29] n = 21, 11 typically developing children (mean age = 6.6 ± 1.5 years), 55% male, 10 children with cerebral palsy (mean age 6.2 ± 1.5 years), 60% male, recruitment unclear, USA.
To develop a wearable sensor system for combined activity and gait monitoring in children with cerebral palsy.6 sensors, Hardware types: 1 3-D accelerometer and 5 Force Sensitive Resistor (FSR) sensors (intelink), sampled at 400 Hz.FSR sensors in insole.Accelerometer mounted on back of heel of shoe within a plastic enclosure.
Four classes (each with different conditions): sitting (on child chair, on adult chair, on parent's lap, on floor playing with toys); standing (standing still, standing while playing with toys, standing while being dressed); walk (slow walk, fast walk, run, each also completed on GAITRITE).
Standardised tasks in a laboratory.Each condition completed for 2 min.When child walked on GAITRite, it was for the span of the GAITRite mat.
To develop, test, and compare human activity recognition algorithms trained on raw accelerometer signal from wrist, hip and the combination of wrist and hip in preschool-aged children.Evaluated conventional physical activity cut-point methods to activity class recognition models.Li, 2019 [31] n = 16, age 5-15 years old, sex not stated, clinical/TD likely asthmatic, unclear recruitment, however, reference dataset, BREATHE cohort, USA.Final data n = 14 (as two had substantial missing data).
To develop a sensor-based integrated health monitoring system for studying paediatric asthma-specifically monitoring physical activity.To compare greedy Gaussian segmentation (GGS) with a standard fixed-size window/sliding-window approach using data from 2 HAR studies (one adult, one child) of different durations and sensor locations (just one for children).Ahmadi and Brookes, 2020 [33] n = 31, 3-5 years old, mean age 4.0 ± 0.9 years, 22 male, clinical/TD not stated, mainly word of mouth/local recruitment, Australia.
To evaluate the classification accuracy in free-living conditions of an existing laboratory-developed ML system for preschoolers.
Five classes: sedentary, light activities and games, moderate-to-vigorous activities and games, walk, run.
Free play.20 min free play in home or park chosen by parent, some age-appropriate toys provided, no prompting for activities performed.
To evaluate ML developed on free-living data, using 1-15 s windows (1, 5, 10, 15 s), lagged and lead frames, and based on multiple sensors.
Identical to Ahmadi and Brookes, 2020.
Identical to Ahmadi and Brookes, 2020.
Identical to Ahmadi and Brookes, 2020.
To develop a wearable sensor suit-based system to assess infant movements as early indicator of neurocognitive disorders.Franchak, 2021 [37] Laboratory study: N = 15, 6-18 months old, eight female, TD unclear, recruited via social media advertisements and local community recruitment events, USA.Home data collection case study: N = 2, 10.5-11 months old, sex unclear.Likely from the lab study; however, unclear in reporting.Note neither infant could walk independently; however, both could stand, cruise along furniture, and walk while supported with a push toy or caregiver assistance.
To develop and validate a classification system using infant-worn inertial sensors to classify typical postures and movements in an infant's day, to assist with monitoring infant movement behaviours in the home environment.Aimed to assess whether the method could accurately detect individual differences in how much time infants spend in different postures, to characterise everyday movement experiences and their potential for developmental impact.
Laboratory study: 3 sensors (MetamotionR IMU); accelerometer and gyroscope, sampled at 50 Hz.Mounted on right hip, thigh, and ankle.Home data collection case study: four Biostamp IMUs (accelerometer and gyrosocope) Sampled at 62.5 Hz, embedded in pair of customized infant leggings-placed bilateral hip and ankle.
Five body positions: supine (lying on back), prone (lying flat on stomach or in crawling position), sitting (sitting on a surface with or without support from caregiver, the highchair, or on caregiver's lap), upright (standing, walking, or cruising along furniture), held by caregiver (carried in caregiver's arms, excluding times they were seated on caregiver's lap).
Standardised tasks in a laboratory: 10 activities (assisted or unassisted)-standing upright, walking, crawling, sitting on the floor, lying supine, lying prone, held by a stationary caregiver, held by a caregiver walking in place, sitting restrained in a highchair.Completed each activity for 1 min, total session lasting 10 min, followed by free play to allow for spontaneous body positions.Standardised tasks in the home environment: Experimenter guided caregiver via phone through a set of procedures to elicit different body positions-tasks the same as the laboratory tasks, completed each activity 1 min, followed by 10 min of free play.Following free play, infant and caregiver went about day as normal wearing IMUs for approx.8 h-video recording was for up to 180 min during this time.To determine whether there is a difference in physical activity assessment between wrist-worn sensor on the dominant and non-dominant arm and between lower back and hip-worn sensor.

sensors (Mbient Lab
Meta-motion IMU's).Hardware type: accelerometer, gyroscope and magnetometer.Accelerometer range was ±16 g at 100 Hz, magnetometer range as ±1300 uT at 25 Hz and gyroscope was ±2000 st/s.The 4 sensors were mounted on both wrists, lower back, and hip on dominant hand side (upper limb collected separately to low back and hip).
Nine activities: jumping, rotating, running, walking, walking on tiptoe, clapping hands, standing still, sitting still, and dancing.Standardised tasks.All activities performed for 15 s with 5 s standing between and done twice-once with two wrist sensors and once with two lower-body sensors.10 s of each activity was used for analysis.EMG = Electromyography.HAR = human activity recognition.IMU = inertial measurement units.ML = machine learning.RFID = radio frequency identification.SOM = Self-Organising Map.SVM = Support vector machine.TD = typically developed.

Aims
Eleven of the reviewed papers aimed to develop systems that would allow for physical activity or general activity monitoring, to be used in understanding child development and preventing lifestyle diseases such as obesity.However, the aims of the remaining nine papers varied widely.Specifically, of the nine remaining papers, two evaluated systems in a different environmental context (e.g., free-living) [33,34].Two papers had the specific aim of evaluating infants' movements as early indicators of neurocognitive disorders, and one focused on evaluating methods for measuring the time infants spent in prone lying postures (i.e., tummy time) [35,38].One paper was focused on the prevention of falls [22], and two papers focused on the development of a system to measure postures and movements in children [27,36].

Sensor Information (Type, Sampling Rate, Number of Sensors, Locations)
A range of commercially available and custom-built wearable movement sensors were utilised.Eleven of the papers developed/evaluated models using only accelerometer data, whilst five papers utilised inertial measurement unit sensors [20,31,35,37,38].Four papers combined accelerometer data with other types of non-sensor-based and sensor-based data, including calorimetry (n = 1 [23]), air pressure sensors (n = 2 [22,25], and force pressure sensors (n = 1 [29]).One paper compared approaches using different numbers of accelerometers and additional data [19].Eleven of the papers utilised a single accelerometer, and accelerometers were located in a range of locations: upper chest (n = 1 [36]), upper arm (n = 2 [24]) wrist (n = 1 [31]), waist (n = 2 [23,25]), hip (n = 2 [26,28]), back pocket of trousers (n = 1 [22]), and shoe (n = 1 [29]).Four of the papers used two accelerometers, and these were located on the hip and wrist [30,[32][33][34].One paper used three accelerometers located on the hip, thigh, and ankle [37].This paper also described a second home-data collection phase of their model development and evaluation, where they used four lower limb accelerometers (bilateral hip and ankle) based on the results of the laboratory-based study [37].The remaining four studies used four accelerometers with a range of location combinations, which all combined upper-body-and lower-body-mounted accelerometers [19,21,35,38].

Postures and Movements Measured
The postures and movements most commonly included were lying, sitting, standing, walking, and running.For papers that included lying, half specifically focused on the orientation that the child was lying in (e.g., prone/supine/side lying), and the other half did not differentiate the orientation.All papers that differentiated lying included children under the age of three years old.For papers that included sitting, the sitting data used for model development sometimes included sitting on varied surfaces and in different conditions; however, many did not report the specific posture the child was sitting in, or if the child self-selected their sitting posture when the data was collected.Additionally, very few of the models developed included sitting and standing with and without movement, and those that did focused on physical activity intensity classifications.Five of the reviewed papers developed machine learning models that classified both specific movements and physical activity intensity, where the specific postures and movements included were walking and running.More diverse and child-specific movements such as crawling (n = 5 [25,32,35,37,38]), climbing (n = 1 [22]), and jumping (n = 1 [20]) were less commonly identified.

Data Collection Methods
The majority of papers explicitly stated data were acquired in a laboratory-based environment (n = 7), whilst others (n = 6) did not state the location, but the methods suggested that it was within a laboratory.One paper [37], conducted a laboratory study which was repeated (with some modifications to sensor locations) in a home environment.Within the home environment, both prescribed activities and free-living activities were collected.The remaining papers (n = 7) collected data within various "free living" environments, which included indoor play centres, healthcare clinics, childcare centres, the child's own home, and a park.Tasks were performed during data collection for between 15 s and 5 min each.

Classification Model Development
Table 4 details the classification model development for all included papers, with a focus on window size, feature extraction methodology, machine learning approach applied, model development, and validation approaches.
A range of different features were extracted for model development, and these were typically amplitude and frequency domain features.This was performed for all non-deep learning models (n = 15), as well as the first model of the first Airaksinen paper [35].For the second model of the first paper [38] and the second paper [35], the time signal was input into supervised deep learning models as well as into an unsupervised deep learning model (n = 1).

Model Accuracy
Tables 5 and 6 detail the accuracy of the developed models.Human coding by direct observation or later video observation was most commonly used as the gold standard comparison; however, several papers did not clearly report on the method for the collection and annotation of the comparison data.
A range of approaches were used in developing and validating model accuracy.These included leave-one-subject-out cross-validation (n = 6), 10-fold cross-validation (n = 3), and 3-fold cross-validation (n = 1 [22]).One paper [31] split the data set into a training set and test set, where the model was trained on 12 participants and tested on the remaining two.One paper split the data into three evenly sized data sets, one for training, one for validation, and one for testing [23].The remaining machine learning model papers used a combination of validation approaches.
The majority of papers used confusion matrices to determine accuracy, although there was little consistency of what was included in the confusion matrices.For example, recall and precision (%) [31], prevalence, sensitivity, and positive predictive value [37], or just 'accuracy' [25].How each of the statistics reported were calculated was often not explicitly stated.
Table 6 summarises the accuracy reported for the models validated in each paper.There was a wide range reported for overall accuracy of the models (i.e., the degree of accuracy considering all postures and movements included in the model) of 59-97%.Further, a large accuracy range commonly existed in models to detect each specific posture and movement (see Table 6).For example, models were able to detect sitting with a range of 53-100%, walking with a range of 9-99%, and running with a range of 18-100% accuracy.Five of the papers only reported overall accuracy or did not report posture-or movement-specific accuracy [19,20,24,31,32].Not stated.
Overall average recall when using GGS was 73%.Overall averaged precision when using GGS was 86%.Instantaneous accuracy from XGBoost using GGS was 79.4%.Highest fixed-size window accuracy was 72.7%.
Kwon, 2019 [32] GoPro video recorded.Three coders independently coded first four participants using draft coding scheme; after discussion and revision, two coders independently coded rest with 96% concordance.Accel and video synched using visual inspection of active/still.Madej, 2022 [20] Manually labelled offline (not explicitly stated what was used as the reference).
Mean activity vector distance used to conclude whether the constructed feature vector allowed the authors to distinguish between the analysed activities.Euclidean distance was averaged over subjects for each sensor separately, then for all sensors in selected IMU and configuration.
Not stated.
Best result accelerometer and magnetometer on non-dominant arm (trace of minimum distances matrix = 8), worst was gyroscope on lower back and magnetometer on hip (trace = 4).Conclude no differences between wrists, nor between low back and hip.

Discussion
The aims of this systematic review were to determine how young children's postures and movements have been objectively classified and measured using accelerometer hardware and accompanying software, and the accuracy of current systems.The review identified 20 peer-reviewed journal papers, 17 of which reported customised machine learning-based algorithms, and three of which reported simpler, human-defined prediction approaches.While the quality of papers varied greatly, over half scored very well across study design and concurrent validity items.This review highlights the diversity of approaches that have been used to objectively capture children's posture and movement and the impact this had on the reported accuracy.The results highlight that there is currently little consensus on: (1) which postures and movements to record, (2) the participant sample, (3) the study type/developmental approach, (4) the hardware, (5) the software, and (6) the validation approaches used.The synthesis below includes recommendations for each of these factors to help guide future development.

Posture and Movement
The synthesis of the 20 peer-reviewed papers in this review highlights that there is little consistency in which activities have been selected to objectively classify in young children.Indeed, there were over 30 postures and movements targeted across the included papers, with only some overlap between studies.As such, there is a need to establish a consensus on the types of postures and movements to assess, a finding echoed even in a recent scoping review that summarised how postures and movements had been objectively measured in healthy adult populations [14].The results highlight an emphasis on measuring different lying postures in non-ambulatory-aged children (<3 years old) [19,36] which aligns with the evidence linking time spent lying in different positions with important developmental milestones in this age group.However, for older age groups (>3 years old), there was considerable diversity in movements and postures measured.The most commonly measured postures and movements across the included papers were lying, sitting, standing, walking, and running.These align well with the most frequently reported movements and postures measured in adults [14].Few papers examined child-specific movements (e.g., crawling [25,37] and climbing [22,24,25,32]) and child-specific adaption of postures (e.g., different types of sitting such as kneeling and side sitting [29]).This might be due to most papers (n = 19) using a prescribed, standard set of tasks, that may not be reflective of free-living conditions.Even in studies examining postures and movements, physical activity intensity also remained frequently studied, which was justified through established links with childhood health [39].Future research should continue to focus on the postures and movements that have been identified as important, while also considering diversification to more child-specific variations of these postures and movements.

Participant Sample
The number of participants evaluated in the reviewed papers ranged from 1 to 100 participants.However, most studies involved fewer than 25 participants, and studies rarely reported any sample size justification.This is a weakness consistent with research conducted in adults, with most samples including approximately 20 participants [14].It is accepted that the generalisability of models developed on small cohorts is limited, despite them often reporting very high levels of accuracy.For example, the highest and most consistent accuracy reported for predicting walking was in a study with only one participant (99-100%), while the studies with the largest samples had much greater variance in the reported accuracy (e.g., 58-100%, n = 100, [23]).If the goal is to utilise the models beyond the sample they are developed on, larger sample sizes are required that are representative of the intended application population.While most of the samples were balanced for sex, it remains unclear if there is a specific effect of sex on the objective measurement of posture and movement in children from these studies.Similarly, there was a range of childhood age groups studied; however, none of the included studies specifically investigated the influence of child age on the objective classification of postures and movements.A deeper understanding of the influence of child age on prediction accuracy is warranted, as children's postures and movements change with age (e.g., the change in gait patterns from toddlers to preschoolers).It has also been shown that a model developed on adults may not perform well on children [24].Thus, future research should specifically investigate the influence of both sex and age on model predictions.Lastly, most of the studies involved only typically developing children (19 out of 20 papers).The success of any developed objective classification systems should be specifically checked on populations with atypical developmental profiles, for example, children that are known to move differently to typically developing children, such as children with cerebral palsy [40].

Study Type/Development Approach
The twenty included papers were all aimed at methodological advancement and therefore all utilised a validation study design.More than half collected the data in a laboratory-based, controlled manner (i.e., with a prescribed set of activities).This approach is common when developing methods, with most similar work on healthy adults also collected in a laboratory environment using standardised activities [14].While the remaining papers all included a more ecologically valid environment (such as a play centre, the home, or a childcare centre), they mostly used a structured set of activities rather than free play.Only two investigations were cross-validated in a completely uncontrolled freeliving space [33,34].Collectively, these studies found that the accuracy of the posture and movement prediction found in a laboratory-based study was reduced by 15-20% when assessed in free-living conditions [33,34].All studies used either human coding of video or direct observation as the gold standard.While this gold standard was demonstrated to be sufficiently accurate and reliable [7], it might limit the study design to controlled settings, given that videoing and observing free-living conditions is time-intensive and impractical for longer durations.Therefore, while future work should include a cross-validation component in free-living environments, this would be facilitated by computational approaches, such as machine learning, in the processing of gold standard data.

Hardware
There was very little consensus on the type of hardware applied and how it was used across the studies reviewed.Actigraph was the most frequently used commercial device (eight studies), with a diverse range of other commercial and non-commercial sensors otherwise applied.It is generally accepted that more sensors will increase the accuracy of the algorithm developed [41].However, this is not practical when long-term or large-scale monitoring is planned.Given that most studies aimed to develop a system that could be used for multiple day recording in large samples to establish links between postures and movements with childhood health, they also generally only used a single sensor.There was diversity in the location of single sensors, although the hip and the wrist were the most common locations.One paper compared the accuracy between locations and found that a hip sensor was slightly more accurate than the wrist for identifying 23 different activities [34].Research in healthy adults has demonstrated that a single sensor located on the thigh is more suitable for differentiating sitting and standing in adults [10]; however this has not yet been confirmed in children.While it appears that there is some consensus that a single sensor is optimal for long-term childhood activity tracking, more research is required to determine which location is optimal.

Software
The results of this review highlight a range of different software prediction approaches used to objectively record children's postures and movements.The one consistency was that machine learning methods were favoured.Only three studies used algorithms that required human-specified criteria, suggesting that this approach is becoming less favoured amongst the research community.Although the majority of studies employed conventional machine learning, the most recent papers have started utilising deep learning, consistent with human activity recognition (HAR) work in adults.In the conventional machine learning models, there was considerable diversity relating to feature extraction.There was no consistency in choice of overlapping or non-overlapping windows, the size of windows, or the features selected.Some papers highlighted the influence of these aspects on accuracy, with longer window sizes appearing more suitable.There was also a large range of traditional machine learning models used but with Random Forests being the most common.However, the interplay between features, window size, and model makes recommendations on optimal approaches difficult.

Validation Approach and Accuracy
Traditionally, machine learning models are validated by splitting the data into training, validation, and test data sets and, in the case of HAR, by splitting by participant rather than by random windows of data.The training and validation data sets are then used to train the model, with the validation set used to tune the hyperparameters of models coming from the training set.Then, the accuracy is defined by applying the model to the previously unseen test data.More recently, validation has been performed through cross-validation techniques such as n-fold and leave-one subject-out (LOSO) cross-validation.In these scenarios, the test data set is not seen at all in the model development, and this can be considered to give independent accuracy measures.However, this approach has not been undertaken in any of the papers reviewed (with the exception of Ahmadi, where a previous model derived from laboratory data was tested against free-living data [33]).Further, this practice requires large, labelled datasets which are not readily available for children.One potential solution is for research groups to share data where similar accelerometers in similar positions have been utilised, so one group can test their model on unseen data from another group and vice versa.Many papers in this review have employed n-fold and LOSO cross-validation methodologies but have applied them across all of the data collected.
The results of this study highlight a very large range of prediction accuracies across each of the examined postures and movements.Walking and running were most commonly examined, with accuracy ranging from as low as 9% (walking) to as high as 100% (running).Importantly, for the 13 postures and movements evaluated in samples of more than 25 children (with the exception of 'pivoting'), an accuracy of greater than 80% was reported.This is meaningful, as this 80% threshold of accuracy has been largely accepted as the cut-off for acceptable implementation of a model.However, given that only one study assessed their model in free-living conditions, caution should still be applied when interpreting these results.

Strengths and Weaknesses
A strength of this review is that the author team included both human health and computational expertise, enabling transdisciplinary understanding and translation of the findings.A further strength was that the review included quality assessment of the papers, which is not always included in machine learning systematic reviews, again likely reflective of transdisciplinary differences.Lack of quality assessments in other machine learning reviews may also have been due to the limited applicability of most quality assessment tools for machine learning-type studies, reflected in the need to modify the COSMIN quality assessment to ensure it met the needs of this study.
A limitation of the study was that it did not include studies of older children which may have useful information, but these studies were partly covered by another recent review [14].This review was focused on studies that utilised machine learning and algorithm-based approaches to wearable movement sensors.Thus, papers that objectively measured movements and postures using other data sources, such as video data, were not summarised.

Implications
This review suggests a number of implications for future machine learning model developments, including the importance of ensuring an adequate sample in terms of size and representativeness, sample age and sex, sensor location, separation of training and testing data, laboratory and field testing, and inclusion of a broad range of postures and movements commonly used by children.Further research should also explore the strengths and weaknesses of various machine learning approaches.

Conclusions
Young children's postures and movements are critical to their current and future health and development, so high-quality evidence from robust measures is essential to understanding how to support healthy development.The findings of this review suggest that the rapidly developing machine learning field has demonstrated there is potential to substantially enhance the quality of such evidence.

Figure 1 .
Figure 1.PRISMA flowchart of included studies.#Screening took place using "Research Screener" artificial intelligence screening software.Nine rounds of screening (50 in each round) took place before the screeners determined that no further useful papers were being shown.

Figure 1 .
Figure 1.PRISMA flowchart of included studies.#Screening took place using "Research Screener" artificial intelligence screening software.Nine rounds of screening (50 in each round) took place before the screeners determined that no further useful papers were being shown.
4 sensors; Hardware: triaxial accelerometer and gyroscope (Movesense IMU), Sampled at 52 Hz, Mounted on upper arms and legs.Iteratively developed five posture categories: prone, supine, side left, side right, crawl position.Eight movement categories: macro still, turn left, turn right, pivot left, pivot right, crawl proto, crawl commando, crawl 4 limbs (crawl 4 limbs omitted as only one recording utilised category).Standardised tasks.In clinic-like settings for 30-60 min.Physiotherapist encouraged a range of postures and movements by play without touching infant.Movements collected while infant was placed on a foam mattress.Mean 29 min of data collection (range 9-40 min).Total of 12.1 h recorded.

Free
play at home (n = 40) or home like clinic (n = 24).Average data 67 min (range 18-199 min) total recording time 71 h and 30 min.Children encouraged to free play with little adult interference, differences in environment/play opportunities.Participants collected at home instructed to play for at least 1 h.Madej, 2022[20] n = 10, 4-40 years old (mean 24 years ± 14 years), 7 men, unclear if clinical/TD, unclear recruitment, Poland.

Table 1 .
Quality assessment based on general recommendations for the design of a study.

Table 2 .
Quality assessment based on criterion validity.

Table 3 .
Details of study design as reported in each paper.

Table 4 .
Details of classification model development as reported in each paper.

Table 5 .
Details of classification model accuracy.

Table 6 .
Summary of classification model accuracy for each posture and movement, rows ordered by combined studies' sample size and columns ordered by paper's chronological order.