Computer Vision-Based Unobtrusive Physical Activity Monitoring in School by Room-Level Physical Activity Estimation: A Method Proposition

: As sedentary lifestyles and childhood obesity are becoming more prevalent, research in the ﬁeld of physical activity (PA) has gained much momentum. Monitoring the PA of children and adolescents is crucial for ascertaining and understanding the phenomena that facilitate and hinder PA in order to develop e ﬀ ective interventions for promoting physically active habits. Popular individual-level measures are sensitive to social desirability bias and subject reactivity. Intrusiveness of these methods, especially when studying children, also limits the possible duration of monitoring and assumes strict submission to human research ethics requirements and vigilance in personal data protection. Meanwhile, growth in computational capacity has enabled computer vision researchers to successfully use deep learning algorithms for real-time behaviour analysis such as action recognition. This work analyzes the weaknesses of existing methods used in PA research; gives an overview of relevant advances in video-based action recognition methods; and proposes the outline of a novel action intensity classiﬁer utilizing sensor-supervised learning for estimating ambient PA. The proposed method, if applied as a distributed privacy-preserving sensor system, is argued to be useful for monitoring the spatio-temporal distribution of PA in schools over long periods and assessing the e ﬃ ciency of school-based PA interventions.


Introduction
In the recent four decades, a 10-fold increase in the number of obese children and adolescents has been observed and it is estimated that almost one in every five children globally are overweight [1]. Meanwhile, physical inactivity (PI), which has been associated with various health risks [2], and which is also one of the main contributors to overweight, has been described as a global pandemic [3]. Concurrently, smartphones have become more accessible even to lower-income families and this is enabling screen time to increasingly compete with healthier activities in the temporal budgets of the youth even outside of their homes. Children and adolescents spend a large part of their time in school where their health behaviour can be researched and possibly influenced. So far, school-based physical activity (PA) interventions have mostly shown modest [4][5][6][7] and only temporary [8] effects on PA, if at all [9]. There are still many ambiguities in this field [10][11][12] due to limited evidence. To maximize impact on public health, evidence-based best practices of PA interventions should be determined before making large investments into scaling up the intervention programs [13,14].

Methods of Assessing Physical Activity of Children
Physical activity is defined as bodily movement via skeletal muscles that results in energy expenditure (EE) [15]. Measurement of PA in the context of the PI epidemic is mostly concerned with assessing habitual PA and determining whether some populations of youth are meeting the established guidelines [16] of 60 min or more daily moderate to vigorous PA (MVPA) with moderate PA defined by the World Health Organization as 3-6 Metabolic Equivalent of Task units (METs) and vigorous PA above 6 [17]. MET is the PA intensity unit defined by the ratio of a person's working metabolic rate relative to their resting metabolic rate with individual metabolic differences normalized based on body weight [18]. The PA (proxy-) measures described below are often converted to this metric.
This work is concerned with brief expressions of PA (bodily movement lasting no more than a few seconds-PA microexpressions) that might be wholly observable in an indoor video camera's field of view (FoV). Therefore, the descriptions of methods do not go deeply into concerns of population-level PA inference, but rather the relation of the measurement techniques to age-and context-specific PA patterns observable in school.
Methods with varying levels of objectivity have been used in research on children's PA ranging from indirect approaches like survey questionnaires, interviews, and activity diaries to direct methods such as observation, and physical measurements like accelerometry, heart-rate monitoring, and doubly labelled water (for an overview see [19]). Assessing children's PA with indirect measures has shown to be unreliable, often overestimating PA [20][21][22]. Self-report or parent-assisted measures, while relatively cheap, suffer from reliability and validity issues concerning inaccuracy of assessment, recall, and social desirability bias [19,23].
Direct systematic observation can provide rich insight into the PA dynamics of a group of children in a specific context-one can observe the subjects' interactions with each other and their immediate environment while taking notes on the intensity and duration of PA these interactions entail. Results based on such observation, however, are not strictly reliable as the observer's senses are limited and interpretations subjective. Thorough training and "recalibration" of observers can somewhat mitigate these problems and increase comparability, but it also increases the cost of applying the method [24].
Direct physical measures are often used for estimating EE in epidemiological and kinesiological research. To this end, doubly labelled water (DLW) provides accurate measures of overall EE [25], but the method only allows EE assessment averaged over long periods of time (sampling rates counted in days), it is very intrusive, expensive, and does not directly measure the construct of PA. DLW's accuracy, however, has made it a useful tool for validation of the methods described here.
Heart rate monitoring can provide high sampling rates and is well correlated with EE, but the relation varies widely between and within individuals [26]. Consequently, thorough calibration for factors like age, sex, body weight, and physical fitness is required to assess EE via heart rate monitoring [27]. Further inference of PA from heart rate monitors benefits from the additional modality of movement measured with an accelerometer [28]. Combined heart rate and acceleration sensors have been deemed valid for assessing PA of children [29]. Before an overview of accelerometry, pedometers should be mentioned as a relatively cheap and reasonably valid option for assessing the PA levels of children [30][31][32]. However, pedometers are essentially single-axis inertial sensors that are individually calibrated for each subject to detect their stepping patterns, so these devices are not designed to register horizontal motion and cannot quantify the intensity of PA at a given moment.
Triaxial accelerometers can provide more information by quantifying the inertial forces on each of its three axes at high sampling rates (up to 100 Hz in practice). This allows modelling of acceleration vector magnitude (AVM) in 3D space which can be corrected for gravitation (Euclidean Norm Minus One g or ENMO) to obtain a measure of the force applied to the sensor by the subject. However, due to the restricted functionalities of popular wearable accelerometers, "activity counts" (arbitrary quantities reflecting PA intensity over fixed epochs that are calculated on board during measurement) are often used in practice [19]. Accelerometers have seen wide and methodologically varied application in PA research [33][34][35][36][37]. Although accelerometers provide rather good indication of PA intensity and sedentary behavior, especially when combined with additional sensors such as inclinometers and gyroscopes, decisions related to sensor data management and analysis remain somewhat subjective [37][38][39][40]. Specifically, devices of different manufacturers calculate activity counts using different formulae (which are not always published), and there is no consensus on parameters of recording and methods of aggregating acceleration data to reflect comparably [41] the concepts of moderate and vigorous PA. This has led Migueles et al. to conclude "that it is not possible (and probably will never be) to know the prevalence of meeting the PA guidelines based on accelerometer data" [42].
Since researchers mostly cannot know whether the forces reflected in acceleration signals are truly applied by the subject or whether the device is worn as instructed (sensor jitter, vehicular transport, and non-compliant uses), machine learning approaches have gained popularity for classifying the type [43] and the intensity [44] of PA from wearable accelerometers. Fergus et al. [45] explored thoroughly several machine learning approaches and feature combinations to classify children's PA type and intensity based on wearable accelerometers achieving best performance on test data with a multilayer perceptron artificial neural network. Deep neural networks have achieved state of the art performance for the prediction of PAEE in pre-school children [46]. Machine learning becomes even more relevant when considering reduced study control of wrist-worn accelerometers compared to hip-wear [47] and especially for PA monitoring via smartphone sensors, where the researcher has even less control over the positioning of the sensors in relation to the body.
Smartphones contain various sensors that can provide relevant information about the intensity and type of PA while the subject is carrying the device. Accelerometer, magnetometer, gyroscope, and GPS have a clear association with PA, but additionally light, proximity and WiFi sensors, barometers, microphones, and cameras can provide extra modalities for PA analysis (for overview see [48]). The interactive nature of the smartphone also allows for attempts at influencing the users' PA [49], which itself is an important field of inquiry for promoting PA behavior change among the youth [50]. The growing popularity of smartphone and wearable fitness apps is leading to huge amounts of data potentially useful for large-scale PA analysis. However, the differences between devices and software, privacy concerns, and data ownership issues lead to a situation where unification and comparison of data collected by different companies is very difficult [51].
An overview of the advantages and disadvantages of PA assessment methods is presented in Table 1. Compared to adults, children's PA is intermittent in nature [52,53], so methods analyzing their PA patterns should consider higher sampling rates and shorter PA intensity estimation epochs than is required for measuring adults. It is also important to consider that it is more difficult to achieve high accelerometer wear protocol compliance in children, and especially early teens [54][55][56]. Individual-level measurement methods described above often assume recording and processing of personal information by researchers and are generally intrusive, requiring human research ethics reviews, the subjects' and their parents' informed consent, and the general bothering of subjects. Intrusiveness also potentially compromises the results by observer effects. Below, ways of overcoming these limitations in school-based PA research are explored.

Spatio-Temporal Distribution of Physical Activity in School
Schools are very specific semi-closed environments where children are required by law to spend lots of time. In OECD countries children spend on average 14% of their waking hours in compulsory classes during primary education (calculated assuming 8 h of sleep based on [57]). If one only considers the school year (September to May) and counts in recess between the classes, then it adds up to a large proportion of time spent in this specific environment. Parts of these spaces with differing attributes can facilitate more or less PA. Playground size has been shown to correlate with PA [58], but the evidence on the relation of other aspects of school architecture to PA is insufficient [59]. There is some evidence indicating playground redesigns', markings', and physical structures' positive effects on PA [60], but others have reached conflicting results [61]. Specifically, there is a lack evidence on which kind of playground equipment and their specific features have the strongest and longest-lasting effects on PA [58]. While one should strive to design the perfect playground for increasing all students' PA, one size might not fit all. Boys and girls of different ages have significantly different play preferences [62] and might require different stimuli for increasing PA [63]. Ethnographic evidence also suggests specific approaches to playground and classroom designs might be necessary for motivating the high-risk group of least physically active students [64]. In addition, Nicaise et al. [65] reported that the effects of playground redesign on PA might not reflect in wearable sensor data, while observations imply a positive effect.
Exergaming, as a branch of the emerging health behavior intervention paradigm of gamification [66][67][68], has received much attention [69][70][71][72] concerning the PI epidemic. The idea of taking advantage of the neurochemical reward mechanisms utilized in the gaming industry to achieve positive health outcomes is becoming increasingly relevant in the context of pervasive computing. Prevalence of smartphones and wearables in combination with increasing feasibility of integrating gaming hardware into the school environment provide a valuable opportunity for the gamification of PA. Baranowski et al. [73] defined the identification of optimal game designs for attaining PA change as an important research priority while emphasizing specific game context (e.g., recess on playground, in hallway or classroom before or after lunch) and context-specific game design elements (cooperation or competition with self or others while using various reward systems).
All of this infers a need for informed and efficient zoning of schools to facilitate increased PA for all students throughout the school year-ideally a custom design for each school, season, and day of week within a season. Location-based PA information can be useful for determining the areas that facilitate or hinder PA and for assessing the utilization rate of a playground, its sections or specific stationary PA equipment.
So far, the spatial distribution of physical activity in school has been studied in schoolyards using GPS combined with heart rate monitoring [74] and accelerometry [75,76]. However, GPS signals are sensitive to environmental factors such as tall buildings [77] and cannot reveal the altitude of the sensor, making it inapplicable in multi-level buildings.

Method Proposition
To better understand the spatial aspects of PA in school while also minimizing participant burden in the research process, a hypothetical method for assessing the spatio-temporal distribution of PA in school is proposed: ambient sensors capable of detecting the number of children at a location and classifying the intensity of their PA in real-time without recording any personal information. A computer vision application for accurate estimation of ambient PA based solely on video frames temporarily stored in random-access memory (RAM) could be a viable solution. Adopting a smart sensor with such capacity would allow researchers to delegate the processing of personal information to artificial intelligence thus obtaining PA estimations at a location without violating the subjects' privacy or bothering them at all. The proposed method can be considered as automated direct observation, except while the type and severity of human error during observation can be variable, algorithmic errors should be consistent and therefore easier to account for. Just one or a few of such sensors could suffice for assessing the effectiveness of stationary PA equipment and stimuli aimed at increasing PA at a specific location. Covering a school building with a distributed sensor system could provide a flow of location-based PA data at high temporal resolutions over any length of time that the system is maintained. Internet Protocol (IP) cameras with relatively wide FoV could be placed at strategic locations throughout the building, or alternatively with a uniform distribution to capture the PA in the building. As a semi-closed environment with students arriving and leaving based on a known time schedule, even a rather sparse distribution of the sensors could potentially reveal hallway-, floor-, and school-level PA patterns. Ability to detect long-term building-level changes in PA patterns enabled by continuous monitoring of ambient PA could open a new field of intervention research designs. Proposed sensors could also be useful for other settings where one is interested in assessing PA at a location in a privacy-preserving manner.
The output of a single sensor, or the basic measurement unit of the method, is currently envisioned as PA intensity of detected child during the length of the prediction epoch. One sample from a single sensor would be the PA intensity levels for each detection during a prediction period/frame range (varying number of values, depending on the number of children visible). In other words, the sensors would measure the intensities of brief displays of PA in FoV (ambient PA) as opposed to measuring PA of individuals. For example, one student can step out of the perceptive field of the sensor during a second and another in during the next second. Then, if both students were moving at the same PA intensity during the corresponding successive predictions, the measure of ambient PA would remain the same during the 2 s, even though originating from separate individuals.
The raw output of the proposed distributed sensor system could be aggregated and visualized on a 2D graph of a single floor plan or a 3D model of the whole school building where the size of a circle/sphere could represent the average number of students detected by a particular sensor during some period and a color scale could be used to represent the average intensity of the PA of the detections during the period. One can imagine a graph of a school floor plan where at the locations of the sensors a small blue dot would signify a single student standing still; a large purple circle indicating lots of detected students in the scene, but a medium average PA; a large red circle could indicate lots of students performing a group activity entailing vigorous PA. Similar visualizations could be done for various aggregations, computing the average number of detected children and the average PA intensity of the detections during a longer period at the sensor locations. These could then be further aggregated to reveal seasonality (average first recess PA distribution, average Monday within a semester PA distribution, etc.) or for pre-post intervention testing (average PA during the weeks before, during, and after an intervention).
For observing changes in whole school PA levels, additional measures can be taken to increase reliability. In schools that record the number of students in the building at the beginning of each lesson, the sensor system data could potentially be normalized by considering the number of students present and measures such as accessible floor area of the building and the floor area monitored by the sensors. Such "student-density" measures should increase over-time and between-school comparability of the estimated ambient PA levels, especially during winter in colder climates when students remain indoors during recess.
Combining ambient PA measures with direct observations, interviews, and/or questionnaires would enable thorough analysis of PA in school. The value of such data could be further increased by simultaneously collecting rich contextual data such as lunch menu and its estimated sugar content, weather conditions, concurrent events, group vaccination, student sick leave rates, etc. Even though such a method would not have the capacity to reveal whether the students achieve their recommended hour of daily MVPA, it could be a valuable tool for assessing the capacity of an intervention to activate children on location and whether the PA reactions to intervention remain similar over time.

The Promise of Computer Vision
Motion detection and object (such as a human) tracking tasks have received much attention in the computer vision field [78][79][80][81][82][83]. When using stationary cameras in a school building where the background should be mostly static at a given length of time, background-subtraction methods [84,85] could potentially be used for obtaining a proxy for PA intensity as the ratio of black and white pixels in the subtracted image. However, such a simple approach would likely be sensitive to variance of scale, changes in illumination, and would not differentiate between the PA of children and any other motion. A more advanced approach to ambient PA estimation stems from human action recognition (HAR) (for an overview see [86]). HAR algorithms are usually developed and tested on datasets containing up to 101 actions such as brushing teeth, bowling, frisbee catch, playing guitar, baby crawling, band marching, etc. [87]. Since actions are defined through time, HAR research emphasizes temporal features (difference between consecutive video frames), while object recognition algorithms are mostly concerned with just the spatial information (the image). Thanks to advances in hardware and machine learning methods, HAR has seen rapid development in recent years [88]. Simultaneously applying two convolutional neural networks (CNNs), one for the spatial, and other for the temporal domains, has shown to be an effective approach for learning features of many abstract actions from video [89]. These two-stream methods have been shown to benefit from fusing together the spatial and temporal features learned by the separate networks to increase recognition accuracy. This fusion can be applied in the convolutional layers (early fusion) [90] or the fully connected layers (late fusion) [91,92], either approach can provide task-specific feature learning benefits.
Significant advances have recently been made in processing efficiency in action recognition. Several approaches have managed to reduce the computational cost of the task to enable real-time action recognition on established benchmarking datasets [93][94][95]. Singh et al. [93] translated the Single Shot Detector [96] network architecture, designed for rapid detection of multiple objects in images, to the action recognition task in the temporal domain resulting in capacity for online independent construction of multiple "action tubes" containing the humans whose actions are to be classified. By applying a novel greedy classification algorithm to the tubes, they achieved performance superior to state-of-the-art algorithms that are not capable of online action localization and did it all at real-time speeds. Such online capacity is especially important for the proposed method whereby PA intensity predictions are to be made at a constant frequency based on live video input.
Zhang et al. [94] combined several methods of knowledge transfer, enabling a CNN operating on low resolution motion vector images to utilize the knowledge of another CNN learned from high-resolution optical flow allowing reasonable action recognition performance at more than real time-speeds. Another approach [95] applied the efficient object detection architecture of YOLO v2 [97] to the output of FlowNet2 [98] (an optical flow estimation CNN) as the temporal stream and the same architecture to the spatial stream. Task-specific fine tuning and integration of FlowNet2 into the two-stream architecture in combination with early fusion of the spatial and temporal features enables end-to-end trainability and real-time speeds [95].
Another recent HAR innovation proposed by Li et al. [99] introduced convolutions exploiting the spatial correlations in images for efficient motion-based action localization by using a Long Short-Term Memory cell (LSTM-a neural network mechanism that enables "remembering" previous states [100]) between convolutional instead of fully connected layers. Spatial correlations between the previous hidden state provided by the VideoLSTM and the current input reveal the likely location of an action based on motion, thereby making the whole process more efficient. Such motion-based attention can be especially useful for the intended setting utilizing stationary cameras. Building on this technique of motion-based attention, Zhao and Snoek developed an algorithm for detecting the spatiotemporal extent of actions by embedding the RGB spatial and optical flow temporal streams into a single two-in-one stream network [101]. Aside from simplifying the computation of action recognition, their approach also assigns motion direction to the actor as an extra feature distinctive to many actions (e.g., the difference between sitting down and standing up, or PA-entailing motion towards a direction relevant to the research questions studied with proposed smart sensors).
The processing speed and energy efficiency of proposed smart sensors could potentially also benefit from Deep Compression [102]. This technique, developed by Han et al., minimizes redundancies in deep neural networks by pruning ineffective connections, quantizing the weights and Huffman coding the resulting weights' distribution. Such compression, when applied to convolutional neural networks, was accompanied by a 3-4-fold increase in processing speeds and 3-7-fold increase in energy efficiency without significant loss of classification performance. The size reduction accompanied by Deep Compression allows to fit large neural networks in on-chip SRAM, thus removing the need for accessing DRAM during processing, which consumes the most power during neural network operation. Han et al. [103] propose a specific hardware design that would take full advantage of Deep Compression and power efficiency of SRAM-based computation (120-fold energy saving compared to DRAM-based implementations). Furthermore, novel hardware architectures utilizing the emerging Resistive RAM technology are being developed precisely with the goal of efficient neural network computation on very small chips [104,105]. While currently the proposed distributed sensor system is planned as a centralized computing implementation, the developments in specialized low-power artificial intelligence chips infer the possibility of a potential distributed computing implementation in the future.
Considering the machine learning task described below and potentially a low number of classes to be distinguished, using single-channel greyscale input might also be a viable option for further reducing network size and computational cost. Similarly, lower resolutions of input might be considered as the indoor environment forces relatively small distances between the subjects and the camera. Whether such approaches would be accompanied by severe loss of performance is to be determined with experimentation.

Action Intensity Classification by Acceleration Vector Magnitude Estimation
Supervised learning in video analysis is usually implemented by assigning semantically subjective labels to frames of video and learning the "typical" features from instances of visual data with such labels. One approach to action intensity classification would be to create a training dataset where the frames of moving children are annotated by visually assessing the intensity of PA displayed in a given range of video frames. Instead, this work proposes to annotate the data based on real accelerations measured from subjects in the training video. Such an approach could essentially fuse the sensors and the research fields of accelerometer-based PA monitoring and video-based HAR. Synchronizing accelerometers recording at 30 Hz with a video camera recording at 30 fps can create raw training data where each frame corresponds to three acceleration scores on the accelerometer's axes for each visible subject. AVM can be then calculated for each subject in each frame, which can then be preprocessed and aggregated (e.g., cumulative sum or average) according to the acceleration prediction frequency, forming the ground truth.
To explore the potential of such a dataset, a sample of proposed training data was collected using a Logitec C922 webcam (Logitech International S.A., Newark, CA, USA) and four Actigraph wGT3x-BT accelerometers (ActiGraph LLC, Pensacola, FL, USA) worn on the hips of three children and one adult as actors. The camera's FoV covered a~5 ×~4 m floor area with a fixed camera at~2 m height. Table 2 presents the analysis of a 10.4-min synchronized clip where all four subjects or at least their torsos are mostly visible. For the most part, the clip contains structured play where the adult is acting as the game leader/instructor. It is important to note that due to the nature of the games played, the clip contains significant amounts of synchronized motion (game leader says "Go!" and children react by jumping/moving) which can increase the correlation coefficients. The bottom row of Table 2 corresponding to the section with least synchronized motion should represent better a real-world situation in a school hallway. Inversely, the webcam's auto-focus created some noise in the motion features which can somewhat reduce the correlations compared to using stable focus. Columns of the table present different proposed acceleration prediction frequencies as cumulative aggregates of both total motion information in the video (represented as total H.264-encoded motion vector magnitude per frame) and acceleration domains (represented as sum of all four subject's ENMO). Analysis of the sample of raw training data shows moderate correlations between the temporal features in video and the accelerations of subjects in the scene. If one considers the task of the algorithm as regression of video to AVM of objects in FoV, then a 0.5 correlation with target in the temporal stream infers that even a relatively simple spatial stream architecture might provide the additional features for accurate estimation of AVM intervals. In a simple form, the temporal stream could quantify the total visible motion and the spatial stream could count the number of children to get average PA intensity per subject per prediction epoch. A more accurate model would separate the motion of children from other motion, noise, and changes in illumination while the RGB stream would not only count the children but based on features such as body position, and its variance along the action tube, also gain information about how much of the motion is associated with which subject. This can provide additional information on the nature of ambient PA displayed in a specific scene with several subjects. Deep learning models could become even more precise to reflect PA EE if training data labels were calculated accounting for the weight and/or the body mass index (BMI) of the actors. The formula for calculating the PA EE proxy labels for the whole dataset could likely be derived from thorough analysis of only a few clips where actors of various body types perform activities entailing the full range of PA intensities. Sliding window averaging of the accelerations prior to label aggregation can potentially enhance feature learning and the method's construct validity by reducing the effects of device jitter and sensor noise.
As the analysis presented in Table 2 concerns a clip where subjects do not step into or out of the cameras FoV, it does not reflect the eventual PA measurement setting very well. Since the proposed measurement units are defined by the prediction frequency (30 frames or 1 s for in Figure 1), some criteria should be developed for what should constitute a valid detection. For example, when a child or their head/torso appears in the corner of the scene creating an action tube of length only 15 frames within a 30-frame epoch, then it likely should not be classified. Action tubes of length 25 within the epoch, on the other hand could already carry enough features to accurately assess the level of PA performed by the detected child during that second (perhaps they just stepped into the scene five frames after the beginning of the current prediction epoch or stepped out before the end of the prediction epoch). Such detection-validity thresholds should be determined by thorough experimentation and expert assessment of the algorithm's performance on the test set. The optimal period of aggregating acceleration scores to ground truth labels depends on the temporal resolution requirements of real-time implementation on one hand and the need for a good representation of age-(intermittency) and environment-specific (bouts wholly observable within the FoV of the camera in a school hallway) PA patterns on the other. In the early stages of such research a 5 Hz "data unification frequency" could be convenient so that data filmed with cameras using NTSC (29.97 or 59.94 fps) and PAL (25 or 50 fps) standards could be easily united in the dataset. However, this would restrict the selection of the final prediction epoch to lengths divisible by 200 ms and would not allow cumulative aggregation of accelerations if both types of cameras would be combined with a 30 Hz accelerometer sampling rate. On the other hand, using different acceleration sampling rates could compromise construct validity by distorting the ground truth. Additionally, to maintain some comparability with other accelerometer-based PA research, using NTSC cameras and 30 Hz acceleration sampling might be preferable, as it is the most common frequency used in PA accelerometry [40].
Aside from sampling frequency, the optimal FoV should also be determined for the cameras to be used in the collection of training data. Very large FoV or even 360 • cameras could provide the best floor-area coverage per sensor when applied in school, but the distortions ("fisheye" effect) in such video could make the machine learning task much more difficult and therefore compromise measurement accuracy of the proposed sensors. Hence, there is likely a trade-off between the sensor's precision and the size of its perceptive field.
A great benefit of such continuous training data is that by selecting different starting moments (t + 15|10|6|5 frames and accelerations) as data augmentation prior to label aggregation, the size of the dataset could be multiplied with little effort and this can be helpful for learning more general features of PA intensity displays.
This kind of action intensity classification can benefit from the advances in deep learning methods applied in HAR, but the task of the algorithm is fundamentally different. HAR is mostly concerned with recognizing actions with a specific relevant function (e.g., detecting getting up from bed in a smart home to start the coffeemaker), but the proposed method would attempt to classify the intensity of any human action and inaction regardless of the goal of the behavior. In other words, very different sequences of human body positions can fall under the same action intensity category, even though they may not represent the same actions. For example, the estimated AVM can be the same for two students moving at the same intensity, but one of whom is jumping while the other sprinting. Therefore, the variance in the appearance of features in each PA intensity cluster would be much higher than for classes representing specific actions. Nevertheless, the ground truth is based on directly measured accelerations, and as initial tests show (Table 2), acceleration as a feature of PA expression reflects well in video. As such, the proposed deep learning approach does not really belong to the domain of action recognition but is better described as a form of multimodal or sensor-supervised learning where the neural network learns the features in two-dimensional spatial data representing movement and based on this knowledge makes predictions in the form of AVM intervals. This action intensity classification task is essentially an ordinal regression problem-more variance in body positions per detected child (action tube entropy) and/or bigger displacement distance in the sequence of frames (action tube shape) should indicate a bigger AVM and a higher-order PA intensity class. Due to the somewhat linear nature of the task, the CNN architecture should benefit from class correlations for class distinction-features of vigorous PA can be somewhat similar to features of moderate PA, but should differ more from the features of light PA. Class correlations have been shown to improve performance [107] even when classifying abstract actions.
In general, as a classifier, the algorithm would be working with large overlapping feature spaces of ordinal classes. The optimal neural network architecture for this type of computer vision system is yet to be determined and partly depends on the quality and amount of training data available and necessary to learn PA intensity features well enough for application as a measurement technique.

Discussion
This work poses the following two hypotheses: (i) room-level measurement of PA is useful for determining best practices of school-based physical activity interventions; and (ii) modern computer vision technology is capable of privacy-preserving room-level physical activity estimation. A course of action is also proposed to test these hypotheses: (ii) deep learning on a dataset of synchronized video and accelerometry; and (i) location-specific or whole-school pre-post intervention analysis of data provided by proposed smart sensors.
Construct validity of the proposed sensor to measure ambient PA is currently difficult to estimate as, to the knowledge of the author, ambient PA has not been researched in this manner. However, deep neural networks continue to perform tasks previously thought to be impossible for machines, and due to the nature of the training data, construct validity for a single sensor can be thoroughly assessed. Besides visually analyzing the PA displays and corresponding predictions and ground truth labels in the test set, the sensors could also be validated in the field, potentially using additional sensors (heart-rate monitor, thermometer and/or thermal camera) in combination with subject-specific attributes such as body weight, BMI, the weight of their clothes and back-pact, hardness and density of their shoe-soles (potential effects of footwear on hip-worn accelerometer signals and inferred PA microexpression intensity). Assessing construct validity of proposed distributed sensor system to measure school-level PA would be much more difficult. The sensors could either be strategically placed (ends of hallways and larger open areas) or alternatively by maintaining uniform distances between sensors and establishing a standard sensor-FoV-to-floor-area ratio. The former, cheaper approach could be viable to compare school-level PA over time, but comparison between schools of different architecture would be rather limited. The latter case would allow increased between-school comparability, but the distributed sensor systems would be more complex and expensive. Ideally, such sensors could also be deployed with total coverage by maintaining some FoV overlap between the sensors. This would enable stitching together a whole-school perceptive field that would allow seamless tracking of individuals and their PA levels. Currently, such an approach seems excessive and a more sparse deployment of wide-FoV sensors should suffice to capture PA distribution patterns well enough to test hypotheses.
Occlusion in crowded scenes and the presence of adults in the building threaten the reliability of such a method. These issues should be addressed early on while creating the machine learning dataset. For the purposes of reducing occlusion, the sensors should be placed relatively high, so the spatial stream could more reliably count the subjects. For this, the training data should be collected with cameras fixed at heights varying from 2 to 3 m at viewing angles that maximize the perceptive field at specific heights-this should also ensure applicability of the sensors in various architectural settings. The data should also contain crowded scenes of various PA distributions among the crowd. To avoid false positives due to the school personnel, grownups could be included in the training dataset, but either not annotating them at all or adding a label "non-detection". The latter case would provide researchers with additional information regarding their research questions (e.g., teachers actively implementing a PA intervention), however, this would also change the regressional nature of the machine learning task and therefore increase computational complexity. Coming back to the idea of school as a semi-closed system, assumptions could be made that occlusion and grownup false positives follow a somewhat constant distribution at least within a semester of a school year. For good measure, events that bring more adults or larger crowds into the building or out of it, should be recorded as contextual data. Considering these notions, the inference of student PA distribution from such a sensor system might yet be valid even at considerable occlusion and adult presence rates. Since the proposed method would be gathering high-resolution data throughout the school year ideally several years in a row, the sheer amount of data would likely enable detection of relevant school-level PA patterns.
Aside from technical and statistical issues, human factors could threaten such a method as well. Even when certifying such a sensor system as truly privacy-preserving with no possibility to retrieve video frames from RAM, the teachers, parents of students, and the wider public might not trust such activity. A camera in a school, even when called a "smart sensor", might cause concerns regarding potential surveillance and security of the data. Therefore, informing the public with an adequate science communication strategy could have an important role when attempting to apply such a method.
Since ambient PA has not been studied before, testing specific interventions might not be the only scientific value of such measurement, there is also a large explorative component to this research. This new form of data could potentially-lead to discovery and new hypotheses and not necessarily only concerning students' health behavior. New insights into crowd and pedestrian behavior dynamics and communal building architecture could be gained in addition to currently unforeseeable phenomena. For this, it would be important to collect dense contextual data alongside the ambient PA distribution and supporting indirect measures.

Conclusions and Future Work
Proliferation of physical inactivity is increasing the demand for PA research and for effective large-scale manipulation of health behavior, best practices of school-based PA interventions need to be developed. Verification of the effectiveness of interventions could benefit from unobtrusive privacy-preserving monitoring of ambient PA in schools. To this end, recent advances that have enabled real-time recognition of many rather complex actions seem promising. This work proposes a novel method for unobtrusive ambient PA monitoring whereby the processing of personal data is delegated to deep learning neural networks while maintaining enough validity and reliability to draw meaningful inferences on the PA at the location.
Future work priorities should entail development of the synchronized dataset of video and accelerometry, preferably in a way such that it could be shared between researchers and used for benchmarking-if deemed ethical, this could be achieved by contracting child actors. Developing a segmentation-based annotation tool can likely simplify data annotation by indicating the start and termination of labelling when subjects step into and out of the frame. Once raw accelerations are annotated to the video frames, different options for constructing labels should be explored-different weights on the accelerometers' vertical and horizontal axes, cutting acceleration peaks to potentially ease learning and normalization of accelerations by actor BMI and other individual attributes to better reflect the measurement construct. This should eventually be followed by testing different machine learning frameworks to come to a real-time capable model for a privacy-preserving sensor.