Computer Vision Tasks for Ambient Intelligence in Children’s Health

: Computer vision is a powerful tool for healthcare applications since it can provide objective diagnosis and assessment of pathologies, not depending on clinicians’ skills and experiences. It can also help speed-up population screening, reducing health care costs and improving the quality of service. Several works summarise applications and systems in medical imaging, whereas less work is devoted to surveying approaches for healthcare goals using ambient intelligence, i.e., observing individuals in natural settings. Even more, there is a lack of papers providing a survey of works exhaustively covering computer vision applications for children’s health, which is a particularly challenging research area considering that most existing computer vision technologies have been trained and tested only on adults. The aim of this paper is then to survey, for the ﬁrst time in the literature, the papers covering children’s health-related issues by ambient intelligence methods and systems relying on computer vision.


Introduction
Computer vision (CV) offers powerful tools to assist healthcare applications, especially when coupled with artificial intelligence and machine learning.CV applications can provide objective evidence of the presence of pathologies or an assessment that is not dependent on clinicians' skills and experiences.They can also help speed-up population screening reducing health care costs and improving the quality of service [1].
According to the related scientific literature [2], there are two distinct levels in which CV can be effectively exploited: physician-level diagnostics (medical imaging) and medical scene perception (ambient intelligence).In medical imaging, the interior body is represented for clinical diagnosis and medical intervention, whereas ambient intelligence covers techniques aimed at recognizing human activity and their physical, motor and mental status while moving and acting in physical spaces.
In its broader sense, ambient intelligence is an umbrella term that encompasses intelligent and ubiquitous sensing, smart computing, and human-centred interfaces combined to deliver environments that are sensitive and responsive to people's presence and activities.In healthcare, ambient intelligence can refer to a continuous, non-invasive awareness of activity and health status of individuals, patients and people in need in a physical space that can assist doctors, nurses and other healthcare workers with clinical tasks such as patient monitoring, automated documentation and protocol compliance monitoring [3,4].Cameras and visual sensors are key ingredients of ambient intelligence, as they convey precious information about the activity and the behaviour of people in an environment [3].Visual data can be also processed to unobtrusively measure individuals' vital signs and to support visual analyses of disease signs and symptoms [5].CV tools come into play here as key enablers of physicians' and caregivers' tasks based on visual inspection.
Regarding the related scientific literature, whereas several works summarise applications and systems in medical imaging [6], less work is devoted to surveying approaches for ambient intelligence [7,8].More importantly, most of the current literature focuses on CV for ambient intelligence in adult and older adult care, whereas there is a lack of papers that comprehensively review work on CV for children's health.This is an emerging and cogent topic, which is receiving a growing attention by health organizations and healthcare institutions lately [9,10].The perspectives that ambient intelligence and innovative health technologies may open in paediatric care are manifold and can strongly benefit from research and technology advancement [10,11].Particularly, CV coupled with artificial intelligence and machine learning can support several clinical tasks, for disease detection or well-being monitoring.Among them, the clinical tasks most commonly considered comprise detection and assessment of

•
Neurocognitive impairment (e.g., based on Prechtl General Movement Assessment-GMA) or early signs of neurocognitive developmental disorders (e.g., Autism Spectrum Disorders-ASD or Attention Deficit Hyperactivity Disorders-ADHD).• Dysmorphisms (e.g., cleft lip) or physical or motor impairments (e.g., gait and walking disorders) due to genetic disorders or surgery.

•
The well-being and health status of newborns (e.g., vital signs and sleep monitoring in the nursery or in the Neonatal Intensive Care Unit-NICU) and children.
To support these clinical tasks, CV tools need to be able to perform low-level tasks such as face detection and head pose estimation, gaze tracking and analysis, motion detection and tracking (e.g., legs and arms), and measurement of physiological signs (e.g., heart rate).These low-level tasks underpin more complex inferences for the detection and assessment of activities, vital signs or disease symptoms.
So far, some survey papers have analysed the state of the art on one specific low-level task (e.g., body motion, gaze tracking, head pose estimation), which might be related to neurological diseases or motor impairments.For instance, methods and systems aimed at an early neurological disorder diagnosis have been recently collected in [12].Prechtl general movement assessment by CV was summarized in [13,14].A review of works dealing with gait deviations (also in children) in individuals with intellectual disabilities has been proposed in [15].Data-driven detection techniques that quantify behavioural differences between autism cases and controls are reported in [16,17].
This paper aims to fill the aforementioned gap by providing, for the first time in the literature, a comprehensive overview of the papers covering children's health-related issues by ambient intelligence methods and systems relying on computer vision.A coarse taxonomy for the paper can be recovered by dividing works according to the part they concentrate on, e.g., the face for extracting gaze direction and facial expressions, or on the whole body, e.g., for gait analysis, posture estimation, and human-object interaction.
The proposed taxonomy is schematized in Figure 1 and the paper is then arranged accordingly as follows: Section 2 discusses works that introduced CV-based systems relying on tasks related to children's head and face such as face analysis and head-pose estimation; whereas Section 3 deals with works involving CV tasks aimed at analysing the human body or part of it.These sections map each CV method to the clinical problem it addresses, providing the reader with the background clinical motivation.
Section 4 then discusses some challenges to be addressed to reach a level of performance that allows the instruments to be effectively used in clinical practice and, finally, Section 5 will conclude the paper.

Face Analysis and Head Movements
Children's faces contain a variety of valuable information regarding their state of health.Indeed, due to physiological or behavioural responses, certain pathological conditions alter the expression or appearance of children's faces.
Contactless approaches, such as computer vision methods, may detect and analyse the most relevant facial features, thus providing clinicians (or parents, teachers, caregivers, etc.) with unobtrusive and objective information on children's health status.
In the literature, many efforts have been made in this field, documented by a plethora of research papers that have been reported and discussed in this section.The studies range from the analysis of children's face morphology to recognize genetic disorders, to head and gaze tracking as a tool for large-scale screening of neurocognitive problems, to children's facial expression recognition.
For each reported work, the clinical aim, the used methods, the performances and eventually the limits of the study have been pointed out.
A selection of papers was initially made by using the following queries in the research databases: Among all the scientific papers retrieved from the above-mentioned databases, a further selection was conducted based mainly on the scientific content (many works were in fact not relevant for the purposes of the proposed survey), the type of publication (journals were preferred to conferences in case of comparable ideas) and finally on the number of citations (articles prior to 2020 with less than 10 citations were not considered).
The remaining documents were split depending on the task: (i) analysis of the morphology of the face (see Section 2.1); (ii) head pose estimation and/or gaze tracking (see Section 2.2); and (iii) facial expression recognition (see Section 2.3).
In the following subsections, some tables have been used to organize information within them.Section 2.4 concentrates on papers exploiting multimodal data to capture and analyse as many aspects of face-related behaviours.In each table, (i) the proposed computer vision approach used to perform the analysis, (ii) the clinical task, (iii) the obtained performance and (iv) the test data population in terms of cardinality and age of children are reported.Regarding the performance, the scores and the metrics released by the authors are reported.In the case of non-quantitative measurements, the term 'qualitative' has been used in the relative table cell.Then, Section 2.5 reports some datasets that have been made public and available to researchers and data scientists to enable them to train and validate their methods.Finally, in Section 2.6, the most recent and promising methods that attempt to address automatic face analysis challenges are introduced and discussed.

Face Morphology Analysis
Facial morphology refers to a series of many different complex traits, each influenced by genetic and environmental factors.In Table 1, a summary of the selected work for face morphology analysis is reported.
Table 1.Summary of the selected work for face morphology analysis.'acc' = accuracy (correct predictions/total number of predictions with respect to the clinical goal occurrences tested); CNN = Convolutional Neural Network; RMSE = root-mean-square error (it measures the differences between values predicted by a model and the values provided by experts); SVR = Support Vector Regression.

Work (Year)
Method Clinical Task Metrics Dataset Population/Age (h = hours, w = weeks, m = months, y = years) [18,19]  In the literature, an early interesting clinical task has been the quantification of facial asymmetry in children with unilateral cleft lip nasal deformity.The authors in [20,21] developed a computer vision-based method using a template mesh deformed to fit a target mesh using a geometric point detector.The clinical task was the quantification of facial asymmetry in children with unilateral cleft lip nasal deformity.To accomplish that, the authors (1) identify the three-dimensional midfacial plane in children with an unrepaired cleft lip, (2) quantify nasolabial symmetry (by assessing, for instance, symmetry scores for cleft severity) and (3) determine the correlation of these measures to clinical expectations.A total of 35 infants (ages 4 to 10 months) with unrepaired unilateral cleft lips and 14 infant controls were enrolled in this study.Significant differences in symmetry scores were found between cleft types, and before and after surgery.
Also, Mercan and colleagues [18,19] aimed at developing a computer vision-based approach to analyse 3D facial images of 50 infants and 50 children (aged 8-10 years) before and after primary cleft lip repair.They assessed a specific set of features related to unilateral cleft lip nasal deformity: dorsal deviation, columellar deviation, nasal tip asymmetry, and blunting of the alar-cheek junction.They also showed a correlation between this set of measures related to nasolabial symmetry and aesthetic appraisal, demonstrating that computer vision analysis techniques can quantify nasal deformity at different stages.
Another application for automatic facial morphology analysis is the estimation of the postnatal gestational age, to assess whether or not infants are premature, which helps clinicians to decide on suitable post-natal treatment.The work of Torres et al. [23] focused on the development of a novel system for postnatal gestational age estimation using small sets of images of a newborn's face, foot and ear.A Convolutional Neural Network with two-stage architecture predicts broad classes of gestational age; then, it fuses the outputs of these discrete classes with the baby's weight to make fine-grained predictions of gestational age using Support Vector Regression.
Recently, most studies aimed at analysing children's face morphology by computer vision methods focus on recognising facial dysmorphisms due to genetic disorders instead.This is a complex recognition problem: several genetic disorders can cause facial dysmorphism that can eventually be combined with dysfunctions in other organs [24].Based on facial features, a geneticist or a paediatrician can reach a possible diagnosis and order appropriate tests for confirmation of the same.Nonetheless, while some of these syndromes can be associated with distinctive facial features, others can be harder to detect at first sight.A computer vision approach, aimed at automatically analysing the face of children with facial dysmorphism, may avoid a delay in diagnosis by supporting geneticists and paediatricians in recognizing the facial gestalt of genetic syndromes.In [22], the authors aim at testing a computer vision approach to identify dysmorphic syndromes in Indian children.Fifty-one children with definite chromosomal abnormalities or microdeletion/duplication syndromes, or single gene disorders, with recognizable facial dysmorphism were enrolled in the study.Their facial photographs (frontal and lateral) were uploaded in the Face2Gene CLINIC app [25], where a deep convolutional neural network compares a patient's gestalt to its database for syndrome suggestion.Of the 51 patients, the software predicted the correct diagnosis in 37 patients (72.5%).The method works quite well to classify facial dysmorphism available during training.The open challenge becomes then to handle "unseen" cases since there is a vast number of genetic disorders causing dysmorphism and providing all of them during model training becomes unrealistic [26].

Head and Gaze Tracking and Analysis
In Table 2, a summary of the selected work for head pose estimation and/or gaze tracking is reported.
Problems in neurocognitive development, ASD in particular, are associated with disorders in the processing of social information, difficulties in social interaction, and atypical attention and gaze patterns.Atypical eye gaze is an early-emerging symptom of ASD and holds promise for autism screening.Traditionally, gaze tests rely on manual assessments of children's visual fixations to pictorial stimuli, but are very time-consuming and difficult to standardize.
Many studies aimed at developing faster and low-cost solutions to reproduce the two principal visual-based ASD clinical diagnostic tests: (i) the analysis of gaze fixation patterns, which represent the region of an individual's visual focus and (ii) the analysis of visual scanning methods, which corresponds to the way in which individuals scan their surrounding environment [27][28][29].For instance, the framework and computational tool proposed in [27] for measuring attention was tested on a population of 104 children (age 16-31 months), 22 of them diagnosed with ASD.The computer vision algorithm detailed in [30] was used to automatically track children's gaze and head position from a recorded video.The latter was registered using an iPad front-facing camera while they watched a movie displaying dynamic, social and non-social stimuli on the device screen.The authors detected and tracked 51 facial landmarks, thus allowing for the detection of head, mouth, and eye position to assess the direction of attention.They estimated the head positions relative to the camera by computing the optimal rotation parameters between the detected landmarks and a 3D canonical face model.The study showed that children in the ASD group paid less attention to the video stimulus and to the social as compared to the non-social stimuli, and often fixated their attention on one side of the screen.Also, robot-assisted tools have been of interest in intervention for children with ASD, showing impressive results both in the diagnosis and therapeutic intervention when compared to classical methods.The study reported in [33] aimed at early detecting ASD signs in naturalistic behavioural observation through child-robot interaction.The proposed system is composed of a responsive robotic platform, a flexible and scalable vision sensor network, and an automated face analysis algorithm based on machine learning models.The latter is developed using state-of-art trained neural models, available by Dlib3 [35] and OpenFace [36] and involves face detection, recognition, segmentation and tracking, landmarks detection and tracking, head pose, eye gaze and visual focus of attention estimation.The authors also present a proof-of-concept test, with the participation of three typically developing children and three children at risk of suffering from ASD.
Gaze detection and tracking may also be useful to monitor paediatric patients, especially in critical settings (e.g., Intensive Care Unit, ICU).The authors in [34] used the Faster R-CNN algorithm to fine-tune a pre-trained ResNet-101 model [37] to automatically detect and track eye regions for paediatric ICU patients monitoring.The last two layers of the CNN were fine-tuned during training with 59 images and annotations for the eye and mouth regions.The mouth landmark was included to improve model performance: it was found in earlier testing that the mouth and eyes were often confused by object detectors because of their similar shape and intensity profile on the face.By explicitly training the model to detect both landmarks, the mouth serves as negative training data for the eye localisation task.With a localization rate of 84%, this study demonstrated the potential of convolutional neural networks for eye localization and tracking in a paediatric ICU setting.

Facial Expressions Analysis
In Table 3, a summary of the selected work for facial expressions analysis is reported.Over the past decade, research on automatic analysis of children's facial expressions has made great strides.One of the challenges in this area has been the recognition of regular facial expressions.Another has been the attention to micro-expressions or compound ones, i.e., the combination of several facial expressions, which can be critical for the success of an automated system.The computational analysis of facial expressions can overcome the limitations of human perception and provide fast and objective results in a wide range of clinical tasks.
For instance, commonly used screening tools for autism spectrum disorder (ASD) generally rely on subjective caregiver questionnaires.While behavioural observation carried out by specialists is more accurate, it is also expensive, time-consuming and requires considerable expertise.Many efforts have been made in the field of CV to overcome such limitations and automatically recognise ASD children's facial expressions to

•
Handle meltdown crisis.Studies such as [47,48] consider the safety of autistic children during a meltdown crisis.Meltdown signals are not associated with a specific facial expression, but with a mixture of abnormal facial expressions related to complex emotions.Through the evaluation of a set of spatio-temporal geometric facial features of micro-expressions, the authors demonstrate that the proposed system can automatically distinguish a compound emotion of autistic children during a meltdown crisis from the normal state and timely notify caregivers.• Support specialists in diagnosing and evaluating ASD children.In [41], the authors propose a CV module consisting of four main components aimed at face detection, facial landmark detection, multi-face tracking and facial action unit extraction.The authors highlight how the proposed system could provide a noninvasive framework to apply to pre-school children in order to understand the underlying mechanisms of the difficulties in the use, sharing and response to emotions typical of ASD.

•
Computationally analyse how children with ASD produce facial expressions with respect to their typically developing peers.In [56][57][58], the authors propose a framework aimed at computationally assessing how ASD and typically developing children produce facial expressions.Such a framework, which works on a sequence of images captured by a webcam under unconstrained conditions, locates and tracks multiple landmarks to monitor facial muscle movements involved in the production of facial expressions (thus performing a type of virtual electromyography).The output from these virtual sensors is then fused to model the individual's ability to produce facial expressions.The results correlate with psychologists' ratings, demonstrating how the proposed framework can effectively quantify the emotional competence of children with ASD to produce facial expressions.

•
Early detect symptoms of autism.Despite advances in the literature, it is still difficult to identify early markers that can effectively detect the manifestation of symptoms of ASD.Carpenter and colleagues [49] collected videos of 104 young children (22 with ASD) watching short movies on a tablet.They then used a CV approach to automatically detect and track specific facial landmarks in the recorded videos to estimate the children's facial expressions (positive, neutral, all others) and differentiate between children with and without ASD.In these cases, children with ASD were more likely to show 'neutral' facial expressions, while children without ASD were more likely to show 'all other' facial expressions (raised eyebrows, open mouth, engaged, etc.).
Another fundamental goal in healthcare involves detecting and monitoring pain and discomfort in children.
Children are particularly vulnerable to the effects of pain and discomfort, which can lead to abnormal brain development, yielding long-term adverse neurodevelopmental outcomes.Nowadays, the evaluation of pain in patients depends mainly on the continuous monitoring of the medical staff when the patient is unable to verbally express his/her experience of pain, as is the case of babies.Therefore, the need to provide alternative methods for its evaluation and detection.
For instance, PainCheck Infant [50] is a mobile point-of-care application that uses automated facial evaluation and analysis to assess procedural pain in infants.Based on an artificial intelligence algorithm, it enables the detection of six facial action units (AUs) that indicate the presence of pain: AU4 (forehead lowering), AU9 (nose wrinkling), AU15 (lip corner pressing), AU20 (horizontal mouth stretching), AU25 (lip parting) and AU43 (eye closure).These facial actions, as classified by the Baby Facial Action Coding System [59], represent specific muscle movements (contraction or relaxation).The authors reported the good psychometric properties of PainCheck Infant after collecting video recordings from 40 infants (aged 2-9 months).
The authors in [60] also proposed an infant monitoring system to detect a broader spectrum of facial expressions consisting of discomfort, unhappiness, joy and neutral.They also aimed at detecting some states, including sleep, pacifier and open mouth.The proposed system was based on combining expression detection using Fast R-CNN with compensated detection using a Hidden Markov Model.The experimental results showed an average precision for discomfort detection up to 90%.
The studies reported in [42,46] focus on texture and geometric descriptors to analyse infants' faces and detect expressions of discomfort.In particular, Martinez et al. [46] used three different texture descriptors for pain detection: Local Binary Patterns, Local Ternary Patterns and Radon Barcodes.A Support Vector Machine (SVM) based model was implemented for their classification.The proposed features gave a promising classification accuracy of around 95% for the infant COPE image database [61,62].In [42], a two-phase classification workflow was developed: phase 1, subject-independent, derived geometric and appearance features; phase 2, subject-dependent, incorporated template matching based on facial landmarks.Finally, to detect comfort or discomfort facial expressions, an SVM classifier was applied to the video frames.Videos of 22 infants were used to evaluate the proposed method.Experiments showed AUC of 0.87 for the subject-independent phase and 0.97 for the subject-dependent phase.
However, there is a view among some researchers that pain is a multimodal emotion, often expressed through several different modalities.For this reason, in [51] Salekin and colleagues show that there is a need for a multimodal assessment of pain, particularly in the case of post-operative pain (acute and prolonged pain).They integrated visual and vocal signals using a multimodal spatio-temporal approach.For neonatal face analysis, the proposed algorithm first detects the face region in each video frame using a pre-trained YOLO-based [63] face detector.Then, a VGG-16 [38] network extracts visual features from the face.Finally, they used LSTM [64] with deep features to learn the temporal pattern and dynamics typical of postoperative discomfort.Experimental results on a real-world dataset (known as USF-MNPAD-I-University of South Florida Multimodal Neonatal Pain Assessment Dataset, consisting of 58 neonates with a gestational age that ranges from 27 to 41 weeks [65]) show that the proposed multimodal spatio-temporal approach achieves the highest AUC (0.87) and accuracy (79%), averaging 6.67% and 6.33% higher than unimodal approaches.The work of Zamzmi et al. [54] also presented a comprehensive multimodal pain assessment system that fuses facial expressions, crying sounds, body movement and vital signs.In terms of face analysis, the proposed system acquires video of infants being monitored in the neonatal intensive care unit and implements four feature extraction methods, namely strain-based, geometric-based, texture-based, and gradientbased, to extract relevant features from the newborns' faces.The system achieved an accuracy of 95.56%.
The area of children's social interactions is also considered clinically relevant, since the ability to produce and decode facial expressions in both childhood and adolescence promotes social competence, whereas deficits characterise several forms of psychopathology.However, even in this area, the study of facial expressions has been hampered by the labour-intensive and time-consuming nature of human coding.Therefore, some efforts have been made to automatically analyse children's facial expressions in order to study their social interactions.
For example, primary social interactions, namely the family context, are the focus of the studies reported in [39,66].In the latter, the intensity of twelve infants' facial expressions is detected and measured in order to model the dynamics of face-to-face interactions with their mothers.Certified Facial Action Coding System (FACS) coders manually coded facial AUs related to the positive and negative affect from the video.Then, relevant facial features were tracked using Active Appearance Models (AAM) and registered to a canonical view before extracting Histogram of Oriented Gradients (HOG) features.Finally, using these features, the authors compared two dimensionality reduction approaches (Principal Components Analysis with Large Margin Nearest Neighbour and Laplacian Eigenmap) and two classifiers, SVM and K-Nearest Neighbour.
In [40,67], the pro-social and antisocial behaviour of children is studied.In particular, lie detection is carried out.Zanette et al. [40] first collected video recordings of a group of children (6-11 years old).Non-verbal behaviour was analysed using the Computer Expression Recognition Toolbox (CERT), which uses FACS to automatically code children's facial expressions while lying.The results showed the reliability of CERT in detecting differences in children's facial expressions when telling antisocial versus prosocial lies.
Regarding expression recognition aiming at emotion detection, most of the works in this area have used deep neural networks for automatic classification of children's facial expressions, such as [53], where a VGG-16 network [38] was used.Here, the authors trained the network on adult videos and refined the network using two publicly available databases of toddler videos that differ in context, head pose, lighting, video resolution, and toddler age: FF-NSF-MIAMI [68,69] and CLOCK [70] databases.The resulting AU detection system, which the authors call Infant AFAR (Automated Facial Action Recognition), is available to the research community for further testing and applications.
In [55], the authors present an advanced lightweight shallow learning approach to emotion classification by using the skip connection for the recognition of facial behaviour in children.In contrast to previous deep neural networks, they limit the alternative path for the gradient in the early part of the network by a gradual increase with the depth of the network.They show that the progressive ShallowNet is not only able to explore more feature space, but also solves the overfitting problem for smaller data, using the LIRIS-CSE [71] database to train the network.
Nagpal et al. [45] incorporated supervision into the traditionally unsupervised Deep Boltzmann machine [72] and proposed an average supervised deep Boltzmann machine for classifying an input face image into one of the seven basic emotions [73].The proposed approach was evaluated on two child face datasets: Radboud Faces [74] and CAFÉ [75].
However, emotion recognition classifiers traditionally predict discrete emotions.Nevertheless, a method for dealing with compound and ambiguous labels is often required to classify emotion expressions.In [52], Washington and colleagues explored the feasibility of using crowdsourcing to obtain reliable soft-target labels and evaluate an emotion detection classifier trained with such labels.Reporting an emotion probability distribution, which takes into account the subjectivity of human interpretation, may be more useful than an absolute label for many applications of affective computing.For the experiments, they used the Child Affective Facial Expression (CAFE) data set [75] and a ResNet-152 neural network [37] as a classifier.
In healthcare, social robotics is experiencing a rapid increase in applications.Some of these applications include robot-assisted therapy for children [76].Empathy, or the ability to correctly interpret the manifestations of human affective states, is a critical capability of social robots.The study reported in [44] proposes a method based on deep neural networks that fuses information from the skeleton of the body posture with facial expressions for the automatic recognition of emotions.The network is composed of two different branches, one focusing on facial expressions and the other focusing on body posture.The two branches are then combined at a later stage to form the branch for the recognition of the whole body expression.The authors evaluated their method on a sophisticated child-robot interaction database (aged 6 to 12 years) of previously collected emotional expressions.

Multimodal Analysis
Several papers combine different types of data (e.g., gaze tracking and facial morphology data, or head pose estimation and expression classification, etc.) to capture and analyse as many aspects of the condition under study as possible.Some of the most relevant ones are resumed in Table 4 For instance, several studies have focused on analysing facial features for detecting early symptoms of ASD and on the automatic diagnosis of attention deficit hyperactivity disorder (ADHD) based on children's attention patterns and facial expressions [78].
For example, to detect early indicators of ASD, the authors in [30] analysed both facial expressions and head postures of twenty 16-to 30-month-old children with and without autism.They extracted 49 facial landmarks using the IntraFace software [79]; with regard to the analysis of facial expressions, three classes of emotions were taken into account: Neutral, Positive (Happy) and Negative (Anger, Disgust and Sad).However, the facial expression classifier was trained on the standard Cohn-Kanade dataset [80], which contains video sequences from a total of 123 subjects between the ages of 18 and 50.
Xu et al. [77] and Nag and colleagues [81] also attempted to find notable indicators for early detection of ASD in both facial expressions and gaze patterns.The system proposed in [77] provides participants with three modes of virtual interaction-videos, images and virtual interactive games.Computer vision-based methods are used to automatically detect the subject's emotion and attentional characteristics in the three interaction modes.The system is intended to aid in the early detection of autism.The system's accuracy has been verified through experiments on the publicly available dataset and data collected from 10 children with ASD.

Publicly Available Datasets
Large amounts of adult facial image datasets were available for research purposes, but very few equivalent datasets for children can be found in the literature.The most relevant datasets reporting infant, toddler, and children faces are reported here and listed in Table 5: • COPE Database [61,62]: This database contains 204 photographs of 26 newborns (between 18-36 h old) who were photographed while experiencing the pain of a heel lance and a variety of stressors, including being moved from one cot to another (a stressor that produces crying that is not in response to pain), a puff of air on the nose (a stressor that produces eye squinting), and friction on the outer lateral surface of the heel (a stressor that produces facial expressions of distress similar to those of pain).In addition to these four facial displays, the database contains images of the newborns in a neutral resting state.[83].Two age-appropriate emotion induction tasks were used to elicit positive and negative facial expressions.In the positive emotion task, an experimenter blew bubbles at the infant.In the negative emotion task, an experimenter presented the infant with a toy car, allowed the infant to play, then removed the car and covered it with a clear plastic container.Each video was approximately 2 min long (745 K and 634 K recorded frames).The video resolution was 1920 × 1080.FACS coders manually annotated for nine action units: AU1 (inner brow raised), AU2 (outer brow raised), AU3 (inner brow pulled together), AU4 (lowered eyebrow), AU6 (raised cheek), AU9 (nose), AU10 (nose wrinkle), AU9 (nasal wrinkling), AU12 (corner of lips pulled back), AU20 (lip stretching) and (lip stretching) and AU28 (lip sucking).Face detection, head pose estimation and facial expression recognition are challenging tasks, whose success can be hindered by varying conditions such as facial occlusion, lighting, unusual expressions, distance from the cameras, skin type, complex real-world background, low data resolution and noise.These challenges, which are well known when dealing with adult face analyses, might even be exacerbated in the case of children and newborns.
Among the most recent and promising methods that attempt to address such challenges, those based on deep learning models are gathering more and more momentum, as they guarantee remarkable results in terms of accuracy and robustness.To mention a few, DeepFace [86] and FaceNet [87], on which OpenFace [36] is based, have been some of the pioneering solutions that have demonstrated state-of-the-art performance and paved the way for further breakthroughs in the field.
However, these methods have been trained, implemented and tested mainly on adult faces, so further development is needed to test their generalisability to newborn and infant faces.
Regarding face detection, we note that ArcFace [95] (code available at: https:// github.com/1996scarlet/ArcFace-Multiplex-Recognition),RetinaFace [96] (code available at: https://github.com/1996scarlet/ArcFace-Multiplex-Recognition)and FaceYolov5 [97] (code available at: https://github.com/deepcam-cn/yolov5-face/tree/master)performed exceptionally in detecting adult faces.Nevertheless, as several studies ( [70,98,99]) reported, face recognition methods designed for adults fail when applied to the neonatal population due to the unique craniofacial structure of neonates' faces as well as the large variations in pose and expression as compared to adults.Therefore, further research in this domain should concentrate on designing algorithms trained specifically on datasets collected from the neonatal population, as pointed out by Zamzmi et al. in [43], where a novel Neonatal Convolutional Neural Network for assessing neonatal pain from facial expression is described.
Regarding expression recognition aiming at emotion detection, most of the research has focused on adult face images so far [100][101][102][103], with no dedicated research on automating expression classification for children.As infants' faces have different proportions, less texture, fewer wrinkles and furrows, and unique facial actions with respect to adults, automated detection of facial action units in infants is challenging.More thorough experiments are needed to assess the applicability and robustness of the cited methods when tested on newborn and child data.Furthermore, emotion recognition classifiers typically forecast isolated emotions.A strategy for addressing complex, compound emotions may involve integrating multiple modalities and other types of sensors (e.g., thermal cameras), thus including temporal, auditory, and visual data, to enhance the precision and robustness of the models, as demonstrated in [104].

Body Analysis
Introducing automatic methods to analyse the movements of babies and children (behavioural coding) is becoming increasingly needed.On the one side, when reported by parents or general practitioners, it relieves the workload on specialized health professionals, reducing costs and time to obtain a diagnosis.On the other side, it enables the possibility of continuous screening of a larger population, making early diagnosis of eventual diseases even before symptoms become evident to non-expert observers possible [12].These automatic methods leverage human pose estimation algorithms.Deep Learning architectures have obtained significant results for human pose estimation in the last few years, but they have been trained on images picturing adults.The estimation of the pose of children (infants, toddlers, children) is sparsely studied despite it can be extremely useful in different application domains [105].In this section, the works dealing with the estimation of the body posture of babies and children are reported and discussed.In particular, at first, existing benchmarks are introduced and subsequently, the most relevant works in the literature introducing algorithmic pipelines exploited for the healthcare of young subjects are discussed.For each work, the clinical aim and eventually how they addressed the additional bias of dealing with children have been pointed out.
A coarse selection of related papers was initially carried out by using the following queries in the research databases: Among all the documents retrieved from the databases, a fine selection was therefore conducted based mainly on the scientific content (some documents were in fact not relevant for the purposes of the proposed survey), the type of publication (journals were preferred to conferences in case of comparable ideas) and finally also on the number of citations (articles prior to 2020 with less than 10 citations were not considered).
This led to the following content organization: at first, in Section 3.1, documents describing datasets and common tools exploited for pose estimation are reported.Then, the remaining documents have been split depending on how the infants were acquired, i.e., lying in a bed/crib or standing/walking and two different subsections are used to describe them accordingly.Similarly to Section 2, in each subsection some tables have been used to organize information within them.Finally, in Section 3.4, new research directions for more accurate infants' pose estimation are reported and discussed.

Common Datasets and Tools for Human Pose Estimation
Healthcare would enjoy powerful and reliable algorithms fine-tuned on specific goals (i.e., pathology classification or evaluation of its stage); nevertheless interdisciplinary specialists could be advantaged in having tools oriented to more generic processing and providing semi-raw data ready for further analysis.This includes many tools that allow the research community to set up pipelines aiming at the final goal of analysing infant movements.Such an approach enables a wider range of scientists to analyse child movement patterns and, at the same time, represents a starting point for the image-processing research community.Among these tools, the most common library for human pose detection is OpenPose.It is a real-time multi-person human pose detection library [106] that maps 25 points on the body including shoulders, elbows, wrists, hips (+mid-hip), knees, ankles, heels, big toes, little toes, eyes, ears, and nose.It was trained on adults but, as reported in the following section, it has been largely used also on infants with or without a specific domain adaptation learning phase.It is available at https://github.com/CMU-Perceptual-Computing-Lab/openpose(accessed on 15 September 2023).OpenPose has then been integrated into AutoViDev [107], a system specifically created for automated video action recognition.It provides a highly modular implementation of 188 primitives, on which users can flexibly create pipelines.It also supports automated tuners and an easy-to-use GUI to help researchers/practitioners develop prototypes.AutoVideo is released under MIT license at https://github.com/datamllab/autovideo(accessed on 13 February 2023).
Another efficient deep architecture for markerless pose estimation and semantic features detection is DeepLabCut (DLC) [108].Open source Python code for selecting training frames, checking human annotator labels, generating training data in the required format, and evaluating the performance on test frames is available at https: //github.com/DeepLabCut/DeepLabCut(accessed on 13 February 2023).
Some useful annotation tools are introduced to build image databases for computer vision research.The early one introduced is LabelMe [109] that works online at http: //labelme2.csail.mit.edu/Release3.0/index.php(accessed on 13 April 2023).Another is Kinovea, designed for sports analysis, open-source and freely available at www.kinovea.org(accessed on 29 Match 2023).Finally, the tool in [110] is more oriented to pose estimation, it is interactive and it relies on a heuristic weakly supervised human pose (HW-HuP) solution to estimate 3D human poses in contexts where no ground truth 3D pose data are accessible, even for fine-tuning.
Unfortunately, all the above-mentioned instruments suffer from a bias in terms of target patients.They are designed and/or trained for adults, reducing their reliability when applied to infants.This allows specifically child-oriented tools to shine in this landscape.A tool specifically designed for the semi-automatic annotation of baby joints, namely Movelab, has been recently introduced in [111] instead.It consists of a GUI that allows users to browse videos and to choose an algorithm for baby pose detection among MediaPipe Pose [112] and two ResNet architectures fine-tuned on a proprietary dataset of 600 videos of children lying on a bed.
AVIM is another tool, developed using the OpenCV image processing library and specifically designed for an objective analysis of infants from 10 days to the 24th week of age.It acquires and records images and signals from a webcam and a microphone and allows users to perform audio and video editing [113].It is similar to MOVIDEA, a software developed using MATLAB [114].Both tools rely on manual annotation of interesting points of the body and provide cinematic measurements.
All these discussions raise a problem that concerns artificial intelligence and which becomes more serious in highly specialised environments: lack of data.Retrieving data, providing accurate annotations and complying with all regulations could be extremely tedious and time-consuming.In addition, the specific care required by the category of children and the need for long-term monitoring make datasets from this sector extremely rare.Some of the most relevant ones are resumed in Table 6.Some of them have been introduced to help in pose estimation and markerless joint detection and tracking.Under this umbrella, we can cite the Moving INfants In RGB-D (MINI-RGBD) [115] dataset that was generated by mapping real infant movements to the Skinned Multi-Infant Linear body model (SMIL) with realistic shapes and textures and generating RGB and depth images with precise ground truth 2D and 3D joint positions.The dataset is available for research purposes at http://s.fhg.de/mini-rgbd(accessed on 23 August 2023).
Another relevant contribution to this topic has been recently provided in [116], where hybrid synthetic and real infant pose (SyRIP) were collected and made publicly available.It came with a multi-stage invariant representation learning strategy that could transfer the knowledge from the adjacent domains of adult poses and synthetic infant images into a fine-tuned domain-adapted infant pose (FiDIP) estimation model.The code is available at https://github.com/ostadabbas/Infant-Pose-Estimation(accessed on 23 August 2023).
Other relevant datasets for infant body parsing and pose estimation from videos are the BHT dataset [117], the AIMS dataset [118] and the Youtube-infant dataset [119].BHT consists of 20 movement videos of infants aged from 0-6 months.YouTube-infant has 90 infant movement videos collected from YouTube.Both datasets contain annotations for five classes: background, head, arm, torso and leg.Pose annotation was made by the LabelMe online annotation tool and the dataset comes with BINS scores describing neurological risks associated with each infant [120].The AIMS dataset contains 750 real and 4000 synthetic infant images with Alberta Infant Motor Scale (AIMS) pose labels [121].Code and data referenced in [119] are provided at https://github.com/cchamber/Infant_movement_assessment/ (accessed on 23 August 2023).
In [122,123], the BabyPose dataset, consisting of 16 depth videos of 16 preterm infants recorded during the actual clinical practice in a neonatal intensive care unit (NICU), has been introduced.Each video lasts 100 s (at 10 fps).Each frame was annotated with the limbjoint locations.Twelve joints were annotated, i.e., left and right shoulder, elbow, wrist, hip, knee and ankle.The database is freely accessible at https://zenodo.org/record/3891404(accessed on 13 February 2023).
Concerning autism-related behaviours, the Self-Stimulatory Behaviour Dataset (SSBD) [124] collected stimming behaviour videos of children available on public domain websites and video portals, such as Youtube, Vimeo, Dailymotion, etc.The dataset contains 75 videos grouped into three categories each containing 25 videos.The mean duration of a video is 90 s.The resolution of the videos varies, but is greater than 320 × 240 pixels.Videos are related to Armflapping, Headbanging, and Spinning repetitive behaviours.The dataset can be found at https://github.com/antran89/clipping_ssbd_videos(accessed on 23 August 2023) with annotated data.

Monitoring of Lying Children
Early diagnosis plays a key role in most healthcare scopes, including neurological disorder.It is clear that a diagnosis in the first weeks of a child's life is crucial, especially in preterm infants, to recognise signs of possible lesions in the developing brain and to plan timely and appropriate rehabilitation interventions.Unfortunately, this can only be achieved by monitoring the child lying down, with two main constraints: on the one hand, it is mandatory to use completely non-invasive methods, and on the other hand, considering the specific movement dynamics under investigation, ad hoc datasets are required, underlining what was discussed in the previous section.Beyond this critical analysis, further monitoring can be carried out to track vital signs or movements of discomfort that represent manifestations of the child's distress.All these kind of analysis are generally performed using a camera mounted at the top of the crib at a neonatal intensive care unit (NICU) as shown in the typical setup for data acquisition and processing is reported in Figure 2. Most relevant work is resumed in Table 7.
Table 7.A summary of selected work concerning the analysis of the body of infants lying in a crib.SVM = Support Vector Machine, AUC = Area Under Curve, FVGAN = factorized video generative adversarial network, GMA = Prechtl General Movement Assessment [126], acc = accuracy (correct predictions / total number of predictions with respect to the clinical goal occurrences tested), RF = Random Forest, RMSE = root-mean-square error (it measures the differences between values predicted by a model and the values provided by experts, b.p.m. = beats per minute, r.p.m. = respirations per minute.The easiest approaches rely on optical flow's motion information to estimate pixel motion vectors between frames.One of the former applications related to infants' health care was recognising comfort or discomfort.In [127], the authors calculated the motion acceleration rate and 18 time-and frequency-domain features characterizing motion patterns and provided them with a support vector machine (SVM) classifier.The method was evaluated using 183 video segments for 11 infants from 17 heel prick events.The experimental results show an AUC of 0.94 for discomfort detection and an average accuracy of 0.86 when combining all proposed features, which is promising for clinical use.

Work (Year
A more effective computer-aided pipeline to characterize and classify infants' motion from 2D video recordings has been proposed in [132].The authors used data from 142 preterm infants, acquired from a viewpoint perpendicular to the plane where the infants lay, at 40 weeks of gestational age.The final goal was detecting anomalous motion patterns.The ground truth was built starting from brain MRI evidence at birth and neurological examinations 30 months after the video recording.DeepLabCut was exploited, but it was fine-tuned to detect a small set of meaningful landmark points (nose, hands and feet) on the infants' bodies.The authors discussed these choices.They wrote that classical full-body pose estimation algorithms, if not fine-tuned on infants' poses, have proven to not always be appropriate for infants since they are trained and implemented for detecting adults' poses.Since fine-tuning requires a significant amount of data, the authors focused only on some key points that provide meaningful information regarding infants' motion, guaranteeing this way a higher per-point accuracy and a higher control on the interpretability of the results.Starting from the trajectories of the detected landmark points, quantitative parameters describing infants' motion patterns were extracted and classified between normal or abnormal motion patterns by means of different shallow and deep classifiers.Despite the accurate setup, the mean overall accuracy was not over 60%.The problems were the unbalanced dataset and the sparsity of landmarks tracked over time.Obtaining a dense body motion analysis of babies is particularly challenging indeed, as the body part dimensions between infants and adults vary significantly.Similarly, in [119], a framework that predicts the neuromotor risk level of 19 infants (more or less than 10 weeks of age) was proposed.The training was conducted using 420 YouTube video segments.OpenPose was used to extract pose information.Due to differences between adults and infants in their appearance and pose, pose tracking using OpenPose was initially limited in performance.Therefore, the authors specialized OpenPose for infants by creating a dataset of infant images with labels of joint positions.Root-mean-square error on joint positioning decreased from 0.05 by standard OpenPose to 0.02 after the specialization of the algorithm on infants.The adapted pose estimator allowed authors to extract movement trajectories from videos of infants moving.Finally, the authors combined many features into one estimate for assessing neuromotor risk, demonstrating a correlation between the score and the risk associated with each infant by clinicians.
Recently, in [133], a method to assess the general movement assessment (GMA) of infant movement videos has been proposed.It uses a semi-supervised model, termed SiamParseNet (SPN), which consists of two branches, one for intra-frame body parts segmentation and another for inter-frame label propagation.Another important contribution is the adoption of two training strategies to alternatively employ different training modes to achieve optimal performance.Factorized video GAN was exploited to augment training.Similarly, in [131], the automated analysis of general movements was achieved using lowcost instrumentation in the home.Videos from a single commercial RGB-D sensor were processed using DeepLabCut to estimate the 2D trajectories of selected points and then to reconstruct 3D trajectories by aligning data recorded with the depth sensor.Eight infants were recorded in the home at 3, 4, and 5 months of age.
The potential ability of computer vision to accurately characterize infant reaching motion is the topic of the paper in [130,134].Analysing reaching motion (fast movement towards a given target, usually a toy) may contribute to the early diagnosis and assessment of infants at risk for upper extremity motor impairments.In [130] the analysed videos obtained were from 12 infants (5 with developmental disorders) of about 12 months of age or less.The total number of reaching actions analysed was 65.The x and y coordinates of hand key points were obtained from OpenPose and compared with those manually annotated (frame-by-frame), resulting in 95% confidence intervals.The authors concluded that OpenPose may be used for markerless automatic tracking of infant reaching motion from recorded videos, but did not provide evidence of the ability to automatically classify disorders.
In [134], a lightweight network was tested on videos of infants (up to 12 months of age) performing reaching/grabbing actions collected from an online video-sharing platform and semiautomatically annotated by exploiting the toll kinovea.A total of 193 reaches performed by 21 distinct subjects were processed with a precision of 0.57-0.66and recall of 0.72-0.49for reaching and no-reaching action, respectively.
Persistent asymmetrical body behaviour in early life provides a prominent prodromal risk marker of neurodevelopmental conditions like autism spectrum disorder and congenital muscular torticollis.The authors in [135] proposed a computer vision method for assessing bilateral infant postural symmetry from images, based on 3D human pose estimation, adapted to the challenging setting of infant bodies.In particular, the HW-HuP interactive annotation tool was modified to correct 3D poses predicted on infants in the SyRIP dataset.A Bayesian estimator of the ground truth derived from a probabilistic graphical model of fallible human raters was proposed.
A less debated area of research is devoted for measuring vital signs (especially in the neonatal intensive care unit) in a contactless fashion by exploiting RGB or RGBD data.These solutions are aimed at avoiding trauma and pain observed in traditional sensors-based monitoring when removing the strong adhesive bond between the electrode and epidermis of pre-term infants.A preliminary study of a proposed non-contact system based on photoplethysmography (PPGi) and motion magnification is reported in [128].The proposed non-contact system framework involved skin colour and motion magnification, region of interest (ROI) selection, spectral analysis and peak detection.Non-contact heart rate (HR) and respiratory rate (RR) in 10 infants were monitored and compared with ECG data.The authors concluded that the non-contact technique requires further investigations to improve the accuracy necessary for use with neonates.One of the main factors of failure was spotted in the reduced ROI to be analysed with respect to experiments involving adults.A similar approach was also proposed in [129], but just a baby was used to create a dataset and two different motion detection methods (based on frame differences and background subtraction, respectively) were applied individually and integrated to achieve better accuracy.For the same aim in [136], authors used depth information captured by two RGB-D cameras in order to reconstruct a 3D surface of a patient's torso with high spatial coverage.The volume was computed based on an octree subdivision technique of the 3D space.Finally, respiratory parameters were calculated from the estimated volume-time curve, but experiments were carried out only on a baby mannequin with an artificial test lung for infants.The lung was branched to a mechanical ventilator.Recently, in [125] remotely monitored both HR and RR of neonates in the NICU using colour and motionbased methods.The most interesting contribution of the paper is the use of YOLO V3 weights to achieve a baby detection model, detecting this way ROI automatically.

Posture/Gait Analysis
Problems related to standing infants are linked to motor deficits or temporary or chronic illnesses, problems involving the way a child walks, stands or sits that require precise quantitative assessment to evaluate both the severity of the pathology and the effectiveness of clinical treatments.From this perspective, the spectrum of dynamics involved is much broader and includes monitoring how they walk or sit and how they perform specific actions.
In this area, we can find works presenting tools for the diagnosis of motor impairments, for assessing temporal or chronic diseases, and for evaluating the efficacy of drugs or the outcomes of therapeutic sessions.Most relevant work is resumed in Table 8.It is worth noting that by examining these works, once again, it becomes clear how the demand for specific datasets can be pivotal for the development and assessment of specific algorithms.The work in [140] focuses on Duchenne muscular dystrophy (DMD) and is aimed at developing a digital platform to enable innovative outcome measures.Eleven participants were involved (the median age was 13).Six participants were ambulant and five nonambulant.Each participant was acquired AT HOME while performing tasks decided by medical experts.Video analysis was then performed using OpenPose software and different parameters, such as trajectory, smoothness and symmetry of movement, and voluntary or compensatory movements were extracted.Data from the videos of DMD participants were compared to data from the healthy control on four tasks: walking, Hands-to-head while standing, Hands-to-head while sitting and Sit-to-stand then hands-to-head while standing.Front and side views were used.
In [139], videos of children with ASD in an uncontrolled environment were analysed by a multi-modality fusion network (RGB and optical flow) based on 3D CNNs.The final goal was recognizing autistic behaviours in videos.The method is based on I3D architecture pre-trained on a large-scale action recognition dataset and fine-tuned on a small dataset of stereotypic actions.The child was detected by Yolov5 [141] and tracked by DeepSORT algorithms [142].Optical flow extraction was performed by the RAFT algorithm [143].Extensive experiments on different deep learning frameworks were performed to propose a baseline.The best-gathered accuracy was 86.04% using a fusion of RGB and flow streams.
The authors in [137] analysed clinical gait analysis videos from young patients (average patient age was 11 years).For each video, they used OpenPose to extract time series of anatomical landmarks.Next, these time series were processed to create features for supervised machine learning models (CNN, RF, and RR) to predict gait parameters and clinical decisions.The approach relying on CNN for classification outperformed the others with an AUC of 0.71 in correctly predicting surgery decisions.
An interesting research was conducted in [138], in which authors observed the coordination patterns in 11-month-old pre-walking infants with a range of cruising (moving sideways in an upright posture while holding onto support) and crawling experiences.Computer vision tasks were delegated to the AutoViDev system.Subsequently, authors identified infants' coordination patterns demonstrating how infants learn to assemble solutions in real-time as they encounter new problems.This evolutionary model could be used to assess motor or neurological impairments.

New Research Directions for More Accurate Infants Pose Estimation
In this section, up-to-date computer vision strategies for human pose estimation are reported and discussed with reference to their possible application on infants and viable research directions.First, it is important to observe that all the listed strategies have been trained and tested on adults, and then an assessment of their performance on children is the first pathway to be suggested to the research community hoping that their efficiency in terms of outcomes (and sometimes also having a reduced computational workload) will be kept also on datasets involving children.Recently, transformer-based solutions have shown great success in 3D human pose estimation.Under this premise, a breakthrough work is the one introducing PoseFormer [144], a pure transformer-based approach for 3D pose estimation from 2D videos.The spatial transformer module encodes the local relationships between the 2D joints and the temporal transformer module captures global dependencies across the arbitrary frames regardless of the distance.Extensive experiments show that the PoseFormer model achieved state-of-the-art performance on popular 3D pose datasets.Code is available at https://github.com/zczcwh/PoseFormer(accessed on 4 September 2023).Another important related achievement that deserves a mention is PoseAug [145], a novel auto-augmentation framework that learns to augment the available training poses towards greater diversity and thus enhances the generalization power of the trained 2D-to-3D pose estimator.It has been conceived to address the existing problem of inferior generalization performance to new datasets of existing 3D human pose estimation methods.In other words, it augments the diversity of 2D-3D pose pairs in the training data.The code is available at https://github.com/jfzhang95/PoseAug(accessed on 4 September 2023).Both methods can speed up the clinical assessment and diagnosis of children due to their capability to localize joints with higher precision independently from specific acquisition setups and camera views.In [146], a Spatio-Temporal Crisscross attention (STC) block has been introduced to improve joint correlation computation for comparing trajectories into the 3D space, including spatial and temporal analysis.
The system works very well on complicated pose articulation (as those of children are, especially while they lie in a bed).These systems are highly complex: to overcome this, a tokenization mechanism can allow us to operate on temporally sparse input poses but still generate dense 3D pose sequences as proposed in [147].The code and models can be accessed at https://github.com/goldbricklemon/uplift-upsample-3dhpe(accessed on 4 September 2023).This could particularly help with children where occlusions often appear and reduce the availability of 2D data.Viable alternatives to transformers, also beyond CNN, have been also recently proposed.For example, in [148], capsule networks (CapsNets) have been introduced for 3D human pose estimation, ensuring viewpointequivariance and drastically reducing both the dataset size and the network complexity, while retaining high output accuracy.Its peculiarities make this approach very suitable for modelling children's poses with few shots, even using domestic setups.

Discussion
Protecting and safeguarding children's health is a key priority that benefits society as a whole.The World Health Organization and the United Nations Children's Fund have specialised in improving children's health, including care before and after birth [149].There is evidence from longitudinal studies showing that the benefits of healthy childhood development extend to older ages [150].
Advances in the use of health technologies have the potential to bring further benefits to neonatal and paediatric healthcare, as it is recognised by the same health organisations [10].Ambient intelligence and CV, as a means of unobtrusive, contactless, remote monitoring of children's physical, motor and mental health status and activities in both healthcare and private settings, can assist in a range of clinical tasks and thereby contribute to a better understanding of child physiology and pathophysiology [151].
The scientific and technological communities have become more aware of this potential in recent years, as is evidenced by the rising trend in the number of publications we have retrieved in the past five years (i.e., 54 out of the 65 papers retrieved have been published in the last 5 years).
Research in the field is spread across several countries.In Italy, there is a vibrant and lively community of scientists and scholars working to advance the scientific frontiers.Initially, research has mostly focused on monitoring and improving the interaction with children with ASD [41,56,105,152,153].Most recently, attention has moved to the prediction of neurological development disorders [12,132], especially in relation to general movements [111,114,123] and for preterms [122,132].
Although great strides have been made in the past few years, several open issues and challenges still need to be addressed, also in relation to ethical and legal concerns, to reach a significant level of performance and to allow the instruments to be effectively used in clinical practice, as we will discuss in the following subsections.

Gaps and Open Challenges
Among existing challenges, the lack of task-specific public datasets, missing in several areas, represents one of the most critical issues.On the one side, this lack wastes the energy of several research groups to build datasets from scratch and, besides, it makes difficult a fair algorithms comparison.Of course, collecting datasets of children is even more challenging due to several reasons: The use of large, shared datasets is a long way but there are other gaps and limitations in this topic that should be addressed.From the analysis of the literature, it has emerged that a large number of metrics are used to assess introduced machine learning and computer vision methods.How to select the most suited metric for each specific task is a big challenge and that choice should be shared and therefore universally accepted among research groups.In fact, even if data would have been available, experimental baselines must share common reference metrics.This is a big challenge, especially in the case of face analysis.Indeed, by observing the tables in Sections 2 and 3, it is possible to see a large number of used metrics, depending on the specifically addressed task.Some metrics look at the broad clinical problem (normal/atypical), whereas other ones concentrate on the finer visual task (e.g., landmark positioning) demanding a supervisor (human or automatic) to make a diagnosis.This way methods become not easily comparable, and it is not trivial to understand which one might help in clinical practice.On the other side, there are still many qualitative evaluations that do not help the clinician follow up on clinical practice since subjectiveness is not pushed away but even strengthened since it masters the process automatization.Another limitation that slows the development of effective machine learning approaches involving children is the need for long-term follow-up: many medical conditions require long-term follow-up to accurately observe and evaluate the clinical evolution.This extended time frame is necessary to assess the effectiveness of treatments, the progression of diseases, or the occurrence of relevant events.Waiting for this followup period adds to the time required for verification.This can also affect the statistical significance and sample size.Efforts in this research direction could also allow researchers to deploy foundation models, which are at the edge of machine learning research right now and are pushing ahead the so-called generalist medical AI [154].
Addressing these challenges requires interdisciplinary collaboration between researchers, ethicists, and legal experts to ensure that the collection and use of children's data for AI model training aligns with ethical and legal guidelines while prioritising the privacy and well-being of children.In the following section, we overview some of the most common ethical and legal concerns and suggest possible solutions where available.

Ethico-Legal Considerations
The debate on ethical and legal issues in neonatology and paediatrics is broad and long-standing and has been addressed in a large body of literature in the field, both from a general perspective [155] and in specific scenarios [156,157].The ethical mandates to which clinical practice should adhere include respect for parental autonomy, the primacy of the best interests of the child, doing no harm, and the right to be informed and to give consent.
As far as ambient intelligence and CV are concerned, an ethical approach to technology development and a thorough understanding of all the relevant ethical, legal and social issues raised by monitoring technologies should be a top priority for researchers and innovators.This is the only way to ensure immediate acceptance and long-term use.The debate in this respect is more mature when ambient intelligence and assistive technologies are targeted at older adults and their caregivers [158].In neonatology and paediatrics, a child-focused approach is certainly the way forward to ensure a safe, effective, ethical and equitable future for these technologies.
Most of the papers published to date have addressed ethics and bioethics for any technological aid in clinical practice [159,160], while some recent publications have addressed the ethical and legal implications of the use of artificial intelligence in child education [161,162], child entertainment [163] and child care [164,165].
Overall, we can identify some key ethical and legal issues that researchers in CV and ambient intelligence in childcare should be aware of.These are privacy, extensive validation, transparency and accountability, and are discussed below along with some recommendations to address them.
Privacy: the privacy and confidentiality of children and their parents are treated with high standards, as already introduced in the previous section.This hinders the rapid development of technology to some extent but ensures that children's dignity and respect are properly taken into account.It is worth noting that when ambient intelligence comes into play, privacy becomes an issue not only for patients and parents but also for clinicians and caregivers.Addressing this issue at the technical level requires the adoption of privacypreservation approaches such as those based on privacy-preserving visual sensors (e.g., depth or thermal sensors) or those based on ad hoc techniques able to ensure context-aware visual privacy and retain all the information contained in RGB cameras [166].This may help reduce the feeling of intrusion in parents and caregivers.Extensive validation: scientists are aware of the inherent limitations of data-inductive techniques, such as those CV methods that use machine learning approaches.The accuracy of these methods is closely related to the type and quality of data used to train and develop them.For this reason, it is very important to perform extensive technical and clinical validation of such methods to verify their ability to generalise and handle unknown conditions.Standardised external validation and multi-centre studies should be carefully planned, together with standardised evaluation metrics, to demonstrate the reliability of the methods developed, particularly in terms of generalisability, safety and clinical value.Transparency: the use of technology should be made clear and transparent, thus avoiding any grey areas and uncertainties in their adoption.This entails accounting for the relevant details about the data used, the actors involved, the choices and processes enacted during development along with the main scope and limitations of the CV and ambient intelligence tools.In addition, meaningful motivations behind their outputs should be provided, especially when they are used to support diagnostic and prognostic processes.Only this way, end-users and beneficiaries, mainly children, caregivers, clinicians, nurses and parents can really be aware and empowered by the CV-and AI-powered technologies and gather trust in them [167][168][169].The final goal is actually to contribute to collaborative decision-making, by augmenting caregivers and recipients with powerful informationprocessing tools.Accountability: healthcare professionals are responsible for justifying their actions and decisions to patients and their families, and are liable for any potential positive or negative impact on the patient's health.The use of decision support technologies, such as those based on CV and ambient intelligence, should be clearly modelled in the legal framework of medical liability to avoid any grey area when clinicians decide to use the results of a tool or follow a suggestion received.This is still a very controversial issue.On a technical level, CV applications can implement traceability tools that document their entire development lifecycle, making it easier to deal with cases where something goes wrong.

Conclusions
This paper surveyed, for the first time in the literature, the works covering children's health-related issues by ambient intelligence methods and systems relying on computer vision.A taxonomy has been introduced by dividing works according to the part they concentrate on, e.g., the face for extracting gaze direction and facial expressions, or the whole body, for gait analysis, posture estimation, and human-object interaction.For each research area, publicly available datasets and new computer vision perspectives have been discussed with particular attention on some challenges that still need to be addressed to reach a level of performance that allows the instruments to be effectively used in clinical practice.
In the coming years, we expect to see a significant increase in work in this area, both from the ethico-legal community and from the scientific and technological community.In particular, with regard to scientific and technological advances, future developments are expected to take place in several directions:

•
The collection and availability of larger datasets, also covering longer periods of children monitoring; • The improvement of current solutions thanks to more precise and advanced methods, also based on foundational vision models; • The integration of different types of visual sensors, such as thermal cameras that might provide relevant information for instance about the development of the thermoregulatory system of newborns; • The integrated processing of multimodal data, such as audio signals (e.g., to monitor children's crying), IoT data (e.g., from smart mattresses) and videos, thereby allowing, for example, a comprehensive monitoring of the health and well-being status of newborns in nurseries or in NICUs; • The optimization of computing and sensing facilities to enable technology diffusion in resource-limited and most needy countries.
Overall, considering the new perspectives that CV and machine learning tools can open, we deem it relevant to stress that researchers and innovators should strive to comply with several mandates at technical, socio-ethical and organizational levels.Solutions should strictly comply with existing and emerging regulations, such as that the Artificial Intelligence Act (COM/2021/206 final-available at https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:52021PC0206, accessed on 15 September 2023).Only this way, they can aspire to have real-life adoption and, thus, have an actual impact.Currently, innovation endeavours in this field are still in their early stages, but we are sure they can benefit from the more mature discussion going on in the field of ambient intelligence and Active and Assisted Living, towards a really beneficial application for children, parents, caregivers and society at a large [158].

Figure 1 .
Figure 1.Schema of the proposed taxonomy.

•
Scopus QUERY "TITLE-ABS-KEY ((newborn OR baby OR children OR toddler OR infant) AND (face OR facial) AND (analysis OR detection OR recognition OR tracking) AND "computer vision" AND PUBYEAR > 2014 AND PUBYEAR < 2024 that returned 158 documents; • Web of Science Core Collection ((((ALL=(children)) OR ALL=(infant)) OR ALL=(baby)) OR ALL=(newborn)) AND ((ALL=(face)) OR ALL=(facial)) AND ((ALL=(analysis)) OR ALL=(detection) OR ALL=(recognition)) AND ALL=(computer) AND ALL=(vision)), refined in the YEARS from 2015 to 2023, that returned 197 documents; • Scholar allintitle: children OR newborn OR babies OR infants OR face OR facial OR analysis OR recognition OR detection OR tracking OR "computer vision", refined in the YEARS from 2015 to 2023, that returned 158 documents.

Figure 2 .
Figure 2. A typical experimental setup for children monitoring in a NICU.Image has been taken from [125].

Table 2 .
A partial summary of the selected work for head pose estimation and/or gaze tracking.

Table 3 .
Summary of the selected work for face expressions analysis.AAM = Active Appearance Models; 'acc' = accuracy (correct predictions/total number of predictions with respect to the clinical goal occurrences tested); AU = Action Units; AUC = Area Under the Curve; CERT = Computer Ex-

Table 4 .
Summary of the selected work for multimodal face analysis.'acc' = accuracy (correct predictions/total number of predictions with respect to the clinical goal occurrences tested); ADHD = Attention deficit hyperactivity disorder; ICC = Intra Class Correlation coefficient.
All subjects were born in a large Midwestern hospital in the United States.All newborns involved in the study were Caucasian, evenly divided between the sexes (13 boys and 12 girls), and in good health.• CAFE Database [75]: The CAFE set is a collection of 1192 photographs of 2-to 8-year-old children posing with the six basic emotions defined by Ekman [82]: sadness, happiness, surprise, anger, disgust and fear.It also includes a seventh neutral expression.Such a set is also racially and ethnically diverse, with 27 African American, 16 Asian, 77 Caucasian/European American, 23 Latino, and 11 South Asian children.Photographs include enough face variability to allow independent researchers to determine and study the natural variation in human facial expressions.The children were asked to pose with their mouths open and closed for each expression except surprise.Surprised faces were open-mouthed only.Open-mouthed, disgusted faces usually included a tongue protrusion.• CLOCK Database [70]: This database was generated by a multi-site longitudinal project known as CLOCK (Craniofacial microsomia: Longitudinal Outcomes in Children pre-Kindergarten), which examined the neurodevelopmental and phenotypic outcomes of children with craniofacial microsomia (CFM) and demographically matched controls [85]RIS-CSE Database[71]: It features video clips and dynamic images consisting of 26,000 frames depicting 12 children from diverse ethnic backgrounds.This database showcases children's natural, unforced facial expressions across various scenarios, featuring six universal or prototypical emotional expressions: happiness, sadness, surprise, anger, disgust, and fear as defined by Ekman[73].The recordings were made in unconstrained environments, enabling free head and hand movements while sitting freely.In contrast to other public databases, the authors assert that they were capable of gathering children's natural expressions as they happened due to the unconstrained environment.The database has been validated by22human raters.•GestATionalDatabase [23]: It comprises 130 neonates recruited between October 2015 and October 2017.Clinical staff at Nottingham University NHS Trust Hospital, Nottingham, UK carried out recruitment and sorted the neonates into five groups based on their prematurity status.The data gathered included: (i) images of the neonates' faces, feet, and ears; (ii) case report forms with important information such as the baby's gestational age, days of life at the time of the visit, current weight, Ballard Score, the mother's medical history, and information related to the delivery.It is important that technical term abbreviations are explained when they are first used, and that a logical flow of information is maintained with causal connections between statements.•FF-NFS-MIAMIDatabase [68,69]: It is a database documenting spontaneous behaviour in 43 four-month-old infants.Infants' interactions with their mothers were recorded during a Face-to-Face/Still-Face (FF/SF) protocol [84].The FF/SF protocol elicits both positive and negative effects.It assesses infant responses to parent unresponsiveness, an age-appropriate stressor.AUs were manually annotated from the video USF-MNPAD-I Database [65]: The University of South Florida Multimodal Neonatal Pain Assessment (USF-MNPAD-I) Dataset was collected from 58 neonates (27-41 weeks gestational age) while they were hospitalised in the NICU, undergoing procedural and postoperative procedures.It comprises video footage (face, head, and body), audio (crying sounds), vital signs (heart rate, blood pressure, oxygen saturation), and cortical activity.Additionally, it includes continuous pain scores, following the NIPS (Neonatal Infant Pain Scale) scale[85], for each pain indicator and medical notes for all neonates.This dataset was obtained as a component of a continuous project centred on creating avant garde automated approaches for tracking and evaluating neonatal pain and distress.
by certified FACS coders for four action units: AU4 (brow lowering), AU6 (cheek raising), AU12 (lip corner pulling) and AU20 (lip stretching).The combination of AU6 and AU12 is associated with a positive effect; AU4 and AU20 are associated with a negative effect.The video resolution is 1288 × 964.There are 116,000 manually annotated frames in 129 videos of 43 infants.•

Table 8 .
A Summary of selected work for posture/gait analysis.AUC = Area Under Curve, acc = accuracy (correct predictions/total number of predictions with respect to the clinical goal occurrences tested, ASD = Autism Spectrum Disorders.
Parents may be concerned about the potential risks of data misuse or the potential impact on their child's privacy.Building trust and addressing these concerns is crucial, and it often involves clear communication and transparency about data handling practices.• Limited accessibility: Children may have limited access to technology or may not be able to provide consistent or reliable data due to various factors like socioeconomic disparities, geographical location, or cultural norms.This can result in biased or incomplete datasets, which can negatively impact the performance and fairness of AI models.• Dynamic and diverse nature of children's behaviour: Children's behaviour, cognition, and language skills undergo rapid development and change over time.Creating a dataset that adequately captures this dynamic nature requires extensive longitudinal studies, which can be resource-intensive and time-consuming.• Ethical considerations in data collection: Collecting data from vulnerable populations, such as children, requires special care to ensure their well-being and protection.Researchers must consider the potential emotional or psychological impact on children and ensure that the data collection process is designed ethically and with sensitivity.• Limited sample size: Children constitute a smaller population subset compared to adults, making it challenging to gather a sufficiently large and diverse dataset.Limited data can lead to overfitting, where the AI model performs well on the training data but fails to generalize to new examples.• Consent withdrawal and data management: Children's participation in data collection should be voluntary, and they or their parents should have the right to withdraw consent at any time.Managing and removing data associated with withdrawn consent can be challenging, especially if it has already been incorporated into AI training models.
(GDPR)and the Children's Online Privacy Protection Act (COPPA) in the United States.These laws require obtaining explicit consent from parents or guardians and ensuring the anonymity and security of children's personal information.Meeting these requirements can be complex and time-consuming.•Parentalconsent: Obtaining parental consent for data collection can be difficult, especially if it involves sensitive information or requires active participation from children.