Deep learning systems for estimating visual attention in robot-assisted therapy of children with autism and intellectual disability

Recent studies suggest that some children with autism prefer robots as tutors for improving their social interaction and communication abilities which are impaired due to their disorder. Indeed, research has focused on developing a very promising form of intervention named Robot-Assisted Therapy. This area of intervention poses many challenges, including the necessary flexibility and adaptability to real unconstrained therapeutic settings, which are different from the constrained lab settings where most of the technology is typically tested. Among the most common impairments of children with autism and intellectual disability is social attention, which includes difficulties in establishing the correct visual focus of attention. This article presents an investigation on the use of novel deep learning neural network architectures for automatically estimating if the child is focusing their visual attention on the robot during a therapy session, which is an indicator of their engagement. To study the application, the authors gathered data from a clinical experiment in an unconstrained setting, which provided low-resolution videos recorded by the robot camera during the child–robot interaction. Two deep learning approaches are implemented in several variants and compared with a standard algorithm for face detection to verify the feasibility of estimating the status of the child directly from the robot sensors without relying on bulky external settings, which can distress the child with autism. One of the proposed approaches demonstrated a very high accuracy and it can be used for off-line continuous assessment during the therapy or for autonomously adapting the intervention in future robots with better computational capabilities.


Introduction
Recent technology solutions in intelligent systems and robotics have made possible innovative intervention and treatment for individuals affected by Autism Spectrum Disorder (ASD), which is a pervasive neurodevelopmental disorder in which deficits in social interaction and communication can make ordinary life challenging from childhood through adulthood [1].Children with ASD often experience anxiety when interacting with other people due to the complexity and unpredictability of human behaviors.The controllable autonomy of robots has been exploited to provide acceptable social partners for these children [2] in the treatment of this disorder.Indeed, several studies have shown that some individuals with ASD prefer robots to humans and that robots generate a high degree of motivation and engagement, including children who are unlikely or unwilling to interact socially with human therapists (see [3] for a review).
Epidemiological data show that ASD can often comorbid with some level of Intellectual Disability (ID) [4], in fact, it has been reported that 54% of children with ASD have an IQ below 85, which makes therapeutic interventions more difficult due to the limited capabilities of the subjects, who often need hospitalization.The treatment of these children is more likely to benefit from the introduction of technological aids.Regarding therapeutic training of imitation presented in this article for instance, the standard approach is to employ two clinical personnel: one must be close to constantly support the child while another performs the tasks to imitate.Intelligent semi-autonomous systems like robots could provide assistance by performing the imitation tasks in this case, and, therefore, require only one therapist to support the child while controlling the system.
The work presented in this paper is part of the EU H2020 MSCA-IF CARER-AID project, which aims to improve robot-assisted therapy interventions for children with ASD and ID via an automatic personalization of the robot behavior that should meet the patient's condition.The aim is to fully integrate the robot within a standard treatment, the TEACCH [5] (Treatment and Education of Autistic and related Communication Handicapped Children) approach, which is commonly used for the treatment of ASD.
One of the most common characteristics of ASD is the impairment in using eye gaze to establish attention in social interaction and it is one of the most important of the traits that are assessed for diagnosis and one of the main areas of intervention for therapy [6].
Small robotic platforms that are being used for robot-assisted therapy have usually limited sensors on-board and the use of external devices is very common in Human-Robot Interaction (HRI) for computing the human behavior in this context.Therefore, the challenge is to estimate the child's visual attention directly from the robot cameras, possibly without the need of external devices, such as high-resolution cameras and/or Microsoft Kinects (or equivalent), which can definitely increase the performance, but at the same time limit the portability of the system and make more difficult its actual integration within the standard therapeutic environment.
Moreover, especially in the case of ID, children are unlikely to adhere to any imposed constraint like those typically required to maximize algorithm performance.Therefore, flexibility and adaptability are essential pre-requisites for the inclusion of any technology in the actual therapy [7].Conversely, in the case of robot-assisted therapy, preliminary experiments suggest that deriving attentional focus from a single camera would not be accurate enough [8].Quite the reverse, more recent experiments found acceptable levels of agreement between the robot observations and manual annotations [9].
To investigate this issue, the authors considered training attention classifiers based on state-of-the-art algorithms of the popular deep-learning neural networks family [10,11].These artificial neural network architectures achieved exceptional results in computer vision [12] and the authors hypothesize that these technological improvements could empower the robots and allow them to perform reliably for vision tasks like estimating the engagement via the attention focus, and use this information to personalize the interaction.
The results of the evaluation of two architectures to estimate the child's attention to the robot from low-resolution video recordings taken from the robot camera while interacting with the children are presented here.The approach was tested on videos collected during up to 14 sessions of robot-assisted therapy in an unconstrained real setting with six hospitalized children with ASD and ID.The clinical results of the field experiment presented in this article are analyzed and discussed in [13,14].
The rest of the paper is organized as follows: Section 2 reviews and the scientific literature that constitutes the background for this work; Section 3 introduces our approach for classifying the child attention status during the robot-assisted therapy, describes the procedure to collect the video recordings and produce the dataset for the experimentation, and present the algorithms that constitutes our system; Section 4 analyses the numerical results, finally Section 5 gives our discussion and conclusion.

Social Assistive Robotics for Children with Autism
Considering the complexity and the wide amplitude of the autism "spectrum", which encompasses different disabilities and severity levels, it is appropriate to use a multi-modal intervention that can be adapted to the individual's needs to obtain the best benefits from the therapy.Therefore, the controllable autonomy of robots has been exploited to provide acceptable social partners for these children [2].
Socially Assistive Robotics (SAR) is a novel field of application in robotics, which merges assistive and social robotics to design new platforms and services to help users through advanced interaction driven by their needs (e.g., tutoring, physical therapy, daily life assistance, emotional expression) via multimodal interfaces (speech, gestures, and input devices) [15].Several studies in this area, have shown that some individuals with ASD prefer social robots to humans.The social robot "Probo" could help the social performance of some children with ASD in specific situations like making a fruit salad [16] for example.Robins et al. [17] showed that children with ASD and limited verbal skills prefer interacting with robots than humans.Recent studies have successfully presented robots as mediators between humans and individuals with ASD [18].Duquette et al. [19] show improvements in affective behavior and attention sharing with co-participating human partners during an imitation task solicited by a simple robotic doll, for instance,.Social robots might be especially beneficial for individuals with ASD who face communication difficulties because practicing communication can be less intimidating with a robot than with another person [20,21].
Humanoid robots, which look like a human being but are much less complex compared to the human, could make learning easier for a child with ASD and then facilitate the transfer of skills learned through models of imitative human-robot interaction to child-human interaction [22].Indeed, imitation as a means of communication can be related to positive social behavior and it is considered a good predictor of social skills [23].Children with autism are often characterized by difficulties in imitating the behavior of other people and, therefore, imitation is employed in therapy to promote better body awareness, sense of self, creativity, leadership, and ability to initiate interaction [24].
Recent robotics research has shown numerous benefits of robot assistants in the treatment of children with ASD [17,25].

Visual and Social Attention in Children with ASD and Robots
People naturally tend to look and focus their attention on objects which are of immediate interest [26].Visual attention also is normally established in social contexts like a conversation between two people, and the ability to correctly simulate focusing the visual attention on the interaction partner is considered a way for robots to exhibit social intelligence and awareness and facilitate HRI [27].
This natural process can be identified as part of the social attention, which unifies several domains of attention in social contexts [28].The authors use social attention to identify the domain of social motivation, in this article, as is common in the case of ASD, indeed, they look at the capability of the children to direct the visual focus of attention on the interaction partner, i.e., the robot, as a way for assessing their engagement in the therapeutic activities.
Due to this innate attitude of human beings toward social attention, the deficit in social attention with others is one of the most noticeable features of ASD [29] and it is among the earliest signs of the disorder [30], consequently it plays a crucial role in the detection and assessment of this disorder as well as in therapies.
Thus, it is crucial that a robot assistant has the capability to evaluate if the child is looking at itself while engaged in a therapeutic session.This capability can be used in the assessment of the child's condition for the diagnosis and, then, during the therapy to keep track of progress.Furthermore, the robot can use this information to autonomously adapt the interaction with the child during the therapeutic session, for instance playing some sound for attracting the child's attention when he/she is distracted.
The authors remark that they use "Attention" to identify when the child is directing the attentional focus on the robot or "Distraction" when he/she is not monitoring or following the robot prompts.

Video Analysis for Face Detection
Human face localization plays an essential role in a countless number of applications such as human-computer and human-robot interaction systems [27,31], pilot/driver attention levels [32,33], video surveillance [34], face recognition [35], and facial expression analysis [36,37].
Face detection aims at finding whether faces are present in a given image and, if any is found, return their locations through bounding boxes expressed in pixel coordinates, e.g., upper and lower corners of the rectangle that contains the face, in practice.
Generally, the application of a face detection algorithm is a prerequisite for most face recognition and tracking algorithms, which assume that face location is known [38].Few techniques originally developed for face recognition have also been used to detect faces, on the contrary, but their computational requirements are excessive for the task and demonstrated limited generalization capability as their performance is significantly reduced when applied to "uncultivated" contexts [39].
Recent studies show that deep learning neural network approaches can achieve impressive performances in face detection and face alignment [38].Face alignment algorithms aim at identifying the geometric structure of human faces in digital images and return an estimate of the face's landmark positions, also known as facial features points.Facial alignment calculation usually begins from a rectangular bounding box returned by a face detector [40].However, novel deep learning algorithms have been proposed to simultaneously detect face location and estimate basic landmarks.The algorithm the authors used for this work estimates eyes, nose, mouth corners [41], for instance.

Face Detection and Alignment for Estimating the Visual Focus of Attention in HRI
Face detection and localization of facial features are part of the common processing scheme for determining driver visual attention from video cameras [42].Facial features can be used to estimate the head pose [43], which is a reliable indicator of the gaze, and, thus, of the visual focus of attention, which can be effectively estimated by head pose only [44,45].
Visual attention in human-robot interaction has been widely studied because its crucial but, it is often evaluated using external devices, such as high-resolution cameras and/or kinetics, e.g., [46,47], due to the limited video recording resources on the most common commercial robotic platforms for social applications.Authors proposed a system for robot-assisted therapy that employs five individual sensors, including three RGB cameras and two Microsoft Kinects [18].
Research on face detection from a single camera has been carried out in the field of ASD, where the majority of the work has focused on tracking the face of the individual with ASD and moving the robot face accordingly [48][49][50].Indeed, these studies usually employed standard software like OpenCV and did not evaluate the accuracy of the algorithms.Two studies evaluated the face detection performance from a single camera in the context of robot-assisted therapy, but they came to different conclusions: other authors suggest that deriving attentional focus from a single camera would not be accurate enough [8]; on the other hand, recent experiments found acceptable levels of agreement between the robot observations and manual post-hoc annotations [9].However, none of them tested the system in the clinical context with children with ASD and ID.

Materials and Methods
The methodology used for this work comprises gathering data from a field experiment in which the authors tested the integration of robot-assisted sessions in the standard therapy of hospitalized children with ASD and ID.The video recordings collected during the field experiment are used for training and evaluating an attention classifier based on the low-resolution video recording for the robot-assistant during the therapy sessions.
Section 3.1 describes the clinical experiment and explains how data has been collected and its format.Section 3.2 presents our architecture, its components and a widely used algorithm for real-time face detection used for comparison.

Data Gathering from the Clinical Experiment
This section presents the details of the clinical experiment that generated the recordings used in to create the database for the machine learning experiments.The clinical results of this experiment are presented in [13,14].To summarize, the robot-assisted therapy demonstrated to be effective in imitation training for 4 children with ID levels from mild to severe, who were able to learn all three imitation tasks.However, the two other children with the lowest IQ (profound ID) did not show significant improvement, probably due to their severe limitation in comprehending the instructions.This result suggests that other ways of interaction or technologies should be investigated for these extremely difficult cases.

Participants
Six children were selected among patients diagnosed with ASD and ID, who are currently receiving treatment at the IRCCS Oasi Maria SS of Troina (Italy), a specialized institution for the rehabilitation of intellectual disabilities.Children are inpatients of the institution, where they live for most of the time and follow a clinical daily program of training using the TEACCH approach with psychologists and highly specialized personnel.All children have ASD of grade 3, while the ID levels range from mild to profound.
Participants' ASD and ID levels have been diagnosed before the start of this study with the following standard psycho-diagnostic instruments: Leiter-R, WISC, PEP-3, VABS, ADI-R, and CARS-2.Regarding more details about the diagnosis procedure and the instruments see [4,22].
Ethical approval had been obtained from both the ethical council of IRCSS Oasi Maria SS of Troina and Sheffield Hallam University.All the parents signed consent forms before their children were included in the study.Children were free to leave the experiment at any time and they were always supported by a professional educator other than the researchers.

The Robot Therapist: the Softbank Robotics Nao
The robot used for leading the robot-assisted therapy was the Softbank Robotics Nao v4, which is a small toy-like humanoid robot, very popular for child-robot interaction studies [18,22,51,52].
Unless otherwise specified, this study used the default settings and the standard equipment of the Nao robot v4, which includes two 1.22 Mega pixels cameras that can be used to take pictures and record videos from the robot's perspective.Camera and other sensor positions are shown in Figure 1.According to the specifications, when image resolution is up to 1280 × 960 pixels and video recording is up to 30 fps, the actual resolution and frame-rate are usually restricted to 320 × 240 and 10 fps due the limited computing capacity of the main processor and memory resources.
Among the software features, Nao has the face detection and tracking functionality that was used in the clinical experiments to direct the robot toward the child during the interaction.

Protocol for the Clinical Experiments
To assess the children's behavior the study adopted the Verbal Behaviour Milestones Assessment and Placement Program (VB-MAPP) [53].To match the specific level and training need of the participants, a human therapist conducted a preliminary evaluation of the children's capability to perform the VB-MAPP imitation tasks of levels 1 and 2, and then those tasks that the children were not able to perform were selected and the robot programmed for their training.The authors selected three tasks (T1, T2, and T3) for the experiments, in which the children received a milestone score of 0 or 0.5, defining which of them were not able or did not perform the task properly.
The robot was included in the TEACCH program among the daily activities and identified via a specific "visual schedule".A visual schedule communicates the sequence of upcoming activities or events using objects, photographs, icons, words, or a combination of tangible supports.During each session, the children performed three gross motor imitation tasks managed by the robot only (Figure 2).

Protocol for the Clinical Experiments
To assess the children's behavior the study adopted the Verbal Behaviour Milestones Assessment and Placement Program (VB-MAPP) [53].To match the specific level and training need of the participants, a human therapist conducted a preliminary evaluation of the children's capability to perform the VB-MAPP imitation tasks of levels 1 and 2, and then those tasks that the children were not able to perform were selected and the robot programmed for their training.The authors selected three tasks (T1, T2, and T3) for the experiments, in which the children received a milestone score of 0 or 0.5, defining which of them were not able or did not perform the task properly.
The robot was included in the TEACCH program among the daily activities and identified via a specific "visual schedule".A visual schedule communicates the sequence of upcoming activities or events using objects, photographs, icons, words, or a combination of tangible supports.During each session, the children performed three gross motor imitation tasks managed by the robot only (Figure 2).
To facilitate the interaction, the robot-led therapy sessions were carried out in the same room where children usually did their treatment sessions.During a training encounter, the robot was deployed on a table, to be approximately at the same height of the child, initially at a distance of at least 1 m.The child could move backward or forward to be more comfortable.The training encounters usually comprised three sessions, one for each task.During each session, the children were encouraged to imitate the task performed by the robot.Tasks were proposed in a randomized modality to avoid stereotypical learning.First, the robot verbally presented the behavior to perform in a simple and clear language, then it solicited the child to imitate its movements while doing them (Figure 2).
A professional educator, selected among those involved in the everyday treatment of the children, was always present to represent a "secure base" for the children.The professional educator gave a positive verbal reinforcement ("good" and/or "right") along with, in some cases, a physical reinforcement (a caress).These reinforcements were different for each child and were connected directly to responses, behavior and to the child's difficulties.During all the tasks the robot called the children by name to make the interaction more personalized.
The procedure comprised a preliminary session to decrease the novelty effect.During this preliminary session, the robot was presented to all the children in a non-therapeutic context for a total of approximately 10 min.
The actual experimentation began 7 days after the preliminary encounter.The study comprises a total of 14 encounters over one month, i.e., 3 sessions per week.The total length of each session was approximately 6-8 min per child.

Protocol for the Clinical Experiments
To assess the children's behavior the study adopted the Verbal Behaviour Milestones Assessment and Placement Program (VB-MAPP) [53].To match the specific level and training need of the participants, a human therapist conducted a preliminary evaluation of the children's capability to perform the VB-MAPP imitation tasks of levels 1 and 2, and then those tasks that the children were not able to perform were selected and the robot programmed for their training.The authors selected three tasks (T1, T2, and T3) for the experiments, in which the children received a milestone score of 0 or 0.5, defining which of them were not able or did not perform the task properly.
The robot was included in the TEACCH program among the daily activities and identified via a specific "visual schedule".A visual schedule communicates the sequence of upcoming activities or events using objects, photographs, icons, words, or a combination of tangible supports.During each session, the children performed three gross motor imitation tasks managed by the robot only (Figure 2).

Video Recording and Labeling
During each therapeutic session, the interaction was recorded using the robot's top camera (Figure 1).The video recording was restricted to 320 × 240 pixels per frame up to 10 fps.The restriction is a default setting due to the limited computing capabilities of the robot's CPU, which was not capable of supporting a higher resolution/framerate video recording while executing the behavior for the therapy.The recordings included the execution of the rehabilitation tasks only.Figure 3 presents one frame as an example of the video recordings.To facilitate the interaction, the robot-led therapy sessions were carried out in the same room where children usually did their treatment sessions.During a training encounter, the robot was deployed on a table, to be approximately at the same height of the child, initially at a distance of at least 1 m.The child could move backward or forward to be more comfortable.The training encounters usually comprised three sessions, one for each task.During each session, the children were encouraged to imitate the task performed by the robot.Tasks were proposed in a randomized modality to avoid stereotypical learning.First, the robot verbally presented the behavior to perform in a simple and clear language, then it solicited the child to imitate its movements while doing them (Figure 2).
A professional educator, selected among those involved in the everyday treatment of the children, was always present to represent a "secure base" for the children.The professional educator gave a positive verbal reinforcement ("good" and/or "right") along with, in some cases, a physical reinforcement (a caress).These reinforcements were different for each child and were connected directly to responses, behavior and to the child's difficulties.During all the tasks the robot called the children by name to make the interaction more personalized.
The procedure comprised a preliminary session to decrease the novelty effect.During this preliminary session, the robot was presented to all the children in a non-therapeutic context for a total of approximately 10 min.
The actual experimentation began 7 days after the preliminary encounter.The study comprises a total of 14 encounters over one month, i.e., 3 sessions per week.The total length of each session was approximately 6-8 min per child.

Video Recording and Labeling
During each therapeutic session, the interaction was recorded using the robot's top camera (Figure 1).The video recording was restricted to 320 × 240 pixels per frame up to 10 fps.The restriction is a default setting due to the limited computing capabilities of the robot's CPU, which was not capable of supporting a higher resolution/framerate video recording while executing the behavior for the therapy.The recordings included the execution of the rehabilitation tasks only.Figure 3 presents one frame as an example of the video recordings.The videos were resized to double their resolution to 640 × 480 using bicubic interpolation, because, in many cases, the children's faces were too small to be recognized by the algorithms.The authors underline that during the sessions, children were free to move and, especially at the beginning, some were positioned more than 1 m from the robot, with the picture of their face contained in less than 20 × 20 pixels.
The robot face detection functionality was active and used with the intention to center the top camera toward the child during the interaction.However, due to the presence of a professional The videos were resized to double their resolution to 640 × 480 using bicubic interpolation, because, in many cases, the children's faces were too small to be recognized by the algorithms.
The authors underline that during the sessions, children were free to move and, especially at the beginning, some were positioned more than 1 m from the robot, with the picture of their face contained in less than 20 × 20 pixels.
The robot face detection functionality was active and used with the intention to center the top camera toward the child during the interaction.However, due to the presence of a professional educator, the camera was occasionally centered on the educator's face.Video recordings used for this work were edited to remove the educator's face.To this end, the authors asked the educator to wear a green headband so that their face was easily recognizable with an inexpensive computational algorithm.The educators kindly agreed to comply with this request with no problem.
To build the ground truth for attention estimation experiments, some frames were extracted and manually annotated (attention vs distraction) by two researchers, which separately compiled a record sheet divided into frames.Once completed, the discrepancies were resolved via discussion.Then, the two researchers agreed on the final labels which were used to build the database for this study's machine learning experiments.
The authors remark that the labeling was considered "Attention" when the child was monitoring the robot, i.e., when the visual focus of attention was on the robot and considered "Distraction" otherwise.educator, the camera was occasionally centered on the educator's face.Video recordings used for this work were edited to remove the educator's face.To this end, the authors asked the educator to wear a green headband so that their face was easily recognizable with an inexpensive computational algorithm.The educators kindly agreed to comply with this request with no problem.

Estimating Attention from Low-Resolution Camera Images
To build the ground truth for attention estimation experiments, some frames were extracted and manually annotated (attention vs distraction) by two researchers, which separately compiled a record sheet divided into frames.Once completed, the discrepancies were resolved via discussion.Then, the two researchers agreed on the final labels which were used to build the database for this study's machine learning experiments.
The authors remark that the labeling was considered "Attention" when the child was monitoring the robot, i.e., when the visual focus of attention was on the robot and considered "Distraction" otherwise.

Estimating Attention from Low-Resolution Camera Images
Figure 4 presents the methodology used in the study's machine learning experiments.The authors tested two alternatives with different expected performance and computational requirements for each step of the study approach, to analyze the benefit vs computational cost ratio.The following sections present the alternative algorithms tested in our experiments.
The Naïve estimation is an inexpensive algorithm that simply assumes that a child established visual attention if the robot detects the child's face.

The Viola-Jones (VJ) Framework for Face Detection
The Viola-Jones is a widely used method for object detection, in particular, face detection which was the first application demonstrated in the seminal paper [54].The reasons for its success are the real-time computation and its robustness (very high true-positive and very low false-positive rates).It has been used as part of the procedure for the recognition of visual focus of attention and its level in human-robot interaction [27].
The method has 4 key characteristics: (i) Feature Selection using Haar Basis functions; (ii) Creating an "Integral Image" that allows for very fast feature evaluation; (iii) Adaboost Classifier Training; (iv) it combines successively more complex classifiers in a cascade structure to increase the speed of the detector by focusing attention on promising regions of the image.
The rationale for using a "Frontal Face" detector to estimate attention is that the latter is related to the head pose.The basic implementation of the VJ algorithm does not recognize partial faces, i.e., when the child's face is pointing elsewhere and not looking at the robot, in fact.Thus, the VJ approach will recognize a face only when it is orientated toward the robot, which could be interpreted as focusing the visual attention on it.The Naïve estimation is an inexpensive algorithm that simply assumes that a child established visual attention if the robot detects the child's face.

The Viola-Jones (VJ) Framework for Face Detection
The Viola-Jones is a widely used method for object detection, in particular, face detection which was the first application demonstrated in the seminal paper [54].The reasons for its success are the real-time computation and its robustness (very high true-positive and very low false-positive rates).It has been used as part of the procedure for the recognition of visual focus of attention and its level in human-robot interaction [27].
The method has 4 key characteristics: (i) Feature Selection using Haar Basis functions; (ii) Creating an "Integral Image" that allows for very fast feature evaluation; (iii) Adaboost Classifier Training; (iv) it combines successively more complex classifiers in a cascade structure to increase the speed of the detector by focusing attention on promising regions of the image.
The rationale for using a "Frontal Face" detector to estimate attention is that the latter is related to the head pose.The basic implementation of the VJ algorithm does not recognize partial faces, i.e., when the child's face is pointing elsewhere and not looking at the robot, in fact.Thus, the VJ approach will recognize a face only when it is orientated toward the robot, which could be interpreted as focusing the visual attention on it.

Faster Convolution Neural Networks with Region Proposal (R-CNN)
State-of-the-art object detection methods include the region proposal-based convolutional neural networks like the popular Faster R-CNN [55], which combine fast prediction with high accuracy by training an auxiliary neural network to make region proposals that can speed-up the execution thanks to a restriction of the search area for a Convolutional Neural Network (CNN).Faster R-CNN is composed of two modules: A Region Proposal Network (RPN) and a CNN detector, which could also be a pre-trained object detection network.A four-step alternating training is employed to better integrate both modules [55].
This study adopted a technique known as "transfer learning" [56], in which pre-trained CNN models are fine-tuned for a specific task using new data.The model that has been tuned in this study's computational experiments is the VGG-16 [57], which is a very deep CNN architecture for image recognition.

Multitask Cascaded Convolutional Networks for Face Detection and Landmark Estimation (MTCNN)
The MTCNN is a deep learning neural network architecture that employs multi-task cascaded CNNs, which are the state-of-the-art for many computer vision applications [58].The three main characteristics of the MTCNN for performance improvement are: (i) the cascaded CNNs architecture; (ii) an online hard sample mining strategy; and (iii) joint face alignment learning.The last characteristic is of particular interest because facial landmarks could be used to estimate the head pose and, thus, the attention.The MTCNN can outperform state-of-the-art methods across several benchmarks while achieving real-time performance for 640 × 480 VGA images [41].These are both interesting features for the application being investigated and, therefore, it has been selected for this study's experimental testing.
The coordinates of the landmarks were related to the bounding box position and rescaled according to its size in this study's method: where x l is the landmark coordinate used for classifying, x 0 is the coordinate in the image and b 1 is the top left corner and b 2 is the bottom right corner.

Histograms of Oriented Gradients (HOGs)
Histograms of oriented gradients (HOGs) are the concatenation of histograms obtained by dividing the image into small connected regions called cells.Thus, the local object appearance and shape within an image can be described by the distribution of intensity gradients or edge directions.HOG descriptors are particularly suited for human detection as it has been shown that they outperform other feature sets in this task [59].
The authors extracted the HOG descriptors from the face image to test the possible improvement in the attention estimation in this study's approach.Thirty-six HOG features were calculated on the face, only as delimited by the bounding box, and rescaled to 20 × 20.

K-Nearest Neighbor Classifier (K-NN)
The authors trained a classifier using the K-Nearest Neighbor (K-NN) algorithm to distinguish the two statuses "Attention" (visual focus on the robot) or "Distraction" (not monitoring the robot) from the 5 Facial Landmarks (center of the eyes, nose, and edges of the mouth) or the HOG descriptors.
K-NN which is a classical computational intelligence technique widely used in pattern recognition and data mining applications since it can be applied to both supervised and unsupervised learning problems [60].The method is based on the simple but powerful idea items can be associated by similarity and, therefore, when applied to classification, any new item can be labeled as the "most similar" in the labeled dataset [61].This algorithm has been selected after preliminary experiments with different classifier methodologies using a Bayesian optimization process to explore the parameter space and select the best ones [62].Indeed, in a ten-fold cross-validation, the K-NN classifier achieved the best result in terms of classification accuracy over Support Vector Machines, Decision Trees, and Naïve Bayes Classifier.

Performance Measures
The aim of the computational experiments was to validate the deep learning approaches and compare them to a standard approach for the evaluation of attention and distraction in human-robot interaction.Given that both conditions are important, along with the classic classification accuracy, the authors considered several measures that focus on the attention and distraction conditions as detailed in Table 1.Where P is the number of positive cases, i.e., frames with attention condition; N identifies the negative cases, i.e., frames with distraction condition; TP is the true positives (correct classification of attention); TN is the true negatives (correct classification of distraction); FP is the false positives (distraction that has been classified as attention); FN is the false negatives (attention that has been classified as distraction).

TP+TN P+N
Accuracy is the proximity of the classification results to the true values.It evaluates the overall performance of classification.

Precision TP TP+FP
Precision is the positive predictive value.This case indicates the reliability of the classification of attention.

TN TN+FN
The negative predictive value indicates the reliability of the classification of distraction.

TP P
Also known as Recall, Hit or True Positive Rate, Sensitivity focuses only on how good the performance in classifying attention is.

TN N
Opposite to Sensitivity, Specificity or true negative rate is evaluating only the performance in classifying distraction.
A higher value indicates a better performance for all the measures considered in comparisons.The authors also have calculated the Receiver Operating Characteristic (ROC) curve, which is a graph that exemplifies the performance of a binary classifier by plotting Sensitivity (true positive rate) against the false positive rate (FP/P) at various threshold settings.The points of the ROC curve are calculated from the K-NN classification scores (group similarities) and varying the discrimination thresholds.A measure for model comparison is the Area Under the Curve (AUC), which, in the case Quite the opposite, the Naïve classification of R-CNN detections shows a better sensitivity, i.e., true classification of attention, but achieves the worst overall performance as it overestimates attention by detecting partial faces.Results for the Naïve R-CNN are in Table 5.Table 6 reports a significant improvement by applying the K-NN classification to the HOGs extracted from the R-CNN detections.This can be explained by the fact that the R-CNN detects more faces (true positives) than the VJ algorithm thanks to the ad-hoc training.Meanwhile, the K-NN classification stage reduces the false positive detections and therefore improves the overall performance.The performance behavior of the MTCNN is reported in Tables 7 and 8 and it is similar to the R-CNN.The version with the landmarks has the best sensitivity and reliability in the prediction of the negative statuses, i.e., distraction.However, this is achieved practically by overestimating attention which results in many false positives and the worst specificity (Table 7).This performance is too low, and it cannot be considered reliable enough for a real application.Finally, in Table 8 one can see the best overall result in term of all the performance metrics considered.
To summarize the results and show a direct comparison, Figure 5 presents the ROC curves for all the approaches considered.To summarize the results and show a direct comparison, Figure 5 presents the ROC curves for all the approaches considered.The AUCs are: VJ-Naïve 0.7382; VJ-HOG-K-NN 0.7926; MTCNN-Landmarks-K-NN 0.8681; MTCNN-HOG-K-NN 0.9314; RCNN-Naïve 0.6192; RCNN-HOG-K-NN 0.9128.The AUC values are consistent with the other metrics as they confirm that the classification of HOGs with the K-NN is best approach and that the MTCNN is the most efficient algorithm for detecting the faces.However, AUC values give a biased ranking as this measure is based on the true and false positives, i.e., the algorithms that achieve a better result in the sensitivity are in advantage over the others, while the authors remark that in this field of application true negatives (i.e., "Distraction") are also important.

Computational Execution
The authors also considered the computational execution of the approaches considered in the numerical comparisons for further evaluation.Table 9 presents the performance evaluation of the methods.Experiments were done on a Workstation equipped with Intel Xeon CPU E5-2683 v4 at The AUCs are: VJ-Naïve 0.7382; VJ-HOG-K-NN 0.7926; MTCNN-Landmarks-K-NN 0.8681; MTCNN-HOG-K-NN 0.9314; RCNN-Naïve 0.6192; RCNN-HOG-K-NN 0.9128.The AUC values are consistent with the other metrics as they confirm that the classification of HOGs with the K-NN is best approach and that the MTCNN is the most efficient algorithm for detecting the faces.However, AUC values give a biased ranking as this measure is based on the true and false positives, i.e., the algorithms that achieve a better result in the sensitivity are in advantage over the others, while the authors remark that in this field of application true negatives (i.e., "Distraction") are also important.

Computational Execution
The authors also considered the computational execution of the approaches considered in the numerical comparisons for further evaluation.Table 9 presents the performance evaluation of the methods.Experiments were done on a Workstation equipped with Intel Xeon CPU E5-2683 v4 at 2.10 GHz and NVIDIA Tesla K40.If the images can be transferred to the Workstation quickly enough, the Viola-Jones is confirmed to be quickest to process the frames from the robot in a reasonable time, while the R-CNN and MTCNN can achieve decent performance only if accelerated with a GPU.Note that MTCNN and R-CNN were accelerated using a GPU.
Note that, for the purpose of this application, even the lowest rate of 2 frames per second can be enough to estimate the behavior of the children, which changes in the span of seconds.

Discussion and Conclusions
Robot-assisted therapy is a promising field of application for intelligent social robots.However, most of the studies in the literature focus on ASD individuals without ID or neglected to analyze comorbidity.Indeed, very little has been done in this area and it could be considered as one of the current gaps between the scientific research and the clinical application [7,64].Regarding the clinical context of ASD with ID, the aim is to use social assistive robots to provide assistance to the therapist and, consequently, reduce the workload by allowing the robot to take over some parts of the intervention.This includes monitoring and recording the child's activities, proactively engaging the child when he/she is distracted, and adapting the robot behaviors according to the levels of intervention for every child on an individual basis [18].
To this end, computational intelligence techniques should be utilized to increase the robot capabilities to favor greater adaptability and flexibility that can allow the robot to be integrated into any therapeutic setting according to the specific needs of the therapist and the individual child.
This article describes a step forward in this direction, indeed the authors tackled the problem of estimating the child's visual focus of attention from the robot camera's low-resolution video recordings.To investigate the applicability in a clinical setting, the authors created a database of annotated video recordings from a clinical experiment and compared some computer vision approaches, including popular deep neural network architectures for face detection combined with HOG feature extraction and a K-NN classifier.
The results show that the approaches based on CNN can significantly overcome the benchmark algorithm only if the detection is corrected using HOG features and a classifier to adjust the attention status estimation.Overall, the approach that achieves the best result is the one that makes use of the MTCNN to identify faces.The MTCNN achieved the highest accuracy on the test set, 88.2%, which is very good, and it could be used to adapt the robot behavior to the child's current attention status, even though the computational requirements of CNN demand a proper workstation to be attached and to control the robot.However, thanks to the introduction of ultralow-power processors as accelerators for these architectures [65], the authors hypothesize that these could empower the robots and allow them to perform reliably and in real-time vision tasks like estimating the attention focus and use this information to personalize the interaction.Despite the good result, a deeper analysis of the result with each child shows some cases in which the performance was poor, and this suggests the need to always perform a careful review by experts if the automated attention estimation is used for diagnosis.It should be noted that the researchers that analyzed the videos did not initially give the same label to some frames, and they had to discuss to come to an agreement for creating the final benchmark.These reasons, along with the low-resolution, might explain the not so remarkably high performance of some computational intelligence approaches.
However, it should be noted that the visual focus of attention is only one of the components of the social attention and the human labels were influenced also by other factors, whether the child's posture, behavior, and actions were coherent with the task, the robot, and the environment for example.Moreover, this study is considering an estimation of the social attention from only one of its components-focusing the visual attention on the interaction partner.
Future work should focus on refining algorithms and, moreover, increase the hardware support for them.Other cues, such as the adherence of the child's behavior to the robot's prompt in the case of social attention evaluation should be evaluated and considered to refine the classification.

Figure 1 .
Figure 1.The Nao robot.The camera used for recording the child activity is the one on the top (circled in red).

Figure 2 .
Figure 2. Example of the child-robot interaction during the therapeutic session (the child is imitating the robot moving his arm).A professional educator was always present nearby to support the child.

Figure 1 .
Figure 1.The Nao robot.The camera used for recording the child activity is the one on the top (circled in red).

Figure 2 .
Figure 2. Example of the child-robot interaction during the therapeutic session (the child is imitating the robot moving his arm).A professional educator was always present nearby to support the child.Figure 2. Example of the child-robot interaction during the therapeutic session (the child is imitating the robot moving his arm).A professional educator was always present nearby to support the child.

Figure 2 .
Figure 2. Example of the child-robot interaction during the therapeutic session (the child is imitating the robot moving his arm).A professional educator was always present nearby to support the child.Figure 2. Example of the child-robot interaction during the therapeutic session (the child is imitating the robot moving his arm).A professional educator was always present nearby to support the child.

Figure 3 .
Figure 3.A frame extracted from one of the videos recorded by the robot (the child is imitating the robot raising his arms up).

Figure 3 .
Figure 3.A frame extracted from one of the videos recorded by the robot (the child is imitating the robot raising his arms up).

Figure 4
Figure4presents the methodology used in the study's machine learning experiments.The authors tested two alternatives with different expected performance and computational requirements for each step of the study approach, to analyze the benefit vs computational cost ratio.The following sections present the alternative algorithms tested in our experiments.

Figure 4 .
Figure 4.The steps of the study approach.Regarding each step, some alternative approaches were tested.

Figure 4 .
Figure 4.The steps of the study approach.Regarding each step, some alternative approaches were tested.

Figure 5 .
Figure 5. Receiver Operating Characteristic (ROC) curves calculated from the K-NN classification scores with varying discrimination thresholds.Note that Naïve classifiers produced only binary (0 or 1) scores.

Figure 5 .
Figure 5. Receiver Operating Characteristic (ROC) curves calculated from the K-NN classification scores with varying discrimination thresholds.Note that Naïve classifiers produced only binary (0 or 1) scores.

Table 1 .
Measures for evaluating classification performance.

Table 3 .
Results of the Viola-Jones with Naïve Classification.

Table 4 .
Results of the Viola-Jones with histograms of oriented gradients (HOGs) and K-Nearest Neighbor (K-NN) classification.

Table 5 .
Results of the Faster Convolution Neural Networks with Region Proposal (R-CNN) Naïve classification.

Table 6 .
Results of the R-CNN with HOGs and K-NN classification.

Table 7 .
Results of the Multitask Cascaded Convolutional Networks (MTCNN) Landmarks classification via K-NN.

Table 8 .
Results of the MTCNN with HOGs and K-NN classification.

Table 9 .
Evaluation of the execution time and maximum frame per seconds (fps) at 640 × 480.