The face conveys very rich information that is critical in many aspects of everyday life. Face appearance is the primary means to identify a person. It plays a crucial role in communication and social relations: a face can reveal age, sex, race, and even social status and personality. Besides, a skilled observation of the face is also relevant in the diagnosis and assessment of mental or physical diseases. The face appearance of a patient may indeed provide diagnostic clues to the illness, the severity of the disease and some vital patient’s values [1
]. For this reason, since the beginning of studies related to automatic image processing, researchers have investigated the possibility of automatically analyzing the face to speed up the related processes, making them independent from human error and caregiver’s skill level, but also to build new ones assistive applications.
One of the early and most investigated topics in the computer vision community, which is still quite active today, is face detection: its primary goal is to determine whether or not there are any faces in the image and, if present, where are the corresponding image regions. Several new methods have emerged in recent years and they have improved the accuracy of face detection so that it can be considered a problem solved in many real applications even if the detection of partially occluded or unevenly illuminated faces is still a challenge. Most advanced approaches for face detection have been reviewed in [3
Face detection is the basic step for almost all the algorithmic pipelines that in somewhat aim at analyzing facial cues. The subsequent computer vision approaches involved in the face related algorithmic pipelines are instead still under investigation and details about the recent advancements can be found in some very outstanding survey papers on face analysis from the technological point of view. They cover algorithmic approaches for biometric identification [5
] (even in presence of plastic surgery tricks [7
], occlusions [8
], or distortion; low resolution; and noise [9
]), facial muscles movements analysis [10
], and emotion recognition [11
Looking deeply at the works in literature, it is possible to identify three different levels on which methodological progresses move-forward: The first level, which evolves very fast and therefore has produced solutions that reached outstanding accuracy and robustness on benchmark datasets, concerns the theoretical research. It mainly deals with the study and the introduction of novel neural models, more effective training strategies, and more robust features. At this level, classical classification topics such as object recognition [12
] are addressed. There are several hot topics at this level, but the most relevant for the scope of this paper are few-shot learning [17
], advanced transfer learning [18
], automatic data augmentation [19
], prototypical class learning [20
], adaptively integration of local features with their global dependencies [21
], better understand CNN behaviors to discover how to build more spatially efficient and better performing architecture [22
], and exploiting spatio-temporal dynamics [23
]. The introduction of new challenging datasets for more comprehensive and unbiased comparisons [24
] is an additional hot topic, whereas the most pioneering academic researches go towards the solution of unconventional problems such as face recognition in the presence of disguise variations [25
The second level, namely, applied research, tries instead to leverage theoretical findings to solve more specific, but still cross-contextual, issues such as robust facial landmarks detection [26
], facial action unit estimation [27
], human pose estimation [28
], Anomaly Detection in Video Sequence [29
], and so on. Finally, the third level involves the on-field research that leverages the outcomes of the theoretical and applied researches to solve contextual issues, i.e., related to healthcare, autonomous driving, sports analysis, security, safety, and so on. In the context-related researches, technological aspects are only a part of the issues to be fixed in order to get an effective framework. Often domain-specific challenges have to be addressed by a multidisciplinary team of researchers who has to find the best trade-off between domain-related constraints and available technologies to build very effective frameworks. This is even more valid in the case of the healthcare scenario as the deployment has to take into account how the final users (i.e., medics, caregivers, or patients) will exploit technology, and, to do that clinical, technological, social, and economic aspects have to be weighted [30
]. For instance, recent face analysis systems (e.g., that perform facial emotion recognition) have reached outstanding accuracy by exploiting deep learning techniques. Unfortunately, they have been trained on typically developed persons and they cannot be exploited as supplied to evaluate abilities in performing facial expression in the case of cognitive or motor impairments. In other words, existing approaches may require a re-engineerization to handle specific tasks involved in healthcare services. This has to be carried out including among all life science knowledge, biological, medical, and social background [31
]. At the same time, the demand for smart, interactive healthcare services is increasing, as several challenges issues (such as accurate diagnosis, remote monitoring, and cost–benefit rationalization) cannot be effectively addressed by established stakeholders [32
]. From the above, it emerges that it would be very useful to summarize works in the literature that, by exploiting computer vision and machine learning tasks, face specific issues related to healthcare applications. This paper is motivated by the lack of such similar works in the literature and its main goal is to make up for this shortcoming. In particular, the main objectives of this survey are
to give an overview of the cutting-edge approaches that perform facial cue analysis in the healthcare area;
to find critical aspects that rule the transfer of knowledge from academic, applied, and healthcare researches;
to path the way for further researches in this challenging domain starting from the last exciting findings in machine learning and computer vision; and
to point out benchmark datasets specifically built for the healthcare scenario.
The document is not limited to global face analysis and it also concentrates on methods related to local cues. A research taxonomy is introduced by dividing the face in its main features: eyes, mouth, muscles, skin, and shape. For each facial feature, the computer vision-based tasks aiming at analyzing it and the related healthcare goals that could be pursued are detailed. This leads to the scheme in Figure 1
From Figure 1
, the organization of the rest of the paper arises. In each section, one of the listed computer vision tasks is addressed with reference to the faced healthcare issues. According to the above, the rest of the paper is organized as follows. Section 2
reports studies concentrating on the analysis of the eye region for gaze tracking purposes, Section 3
gives an overview on researches exploiting automatic facial expression analysis and emotion recognition, Section 4
supplies the state-of-the-art in soft/hard biometry, Section 5
analyzes strategies for extracting vital parameters from images framing an individual, and finally Section 6
points out applications involving visual speech recognition and animation. Section 7
provides directions for further improvements and concludes the paper.
2. Eye Analysis
Eye movements play a crucial role in terms of individual’s perception and attention to the visual world [33
]; consequently, non-intrusive eye detection and tracking have been investigated for decades in the development of human–computer interaction [34
], attentive user interfaces [35
], or cognitive behavioral therapy [36
]. Eye-tracking is the measurement of eye movement/activity and gaze (point of regard) tracking is the analysis of eye tracking data with respect to the head/visual scene [37
], and they have systematically been employed in healthcare applications [38
The detection and analysis of eye movements have recently reached maturity by exploiting convolutional neural networks that allowed also computer vision based methods to become very effective. The subsequent analysis of eye tracking data in the healthcare domain is instead an open issue and then it has been a very active research topic in the last decades. This section first highlights recent achievements in the applied research concerning eye movements and gaze estimation, and then it focuses on the on-field research in the healthcare domain.
With the software iTracker [39
], CNNs have been employed to achieve an eye tracking estimation at 10–15 fps and running on commodity hardware like mobile phones and tablets. A Tolerant and Talented (TAT) scheme has also been employed to improve performance on tablet and smartphones in [40
]. In particular, TAT consists of a knowledge distillation from teachers that are randomly selected, with the aim of removing the ineffective weights and give the pruned weights (by opportunely using cosine similarity) another direction in the optimization process. Finally, a Disturbance with Ordinal (DwO) schemes generates adversarial samples, enhancing the network robustness. The possibility to infer gaze in natural environments has been investigated in [41
]. Authors proposed an appearance-based CNN solution that works in real-time, as well as a challenging dataset with both gaze and head pose information, using a motion capture system and mobile eye tracking glasses to extract ground truth data. In [42
], using the CNN architecture, a CNN is designed to extract features in all frames and to use them in a many-to-one recurrent module that predicts the 3D gaze vector of the last frame, outperforming performance in the EYEDIAP [43
] dataset. Conditional Local Neural Fields (CNLF) have been introduced in [44
], where the network can provide a full facial behavior analysis. The rest of this section will introduce the recent outcomes in the healthcare domain.
A study of eye tracking data through temporal analysis of fixation data by using eye tracking to understand the group and individual patterns has been proposed in [45
]. Authors used the proposed system to investigate emotion regulation with the study of attention to different segments of a video among different age groups, claiming the importance of temporal patterns. Variation in eye gaze after sad mood induction in previously depressed and never depressed women has been introduced in [46
], and this information has been fused with head pose and speaking behavior to detect depression [47
]. The possibility of tracking human gaze in an unconstrained environment for assistive applications has been proposed in [48
]: authors employed an RGB-D device and a head pose estimation algorithm, proposing the system as remote device control, as well as a rehabilitation device, and to help people with neurological impairments. A pilot study that showed the potential of eye tracking for enhancing debriefing and educational outcomes has been proposed by [49
], showing also the open challenges and the high costs to operate in real environments. In [50
], eye tracking has been employed for diagnosing strabismus; moreover, it has been employed to detect disconjugate eye movements in the case of structural traumatic brain injury and concussion [51
], and in mild traumatic brain injury [52
]. Oculomotor abnormalities as a biomedical marker for stroke assessment have also been investigated [53
]. In [54
], eye tracking has been combined with video debriefing techniques in simulated learning scenarios to improve the quality of feedback and second to determine the satisfaction of students toward the system. Also, other ocular wearable sensors like contact lens [55
] and egocentric vision sensors [56
] have been massively employed. In [57
], the use of smart glasses is investigated in different cases, i.e., as a viewer of information, as a source of medical data and of healthcare information, showing that smart glasses can be used in the measurement of vital signs of the observed patient in a sufficiently reliable way for medical screening. A system that for supporting the daily living of a user has been proposed in [58
], organizing the data acquired by the user over different days using an unsupervised segmentation. In [60
], doctor’s head position is estimated and tracked with aims of augmented reality patient’s body surface projection.
As expected, recent advances in machine learning had many implications in the healthcare systems, also leading to new applications trying to solve the new problem and challenges that were missing in the state-of-the-art until the past few years. An example is a work proposed in [61
], where CNNs have been employed for the first time to predict the user’s knowledgeability from his eye gaze. In [62
], a learning-based approach has been applied in egocentric videos to detect engagement. A system that incorporates the gaze signal and the egocentric camera of an eye tracker to identify the objects that the user focuses has been proposed in [63
]. In particular, deep learning is used to classify objects to construct episodic memories of egocentric events in real-time whether the user draws attention to that object.
Many works have been also proposed in the field of Autism Spectrum Disorder (ASD). If it is difficult to collect and summarize all the wide literature of gaze estimation for ASD, it is possible to summarize the interest it received by the medical and scientific community. First of all, it is believed that the focus of attention in the scene is fundamentally different for individuals who have autism compared with typical controls, in particular for socially relevant information and processing of faces [64
]. Moreover, autism causes social attention impairments and deprivation of social information input during infancy and preschool development, further disrupting normal brain and behavioral development [66
]; this cycle represents a negative feedback loop, with the consequence of affecting the whole social development of the individual. Thus, it is not surprising how eye tracks have been evaluated in social attention analysis and triadic interaction with an object and the therapist. In [67
], an interface to support automatic video analysis in a supportive manner for guiding human judgment of social attention during the assessment is proposed. In [68
], low-cost computer vision tools to measure and identify ASD behavioral signs have been proposed and evaluated. Gaze estimation as a tool to analyze visual exploration of a closet containing toys in children with ASD has been proposed in the work in [69
]. In this work, gaze trajectories of the child are integrated for the first time with the purposes of an Early Start Denver Model (ESDM) program built on the child’s spontaneous interests and game choice delivered in a natural setting.
Also, in this specific domain, the recent advances in deep learning have been integrated. In [70
], a deep learning framework that estimates levels of the child’s affective states and engagement by multimodal sensor fusion is proposed. A computer vision-based pipeline for the automatic and quantitative screening of ASD has been proposed in [71
], integrating multiple modalities for the assessment. People are classified using a photo-taking task during free exploration, and the analysis is made on the user attention. Temporal information in eye movements is also integrated, also outperforming state-of-the-art performance with the Saliency4ASD [72
] dataset. In [73
], a deep learning model for human action recognition is integrated to automate the response measurement for screening, diagnosis and behavioral treatment for ASD.
In the last few years, the analysis of the gaze/face interaction of ASD children with social robots is becoming a very important research topic [74
], with the aim of providing an analysis of joint attention [77
], face-to-face interaction [78
], and joint attention with the therapist in a triadic interaction [79
A comparison of results obtained by computer vision based works in healthcare domain is provided in Table 1
. It can be observed that, if the gap between the technique employed for the healthcare application and the state-of-the-art method performance is not very strong, it is still possible to observe how further research is necessary to embody the latest research outcome; moreover, the validation with benchmark dataset is still very desirable, as it is often not accomplished, as well as a uniforming method for the evaluation.
In fact, regarding the latter, the possibility of benchmark eye analysis techniques is provided by the numerous existing dataset, often taken with a professional and calibrated eye tracker. Among them, it is worth to mention the following.
The USC eye-1 [88
] dataset, designed to analyze the role of memory in visual interaction.
The dataset presented in [89
], containing behavioral and pupil size data from non-diagnosed controls and ADHD-diagnosed children performing a visuospatial working memory task.
], made of eye movements of 14 children with Autism Spectrum Disorder (ASD) and 14 healthy controls, with the aim of evaluating specialized models to identify the individuals with ASD.
Self-Stimulatory Behaviour Dataset (SSBD) [90
], designed for the automatic behavior analysis in uncontrolled natural settings.
Multimodal Dyadic Behavior Dataset [91
], containing 160 sessions of 3–5 min semistructured play interaction between a trained adult examiner and a child between (15–30 months). The session aims at eliciting social attention, back-and-forth interaction, and non-verbal communication.
Note that such datasets focus on the data about gaze points on a target, saccade movements, target description and clinical information of the users, without recording the visual information of the eye-region. This implies that computer vision-based methods must often reproduce the experiment as in the datasets, excluding the possibility of directly use them. In many cases, this is accomplished, but without publicly available data that comes together with the performed healthcare task. This gap could be filled by new sets of publicly available data that include eye tracker, clinical, and RGB/RGB-D data but, for the healthcare domain, this is a missing piece in the literature.
3. Facial Expression
The ability to effectively communicate emotion is essential for adaptive human function. Of all the ways that we communicate emotion, facial expressions are among the most flexible—their universality allows us to rapidly convey information to people of different ages, cultures, and languages. Computer vision has reached very high accuracy in the automatic recognition of facial expressions and, more in general, in the behavioral analysis of gestures (i.e., facial muscle activity). A modern and comprehensive taxonomy of computer vision approaches can be found in [92
From the literature review, it clearly emerges that existing approaches suffer if used in the wild, as, in those cases, challenging conditions such as large inter-personal variations in performing the same expression non-uniformly, accessories (e.g., glasses, mustache, and haircut) and variation in pose and illumination make harder to performs sub-tasks, especially face alignment. To give a quantification of the performance decrease in the recognition of facial expressions in video while stepping from constraint acquisition conditions to unconstrained conditions, it must be considered that it drops from
(using a deep learning algorithm that incorporates face domain knowledge to regularize the training of an expression recognition network [94
]) of recognition of 8 expressions (Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise) on the CK + dataset [95
of the SFEW 2.0 dataset [96
], where the best performance so far have been gathered by a complex framework involving multiple deep CNNs even adopting several learning strategies [97
]. To improve performance recently multimodal (audio and video) feature fusion frameworks that can continuously predict emotions have been introduced [98
], but, of course, synchronized audio is not always available. It is straightforward to derive that the definition of FER modules effectively working in a healthcare scenario is still an open research issue.
In particular, healthcare frameworks which include an emotion or expression recognition module, have been introduced to provide suitable solutions for the following.
Ubiquitous healthcare systems
Computational diagnosis and assessment of mental of facial diseases
Ubiquitous healthcare systems provide person-centered and integrated care especially suited for long-term care as they also achieve emotional and psychological cognition for human beings. These application scenarios are spreading rapidly and their development has recently been further accelerated by the development of architectures based on 5G communication technologies [100
]. In the e-Healthcare framework proposed in [101
], images acquired by a smart device (smartphones or any installed camera) are acquired and transmitted to a cloud along with the medical data for further processing. There, a cloud manager first authenticates the user, and then sends the face images data to the emotion detection module. The emotion information is subsequently sent to appropriate healthcare professionals. As a practical follow-up, if the detected emotion is not positive (e.g., pain), caregivers can visit the patient. The maximum classification accuracy on a proprietary dataset was 99.8% but only three classes (normal, happy and pain) were considered. The machine learning pipeline is, in fact, not suitable to properly manage a greater number of classes given that the feature of acquired facial images are extracted by using local binary patterns and, according to a traditional scheme that dates back in 2005 [102
], Support Vector Machines are exploited for classification. The drawbacks of the above processing scheme are also highlighted in [103
] where a satisfaction detection system is presented as part of a smart healthcare framework. As customer satisfaction (of users and patients) is an important goal for smart healthcare business, a smart home is equipped to capture signals from the users. These signals are processed in a cloud server and a cloud manager then sends the result to the stakeholder. The gathered results on a proper dataset collected by involving 40 male students, were not convincing (best accuracy 78% on three classes, satisfied, unsatisfied, or indifferent) demonstrating that highly sophisticated classifying approaches are required in this domain.
Facial expressions also play a relevant role in the case of diagnosis or assessment of cognitive impairments (e.g., autism and schizophrenia). In [104
], a complex pipeline is introduced and tests on a large number of adults and children, with and without autism spectrum disorders, are reported. The pipeline is able to quantify in a personalized manner the patient’s ability to perform four basic expressions and to monitor improvements over time. The authors exploited the Convolutional Experts Constrained Local Model (CE-CLM) for facial landmarks location and the concatenation of dimensionality reduced HOGs and facial shape features (from CE-CLM) for action unit intensity prediction. Besides, a novel statistical approach is used to regularize estimations on the basis of geometrical and temporal constraints. A proprietary dataset (27 children with and without autism spectrum disorders) was used and the comparison with annotation provided by experts demonstrated an average precision of about 90% in recognizing correctly executed facial expressions. The facial landmarks detection for atypical 3D facial modeling in facial palsy cases has been investigated in [106
]. Potentially such modeling can assist the medical diagnosis using atypical facial features (e.g., asymmetrical face). A face alignment network, having stacked hourglass architecture with a residual block, was proven to be high performing (in terms of normalized mean error) method for landmark localization on unseen atypical faces recorded in a proprietary dataset of 87 subjects.
Patient pain can be detected highly reliably from facial expressions using a set of facial muscle-based action units. Automated detection of pain would be highly beneficial for efficient and practical pain monitoring. In the healthcare domain, pain monitoring can be exploited to provide effective treatment and to eventually improve patient pain (e.g., in fibromyalgia patients) [107
]. The most up-to-date approach for detecting pain [108
] makes use of a generic AU detector based on Gabor filters and SVM classifier coupled with a Multiple Instance Learning (MIL) framework for solving pain detection as a weakly supervised learning problem in a low-dimensional feature space. Experimental results show an 87% pain recognition accuracy with 0.94 AUC (Area Under Curve) on the UNBC-McMaster Shoulder Pain Expression dataset.
In addition to supporting the diagnosis and assessment of psychological and mental problems, the modules for automatic recognition of facial expressions are also of great help in the case of the use of technological rehabilitation frameworks. It has been shown that communication between humans and computers benefits from sensor-based emotion recognition as humans feel uncomfortable when emotions are absent [109
]. For instance, they have been involved during robot–ASD children interactions aimed at learning young autistic patients by imitation, making possible an objective evaluation of children’s behaviors [110
] and then giving the possibility to introduce a metric about the effectiveness of the therapy [111
]. As part of smart environments, facial expression module can be used to recognize the emotions of the people from their facial expressions and to react in a friendly manner according to the users’ necessities [112
summarizes the computer vision techniques involved in the healthcare-related systems. From right to left: the first column reports the referenced works, the second and third columns indicates the technique used to extract the features and to classify data respectively, the fourth column refers to the benchmark dataset used in literature to validate the computer vision technique, the fifth column reports the performance of the technique on the dataset, and the rightmost column reports the performances on the same dataset but of the best technique in the state-of-the-art.
From the literature overview, it emerges that the analysis of the face aimed at medical and health applications is still in an embryonic state. There is indeed a great untapped potential linked to the latest methods of computer vision and machine learning that are currently confined to the academic sector. It can be easily observed that often, in health care applications, approaches that are not state-of-the-art are exploited, perhaps because they are ready for use. Bringing, in fact, the best performing approaches to applications requires a lot of time that is often preferred to use to design and implement the experiments that involve the recruitment of people and the involvement of specialists with multidisciplinary skills. This is a relevant drawback that has to be addressed, especially with regard to the analysis of the face, which has a complex structure requires the use of advanced approaches, desirably even able to detect micro-movements of facial muscles, so as not to invalidate the entire experimental architecture with not reliable image/video data computation. Deep learning-based end-to-end approaches could fix this crucial issue, but they require annotated data that medical staff often is not able to provide due to its subjectivity and the complexity of the images. This brings data scientists to adapt existing computational models (trough transfer learning with either domain adaptation or task adaptation or even looking back to handcrafted features). Some examples of specific data benchmark already exist: (1) iCOPEvid [123
] for infant classification of pain expressions in videos, (2) Emopain [124
] and UNBC-McMaster [125
] for adult classification of pain expressions in videos, (3) AVEC 2019 [126
] for Detecting Depression, and (4) ANYWAY, a strong effort is required to provide larger-scale datasets that can speed up the research focusing on facial expression recognition in videos for healthcare purposes by exploiting end-to-end training methods provided by academic and applied research studies.
4. Soft/Hard Biometrics
Biometrics have been employed with success in several healthcare fields spreading from social assistive technologies, improving, for example, the level of the human–machine interaction in applications for autistic individuals [127
], as well as people with dementia [128
] and, generally, for elderly care [129
Human–Robot Interaction (HRI) for Socially Assistive Robotics (SAR) is a new, growing, and increasingly popular research area at the intersection of a number of fields, including robotics, computer vision, medicine, psychology, ethology, neuroscience, and cognitive sciences. New applications for robots in health and education have been developed for a broad population of users [130
]. In these application fields, the level of realism is a key factor that can be substantially increased by the introduction of biometrics, as this can give to the robot the possibility to change its behavior depending on observed peculiarities of the interacting individual. This way, traditional applications in the field of socially assistive robotics, like interaction with autistic children, considering their well-known interest on computers and electronic devices [131
], as well as people in rehabilitation in cases of dementia [128
] or post-stroke [133
], and generally for elderly care [129
], could benefit and its level of acceptance from the involved individuals could be improved. In addition, biometrics could be used to make the robot able to autonomously start a specific task, increasing this way the level of realism of the interaction perceived by the user.
], soft biometrics are defined as the set of all those characteristics that provide some information about the individual, but such that they lack the distinctiveness and permanence to sufficiently differentiate any two individuals. The soft biometric traits can either be continuous (e.g., height and weight) or discrete (e.g., gender, eye color, ethnicity, etc.). With the term hard biometrics, on the other hand, are defined all those characteristics be means of two individuals can be perfectly differentiated, as visual features describing face cue traits in order to perform face recognition tasks.
In the last few decades, computer vision, as well as other information science fields, have largely investigated the problem of the automatic estimation of the main soft biometric traits by means of mathematical models and ad hoc coding of the visual images. In particular, the automatic estimation of gender, race, and age from facial images are among the most investigated issues, but there is still a lot of open challenges especially for race and age. The extraction of this kind of information is not trivial due to the ambiguity related to the anatomy of each individual and his lifestyle. In particular, in race recognition, the somatic traits of some population could be not well defined: for example, one person may exhibit some features more than another one. Similar considerations apply to age estimation, where the appearance of biological age could be very different from the chronological one.
], a humanoid robot, able to automatically recognize soft-biometric traits related to gender and age of the interacting individuals, is introduced. Recognition tasks are based on hand-crafted features extraction, Histogram of Oriented Gradients (HOG) for gender, Spatial Weber Local Descriptor (SWLD) for age, and Support Vector Machine (SVM) for the final classification. An interesting work regarding hand-crafted visual face features for gender, age, and ethnicity, has been proposed in [136
], where different algorithmic configurations, based on LBP, HOG, SWLD, and CLBP has been validated. In recent years, due to a greater ability in visual appearance description for pattern recognition tasks, methodologies based on Deep Neural Network has been employed in the field of soft-biometrics and in particular for soft-biometrics related to age, gender, and race estimation. A method for automatic age and gender classification tasks, by means of a simple CNN architecture that can be used even in the presence of limited training data, is presented in [137
]. In [138
], the authors introduce, in a CNN architecture for age estimation, learning strategies based on local regressors and gating networks to tackle the non-stationary aging process, therefore implying a heterogeneous data of age estimation, due to how human face matures in different ways at different ages. To deal with heterogeneous data, in [139
], the authors propose Deep Regression Forests (DRFs) where split nodes are connected to a fully connected layer of a CNN in order to deal with heterogeneous data by jointly learning input-dependent data partitions at the split nodes and data abstractions at the leaf ones. In [140
], a new CNN loss function, named mean-variance loss, is introduced, and it consists of a mean loss, which penalizes the difference between the mean of the estimated age distribution and the ground-truth age, and a variance loss, which penalizes the variance of the estimated age distribution. Among hard biometrics, of great interest, in the field of HRI in particular, are related to the face recognition task where the best-performing methods were those based on CNNs. In recent years several architectures of CNNs, with different level of complexity, have been proposed [141
]. In particular, last efforts regard definitions of new loss functions able to accomplish higher discriminative learnings. Concerning this last point, in [144
] is reported a loss function, named center loss, able to minimize the intraclass distances of the deep features and that, employed in a joint fashion with the softmax loss, higher discriminative features can be obtained for robust face recognition. Following the seminal work of the center loss, in [145
], the authors introduce Additive Angular Margin Loss, to enhance intraclass compactness and interclass discrepancy, that corresponds a geodesic distance margin between the sample and centers of identities distributed on a hypersphere, pushing further the CNN performance in terms of high discriminative features for the face recognition task.
Although CNN-based techniques have proven to be the best performing in solving soft biometrics recognition, it is well known that they need complex hardware for their implementation compared to techniques based on shallow networks and handcrafted features. As soft and hard biometrics involved in health care tasks are often implemented to be long-life used at the patient’s home or to be exploited in public healthcare centers with limited money founding, often the hardware resources are limited and usually, a trade-off between the accuracy of the results and lightness of algorithm implementation has to be made. This drove most researchers in this are to made assumptions about the application scenarios, e.g., by assuming a limited number of subjects to be analyzed and then by implementing lighter algorithms running also on not up-to-date hardware components.
5. Vital Parameters Monitoring
The accurate measuring of vital signs such as (i) blood pressure (BP), (ii) heart rate (HR), (iii) breathing rate (BR), (iv) and body temperature, with a noninvasive and non-contact method, is a very challenging task. The techniques pursuing the aforementioned measurements could be applied to any part of the human body but, generally, they are applied to the face that is the part that remains uncovered both in the medical field (people bedridden) and in the civil sphere (for example, in cases of monitoring of crowded areas to identify subjects with fever to contain the spread of viral diseases). The reference work in this research field dates back in 2000 [146
] when the first system for Photoplethysmography Imaging (PPGI; sometimes referred also as camera-based PPG, i.e., cbPPG) was presented. The system consisted of a camera, a light source made up of near-infrared (NIR) light-emitting diodes (LEDs), and a high-performance PC. It estimates blood pressure by detecting the rhythmic changes in the optical properties of the skin caused by the variations in the microvasculature. It made estimates without using a photodetector in contact with the skin but just a webcam. It is important to highlight that the accuracy of this kind of algorithms mainly depends on the acquisition technology put in place. For example, in the case of an intraoperative application, to improve accuracy, near-infrared (NIR) camera can be coupled with the RGB camera [147
]. In [148
], a more complex acquisition setup was exploited. Greenlight generated by eight light-emitted diodes (LEDs) was projected to the subject (whose eyes were protected by special glasses that do not transmit the green light) and all video recordings were carried out in a dark laboratory room. This allows performing the analysis of microcirculation in migraine patients and healthy controls for diagnostic purposes and for the prediction of the personalized treatments of migraine patients. Concerning algorithmic strategies instead, there are two tasks to be faced by researchers: the segmentation of the skin area to be monitored, and the processing of extracted optical data to estimate the at best the vital signs. A Bayesian skin classifier and a level set segmentation approach to defining and track ROIs based on spatial homogeneity were used in [147
]. Anyway, skin detection becomes easy in the case of the use of infrared thermography images. For instance, in [149
], this technology was exploited to estimate the respiratory rate in 28 patients in a post-anesthesia care unit, just defining a region of interest (ROI) around the nose. An approach that dynamically selects individual face regions and outputs the HR measurement while simultaneously selecting the most reliable face regions for robust HR estimation has been proposed in [150
]. Authors in [151
] used Convolutional Neural Networks (CNN) to optimize ROIs whereas authors in [152
] combined Eulerian Magnification and CNN approaches to extract HR from facial video data. After preprocessing, a regression CNN is applied to the so-called “feature-image” to extract HR.
Some works do not rely on skin detection but they detect and track specific regions. A common source of the signal in this kind of works is the nostril region: it is much smaller compared to, for example, the forehead, but it can be detected and tracked in an easier way by using textural features. For example, in [153
], the ROI around the nostril zone is manually initialized through a graphical user interface and then tracked by the tracking, learning, and detection (TLD) predator algorithm [154
]. To avoid manual initialization, in [155
], automatic detection of the medial canthus of the periorbital regions is carried out by analyzing edges.
Concerning data processing, noise suppression and data reduction are primary tasks to be faced. Strategies for accomplishing this step can be categorized into blind source separation (BSS), model-based, and data-driven methods. In [156
], both Independent Component Analysis and Principal Component Analysis [157
] were used for blind source separation and data reduction with the final aim to extract cardiac pulse from skin images. In [158
], a set of stochastically sampled points from the cheek region was used to estimate the PPG waveform via a Bayesian minimization approach. The posterior probability required for the Bayesian estimation is estimated through an importance-weighted Monte Carlo sampling approach, in which observations likely to yield valid PPG data are predominant. A Fourier transform is applied to the estimated PPG waveform and the frequency bin corresponding to the maximum peak within an operational band is selected as the heart rate frequency.
The authors of [159
] introduced a mathematical model that incorporates the pertinent optical and physiological properties of skin reflections with the objective to increase our understanding of the algorithmic principles behind remote photoplethysmography. A CNN-based approach for the analysis of breathing patterns acquired with thermography was also used in [160
]. However, their CNN architecture was applied to extracted spectrogram data and not the raw thermal images.
Another key aspect is the assessment of the PPGI data quality, e.g., the capability to automatically segment the periods during which the patient is stable and in the frame. In [161
], the authors carried out a beat-by-beat quality assessment on every PPGI signal to identify data windows suitable for heart rate estimation. The PPGI quality assessment starts by applying a Bayesian change point detection algorithm to find these step changes and discarding heart rate estimates during these periods. Then, they extract heart rate from face video on 40 patients undergoing hemodialysis. One more task is the amplification of weak skin color variation. To this purpose, authors in [162
] used the Eulerian Video Magnification able to amplify, by spatio-temporal filtering, the pulsatile signal in every pixel of skin image. The PPGI was tested in a clinical environment to control the regional anesthesia procedures. Convolutional neural networks can help to simultaneously face multiple above tasks. For instance, the authors of [163
] introduce the first application of deep learning in camera-based vital sign estimation, as it exploits a multi-task convolutional neural network for the detection of neonates and their skin regions in an incubator. Similarly, the framework in [163
] exploits a multi-task convolutional neural network model that automatically detects the presence or absence of a patient and segments the patient’s skin regions if the patient is found in front of the camera.
To go deeper into the possible application contexts, in addition to the already mentioned patients’ monitoring in intraoperative/post-operatory phases, and for diagnostic purposes, another growing field of application for PPGI is the non-contact monitoring of neonates, particularly in the neonatal intensive care unit (NICU). To this end, several groups have presented works to detect respiration from camera-based measurements taken from the top view of an incubator [164
] and cardiac information [165
]. Data from NICU equipment have been also processed in [168
] (HR estimation, one subject) as well as in [163
], where a CNN has been trained to automatically detect skin region (automated skin segmentation, 15 subjects). Some works approach real-world measurement scenarios with healthy subjects only. For example, monitoring subjects while performing sport exercises is attractive but pretty challenging due to the presence of motion artifacts. This issue has been addressed by in [169
] with PPGI for HR extraction. The estimation of the respiratory rate of subjects on stationary exercise bikes, by using thermography has been addressed in [153
], whereas the vital signs have been estimated both by a thermographic and an RGB camera in [173
]. Another promising yet challenging environment for camera-based monitoring is the car. For this scenario, two groups have presented results on HR estimation for one subject each [174
] obtained by capturing the color variations resulting from blood circulation in facial skin. Some authors proposed a motion-resistant spectral peak tracking (MRSPT) framework and evaluated their approach both during fitness as well as driving scenarios. The proposed motion resistant spectral peak-tracking strategy eliminates the motion artifacts by integrating facial motion signals [176
]. A NIR camera-based set-up for driver monitoring was also used in [177
], where the authors used RPPG signal tracking and denoising algorithm (sparsePPG) based on Robust Principal Components Analysis and sparse frequency spectrum estimation. Another application field is related to the use of PPGI for getting synchronization between magnetic resonance imaging (MRI) and subject’s cardiac activity. This is an essential part of many magnetic resonance imaging (MRI) protocols and is referred to as cardiac “gating” or “triggering”. Pioneering work in this application area was presented in [178
], demonstrating that cardiac triggering using PPGI is technically feasible only in the presence of a reliable signal-to-noise ratio of the videos. Other works deal with specific aspects that reach beyond the basic vital signs such as the estimation of blood pressure variability [179
], pulse wave delay [148
], the jugular venous pulse waveform [180
], and venous oxygen saturation [181
From the above literature review, it emerges that there is a lack of reproducibility and comparability in the rPPG (remote photoplethysmography) field. This is because only a few datasets are publicly available, UBFC-RPPG [182
] and MAHNOB-HCI [183
], that are specifically designed for the remote heart rate measurement task, and the OBF [184
], which is a recent release for a study about remote physiological signals measurement. These datasets incorporate the three main challenges for rPPG algorithms: changing of skin tone, motion, high heart/pulse rate changes. All the datasets refer to the ECG ground truth.
The UBFC-RPPG dataset contains 42 videos from 42 different subjects. The videos are recorded with a resolution of in an uncompressed 8-bit RGB format. Each subject is in front of a camera (1 m away). The participant is required to play a time-sensitive mathematical game to keep their heart rate varied. The MAHNOB-HCI dataset includes 527 facial videos with corresponding physiological signals from 27 subjects. The videos are recorded with 61 fps with a resolution of , which are compressed in AVC/H.264. The OBF dataset contains 200 five-minute-long RGB videos recorded from 100 healthy adults. The videos are recorded at 60 fps with a resolution of and compressed in MPEG-4.
Note that in this research, in addition to the exploited computer vision techniques (for face recognition, skin detection, ROI feature extraction and detection, etc.), the signal resolution plays an important role on rPPG accuracy, especially when the camera–subject distance is over 1 m [185
]. However, based on the latest outcomes obtained by using CNN [186
] on OBF [184
] and MAHNOB-HCI [183
] datasets it is possible to recover rPPG signals from highly compressed videos. Anyway, a comprehensive survey on rPPG collecting data from all the available datasets and comparing all the approaches in the state-of-the-art is still missing. Finally, note that although vital sign estimation is now possible with ubiquitous, inexpensive, consumer-grade equipment, even from great distances, its spreading has raised privacy concerns that have been addressed in [187
] with an approach capable to eliminate physiological information from facial videos. To overcome privacy issues in camera-based sensing, a non-negligible degree of signal components associated with non-skin areas was proposed in [188
]. It is essentially a single-pixel photo-detector that has no spatial resolution and then it does not allow facial analysis (e.g., face detection or recognition) and thus fundamentally eliminates privacy concerns.
6. Visual Speech Recognition and Animation
Speech recognition is a relatively new topic in the field of computer vision based applications for health assistance. Recently, it has been observed that the detection of lips and the automatic evaluation of their animations in the process of speech formation can be a strategic task in the development of assistive applications.
Traditionally, approaches to speech detection task focused on the detection and processing of the audio signals, independently from the visual information. However, an algorithm based only on audio information suffers in the presence of acoustic noise. For this reason, in recent years, some works started to consider the correlation between speech formation and lip animation, resulting in the birth of a specific computer vision task called lip-reading [189
Deep bottleneck features (DBNFs) can be considered as the conjunction ring between audio- and video-based approaches: they initially have been used successfully for acoustic speech recognition from audio [190
]; successively, DBNFs have also been used for speech recognition starting from video sequences. One of the most interesting approaches is proposed in [192
]: here, authors applied DBNF immediately after Local Binary Patterns to reduce computational time, and then concatenated the output with Discrete Cosine Transform (DCT) features and fed to a Hidden Markov Model for temporal analysis. The approach proposed in [193
] is quite similar, but here DBNFs are applied directly to the image pixels.
As previously stated, the main characteristic of this branch of application is its multi-modality, which is the integration of signals coming from a different typology of sources. The implications of this have been clearly highlighted in [194
] where authors propose an approach to learn the dependency of data, then present results obtained by a neural network trained with audio signals and tested with video ones (and vice versa). The multisensory representation is also the starting point of the methodology proposed in [195
]: here, the authors assert that the visual and audio components of a video signal are strictly correlated, and they propose a self-supervised approach, by means of a neural network, to evaluate the alignment of video frames and audio. In [196
], the authors introduce a joint audio–visual method to assign audio to a specific speaker in a complex environment. They used a specific neural network where inputs are the recorded sound mixture and the detected faces in each frame, the output is the assignment of each audio stream to the correct detected speaker. The approach requires human interaction in terms of the specification from what faces in the video is desired to hear the speech from. As common in this kind of application, in this work authors also present a dataset (called AVSpeech) composed by 1500 h of video clips where the speaker is clearly visible, and clean speech without noise is associated to it.
On the contrary, the authors of [197
] propose an approach based only on image processing. They propose a CNN architecture able to effectively learn and recognize hundreds of words from wild videos from the web; the results they present are quite encouraging and confirm that video information can be used independently from audio information. Authors further improve their idea by proposing a similar approach in [198
]. Here, they propose a network architecture they call WLAS (Watch, Listen, Attend and Spell) that is specialized for speech recognition; they also include a new learning strategy for redundancy reduction and overfitting limitation.
An interesting aspect is analyzed in [199
]: here, the authors focus their attention on the creation of a synthetic dataset for training, created by means of a 3D modeling software and able to overcome one of the main limitations of the current dataset: the lack of non-frontal images of lip/mouth. Effectively, by observing most-used datasets, it is evident that they mainly contain frontal images, and this can affect the performance of a neural network in the presence of non-frontal test images.
Even if several datasets can be found on the web, recently a group of researchers developed an approach for automatic dataset construction for speech recognition starting from YouTube videos. The authors applied several filters and processing algorithms to the videos, with the goal of extracting samples suitable for the training of a neural speech recognition systems [200
]. The creation of a specific dataset is also one of the goals of [198
], where a new dataset of over 100,000 natural sentences from British television is presented to the community. Two datasets have been proposed in [201
], where also a good comparison between the benefits of audio- and video-based processing methodology for speech recognition is presented. Finally, a special mention should be deserved for the approach proposed in [202
], which gives a different point of view on this topic. Here, authors propose a methodology to train a Visual Speech Recognition (VSR) model by distilling from an Automatic Speech Recognition (ASR) model by means of a deep 1D-convolutional residual network. This way, it is possible to use each available dataset on the web to train a net, even if images/videos are not annotated. Subtitles generation becomes a redundant operation, and also efforts for synchronization between subtitles and images will be avoided.
All the above-presented approaches are faced with visual speech recognition task without any reference to healthcare applications. Traditionally, lip reading is used as support for people with hearing loss [203
]. Interpretation of sign language is an interesting field, as highlighted in [204
] and [205
]. Similarly, in [206
], this problem is faced from the rehabilitation point-of-view. Another very interesting application field is the lip reading applied to ventilated patients [207
], people that are able to correctly move lip and mouth but that cannot produce any sounds.
An example of signal processing applied to healthcare is proposed in [32
], but here the speech recognition is performed only by means of audio processing tools, limiting the applicability is the presence of noise or disturbed audio signal. An overview of applications of deep learning for healthcare is proposed in [208
], but again the focus is on other aspects, overlooking details about video-based approaches for speech recognition. Surprisingly, to our knowledge computer vision based approaches applied to this specific topic are quite rare in literature.
], the authors present a range of application of deep learning in healthcare, and they focus their attention also on speech recognition. However, they find it as the crucial point of next-generation AI in this field the development of voice assistants to accurately transcribe patient visits, limiting the time doctors spend on documentation. This is surely true and relevant, but in our work, we are proving and motivating how AI can also improve the healthcare applications from another active point of view, also in speech recognition.
Starting from the overview here presented, it is reasonable that the vision-based approaches to speech recognition proposed at the beginning of this section can be applied to healthcare applications: this way, traditional limitations of speech recognition algorithms (noise, distortion, and signal overlap) can be overcome, with the goal of realizing architectures able to provide reliable algorithms of lip reading and speech recognition usable in many heterogeneous contexts.
7. Discussion and Conclusions
From the literature overview, it emerges that the analysis of the face aimed at medical and health applications is still in an embryonic state. There is indeed a great untapped potential linked to the latest methods of computer vision and machine learning that are currently confined to the theoretical or applied researches, with a marginal leveraging in the on-field research activities as those related to the healthcare issues.
Along previous sections, it has clearly emerged that often healthcare applications lie on computer vision tasks exploiting not up-to-date approaches. The proposed comparative tables help determine that the accuracy and reliability of involved algorithms are often below the performance of the last outcomes at theoretical and even applied research fields. This comes from the common way of design computer vision based healthcare frameworks and systems by combining consolidated algorithmic modules, better if available as API or toolkit very easy to integrate. Examples are open source toolkits such as OpenFace [210
]) for emotion analysis and OpenBR, [211
] for soft and hard biometrics or cloud-based multitask computer vision platform provided as a service such as Amazon Recognition [212
] and Microsoft Azure [213
Another crucial issue is the possibility to get scalable deep learning algorithms, able to work in real-time even on mobile and non-specialized hardware. From this perspective, enhanced convolutional neural tangent Kernels could be an interesting research line [214
]. On the other hand, the design, implementation, and validation of frameworks facing healthcare tasks could be very difficult and time-consuming, due to the necessity to recruit control and clinical groups and since the assessment has to be carried out involving several subjects with multidisciplinary technical backgrounds. This is a relevant drawback that has to seriously faced, especially with regard to the analysis of the face, which has a complex structure requires the use of advanced approaches, desirably even able to detect micro-movements of facial muscles, so as not to invalidate the entire experimental architecture with not reliable image/video data computation.
A possible way out is the massive exploitation of a very hot topic in machine learning named deep visual domain adaptation [215
], by which it is possible to learn more transferable representations by embedding domain adaptation in the pipeline of deep learning. The idea is to utilize abundant labeled data from an auxiliary domain (generic computer vision tasks), i.e., source domain, for classifying the data from a label-scarce domain, i.e., target domain (healthcare) [216
]. Under this new perspective, it is becoming easier to glimpse deep learning-based end-to-end approaches specifically designed for face analysis in the healthcare domain.
Subjectivity and complexity of annotation of clinical data will remain an open challenge that could benefit from accurate annotation guidelines, standardized processes and clinical entity recognition tools, and formal specifications [217