Driver Stress State Evaluation by Means of Thermal Imaging: A Supervised Machine Learning Approach Based on ECG Signal

Featured Application: A procedure for a driver’s stress state monitoring was provided by means of thermal infrared imaging. It was validated on ECG-derived parameters through the application of supervised machine learning techniques. Abstract: Tra ﬃ c accidents determine a large number of injuries, sometimes fatal, every year. Among other factors a ﬀ ecting a driver’s performance, an important role is played by stress which can decrease decision-making capabilities and situational awareness. In this perspective, it would be beneﬁcial to develop a non-invasive driver stress monitoring system able to recognize the driver’s altered state. In this study, a contactless procedure for drivers’ stress state assessment by means of thermal infrared imaging was investigated. Thermal imaging was acquired during an experiment on a driving simulator, and thermal features of stress were investigated with comparison to a gold-standard metric (i.e., the stress index, SI) extracted from contact electrocardiography (ECG). A data-driven multivariate machine learning approach based on a non-linear support vector regression (SVR) was employed to estimate the SI through thermal features extracted from facial regions of interest (i.e., nose tip, nostrils, glabella). The predicted SI showed a good correlation with the real SI ( r = 0.61, p = ~0). A two-level classiﬁcation of the stress state (STRESS, SI ≥ 150, versus NO STRESS, SI < 150) was then performed based on the predicted SI. The ROC analysis showed a good classiﬁcation performance with an AUC of 0.80, a sensitivity of 77%, and a speciﬁcity of 78%. the regression output. Support Vector Regression with Radial Basis Function kernel (SVR-RBF) was used as regressor. A leave-one-subject-out cross-validation was employed to test the generalization of the regression. The result of the regression (SIcross) was then used to perform a two-level classiﬁcation of the driver’s stress (i.e., STRESS versus NO STRESS). Since the two classes were not balanced, a bootstrap procedure was implemented. Receiver Operating Characteristic (ROC) analysis was executed to investigate the performance of the classiﬁer.


Introduction
According to the latest estimates by the World Health Organization, approximately 1.35 million people die each year from road traffic accidents and between 20-50 million people suffer from non-fatal injuries [1].
Advanced driver-assistance systems (ADAS) are designed to support humans during the driving process, leading to an increase in road safety. Conventional ADAS technologies are mainly based on controlling the vehicle state through proprioceptive (i.e., Odometry, inertial sensors) and exteroceptive sensors (i.e., Lidar, vision sensors, radar, infrared, and ultrasonic sensors) [2]. These state-of-the-art technologies allow for the recognition of objects [3], alerting the driver about dangerous road accuracy, and the Modified Cooper Harper Scale ratings. Stemberger et al. [27] presented a system for the estimation of cognitive workload levels based on the analysis of facial skin temperature. Beyond thermal infrared imaging of the face, the system relied on head pose estimation, measurement of the temperature variation across regions of the face, and an artificial neural network classifier. The system was capable of accurately classifying mental workload into high, medium, and low levels 81% of the time.
Given the advantages of the use of thermography in driver state monitoring, a relevant number of scientific works on this research field are available. Most of these publications concern driver drowsiness/fatigue monitoring and emotional state detection. Ebrahimian-Hadikiashari et al. [28] investigated driver drowsiness by analyzing the breathing function, monitored by thermography. The authors observed a significant decrease of driver respiration rate from extreme drowsiness to wakefulness conditions. Moreover, Knapik et al. [29] presented an approach for the evaluation of driver's fatigue, based on yawn detection using thermal imaging. Zhang et al. [30] demonstrated the feasibility of discriminating emotions (e.g., fear versus no fear) by means of thermal imaging, assessing the forehead temperature as indicative for the emotional dimension of drivers' fear. Focusing on driver stress monitoring, Yamakoshi et al. [31] combined measures from facial skin temperature and hemodynamic variables. The authors observed an increase of sympathetic activity, peripheral vasoconstriction, hence, a significant decrease in peripheral skin temperature during monotonous driving simulation. Basing on differential skin temperatures between peripheral (i.e., nose tip) and truncal parts of the face (i.e., cheeks, jaw, and forehead), they were able to assess an index of a driver's stress. More recently, Pavlidis et al. [32] studied the effects of cognitive, emotional, sensorimotor, and mixed stressors on driver arousal and performance with respect to a baseline of 59 drivers in a simulation experiment. Perinasal perspiration, revealed by thermal imaging, together with the measure of steering angle and the range of lane departures on the left and right side of the road, showed a more dangerous driving condition in the case of sensorimotor and mixed stressors with respect to the baseline condition.
In this paper, the driver stress state was established by means of IR imaging and supervised machine learning methods. Supervised machine learning approaches are part of artificial intelligence (AI) algorithms, able to automatically learn functions that map an input to an output based on known input-output pairs (training dataset). The function is inferred from labeled training data and can be used for mapping new dataset (test dataset) that allow to evaluate the accuracy of the learned function and understand the level of generalization of the applied model [33].
On the basis of key features of thermal signals extracted from peculiar regions of interest (ROIs), indicative of the psycho-physiological state, an estimation of the ECG-derived SI was performed employing a support vector regression with radial basis function (SVR-RBF) [17]. To test the generalization performances of the model, a leave-one-subject-out cross-validation was utilized. After the cross-validation process, a two-level classification of the stress state (STRESS versus NO STRESS) was performed, relying on the estimated SI.
This work describes a novel approach for a contactless methodology dedicated to driver stress state detection and classification, constituting a significant improvement to actual ADAS technology and, in general, to road security level.

Participants
The experimental session involved 10 adults (6 males, age range 22-35, mean 28.4). Before the start of the experimental trials, the participants were adequately informed about the purpose and protocol of the study, and they signed an informed consent form outlining the methods and the purposes of the experimentation in accordance with the Declaration of Helsinki [34].

Procedure and Data Acquisition
Prior to testing, each subject was left in the experimental room for 20 min to allow the baseline skin temperature to stabilize. The recording room was set at a standardized temperature (23 • C) and humidity (50-60%) by a thermostat.
To perform the experiment, a static driver simulator was used (Figure 1a). It was composed of driver's seat, steering wheel, clutch, brake, and gas pedals, and gearshift. To display the scenario, three 27 inch monitors were used. The total video resolution for the stimulation was 5760 × 1080 pixels. The distance between the driver and road screen was approximately 1.5 m. Participants' horizontal view angle was 150 degrees. The simulator could produce starter and engine sounds, left and right signal indicators, and flashers and wiper blades. In this study, the sound of the engine, starter, and lights switches were provided.

Procedure and Data Acquisition
Prior to testing, each subject was left in the experimental room for 20 min to allow the baseline skin temperature to stabilize. The recording room was set at a standardized temperature (23 °C) and humidity (50-60%) by a thermostat.
To perform the experiment, a static driver simulator was used (Figure 1a). It was composed of driver's seat, steering wheel, clutch, brake, and gas pedals, and gearshift. To display the scenario, three 27 inch monitors were used. The total video resolution for the stimulation was 5760 × 1080 pixels. The distance between the driver and road screen was approximately 1.5 m. Participants' horizontal view angle was 150 degrees. The simulator could produce starter and engine sounds, left and right signal indicators, and flashers and wiper blades. In this study, the sound of the engine, starter, and lights switches were provided.
Participants sat comfortably on the seat of the driving simulator during both acclimatization and measurement periods. The software used for driving simulation was City Car Driving, Home Edition software (version 1.5) [35] (Figure 1b). The experimental protocol consisted of performing a driving simulation lasting 45 min in an urban context. The experimental conditions were set a priori to ensure the adverse driving condition and guarantee the uniformity of the experimental protocol for all the subjects. An overview of the experimental setting of City Car Driving software is reported in Table 1.  Participants sat comfortably on the seat of the driving simulator during both acclimatization and measurement periods.
The software used for driving simulation was City Car Driving, Home Edition software (version 1.5) [35] (Figure 1b). The experimental protocol consisted of performing a driving simulation lasting 45 min in an urban context. The experimental conditions were set a priori to ensure the adverse driving condition and guarantee the uniformity of the experimental protocol for all the subjects. An overview of the experimental setting of City Car Driving software is reported in Table 1. The conditions reported in Table 1 were selected to induce stress in the participants. In particular, the settings associated to traffic and emergency situations guaranteed non-comfortable driving, since the participants were often experiencing non-monotonous situations.
During the execution of the experimental protocol, ECG signals and visible and thermal IR videos were acquired.
The ECG signals were recorded by means of AD Instruments PowerLab system using the lead configuration determined by the Standard Limb Leads (i.e., electrodes positioned at the right arm (RA), left arm (LA), and left leg (LL)) [36].
Visible and thermal IR videos were acquired by the depth camera Intel RealSense D415 and FLIR Boson 320LW IR thermal camera, respectively. The technical characteristics of the two acquisition devices are summarized in Table 2. For the purpose of this study, the two imaging devices were held together and aligned horizontally. Figure 2 shows the entire imaging acquisition system.


Accident on the road: Often  Dangerous entrance of the vehicle into the oncoming lane: Rarely The conditions reported in Table 1 were selected to induce stress in the participants. In particular, the settings associated to traffic and emergency situations guaranteed non-comfortable driving, since the participants were often experiencing non-monotonous situations.
During the execution of the experimental protocol, ECG signals and visible and thermal IR videos were acquired.
The ECG signals were recorded by means of AD Instruments PowerLab system using the lead configuration determined by the Standard Limb Leads (i.e., electrodes positioned at the right arm (RA), left arm (LA), and left leg (LL)) [36].
Visible and thermal IR videos were acquired by the depth camera Intel RealSense D415 and FLIR Boson 320LW IR thermal camera, respectively. The technical characteristics of the two acquisition devices are summarized in Table 2. For the purpose of this study, the two imaging devices were held together and aligned horizontally. Figure 2 shows the entire imaging acquisition system.

Analysis of ECG Signals
The ECG signals were recorded at a rate of 1 kHz. The elapsed time periods between the two successive R-peaks of the ECGs (RR signals) were extracted from LabChart7, ADInstruments, and analyzed by the software Kubios HRV Standard [37]. Baevsky's Stress Index (SI) [18] was evaluated for each subject in 30 s consecutive windows. The SI from Kubios is the square root (to make the index normally distributed) of the Baevsky's Stress Index proposed in Reference [18].

Analysis of ECG Signals
The ECG signals were recorded at a rate of 1 kHz. The elapsed time periods between the two successive R-peaks of the ECGs (RR signals) were extracted from LabChart7, ADInstruments, and analyzed by the software Kubios HRV Standard [37]. Baevsky's Stress Index (SI) [18] was evaluated for each subject in 30 s consecutive windows. The SI from Kubios is the square root (to make the index normally distributed) of the Baevsky's Stress Index proposed in Reference [18].
Baevsky's SI is calculated based on the distribution of the RR intervals as reported in Equation (1): Appl. Sci. 2020, 10, 5673 where Mo is the mode (the most frequent RR interval), AMo is the mode amplitude expressed in percent, and MxDMn is the variation scope reflecting the degree of RR interval variability. Values of Baevsky's stress index between 80 and 150 are considered normal [18].

Analysis of Visible and Thermal Imaging Data
Visible and IR videos of the subjects' faces were simultaneously recorded during the driving experiment at an acquisition frame rate of 30 Hz and 10 Hz, respectively.
Given the availability of computer vision algorithms for visible videos, in the present study, visible imaging was used as reference for tracking facial landmark features. The purpose of the visible tracking was to transfer the visible facial landmark features tracked to the thermal imagery, estimating the geometrical transformation between the two imaging optical devices.

Visible and Thermal Data Co-Registration
The first step of the developed procedure consisted of an optical co-registration between visible and thermal optics. The co-registration process was a fundamental step of the whole pipeline, since it allowed a proper mapping from an imaging coordinate system to another.
The optical co-registration relied on procedures implemented in OpenCV [38], and it is described in depth in Reference [39]. A root mean square error (RMSE) value was provided by the co-registration procedure, thus indicating the accuracy in the coordinate transformation from visible to IR imagery at the specific distance of 1 m.

Facial Landmark Detection in the Visible Domain
Visible videos were analyzed through OpenFace [40,41], an open-source software able to perform facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. For each frame, a set of 68 facial landmarks was estimated during the experiment. Figure 3 shows the distribution of the 68 facial landmarks.
Baevsky's SI is calculated based on the distribution of the RR intervals as reported in Equation (1): where Mo is the mode (the most frequent RR interval), AMo is the mode amplitude expressed in percent, and MxDMn is the variation scope reflecting the degree of RR interval variability. Values of Baevsky's stress index between 80 and 150 are considered normal [18].

Analysis of Visible and Thermal Imaging Data
Visible and IR videos of the subjects' faces were simultaneously recorded during the driving experiment at an acquisition frame rate of 30 Hz and 10 Hz, respectively.
Given the availability of computer vision algorithms for visible videos, in the present study, visible imaging was used as reference for tracking facial landmark features. The purpose of the visible tracking was to transfer the visible facial landmark features tracked to the thermal imagery, estimating the geometrical transformation between the two imaging optical devices.

Visible and Thermal Data Co-Registration
The first step of the developed procedure consisted of an optical co-registration between visible and thermal optics. The co-registration process was a fundamental step of the whole pipeline, since it allowed a proper mapping from an imaging coordinate system to another.
The optical co-registration relied on procedures implemented in OpenCV [38], and it is described in depth in Reference [39]. A root mean square error (RMSE) value was provided by the coregistration procedure, thus indicating the accuracy in the coordinate transformation from visible to IR imagery at the specific distance of 1 m.

Facial Landmark Detection in the Visible Domain
Visible videos were analyzed through OpenFace [40,41], an open-source software able to perform facial landmark detection, head pose estimation, facial action unit recognition, and eye-gaze estimation. For each frame, a set of 68 facial landmarks was estimated during the experiment. Figure 3 shows the distribution of the 68 facial landmarks. The landmark detector algorithm within OpenFace relied on the constrained local neural fields (CLNF) procedure [42], whereas the face detector algorithm employed a multi-task convolutional cascaded network (MTCNN) approach [43].

Thermal Data Extraction and Analysis
The sets of the 68 facial landmarks detected in the visible images were identified in the corresponding frames of IR imaging, applying the geometrical transformation obtained from the optical co-registration process. Figure 4a  The landmark detector algorithm within OpenFace relied on the constrained local neural fields (CLNF) procedure [42], whereas the face detector algorithm employed a multi-task convolutional cascaded network (MTCNN) approach [43].

Thermal Data Extraction and Analysis
The sets of the 68 facial landmarks detected in the visible images were identified in the corresponding frames of IR imaging, applying the geometrical transformation obtained from the optical co-registration process. Figure 4a  A fundamental aspect for obtaining an accurate co-registration was the need of temporal synchronization between visible and IR videos. Since the acquisition frame rate of thermal videos was lower than that of the visible camera, the corresponding frames within the visible domain were determined according to the specific timestamps of IR frames. Specifically, among the visible frames acquired around the IR frame timestamp, the one that minimized the temporal difference with IR imaging was chosen. The timestamps of the frames were considered reliable as the videos were acquired on the same PC.
For each thermal video, four ROIs were considered and positioned on facial areas of physiological importance (nose tip, right and left nostrils, and glabella) [44]. The ROIs' coordinates A fundamental aspect for obtaining an accurate co-registration was the need of temporal synchronization between visible and IR videos. Since the acquisition frame rate of thermal videos was lower than that of the visible camera, the corresponding frames within the visible domain were determined according to the specific timestamps of IR frames. Specifically, among the visible frames acquired around the IR frame timestamp, the one that minimized the temporal difference with IR imaging was chosen. The timestamps of the frames were considered reliable as the videos were acquired on the same PC.
For each thermal video, four ROIs were considered and positioned on facial areas of physiological importance (nose tip, right and left nostrils, and glabella) [44]. The ROIs' coordinates were automatically determined from the location of the 68 landmarks. In this way, the initialization of the position of the ROIs was automatically determined (Figure 4c).
With reference to the topographical distribution of the points as represented in Figure 3, the coordinates of the four ROIs were determined, as described in Table 3: For each ROI, the average value of the pixels was extracted over time (Figure 4d). Relatively to the nostrils' ROIs, the average value between ROI 2 and ROI 3 was considered for further statistical analysis, them being related to the same physiological process (i.e., breathing function).
For each of the extracted signals, six representative features were computed over consecutive temporal window of 30 s:

Application of Supervised Machine Learning
Firstly, a machine learning approach was utilized to predict SI relying on features extracted from thermal signals. Specifically, an SVR with RBF kernel was trained on the SI obtained from Kubios through a supervised learning procedure. The SVR-RBF was trained on z-scored data with a fixed nonlinearity exponential parameter γ = 1.
Because of the multivariate (6 regressors) SVR approach, in-sample performance of the procedure did not reliably estimate the out-of-sample performance. The generalization capabilities of the procedure were thus assessed through cross-validation. Specifically, a leave-one-subject-out cross-validation was performed [45]. This cross-validation procedure consisted in leaving one subject (specifically all the samples from the same subject) out of the regression and in estimating the predicted output value on the given subject using the other participants as the training set of the SVR model. This procedure was iterated for all the subjects, and further statistical analyses were performed on the out-of-training-sample estimation of SI from thermal features. Such a metric was labelled SIcross.
Although several machine learning approaches could be suited for such a purpose, given the limited number of independent features available and the exploratory nature of the implemented approach, an SVR-RBF followed by a classification procedure was chosen to limit the procedural complexity. In fact, although SVR-RBF is not a sophisticated approach, it ensures performances which are comparable to more complex machine learning techniques [46].
Secondly, SIcross was used to perform a two-level classification of the driver's stress (i.e., STRESS versus NO STRESS). The two classes were defined on the basis of the threshold associated to a stress condition assessed by the SI (i.e., SI > 150 for stress condition) [18]. Notably, the experimental recordings confirmed the accordance between SI and the driving conditions. In particular, stressful situations assessed by SI were associated to adverse events during driving simulations (e.g., traffic accidents, collisions with pedestrians, sudden car braking).
Since the two classes did not have an equal number of samples, a bootstrap procedure was implemented to test classification performance on balanced classes [47]. The performances of the classification were evaluated by means of receiver operating characteristic (ROC) analysis [48]. situations assessed by SI were associated to adverse events during driving simulations (e.g., traffic accidents, collisions with pedestrians, sudden car braking). Since the two classes did not have an equal number of samples, a bootstrap procedure was implemented to test classification performance on balanced classes [47]. The performances of the classification were evaluated by means of receiver operating characteristic (ROC) analysis [48]. Figure 5 reports the flow chart relating to the described machine learning approach.

Figure 5.
Flow chart of the applied machine learning approach: the thermal features are used as predictors whilst the Electrocardiogram(ECG)-derived Stress Index (SI) is considered as the regression output. Support Vector Regression with Radial Basis Function kernel (SVR-RBF) was used as regressor. A leave-one-subject-out cross-validation was employed to test the generalization of the regression. The result of the regression (SIcross) was then used to perform a two-level classification of the driver's stress (i.e., STRESS versus NO STRESS). Since the two classes were not balanced, a bootstrap procedure was implemented. Receiver Operating Characteristic (ROC) analysis was executed to investigate the performance of the classifier.

Visible and Thermal Imaging Co-Registration and Processing
The spatial RMSE of the optical co-registration was 0.66 ± 0.25 pixels, thus indicating that the accuracy in the coordinate transformation from visible to IR imagery at the specific distance of one meter was less than one pixel.
The percentage value of the correctly identified landmark on the total amount of considered frames and the confidence value in correctly classifying a face are reported in Table 4 for each subject. These parameters were returned by the software OpenFace [40]. The confidence value ranged from 0 (total misclassification of face) to 1 (correct face classification), and it was the result of a landmark A leave-one-subject-out cross-validation was employed to test the generalization of the regression. The result of the regression (SIcross) was then used to perform a two-level classification of the driver's stress (i.e., STRESS versus NO STRESS). Since the two classes were not balanced, a bootstrap procedure was implemented. Receiver Operating Characteristic (ROC) analysis was executed to investigate the performance of the classifier.

Visible and Thermal Imaging Co-Registration and Processing
The spatial RMSE of the optical co-registration was 0.66 ± 0.25 pixels, thus indicating that the accuracy in the coordinate transformation from visible to IR imagery at the specific distance of one meter was less than one pixel.
The percentage value of the correctly identified landmark on the total amount of considered frames and the confidence value in correctly classifying a face are reported in Table 4 for each subject. These parameters were returned by the software OpenFace [40]. The confidence value ranged from 0 (total misclassification of face) to 1 (correct face classification), and it was the result of a landmark detection validation process. In detail, to avoid tracking drift over time, it was necessary to determine if landmark detection succeeded during video processing. The landmark detection validation was performed transforming the area surrounded by the landmarks to a pre-defined reference shape. The vectorized resulting image was then used as a feature vector for a classifier which acts as the validator (i.e., input of the classifier). To train the classifier on the vectorized reference warp, positive and negative landmark detection examples were considered. The positive samples were ground truth landmark labels, whereas the negative samples were generated from the ground truth labels, applying offset and scale transformations. The classifier employed in OpenFace is SVM [49]. On average, 94.66% of the video frames were correctly processed, whereas the confidence index for face classification was 0.90%.
To notice, concerning subjects 3 and 5, the average success and confidence scores were lower with respect to the other subjects, given the scarce lighting conditions of the acquisitions (Table 4). However, in general, for all the subjects, only the frames with high confidence and success scores were considered for further analysis (i.e., success index > 90%, confidence value > 0.8). This ensures there was no impact on the ROIs' identification and, consequently, on their features' estimation.
Finally, the average execution time of the developed algorithm was 0.09 s/frames with MATLAB 2016b© (64-bit Windows 7 Pro, Service Pack 1; Intel (R) Core (TM) i5 CPU; 8.00 GB RAM).

Performances of Supervised Machine Learning Approach
Across subjects, 849 samples were available for the regression analysis. A significant correlation between SI and predicted SI (SIcross) was obtained (r = 0.61, p =~0) (Figure 6a), demonstrating a good performance of the multivariate analysis [50]. The weights associated to each z-scored regressor for each ROI are shown in Figure 6b. Considering that both the regressors and SI were normalized, the values of the weights were indicative of the contribution of each model input in the estimation of the SI.
Since the two classes had a different number of samples (125 samples for STRESS versus 696 samples for NO STRESS conditions), a bootstrap procedure was implemented to provide a classification estimates using balanced classes [47]. Since the two classes had a different number of samples (125 samples for STRESS versus 696 samples for NO STRESS conditions), a bootstrap procedure was implemented to provide a classification estimates using balanced classes [47]. Figure 7a reports the among iterations average ROC curve (bootstrap performed for n = 10,000 iterations). The average area under curve (AUC) was 0.80 and with standard deviation of 0.01. The distribution of the AUC obtained after the bootstrap is reported in Figure 7b.   Since the two classes had a different number of samples (125 samples for STRESS versus 696 samples for NO STRESS conditions), a bootstrap procedure was implemented to provide a classification estimates using balanced classes [47]. Figure 7a reports the among iterations average ROC curve (bootstrap performed for n = 10,000 iterations). The average area under curve (AUC) was 0.80 and with standard deviation of 0.01. The distribution of the AUC obtained after the bootstrap is reported in Figure 7b.  By choosing a specific threshold for SIcross, a sensitivity of 77% and a specificity of 78% were obtained as reported in the confusion matrix (Table 5).

Discussion
In this study, a novel method for driver stress evaluation based on thermal IR imaging and supervised machine learning approaches was described. Thermal IR imaging and ECG were acquired on ten subjects, while performing an experiment on a driving simulator using the software City Car Driving v.1.5 [35]. The experimental session consisted of 45 min of urban context driving with pre-established weather and traffic conditions. Electrocardiography (ECG) signals were used to infer the stress condition of the drivers. Among the variety of indices derived from the ECG signals, stress was considered [18]. In this study, the SI was evaluated in consecutive 30 s time windows by the software Kubios [36]. In the same temporal window, six representative features from average thermal signals on four ROIs (i.e., nose tip, left and right nostrils, glabella) were extracted. The thermal signals were automatically determined by a real-time tracking procedure. The tracking relied on state-of-the-art computer vision algorithms applied on visible images and the optical co-registration between the visible and thermal imaging devices ensuring high performance on signal processing and speed of extraction. Indeed, the high performances were highlighted by the percentage of correctly processed frames which reached an average of 94.66% by the confidence index for face classification that was 0.90 and by the average processing time that was only 0.09 s/frames.
A multivariate machine learning approach based on Support Vector Regression (SVR) with Radial Basis Function (RBF) kernel was employed to estimate the ECG-based SI through peculiar thermal features extracted from facial ROIs. Those ROIs were chosen on the basis of their physiological importance for stress detection [43]. A total amount of 18 thermal features (six features for each ROIs) were computed and used as predictors, while the SI, evaluated through the ECG signals, was considered as the regression output. A leave-one-subject-out cross-validation was employed to test the generalization of the regression. This procedure was iterated for all the subjects and further statistical analyses were performed on the out-of-training-sample estimation of SI. Such a metric was labeled as SIcross. The correlation between SI and SIcross was r = 0.61 (p =~0) thus indicating a good estimation of the SI through the considered thermal features (Figure 6a). A feature-based analysis was performed to investigate the relevance of each feature (Figure 6b).
Concerning the nose tip region, the most contributing features to the SI estimation were the kurtosis, the skewness and the standard deviation. The weights associated to the kurtosis and skewness had negative values, thus indicating an inverse relation between SI and the features' trends. The opposite trend was observed for the weights associated to the standard deviation. This pattern seems to be correlated with sweating or vasoconstriction phenomena, occurring with increasing stress [51,52]. In fact, an increase in the standard deviation and a reduction of the kurtosis and skewness parameters (i.e., flatness and asymmetry of the distribution of the related signal [53]) can be associated to a decrease of uniformity of the signal, thus indicating the presence, for instance, of "cold spots" typically present during sweating and vasoconstriction processes [19]. Concerning the nostrils region, instead, a strong inverse relation between the weight associated to the standard deviation and the SI was found. Since the thermal signals from nostrils are highly related to the breathing function, a lower signal variation (revealed by a decrease in the standard deviation) could be associated to a high breathing rate [54]. This result is in accordance with the findings from References [55,56], where it was shown that stress is associated with an increased respiratory rate. Finally, referring to the glabella region, the most relevant feature to the SI estimation was the 90th percentile, i.e., the value below which 90% of data falls. The weight associated to the 90th percentile was directly related to the SI, thus indicating that an increase in temperature of the glabella could be indicative of a stress condition. This result is in accordance with the findings reported in Reference [57] in which an increase in forehead temperature was associated to the execution of high difficult tasks.
To be noted, when using non-linear regressors, the contribution of each feature in predicting the output does not only depend on the relative weight but also on the non-linearities of the model. Nonetheless, the SVR-RBF employed a single parameter depicting the non-linearity extent for all the features considered. Thus, although not directly regressing the input with the output, the weights of RBF-SVR were still associated to the importance of each regressor.
The SIcross was, then, used to perform a two-level classification of the driver's stress (i.e., STRESS versus NO STRESS). The two classes were defined on the basis of the threshold associated to a stress condition assessed by Baevsky's SI [18]. Since the two classes were not balanced, a bootstrap procedure was implemented [47]. The ROC analysis showed a good performance of the classifier with an average AUC of 0.80 (Figure 7b), a sensitivity of 77%, and a specificity of 78% (Table 4).
It is worth noting that the cross-validation and the bootstrap procedures provided the generalization performances of the model, testing its applicability to a wide cohort of drivers. In fact, although stress conditions could elicit different physiological responses among subjects, for ADAS applications, it could be more relevant to detect stress conditions across participants, rather than focusing on a single subject's stress level.
The main benefit of the developed method with respect to the available literature is the use of supervised machine learning approaches, based on the only thermal features, without accounting for vehicle-or driver's behavioral-related parameters, reaching performances comparable with more complex approaches [27]. Furthermore, the developed method opens the way to efficient real-time implementation of drivers' stress state monitoring relying only on thermal IR imaging, being the model already validated and ready to use.
Nonetheless, further studies should be performed to increase the sample size. The machine learning approach used in this study relied on supervised learning which is inherently a data-driven analysis; data-driven analysis is highly affected by the sample size and the performance of the model could indeed improve reducing a possible overfitting effect driven by the limited sample numerosity. To be noted, the present study focused on drivers with a limited age range (i.e., 22-35 years old), involving only young subjects. The most important improvement of the method will be to include in the study sample people with a wider age range. In future studies, beyond increasing the sample size and age range, other factors, such as gender, thermal comfort, and weather conditions during simulated driving sessions, will be considered [58][59][60]. In fact, taking these factors into account could be of fundamental valence in driving stress research, leading to a wide overview of all the aspects concerning the matter of the study.
Furthermore, the present results are relative to simulated driving conditions in which determining variables for IR measurements, like direct ventilation or sunlight, were not considered. Thus, it would be desirable to apply the developed methodology also on real-driving situations, to generalize the applicability of the technique.
As for being state-of-the-art, this is an original and novel study concerning drivers' stress state evaluation by means of thermal imaging, employing supervised machine learning algorithms. This is a preliminary study, addressed to limited and specific experimental conditions which, however, underlines the feasibility of the method to be verified under wider operating situations.

Conclusions
In the present work, a novel and original method allowing for drivers' stress state evaluation was presented. By using machine learning approaches, it was possible to understand and classify, with a good level of accuracy, the stress state of the subjects while driving in a simulated environment. The presented work constitutes the first step towards the establishment of a reliable detection of the stress s in a non-invasive fashion, ensuring to maintain an ecologic condition during measurements.