An Adaptive Fatigue Detection System Based on 3D CNNs and Ensemble Models

.


Introduction
The issue of drivers falling asleep while operating a vehicle has received considerable attention from numerous researchers in the automotive field, who have dedicated their efforts toward developing a range of drowsiness detection systems. This is an active area of research that involves incorporating various components of the Internet of Things (IoT) and application technology [1], such as sensors, cloud computing, facilities, smartphones, and distributed data processing. To develop a reliable and effective fatigue detection system, researchers typically employ three primary methodologies: behavior-based, vehiclebased, and physical-based approaches [2]. Figure 1 presents an overview of the distinct characteristics of each of these methodologies. Behavioral-based methods utilize computer vision and image processing techniques to evaluate images and videos of the operator, with the objective of assessing their level of alertness. This strategy is based on analyzing various essential physiological indicators, such as eye blinking, facial expressions such as lip movements, yawning, eye closure, nodding, and head posture, to ascertain whether the operator is awake, drowsy, or asleep [3].
A different approach, known as vehicle-based systems, involves incorporating a driver fatigue detection system into the steering wheel of the vehicle using embedded sensors and devices. This integrated system measures various indicators, including steering wheel velocity, steering wheel angle, steering wheel movement, hand position, lane departure, and hand absence [4].
The physical-based fatigue detection methods employ human bio-signals such as Electroencephalography (EEG), Electrooculography (EOC), and Electrocardiography (ECG) to monitor the driver behind the steering wheel. In addition, some signs are involved in these methods, such as breathing, respiratory, and pulse rates [5].

Related Work
The endeavors toward developing a system for detecting fatigue can be categorized into two primary groups: conventional algorithms and machine learning algorithms [6]. Among the machine learning algorithms, CNN and SVM are the most commonly employed and efficient classifiers [7]. Although SVM is quick and precise in analyzing small datasets, its accuracy and speed decrease when utilized for larger datasets. Conversely, CNN exhibits high accuracy and stability for both small and large datasets, but its training can be time-consuming when utilizing CPUs and may incur high processing costs when using GPUs [7].
The development of a fatigue detection system for driving is a crucial component in improving driving safety measures. Previous efforts to employ behavioral-based techniques entailed utilizing software to observe driving behavior by capturing real-time images of the driver using infrared illumination [8]. This approach considers multiple parameters, including PERCLOS, face position, blink frequency, nodding frequency, and eye closure duration, to monitor the driver's conduct. A classifier evaluates these parameters to determine the driver's level of alertness. Currently, this system surpasses other algorithms as it has the capability to observe and analyze a wide range of factors and collect data in both daytime and nighttime conditions.
Abtahi et al. [9] devised a simple method using image processing to detect signs of driver fatigue. This approach involves capturing facial features, such as eye and lip movements, to identify yawning and ocular languor and then monitoring the driver's face in the image. The concept identifies changes in the geometric properties of the driver's face to recognize fatigue. Flores et al. [10] put forth an Advanced Driver Assistance System (ADAS) in their investigation, which employs a technique for detecting tiredness by scrutinizing the driver's face and eyes to evaluate their facial expressions and eye movements. The authors conducted real-time testing of the system under varying lighting conditions in contrast.
Several techniques, such as those described in references [11][12][13][14][15][16], strive to improve the precision of fatigue detection by identifying the same facial characteristics as described in reference [9]. To this end, Sigari et al. [17] developed a method that compares the driver's present head orientation to a pre-existing facial template and projects the top half of the driver's facial image horizontally to detect alterations in eye closure and eyelid distance. A fuzzy-based approach that integrates both parameters was used to automatically activate the algorithm, and it was found to be effective. Nonetheless, it faces difficulties during daylight hours and is incapable of detecting fatigue when the driver is wearing glasses.
In a prior investigation [18], a deep neural network architecture was proposed to address the challenge of identifying drowsiness. The approach involved analyzing the driver's facial characteristics from RGB footage using a feature fusion architecture developed with three separate convolutional neural network models: VGG16, InceptionV3, and ResNet50. However, the accuracy of this approach was found to be limited, with a score of 78%. On the other hand, Galarza et al. [19] presented an interactive system for detecting drowsiness that incorporated behavioral data, such as eye position, head posture, and yawning frequency, utilizing an Android smartphone. This method offered several benefits, including consistent performance across different settings (e.g., lighting conditions and driver accessories, such as glasses, caps, or hearing aids) and an accurate detection rate of drowsiness, achieving a detection accuracy of 93.37%.
Bassi et al. [20] conducted a study wherein they developed a fatigue detection system that employed machine learning techniques, including local binary pattern, SVM, and Principal Component Analysis (PCA). The primary goal of the system was to enhance the performance of SVM by selecting the optimal linear, polynomial, and quadratic kernels and assessing their effectiveness. The SVM model's accuracy differed for different kernels, with the polynomial kernel yielding the highest accuracy of 99%. However, this approach was deemed computationally intensive, and its testing necessitated a considerable amount of time, despite its efficacy.
In their study, Ouabida et al. [21] proposed a method for detecting driver fatigue using an optical correlator for driver-eye tracking. Specifically, they employed the Vander Lugt Correlator (VLC) to estimate the position of the eye center and filter out visual noise in challenging settings. Their approach yielded an impressive 95% accuracy rate. However, a notable disadvantage of this technique is its susceptibility to light reflections from external sources, such as other vehicles or streetlights.
An alternative approach to detecting driver fatigue was introduced by Maior et al. [22], who utilized computer vision and machine learning techniques to extract eye patterns and monitor blink movements from video streams. This method employed SVM, RF, and MLP algorithms and yielded a 94% accuracy rate. Nonetheless, this technique is associated with relatively lengthy processing times.
In their study, Saurav et al. [23] presented a system that utilizes video streaming technology to identify occurrences of yawning. To enhance the accuracy of fatigue detection, advanced deep learning models, specifically Bi-directional Long Short-Term Memory (Bi-LSTM) and CNN, were employed. The system also leverages a camera feed to capture data from the mouth area and distinguish between typical mouth movements and indications of fatigue. The effectiveness of the system was assessed by evaluating it against two datasets, the Yawning Detection Dataset (YawDD) [24] and the National Tsing Hua University Yawning Detection Dataset (NTHUDDD) [25]. The outcome of the evaluation showed a high accuracy rate of 96%.
Biswal et al. [26] have devised an intelligent monitoring system that can detect and caution against driver fatigue. The system relies on video streaming and blink analysis techniques to estimate the distance between the eye and the face, as well as the Eye Aspect Ratio (EAR). An advantage of this system is its ability to integrate with IoT modules for traffic incident alerts. Another approach proposed by Jeon et al. [27] combines vehiclebased and behavioral methods. This method captures data from the steering wheel and pedal pressure sensors and employs Convolutional Neural Networks (CNNs) for classification, with a reported success rate of 94%. However, this technique's accuracy is susceptible to fluctuation due to alterations in the road environment, which is its primary limitation.
By contrast, a number of algorithms have incorporated machine learning and deep learning models to create physical techniques that rely on input from EEG [28,29], ECG [30], and EOG. These techniques represent a fusion of physical and behavioral strategies. For example, Ko et al. [31] proposed a system that extracts Differential Entropy (DE) from EEG signals and applies CNN for classification. This process generates hierarchical features and class-discriminative information, enabling the detection of sleepiness via a density-connected layer. Similarly, Zhu et al. [32] employed CNN to gather and analyze data from wearable EEG sensors. They employed a pre-trained AlexNet model with CNN to classify the collected EEG signals, resulting in a 94% accuracy rate. However, the primary difficulty associated with this approach lies in the time delay between acquiring EEG data and processing it with CNN.

Novelty and Contributions
Upon examination of the reviewed literature, it becomes evident that a multitude of feature extraction techniques were employed to extract the essential features from the input data. Additionally, various strategies were employed in the classification task to attain optimal detection accuracy. Many researchers investigated fatigue detection methods to prevent drivers from drowsiness [33][34][35], and others detect sleepiness from eye closure rate [36]. Nevertheless, despite using robust feature extraction and classification methods, the highest level of detection accuracy attained was 97% on a collected dataset using a driving simulator with an ensemble machine learning method. The objective of this article is to develop an enhanced system that can achieve a higher level of detection accuracy than the existing systems presented in previous literature. This objective is achieved using a cascaded decision system which comprises two stages. The first stage is requested to detect yawning and tiredness, while the second one detects the eye closure state.
In addition, we deployed machine learning and deep learning, which have been employed in various applications, including emotion recognition [37], speech recognition [38], image reconstruction [39], and medical diagnosis [40][41][42][43]. The present study suggests a deep learning approach based on 2D and 3D CNNs for identifying fatigue in images and videos. The proposed study's contributions are as follows: 1. The process of feature extraction from images and videos is accomplished by the utilization of the Haar Cascaded Classifier (HCC) 2. To investigate a cascaded system that detects both tiredness and eye closure. This system is the first concern in this topic, to the best of the author's knowledge. 3. To explore an improved approach to detect fatigue from images based on machine learning methods utilizing SVM, RF, DT, KNN, QDA, MLP, and LR. 4. To design some deep learning models based on 2D and 3D CNNs to handle the input data in RGB modality with specific hyper-parameters. 5. A comparison is carried out among the proposed models, which presents the optimal one based on the accuracy of detection and testing time.
The remaining sections of this paper can be divided into four parts. Section 2 outlines the materials and proposed methods utilized in this study. Section 3 presents a comprehensive analysis of the results obtained from these methods. In Section 4, a brief comparison between the proposed methods and previous works in the literature is discussed. Lastly, Section 5 serves as the conclusion of this paper.

Materials and Methods
This introduces a method for detecting drowsiness, due to fatigue, among drivers by monitoring their conduct, specifically utilizing videos and images. The system is composed of four principal phases, including feature extraction, preprocessing, scaling, and classification. In the feature extraction phase, the Haar Cascaded Classifier is utilized to extract the driver's facial and ocular features from the captured videos or images. Furthermore, the preprocessed facial images are enhanced by utilizing data augmentation techniques to increase the amount of input data for the classification procedure. Moreover, the enhanced data is standardized and scaled to be incorporated into the classification models. The overall architecture of the suggested system is depicted in Figure 2. The concept of the proposed system is to detect fatigue (tiredness), which is a primary cause of drowsiness. The algorithm of the proposed system is shown in Algorithm 1. First, the input video, which is recorded at a rate of 30 frames per second, is fed into the system. Then, the face is extracted from the video frame in the "Face extraction" step. The extracted faces are enrolled in to "Face detection" step for face and eye recognition. The classification process to detect drowsiness from facial symptoms and yawning is performed in Step 3, while the detection of drowsiness from the detected eyes is performed in Steps 4 and 5. We deployed HCC to detect them to be fed into the classifiers to detect their state, whether open or not. The steps of the proposed algorithm are as follows:

Input Data
Step 1: Face extraction Step 2: Face Detection

Image Augmentation
This paper presents a data augmentation technique based on a Generative Adversarial Network (GAN), which has been shown to be effective in several applications. In this study, we apply the Convolutional GAN (CGAN) method to augment the input images. Unlike the standard use of GANs, our study uses them solely for data augmentation and not for classification purposes.
Specifically, our CGAN comprises a generator network and a discriminator network, as shown in Figure 3. The generator consists of five convolutional transpose layers and a denoising fully connected layer to generate feature maps from input images. The discriminator comprises five convolutional layers and a denoising fully connected layer to reconstruct the original image. The generated images are used to augment the available dataset, improving the performance of the DLMs.

Classification
Classifiers such as DT, KNN, SVM, RF, and MLP with backpropagation, QDA, and LR are used in the machine learning approach. In addition, the hyperparameters of the proposed algorithms are mechanically chosen using the grid search method [44,45].
Additionally, the deep learning methodology consists of two deep learning models. The first model, namely the 3D CNN, comprises 16 layers. Its structure encompasses several duties, including feature extraction, feature reduction, full connectivity, and classification. The feature extraction stage employs four 3D convolutional layers with filter sizes 32, 64, 64, and 128. Furthermore, the feature reduction task is accomplished through four 3D Max pooling layers, using a window size of 2. Both these tasks operate on the 3D modality to handle the depth of the input images, where the input video frames are fed into the proposed model with an input shape of (162, 162, and 3). Consequently, this model intends to account for any color changes that may occur within the frame's color channels.
Another objective is the fully connected task, which is managed by utilizing a 3D global average pooling (3D GAP) to process the output 3D feature map generated by the convolutional and pooling layers sequence. The GAP layer is subsequently followed by a series of dense layers that produce an eigenvector, which is then fed into the classification layer. This classification layer consists of a dense layer equipped with a softmax activation function. The architecture of the proposed 2D and 3D model is shown in Figures 4 and 5 Figure 4. Architecture of the proposed 3D CNN model.

Results
This section offers a thorough assessment of the proposed fatigue detection techniques. Firstly, a detailed depiction of the datasets employed in this research is provided. Secondly, the evaluation metrics employed to measure the effectiveness of the proposed methods are presented. Thirdly, the hyperparameters utilized in this study are illustrated. Moreover, the results obtained from the experiments are outlined, accompanied by discussions and comments on these findings. Finally, a comparative analysis of the endeavors is conducted to provide a comprehensive comprehension of the strengths and limitations of the proposed approaches.
The proposed techniques were assessed on a personal computer containing an Intel Core i7 CPU, 8 GB NVIDIA GPU driver, 32 GB of RAM, and running on the Windows 11 operating system. These specifications were enough to handle the processing of the videos with a frame rate of 30 frames per second (time interval of 33.33 ms), as shown in the simulation results. The programming codes for the proposed approaches were developed utilizing Python 3.8 and the Keras and TensorFlow toolkits and the design of the proposed models, including the layers and learning parameters.

Datasets
The proposed techniques are executed on the "ULg Multimodality Drowsiness Database," commonly abbreviated as DROZY [46]. This database comprises two segments. The first segment entails collecting data from 14 young, healthy individuals, comprising three males and eleven females, utilizing video streaming monitoring. The data in this segment was gathered utilizing Kinect technology and video sensors that are equipped with Near-Infrared (NIR) sensitivity, resulting in a resolution of 512 × 424 pixels in MP4 format. Illustrations of NIR intensity scenes generated from video frames are displayed in Figure 6. This dataset is collected at the rate of frames of 30 frames per second (time interval of 33.33 ms) with a number of frames of 17,000 frames per person. Table 1 illustrates the number of images for each person involved in this dataset (publicly provided).  The second dataset is the drowsiness dataset [24]. This dataset comprises images of the drivers with different eye and face symptoms, including eyes closed or open and yawning or not. The objective is to distinguish among these states using the proposed deep learning models. In addition, data augmentation is employed to increase the number of images fed into the deep-learning models. Figure 7 shows samples of this dataset.
The datasets are shuffled and split into training and testing subsets with an 80/20 ratio. In addition, the training process is performed using k-fold cross-validation with a k value of 10.

Evaluation Metrics
To evaluate proposed approaches, various evaluation metrics are used, such as accuracy, recall, precision, F1 score, and Matthews Correlation Coefficient (MCC). These metrics (MCC) are defined by Equations (1) to (5) [47]. The term False Negative (FN) refers to the number of instances in which drowsy states are mistakenly identified as normal. True Positive (TP) denotes the number of drowsy states correctly identified as such. True Negative (TN) pertains to the number of normal states accurately identified as normal. False Positive (FP) refers to the number of normal states inaccurately identified as drowsy. .

Hyperparameter Setting
This study conducts a grid search algorithm to perform hyperparameter selection for both machine learning and deep learning approaches. The objective of this process is to identify the optimal values of hyperparameters that result in maximum accuracy. Table 2 lists the hyperparameters utilized in the proposed methods, which are obtained through 100 iterations for each model. The model training process is carried out iteratively with various hyperparameter values for the optimizer, learning rate, and activation function of the deep learning layers to select the optimal hyperparameters for the deep learning approach. Figure 8 presents the learning curve for accuracy during the hyperparameter optimization, demonstrating that the model performance improves with each run due to the variations in the hyperparameter values. Moreover, Table 3 displays some of the iterations conducted for hyperparameter optimization for the deep learning model, while Table 2 illustrates the selected hyperparameters for the proposed methods.

Simulation Results
This section comprises the simulation results of the proposed models for both images and video datasets discussed previously. In addition, the proposed models have been compared to accomplish the optimum method. The proposed methods are carried out in three main scenarios. The first scenario includes three states, alert, tired, and non-vigilant. The second one comprises four categories: eye open, eye closed, yawn, and no yawn. The last scenario is a combination of the two scenarios, which comprises seven categories, including those in the first and second scenarios. The following subsections discuss the simulation results of each proposed scenario.

Simulation Results of DROZY Video Dataset
This research paper employs a fatigue detection technique that analyzes video frames to determine the driver's level of awareness, specifically identifying whether the driver is alert, tired, or non-vigilant. The objective of this study is to develop an accurate model with low testing time, utilizing both machine and deep learning models. The machine learning approach implemented in this study includes SVM, RF, DT, KNN, QDA, MLP, and LR, while the proposed deep learning approach involves 2D and 3D CNNs. Figures  9 and 10 present the learning curves of the proposed models, demonstrating that the model performance improves during the training process. Table 4 lists the evaluation metrics of the proposed models, including precision, recall, accuracy, and F1-score. The simulation results indicate that the SVM, RF, and KNN machine learning models have superior performance for fatigue detection from videos, achieving an accuracy of 99%. Furthermore, the proposed 3D CNN deep learning model achieves an accuracy of 99%. Thus, the proposed models offer effective solutions for video fatigue detection.    Another scenario is proposed in this paper, based on image capturing of the driver. The proposed models are carried out on an image dataset comprising the symptoms of the face. This scenario includes eye closure, opening categories, and whether or not to yawn. Such as in the previous scenario, the proposed machine learning and deep learning models are carried out. The learning curves of the proposed models are shown in Figures  11 and 12 for machine learning and deep learning, respectively. Table 5 illustrates the evaluation metrics of the proposed models. This scenario is more robust rather than the previous one. It can be noticed from the performance of the proposed models. The proposed RF and LR machine learning models achieved an accuracy of 93%. In addition, the proposed 2D and 3D CNNs achieved 95% and 98% accuracy, respectively. Therefore, in this scenario, the proposed 3D CNNs outperform the other proposed models for detecting facial symptoms.   To provide a general and robust scenario, we combined the image and video datasets and fed them into the proposed models. This scenario provides seven classification categories, including the driver's status and face symptoms. These categories can be summarized as follows: alert, non-vigilant, tired, eye open, eye closed, yawn, and no yawn. The proposed machine and deep learning models are carried out on the combined dataset to be evaluated. Figures 13 and 14 show the learning curves of the proposed machine learning and deep learning models, respectively. Furthermore, the simulation results of the proposed models are illustrated in Table 6. The simulation results reveal that the proposed RF model outperforms the machine learning models, while 3D CNNs do in the deep learning models. The superior models achieved 90% and 98% accuracy for RF and 3D CNNs, respectively. Therefore. They can be considered efficient solutions for robust conditions.

Explainability and Features Impact
In this section, SHAP summary plots were employed to exhibit the ranking of the features. The SHAP summary plot, as demonstrated in Figure 15, displays the features as lines, with the dot denoting the impact of these features in a specific instance. The colors on the plot denote feature correlation, with blue indicating low correlation and red indicating high correlation. Analysis of the summary plot reveals several key observations: (1) Feature "2865" exerts a significant influence on the overall decision; (2) an increase in this feature has a positive effect on the overall score; (3) conversely, a decrease in the value of features "3255", "4810", "6724" has a positive impact on the overall performance of the calculated score. Features with long tails in the right direction are likely to have a positive effect on the total decision.

Results Discussion and Comparison
This research paper presents a system intended for detecting drowsiness based on video and image monitoring. The proposed system encompasses three key tasks: feature extraction, preprocessing, and classification. The Haar Cascaded Classifier (HCC) is utilized to extract the required features, identifying the face and eyes in both images and video frames. Following this, classification is carried out using both machine learning and deep learning algorithms. The effectiveness of the proposed system is evaluated in diverse conditions, including three-class, four-class, and seven-class scenarios. Additionally, the proposed models are compared to identify the optimal method among the proposed alternatives. Figure 16 illustrates a comparative analysis of the proposed models in different scenarios. Additionally, to demonstrate the effectiveness of the proposed methods, their performance is compared to existing works in the literature. to include the works which deployed the same methods proposed in this work and those carried out on the same datasets. Specifically, the deep learning approach introduced in this study is compared to similar methods presented by Maior et al. [22], Biswal et al. [26], Jeon et al. [27], Gwak et al. [48], and Bakheet and colleagues [49]. The algorithm proposed by Gwak et al. [49] is included in both comparisons since it involves experiments in both video streaming scenarios. A comparison between the proposed deep learning approach and other video streaming-based algorithms is presented in Table 7. The simulation results indicate that the proposed methods exhibit superior performance and outperform previous efforts in this field.

Conclusions
This paper has addressed the issue of fatigue detection and proposed an artificial intelligence-based solution to tackle this problem. The suggested system comprises two main tasks: feature extraction and classification. The HCC algorithm has been employed to extract the relevant features, and several datasets have been used to evaluate the performance of the proposed classifiers, including SVM, RF, DT, KNN, QDA, MLP, RL, 2D CNN, and 3D CNN. The results indicate that the proposed 3D CNN classifier outperforms the other models and achieves superior performance. Additionally, the proposed models exhibit high performance compared with those reported in the literature, making them a promising and effective solution for fatigue detection.
Furthermore, the authors intend to expand the scope of their research in the future by pursuing various ideas. Firstly, the proposed models can be implemented to offer a practical solution to the market. This implementation can be performed using NBIDIA Jetson Nano. Secondly, the proposed models can be subjected to validation using additional datasets that include more categories. Finally, the authors aim to explore the possibility of fatigue detection using fused features obtained from both visual and medical signal modalities.