1. Introduction
With the rapid development of industrial technology, there have been fundamental changes in the structure of transportation. Although the popularity of cars has made travel more efficient and convenient, it has also brought about inevitable traffic accidents. According to data statistics, the main causes of traffic accidents are closely related to fatigue, drunk driving, overload, and speeding. In particular, fatigued driving accounts for 14–20% of all traffic accidents, with the occurrence rate of major traffic accidents reaching as high as 43%. Traffic accidents caused by large trucks and on highways account for approximately 37% [
1]. This is because after long periods of intense driving, the muscles and mental state of drivers become relaxed and fatigued, leading to a decrease in reaction and anticipation abilities, thereby posing a serious threat to life and the surroundings [
2]. Therefore, in-depth research on fatigued-driving detection is of great significance in reducing the occurrence rate of traffic accidents and ensuring personal and property safety.
Currently, research on driver-fatigue detection mainly focuses on the field of road traffic and can be divided into three methods: detection based on vehicle driving characteristics [
3,
4,
5], detection based on driver physiological characteristics [
6,
7,
8], and detection based on computer vision of driver facial features [
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20,
21,
22]. Among them, visual-based detection uses cameras or other image sensors to capture the facial-feature changes or head-movement information of the driver. It uses deep-learning algorithms to locate and analyze eye features (blink frequency, eye aspect ratio, cumulative closed-eye time, etc.), mouth features (degree of mouth opening, mouth aspect ratio, etc.), head-pose features (yaw angle, pitch angle, etc.), and facial-expression features, thus achieving fatigue detection through single or multiple feature fusion. This method has the advantage of non-invasiveness, not only accurately determining the driver’s fatigue level but also issuing timely and effective warnings, making it the main research focus currently. Among the many extractable driver facial features, determining whether fatigued driving is present is particularly important. Therefore, in 1998, an important evaluation index called “PERCLOS [
9]” emerged, which can effectively determine fatigued driving based on the percentage of closed-eye time within a specific time period and has been widely used in fatigue identification. In addition, the research by Dziuda et al. [
10] further validated the importance of PERCLOS in fatigued-driving detection. They evaluated eight professional truck drivers by calculating PERCLOS, duration of closed eyes, and blink frequency in facial images of the drivers. The research results showed that PERCLOS is a variable important predictive factor and is considered an important determining indicator in fatigued-driving detection research.
Early fatigued-driving detection methods mainly focused on extracting single facial features. Alioua et al. [
11] proposed a yawning detection algorithm that used an SVM detector for feature extraction. The accuracy of extracting the mouth region through Hough transformation can reach 98%. Zhang et al. [
12] fused long short-term memory networks and convolutional neural networks to propose a fatigued-driving detection method that analyzes time features with an accuracy rate of over 87% for continuous yawning states of drivers. Knapik et al. [
13] introduced a more innovative approach by using an infrared thermography model to detect drivers’ yawning states and integrating it into an advanced driver-assistance system, enabling fatigued-driving detection and recognition without interference under both day and night conditions. However, these methods did not consider the issues of feature loss and increased false detection rates caused by occlusion and significant changes in driver posture, and they exhibited poor stability. To address these issues, researchers widely favor the method of facial multi-feature fusion to overcome the drawbacks of single-feature extraction and the interference of the external environment on the driver’s state. Among them, the combination of multi-keypoint localization of MTCNN [
14] and Dlib [
15] with multiple features such as eyes, mouth, and head pose has been most applied. Deng et al. [
16] used the improved MedianFlow face tracking and detection algorithm with MTCNN to locate and track the eyes and fused the information of blink frequency and head position changes to realize fatigued-driving detection. Liu et al. [
17] used the multi-task cascaded convolutional neural network MTCNN to locate the five key facial positions of the driver’s eyes and mouth. Based on the PERCLOS criterion and fuzzy inference principle, the eye and mouth fatigue feature parameters were fused to determine the driver’s fatigue level. Liu et al. [
18] used the multi-scale block local binary patterns (MB-LBP) and Adaboost classifier to extract 24 facial keypoints. Based on the states of the eyes and mouth, PERCLOS and yawning frequency were calculated, and a fuzzy inference system was used to infer the driver’s fatigue state. However, the five-point localization of MTCNN and 24 facial keypoints cannot fully cover the facial feature region, and the stability and accuracy of feature extraction are easily affected by the external environment. Therefore, some scholars used the Dlib to extract 68 key facial points, which covers more comprehensive facial features, and based on this, they calculated the eye-mouth aspect ratio and combined it with PERCLOS and the driver’s head-pose changes for recognition and judgment. Experimental results showed that the head-pose-estimation method based on the 68 key facial points can accurately determine a fatigue state and has high robustness [
19], but it lacks consideration for lightweight processing. In addition, based on Dlib, Li et al. [
20] used an improved lightweight network YOLOv3-tiny to build driver identification and fatigue evaluation models through online evaluation. They calculated the driver’s eye-closed time, blink frequency, and yawning frequency, achieving a fatigued-driving detection accuracy of 95.1%, with remarkable effects. However, the weight model of YOLOv3-tiny is still not lightweight enough and needs further optimization for lightweight deployment on in-vehicle terminals. Furthermore, Babu et al. [
21] have developed a drowsiness recognition system using Python and Dlib, which includes face detection and head-pose detection. It achieved a 94.51% accuracy in real-time video detection. Cai et al. [
22] used multi-thread optimized Dlib to narrow the face-feature region to the real-time changes of the eyes, mouth, and head and fused multiple feature subsets to realize the fatigued-driving signal detection method based on D-S evidence theory. However, a fatal problem that has been largely overlooked in the above studies is the tendency of the Dlib 2D facial landmark extraction library to lose feature points and have poor real-time performance when there are significant changes in the driver’s head pose.
In summary, to achieve effective detection of fatigued driving, high-precision detection of facial features is a primary requirement. However, in most previous studies, in order to improve the detection accuracy of the network model, the limitation of application terminal computing resources was often overlooked. Moreover, traditional methods, such as MTCNN five-point localization and Dlib two-dimensional keypoint extraction, still need improvement in terms of stability and detection speed, and their real-time performance is relatively poor, which to some extent restricts the deployment of fatigued-driving detection systems in onboard terminals. Therefore, the current research focus is on how to improve the accuracy of multi-feature detection of facial features while keeping the model lightweight, and how to efficiently and stably extract facial-feature points to construct a fatigued-driving detection model. To address these issues, firstly, this paper applies lightweight processing of the backbone network using ShuffleNetv2_BD on the basis of the YOLOv5s baseline model. Then, the maxpool cross-scale aggregation module (M-CFAM) and context information fusion module (L-CIFM) are used to promote the fusion of deep and shallow features, enhance the ability of deep features to extract facial information, and reduce information loss in shallow-feature categories. In addition, the CIoU in the baseline model is replaced with WIoU, and the lost functions are reconstructed by using the static focusing mechanism (MF) to accelerate the convergence speed of the model. Finally, based on lightweight facial-feature detection, the Attention Mesh is used to extract 468 three-dimensional facial keypoints and calculate the aspect ratio of the eyes and mouth. A fatigued-driving detection model is constructed based on the fusion of features including the number of continuous closed-eye frames, the number of continuous yawning frames, and the thresholds of eye and mouth PERCLOS. Experimental results show that the model designed in this paper can achieve high-precision real-time detection and judgment with significantly reduced parameter and computational complexity, laying a theoretical foundation for deployment on mobile terminals.
The remaining parts of this article are organized as follows:
Section 2 introduces the basic architecture of YOLOv5s and proposes optimization and improvement solutions for face-detection networks.
Section 3 extracts fatigue-feature points and constructs a fatigue-determination model.
Section 4 validates and discusses the effectiveness of the improved face-detection algorithm and fatigued-driving determination model through experiments.
Section 5 presents the conclusion and summarizes the entire article.
5. Discussion and Conclusions
This paper introduces a lightweight and robust driver-face detection method and a fatigued-driving determination model. Firstly, Shufflenetv2_BD is used to reconstruct the backbone of YOLOv5s, achieving a lightweight network and an improved training speed of deep linear networks. Secondly, M-CFAM is introduced between the backbone and neck networks to enhance the cross-scale fusion of facial features and to reduce the loss of shallow features. Then, L-CIFM is introduced to enhance the extraction ability of facial-region features by the neck network. In addition, in order to accelerate the convergence speed of the model, WIoU is introduced as a new loss metric, and the loss function is redefined. Through comparative experiments, the proposed algorithm reduces the parameters and model size by 58% and 56.3% compared to the baseline model, and the floating-point operations are only 5.9 GFLOPs. The mAP on the self-built dataset reaches 95.7%, an improvement of 1%. This indicates that the algorithm not only performs well in terms of being lightweight but also effectively improves the detection performance. Finally, based on the threshold value of the eye-mouth aspect ratio calculated by three-dimensional keypoints and the detection of facial-feature categories, the designed fatigued-driving determination model comprehensively judged 135 instance samples within a specified unit cycle frame by using the number of frames of continuous eye closure and continuous yawning as well as the threshold of eye-mouth PERCLOS, achieving a recognition accuracy of 96.3%. A high level of fatigued-driving detection has been achieved. This achieves a recognition accuracy of 96.3% and reaches a high level of fatigued-driving detection. Thus, it verifies that the face-detection algorithm and fatigue-determination model can effectively detect and determine drivers in real-time in different driving scenarios, different genders, and different driving characteristics, demonstrating strong robustness and providing support for the transplantation and deployment of fatigued-driving detection.
The algorithm in this paper is designed for fatigued-driving detection, and it has achieved good results. However, it lacks consideration for extremely complex scenes and face-occlusion problems. In the future, we will increase the coverage of data in complex scenes and introduce tracking algorithms to optimize the insufficient or lost feature extraction in extreme driving scenarios with occlusions in order to improve the applicability in multiple scenarios.