Research on Fatigued-Driving Detection Method by Integrating Lightweight YOLOv5s and Facial 3D Keypoints

In response to the problem of high computational and parameter requirements of fatigued-driving detection models, as well as weak facial-feature keypoint extraction capability, this paper proposes a lightweight and real-time fatigued-driving detection model based on an improved YOLOv5s and Attention Mesh 3D keypoint extraction method. The main strategies are as follows: (1) Using Shufflenetv2_BD to reconstruct the Backbone network to reduce parameter complexity and computational load. (2) Introducing and improving the fusion method of the Cross-scale Aggregation Module (CAM) between the Backbone and Neck networks to reduce information loss in shallow features of closed-eyes and closed-mouth categories. (3) Building a lightweight Context Information Fusion Module by combining the Efficient Multi-Scale Module (EAM) and Depthwise Over-Parameterized Convolution (DoConv) to enhance the Neck network’s ability to extract facial features. (4) Redefining the loss function using Wise-IoU (WIoU) to accelerate model convergence. Finally, the fatigued-driving detection model is constructed by combining the classification detection results with the thresholds of continuous closed-eye frames, continuous yawning frames, and PERCLOS (Percentage of Eyelid Closure over the Pupil over Time) of eyes and mouth. Under the premise that the number of parameters and the size of the baseline model are reduced by 58% and 56.3%, respectively, and the floating point computation is only 5.9 GFLOPs, the average accuracy of the baseline model is increased by 1%, and the Fatigued-recognition rate is 96.3%, which proves that the proposed algorithm can achieve accurate and stable real-time detection while lightweight. It provides strong support for the lightweight deployment of vehicle terminals.


Introduction
With the rapid development of industrial technology, there have been fundamental changes in the structure of transportation. Although the popularity of cars has made travel more efficient and convenient, it has also brought about inevitable traffic accidents. According to data statistics, the main causes of traffic accidents are closely related to fatigue, drunk driving, overload, and speeding. In particular, fatigued driving accounts for 14-20% of all traffic accidents, with the occurrence rate of major traffic accidents reaching as high as 43%. Traffic accidents caused by large trucks and on highways account for approximately 37% [1]. This is because after long periods of intense driving, the muscles and mental state of drivers become relaxed and fatigued, leading to a decrease in reaction and anticipation abilities, thereby posing a serious threat to life and the surroundings [2]. Therefore, in-depth research on fatigued-driving detection is of great significance in reducing the occurrence rate of traffic accidents and ensuring personal and property safety.
Currently, research on driver-fatigue detection mainly focuses on the field of road traffic and can be divided into three methods: detection based on vehicle driving characteristics [3][4][5], detection based on driver physiological characteristics [6][7][8], and detection evaluation. They calculated the driver's eye-closed time, blink frequency, and yawning frequency, achieving a fatigued-driving detection accuracy of 95.1%, with remarkable effects. However, the weight model of YOLOv3-tiny is still not lightweight enough and needs further optimization for lightweight deployment on in-vehicle terminals. Furthermore, Babu et al. [21] have developed a drowsiness recognition system using Python and Dlib, which includes face detection and head-pose detection. It achieved a 94.51% accuracy in real-time video detection. Cai et al. [22] used multi-thread optimized Dlib to narrow the face-feature region to the real-time changes of the eyes, mouth, and head and fused multiple feature subsets to realize the fatigued-driving signal detection method based on D-S evidence theory. However, a fatal problem that has been largely overlooked in the above studies is the tendency of the Dlib 2D facial landmark extraction library to lose feature points and have poor real-time performance when there are significant changes in the driver's head pose.
In summary, to achieve effective detection of fatigued driving, high-precision detection of facial features is a primary requirement. However, in most previous studies, in order to improve the detection accuracy of the network model, the limitation of application terminal computing resources was often overlooked. Moreover, traditional methods, such as MTCNN five-point localization and Dlib two-dimensional keypoint extraction, still need improvement in terms of stability and detection speed, and their real-time performance is relatively poor, which to some extent restricts the deployment of fatigueddriving detection systems in onboard terminals. Therefore, the current research focus is on how to improve the accuracy of multi-feature detection of facial features while keeping the model lightweight, and how to efficiently and stably extract facial-feature points to construct a fatigued-driving detection model. To address these issues, firstly, this paper applies lightweight processing of the backbone network using ShuffleNetv2_BD on the basis of the YOLOv5s baseline model. Then, the maxpool cross-scale aggregation module (M-CFAM) and context information fusion module (L-CIFM) are used to promote the fusion of deep and shallow features, enhance the ability of deep features to extract facial information, and reduce information loss in shallow-feature categories. In addition, the CIoU in the baseline model is replaced with WIoU, and the lost functions are reconstructed by using the static focusing mechanism (MF) to accelerate the convergence speed of the model. Finally, based on lightweight facial-feature detection, the Attention Mesh is used to extract 468 three-dimensional facial keypoints and calculate the aspect ratio of the eyes and mouth. A fatigued-driving detection model is constructed based on the fusion of features including the number of continuous closed-eye frames, the number of continuous yawning frames, and the thresholds of eye and mouth PERCLOS. Experimental results show that the model designed in this paper can achieve high-precision real-time detection and judgment with significantly reduced parameter and computational complexity, laying a theoretical foundation for deployment on mobile terminals.
The remaining parts of this article are organized as follows: Section 2 introduces the basic architecture of YOLOv5s and proposes optimization and improvement solutions for face-detection networks. Section 3 extracts fatigue-feature points and constructs a fatigue-determination model. Section 4 validates and discusses the effectiveness of the improved face-detection algorithm and fatigued-driving determination model through experiments. Section 5 presents the conclusion and summarizes the entire article.

The Basic Architecture of YOLOv5s
As a one-stage object-detection algorithm, YOLOv5 can be divided into five models with progressively increasing scales based on different depth factors and width factors: YOLOv5n, YOLOv5s, YOLOv5m, YOLOv5l, and YOLOv5x. Considering practical application scenarios and computational costs, this article takes YOLOv5s as the baseline model, which consists of the following four components (as shown in Figure 1): (1) Input: Adopting Mosaic data augmentation, adaptive image scaling, and anchor box computation to enhance model training speed and reduce redundant information. (2) Backbone Network: By introducing the CBS convolutional structure and C3 module, the backbone network can perform targeted downsampling, selectively preserving detailed information of the target features, and effectively preventing degradation of network performance. (3) Feature Fusion Network (Neck): The FPN [23] and PAN [24] structures can enhance the network's feature-fusion capability, reduce information loss during downsampling, and achieve effective fusion of information at different scales, enriching the texture information of shallow features and the semantic structure of deep features. (4) Prediction: The CIoU [25] loss function is used, which considers the area overlap, aspect ratio, and center point distance between the ground truth box and the predicted box. This ensures a good fit for width and height even when the center points of the ground truth and predicted boxes overlap or are very close. The predicted redundant information is then filtered using NMS (non-maximum suppression) to enhance the effective detection of the target region.
application scenarios and computational costs, this article takes YOLOv5s as the baseline model, which consists of the following four components (as shown in Figure 1): (1) Input: Adopting Mosaic data augmentation, adaptive image scaling, and anchor box computation to enhance model training speed and reduce redundant information. (2) Backbone Network: By introducing the CBS convolutional structure and C3 module, the backbone network can perform targeted downsampling, selectively preserving detailed information of the target features, and effectively preventing degradation of network performance. (3) Feature Fusion Network (Neck): The FPN [23] and PAN [24] structures can enhance the network's feature-fusion capability, reduce information loss during downsampling, and achieve effective fusion of information at different scales, enriching the texture information of shallow features and the semantic structure of deep features. (4) Prediction: The CIoU [25] loss function is used, which considers the area overlap, aspect ratio, and center point distance between the ground truth box and the predicted box. This ensures a good fit for width and height even when the center points of the ground truth and predicted boxes overlap or are very close. The predicted redundant information is then filtered using NMS (non-maximum suppression) to enhance the effective detection of the target region.

Feature Extraction Backbone
The backbone of YOLOv5s utilizes CSPDarknet, which contains multiple deep convolutions for feature extraction, resulting in a relatively high computational load. Therefore, in order to effectively balance the relationship between detection speed and accuracy, and reduce the model's parameter and computational load, Shufflev2 [26] is introduced as the backbone of the baseline model. It is mainly designed based on ShuffleNet [27] and consists of two parts: the basic unit and the downsampling unit. First, the basic unit adopts Channel Split to split the input feature channels into two paths. One path performs identity mapping to preserve the original features, while the other path utilizes two 1 × 1 standard convolutions and one 3 × 3 depth convolution for dimension reduction and speedup, balancing the channels. Then, Channel shuffle is used to increase information transfer between the two branches and promote feature fusion. The downsampling unit removes the  The backbone of YOLOv5s utilizes CSPDarknet, which contains multiple deep convolutions for feature extraction, resulting in a relatively high computational load. Therefore, in order to effectively balance the relationship between detection speed and accuracy, and reduce the model's parameter and computational load, Shufflev2 [26] is introduced as the backbone of the baseline model. It is mainly designed based on ShuffleNet [27] and consists of two parts: the basic unit and the downsampling unit. First, the basic unit adopts Channel Split to split the input feature channels into two paths. One path performs identity mapping to preserve the original features, while the other path utilizes two 1 × 1 standard convolutions and one 3 × 3 depth convolution for dimension reduction and speedup, balancing the channels. Then, Channel shuffle is used to increase information transfer between the two branches and promote feature fusion. The downsampling unit removes the Channel Split and introduces a depth convolution with a stride S of 2 in both channel paths to achieve a lightweight network.
The adoption of depthwise convolution can effectively reduce the computational and parameter complexity of the model. However, compared to standard convolution, it reduces the search space of convolutional kernel parameters, resulting in a decrease in network representation capacity during feature extraction and fusion. Therefore, the depthwise over-parameterized convolution (DoConv) [28] is introduced to replace DWConv in the Shuffle_Block, allowing the depthwise convolution to be folded into a compact single-layer representation, only one layer is used during inference. The basic unit structure of the improved Shufflenetv2_BD is shown in Figure 2a, and the downsampling unit structure is shown in Figure 2b. Channel Split and introduces a depth convolution with a stride S of 2 in both channel paths to achieve a lightweight network. The adoption of depthwise convolution can effectively reduce the computational and parameter complexity of the model. However, compared to standard convolution, it reduces the search space of convolutional kernel parameters, resulting in a decrease in network representation capacity during feature extraction and fusion. Therefore, the depthwise over-parameterized convolution (DoConv) [28] is introduced to replace DWConv in the Shuffle_Block, allowing the depthwise convolution to be folded into a compact singlelayer representation, only one layer is used during inference. The basic unit structure of the improved Shufflenetv2_BD is shown in Figure 2a, and the downsampling unit structure is shown in Figure 2b.

Maxpool Cross-Scale Feature Aggregation
During the driving process, extreme weather or environmental changes may cause confusion or loss of information for shallow facial features (such as closed eyes or closed mouth). Therefore, this paper introduces the Cross-scale Aggregation Module (CAM) [29] between the Backbone and Neck networks and improves it to enhance the fusion of information between different feature levels in the facial region, reducing the loss of shallow category features.
The CAM structure, as shown in Figure 3a, consists of 5 cross-scale fusion nodes (CFN) arranged in a "V" shape module layout. The intermediate layer of the CFN input is the output of the previous CFN. It integrates the facial features from the backbone network in a bottom-up manner while allowing interaction between the top and bottom layers. The 5-level features of the backbone are aggregated into low, medium, and high-level feature maps, and then the shallow feature expression of the Neck network is enhanced using the 3-level features from different mappings, strengthening the filtering of invalid information.

Maxpool Cross-Scale Feature Aggregation
During the driving process, extreme weather or environmental changes may cause confusion or loss of information for shallow facial features (such as closed eyes or closed mouth). Therefore, this paper introduces the Cross-scale Aggregation Module (CAM) [29] between the Backbone and Neck networks and improves it to enhance the fusion of information between different feature levels in the facial region, reducing the loss of shallow category features.
The CAM structure, as shown in Figure 3a, consists of 5 cross-scale fusion nodes (CFN) arranged in a "V" shape module layout. The intermediate layer of the CFN input is the output of the previous CFN. It integrates the facial features from the backbone network in a bottom-up manner while allowing interaction between the top and bottom layers. The 5-level features of the backbone are aggregated into low, medium, and high-level feature maps, and then the shallow feature expression of the Neck network is enhanced using the 3-level features from different mappings, strengthening the filtering of invalid information.  The improved CAM is called M-CFAM (maxpool cross-scale feature aggregation module), as shown in Figure 3b. In general, it uses the maxpool cross-scale fusion node M-CFN to aggregate the three adjacent features C_{i − 1}, C_{i}, and C_{i + 1} (2 ≤ i ≤ 4) of the backbone as inputs to the continuous nodes for fusion with the C1, C2, C3, and C4 features of the backbone. The structure of M-CFN is shown in Figure 4. Firstly, it combines standard convolution and residual connections to reconstruct the bottleneck structure, reducing the loss of feature information caused by multi-layer continuous convolutions and reducing computational complexity. Secondly, considering that the initial module downsampling Focus enhances the connection between facial features, it increases the difficulty of training in deep convolutions. Moreover, frequent slicing operations are not friendly to embedded platforms, and network quantization operations do not support the Focus module. In this paper, the Focus slicing operation is replaced by a maxpool layer (MP) with a stride of 2 to fully preserve the upper facial texture features and further reduce computational complexity. This forms the "trapezoidal" maxpool cross-scale feature aggregation module M-CFAM.  The improved CAM is called M-CFAM (maxpool cross-scale feature aggregation module), as shown in Figure 3b. In general, it uses the maxpool cross-scale fusion node M-CFN to aggregate the three adjacent features C_{i − 1}, C_{i}, and C_{i + 1} (2 ≤ i ≤ 4) of the backbone as inputs to the continuous nodes for fusion with the C1, C2, C3, and C4 features of the backbone. The structure of M-CFN is shown in Figure 4. Firstly, it combines standard convolution and residual connections to reconstruct the bottleneck structure, reducing the loss of feature information caused by multi-layer continuous convolutions and reducing computational complexity. Secondly, considering that the initial module downsampling Focus enhances the connection between facial features, it increases the difficulty of training in deep convolutions. Moreover, frequent slicing operations are not friendly to embedded platforms, and network quantization operations do not support the Focus module. In this paper, the Focus slicing operation is replaced by a maxpool layer (MP) with a stride of 2 to fully preserve the upper facial texture features and further reduce computational complexity. This forms the "trapezoidal" maxpool cross-scale feature aggregation module M-CFAM.   Figure 4. Firstly, it combines standard convolution and residual connections to reconstruct the bottleneck structure, reducing the loss of feature information caused by multi-layer continuous convolutions and reducing computational complexity. Secondly, considering that the initial module downsampling Focus enhances the connection between facial features, it increases the difficulty of training in deep convolutions. Moreover, frequent slicing operations are not friendly to embedded platforms, and network quantization operations do not support the Focus module. In this paper, the Focus slicing operation is replaced by a maxpool layer (MP) with a stride of 2 to fully preserve the upper facial texture features and further reduce computational complexity. This forms the "trapezoidal" maxpool cross-scale feature aggregation module M-CFAM.

Lightweight Contextual Information Fusion Module
Due to its close relationship with the surrounding area information, facial features play a crucial role in feature extraction and face detection. Therefore, in order to enhance the fusion of contextual information in Neck's C3 and improve the feature-extraction capability of categories, this paper introduces the idea of the RFB [30] module and constructs a lightweight contextual-information-fusion module (L-CIFM) based on DoConv and the efficient multi-scale attention module (EAM) [31] to replace C3 in the Neck network. This reduces the equivalent computation with traditional convolutional layers, improves the training speed of deep linear networks, and optimizes overall performance.
The EMA structure, as shown in Figure 5, extracts the attention weights of the grouping feature maps through three parallel subnets, embedding precise positional information into EMA, and integrating contextual information of different scales to enable the convolution module to generate better pixel-level attention on advanced feature maps. Then, a crossspatial learning method is used to enhance the network structure by handling short and long-term dependency relationships. In contrast to the progressive behavior formed by a limited receptive field, the parallel use of 3 × 3 and 1 × 1 convolutions allows for more contextual information to be utilized in the intermediate features, and finally, the fused features are refined to obtain the output result f e .

Lightweight Contextual Information Fusion Module
Due to its close relationship with the surrounding area information, facial features play a crucial role in feature extraction and face detection. Therefore, in order to enhance the fusion of contextual information in Neck's C3 and improve the feature-extraction capability of categories, this paper introduces the idea of the RFB [30] module and constructs a lightweight contextual-information-fusion module (L-CIFM) based on DoConv and the efficient multi-scale attention module (EAM) [31] to replace C3 in the Neck network. This reduces the equivalent computation with traditional convolutional layers, improves the training speed of deep linear networks, and optimizes overall performance.
The EMA structure, as shown in Figure 5, extracts the attention weights of the grouping feature maps through three parallel subnets, embedding precise positional information into EMA, and integrating contextual information of different scales to enable the convolution module to generate better pixel-level attention on advanced feature maps. Then, a cross-spatial learning method is used to enhance the network structure by handling short and long-term dependency relationships. In contrast to the progressive behavior formed by a limited receptive field, the parallel use of 3 × 3 and 1 × 1 convolutions allows for more contextual information to be utilized in the intermediate features, and finally, the fused features are refined to obtain the output result . The L-CIFM structure, as shown in Figure 6, consists of two convolutional branches and one residual edge for context feature extraction. Firstly, the upper-level feature is preprocessed with a 1 × 1 standard convolution and inputted into the bottleneck module for extracting adjacent contextual features. The left branch uses a 3 × 3 DoConv to enhance facial feature perception of the input feature, improving the deep network's perception of global information on the original input image. Then, a 1 × 1 standard convolution is used to refine the rich semantic information from the upper level. The right branch first uses a 1 × 1 standard convolution to obtain the position information of facial labels and then enriches local features through a 3 × 3 DoConv. The contextual fusion feature is obtained by adding (⊕), and on this basis, the contextual fusion feature is concatenated with the upper-level information enhanced by EMA through Cat, realizing the merging of information from multiple feature dimensions and forming the lightweight contextual-  The L-CIFM structure, as shown in Figure 6, consists of two convolutional branches and one residual edge for context feature extraction. Firstly, the upper-level feature F i is preprocessed with a 1 × 1 standard convolution and inputted into the bottleneck module for extracting adjacent contextual features. The left branch uses a 3 × 3 DoConv to enhance facial feature perception of the input feature, improving the deep network's perception of global information on the original input image. Then, a 1 × 1 standard convolution is used to refine the rich semantic information from the upper level. The right branch first uses a 1 × 1 standard convolution to obtain the position information of facial labels and then enriches local features through a 3 × 3 DoConv. The contextual fusion feature is obtained by adding (⊕), and on this basis, the contextual fusion feature is concatenated with the upper-level information enhanced by EMA through Cat, realizing the merging of information from multiple feature dimensions and forming the lightweight contextualinformation-fusion module. Finally, the expression for the output of the fused context feature information F o is: Sensors 2023, 23, x FOR PEER REVIEW 8 of 22 information-fusion module. Finally, the expression for the output of the fused context feature information is: In the equation, 1×1 represents a 1 × 1 Conv, 3×3 represents a 3 × 3 DoConv, and represents enhancing the attention on effective facial features of the upper layer input using EMA.

Improvement of Loss Function
The loss function of bounding box regression mainly consists of three parts: bounding-box-localization loss, object-confidence loss, and object-classification loss. It is a key component of object detection and has a significant impact on the predictive performance of the model. The CIoU loss used in the baseline model does not effectively reflect the differences between the true values of width and height and their confidence. There is also an issue of imbalance in the bounding-box regression loss between high-quality and lowquality samples. This paper introduces the various losses for bounding-box regression proposed in reference [32], called WIoU. WIoU can be divided into WIoU v1 based on attention and WIoU v2 with a monotonically focusing mechanism (FM), as well as WIoU v3 with a dynamic non-monotonic FM. WIoU v3 can assign a smaller gradient gain to anchor boxes with larger outliers, effectively preventing large gradient loss in low-quality samples. This fully exploits the non-monotonic tuning potential of static FM, reduces the penalty for distance and aspect ratio on low-quality samples, speeds up the convergence of the model, and improves the extraction performance of facial local features. Therefore, this paper adopts WIoU v3 to reconstruct the loss function of the baseline model, effectively balancing the impact of high and low-quality sample differences on the model and improving the extraction performance of facial local features.
Taking Figure 7 as an example, the Intersection over Union (IoU) loss between the ground truth box (green) and the predicted box (blue) can be obtained, denoted as : In the equation, f 1×1 represents a 1 × 1 Conv, f 3×3 DOC represents a 3 × 3 DoConv, and f e represents enhancing the attention on effective facial features of the upper layer input using EMA.

Improvement of Loss Function
The loss function of bounding box regression mainly consists of three parts: boundingbox-localization loss, object-confidence loss, and object-classification loss. It is a key component of object detection and has a significant impact on the predictive performance of the model. The CIoU loss used in the baseline model does not effectively reflect the differences between the true values of width and height and their confidence. There is also an issue of imbalance in the bounding-box regression loss between high-quality and low-quality samples. This paper introduces the various losses for bounding-box regression proposed in reference [32], called WIoU. WIoU can be divided into WIoU v1 based on attention and WIoU v2 with a monotonically focusing mechanism (FM), as well as WIoU v3 with a dynamic non-monotonic FM. WIoU v3 can assign a smaller gradient gain to anchor boxes with larger outliers, effectively preventing large gradient loss in low-quality samples. This fully exploits the non-monotonic tuning potential of static FM, reduces the penalty for distance and aspect ratio on low-quality samples, speeds up the convergence of the model, and improves the extraction performance of facial local features. Therefore, this paper adopts WIoU v3 to reconstruct the loss function of the baseline model, effectively balancing the impact of high and low-quality sample differences on the model and improving the extraction performance of facial local features.
Taking Figure 7 as an example, the Intersection over Union (IoU) loss between the ground truth box (green) and the predicted box (blue) can be obtained, denoted as L IoU : Then define the penalty factor to amplify the of low-quality boxes, and on this basis, adjust the focus point of bounding-box regression on the of anchor boxes using gradient gain r, thus obtaining the loss In the equation, outlier degree = * ; hyperparameters and are used trol the mapping relationship between outlier degree and gradient gain . The improved face-detection model is shown in Figure 8.  Then define the penalty factor R W IoU to amplify the L IoU of low-quality anchor boxes, and on this basis, adjust the focus point of bounding-box regression on the quality of anchor boxes using gradient gain r, thus obtaining the loss L W IoUv3 of WIoU v3:

Keypoints Extraction and Fatigue-Judgment Model Construction
In the equation, outlier degree β = L * IoU L IoU ; hyperparameters δ and α are used to control the mapping relationship between outlier degree β and gradient gain r.
The improved face-detection model is shown in Figure 8. Then define the penalty factor to amplify the of low-quality anchor boxes, and on this basis, adjust the focus point of bounding-box regression on the quality of anchor boxes using gradient gain r, thus obtaining the loss 3 of WIoU v3: In the equation, outlier degree = * ; hyperparameters and are used to control the mapping relationship between outlier degree and gradient gain . The improved face-detection model is shown in Figure 8.

Extraction of 3D Facial Keypoints
The localization coordinates of facial landmarks are crucial for calculating the aspect ratio of the eyes and mouth. Considering that the traditional MTCNN five-point localization only includes the positions of the left and right eyes, nose, and corners of the mouth, it can only locate the facial contour but cannot determine whether the person is in a

Extraction of 3D Facial Keypoints
The localization coordinates of facial landmarks are crucial for calculating the aspect ratio of the eyes and mouth. Considering that the traditional MTCNN five-point localization only includes the positions of the left and right eyes, nose, and corners of the mouth, it can only locate the facial contour but cannot determine whether the person is in a fatigued state. Additionally, due to the adoption of a three-level cascaded network, the detection speed is slow. Moreover, the 2D 68 keypoints extracted by Dlib have the issues of losing feature information and poor real-time performance when the driver's head rotates significantly. Therefore, in order to accurately, quickly, and stably extract facial keypoints and enhance the focus on semantically meaningful regions, this paper adopts a lightweight architecture called Attention Mesh [33] for predicting the coordinates of 468 facial landmarks, which directly predicts the positions of the vertices of a 3D facial mesh.
As shown in Figure 9, the implementation mechanism of 3D facial keypoints extraction consists of two parts: the face extractor and the end-to-end feature-extraction model. The input of the detection video frame image is provided either through the previous frame tracking or directly by the detector. Then, these inputs are divided into separate sub-models by the feature-extraction model, which directly extracts the predicted coordinates of the eye and mouth regions. Each sub-model can independently control the grid size of each feature region based on feature changes, thus improving the quality of grid coverage. Finally, a set of normalization is applied to horizontally align and evenly size the eye and mouth features, further improving the accuracy of prediction. Therefore, it can achieve the same or even higher accuracy in facial keypoint localization as multi-stage cascaded methods, while also improving the speed of localization extraction. fatigued state. Additionally, due to the adoption of a three-level cascaded network, the detection speed is slow. Moreover, the 2D 68 keypoints extracted by Dlib have the issues of losing feature information and poor real-time performance when the driver's head rotates significantly. Therefore, in order to accurately, quickly, and stably extract facial keypoints and enhance the focus on semantically meaningful regions, this paper adopts a lightweight architecture called Attention Mesh [33] for predicting the coordinates of 468 facial landmarks, which directly predicts the positions of the vertices of a 3D facial mesh. As shown in Figure 9, the implementation mechanism of 3D facial keypoints extraction consists of two parts: the face extractor and the end-to-end feature-extraction model. The input of the detection video frame image is provided either through the previous frame tracking or directly by the detector. Then, these inputs are divided into separate sub-models by the feature-extraction model, which directly extracts the predicted coordinates of the eye and mouth regions. Each sub-model can independently control the grid size of each feature region based on feature changes, thus improving the quality of grid coverage. Finally, a set of normalization is applied to horizontally align and evenly size the eye and mouth features, further improving the accuracy of prediction. Therefore, it can achieve the same or even higher accuracy in facial keypoint localization as multi-stage cascaded methods, while also improving the speed of localization extraction.

Eye-Mouth Aspect Ratio and the Determination of Its Threshold
The height of the eye opening varies with blinking, fluctuating as it rapidly decreases and gradually approaches zero during the closing process. When opening, it maintains balance within a certain threshold range. Therefore, this paper proposes to assess the driver's eye opening and closing situation by calculating the eye aspect ratio (EAR) as presented in reference [34] and obtaining the corresponding threshold value.   In addition to judging whether there is fatigued driving based on changes in the driver's eyes, yawning is also a noticeable change in state. When a driver yawns, the distance between the upper and lower lips and the distance from the left corner of the mouth significantly increase and decrease, respectively, and they maintain a short period of stability within a certain threshold range. Therefore, in order to enrich the criteria for determining fatigued-related conditions, the mouth aspect ratio (MAR) is calculated based on the EAR, and the corresponding threshold is determined. Additionally, in order to prevent the problem of losing the keypoints of the inner contour due to different changes in mouth features among different drivers, this study extracts eight points from the external contour of the mouth to calculate the mouth aspect ratio.
The formulas for calculating EAR and MAR are as follows: In the equation, X 362 , X 263 , X 33 , X 133 and X 61 , X 291 represent the horizontal coordinates of four keypoints of the left and right eyes and two keypoints of the mouth outline, , and Y 405 represent the vertical coordinates of twelve keypoints of the left and right eyes and six keypoints of the mouth outline, respectively.
As shown in Figure 11, a frame-by-frame analysis was conducted on the process of a driver transitioning from a normal state to a fatigued state of closing eyes and yawning, using randomly selected video data from the publicly available YawDD [35] simulateddriving dataset. From the graph, it can be observed that as the number of frames increases, the MAR remains between 0.2-0.3 when the driver is in a normal closed-mouth state. When yawning occurs, the MAR rapidly increases and stabilizes at around 1.1. Furthermore, the difference between yawning and regular mouth states is most obvious when the MAR exceeds 0.65. Therefore, 0.65 is determined as the threshold for yawning detection, meaning that when MAR > 0.65, the driver is considered to have yawned once. Similarly, based on the observation that the minimum EAR during eye closure remains around 0, a threshold of 0.02 is determined for eye-closure detection. Hence, when EAR < 0.02, the driver is considered to have closed their eyes once.
Sensors 2023, 23, x FOR PEER REVIEW 12 of 22 Figure 11. Analysis of EAR and MAR results.

The Number of Frames of Continuous Eye Closure and Yawning in a Single Instance
From Figure 11, it can be seen that during the process of fatigued driving, the number of continuous frames with closed eyes and continuous frames with yawning are significantly different from the normal driving state. Studies have shown that, under normal conditions, a person yawns for about 6.5 s [36], which is approximately 150 frames. Therefore, the determination of whether a driver is fatigued can be made by calculating the number of continuous closed-eye frames and continuous yawning frames. The calculation formula is as follows: In the formula, , , , and , respectively, represent the starting and ending frames of closing eyes and yawning.

PERCLOS Criteria and the Determination of Its Threshold
PERCLOS, which stands for Percentage of Eyelid Closure over the Pupil over Time, is a physical parameter used to determine driver fatigue. Taking the P80 measurement standard as an example, when the eye-blink ratio is below 0.2, it is considered as complete eye closure; when the ratio is above 0.8, it is considered as complete eye opening. If this value exceeds a certain threshold, it can be determined that the driver is in a fatigued state. Therefore, in order to make a more accurate determination of driver fatigue, the Percentage of Yawning in a Unit of Time is proposed. Within a specified unit cycle frame 0 , the PERCLOS scores for the eyes ( ) and mouth ( ℎ ) are calculated based on the total number of frames with eye closure and yawning.
In the equation, and represent the starting frame and ending frame of a specified unit cycle frame.
In order to determine the number of frames for continuous eye closure and continuous yawning, as well as the fatigue thresholds for the eyes and mouth, a detection

The Number of Frames of Continuous Eye Closure and Yawning in a Single Instance
From Figure 11, it can be seen that during the process of fatigued driving, the number of continuous frames with closed eyes F e and continuous frames with yawning F m are significantly different from the normal driving state. Studies have shown that, under normal conditions, a person yawns for about 6.5 s [36], which is approximately 150 frames. Therefore, the determination of whether a driver is fatigued can be made by calculating the number of continuous closed-eye frames and continuous yawning frames. The calculation formula is as follows: In the formula, F ei , F ej , F mi , and F mj , respectively, represent the starting and ending frames of closing eyes and yawning.

PERCLOS Criteria and the Determination of Its Threshold
PERCLOS, which stands for Percentage of Eyelid Closure over the Pupil over Time, is a physical parameter used to determine driver fatigue. Taking the P80 measurement standard as an example, when the eye-blink ratio is below 0.2, it is considered as complete eye closure; when the ratio is above 0.8, it is considered as complete eye opening. If this value exceeds a certain threshold, it can be determined that the driver is in a fatigued state. Therefore, in order to make a more accurate determination of driver fatigue, the Percentage of Yawning in a Unit of Time is proposed. Within a specified unit cycle frame F 0 , the PERCLOS scores for the eyes (P eyes ) and mouth (P mouth ) are calculated based on the total number of frames with eye closure and yawning.
In the equation, F start and F end represent the starting frame and ending frame of a specified unit cycle frame.
In order to determine the number of frames for continuous eye closure and continuous yawning, as well as the fatigue thresholds for the eyes and mouth, a detection experiment was conducted on the collected video dataset based on EAR and MAR thresholds. The total number of frames for eye closure and yawning within a unit cycle was counted, and the PERCLOS score was calculated. Based on the three calculation indicators mentioned above, a model for determining fatigued-driving detection was established. After analyzing the experimental results, it was determined that within a unit cycle of 150 frames (F 0 = 150 frames), if the PERCLOS score for the driver's eyes and mouth is not less than 0.15, or the number of frames for continuous eye closure is not less than 20 frames, or the number of frames for continuous yawning is not less than 30 frames, then it is determined that the driver is in a state of fatigued driving. Otherwise, the driver is considered to be in a normal driving state. The fatigue-determination formula is as follows: The fatigued-driving detection process is shown in Figure 12. In order to address the problem of low detection accuracy caused by single-feature recognition or single-target detection based on eye and mouth in current methods, as well as the issue of serious false positives, false negatives, and keypoint loss caused by large head-pose variations of drivers, this paper combines the extraction of three-dimensional keypoints, eye-mouth aspect ratio calculation, and feature-classification detection to comprehensively determine the changes in driver's blinking and yawning states. Specifically, when the driver's mouth is open (o_mouth) and the mouth ratio exceeds the specified threshold, both conditions are met, and it is determined that the driver is yawning. When the driver's eyes are closed (c_eyes) or the eye ratio is below the specified threshold, one of the conditions is met, and it is determined that the driver is blinking. The formulas for discriminating blinking and yawning are as follows: Sensors 2023, 23, x FOR PEER REVIEW 13 of 22 experiment was conducted on the collected video dataset based on EAR and MAR thresholds. The total number of frames for eye closure and yawning within a unit cycle was counted, and the PERCLOS score was calculated. Based on the three calculation indicators mentioned above, a model for determining fatigued-driving detection was established.
After analyzing the experimental results, it was determined that within a unit cycle of 150 frames ( 0 = 150 frames), if the PERCLOS score for the driver's eyes and mouth is not less than 0.15, or the number of frames for continuous eye closure is not less than 20 frames, or the number of frames for continuous yawning is not less than 30 frames, then it is determined that the driver is in a state of fatigued driving. Otherwise, the driver is considered to be in a normal driving state. The fatigue-determination formula is as follows: The fatigued-driving detection process is shown in Figure 12. In order to address the problem of low detection accuracy caused by single-feature recognition or single-target detection based on eye and mouth in current methods, as well as the issue of serious false positives, false negatives, and keypoint loss caused by large head-pose variations of drivers, this paper combines the extraction of three-dimensional keypoints, eye-mouth aspect ratio calculation, and feature-classification detection to comprehensively determine the changes in driver's blinking and yawning states. Specifically, when the driver's mouth is open (o_mouth) and the mouth ratio exceeds the specified threshold, both conditions are met, and it is determined that the driver is yawning. When the driver's eyes are closed (c_eyes) or the eye ratio is below the specified threshold, one of the conditions is met, and it is determined that the driver is blinking. The formulas for discriminating blinking and yawning are as follows: Finally, a fatigued-driving detection model is built by integrating eye and mouth features. The current state of the driver, whether fatigued or not, is determined based on the number of continuous closed-eye frames, continuous yawning frames, and the fatigue thresholds for eyes and mouth. The model outputs two judgment results: normal driving and fatigued driving.  Finally, a fatigued-driving detection model is built by integrating eye and mouth features. The current state of the driver, whether fatigued or not, is determined based on the number of continuous closed-eye frames, continuous yawning frames, and the fatigue thresholds for eyes and mouth. The model outputs two judgment results: normal driving and fatigued driving.

Dataset and Experimental Conditions
The dataset of this study consists of a total of 8021 images, including public datasets YawDD, CEW [37], DrivFace, and Drozy, which consist of male and female drivers of different races, with and without glasses, and in normal and fatigued driving states (speaking and non-speaking). Additionally, there are self-built video datasets of different drivers in real driving scenes during daytime and nighttime.
First, the video data of YawDD and self-built are mirrored, rotated, and cropped according to the time of capturing a picture every 20 frames. Then, the CEW, DrivFace, and Drozy databases are added to enrich the dataset and compensate for the lack of diversity and scene changes in YawDD and self-built video data, enhancing the robustness and generalization ability of the model. The dataset is annotated by using the Python annotation library, Labelimg, resulting in 8021 face bounding-box labels, 3579 o_mouth bounding-box labels, 4634 o_eyes bounding-box labels, 4310 c_mouth bounding-box labels, and 3166 c_eyes bounding-box labels, totaling 23,710 facial feature bounding-box labels across five categories. The dataset is then split into a training set and a validation set in an 8:2 ratio.
During model training, the number of epochs is set to 150. The image size is 640, and the batch size is 16. The training is conducted by using the rect matrix to reduce redundant padding in image preprocessing, decrease memory usage during training, and accelerate the inference process. The specific experimental conditions during training are shown in Table 1.

Evaluation Indicators
This study evaluates the effectiveness of the improved face-detection algorithms by using the metrics of average precision (AP), mean average precision (mAP), floating-point operations (FLOPs), parameters (Params), and model size (Size). AP and mAP are used as performance evaluation metrics for target prediction. A higher value indicates a higher recognition rate for different categories of faces and a stronger overall performance of the model. The other metrics are used to evaluate the lightweight nature of the model. A smaller FLOPs value indicates lower computational complexity, while smaller Params and Size values indicate the model is lightweight. The formulas for calculating AP and mAP are as follows: In the formula, P and R represent precision and recall, respectively. TP represents the number of correctly predicted faces in each category by the model. FP represents the number of incorrectly predicted faces in each category by the model. FN represents the number of faces in each category that were not predicted by the model. AP represents the average precision of predicting each category in the face. The number 5 represents the 5 feature categories of faces being classified. Table 2 shows the experimental comparison results of introducing Shufflenetv2 and improved Shufflenetv2_BD for backbone reconstruction based on the baseline model. From the table, it can be seen that Shufflenetv2 achieves a relatively higher mAP of 92.7% with a minimal increase in parameters after the addition of DoConv, minimizing the overall performance loss of the model. Therefore, this paper chooses Shufflenetv2_BD to reconstruct the backbone of YOLOv5s, a lightweight network model.

Ablation Experiment
In order to verify the effectiveness of the improved algorithm in facial-feature detection, ablative experiments are conducted to demonstrate the detection effects of various optimization points. Each group of experiments uses the same hyperparameters and training method, and the results are shown in Table 3. It can be seen that compared with the baseline model, the improved algorithm achieves a significant improvement in how lightweight the model is, with a reduction of 56.3% in model size and 58% in parameters, and a floating-point operation of only 5.9 GFLOPs. After introducing the improved Shufflenetv2_BD, the AP50 of each category remains basically unchanged, but the mAP decreases by 2%, resulting in a significant loss. By aggregating shallow and deep texture features of faces across scales, enhancing the fusion of contextual features with semantic information, and redefining the loss function using WIoU, the AP50 and mAP are ultimately improved by 1.5% and 3%, respectively. Compared to the baseline model, there is an additional improvement of 1% and 1.7% for AP50 and mAP, demonstrating that the improved algorithm achieves stronger detection performance than the baseline model while being more lightweight than the reconstructed backbone.
As shown in Figures 13 and 14, the improved classification detection results and visualized heatmaps before and after improvement based on ablation experiments are used to demonstrate the advantages of the improved algorithm in predicting facial features of various categories. From the figures, it can be seen that the introduction of EMA can extract more complete facial-feature information, providing accurate location information for categories with smaller targets and shallower features, effectively enhancing the attention to feature regions, especially in the categories of c_mouth and c_eyes. The recall and mean average precision of c_eyes have the most significant improvement, reaching 9.2% and 2.3%, respectively.    The improved algorithm's classification detection results were evaluated using 50,000 randomly selected face images with different feature variations and different scenes from the CelebA [38] face dataset. The corresponding classification recognition accuracy was used as the evaluation metric. As shown in Table 4, thanks to the fusion feature enhancement of M-CFAM and L-CIFM, the improved algorithm achieved a recognition rate of 98.6% for the target regions of facial features. Specifically, the recognition rates for the face and o_eyes categories reached almost 100%, while the c_eyes and c_mouth categories had relatively higher false detection rates and room for improvement in classification judgments. Nonetheless, the average recognition rate exceeded 97%, indicating a good evaluation performance. The results show that although our method has a relatively smooth improvement in overall mAP, the recognition precision P of face, o_mouth, o_eyes, and c_mouth is also comparable to the baseline model. Given the point-by-point optimization improvement on the baseline model, it can achieve accurate extraction and prediction of various facial features and categories while ensuring extremely low system overhead, thereby verifying the effectiveness of the improved algorithm.

Horizontal Network Comparison Experiment
Considering that this article mainly focuses on the lightweight real-time research and analysis of the fatigued-driving detection model, we have weighed the choice to conduct a comprehensive performance comparison with the current mainstream fatigued-driving detection algorithms. From Table 5, it can be seen that the model size of this algorithm is only 6.3 MB, much smaller than YOLOv3-Tiny, YOLOv4-Tiny, YOLOv7-Tiny, and SSD. It has a stronger advantage in lightweight terminal deployment. Even compared with YOLOv4-Tiny and YOLOv5n, which have similar lightweight levels, although the floatingpoint operation volume is slightly higher, this algorithm improves mAP by 7.6% and 2.9%, respectively, ensuring the overall performance of predicting facial-feature categories while making the model lightweight. In addition, compared with the most advanced objectdetection algorithm, the YOLOv8 series, the algorithm proposed in this paper has a lower mAP by 2% and relatively weaker detection performance. However, it has a significant advantage in being lightweight and having a better overall performance. Therefore, it can be concluded that this algorithm can achieve high robustness with very low computational power, ensuring the portability requirement on embedded mobile terminals, further validating the advancement of this algorithm.

Fatigue Sample Test Result Analysis
The fatigued-driving detection model in this study integrates the weight file "best.pt" obtained during the training process of the improved algorithm into the judgment model in order to achieve the goal of multi-index and multi-feature fusion detection. Therefore, in order to verify the accuracy of the fatigued-driving detection model, 135 segments averaging 560 frames of untrained data were collected from the YawDD video dataset, including normal driving (speaking and not speaking) and fatigued driving videos of different driving scenarios, drivers wearing and not wearing glasses, and drivers of different genders, as instance samples. The recognition accuracy obtained from the instance detection is ultimately used as the evaluation metric, and the judgment results of the fatigued-driving detection model in this study are compared with the results in references [39,40], which adopt the MTCNN five-point positioning method, as shown in Table 6. From Table 6, it can be seen that the fatigued-driving detection effect of reference [39] is actually poor, with weak discrimination ability for normal state, eye changes, and yawning. Among them, the fatigue recognition rate in the yawning state is the lowest, only 23.1%, and the overall performance is much lower than the 98.1% of reference [40] and the 96.7% of this study. Compared with reference [40], the fatigued-driving detection model constructed in this study, which combines eye and mouth features, achieves an overall judgment accuracy of 96.3%. The recognition accuracy of 45 fatigued driving samples and 90 normal driving samples reached 93.3% and 97.8%, respectively. Due to the consideration of eye fatigue, the overall comprehensive performance is slightly weaker than the 97% of reference [40]. However, if we ignore eye fatigue and keep the same conditions as reference [40], the overall accuracy of this study can reach 97.5%, which is the strongest performance. In addition, although the judgment methods and sample sizes used in reference [39] and reference [40] are different, they also reflect from the side that the fatigued-driving detection judgment model designed in this study has better fatiguejudgment results and a higher detection level. However, due to the lack of validation in real driving scenarios, the discrimination mechanism of the fatigued-driving detection model still has some mixed discrimination errors when the driver's speaking state and fatigue features are not obvious. The classification discrimination ability of the model still needs further improvement. However, in overall detection and discrimination, our method has shown significant advantages when compared to other approaches. Figure 15 shows the comparison of detection results by adopting Dlib and Attention Mesh for keypoint extraction of normal driving and fatigued driving, wearing glasses, and not wearing glasses (with other variables kept the same). The EAR and MAR in the figure can dynamically detect and calculate the changes in the driver's eyes and mouth in real time. When the driver closes their eyes or yawns, it indicates that the EAR or MAR is lower or higher than the set threshold, and "Blink" and "Yawn" will display the number of eye blinks and yawns. Then, the fatigue detection model will count the number of continuous eye closures and continuous yawns within a unit cycle frame, as well as the total number of eye closures and yawns, and display the calculated eye and mouth PERCLOS scores on the "PERCLOS". If the score is lower than the specified threshold, it is determined as "Normal", otherwise it is determined as "Fatigue". obvious. The classification discrimination ability of the model still needs further improvement. However, in overall detection and discrimination, our method has shown significant advantages when compared to other approaches. Figure 15 shows the comparison of detection results by adopting Dlib and Attention Mesh for keypoint extraction of normal driving and fatigued driving, wearing glasses, and not wearing glasses (with other variables kept the same). The EAR and MAR in the figure can dynamically detect and calculate the changes in the driver's eyes and mouth in real time. When the driver closes their eyes or yawns, it indicates that the EAR or MAR is lower or higher than the set threshold, and "Blink" and "Yawn" will display the number of eye blinks and yawns. Then, the fatigue detection model will count the number of continuous eye closures and continuous yawns within a unit cycle frame, as well as the total number of eye closures and yawns, and display the calculated eye and mouth PERCLOS scores on the "PERCLOS". If the score is lower than the specified threshold, it is determined as "Normal", otherwise it is determined as "Fatigue". From the figure, it can be seen that the lightweight face detection algorithm designed in this study is able to accurately recognize various facial-feature categories. In addition, compared to Dlib, which has the problem of keypoint loss leading to the failure of the fatigue-detection model and poor real-time detection, the fatigue-detection model using Attention Mesh has a faster detection speed, reaching 28 FPS, which basically meets the From the figure, it can be seen that the lightweight face detection algorithm designed in this study is able to accurately recognize various facial-feature categories. In addition, compared to Dlib, which has the problem of keypoint loss leading to the failure of the fatigue-detection model and poor real-time detection, the fatigue-detection model using Attention Mesh has a faster detection speed, reaching 28 FPS, which basically meets the requirements of real-time detection and verifies the effectiveness of Attention Mesh in quickly and stably extracting keypoints. The final result achieves a good continuous detection effect, which is in line with the experimental design.

Discussion and Conclusions
This paper introduces a lightweight and robust driver-face detection method and a fatigued-driving determination model. Firstly, Shufflenetv2_BD is used to reconstruct the backbone of YOLOv5s, achieving a lightweight network and an improved training speed of deep linear networks. Secondly, M-CFAM is introduced between the backbone and neck networks to enhance the cross-scale fusion of facial features and to reduce the loss of shallow features. Then, L-CIFM is introduced to enhance the extraction ability of facial-region features by the neck network. In addition, in order to accelerate the convergence speed of the model, WIoU is introduced as a new loss metric, and the loss function is redefined. Through comparative experiments, the proposed algorithm reduces the parameters and model size by 58% and 56.3% compared to the baseline model, and the floating-point operations are only 5.9 GFLOPs. The mAP on the self-built dataset reaches 95.7%, an improvement of 1%. This indicates that the algorithm not only performs well in terms of being lightweight but also effectively improves the detection performance. Finally, based on the threshold value of the eye-mouth aspect ratio calculated by three-dimensional keypoints and the detection of facial-feature categories, the designed fatigued-driving determination model comprehensively judged 135 instance samples within a specified unit cycle frame by using the number of frames of continuous eye closure and continuous yawning as well as the threshold of eye-mouth PERCLOS, achieving a recognition accuracy of 96.3%. A high level of fatigued-driving detection has been achieved. This achieves a recognition accuracy of 96.3% and reaches a high level of fatigued-driving detection. Thus, it verifies that the face-detection algorithm and fatigue-determination model can effectively detect and determine drivers in real-time in different driving scenarios, different genders, and different driving characteristics, demonstrating strong robustness and providing support for the transplantation and deployment of fatigued-driving detection.
The algorithm in this paper is designed for fatigued-driving detection, and it has achieved good results. However, it lacks consideration for extremely complex scenes and face-occlusion problems. In the future, we will increase the coverage of data in complex scenes and introduce tracking algorithms to optimize the insufficient or lost feature extraction in extreme driving scenarios with occlusions in order to improve the applicability in multiple scenarios.  Data Availability Statement: Data are available on request due to restrictions, e.g., privacy or ethical restrictions.