1. Introduction
The mining industry is one of the most dangerous industries [
1,
2,
3], with mining trucks operating in challenging environments characterized by extreme weather conditions [
4,
5,
6], uneven terrain, and limited visibility [
7,
8]. According to recent industry reports, driver distraction contributes to mining truck accidents, with operator fatigue and inattention identified as primary causal factors in approximately 15% of serious incidents involving large haul trucks [
9,
10,
11]. Unlike conventional on-road vehicles, mining trucks operate in isolated areas where immediate assistance is often unavailable, making real-time driver monitoring systems not only beneficial but also essential for operational safety.
Within the operational context of real-world mining transportation, drivers exhibit a wide spectrum of distraction-related behaviors, ranging from mobile phone usage and fatigue to environmental scanning [
12,
13,
14]. To accurately detect driver distraction, the most important factor is head pose estimation.
Figure 1 illustrates this point. Part a shows the original image, and part b shows the Euler angle results from head detection and head pose estimation. Precisely detecting the driver’s head and then estimating its pose is the foundation of high-accuracy distraction detection. This paper focuses on head pose estimation methods for mining truck drivers. In current practice, achieving robust head pose estimation with fully supervised learning requires large amounts of annotated training data captured under varying lighting and vibration conditions [
15,
16,
17]. This includes both fully supervised object detection approaches, such as the YOLO series and its variants, and fully supervised head pose estimation methods, including TRG [
18], CIT [
19], and WHENet [
20]. The process of acquiring labeled datasets across varying environmental conditions is both prohibitively expensive and time-consuming [
21]. Semi-supervised learning has emerged as a compelling paradigm to address this challenge, enabling models to learn effectively from a limited set of labeled examples while leveraging abundant unlabeled data.
In the domain of object detection, landmark contributions include STAC [
22], which establishes the foundational paradigm of using weakly augmented samples to generate pseudo labels while training on strongly augmented data. Unbiased Teacher [
23] systematically diagnoses and addresses foreground–background and class imbalance issues inherent in the training process. Soft Teacher [
24] introduces a Soft Teacher mechanism combined with box jitter, enabling end-to-end collaborative evolution between teacher and student networks and substantially improving the utility of pseudo labels. Building on these advances, Efficient Teacher [
25] seamlessly integrates the aforementioned principles with the efficient YOLO [
26,
27,
28] detector family, yielding a mature, industry-ready solution that closes the loop from academic innovation to engineering practice. The applicability of semi-supervised learning extends beyond object detection. For instance, Basak et al. [
29] demonstrated the feasibility of semi-supervised learning for 3D head pose estimation from synthetic data, employing domain adaptation techniques to bridge the distributional gap between simulated and real-world environments. Similarly, SemiUHPE [
30], a semi-supervised approach for head pose estimation, has reported promising results. Recent efforts have extended such methodologies to unconstrained, real-world settings. Despite these advances, there exists a substantial discrepancy between the datasets commonly used in these methods, such as BIWI and AFLW2000, and the conditions found in mining scenarios. Operators of mining trucks frequently wear items including masks, sunglasses, and safety helmets, and the acquired data are typically in the form of infrared images. Consequently, the application of semi-supervised approaches within the specific context of the mining environment remains largely unexplored. Two principal challenges persist when applying current semi-supervised approaches within autonomous mining systems. First, existing two-stage head pose estimation approaches typically depend on fully supervised object detection models to provide head localization, without integrating semi-supervised detection methods. Second, the literature lacks empirical validation of these methods through deployment and testing within authentic, operational mining scenarios.
This paper proposes SemiCHPE to address the aforementioned challenges. SemiCHPE comprises two stages. The first stage, HeaDet, is built upon YOLOv8 with Distribution Focal Loss (DFL), incorporating an enhanced confidence prediction branch and trained using the Efficient Teacher framework for semi-supervised learning. The second stage is a MobileNetV3 [
31] based HPE that estimates 3D head orientation using a probabilistic rotation representation based on the Matrix Fisher distribution. This probabilistic approach provides a complete distribution over the space of rotations, enabling principled uncertainty quantification for filtering unreliable predictions during both training and inference. To better leverage pseudo labels for semi-supervised training, we adopt a curriculum learning method based on loss weights to optimize the learning process and further boost the performance of HPE. Finally, SemiCHPE is deployed within an operational open-pit mining truck transportation system.
The contributions of this work are threefold:
- (1)
A cascade framework named SemiCHPE, in which both head detection and head pose estimation are trained using semi-supervised learning methodologies, is proposed.
- (2)
A head detector named HeaDet, adapted for the Efficient Teacher framework that improves model performance, is introduced.
- (3)
A loss-weight-based curriculum learning method is introduced to train the HPE head pose estimator.
- (4)
Real-world deployment on open-pit mining trucks validates SemiCHPE, a semi-supervised cascade pipeline for mining truck driver head pose estimation.
The remainder of this paper is organized as follows.
Section 1 reviews related work in semi-supervised object detection and head pose estimation.
Section 2 describes the proposed semi-supervised cascade head pose estimation method in detail.
Section 3 elaborates on the experimental setup, dataset characteristics, and evaluation metrics, presenting both quantitative and qualitative results, including ablation studies and deployment benchmarks.
Section 4 concludes the paper and discusses limitations and directions for future research.
2. Method
This section provides a detailed introduction to the semi-supervised cascade head pose estimation method and the semi-supervised learning approach used for model training.
2.1. Semi-Supervised Cascade Head Pose Estimation Method
We approach driver distraction detection as a two-stage cascade learning problem under semi-supervised settings. Our labeled dataset Dl contains Nl samples with corresponding head bounding boxes and pose annotations, while the unlabeled dataset Du contains more samples (Nu >> Nl). Each pose is represented as a rotation matrix R in SO(3).
Figure 2 illustrates the overall framework of the semi-supervised cascade head pose estimation method, which includes a head detector and a head pose estimator. Given the need for both high accuracy and real-time performance in mining truck driver head pose estimation, we chose YOLOv8 as the object detection model to improve. YOLOv8 is capable of real-time high-accuracy detection in complex scenes. For head pose estimation, we trained MobileNetV3 using a modified semi-supervised method. To provide accurate head positions for the second stage, we select the YOLOv8 model with DFL for head detection. To better leverage the effectiveness of Efficient Teacher, a confidence prediction branch is added to the decoupled head of YOLOv8, which facilitates semi-supervised training of a better detection model. The YOLOv8 model is an advanced object detection architecture. The improved model proposed in this paper is referred to as head detector (HeaDet), which primarily consists of three key components: Backbone, Neck, and Head.
The basic components constituting the Backbone, Neck, and Head include CBR, C2f, SPPF, Bottleneck, and decoupled heads. CBR consists of a 3 × 3 convolutional layer with stride 2, batch normalization, and ReLU activation function, used for down-sampling feature maps. The C2f module is designed based on the Cross Stage Partial (CSP) architecture and includes two 1 × 1 convolutional layers with a stride of 1 (cv1 and cv2) and multiple bottleneck layers. The bottleneck layers enhance gradient flow through residual connections, each containing two 3 × 3 convolutional layers with a stride of 1 for extracting high-level features. The input feature map of C2f first passes through the cv1 convolutional layer, expanding the number of output channels to twice that of the input. It is then split into two parts: one part is directly passed to the subsequent concatenation layer, while the other enters the Bottleneck modules for deep feature extraction. Finally, the outputs of all Bottleneck modules are concatenated with the directly passed feature map along the channel dimension and compressed to the target number of channels through the cv2 convolutional layer. The Spatial Pyramid Pooling Fast (SPPF) enhances the model’s receptive field through multi-scale feature fusion while reducing computational redundancy. Its structure includes two 1 × 1 convolutional layers with a stride of 1 (cv1 and cv2). The input feature map of SPPF first passes through the cv1 convolutional layer and is then split into two parts: one part is directly passed to the subsequent concatenation layer, while the other enters a series of three cascaded 5 × 5 max-pooling layers for feature extraction. The pooled feature maps are concatenated with the feature map processed by cv1 along the channel dimension and then compressed to the target number of channels through the cv2 convolutional layer. The decoupled head consists of two branches, which output class information and predicted bounding box information, respectively. Each branch is composed of two CBR modules and a 1 × 1 convolutional layer with stride 1, where the convolutional layers in CBR have a stride of 1 and a kernel size of 3 × 3.
When an image enters the Backbone of the multi-scale object detection model, it first passes through two CBR modules for feature extraction, followed by one C2f module for further feature extraction. Subsequently, it sequentially passes through a CBR module and a C2f module, and this process is repeated three times. Finally, the feature map enters the SPPF module for feature extraction. The feature maps output by the CBR modules in the Backbone are sequentially labeled as [C1, C2, C3, C4, C5]. The feature maps output by the Backbone then enter the Neck. First, they undergo upsampling using the nearest neighbor method via upsample and are then concatenated with the C4 feature map along the channel dimension. Subsequently, a C2f module is used for feature extraction, and the generated feature map is upsampled again. After concatenation with the C3 feature map, another C2f module is applied for feature extraction, producing a feature map labeled as P3. The P3 feature map passes through a CBR module and is concatenated with the C4 feature map. The concatenated feature map is then fed into a C2f module, generating a feature map labeled as P4. The P4 feature map passes through a CBR module and is concatenated with the C5 feature map, followed by feature extraction using a C2f module, producing a feature map labeled as P5. Finally, the feature maps [P3, P4, P5] are respectively fed into three decoupled heads to generate prediction information.
In the second stage, the detected head regions are passed through a lightweight MobileNetV3 network, referred to in this paper as the HPE. This estimator, adapted from the SemiUHPE architecture, is embedded within a mean teacher framework to estimate 3D head orientation. MobileNetV3 is specifically adopted to enable efficient and fast inference on embedded devices.
HPE is based on the inverted residual block and linear bottlenecks, which enhance model representational capacity while maintaining computational efficiency. The core components of MobileNetV3 are its unique convolutional module designs. In each depthwise convolutional (DW) module, the number of channels is first increased via a 1 × 1 convolution to expand the spatial dimension of the input features. This contrasts with traditional residual blocks, which typically reduce and then increase the number of channels, hence the term “inverted” residual. Subsequently, a 3 × 3 depthwise convolution is applied for spatial feature extraction. Finally, another 1 × 1 convolution reduces the channel dimension, restoring it to the original or target dimensionality. It is worth noting that no nonlinear activation function is used after the final 1 × 1 convolution to prevent information loss, which is also part of the linear bottleneck design. Each inverted residual block incorporates a skip connection that directly links the input and output, facilitating the training of deeper networks. The DSDW module is similar to the DW module, except that it lacks a skip connection and employs a stride of 2 in its depthwise convolution. The model’s task head consists of a dropout layer, a fully connected layer, and a batch normalization layer, ultimately outputting the rotation matrix representing the head pose.
2.2. Semi-Supervised Head Detection
Efficient Teacher enables HeaDet to achieve superior head detection performance through semi-supervised training via the Pseudo Label Assigner (PLA), Epoch Adaptor (EA) and gradient reversal layer (GRL). The PLA method introduces two thresholds, a high one
τ1 and a low one
τ2, to clearly separate pseudo labels into reliable and uncertain categories. Pseudo labels with scores above
τ1 are considered reliable, while those falling between
τ1 and
τ2 are treated as uncertain. An unsupervised loss is then designed to make effective use of the uncertain pseudo labels. The loss function is given as follows:
is the loss computed on labeled images, and
is the loss computed on unlabeled images. The hyperparameter λ balances the supervised and semi-supervised losses; in this work, it is set to 3.0. The supervised loss
is defined as follows:
CE denotes the cross-entropy loss function.
X(h,w) is the output of the student model, and
Y(h,w) is the sampling result produced by the detector label assigner. The unsupervised loss
is defined as follows:
where
,
, and
denote, respectively, the classification score, the regression output, and the objectness score of the sample drawn by PLA at position
on the feature map, the term
represents the objectness score of the pseudo label at
.
is the score of the pseudo label at (h, w).
denotes the indicator function, which takes the value 1 when the stated condition holds and 0 otherwise.
During the Burn-in phase, EA feeds both labeled and unlabeled data to the network and employs a domain classifier to confound the detector’s ability to discriminate between the two data sources. This alleviates the overfitting observed when the Burn-in phase uses only labeled data. The domain adaptation loss is defined as:
is the output of the domain classifier, with
for labeled data and
for unlabeled data. A gradient reversal layer (GRL) is employed; the domain classifier is optimized via standard gradient descent, yet the gradient sign is flipped during backpropagation through this layer. The base network is optimized via the GRL. During Burn-in, the supervised loss for a single image is reformulated as:
The
balances the domain adaptation term, set to 0.1. During distribution adaptation, the
k-th thresholds
τ1 and
τ2 are set as follows:
In all experiments, is fixed at 60. The list of pseudo-label scores for class -th at epoch -th is denoted by , the numbers of labeled and unlabeled samples are denoted by and , and the count of class -th ground truths tallied by EA at epoch -th is denoted by . Adaptively determining the thresholds per epoch makes the joint training more robust to evolving data distributions.
2.3. Semi-Supervised Head Pose Estimation
Upon detection of a head region by HeaDet, the corresponding image patch is cropped and resized to a spatial resolution of 224 × 224 pixels prior to being passed to the HPE. Instead of directly regressing Euler angles, an approach susceptible to periodicity artifacts and gimbal lock, we adopt the probabilistic rotation representation based on the Matrix Fisher distribution (MFD), as introduced in SemiUHPE. The MFD is adopted as the representation model for head pose estimation due to its fundamental definition on the three-dimensional rotation group
, enabling the modeling of arbitrary rotations unambiguously and without singularities. Moreover, as a probabilistic distribution, the MFD is capable of not only yielding the most probable pose but also quantifying predictive uncertainty through its entropy or singular values. This characteristic proves particularly critical in semi-supervised learning, as it allows the model to assess the reliability of pseudo labels dynamically and filter out low-quality samples accordingly, thereby enhancing both the stability of training and the accuracy of the final pose estimates. The probability density function of MFD
is as follows:
where
denotes a generic
matrix and
represents the normalization factor. Subsequently, the principal orientation R and the spread parameter
S of the distribution are formulated as:
where U and V are the matrices obtained from the singular value decomposition of
, expressed as
, where S =
diag
is a diagonal matrix containing the singular values sorted in descending order. Each singular value reflects the concentration strength of the distribution along the corresponding axis. To quantify prediction uncertainty, we adopt an entropy-based confidence measure. During training, the network regressor
takes a single RGB image x as input and outputs a 3 × 3 matrix
, which parameterizes an MFD
. This distribution inherently encodes both the predicted rotation, captured by the mode
, and the dispersion, captured by
, as detailed in Equation (2). The entropy of this predictive distribution, which serves as a confidence measure for uncertainty estimation, is given by the following expression.
where
denotes a term that remains constant with respect to the parameter matrix
diag
, where
is a
diagonal matrix whose diagonal matrix with
. Each element
derives from a unit quaternion q
. Given the singular value decomposition
,
denotes the standard mapping from a unit quaternion to a rotation matrix. For
the
-th column of the identity matrix
, we define
. Then each
is obtained as the trace
. A detailed derivation of this formulation can be found in [
32]. In general, a lower entropy corresponds to a more peaked distribution, indicating reduced uncertainty and higher confidence.
When the predicted entropy is below a fixed threshold
τ, dynamic entropy-based filtering considers it as a pseudo label. The resulting unsupervised loss is:
is the indicator function (1 if the condition holds, 0 otherwise). denotes the prediction entropy computed via Equation (13). is the cross-entropy loss enforcing consistency between two continuous Matrix Fisher distributions. The terms and are defined as and , where and are the outputs of the teacher and student models, respectively.
The unlabeled set contains many challenging heads, making it difficult for the teacher to separate the in-distribution samples from out-of-distribution ones via a fixed threshold. The teacher’s prediction entropy for shows that most samples receive confident (low-entropy) predictions. High-entropy samples fall into two categories: hard heads still belonging to (e.g., severe occlusion, atypical poses rare in labeled data but potentially correctable) and noisy heads from (unrecognizable poses due to missing context or wrong category). Moreover, the teacher’s predictive capability improves during training, meaning the difficulty and uncertainty of a given sample evolve. We therefore introduce dynamic entropy-based filtering to improve pseudo-label quality and enhance robustness in real-world settings. Assuming , we retain only a portion of unlabeled data for unsupervised training.
The filtering threshold
is progressively updated over
stages and computed as:
is the fraction of unlabeled data retained, linked to the unknown
in
. The function percentile
gives the
percentile value.
denotes the teacher model at the
k-th stage
Equation (14) is then revised as:
For a given , declines as stage progresses; the optimal is inversely related to the quantity of in . Notably, the separation of and here captures pose inference difficulty and reliability, not classical covariate shift, enabling the dynamic threshold to preserve plausible hard samples while suppressing highly noisy ones.
We further introduce a loss-weight-based curriculum learning method that uses prediction uncertainty, specifically entropy, as a difficulty measure to dynamically adjust the loss weights of unlabeled data across different training stages. The formula for calculating dynamic loss weight is as follows:
The parameter controls the steepness of the curve. A smaller makes the discrimination sharper, while a larger gives a smoother transition. In this work, is initially set to 0.1. The variable serves as the curriculum control. At the early stage, is close to 0, so only samples with very high confidence receive a weight near 1. As training progresses, approaches 1, and most samples obtain a relatively high weight. Here is the current training epoch, and is the total number of training epochs. The factor controls the pace of the curriculum; it is set to 1.5, meaning the curriculum advances slightly faster than the actual training time. Finally, and denote the minimum and maximum uncertainty values within the current batch.
The total loss for semi-supervised training is given below:
We introduce two domain-specific data augmentations tailored to the head pose estimation task. The first, termed Cut Occlusion, randomly masks rectangular regions centered on the head to simulate partial occlusions, a frequent occurrence in mining truck environments due to mechanical vibration and variable lighting conditions. The second augmentation, Rotation Consistency, applies random in-plane rotations ranging from −30° to 30° and enforces that the resulting pose predictions remain geometrically consistent with the applied rotation through matrix multiplication. We employ aspect ratio-preserving cropping followed by zero-padding, rather than naive resizing, to better retain natural facial proportions and mitigate distortion-induced bias.
2.4. Semi-Supervised Training Method
The training framework for both Efficient Teacher and SemiUHPE comprises two distinct phases: an initial warm-up phase followed by a semi-supervised training phase. During the warm-up phase, the model is trained exclusively on labeled data using standard supervised loss functions.
In the subsequent semi-supervised phase, training is activated on both labeled and unlabeled data. The teacher model generates pseudo labels in an online manner using the warmed-up weights. The learning rate follows a cosine annealing schedule, decaying from an initial value of 1 × 10
−3 to a minimum of 1 × 10
−5, with periodic restarts every 50 epochs to facilitate escaping local minima. During training, each batch is composed of one labeled sample and four unlabeled samples, configurations that preserve task semantics through sufficient labeled supervision while maximizing the utilization of unlabeled data to enhance model generalization. The parameter configurations employed in the training framework are primarily inherited from Efficient Teacher [
25] and SemiUHPE [
30].
3. Experimental Section, Results and Discussion
3.1. Experimental Environment Settings and Dataset
The experimental hardware consisted of an Intel Xeon Silver 4210 processor and an NVIDIA RTX 3090 GPU. Software configuration included Ubuntu 22.04 LTS, PyTorch 2.3, CUDA 11.2, ONNX 1.8, and Python 3.12. The embedded terminal processor was deployed on an NVIDIA Jetson Orin platform running Ubuntu 20.04 OS, with JetPack 5.1.4 and TensorRT 8.5.
We constructed a large-scale dataset of mining truck drivers containing 20,000 near-infrared images. These images were captured by industrial-grade in-cab cameras at a resolution of 1920 × 1080 and a frame rate of 30 FPS, showing the upper bodies of the drivers. To avoid relying only on ideal conditions, we intentionally collected videos throughout the day from early morning to evening under various weather conditions. We recorded driving data from 50 drivers using only ambient light. The dataset was split into training, validation, and test sets at a ratio of 8:1:1. Only 10% of the training data were annotated, giving 1600 labeled training frames from 30 drivers, while the validation and test sets were fully labeled, each containing 2000 frames from 10 drivers respectively. To avoid temporal correlation between consecutive frames, we sampled one frame every 15 frames from the original 30 FPS video. During dataset partitioning, we ensured that the drivers in the training, validation, and test sets are mutually exclusive: no driver appears in more than one subset. This design eliminates potential bias from driver-specific characteristics and preserves the integrity of the evaluation protocol, guaranteeing that model performance reflects genuine generalization rather than memorization of driver-dependent patterns. Ground-truth head poses were provided by an IM600 sensor with an accuracy of 0.05°, synchronized with the image signal via hardware triggering. The annotations include both head bounding boxes and head pose angles. Two annotators labeled the data, achieving an agreement rate above 95% at an IoU threshold of 0.5. The distribution of mining truck driver head poses in the dataset is shown in
Figure 3. This study was approved by the Institutional Review Board of Jiangsu University School of Medicine under approval number JSDX2002601010089, and written informed consent was obtained from all drivers. All experimental results reported in this study are based on the test set.
The head pose data captured by the IM600 sensor reveal distinct distributional characteristics for each Euler angle, all of which align closely with the operational context of mining truck operation. The Pitch angle, spanning from −60° to +60°, displays a slight asymmetry with a higher proportion of downward postures, particularly concentrated in the −15° to 0° range associated with instrument panel monitoring. The Yaw angle, covering a full range of −90° to +90°, exhibits a bimodal distribution with a rightward bias, reflecting the driver’s need to frequently check the right-side mirror from a left-side driving position. In contrast, the Roll angle is more narrowly distributed between −45° and +45°, with a pronounced concentration in the central range and a subtle rightward tendency attributable to the driver’s seating posture.
HeaDet and HPE were trained using stochastic gradient descent with a learning rate of 0.01, weight decay of 1 × 10−4, and momentum of 0.9. The training comprised 200 warm-up epochs followed by 100 semi-supervised epochs. Teacher networks were initialized as the exponential moving average of the student network weights, with a momentum coefficient β set to 0.9996. For HPE, we applied Cut Occlusion and Rotation Consistency as data augmentation methods. In Cut Occlusion, each occluded block covers 2% to 5% of the image area, and the number of such blocks ranges from two to four. Rotation Consistency uses rotation angles from minus 30° to plus 30°. For HeaDet, Mosaic data augmentation was applied with a probability of 0.8.
Evaluation metrics included precision (ratio of true positives to predicted positives), recall (ratio of true positives to actual positives), and AP50 (mean average precision across intersection over union thresholds of 0.5); AP50 reflects comprehensive localization performance. For head pose estimation, we report MAE and Root Mean Square Error (RMSE) for each Euler angle (pitch, yaw, roll) in degrees.
3.2. HeaDet Semi-Supervised Training with Different Labeled Data Ratios
To evaluate the data efficiency of our semi-supervised approach, we assess the performance of HeaDet with varying proportions of labeled data, ranging from 1% to 10%. The corresponding results are summarized in
Table 1.
Table 1 summarizes the detection performance of the HeaDet framework under varying amounts of labeled data, specifically at 1%, 3%, 5%, and 10%. The evaluation is conducted using standard object detection metrics: Precision, Recall, F1-Score, and Average Precision at an IoU threshold of 0.5 (AP
50).
At the lowest annotation level of 1%, the model achieves a precision of 78.5%, recall of 66.2%, F1-Score of 71.83%, and AP50 of 70.1%. These results indicate that the framework retains a fundamental level of detection capability even under extreme data scarcity. The observed disparity between Precision and Recall, a gap of 12.3 percentage points, reflects a conservative prediction tendency, characterized by a relatively high false negative rate.
When the proportion of labeled data is increased to 3%, all metrics exhibit marked improvement. Precision rises to 84.2%, while Recall shows a more pronounced increase to 82.8%, reducing the precision-recall gap to only 1.4 percentage points. The F1-Score reaches 83.49%, and AP50 improves to 81.6%. The convergence of Precision and Recall suggests a more balanced classification behavior and reduced prediction bias, indicating improved model calibration.
With 5% labeled data, performance approaches near saturation. Precision climbs to 97.3%, Recall to 96.1%, and the F1-Score reaches 96.70%, with AP50 at 95.2%. The narrow margin between Precision and Recall (1.2 percentage points), together with an F1-Score exceeding 96%, implies that this level of supervision is sufficient to achieve near-optimal performance for the HeaDet architecture.
At the highest annotation ratio of 10%, the model attains its best overall performance: Precision of 99.1%, Recall of 99.5%, F1-Score of 99.30%, and AP50 of 99.4%. Notably, Recall marginally surpasses Precision, indicating enhanced sensitivity to positive instances. However, relative to the 5% setting, the improvements are incremental, ΔF1-Score of 2.6 percentage points and ΔAP50 of 3.9 percentage points, suggesting diminishing returns with increasing annotation effort.
Collectively, these results delineate three distinct phases in model behavior under semi-supervised conditions: (1) a data scarce regime at 1%, where performance is constrained by limited supervision; (2) a rapid improvement phase between 1% and 5%, where incremental labeled data yields substantial performance gains; and (3) a saturation phase beyond 5%, where further annotations contribute only marginal improvements. These findings underscore the efficiency of the HeaDet framework in leveraging limited supervision and highlight the diminishing utility of additional labeled data beyond an important threshold.
3.3. Ablation Study on the HeaDet Model
The ablation study systematically evaluates the incremental contributions of the GRL and the objectness branch to HeaDet for object detection in semi-supervised learning. The results demonstrate clear synergistic improvements in performance with the integration of each component. The experimental results are presented in
Table 2.
The baseline YOLOv8 model achieves a Precision of 91.3%, Recall of 94.5%, F1-Score of 92.9%, and AP50 of 89.2%. Introducing the GRL alone yields moderate gains across all metrics, with Precision increasing to 91.8%, Recall to 95.9%, F1-Score to 93.8%, and AP50 to 90.1%. These improvements, corresponding to increases of 0.5, 1.4, 0.9, and 0.9 percentage points respectively, are primarily attributed to enhanced domain adaptation and feature alignment facilitated by the GRL.
More notably, the addition of the objectness branch in conjunction with the GRL leads to substantial performance gains. The full model achieves a Precision of 99.1%, Recall of 99.5%, F1-Score of 99.3%, and AP50 of 99.4%. Relative to the GRL-only configuration, these figures represent increases of 7.3 percentage points in Precision, 5.0 in Recall, 6.4 in F1-Score, and 10.2 in AP50. Compared to the baseline YOLOv8, the improvements are even more pronounced, with gains of 7.8, 5.0, 6.4, and 10.2 percentage points respectively.
In semi-supervised settings, classifier calibration on unlabeled data may deteriorate, leading to high classification scores in some background regions. The objectness branch, trained with a binary foreground/background loss, effectively rejects these false positive detections to obtain higher-quality pseudo-label data, explaining HeaDet’s superiority over YOLOv8 in semi-supervised learning. These results confirm the objectness branch’s critical role in refining confidence estimation and suppressing false positives. Together with the GRL for cross-domain feature alignment, the two components operate synergistically to improve detection accuracy, enabling the complete model to achieve near-optimal performance across all key metrics and validating their complementary nature.
3.4. Comparative Experiment of HeaDet Model
To demonstrate the superiority of our HeaDet, we conducted full supervised comparative experiments on the same dataset using methods including RetinaNet, Faster R-CNN, YOLOv8, YOLOv13, and YOLO26. All methods were trained fully supervised using the same 10% labeled data.
Table 2 and
Table 3 show that semi-supervised HeaDet beats all baselines. It exceeds Faster R-CNN by 4.3 points in AP
50 and RetinaNet by 4.9 points. Among single-stage detectors, YOLO26 is the best baseline; HeaDet still improves on it by 1.3 points in F1-Score. Against YOLOv13, the improvements are larger, 4.35% in F1-Score and 6.3% in AP
50. The above results confirm that our improvements are effective. Under fully supervised training, HeaDet achieves an F1-Score of 98.35% and an AP50 of 98.1%, outperforming the next best method, YOLO26, by 0.35% and 0.4% respectively. Compared with Faster R-CNN and RetinaNet, HeaDet not only maintains high precision but also delivers a substantial gain in recall.
3.5. Difficulty Analysis of HeaDet Model Detection
To analyze detection difficulty, we divided the test set images into eleven groups numbered 1 through 11. These groups are large head, small head, occluded, unoccluded, no accessories, wearing glasses, wearing a mask, wearing a hat, good facial lighting, facial reflection, and uneven facial lighting. Semi-supervised HeaDet and fully supervised YOLOv8 were then evaluated on these eleven groups. The results are shown in
Table 4. Since an image may belong to more than one group, the detection difficulty analysis can only provide a rough indication of the actual difficulty.
The quantitative results presented in the table indicate that HeaDet consistently outperforms YOLOv8 in the majority of challenging scenarios. Notably, HeaDet demonstrates a significant performance advantage in handling occlusion and accessories. In the occluded group, HeaDet achieves an F1-Score of 99.12% compared to 97.68% for YOLOv8. Similarly, for subjects wearing accessories, HeaDet maintains superior accuracy, surpassing YOLOv8 by margins of 1.43% for glasses, 0.91% for masks, and 1.07% for hats. This suggests that the proposed method is particularly effective at learning discriminative features even when target objects are partially obscured or modified by external items.
Furthermore, HeaDet exhibits enhanced robustness under difficult lighting conditions. In scenarios characterized by facial reflection and uneven lighting, HeaDet improves the F1-Score by 1.57 percentage points and 0.57 percentage points respectively, over the baseline. The model also shows greater stability across different object scales, achieving higher F1-Scores in both the large head and small head categories compared to YOLOv8.
While YOLOv8 shows competitive performance in specific unconstrained groups, such as the unoccluded and good facial lighting categories where it slightly edges out HeaDet, the proposed method demonstrates a more consistent and stable performance level. HeaDet achieves an F1-Score exceeding 98.5% across all eleven groups, whereas the performance of YOLOv8 fluctuates more significantly, dropping below 98% in seven of the eleven groups. These results confirm that HeaDet possesses superior generalization capabilities, effectively mitigating the impact of occlusion, accessories, and adverse lighting on detection accuracy.
3.6. Ablation Study on the HPE Model
An ablation study was conducted to systematically evaluate the incremental contributions of Curriculum Learning, Rotation Consistency, and Cut Occlusion to the baseline HPE model. Performance was assessed using MAE and RMSE across keypoint predictions. The experimental results are presented in
Table 5.
The baseline model yields an Average MAE of 4.7° and an Average RMSE of 5.8°. Introducing the Cut Occlusion component alone reduces Average MAE to 3.5°, representing a reduction of 1.2 points, and Average RMSE to 4.1°, a reduction of 1.7 points. This improvement demonstrates the effectiveness of Cut Occlusion in enhancing model robustness to occluded keypoints.
The subsequent integration of Rotation Consistency further decreases Average MAE to 3.0° and Average RMSE to 3.6°, corresponding to additional reductions of 0.5 points in both metrics. This gain confirms the contribution of Rotation Consistency in improving rotational invariance and generalization across varying poses.
Finally, the addition of Curriculum Learning yields marginal but consistent improvements, with Average MAE declining to 2.8° and Average RMSE to 3.4°, reflecting further reductions of 0.2 points. This result indicates that Curriculum Learning facilitates progressive learning by gradually increasing task complexity during training.
3.7. Comparative Experiment of HPE Model
To demonstrate that HPE is better suited for the task, we compared it on the same dataset using ResNet, EfficientNetV2, GhostNet, FasterNet, and RetinaNet. All methods were trained fully supervised with the same 10% labeled data.
Table 6 shows the experimental results.
Table 6 reports the experimental results. HPE achieves the lowest error rates across both evaluation metrics, recording an MAE of 3.2° and an RMSE of 4.5°. In comparison, ResNet ranks as the second-best-performing model with an MAE of 3.8° and an RMSE of 4.7°. The remaining models exhibit progressively higher error margins, with EfficientNetV2 yielding an MAE of 4.1° and an RMSE of 4.8°, followed by GhostNet and FasterNet, the latter of which demonstrates the least favorable performance with an MAE of 4.8° and an RMSE of 5.3°. The performance gap between HPE and the strongest baseline, ResNet, is statistically significant. Specifically, the proposed method reduces the MAE by approximately 15.8% and the RMSE by roughly 4.3% relative to ResNet.
3.8. Comparison with YOLO Variants Under Efficient Teacher Framework
To validate the efficacy of the proposed HeaDet architecture, a series of comparative experiments was conducted under a semi-supervised paradigm. Specifically, YOLOv5, YOLOv6, YOLOv7, YOLOv8, and HeaDet were trained using the Efficient Teacher framework with access to only 10% of the labeled data. The resulting performance comparisons are illustrated in
Figure 4.
Figure 4 presents a comparative analysis of detection performance across the YOLO family and the proposed HeaDet architecture. Within the YOLO series, clear generational differences emerge. YOLOv5 delivers the most balanced performance among baseline models, achieving a Precision of 96.3%, Recall of 97.5%, F1-Score of 96.9%, and AP
50 of 96.2%. The narrow margin between Precision and Recall, just 1.2 percentage points, reflects a well-calibrated trade-off between false positive suppression and detection completeness. YOLOv6 exhibits a marked decline across all metrics, with a Precision of 92.1%, a Recall of 94.3%, an F1-Score of 93.2%, and AP
50 of 93.4%. YOLOv7 partially recovers classification accuracy, attaining a Precision of 95.2%, though its Recall trails at 94.1%, yielding an F1-Score of 94.6% and AP
50 of 91.8%. Notably, despite gains in pointwise classification, the drop in AP
50 suggests underlying deficiencies in the precision-recall trade-off.
HeaDet outperforms all YOLO baselines across every evaluation metric. It attains a Precision of 99.1%, an improvement of 3.9 percentage points over YOLOv7, a Recall of 99.5%, a gain of 3.6 percentage points relative to YOLOv8, an F1-Score of 99.30%, an increase of 4.65 percentage points compared to YOLOv7, and an AP50 of 99.4%, an enhancement of 6.0 percentage points over YOLOv6. Notably, HeaDet achieves both the highest Precision and Recall simultaneously, effectively overcoming the conventional trade-off between these two measures observed in other architectures. The near equality of Precision and Recall (a gap of only 0.4 percentage points) reflects well-calibrated classification and robust feature learning.
Compared to the strongest YOLO baseline, YOLOv7, HeaDet delivers substantial gains of 3.9 percentage points in Precision, 5.4 percentage points in Recall, 4.65 percentage points in F1-Score, and 7.6 percentage points in AP50. The consistent superiority across all metrics and the marked improvement in AP50 underscore the effectiveness of the proposed architectural innovations in enhancing both localization accuracy and detection completeness. An F1-Score approaching 99.3% indicates near-optimal harmonic mean performance, while an AP50 exceeding 99% reflects excellent precision-recall characteristics across varying confidence thresholds.
We attribute the performance gains of HeaDet to three key architectural innovations: the dense sampling strategy inherited from Efficient Teacher, an objectness branch designed to enhance pseudo-label quality, and an adaptive threshold mechanism that effectively mitigates the distribution shift between labeled and unlabeled data in the mining truck domain.
3.9. Comparison with State-of-the-Art Semi-Supervised Methods
To further validate the superiority of HeaDet, we compare it with three representative semi-supervised object detection methods: Soft Teacher, Unbiased Teacher, and STAC. All methods are evaluated using 10% of the labeled data. The experimental results are presented in
Figure 5.
As shown in
Figure 5, Soft Teacher achieves a Precision of 94.2%, Recall of 92.1%, and F1-Score of 93.14%. The relatively narrow gap between Precision and Recall (2.1 percentage points) suggests effective pseudo-label filtering, although the modest AP
50 indicates potential difficulties in handling objects across varying scales.
Unbiased Teacher outperforms Soft Teacher across all metrics, achieving a Precision of 95.5%, Recall of 94.3%, and F1-Score of 94.90%. This is gains of 1.3, 2.2, 1.76, and 2.9 percentage points in Precision, Recall, and F1-Score, respectively. The improved balance between Precision and Recall (a difference of 1.2 percentage points) validates the effectiveness of its human-in-the-loop annotation refinement strategy in enhancing pseudo-label quality.
STAC presents a different optimization profile, with a precision of 93.8%, recall of 91.5%, and F1-Score of 92.64%. These metrics are slightly lower than those of Unbiased Teacher, with the F1-Score being the lowest among all methods.
HeaDet establishes a new state-of-the-art performance across all evaluation criteria. It achieves a Precision of 99.1%, a Recall of 99.5%, and an F1-Score of 99.3%. These results correspond to gains of 3.6, 5.2, and 4.40 percentage points over the strongest competitor in each respective metric. Notably, HeaDet achieves this superior performance while effectively resolving the trade-off observed in competing approaches. The near equality of Precision and Recall (a gap of only 0.4 percentage points) reflects exceptional pseudo-label quality control and effective mitigation of confirmation bias, a persistent challenge in iterative self-training. The F1-Score approaching 99.3% further demonstrates the framework’s ability to maintain detection precision across varying confidence thresholds, addressing the calibration limitations evident in both Soft Teacher and Unbiased Teacher.
This comparative analysis reveals that existing semi-supervised methods tend to specialize: Unbiased Teacher excels in classification calibration, while STAC demonstrates strong localization consistency. Neither, however, achieves unified optimization across both dimensions. The comprehensive superiority of HeaDet suggests that its design, integrating adaptive pseudo-label weighting, uncertainty-aware consistency regularization, and dynamic threshold adjustment, effectively addresses the core challenges of noise sensitivity and confirmation bias that plague semi-supervised object detection.
3.10. The HPE and SemiUHPE Pose Estimation with Different Labeled Data Ratios
To demonstrate the high performance of HPE, we evaluate the performance of HPE and SemiUHPE for head pose estimation under varying levels of labeled data availability.
Table 7 presents the MAE and RMSE for each Euler angle, along with their average values, for both SemiUHPE and HPE under varying proportions of labeled data, SemiUHPE and HPE. Evaluations are conducted under varying proportions of labeled training data, specifically 1%, 3%, 5%, and 10% of the full dataset.
As the fraction of labeled data increases from 1% to 10%, both methods exhibit a consistent decline in Average MAE, indicating that access to more ground-truth annotations systematically improves estimation accuracy. This trend holds across all individual Euler angles and confirms the expected utility of labeled data in semi-supervised regression tasks.
Comparing the two approaches, HPE consistently achieves lower Average MAE than SemiUHPE across most experimental settings. At the lowest annotation level of 1%, the difference is marginal, with Average MAE values of 5.7° and 5.6° for SemiUHPE and HPE, respectively. However, as the labeled data proportion increases to 3%, HPE demonstrates a more pronounced advantage, reducing the Average MAE from 4.7° to 4.1°, a relative improvement of approximately 13%. At 5% labeled data, the gap narrows slightly, with HPE attaining an Average MAE of 3.4° compared to 3.6° for SemiUHPE. At 10% annotation, HPE again outperforms its counterpart, achieving an Average MAE of 2.8° versus 3.0°.
Examining individual Euler angles reveals a similar pattern. HPE generally yields lower MAE for Pitch, Yaw, and Roll, with the most substantial gains observed under limited supervision. These results demonstrate that while both methods benefit from increased labeled data, HPE offers superior estimation accuracy, particularly when labeled examples are scarce. The advantage is most evident in the 3% to 5% labeled data regime, where HPE consistently outperforms SemiUHPE across all metrics. This finding suggests that the architectural or training innovations embedded in HPE contribute meaningfully to more efficient utilization of limited supervision in head pose estimation tasks.
Table 8 presents the RMSE for each Euler angle, along with the corresponding average values.
Experimental findings indicate that both SemiUHPE and HPE benefit from increased supervision, with RMSE values for individual Euler angles, as well as their average, declining monotonically as the proportion of labeled data increases from 1% to 10%. This trend reinforces the established relationship between annotation availability and regression accuracy in head pose estimation.
Across all experimental configurations, HPE consistently achieves lower error than SemiUHPE. At the lowest annotation level of 1%, HPE attains an Average RMSE of 6.6°, with per-angle errors of 8.0°, compared to 6.8° for SemiUHPE. This performance margin persists as labeled data increases. At 3%, HPE records an Average RMSE of 5.4° versus 5.8° for SemiUHPE; at 5%, the difference narrows slightly to 4.1° versus 4.2°; and at 10%, HPE maintains an advantage of 3.4° compared to 3.6° for SemiUHPE. The consistent margin, ranging from 0.2 to 0.4 points in Average RMSE, suggests that HPE makes more effective use of limited supervision across all data regimes.
A further consistent observation is the elevated difficulty associated with Pitch estimation relative to Yaw and Roll. For HPE, the gap between Pitch error and Roll error spans 1.6 to 2.4 points across data proportions, while for SemiUHPE, this gap ranges from 1.2 to 1.7 points. This persistent discrepancy underscores the inherent complexity of modeling vertical head rotations, which may involve greater ambiguity and variability compared to horizontal or torsional movements.
3.11. Comparison Between HPE and Other Head Pose Estimation Methods
To better position this study within the broader literature, we compare our method with several fully supervised approaches evaluated on datasets such as BIWI and AFLW2000. While these datasets differ substantially from ours, the comparison still offers a useful reference point for assessing the proposed approach. Cross-dataset comparisons are intended only as qualitative references. The results are summarized in
Table 9.
As presented in the table, the average MAE values for fully supervised methods across different datasets lie within the range of 2° to 5°. The lowest among these is achieved by the TRG method, with an average MAE of 2.75°. In comparison, our semi-supervised approach achieves an average MAE of 2.8° on our dataset. These results indicate that the proposed method remains competitive relative to existing approaches.
3.12. Edge Deployment Optimization on NVIDIA Jetson Orin NX
For real-world deployment, achieving an appropriate balance between accuracy and inference speed is essential. We evaluated the performance of HeaDet and HPE under three numerical precision configurations: FP32 for full precision, FP16 for half-precision, and INT8 for 8-bit integer quantization. All models used in this evaluation were trained with ten percent of the dataset. The corresponding results for object detection and head pose estimation are presented in
Figure 6. The FP32, FP16 and INT8 versions of HeaDet have model sizes of 53.2 MB, 24.1 MB and 14.0 MB respectively. Their memory usage during inference is 86.3 MB, 38.4 MB and 22.5 MB. The FLOPs for all three HeaDet variants are 28.7 GFLOPs. For HPE, the three versions have model sizes of 5.6 MB, 2.3 MB and 1.4 MB with inference memory usage of 18.6 MB, 5.2 MB and 2.1 MB. The FLOPs for all HPE variants are 0.3 GFLOPs.
For HeaDet, the full precision FP32 configuration achieves an AP50 of 99.4%, a recall of 99.3%, and an F1-Score of 99.5%, establishing a high-performance baseline for object detection. Transitioning to FP16 results in only minor degradation: AP50 decreases to 98.8%, recall to 98.9%, and F1-Score to 98.7%. These negligible losses indicate that 16-bit floating-point representation preserves sufficient numerical fidelity for accurate head localization, with minimal impact on detection quality. In contrast, INT8 quantization leads to more pronounced performance drops: AP50 declines to 97.4%, recall to 96.5%, and F1-Score to 97.8%, reflecting the reduced dynamic range and discretization error inherent in integer quantization. Despite this, the retention of over 97% mAP suggests that post-training calibration or quantization-aware training effectively mitigates accuracy loss.
Inference speed improvements are substantial across quantization levels. HeaDet achieves 22 FPS under FP32, increasing to 51 FPS in FP16 (a 132% improvement), and further to 106 FPS in INT8 (a 382% improvement). Correspondingly, latency reduces from 45.5 ms to 19.6 ms and then to 9.4 ms. This demonstrates significant gains in real-time capability without compromising detection reliability at moderate quantization levels.
For HPE, the FP32 configuration yields an MAE of 2.8° and RMSE of 3.8°, indicating high angular estimation accuracy. FP16 quantization increases MAE to 3.1° and RMSE to 3.9°, representing a modest degradation likely due to limited precision in activation and weight representations. INT8 quantization results in higher errors, MAE of 3.5° and RMSE of 4.3°, indicating increased sensitivity to quantization noise in regression tasks. However, these errors remain within acceptable bounds for practical applications involving head orientation estimation.
Inference throughput for HPE also scales with reduced precision. The model runs at 55 FPS under FP32, improves to 102 FPS in FP16 (an 85.5% increase), and reaches 185 FPS in INT8 (a 236% increase). Latency is reduced from 18.2 ms to 9.8 ms and further to 5.4 ms, yielding response times below the ten-millisecond threshold that are well-suited for real-time systems.
Collectively, these results demonstrate that both HeaDet and HPE benefit from low-precision inference in terms of throughput and latency. While INT8 introduces measurable accuracy degradation, especially in regression-based HPE, the performance remains adequate for downstream distraction classification. FP16 emerges as a favorable compromise, offering near-optimal accuracy with substantial efficiency gains. This makes it well-suited for deployment in embedded systems where real-time processing and energy efficiency are important. The findings support the feasibility of deploying deep learning-based driver monitoring systems using quantized models without sacrificing operational effectiveness.
To better elucidate the impact of quantization on HPE performance, the MAE and RMSE for different Euler angles are visualized, as illustrated in
Figure 7.
As shown in
Figure 7, a progressive degradation in estimation accuracy is observed as the numerical precision decreases from FP32 to INT8. The FP32 configuration serves as the baseline, demonstrating the highest fidelity with MAE values of 2.8°, 3.1°, and 2.5° for Pitch, Yaw, and Roll, respectively. The corresponding RMSE values for the baseline are 4.1° (Pitch), 3.3° (Yaw), and 2.8° (Roll).
Transitioning to the FP16 precision results in a slight but consistent increase in error across all metrics. The MAE for all angles increases by approximately 0.3°, while the RMSE shows a more pronounced rise, particularly for the Pitch angle, which increases from 4.1° to 4.5°. This suggests that the reduction to half-precision introduces minor quantization noise that marginally impacts the model’s predictive stability.
A more substantial performance drop is evident with the INT8 quantization. Compared to the FP32 baseline, the MAE for Yaw experiences the most significant relative increase, rising from 3.1° to 3.8°. The RMSE metrics are notably more sensitive to this aggressive quantization. The RMSE for Pitch escalates sharply from 4.1° in FP32 to 5.7° in INT8, representing a 39% increase. Similarly, the RMSE for Roll increases from 2.8° to 4.2°. This marked elevation in RMSE, relative to MAE, indicates the presence of larger and more frequent estimation outliers under INT8 quantization.
The experimental results elucidate a fundamental trade-off between model precision and inference efficiency in quantized head pose estimation. FP32 remains the preferred configuration for accuracy-critical scenarios where computational resources are unconstrained. FP16 provides an optimal compromise, delivering near FP32 accuracy with improved throughput, making it well-suited for balanced deployment scenarios. INT8, while introducing higher estimation errors, remains viable for real-time applications where latency constraints and hardware limitations are paramount.
3.13. Deployment of SemiCHPE on Real-World Mining Trucks
The proposed framework was deployed across 15 mining trucks operating within a large open-pit mine in Namibia. The deployment leveraged an edge computing platform based on the NVIDIA Jetson Orin NX, executing the SemiCHPE method using FP16 numerical precision to balance accuracy and inference efficiency under real-world operating conditions. In this configuration, HeaDet achieves an inference speed of 51 frames per second, while the head pose estimation module operates at 102 frames per second. The complete pipeline, which includes both detection and pose estimation, delivers an overall inference speed of 32 frames per second. To further improve inference speed, techniques such as multithreading can be utilized to optimize performance.
Over a continuous deployment period, a total of 1901 video clips were recorded, comprising 671 instances of distracted driving and 1230 instances of normal driving. The classification of driver distraction was defined in accordance with the established research literature and industrial safety standards. Head poses in the training set during driver distraction events were analyzed, and the following thresholds for classifying a driver as distracted were determined. A driver is considered distracted if the head orientation meets any of these conditions: Pitch angle below 23.5° or above 30.4°, Yaw angle below 47.3°, or Roll angle below 18.8° or above 19.6°. Furthermore, a distraction event is counted only when such a condition is observed for at least 83 out of 100 consecutive frames.
To mitigate the influence of transient head movements and short-term pose variations, we introduced a temporal persistence criterion. Specifically, a video segment was considered a distraction event only when at least 80 out of 100 consecutive frames exceeded the predefined angular thresholds. Such a temporal filtering mechanism effectively distinguishes sustained distraction from incidental head motions, thereby improving the reliability of ground-truth annotations for subsequent evaluation. The threshold used to classify driver distraction was determined based on the practical experience of seasoned mine safety supervisors in conjunction with onboard validation. Although this threshold may not be broadly generalizable, it is adequate for the current experimental stage, which emphasizes algorithmic validation and iterative refinement. The final evaluation results are presented in
Figure 8.
As illustrated in
Figure 8, the performance of two head pose estimation-based driver distraction detection methods, namely, the proposed SemiCHPE framework and a conventional Facial Landmarks baseline, is evaluated using confusion matrices and standard classification metrics, including accuracy, precision, recall, and F1-Score. The results reveal a substantial disparity in discriminative capability between the two approaches, underscoring the effectiveness of the SemiCHPE framework.
Examining the confusion matrix for SemiCHPE, the model correctly classifies 1150 out of 1230 normal instances, while 80 normal cases are misclassified as distracted (false positives). Among the 671 distracted driving instances, 69 are incorrectly identified as normal (false negatives), and 602 are correctly detected. These figures yield an overall accuracy of 92.1%, reflecting strong generalization performance across both classes. The precision of 88.3% indicates that the majority of instances predicted as distracted correspond to true positive events, demonstrating reliable confidence in positive predictions. With a recall of 89.7%, the model successfully detects nearly 90% of actual distracted drivers, suggesting robust sensitivity to distraction-related behavioral patterns. The balanced F1-Score of 89.0% further confirms the method’s ability to maintain an effective trade-off between precision and recall.
In contrast, the Facial Landmarks method exhibits inferior performance. It correctly identifies 516 distracted drivers while failing to detect 155 instances, resulting in an elevated false-negative rate. A total of 190 normal driving instances are misclassified as distracted, indicating a notable increase in false alarms. Correct classification of normal cases reaches 1040. Thus, overall accuracy declines to 81.9%, with precision and recall falling to 73.1% and 76.9%, respectively. The corresponding F1-Score of 74.9% reflects a less favorable balance between detection sensitivity and predictive reliability.
A comparative analysis reveals that SemiCHPE consistently outperforms the Facial Landmarks method across all evaluation metrics. Specifically, the proposed method achieves absolute improvements of 10.3 percentage points in accuracy, 15.2 percentage points in precision, 12.8 percentage points in recall, and 14.1 percentage points in F1-Score. These gains suggest that SemiCHPE is capable of extracting more discriminative cues associated with driver distraction, likely attributable to enhanced spatial–temporal modeling or more robust pose estimation under challenging conditions, including variable lighting, partial occlusion, and dynamic head motion.
Based on error analysis, representative failure cases of the proposed SemiCHPE framework are illustrated in
Figure 9. These false negatives predominantly stem from pronounced lateral leaning, forward flexion, or backward extension of the driver’s torso, often accompanied by partial occlusion of the head region.
In summary, these findings establish the proposed SemiCHPE method as a more accurate and dependable solution for driver distraction detection. Its superior performance in mitigating false negatives renders it well-suited for safety-critical scenarios where timely identification of distracted behavior is paramount.