Semi-Supervised Cascade Head Pose Estimation for Drivers in Open-Pit Mining Trucks

Jiang, Feng; Hu, Bin; Liu, Yulong; Chen, Xiaonian; Zhang, Wei; Li, Yong

doi:10.3390/electronics15122576

Open AccessArticle

Semi-Supervised Cascade Head Pose Estimation for Drivers in Open-Pit Mining Trucks

by

Feng Jiang

¹,

Bin Hu

²,

Yulong Liu

^1,*,

Xiaonian Chen

²,

Wei Zhang

² and

Yong Li

³

¹

CGNPC Uranium Resources Co., Ltd., Beijing 100084, China

²

Suzhou Automotive Research Institute (Wujiang), Tsinghua University, Suzhou 215200, China

³

Automotive Engineering Research Institute, Jiangsu University, Zhenjiang 212013, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(12), 2576; https://doi.org/10.3390/electronics15122576

Submission received: 22 April 2026 / Revised: 5 June 2026 / Accepted: 9 June 2026 / Published: 11 June 2026

Download

Browse Figures

Versions Notes

Abstract

Driver distraction causes accidents in mining trucks, posing significant safety risks in open-pit mining operations. Estimating the driver’s head pose is a key task for detecting distraction. However, accurate head pose estimation typically requires large amounts of high-quality annotated data. Obtaining a high-precision head pose estimation model under conditions of limited labeled data is challenging. To address the scarcity of annotated data in mining scenarios, this paper proposes a semi-supervised framework named the semi-supervised cascade head pose estimator (SemiCHPE) for driver head pose estimation. The framework adopts a two-stage cascade architecture: the first stage involves a semi-supervised head detector (HeaDet) for head detection, while the second stage comprises a semi-supervised head pose estimator (HPE) for pose estimation. Extensive experiments conducted on our proprietary dataset of mining truck drivers demonstrate that, using only 10% of the dataset, the proposed framework achieves an F1-Score of 99.3% for head detection and a mean absolute error (MAE) of 2.8° for head pose estimation. When deployed on an NVIDIA Orin NX platform within operational mining trucks, the framework attains real-time inference at 32 frames per second with an accuracy of 91.6%, validating its effectiveness for real-world deployment in intelligent mining transportation systems.

Keywords:

semi-supervised learning; mining truck driver distraction; head pose estimation; object detection; edge deployment

1. Introduction

The mining industry is one of the most dangerous industries [1,2,3], with mining trucks operating in challenging environments characterized by extreme weather conditions [4,5,6], uneven terrain, and limited visibility [7,8]. According to recent industry reports, driver distraction contributes to mining truck accidents, with operator fatigue and inattention identified as primary causal factors in approximately 15% of serious incidents involving large haul trucks [9,10,11]. Unlike conventional on-road vehicles, mining trucks operate in isolated areas where immediate assistance is often unavailable, making real-time driver monitoring systems not only beneficial but also essential for operational safety.

Within the operational context of real-world mining transportation, drivers exhibit a wide spectrum of distraction-related behaviors, ranging from mobile phone usage and fatigue to environmental scanning [12,13,14]. To accurately detect driver distraction, the most important factor is head pose estimation. Figure 1 illustrates this point. Part a shows the original image, and part b shows the Euler angle results from head detection and head pose estimation. Precisely detecting the driver’s head and then estimating its pose is the foundation of high-accuracy distraction detection. This paper focuses on head pose estimation methods for mining truck drivers. In current practice, achieving robust head pose estimation with fully supervised learning requires large amounts of annotated training data captured under varying lighting and vibration conditions [15,16,17]. This includes both fully supervised object detection approaches, such as the YOLO series and its variants, and fully supervised head pose estimation methods, including TRG [18], CIT [19], and WHENet [20]. The process of acquiring labeled datasets across varying environmental conditions is both prohibitively expensive and time-consuming [21]. Semi-supervised learning has emerged as a compelling paradigm to address this challenge, enabling models to learn effectively from a limited set of labeled examples while leveraging abundant unlabeled data.

In the domain of object detection, landmark contributions include STAC [22], which establishes the foundational paradigm of using weakly augmented samples to generate pseudo labels while training on strongly augmented data. Unbiased Teacher [23] systematically diagnoses and addresses foreground–background and class imbalance issues inherent in the training process. Soft Teacher [24] introduces a Soft Teacher mechanism combined with box jitter, enabling end-to-end collaborative evolution between teacher and student networks and substantially improving the utility of pseudo labels. Building on these advances, Efficient Teacher [25] seamlessly integrates the aforementioned principles with the efficient YOLO [26,27,28] detector family, yielding a mature, industry-ready solution that closes the loop from academic innovation to engineering practice. The applicability of semi-supervised learning extends beyond object detection. For instance, Basak et al. [29] demonstrated the feasibility of semi-supervised learning for 3D head pose estimation from synthetic data, employing domain adaptation techniques to bridge the distributional gap between simulated and real-world environments. Similarly, SemiUHPE [30], a semi-supervised approach for head pose estimation, has reported promising results. Recent efforts have extended such methodologies to unconstrained, real-world settings. Despite these advances, there exists a substantial discrepancy between the datasets commonly used in these methods, such as BIWI and AFLW2000, and the conditions found in mining scenarios. Operators of mining trucks frequently wear items including masks, sunglasses, and safety helmets, and the acquired data are typically in the form of infrared images. Consequently, the application of semi-supervised approaches within the specific context of the mining environment remains largely unexplored. Two principal challenges persist when applying current semi-supervised approaches within autonomous mining systems. First, existing two-stage head pose estimation approaches typically depend on fully supervised object detection models to provide head localization, without integrating semi-supervised detection methods. Second, the literature lacks empirical validation of these methods through deployment and testing within authentic, operational mining scenarios.

This paper proposes SemiCHPE to address the aforementioned challenges. SemiCHPE comprises two stages. The first stage, HeaDet, is built upon YOLOv8 with Distribution Focal Loss (DFL), incorporating an enhanced confidence prediction branch and trained using the Efficient Teacher framework for semi-supervised learning. The second stage is a MobileNetV3 [31] based HPE that estimates 3D head orientation using a probabilistic rotation representation based on the Matrix Fisher distribution. This probabilistic approach provides a complete distribution over the space of rotations, enabling principled uncertainty quantification for filtering unreliable predictions during both training and inference. To better leverage pseudo labels for semi-supervised training, we adopt a curriculum learning method based on loss weights to optimize the learning process and further boost the performance of HPE. Finally, SemiCHPE is deployed within an operational open-pit mining truck transportation system.

The contributions of this work are threefold:

(1): A cascade framework named SemiCHPE, in which both head detection and head pose estimation are trained using semi-supervised learning methodologies, is proposed.
(2): A head detector named HeaDet, adapted for the Efficient Teacher framework that improves model performance, is introduced.
(3): A loss-weight-based curriculum learning method is introduced to train the HPE head pose estimator.
(4): Real-world deployment on open-pit mining trucks validates SemiCHPE, a semi-supervised cascade pipeline for mining truck driver head pose estimation.

The remainder of this paper is organized as follows. Section 1 reviews related work in semi-supervised object detection and head pose estimation. Section 2 describes the proposed semi-supervised cascade head pose estimation method in detail. Section 3 elaborates on the experimental setup, dataset characteristics, and evaluation metrics, presenting both quantitative and qualitative results, including ablation studies and deployment benchmarks. Section 4 concludes the paper and discusses limitations and directions for future research.

2. Method

This section provides a detailed introduction to the semi-supervised cascade head pose estimation method and the semi-supervised learning approach used for model training.

2.1. Semi-Supervised Cascade Head Pose Estimation Method

We approach driver distraction detection as a two-stage cascade learning problem under semi-supervised settings. Our labeled dataset D_l contains N_l samples with corresponding head bounding boxes and pose annotations, while the unlabeled dataset D_u contains more samples (N_u >> N_l). Each pose is represented as a rotation matrix R in SO(3).

Figure 2 illustrates the overall framework of the semi-supervised cascade head pose estimation method, which includes a head detector and a head pose estimator. Given the need for both high accuracy and real-time performance in mining truck driver head pose estimation, we chose YOLOv8 as the object detection model to improve. YOLOv8 is capable of real-time high-accuracy detection in complex scenes. For head pose estimation, we trained MobileNetV3 using a modified semi-supervised method. To provide accurate head positions for the second stage, we select the YOLOv8 model with DFL for head detection. To better leverage the effectiveness of Efficient Teacher, a confidence prediction branch is added to the decoupled head of YOLOv8, which facilitates semi-supervised training of a better detection model. The YOLOv8 model is an advanced object detection architecture. The improved model proposed in this paper is referred to as head detector (HeaDet), which primarily consists of three key components: Backbone, Neck, and Head.

The basic components constituting the Backbone, Neck, and Head include CBR, C2f, SPPF, Bottleneck, and decoupled heads. CBR consists of a 3 × 3 convolutional layer with stride 2, batch normalization, and ReLU activation function, used for down-sampling feature maps. The C2f module is designed based on the Cross Stage Partial (CSP) architecture and includes two 1 × 1 convolutional layers with a stride of 1 (cv1 and cv2) and multiple bottleneck layers. The bottleneck layers enhance gradient flow through residual connections, each containing two 3 × 3 convolutional layers with a stride of 1 for extracting high-level features. The input feature map of C2f first passes through the cv1 convolutional layer, expanding the number of output channels to twice that of the input. It is then split into two parts: one part is directly passed to the subsequent concatenation layer, while the other enters the Bottleneck modules for deep feature extraction. Finally, the outputs of all Bottleneck modules are concatenated with the directly passed feature map along the channel dimension and compressed to the target number of channels through the cv2 convolutional layer. The Spatial Pyramid Pooling Fast (SPPF) enhances the model’s receptive field through multi-scale feature fusion while reducing computational redundancy. Its structure includes two 1 × 1 convolutional layers with a stride of 1 (cv1 and cv2). The input feature map of SPPF first passes through the cv1 convolutional layer and is then split into two parts: one part is directly passed to the subsequent concatenation layer, while the other enters a series of three cascaded 5 × 5 max-pooling layers for feature extraction. The pooled feature maps are concatenated with the feature map processed by cv1 along the channel dimension and then compressed to the target number of channels through the cv2 convolutional layer. The decoupled head consists of two branches, which output class information and predicted bounding box information, respectively. Each branch is composed of two CBR modules and a 1 × 1 convolutional layer with stride 1, where the convolutional layers in CBR have a stride of 1 and a kernel size of 3 × 3.

When an image enters the Backbone of the multi-scale object detection model, it first passes through two CBR modules for feature extraction, followed by one C2f module for further feature extraction. Subsequently, it sequentially passes through a CBR module and a C2f module, and this process is repeated three times. Finally, the feature map enters the SPPF module for feature extraction. The feature maps output by the CBR modules in the Backbone are sequentially labeled as [C1, C2, C3, C4, C5]. The feature maps output by the Backbone then enter the Neck. First, they undergo upsampling using the nearest neighbor method via upsample and are then concatenated with the C4 feature map along the channel dimension. Subsequently, a C2f module is used for feature extraction, and the generated feature map is upsampled again. After concatenation with the C3 feature map, another C2f module is applied for feature extraction, producing a feature map labeled as P3. The P3 feature map passes through a CBR module and is concatenated with the C4 feature map. The concatenated feature map is then fed into a C2f module, generating a feature map labeled as P4. The P4 feature map passes through a CBR module and is concatenated with the C5 feature map, followed by feature extraction using a C2f module, producing a feature map labeled as P5. Finally, the feature maps [P3, P4, P5] are respectively fed into three decoupled heads to generate prediction information.

In the second stage, the detected head regions are passed through a lightweight MobileNetV3 network, referred to in this paper as the HPE. This estimator, adapted from the SemiUHPE architecture, is embedded within a mean teacher framework to estimate 3D head orientation. MobileNetV3 is specifically adopted to enable efficient and fast inference on embedded devices.

HPE is based on the inverted residual block and linear bottlenecks, which enhance model representational capacity while maintaining computational efficiency. The core components of MobileNetV3 are its unique convolutional module designs. In each depthwise convolutional (DW) module, the number of channels is first increased via a 1 × 1 convolution to expand the spatial dimension of the input features. This contrasts with traditional residual blocks, which typically reduce and then increase the number of channels, hence the term “inverted” residual. Subsequently, a 3 × 3 depthwise convolution is applied for spatial feature extraction. Finally, another 1 × 1 convolution reduces the channel dimension, restoring it to the original or target dimensionality. It is worth noting that no nonlinear activation function is used after the final 1 × 1 convolution to prevent information loss, which is also part of the linear bottleneck design. Each inverted residual block incorporates a skip connection that directly links the input and output, facilitating the training of deeper networks. The DSDW module is similar to the DW module, except that it lacks a skip connection and employs a stride of 2 in its depthwise convolution. The model’s task head consists of a dropout layer, a fully connected layer, and a batch normalization layer, ultimately outputting the rotation matrix representing the head pose.

2.2. Semi-Supervised Head Detection

Efficient Teacher enables HeaDet to achieve superior head detection performance through semi-supervised training via the Pseudo Label Assigner (PLA), Epoch Adaptor (EA) and gradient reversal layer (GRL). The PLA method introduces two thresholds, a high one τ₁ and a low one τ₂, to clearly separate pseudo labels into reliable and uncertain categories. Pseudo labels with scores above τ₁ are considered reliable, while those falling between τ₁ and τ₂ are treated as uncertain. An unsupervised loss is then designed to make effective use of the uncertain pseudo labels. The loss function is given as follows:

L = L_{s} + λ L_{u}

(1)

L_{s}

is the loss computed on labeled images, and

L_{u}

is the loss computed on unlabeled images. The hyperparameter λ balances the supervised and semi-supervised losses; in this work, it is set to 3.0. The supervised loss

L_{s}

is defined as follows:

L_{s} = \sum_{h, w} (C E (X_{(h, w)}^{c l s}, Y_{(h, w)}^{c l s}) + C I o U (X_{(h, w)}^{r e g}, Y_{(h, w)}^{r e g}) + C E (X_{(h, w)}^{o b j}, Y_{h, w}^{o b j}))

(2)

CE denotes the cross-entropy loss function. X_(h,w) is the output of the student model, and Y_(h,w) is the sampling result produced by the detector label assigner. The unsupervised loss

L_{u}

is defined as follows:

L_{u} = L_{u}^{c l s} + L_{u}^{r e g} + L_{u}^{o b j}

(3)

L_{u}^{c l s} = \sum_{h, w} (1_{p_{(h, w)} \geq τ_{2}} C E (X_{(h, w)}^{c l s}, {\hat{Y}}_{(h, w)}^{c l s}))

(4)

L_{u}^{r e g} = \sum_{h, w} (1_{p_{(h, w)} \geq τ_{2} or {\hat{o b 𝚥}}_{(h, w)} > 0.99} C I o U (X_{(h, w)}^{r e g}, {\hat{Y}}_{(h, w)}^{r e g}))

(5)

L_{u}^{o b j} = \sum_{h, w} (1_{p_{(h, w)} \leq τ_{1}} C E (X_{(h, w)}^{o b j}, 0) + 1_{p_{(h, w)} \geq τ_{2}} C E (X_{(h, w)}^{o b j}, {\hat{Y}}_{(h, w)}^{o b j}) + 1_{τ_{1} < p_{(h, w)} < τ_{2}} C E (X_{(h, w)}^{o b j}, {\hat{o b 𝚥}}_{(h, w)}))

(6)

where

{\hat{Y}}_{(h, w)}^{c l s}

,

{\hat{Y}}_{(h, w)}^{r e g}

, and

{\hat{Y}}_{(h, w)}^{o b j}

denote, respectively, the classification score, the regression output, and the objectness score of the sample drawn by PLA at position

(h, w)

on the feature map, the term

{\hat{o b 𝚥}}_{(h, w)}

represents the objectness score of the pseudo label at

(h, w)

.

p_{(h, w)}

is the score of the pseudo label at (h, w).

1 {\cdot}

denotes the indicator function, which takes the value 1 when the stated condition holds and 0 otherwise.

During the Burn-in phase, EA feeds both labeled and unlabeled data to the network and employs a domain classifier to confound the detector’s ability to discriminate between the two data sources. This alleviates the overfitting observed when the Burn-in phase uses only labeled data. The domain adaptation loss is defined as:

L_{d a} = - \sum_{h, w} [D l o g p_{(h, w)} + (1 - D) l o g (1 - p_{(h, w)})]

(7)

p_{(h, w)}

is the output of the domain classifier, with

D = 0

for labeled data and

D = \hat{1}

for unlabeled data. A gradient reversal layer (GRL) is employed; the domain classifier is optimized via standard gradient descent, yet the gradient sign is flipped during backpropagation through this layer. The base network is optimized via the GRL. During Burn-in, the supervised loss for a single image is reformulated as:

\begin{array}{l} L_{s} & = \sum_{h, w} (C E (X_{(h, w)}^{c l s}, Y_{(h, w)}^{c l s}) + C I o U (X_{(h, w)}^{r e g}, Y_{(h, w)}^{r e g}) \\ + C E (X_{(h, w)}^{o b j} + Y_{h, w}^{o b j})) + λ L_{d a} \end{array}

(8)

The

λ

balances the domain adaptation term, set to 0.1. During distribution adaptation, the k-th thresholds τ₁ and τ₂ are set as follows:

τ_{1}^{k} = P_{c}^{k} [n_{c}^{k} \cdot \frac{N_{u}}{N_{l}}]

(9)

τ_{2}^{k} = P_{c}^{k} [α % \cdot n_{c}^{k} \cdot \frac{N_{u}}{N_{l}}]

(10)

In all experiments,

α

is fixed at 60. The list of pseudo-label scores for class

c

-th at epoch

k

-th is denoted by

P_{c}^{k}

, the numbers of labeled and unlabeled samples are denoted by

N_{l}

and

N_{u}

, and the count of class

c

-th ground truths tallied by EA at epoch

k

-th is denoted by

n_{c}^{k}

. Adaptively determining the thresholds per epoch makes the joint training more robust to evolving data distributions.

2.3. Semi-Supervised Head Pose Estimation

Upon detection of a head region by HeaDet, the corresponding image patch is cropped and resized to a spatial resolution of 224 × 224 pixels prior to being passed to the HPE. Instead of directly regressing Euler angles, an approach susceptible to periodicity artifacts and gimbal lock, we adopt the probabilistic rotation representation based on the Matrix Fisher distribution (MFD), as introduced in SemiUHPE. The MFD is adopted as the representation model for head pose estimation due to its fundamental definition on the three-dimensional rotation group

S O (3)

, enabling the modeling of arbitrary rotations unambiguously and without singularities. Moreover, as a probabilistic distribution, the MFD is capable of not only yielding the most probable pose but also quantifying predictive uncertainty through its entropy or singular values. This characteristic proves particularly critical in semi-supervised learning, as it allows the model to assess the reliability of pseudo labels dynamically and filter out low-quality samples accordingly, thereby enhancing both the stability of training and the accuracy of the final pose estimates. The probability density function of MFD

M F (R; A)

is as follows:

p (R) = M F (R; A) = \frac{1}{F (A)} e x p (t r (A^{T} R))

(11)

where

A \in R^{3 \times 3}

denotes a generic

3 \times 3

matrix and

F (A)

represents the normalization factor. Subsequently, the principal orientation R and the spread parameter S of the distribution are formulated as:

R = U [\begin{matrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & d e t (U V) \end{matrix}] V^{T}

(12)

where U and V are the matrices obtained from the singular value decomposition of

A

, expressed as

A = {U S V}^{T}

, where S =

S =

diag

(s_{1}, s_{2}, s_{3})

is a diagonal matrix containing the singular values sorted in descending order. Each singular value reflects the concentration strength of the distribution along the corresponding axis. To quantify prediction uncertainty, we adopt an entropy-based confidence measure. During training, the network regressor

N

takes a single RGB image x as input and outputs a 3 × 3 matrix

A_{f} = N (x)

, which parameterizes an MFD

f \sim M F (A_{f})

. This distribution inherently encodes both the predicted rotation, captured by the mode

R_{f}

, and the dispersion, captured by

S_{f}

, as detailed in Equation (2). The entropy of this predictive distribution, which serves as a confidence measure for uncertainty estimation, is given by the following expression.

H (f) = l o g F_{f} - \sum_{i = 1}^{4} (z_{f_{i}} \frac{1}{F_{f}} \frac{\partial F_{f}}{\partial z_{f_{i}}})

(13)

where

F_{f}

denotes a term that remains constant with respect to the parameter matrix

Z =

diag

(0, z_{1}, z_{2}, z_{3})

, where

Z

is a

4 \times 4

diagonal matrix whose diagonal matrix with

0 \geq z_{1} \geq z_{2} \geq z_{3}

. Each element

z_{i}

derives from a unit quaternion q

\in S^{3}

. Given the singular value decomposition

A_{f} = U_{f} S_{f} V_{f}^{T}

,

γ

denotes the standard mapping from a unit quaternion to a rotation matrix. For

e_{i}

the

i

-th column of the identity matrix

I_{4}

, we define

E_{i} = γ (e_{i})

. Then each

z_{f_{i}}

is obtained as the trace

E_{i}^{T} S_{i}

. A detailed derivation of this formulation can be found in [32]. In general, a lower entropy corresponds to a more peaked distribution, indicating reduced uncertainty and higher confidence.

When the predicted entropy is below a fixed threshold τ, dynamic entropy-based filtering considers it as a pseudo label. The resulting unsupervised loss is:

L_{u n s u p} (x^{u}) = 1_{(H (p_{t e a}) \leq τ)} L^{C E} (p_{t e a}, p_{s t u})

(14)

1 (\cdot)

is the indicator function (1 if the condition holds, 0 otherwise).

H (p_{t e a})

denotes the prediction entropy computed via Equation (13).

L^{C} E (\cdot, \cdot)

is the cross-entropy loss enforcing consistency between two continuous Matrix Fisher distributions. The terms

p_{t e a}

and

p_{s t u}

are defined as

p_{t e a} =

M F (A_{t e a}^{u})

and

p_{s} t u = M F (A_{s t u}^{u})

, where

A_{t e a}^{u} = N_{t e a} (x^{u})

and

A_{s t u}^{u} = N_{s t u} (x^{u})

are the outputs of the teacher and student models, respectively.

The unlabeled set

D^{u}

contains many challenging heads, making it difficult for the teacher to separate the in-distribution samples

D_{i} d^{u}

from out-of-distribution ones

D_{o} o d^{u}

via a fixed threshold. The teacher’s prediction entropy for

D^{u}

shows that most samples receive confident (low-entropy) predictions. High-entropy samples fall into two categories: hard heads still belonging to

D_{i d}^{u}

(e.g., severe occlusion, atypical poses rare in labeled data but potentially correctable) and noisy heads from

D_{o o d}^{u}

(unrecognizable poses due to missing context or wrong category). Moreover, the teacher’s predictive capability improves during training, meaning the difficulty and uncertainty of a given sample evolve. We therefore introduce dynamic entropy-based filtering to improve pseudo-label quality and enhance robustness in real-world settings. Assuming

D^{u} = {D_{i d}^{u} ⋃ D_{o o d}^{u}}

, we retain only a portion of unlabeled data for unsupervised training.

The filtering threshold

τ_{k}

is progressively updated over

\dot{K}

stages and computed as:

τ_{k} = percentile ⟨H (M F (N_{t e a}^{k} (x_{i}^{u}))) |_{i = 1}^{N_{u}}, δ⟩

(15)

δ

is the fraction of unlabeled data retained, linked to the unknown

D_{o} o d^{u}

in

D^{u}

. The function percentile

⟨ \cdot, \cdot ⟩

gives the

δ^{t h}

percentile value.

N_{t e a}^{k}

denotes the teacher model at the k-th stage

(k \in {1, 2, . . ., K}) .

Equation (14) is then revised as:

L_{u n s u p}^{'} (x^{u}) = 1_{(H (p_{t e a}^{k}) \leq τ_{k})} L^{C E} (p_{t e a}^{k}, p_{s t u})

(16)

p_{t e a}^{k} = M F (N_{t e a}^{k} (x^{u}))

(17)

For a given

δ

,

τ_{k}

declines as stage

k

progresses; the optimal

δ

is inversely related to the quantity of

D_{o o d}^{u}

in

D^{u}

. Notably, the separation of

D_{i d}^{u}

and

D_{o o d}^{u}

here captures pose inference difficulty and reliability, not classical covariate shift, enabling the dynamic threshold to preserve plausible hard samples while suppressing highly noisy ones.

We further introduce a loss-weight-based curriculum learning method that uses prediction uncertainty, specifically entropy, as a difficulty measure to dynamically adjust the loss weights of unlabeled data across different training stages. The formula for calculating dynamic loss weight is as follows:

λ_{c u r r i c u l u m} (x^{u}, t) = \frac{1}{1 + e x p (- \frac{1}{α} [\frac{H_{m a x} - H (p_{t e a})}{H_{m a x} - H_{m i n}} - γ (t)])}

(18)

γ (t) = m i n (1, \frac{t}{T_{total}} \cdot k)

(19)

The parameter

α

controls the steepness of the curve. A smaller

α

makes the discrimination sharper, while a larger

α

gives a smoother transition. In this work,

α

is initially set to 0.1. The variable

γ (t)

serves as the curriculum control. At the early stage,

γ (t)

is close to 0, so only samples with very high confidence receive a weight near 1. As training progresses,

γ (t)

approaches 1, and most samples obtain a relatively high weight. Here

t

is the current training epoch, and

T_{total}

is the total number of training epochs. The factor

k

controls the pace of the curriculum; it is set to 1.5, meaning the curriculum advances slightly faster than the actual training time. Finally,

H_{m i n}

and

H_{m a x}

denote the minimum and maximum uncertainty values within the current batch.

The total loss for semi-supervised training is given below:

L_{t o t a l} = L_{s u p} (x^{l}, y^{l}) + λ_{c u r r i c u l u m} (x^{u}, t) \cdot L_{u n s u p}^{'} (x^{u})

(20)

We introduce two domain-specific data augmentations tailored to the head pose estimation task. The first, termed Cut Occlusion, randomly masks rectangular regions centered on the head to simulate partial occlusions, a frequent occurrence in mining truck environments due to mechanical vibration and variable lighting conditions. The second augmentation, Rotation Consistency, applies random in-plane rotations ranging from −30° to 30° and enforces that the resulting pose predictions remain geometrically consistent with the applied rotation through matrix multiplication. We employ aspect ratio-preserving cropping followed by zero-padding, rather than naive resizing, to better retain natural facial proportions and mitigate distortion-induced bias.

2.4. Semi-Supervised Training Method

The training framework for both Efficient Teacher and SemiUHPE comprises two distinct phases: an initial warm-up phase followed by a semi-supervised training phase. During the warm-up phase, the model is trained exclusively on labeled data using standard supervised loss functions.

In the subsequent semi-supervised phase, training is activated on both labeled and unlabeled data. The teacher model generates pseudo labels in an online manner using the warmed-up weights. The learning rate follows a cosine annealing schedule, decaying from an initial value of 1 × 10⁻³ to a minimum of 1 × 10⁻⁵, with periodic restarts every 50 epochs to facilitate escaping local minima. During training, each batch is composed of one labeled sample and four unlabeled samples, configurations that preserve task semantics through sufficient labeled supervision while maximizing the utilization of unlabeled data to enhance model generalization. The parameter configurations employed in the training framework are primarily inherited from Efficient Teacher [25] and SemiUHPE [30].

3. Experimental Section, Results and Discussion

3.1. Experimental Environment Settings and Dataset

The experimental hardware consisted of an Intel Xeon Silver 4210 processor and an NVIDIA RTX 3090 GPU. Software configuration included Ubuntu 22.04 LTS, PyTorch 2.3, CUDA 11.2, ONNX 1.8, and Python 3.12. The embedded terminal processor was deployed on an NVIDIA Jetson Orin platform running Ubuntu 20.04 OS, with JetPack 5.1.4 and TensorRT 8.5.

We constructed a large-scale dataset of mining truck drivers containing 20,000 near-infrared images. These images were captured by industrial-grade in-cab cameras at a resolution of 1920 × 1080 and a frame rate of 30 FPS, showing the upper bodies of the drivers. To avoid relying only on ideal conditions, we intentionally collected videos throughout the day from early morning to evening under various weather conditions. We recorded driving data from 50 drivers using only ambient light. The dataset was split into training, validation, and test sets at a ratio of 8:1:1. Only 10% of the training data were annotated, giving 1600 labeled training frames from 30 drivers, while the validation and test sets were fully labeled, each containing 2000 frames from 10 drivers respectively. To avoid temporal correlation between consecutive frames, we sampled one frame every 15 frames from the original 30 FPS video. During dataset partitioning, we ensured that the drivers in the training, validation, and test sets are mutually exclusive: no driver appears in more than one subset. This design eliminates potential bias from driver-specific characteristics and preserves the integrity of the evaluation protocol, guaranteeing that model performance reflects genuine generalization rather than memorization of driver-dependent patterns. Ground-truth head poses were provided by an IM600 sensor with an accuracy of 0.05°, synchronized with the image signal via hardware triggering. The annotations include both head bounding boxes and head pose angles. Two annotators labeled the data, achieving an agreement rate above 95% at an IoU threshold of 0.5. The distribution of mining truck driver head poses in the dataset is shown in Figure 3. This study was approved by the Institutional Review Board of Jiangsu University School of Medicine under approval number JSDX2002601010089, and written informed consent was obtained from all drivers. All experimental results reported in this study are based on the test set.

The head pose data captured by the IM600 sensor reveal distinct distributional characteristics for each Euler angle, all of which align closely with the operational context of mining truck operation. The Pitch angle, spanning from −60° to +60°, displays a slight asymmetry with a higher proportion of downward postures, particularly concentrated in the −15° to 0° range associated with instrument panel monitoring. The Yaw angle, covering a full range of −90° to +90°, exhibits a bimodal distribution with a rightward bias, reflecting the driver’s need to frequently check the right-side mirror from a left-side driving position. In contrast, the Roll angle is more narrowly distributed between −45° and +45°, with a pronounced concentration in the central range and a subtle rightward tendency attributable to the driver’s seating posture.

HeaDet and HPE were trained using stochastic gradient descent with a learning rate of 0.01, weight decay of 1 × 10⁻⁴, and momentum of 0.9. The training comprised 200 warm-up epochs followed by 100 semi-supervised epochs. Teacher networks were initialized as the exponential moving average of the student network weights, with a momentum coefficient β set to 0.9996. For HPE, we applied Cut Occlusion and Rotation Consistency as data augmentation methods. In Cut Occlusion, each occluded block covers 2% to 5% of the image area, and the number of such blocks ranges from two to four. Rotation Consistency uses rotation angles from minus 30° to plus 30°. For HeaDet, Mosaic data augmentation was applied with a probability of 0.8.

Evaluation metrics included precision (ratio of true positives to predicted positives), recall (ratio of true positives to actual positives), and AP₅₀ (mean average precision across intersection over union thresholds of 0.5); AP₅₀ reflects comprehensive localization performance. For head pose estimation, we report MAE and Root Mean Square Error (RMSE) for each Euler angle (pitch, yaw, roll) in degrees.

3.2. HeaDet Semi-Supervised Training with Different Labeled Data Ratios

To evaluate the data efficiency of our semi-supervised approach, we assess the performance of HeaDet with varying proportions of labeled data, ranging from 1% to 10%. The corresponding results are summarized in Table 1.

Table 1 summarizes the detection performance of the HeaDet framework under varying amounts of labeled data, specifically at 1%, 3%, 5%, and 10%. The evaluation is conducted using standard object detection metrics: Precision, Recall, F1-Score, and Average Precision at an IoU threshold of 0.5 (AP₅₀).

At the lowest annotation level of 1%, the model achieves a precision of 78.5%, recall of 66.2%, F1-Score of 71.83%, and AP₅₀ of 70.1%. These results indicate that the framework retains a fundamental level of detection capability even under extreme data scarcity. The observed disparity between Precision and Recall, a gap of 12.3 percentage points, reflects a conservative prediction tendency, characterized by a relatively high false negative rate.

When the proportion of labeled data is increased to 3%, all metrics exhibit marked improvement. Precision rises to 84.2%, while Recall shows a more pronounced increase to 82.8%, reducing the precision-recall gap to only 1.4 percentage points. The F1-Score reaches 83.49%, and AP₅₀ improves to 81.6%. The convergence of Precision and Recall suggests a more balanced classification behavior and reduced prediction bias, indicating improved model calibration.

With 5% labeled data, performance approaches near saturation. Precision climbs to 97.3%, Recall to 96.1%, and the F1-Score reaches 96.70%, with AP₅₀ at 95.2%. The narrow margin between Precision and Recall (1.2 percentage points), together with an F1-Score exceeding 96%, implies that this level of supervision is sufficient to achieve near-optimal performance for the HeaDet architecture.

At the highest annotation ratio of 10%, the model attains its best overall performance: Precision of 99.1%, Recall of 99.5%, F1-Score of 99.30%, and AP₅₀ of 99.4%. Notably, Recall marginally surpasses Precision, indicating enhanced sensitivity to positive instances. However, relative to the 5% setting, the improvements are incremental, ΔF1-Score of 2.6 percentage points and ΔAP₅₀ of 3.9 percentage points, suggesting diminishing returns with increasing annotation effort.

Collectively, these results delineate three distinct phases in model behavior under semi-supervised conditions: (1) a data scarce regime at 1%, where performance is constrained by limited supervision; (2) a rapid improvement phase between 1% and 5%, where incremental labeled data yields substantial performance gains; and (3) a saturation phase beyond 5%, where further annotations contribute only marginal improvements. These findings underscore the efficiency of the HeaDet framework in leveraging limited supervision and highlight the diminishing utility of additional labeled data beyond an important threshold.

3.3. Ablation Study on the HeaDet Model

The ablation study systematically evaluates the incremental contributions of the GRL and the objectness branch to HeaDet for object detection in semi-supervised learning. The results demonstrate clear synergistic improvements in performance with the integration of each component. The experimental results are presented in Table 2.

The baseline YOLOv8 model achieves a Precision of 91.3%, Recall of 94.5%, F1-Score of 92.9%, and AP₅₀ of 89.2%. Introducing the GRL alone yields moderate gains across all metrics, with Precision increasing to 91.8%, Recall to 95.9%, F1-Score to 93.8%, and AP₅₀ to 90.1%. These improvements, corresponding to increases of 0.5, 1.4, 0.9, and 0.9 percentage points respectively, are primarily attributed to enhanced domain adaptation and feature alignment facilitated by the GRL.

More notably, the addition of the objectness branch in conjunction with the GRL leads to substantial performance gains. The full model achieves a Precision of 99.1%, Recall of 99.5%, F1-Score of 99.3%, and AP₅₀ of 99.4%. Relative to the GRL-only configuration, these figures represent increases of 7.3 percentage points in Precision, 5.0 in Recall, 6.4 in F1-Score, and 10.2 in AP₅₀. Compared to the baseline YOLOv8, the improvements are even more pronounced, with gains of 7.8, 5.0, 6.4, and 10.2 percentage points respectively.

In semi-supervised settings, classifier calibration on unlabeled data may deteriorate, leading to high classification scores in some background regions. The objectness branch, trained with a binary foreground/background loss, effectively rejects these false positive detections to obtain higher-quality pseudo-label data, explaining HeaDet’s superiority over YOLOv8 in semi-supervised learning. These results confirm the objectness branch’s critical role in refining confidence estimation and suppressing false positives. Together with the GRL for cross-domain feature alignment, the two components operate synergistically to improve detection accuracy, enabling the complete model to achieve near-optimal performance across all key metrics and validating their complementary nature.

3.4. Comparative Experiment of HeaDet Model

To demonstrate the superiority of our HeaDet, we conducted full supervised comparative experiments on the same dataset using methods including RetinaNet, Faster R-CNN, YOLOv8, YOLOv13, and YOLO26. All methods were trained fully supervised using the same 10% labeled data.

Table 2 and Table 3 show that semi-supervised HeaDet beats all baselines. It exceeds Faster R-CNN by 4.3 points in AP₅₀ and RetinaNet by 4.9 points. Among single-stage detectors, YOLO26 is the best baseline; HeaDet still improves on it by 1.3 points in F1-Score. Against YOLOv13, the improvements are larger, 4.35% in F1-Score and 6.3% in AP₅₀. The above results confirm that our improvements are effective. Under fully supervised training, HeaDet achieves an F1-Score of 98.35% and an AP50 of 98.1%, outperforming the next best method, YOLO26, by 0.35% and 0.4% respectively. Compared with Faster R-CNN and RetinaNet, HeaDet not only maintains high precision but also delivers a substantial gain in recall.

3.5. Difficulty Analysis of HeaDet Model Detection

To analyze detection difficulty, we divided the test set images into eleven groups numbered 1 through 11. These groups are large head, small head, occluded, unoccluded, no accessories, wearing glasses, wearing a mask, wearing a hat, good facial lighting, facial reflection, and uneven facial lighting. Semi-supervised HeaDet and fully supervised YOLOv8 were then evaluated on these eleven groups. The results are shown in Table 4. Since an image may belong to more than one group, the detection difficulty analysis can only provide a rough indication of the actual difficulty.

The quantitative results presented in the table indicate that HeaDet consistently outperforms YOLOv8 in the majority of challenging scenarios. Notably, HeaDet demonstrates a significant performance advantage in handling occlusion and accessories. In the occluded group, HeaDet achieves an F1-Score of 99.12% compared to 97.68% for YOLOv8. Similarly, for subjects wearing accessories, HeaDet maintains superior accuracy, surpassing YOLOv8 by margins of 1.43% for glasses, 0.91% for masks, and 1.07% for hats. This suggests that the proposed method is particularly effective at learning discriminative features even when target objects are partially obscured or modified by external items.

Furthermore, HeaDet exhibits enhanced robustness under difficult lighting conditions. In scenarios characterized by facial reflection and uneven lighting, HeaDet improves the F1-Score by 1.57 percentage points and 0.57 percentage points respectively, over the baseline. The model also shows greater stability across different object scales, achieving higher F1-Scores in both the large head and small head categories compared to YOLOv8.

While YOLOv8 shows competitive performance in specific unconstrained groups, such as the unoccluded and good facial lighting categories where it slightly edges out HeaDet, the proposed method demonstrates a more consistent and stable performance level. HeaDet achieves an F1-Score exceeding 98.5% across all eleven groups, whereas the performance of YOLOv8 fluctuates more significantly, dropping below 98% in seven of the eleven groups. These results confirm that HeaDet possesses superior generalization capabilities, effectively mitigating the impact of occlusion, accessories, and adverse lighting on detection accuracy.

3.6. Ablation Study on the HPE Model

An ablation study was conducted to systematically evaluate the incremental contributions of Curriculum Learning, Rotation Consistency, and Cut Occlusion to the baseline HPE model. Performance was assessed using MAE and RMSE across keypoint predictions. The experimental results are presented in Table 5.

The baseline model yields an Average MAE of 4.7° and an Average RMSE of 5.8°. Introducing the Cut Occlusion component alone reduces Average MAE to 3.5°, representing a reduction of 1.2 points, and Average RMSE to 4.1°, a reduction of 1.7 points. This improvement demonstrates the effectiveness of Cut Occlusion in enhancing model robustness to occluded keypoints.

The subsequent integration of Rotation Consistency further decreases Average MAE to 3.0° and Average RMSE to 3.6°, corresponding to additional reductions of 0.5 points in both metrics. This gain confirms the contribution of Rotation Consistency in improving rotational invariance and generalization across varying poses.

Finally, the addition of Curriculum Learning yields marginal but consistent improvements, with Average MAE declining to 2.8° and Average RMSE to 3.4°, reflecting further reductions of 0.2 points. This result indicates that Curriculum Learning facilitates progressive learning by gradually increasing task complexity during training.

3.7. Comparative Experiment of HPE Model

To demonstrate that HPE is better suited for the task, we compared it on the same dataset using ResNet, EfficientNetV2, GhostNet, FasterNet, and RetinaNet. All methods were trained fully supervised with the same 10% labeled data. Table 6 shows the experimental results.

Table 6 reports the experimental results. HPE achieves the lowest error rates across both evaluation metrics, recording an MAE of 3.2° and an RMSE of 4.5°. In comparison, ResNet ranks as the second-best-performing model with an MAE of 3.8° and an RMSE of 4.7°. The remaining models exhibit progressively higher error margins, with EfficientNetV2 yielding an MAE of 4.1° and an RMSE of 4.8°, followed by GhostNet and FasterNet, the latter of which demonstrates the least favorable performance with an MAE of 4.8° and an RMSE of 5.3°. The performance gap between HPE and the strongest baseline, ResNet, is statistically significant. Specifically, the proposed method reduces the MAE by approximately 15.8% and the RMSE by roughly 4.3% relative to ResNet.

3.8. Comparison with YOLO Variants Under Efficient Teacher Framework

To validate the efficacy of the proposed HeaDet architecture, a series of comparative experiments was conducted under a semi-supervised paradigm. Specifically, YOLOv5, YOLOv6, YOLOv7, YOLOv8, and HeaDet were trained using the Efficient Teacher framework with access to only 10% of the labeled data. The resulting performance comparisons are illustrated in Figure 4. Figure 4 presents a comparative analysis of detection performance across the YOLO family and the proposed HeaDet architecture. Within the YOLO series, clear generational differences emerge. YOLOv5 delivers the most balanced performance among baseline models, achieving a Precision of 96.3%, Recall of 97.5%, F1-Score of 96.9%, and AP₅₀ of 96.2%. The narrow margin between Precision and Recall, just 1.2 percentage points, reflects a well-calibrated trade-off between false positive suppression and detection completeness. YOLOv6 exhibits a marked decline across all metrics, with a Precision of 92.1%, a Recall of 94.3%, an F1-Score of 93.2%, and AP₅₀ of 93.4%. YOLOv7 partially recovers classification accuracy, attaining a Precision of 95.2%, though its Recall trails at 94.1%, yielding an F1-Score of 94.6% and AP₅₀ of 91.8%. Notably, despite gains in pointwise classification, the drop in AP₅₀ suggests underlying deficiencies in the precision-recall trade-off.

HeaDet outperforms all YOLO baselines across every evaluation metric. It attains a Precision of 99.1%, an improvement of 3.9 percentage points over YOLOv7, a Recall of 99.5%, a gain of 3.6 percentage points relative to YOLOv8, an F1-Score of 99.30%, an increase of 4.65 percentage points compared to YOLOv7, and an AP₅₀ of 99.4%, an enhancement of 6.0 percentage points over YOLOv6. Notably, HeaDet achieves both the highest Precision and Recall simultaneously, effectively overcoming the conventional trade-off between these two measures observed in other architectures. The near equality of Precision and Recall (a gap of only 0.4 percentage points) reflects well-calibrated classification and robust feature learning.

Compared to the strongest YOLO baseline, YOLOv7, HeaDet delivers substantial gains of 3.9 percentage points in Precision, 5.4 percentage points in Recall, 4.65 percentage points in F1-Score, and 7.6 percentage points in AP₅₀. The consistent superiority across all metrics and the marked improvement in AP₅₀ underscore the effectiveness of the proposed architectural innovations in enhancing both localization accuracy and detection completeness. An F1-Score approaching 99.3% indicates near-optimal harmonic mean performance, while an AP₅₀ exceeding 99% reflects excellent precision-recall characteristics across varying confidence thresholds.

We attribute the performance gains of HeaDet to three key architectural innovations: the dense sampling strategy inherited from Efficient Teacher, an objectness branch designed to enhance pseudo-label quality, and an adaptive threshold mechanism that effectively mitigates the distribution shift between labeled and unlabeled data in the mining truck domain.

3.9. Comparison with State-of-the-Art Semi-Supervised Methods

To further validate the superiority of HeaDet, we compare it with three representative semi-supervised object detection methods: Soft Teacher, Unbiased Teacher, and STAC. All methods are evaluated using 10% of the labeled data. The experimental results are presented in Figure 5.

As shown in Figure 5, Soft Teacher achieves a Precision of 94.2%, Recall of 92.1%, and F1-Score of 93.14%. The relatively narrow gap between Precision and Recall (2.1 percentage points) suggests effective pseudo-label filtering, although the modest AP₅₀ indicates potential difficulties in handling objects across varying scales.

Unbiased Teacher outperforms Soft Teacher across all metrics, achieving a Precision of 95.5%, Recall of 94.3%, and F1-Score of 94.90%. This is gains of 1.3, 2.2, 1.76, and 2.9 percentage points in Precision, Recall, and F1-Score, respectively. The improved balance between Precision and Recall (a difference of 1.2 percentage points) validates the effectiveness of its human-in-the-loop annotation refinement strategy in enhancing pseudo-label quality.

STAC presents a different optimization profile, with a precision of 93.8%, recall of 91.5%, and F1-Score of 92.64%. These metrics are slightly lower than those of Unbiased Teacher, with the F1-Score being the lowest among all methods.

HeaDet establishes a new state-of-the-art performance across all evaluation criteria. It achieves a Precision of 99.1%, a Recall of 99.5%, and an F1-Score of 99.3%. These results correspond to gains of 3.6, 5.2, and 4.40 percentage points over the strongest competitor in each respective metric. Notably, HeaDet achieves this superior performance while effectively resolving the trade-off observed in competing approaches. The near equality of Precision and Recall (a gap of only 0.4 percentage points) reflects exceptional pseudo-label quality control and effective mitigation of confirmation bias, a persistent challenge in iterative self-training. The F1-Score approaching 99.3% further demonstrates the framework’s ability to maintain detection precision across varying confidence thresholds, addressing the calibration limitations evident in both Soft Teacher and Unbiased Teacher.

This comparative analysis reveals that existing semi-supervised methods tend to specialize: Unbiased Teacher excels in classification calibration, while STAC demonstrates strong localization consistency. Neither, however, achieves unified optimization across both dimensions. The comprehensive superiority of HeaDet suggests that its design, integrating adaptive pseudo-label weighting, uncertainty-aware consistency regularization, and dynamic threshold adjustment, effectively addresses the core challenges of noise sensitivity and confirmation bias that plague semi-supervised object detection.

3.10. The HPE and SemiUHPE Pose Estimation with Different Labeled Data Ratios

To demonstrate the high performance of HPE, we evaluate the performance of HPE and SemiUHPE for head pose estimation under varying levels of labeled data availability. Table 7 presents the MAE and RMSE for each Euler angle, along with their average values, for both SemiUHPE and HPE under varying proportions of labeled data, SemiUHPE and HPE. Evaluations are conducted under varying proportions of labeled training data, specifically 1%, 3%, 5%, and 10% of the full dataset.

As the fraction of labeled data increases from 1% to 10%, both methods exhibit a consistent decline in Average MAE, indicating that access to more ground-truth annotations systematically improves estimation accuracy. This trend holds across all individual Euler angles and confirms the expected utility of labeled data in semi-supervised regression tasks.

Comparing the two approaches, HPE consistently achieves lower Average MAE than SemiUHPE across most experimental settings. At the lowest annotation level of 1%, the difference is marginal, with Average MAE values of 5.7° and 5.6° for SemiUHPE and HPE, respectively. However, as the labeled data proportion increases to 3%, HPE demonstrates a more pronounced advantage, reducing the Average MAE from 4.7° to 4.1°, a relative improvement of approximately 13%. At 5% labeled data, the gap narrows slightly, with HPE attaining an Average MAE of 3.4° compared to 3.6° for SemiUHPE. At 10% annotation, HPE again outperforms its counterpart, achieving an Average MAE of 2.8° versus 3.0°.

Examining individual Euler angles reveals a similar pattern. HPE generally yields lower MAE for Pitch, Yaw, and Roll, with the most substantial gains observed under limited supervision. These results demonstrate that while both methods benefit from increased labeled data, HPE offers superior estimation accuracy, particularly when labeled examples are scarce. The advantage is most evident in the 3% to 5% labeled data regime, where HPE consistently outperforms SemiUHPE across all metrics. This finding suggests that the architectural or training innovations embedded in HPE contribute meaningfully to more efficient utilization of limited supervision in head pose estimation tasks.

Table 8 presents the RMSE for each Euler angle, along with the corresponding average values.

Experimental findings indicate that both SemiUHPE and HPE benefit from increased supervision, with RMSE values for individual Euler angles, as well as their average, declining monotonically as the proportion of labeled data increases from 1% to 10%. This trend reinforces the established relationship between annotation availability and regression accuracy in head pose estimation.

Across all experimental configurations, HPE consistently achieves lower error than SemiUHPE. At the lowest annotation level of 1%, HPE attains an Average RMSE of 6.6°, with per-angle errors of 8.0°, compared to 6.8° for SemiUHPE. This performance margin persists as labeled data increases. At 3%, HPE records an Average RMSE of 5.4° versus 5.8° for SemiUHPE; at 5%, the difference narrows slightly to 4.1° versus 4.2°; and at 10%, HPE maintains an advantage of 3.4° compared to 3.6° for SemiUHPE. The consistent margin, ranging from 0.2 to 0.4 points in Average RMSE, suggests that HPE makes more effective use of limited supervision across all data regimes.

A further consistent observation is the elevated difficulty associated with Pitch estimation relative to Yaw and Roll. For HPE, the gap between Pitch error and Roll error spans 1.6 to 2.4 points across data proportions, while for SemiUHPE, this gap ranges from 1.2 to 1.7 points. This persistent discrepancy underscores the inherent complexity of modeling vertical head rotations, which may involve greater ambiguity and variability compared to horizontal or torsional movements.

3.11. Comparison Between HPE and Other Head Pose Estimation Methods

To better position this study within the broader literature, we compare our method with several fully supervised approaches evaluated on datasets such as BIWI and AFLW2000. While these datasets differ substantially from ours, the comparison still offers a useful reference point for assessing the proposed approach. Cross-dataset comparisons are intended only as qualitative references. The results are summarized in Table 9.

As presented in the table, the average MAE values for fully supervised methods across different datasets lie within the range of 2° to 5°. The lowest among these is achieved by the TRG method, with an average MAE of 2.75°. In comparison, our semi-supervised approach achieves an average MAE of 2.8° on our dataset. These results indicate that the proposed method remains competitive relative to existing approaches.

3.12. Edge Deployment Optimization on NVIDIA Jetson Orin NX

For real-world deployment, achieving an appropriate balance between accuracy and inference speed is essential. We evaluated the performance of HeaDet and HPE under three numerical precision configurations: FP32 for full precision, FP16 for half-precision, and INT8 for 8-bit integer quantization. All models used in this evaluation were trained with ten percent of the dataset. The corresponding results for object detection and head pose estimation are presented in Figure 6. The FP32, FP16 and INT8 versions of HeaDet have model sizes of 53.2 MB, 24.1 MB and 14.0 MB respectively. Their memory usage during inference is 86.3 MB, 38.4 MB and 22.5 MB. The FLOPs for all three HeaDet variants are 28.7 GFLOPs. For HPE, the three versions have model sizes of 5.6 MB, 2.3 MB and 1.4 MB with inference memory usage of 18.6 MB, 5.2 MB and 2.1 MB. The FLOPs for all HPE variants are 0.3 GFLOPs.

For HeaDet, the full precision FP32 configuration achieves an AP₅₀ of 99.4%, a recall of 99.3%, and an F1-Score of 99.5%, establishing a high-performance baseline for object detection. Transitioning to FP16 results in only minor degradation: AP₅₀ decreases to 98.8%, recall to 98.9%, and F1-Score to 98.7%. These negligible losses indicate that 16-bit floating-point representation preserves sufficient numerical fidelity for accurate head localization, with minimal impact on detection quality. In contrast, INT8 quantization leads to more pronounced performance drops: AP₅₀ declines to 97.4%, recall to 96.5%, and F1-Score to 97.8%, reflecting the reduced dynamic range and discretization error inherent in integer quantization. Despite this, the retention of over 97% mAP suggests that post-training calibration or quantization-aware training effectively mitigates accuracy loss.

Inference speed improvements are substantial across quantization levels. HeaDet achieves 22 FPS under FP32, increasing to 51 FPS in FP16 (a 132% improvement), and further to 106 FPS in INT8 (a 382% improvement). Correspondingly, latency reduces from 45.5 ms to 19.6 ms and then to 9.4 ms. This demonstrates significant gains in real-time capability without compromising detection reliability at moderate quantization levels.

For HPE, the FP32 configuration yields an MAE of 2.8° and RMSE of 3.8°, indicating high angular estimation accuracy. FP16 quantization increases MAE to 3.1° and RMSE to 3.9°, representing a modest degradation likely due to limited precision in activation and weight representations. INT8 quantization results in higher errors, MAE of 3.5° and RMSE of 4.3°, indicating increased sensitivity to quantization noise in regression tasks. However, these errors remain within acceptable bounds for practical applications involving head orientation estimation.

Inference throughput for HPE also scales with reduced precision. The model runs at 55 FPS under FP32, improves to 102 FPS in FP16 (an 85.5% increase), and reaches 185 FPS in INT8 (a 236% increase). Latency is reduced from 18.2 ms to 9.8 ms and further to 5.4 ms, yielding response times below the ten-millisecond threshold that are well-suited for real-time systems.

Collectively, these results demonstrate that both HeaDet and HPE benefit from low-precision inference in terms of throughput and latency. While INT8 introduces measurable accuracy degradation, especially in regression-based HPE, the performance remains adequate for downstream distraction classification. FP16 emerges as a favorable compromise, offering near-optimal accuracy with substantial efficiency gains. This makes it well-suited for deployment in embedded systems where real-time processing and energy efficiency are important. The findings support the feasibility of deploying deep learning-based driver monitoring systems using quantized models without sacrificing operational effectiveness.

To better elucidate the impact of quantization on HPE performance, the MAE and RMSE for different Euler angles are visualized, as illustrated in Figure 7.

As shown in Figure 7, a progressive degradation in estimation accuracy is observed as the numerical precision decreases from FP32 to INT8. The FP32 configuration serves as the baseline, demonstrating the highest fidelity with MAE values of 2.8°, 3.1°, and 2.5° for Pitch, Yaw, and Roll, respectively. The corresponding RMSE values for the baseline are 4.1° (Pitch), 3.3° (Yaw), and 2.8° (Roll).

Transitioning to the FP16 precision results in a slight but consistent increase in error across all metrics. The MAE for all angles increases by approximately 0.3°, while the RMSE shows a more pronounced rise, particularly for the Pitch angle, which increases from 4.1° to 4.5°. This suggests that the reduction to half-precision introduces minor quantization noise that marginally impacts the model’s predictive stability.

A more substantial performance drop is evident with the INT8 quantization. Compared to the FP32 baseline, the MAE for Yaw experiences the most significant relative increase, rising from 3.1° to 3.8°. The RMSE metrics are notably more sensitive to this aggressive quantization. The RMSE for Pitch escalates sharply from 4.1° in FP32 to 5.7° in INT8, representing a 39% increase. Similarly, the RMSE for Roll increases from 2.8° to 4.2°. This marked elevation in RMSE, relative to MAE, indicates the presence of larger and more frequent estimation outliers under INT8 quantization.

The experimental results elucidate a fundamental trade-off between model precision and inference efficiency in quantized head pose estimation. FP32 remains the preferred configuration for accuracy-critical scenarios where computational resources are unconstrained. FP16 provides an optimal compromise, delivering near FP32 accuracy with improved throughput, making it well-suited for balanced deployment scenarios. INT8, while introducing higher estimation errors, remains viable for real-time applications where latency constraints and hardware limitations are paramount.

3.13. Deployment of SemiCHPE on Real-World Mining Trucks

The proposed framework was deployed across 15 mining trucks operating within a large open-pit mine in Namibia. The deployment leveraged an edge computing platform based on the NVIDIA Jetson Orin NX, executing the SemiCHPE method using FP16 numerical precision to balance accuracy and inference efficiency under real-world operating conditions. In this configuration, HeaDet achieves an inference speed of 51 frames per second, while the head pose estimation module operates at 102 frames per second. The complete pipeline, which includes both detection and pose estimation, delivers an overall inference speed of 32 frames per second. To further improve inference speed, techniques such as multithreading can be utilized to optimize performance.

Over a continuous deployment period, a total of 1901 video clips were recorded, comprising 671 instances of distracted driving and 1230 instances of normal driving. The classification of driver distraction was defined in accordance with the established research literature and industrial safety standards. Head poses in the training set during driver distraction events were analyzed, and the following thresholds for classifying a driver as distracted were determined. A driver is considered distracted if the head orientation meets any of these conditions: Pitch angle below 23.5° or above 30.4°, Yaw angle below 47.3°, or Roll angle below 18.8° or above 19.6°. Furthermore, a distraction event is counted only when such a condition is observed for at least 83 out of 100 consecutive frames.

To mitigate the influence of transient head movements and short-term pose variations, we introduced a temporal persistence criterion. Specifically, a video segment was considered a distraction event only when at least 80 out of 100 consecutive frames exceeded the predefined angular thresholds. Such a temporal filtering mechanism effectively distinguishes sustained distraction from incidental head motions, thereby improving the reliability of ground-truth annotations for subsequent evaluation. The threshold used to classify driver distraction was determined based on the practical experience of seasoned mine safety supervisors in conjunction with onboard validation. Although this threshold may not be broadly generalizable, it is adequate for the current experimental stage, which emphasizes algorithmic validation and iterative refinement. The final evaluation results are presented in Figure 8.

As illustrated in Figure 8, the performance of two head pose estimation-based driver distraction detection methods, namely, the proposed SemiCHPE framework and a conventional Facial Landmarks baseline, is evaluated using confusion matrices and standard classification metrics, including accuracy, precision, recall, and F1-Score. The results reveal a substantial disparity in discriminative capability between the two approaches, underscoring the effectiveness of the SemiCHPE framework.

Examining the confusion matrix for SemiCHPE, the model correctly classifies 1150 out of 1230 normal instances, while 80 normal cases are misclassified as distracted (false positives). Among the 671 distracted driving instances, 69 are incorrectly identified as normal (false negatives), and 602 are correctly detected. These figures yield an overall accuracy of 92.1%, reflecting strong generalization performance across both classes. The precision of 88.3% indicates that the majority of instances predicted as distracted correspond to true positive events, demonstrating reliable confidence in positive predictions. With a recall of 89.7%, the model successfully detects nearly 90% of actual distracted drivers, suggesting robust sensitivity to distraction-related behavioral patterns. The balanced F1-Score of 89.0% further confirms the method’s ability to maintain an effective trade-off between precision and recall.

In contrast, the Facial Landmarks method exhibits inferior performance. It correctly identifies 516 distracted drivers while failing to detect 155 instances, resulting in an elevated false-negative rate. A total of 190 normal driving instances are misclassified as distracted, indicating a notable increase in false alarms. Correct classification of normal cases reaches 1040. Thus, overall accuracy declines to 81.9%, with precision and recall falling to 73.1% and 76.9%, respectively. The corresponding F1-Score of 74.9% reflects a less favorable balance between detection sensitivity and predictive reliability.

A comparative analysis reveals that SemiCHPE consistently outperforms the Facial Landmarks method across all evaluation metrics. Specifically, the proposed method achieves absolute improvements of 10.3 percentage points in accuracy, 15.2 percentage points in precision, 12.8 percentage points in recall, and 14.1 percentage points in F1-Score. These gains suggest that SemiCHPE is capable of extracting more discriminative cues associated with driver distraction, likely attributable to enhanced spatial–temporal modeling or more robust pose estimation under challenging conditions, including variable lighting, partial occlusion, and dynamic head motion.

Based on error analysis, representative failure cases of the proposed SemiCHPE framework are illustrated in Figure 9. These false negatives predominantly stem from pronounced lateral leaning, forward flexion, or backward extension of the driver’s torso, often accompanied by partial occlusion of the head region.

In summary, these findings establish the proposed SemiCHPE method as a more accurate and dependable solution for driver distraction detection. Its superior performance in mitigating false negatives renders it well-suited for safety-critical scenarios where timely identification of distracted behavior is paramount.

4. Conclusions

In this paper, we present SemiCHPE, a semi-supervised framework for driver head pose estimation in mining truck scenarios. The proposed method integrates HeaDet for head detection and HPE for head pose estimation in a two-stage cascade architecture.

Our extensive experiments demonstrate the effectiveness of the proposed framework. Under semi-supervised settings with only 10% labeled data, HeaDet achieves 99.1% precision, 99.5% recall, and 99.4% AP₅₀ for head detection. Compared to state-of-the-art semi-supervised methods including Soft Teacher, Unbiased Teacher, and STAC, HeaDet outperforms the best competitor by 3.6 percentage points in precision and 5.9 points in AP₅₀. For head pose estimation, a curriculum learning method based on loss weights autonomously adjusts the learning difficulty of HPE under semi-supervised conditions, enabling it to achieve a mean absolute error of 2.8° using only 10% of the labeled dataset.

We further evaluate the deployment feasibility on NVIDIA Jetson Orin NX. With FP16 quantization, the detection module runs at 51 FPS with 0.3% AP₅₀ degradation, while the pose estimation module achieves 102 FPS with only a 0.3° MAE increase. The combined cascade pipeline processes at 32 FPS, exceeding our 30 FPS real-time requirement for mining truck monitoring.

In this study, we propose a semi-supervised detection framework that attains state-of-the-art data efficiency for driver monitoring in mining trucks, while also incorporating a comprehensive optimization pipeline that enables real-time edge deployment on resource-constrained hardware. The proposed method has been successfully deployed in an operational setting. However, this distraction detection method, which depends largely on heuristic thresholds applied to Euler angles and frame-level counting, offers limited granularity and does not yet fully exploit the capabilities of the SemiCHPE framework. The core challenge stems from the inherent variability in driver posture within mining truck environments, where fixed thresholds prove insufficient for capturing the full spectrum of distraction-related behaviors. As highlighted in prior research [33], spatio-temporal modeling tends to yield stronger generalization in such dynamic settings. Building on the findings of this study, future work will focus on integrating temporal modeling into the detection pipeline to enable more robust and context-aware distraction recognition across video sequences.

Author Contributions

Conceptualization, F.J.; Methodology, B.H.; Validation, F.J.; Formal analysis, W.Z.; Investigation, W.Z.; Resources, X.C.; Data curation, W.Z. and X.C.; Writing—original draft, F.J.; Visualization, Y.L. (Yong Li); Supervision, Y.L. (Yulong Liu), X.C. and W.Z.; Project administration, Y.L. (Yong Li) and B.H.; Funding acquisition, Y.L. (Yulong Liu). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was reviewed and approved by the Medical Ethics Committee of Jiangsu University (IRB Approval No. JSDX202601089, approved 1 January 2026).

Informed Consent Statement

All human participant experiments follow the Declaration of Helsinki and relevant Chinese regulations. All participants provided written informed consent (IC) before data collection; all raw personal identifiers were removed for privacy protection.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

Authors Feng Jiang and Yulong Liu were employed by the company CGNPC Uranium Resources Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Li, Z.; Liu, X.; Liu, S.; Ma, H.; Wu, G. Building and Validating a Coal Mine Safety Question-Answering System with a Large Language Model Through a Two-Stage Fine-Tuning Method. Appl. Sci. 2026, 16, 971. [Google Scholar] [CrossRef]
Nowak-Senderowska, D.; Pyra, J. Accidents in the Production, Transport, and Handling of Explosives: TOL Method Hazard Analysis with a Mining Case Study. Appl. Sci. 2025, 15, 10150. [Google Scholar] [CrossRef]
Burtan, Z.; Nowak-Senderowska, D.; Szczepański, P. Methodology for Identification of Occupational Hazards Using Their Characteristic Features in Hard Coal Mining. Appl. Sci. 2025, 15, 7079. [Google Scholar] [CrossRef]
Tian, Z.; Chen, F.; Ma, S.; Guo, M. Analysis of the Severity of Heavy Truck Traffic Accidents Under Different Road Conditions. Appl. Sci. 2024, 14, 10751. [Google Scholar] [CrossRef]
You, K.; Gu, Y.; Shao, H.; Wang, Y. A Liquid-Impulse Neural Network Model Based on Heterogeneous Fusion of Multimodal Information for Interpretable Rotating Machinery Fault Diagnosis. Mech. Syst. Signal Process. 2026, 246, 113923. [Google Scholar] [CrossRef]
Chen, Z.; Yang, J.; Chen, L.; Li, F.; Feng, Z.; Jia, L.; Li, P. RailVoxelDet: A Lightweight 3-D Object Detection Method for Railway Transportation Driven by Onboard LiDAR Data. IEEE Internet Things J. 2025, 12, 37175–37189. [Google Scholar] [CrossRef]
Akiduki, T.; Nagasawa, J.; Zhang, Z.; Omae, Y.; Arakawa, T.; Takahashi, H. Inattentive Driving Detection Using Body-Worn Sensors: Feasibility Study. Sensors 2022, 22, 352. [Google Scholar] [CrossRef] [PubMed]
Halin, A.; Verly, J.G.; Van Droogenbroeck, M. Survey and Synthesis of State of the Art in Driver Monitoring. Sensors 2021, 21, 5558. [Google Scholar] [CrossRef] [PubMed]
Jegham, I.; Ben Khalifa, A.; Alouani, I.; Mahjoub, M.A. A Novel Public Dataset for Multimodal Multiview and Multispectral Driver Distraction Analysis: 3MDAD. Signal Process. Image Commun. 2020, 88, 115960. [Google Scholar] [CrossRef]
Li, W.; Huang, J.; Xie, G.; Karray, F.; Li, R. A Survey on Vision-Based Driver Distraction Analysis. J. Syst. Archit. 2021, 121, 102319. [Google Scholar] [CrossRef]
Trifunović, A.; Senić, A.; Čičević, S.; Ivanišević, T.; Vukšić, V.; Trifunović, A.; Senić, A.; Čičević, S.; Ivanišević, T.; Vukšić, V.; et al. Evaluating the Road Environment Through the Lens of Professional Drivers: A Traffic Safety Perspective. Mechatron. Intell. Transp. Syst. 2024, 3, 31–38. [Google Scholar] [CrossRef]
Fonseca, T.; Ferreira, S. Truck Driver Safety: Factors Influencing Risky Behaviors on the Road—A Systematic Review. Appl. Sci. 2025, 15, 9662. [Google Scholar] [CrossRef]
Wang, J.; Zheng, X.; Shahani, N.M.; Guo, X.; Xin, W.; Yue, W.; Liu, L.; Yan, K. Review of Major Influencing Factors Contributing to Persisting Safety Problems in Coal Mines: Addressing Systemic Challenges. Appl. Sci. 2024, 14, 9665. [Google Scholar] [CrossRef]
Fu, S.; Yang, Z.; Ma, Y.; Li, Z.; Xu, L.; Zhou, H. Advancements in the Intelligent Detection of Driver Fatigue and Distraction: A Comprehensive Review. Appl. Sci. 2024, 14, 3016. [Google Scholar] [CrossRef]
Liu, H.; Wang, D.; Xu, K.; Zhou, P.; Zhou, D. Lightweight Convolutional Neural Network for Counting Densely Piled Steel Bars. Autom. Constr. 2023, 146, 104692. [Google Scholar] [CrossRef]
Liu, H.; Xu, K. Recognition of Gangues from Color Images Using Convolutional Neural Networks with Attention Mechanism. Measurement 2023, 206, 112273. [Google Scholar] [CrossRef]
Liu, H.; Xu, K. Densely End Face Detection Network for Counting Bundled Steel Bars Based on YoloV5. In Proceedings of the Pattern Recognition and Computer Vision; Ma, H., Wang, L., Zhang, C., Wu, F., Tan, T., Wang, Y., Lai, J., Zhao, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 293–303. [Google Scholar]
Chun, S.; Chang, J.Y. 6DoF Head Pose Estimation Through Explicit Bidirectional Interaction with Face Geometry. In Proceedings of the European Conference on Computer Vision 2024; Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G., Eds.; Springer Nature Switzerland: Cham, Switzerland, 2024; Volume 15091, pp. 146–163. [Google Scholar]
Li, Y.; Tan, G.; Gou, C. Cascaded Iterative Transformer for Jointly Predicting Facial Landmark, Occlusion Probability and Head Pose. Int. J. Comput. Vis. 2023, 132, 1242–1257. [Google Scholar] [CrossRef]
Zhou, Y.; Gregson, J. WHENet: Real-Time Fine-Grained Estimation for Wide Range Head Pose. In Proceedings of the British Machine Vision Conference, Virtually, 7–10 September 2020; Volume 2020, p. 189. [Google Scholar] [CrossRef]
Zhang, Z.; Zhang, R.; Sun, J. Research on the Comprehensive Evaluation Method of Driving Behavior of Mining Truck Drivers in an Open-Pit Mine. Appl. Sci. 2023, 13, 11597. [Google Scholar] [CrossRef]
Sohn, K.; Zhang, Z.; Li, C.-L.; Zhang, H.; Lee, C.-Y.; Pfister, T. A Simple Semi-Supervised Learning Framework for Object Detection. arXiv 2020, arXiv:2005.04757. [Google Scholar] [CrossRef]
Liu, Y.-C.; Ma, C.-Y.; He, Z.; Kuo, C.-W.; Chen, K.; Zhang, P.; Wu, B.; Kira, Z.; Vajda, P. Unbiased Teacher for Semi-Supervised Object Detection. arXiv 2021, arXiv:2102.09480. [Google Scholar] [CrossRef]
Xu, M.; Zhang, Z.; Hu, H.; Wang, J.; Wang, L.; Wei, F.; Bai, X.; Liu, Z. End-to-End Semi-Supervised Object Detection with Soft Teacher. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Montreal, QC, Canada, 2021; pp. 3040–3049. [Google Scholar]
Xu, B.; Chen, M.; Guan, W.; Hu, L. Efficient Teacher: Semi-Supervised Object Detection for YOLOv5. arXiv 2023, arXiv:2302.07577. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Geng, Y.; Jiang, H.; Cheng, M.; Zhang, B.; Ke, Z.; Xu, X.; Chu, X. YOLOv6 v3.0: A Full-Scale Reloading. arXiv 2023, arXiv:2301.05586. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Basak, S.; Corcoran, P.; Khan, F.; Mcdonnell, R.; Schukat, M. Learning 3D Head Pose From Synthetic Data: A Semi-Supervised Approach. IEEE Access 2021, 9, 37557–37573. [Google Scholar] [CrossRef]
Zhou, H.; Jiang, F.; Yuan, J.; Rui, Y.; Lu, H.; Jia, K. Semi-Supervised Unconstrained Head Pose Estimation in the Wild. IEEE Trans. Pattern Anal. Mach. Intell. 2026, 48, 3082–3099. [Google Scholar] [CrossRef] [PubMed]
Howard, A.; Sandler, M.; Chen, B.; Wang, W.; Chen, L.-C.; Tan, M.; Chu, G.; Vasudevan, V.; Zhu, Y.; Pang, R.; et al. Searching for MobileNetV3. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision; IEEE: Seoul, Republic of Korea, 2019; pp. 1314–1324. [Google Scholar]
Yin, Y.; Cai, Y.; Wang, H.; Chen, B. FisherMatch: Semi-Supervised Rotation Regression via Entropy-Based Filtering. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11154–11163. [Google Scholar]
Hu, T.; Jha, S.; Busso, C. Temporal Head Pose Estimation from Point Cloud in Naturalistic Driving Conditions. IEEE Trans. Intell. Transp. Syst. 2022, 23, 8063–8076. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of the mining truck driver head pose estimation task.

Figure 2. The framework of the SemiCHPE method.

Figure 3. The distribution of mining truck driver head poses in the dataset.

Figure 4. Performance comparison of YOLO variants.

Figure 5. Comparison with state-of-the-art semi-supervised object detection methods.

Figure 6. HeaDet and HPE deployment performance on NVIDIA Jetson Orin NX.

Figure 7. MAE and RMSE of the proposed model under different quantization configurations.

Figure 8. Real-world deployment performance comparison.

Figure 9. Samples of head pose estimation errors in SemiCHPE.

Table 1. HeaDet performance with different labeled data ratios.

Labeled Data	Precision (%)	Recall (%)	F1-Score (%)	AP₅₀
1%	78.5	66.2	71.83	70.1
3%	84.2	82.8	83.49	81.6
5%	97.3	96.1	96.70	95.2
10%	99.1	99.5	99.30	99.4

Table 2. Results of the HeaDet ablation experiment.

Baseline	GRL	Objectness Branch	Precision (%)	Recall (%)	F1-Score (%)	AP₅₀
✓			91.3	94.5	92.9	89.2
✓	✓		91.8	95.9	93.8	90.1
✓	✓	✓	99.1	99.5	99.3	99.4

Table 3. Comparison of full supervised training performance between HeaDet and other models.

Model	Precision (%)	Recall (%)	F1-Score (%)	AP₅₀
YOLOv13	95.6	94.3	94.95	93.1
YOLO26	98.1	97.9	98.00	97.7
RetinaNet	92.8	95.3	94.03	94.5
Faster R-CNN	96.2	95.9	96.05	95.1
HeaDet	98.5	98.2	98.35	98.1

Table 4. Detection difficulty on the test set of semi-supervised HeaDet versus fully supervised YOLOv8.

Group ID	HeaDet F1-Score (%)	YOLOv8 F1-Score (%)
1	99.44	97.66
2	99.12	97.68
3	98.56	97.57
4	98.85	98.88
5	98.92	99.07
6	99.43	98.16
7	98.93	97.36
8	99.10	99.03
9	98.58	97.67
10	98.89	98.03
11	98.60	98.32

Table 5. Results of the HPE ablation experiment.

Baseline	Cut Occlusion	Rotation Consistency	Curriculum Learning	Average MAE (°)	Average RMSE (°)
✓				4.7	5.8
✓	✓			3.5	4.1
✓	✓	✓		3.0	3.6
✓	✓	✓	✓	2.8	3.4

Table 6. Comparison of fully supervised training performance between HPE and other models.

Model	Average MAE (°)	Average RMSE (°)
ResNet	3.8	4.7
Effinetv2	4.1	4.8
GhostNet	4.6	5.1
FasterNet	4.8	5.3
HPE	3.2	4.5

Table 7. The performance of the SemiUHPE and HPE in terms of MAE.

Labeled Data	Method	MAE Pitch (°)	MAE Yaw (°)	MAE Roll (°)	Average MAE (°)
1%	SemiUHPE	6.1	5.3	5.6	5.7
1%	HPE	5.8	6.2	4.9	5.6
3%	SemiUHPE	5.0	4.8	4.3	4.7
3%	HPE	4.2	4.5	3.6	4.1
5%	SemiUHPE	3.2	4.1	3.6	3.6
5%	HPE	3.5	3.8	3.0	3.4
10%	SemiUHPE	3.0	2.9	3.1	3.0
10%	HPE	2.8	3.1	2.5	2.8

Table 8. The performance of the SemiUHPE and HPE in terms of RMSE.

Labeled Data	Method	RMSE Pitch (°)	RMSE Yaw (°)	RMSE Roll (°)	Average RMSE (°)
1%	SemiUHPE	7.8	6.5	6.1	6.8
1%	HPE	8.0	6.4	5.6	6.6
3%	SemiUHPE	6.3	5.9	5.2	5.8
3%	HPE	5.9	4.7	4.1	5.4
5%	SemiUHPE	4.8	3.6	4.3	4.2
5%	HPE	5.0	3.9	3.4	4.1
10%	SemiUHPE	4.2	3.5	3.1	3.6
10%	HPE	4.1	3.3	2.8	3.4

Table 9. Comparison of the performance between our method and fully supervised approaches.

Dataset	Method	MAE Pitch (°)	MAE Yaw (°)	MAE Roll (°)	Average MAE (°)
BIWI	TRG	3.04	3.44	1.78	2.75
	CIT	3.01	4.54	4.15	3.90
	WHENet	3.60	4.10	2.73	3.48
AFLW2000	CIT	2.68	4.38	3.45	3.50
AFLW2000	WHENet	4.44	5.75	4.31	4.83
ours	HPE	2.80	3.10	2.50	2.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Jiang, F.; Hu, B.; Liu, Y.; Chen, X.; Zhang, W.; Li, Y. Semi-Supervised Cascade Head Pose Estimation for Drivers in Open-Pit Mining Trucks. Electronics 2026, 15, 2576. https://doi.org/10.3390/electronics15122576

AMA Style

Jiang F, Hu B, Liu Y, Chen X, Zhang W, Li Y. Semi-Supervised Cascade Head Pose Estimation for Drivers in Open-Pit Mining Trucks. Electronics. 2026; 15(12):2576. https://doi.org/10.3390/electronics15122576

Chicago/Turabian Style

Jiang, Feng, Bin Hu, Yulong Liu, Xiaonian Chen, Wei Zhang, and Yong Li. 2026. "Semi-Supervised Cascade Head Pose Estimation for Drivers in Open-Pit Mining Trucks" Electronics 15, no. 12: 2576. https://doi.org/10.3390/electronics15122576

APA Style

Jiang, F., Hu, B., Liu, Y., Chen, X., Zhang, W., & Li, Y. (2026). Semi-Supervised Cascade Head Pose Estimation for Drivers in Open-Pit Mining Trucks. Electronics, 15(12), 2576. https://doi.org/10.3390/electronics15122576

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Semi-Supervised Cascade Head Pose Estimation for Drivers in Open-Pit Mining Trucks

Abstract

1. Introduction

2. Method

2.1. Semi-Supervised Cascade Head Pose Estimation Method

2.2. Semi-Supervised Head Detection

2.3. Semi-Supervised Head Pose Estimation

2.4. Semi-Supervised Training Method

3. Experimental Section, Results and Discussion

3.1. Experimental Environment Settings and Dataset

3.2. HeaDet Semi-Supervised Training with Different Labeled Data Ratios

3.3. Ablation Study on the HeaDet Model

3.4. Comparative Experiment of HeaDet Model

3.5. Difficulty Analysis of HeaDet Model Detection

3.6. Ablation Study on the HPE Model

3.7. Comparative Experiment of HPE Model

3.8. Comparison with YOLO Variants Under Efficient Teacher Framework

3.9. Comparison with State-of-the-Art Semi-Supervised Methods

3.10. The HPE and SemiUHPE Pose Estimation with Different Labeled Data Ratios

3.11. Comparison Between HPE and Other Head Pose Estimation Methods

3.12. Edge Deployment Optimization on NVIDIA Jetson Orin NX

3.13. Deployment of SemiCHPE on Real-World Mining Trucks

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI