An Improved Tiered Head Pose Estimation Network with Self-Adjust Loss Function

As an important task in computer vision, head pose estimation has been widely applied in both academia and industry. However, there remains two challenges in the field of head pose estimation: (1) even given the same task (e.g., tiredness detection), the existing algorithms usually consider the estimation of the three angles (i.e., roll, yaw, and pitch) as separate facets, which disregard their interplay as well as differences and thus share the same parameters for all layers; and (2) the discontinuity in angle estimation definitely reduces the accuracy. To solve these two problems, a THESL-Net (tiered head pose estimation with self-adjust loss network) model is proposed in this study. Specifically, first, an idea of stepped estimation using distinct network layers is proposed, gaining a greater freedom during angle estimation. Furthermore, the reasons for the discontinuity in angle estimation are revealed, including not only labeling the dataset with quaternions or Euler angles, but also the loss function that simply adds the classification and regression losses. Subsequently, a self-adjustment constraint on the loss function is applied, making the angle estimation more consistent. Finally, to examine the influence of different angle ranges on the proposed model, experiments are conducted on three popular public benchmark datasets, BIWI, AFLW2000, and UPNA, demonstrating that the proposed model outperforms the state-of-the-art approaches.


Introduction
As an important task in computer vision, head pose estimation has been applied in a wide range of applications, such as tiredness detection and autonomous driving. The primary approaches mainly rely on either landmark detection [1][2][3][4][5] or depth information [6][7][8][9]. For example, when building fine 3D face models, the landmark detection approach usually attains 3D and 2D mapping and matching. When depth information is used, the detection approach usually makes up for the missing spatial information in 2D images. The corresponding approaches reveal good robustness despite small-area occlusion, but perform badly when the masking area is extended or there is a large deflection of facial angles [6,[10][11][12]. It has also been revealed that by introducing convolutional neural networks (CNNs) into head pose estimation tasks, performance degradation can be enhanced due to missing facial key points [10][11][12][13][14][15][16][17][18]. Among the corresponding approaches, the difficulty is generally addressed by direct regression [16][17][18]. Inspired by the idea of soft stagewise regression in age estimation tasks [19], CNN is applied in head pose estimation tasks with leapfrogging findings [10][11][12][13][14][15].
Furthermore, capsule networks [20] are also employed in head pose estimation tasks, which have commonalities with CNN-based works. Among these studies, a balance between the yaw, pitch, and roll is preserved by linearly combining the features extracted from the network, estimating all the three angles simultaneously [6,[10][11][12][13][14][15]. The effectiveness of these approaches has been confirmed on numerous public benchmark datasets [21][22][23][24]. For clarification, the two difficulties mentioned above are illustrated. Figure 1 reveals the estimation for a single image from the 300W-LP dataset. The yaw with less expected loss may become worse when the model's parameters are adjusted using the loss feedback from other angles. When the head pose's true angles are [6.1°, −3.2°, −15°] and the expected angles are [5.9°, −1.9°, −9.9°], the traditional loss inaccurately reverses the true loss relationship between yaw and pitch, leading to an imbalance of losses on both sides of the classification line. The intermittent nature of the losses and the erroneous inversions make the model tedious to learn; consequently, this problem is discussed and solved further in Section 3 without using the rotation matrix or soft stagewise regression.
In other ways, it has been reported that an imbalance in the dataset's distribution can damage the model's performance [33][34][35][36][37][38][39]. To eliminate the imbalance, the oversampling mentioned in [33] is employed, the effect of angle distribution is examined on the BIWI dataset, and then it is compared with datasets that have different angular ranges [24].
Apart from these challenges, some exciting findings have been reached in the studies related to neural networks. Among them, multi-scale feature fusion, as a combination of feature pyramid network [40,41] and feature weight assignment based on attention mechanism [42], has a positive effect on almost all computer vision (CV) tasks [43][44][45]. Additionally, some studies have attempted to enhance the performance of the optimizer [46] and activation function [47], with positive findings. Based on the above studies, a series of advancements are made in this study, aiming to minimize the estimation loss of head pose estimation. In summary, the primary contributions of our study are as follows: (1) An idea of tiered estimation by combining multi-output task and multi-scale estimation fusion is proposed, which can not only provide greater freedom of adjustment for the three head attitude angles, but also efficiently minimize the interaction between tuning angles and further lower the estimation loss of each angle. For clarification, the two difficulties mentioned above are illustrated. Figure 1 reveals the estimation for a single image from the 300W-LP dataset. The yaw with less expected loss may become worse when the model's parameters are adjusted using the loss feedback from other angles. When the head pose's true angles are [6.1 • , −3.2 • , −15 • ] and the expected angles are [5.9 • , −1.9 • , −9.9 • ], the traditional loss inaccurately reverses the true loss relationship between yaw and pitch, leading to an imbalance of losses on both sides of the classification line. The intermittent nature of the losses and the erroneous inversions make the model tedious to learn; consequently, this problem is discussed and solved further in Section 3 without using the rotation matrix or soft stagewise regression.
In other ways, it has been reported that an imbalance in the dataset's distribution can damage the model's performance [33][34][35][36][37][38][39]. To eliminate the imbalance, the oversampling mentioned in [33] is employed, the effect of angle distribution is examined on the BIWI dataset, and then it is compared with datasets that have different angular ranges [24].
Apart from these challenges, some exciting findings have been reached in the studies related to neural networks. Among them, multi-scale feature fusion, as a combination of feature pyramid network [40,41] and feature weight assignment based on attention mechanism [42], has a positive effect on almost all computer vision (CV) tasks [43][44][45]. Additionally, some studies have attempted to enhance the performance of the optimizer [46] and activation function [47], with positive findings. Based on the above studies, a series of advancements are made in this study, aiming to minimize the estimation loss of head pose estimation. In summary, the primary contributions of our study are as follows: (1) An idea of tiered estimation by combining multi-output task and multi-scale estimation fusion is proposed, which can not only provide greater freedom of adjustment for the three head attitude angles, but also efficiently minimize the interaction between tuning angles and further lower the estimation loss of each angle. (2) To remove the inconsistency in loss function, which is the main cause of angle estimation discontinuity problems, an easy-to-use dynamic self-adjusting loss function is developed. (3) To examine the influence of the range of angle distributions on the proposed model, a test is conducted on three public benchmark datasets, demonstrating that our approach maintains remarkable performance for various angle ranges.
The rest of the paper is organized as follows. The existing work on head pose estimation is presented in Section 2. The tiered estimation module and loss limitation method are described in Section 3. The experimental findings on various datasets are depicted in Section 4. Finally, a summary is given in Section 5.

Estimation with Key Points
By matching key facial points, which are recognized from images with 3D face landmarks, the head pose can be computed by the landmark-based approaches. For instance, in [5], every landmark was considered as a separate part, and a tree-structured model was employed to capture the global elastic deformation of the face. In addition, the direct predictive estimation of face landmark positions using an ensemble of regression trees was suggested in [1], which can optimize the sum of squared error loss. In parallel to this machine learning approach, in [3,4], a 3D face model combined with specially developed algorithms was employed, in which depth information was captured by the camera for the head pose estimation task.
Some deep learning-based approaches have also generated findings. For example, in [7], a CNN-based model was developed, in which the classification and regression were integrated to evaluate approximate regression confidence. Their results demonstrated that the training of the CNN can achieve near saturation with both 2D and 3D facial landmark-labeled datasets. In addition, in [3], a residual network was integrated with landmark localization structures. In [18], a Face-pose-Net network was built, showing how a simple CNN can be precisely trained and robustly regressed to head pose directly from a single image. In [4], to tackle the face alignment issue, an iterative approach for learning an effective Heatmap-CNN regressor was introduced for unrestrained face crucial points estimation and pose estimation.
Although a great deal of work exists in this area to enhance the accuracy of landmark detection, the reliance on landmark detection hinders its performance in the cases of a significant area occlusion and substantial angle deflection.

Estimation without Key Points
With the remarkable performance of deep learning approaches in different tasks in the CV field, head pose estimation models independent of landmarks are developed. In [14], a CNN paired with adaptive gradient algorithms was employed to achieve estimation under field datasets without depending on important points, but the estimation precision is unideal. Thereafter, a novel milestone of landmark-free head pose estimation was achieved in [15], which employed the fundamental Resnet-50 structure [20] and classified the head pose into an interval by 3 • . In [10], the concept of soft stagewise regression was presented, and a fine-grained structural mapping of spatial features was employed to discover the spatial relationship between features. Shortly thereafter, in [3], a feature decoupling module was added into the CNN, which can explicitly learn the discriminative features of each bit pose by adaptively calibrating the channel response and bounding the variable subspace distribution.
In addition, by means of the angular annotation of the dataset, it is demonstrated that the labeling of quaternion or Euler angles can lead to discontinuities in angle estimation [31]. In order to solve the non-stationary problem (that is caused by labeling datasets using Euler angles), on the one hand, L2 loss was integrated with regression loss based on quaternion [16]; on the other hand, a rotation matrix was applied. For instance, in [30], the Frobenius norm's solution was computed by replacing the singular value decomposition with fundamental algebraic operations. In [48], a two-dimensional Lorentz distribution and angular weight assignment were applied to solve the problems caused by uneven label distribution. In [49], an anisotropic angular distribution learning (AADL) network was proposed, in which Kullback-Leibler scatter was chosen to measure the predicted labels and the ground truth labels. In [12], the matrix Fisher distribution was presented, using the rotation matrix to model the head rotation uncertainty. In the latest study [11], the head pose was represented as three vectors and the model performance was evaluated using the mean absolute error of vectors (MAEV).
In summary, in the above methods, the features related to head pose were generally learned autonomously through neural networks, which did not require additional key point information and can return the head pose directly from the image perspective. Although the addition of the rotation matrix can efficiently eliminate the angle estimation discontinuity, the loss function or even the model itself needs to be further redesigned and improved.

Multitask and Feature Pyramid
Previously, several estimation tasks were conducted simultaneously using multitask estimation approaches under one CNN model. For example, in [28], CNNs with residual blocks and lateral skip connections were employed to simultaneously perform landmark-based face alignment and head pose estimation. Similarly, a cascaded structure was employed in [27] for face alignment and face detection tasks, which improved the performance significantly due to the fact that the correlation within tasks can contribute to facilitating the complementary information of each other. Similarly, this inter-task synergy was also specifically explained by [28]. In [25], model construction and selection related to multitask convolution were explained in detail. In 2021, a fine-feature encoder and three decoders were employed to achieve estimations for three different tasks [29].
At the same time, the idea of multi-scale prediction emerged in target detection. For example, the idea of feature pyramids was proposed in [40] to efficiently capture small-scale information that is usually neglected in deep layers. A global-and-local transformation was used in [44], aiming to solve the reconfiguration problem and reuse of feature hierarchies in the process of constructing feature pyramids. Recently, top-down and bottom-up feature connections were proposed in [41], integrating features at various scales. Furthermore, an adaptive spatial feature-fusion structure was proposed in [43], which can spatially filter conflicting information to delete inconsistency.
To the best of our knowledge, the estimation of three head pose angles has been considered as three branches belonging to the same task and sharing the same layers. However, this increases the burden of model tuning for each angle. Inspired by the multitasking output, in this study, the three angles of the head pose are considered as three different tasks, which are assigned to the three network layers and the corresponding feature scales are enriched using a feature pyramid.

Method
In this section, first, the basic process of head pose estimation is outlined, and then the proposed THESL-Net model is described in detail. Second, a concept of tiered estimation is proposed and the modified loss function is given.

Problem Formulation
Generally, the head pose estimation can be summarized by the following steps. Given a set of face images X = {x n | n = 1, . . . , N} and the pose vector y n for each image x n , where N represents the image number, the elements of y n comprise the angles of yaw, pitch, and roll, denoted as φ, θ, and ψ, respectively. The aim is to discover a mapping function F by minimizing the mean absolute error (MAE) with respect to the estimationŷ = F(x) and ground truth y: whereφ i ,θ i , andψ i represent the estimations ofŷ i after the target of evaluation is split into three different angles.

Overview of THESL-Net
The framework of the proposed THESL-Net model is shown in Figure 2. The proposed model comprises one backbone and one tiered estimation module. In particular, the proposed THESL-Net model is an end-to-end model, and the backbone is Resnet-50 with a feature pyramid structure. Ideally, the loss predicted by the proposed model should have a similar growth trend as that of the real loss; thus, a limiting factor β is added to the cross-entropy loss used in this study.
pitch, and roll, denoted as , , and , respectively. The aim is to discover a mapping function by minimizing the mean absolute error (MAE) with respect to the estimation = ( ) and ground truth : where , , and represent the estimations of after the target of evaluation is split into three different angles.

Overview of THESL-Net
The framework of the proposed THESL-Net model is shown in Figure 2. The proposed model comprises one backbone and one tiered estimation module. In particular, the proposed THESL-Net model is an end-to-end model, and the backbone is Resnet-50 with a feature pyramid structure. Ideally, the loss predicted by the proposed model should have a similar growth trend as that of the real loss; thus, a limiting factor is added to the cross-entropy loss used in this study. After the fixed-size images go through the model, a feature mapping is obtained at each stage of the backbone network, and the features extracted from neighboring stages are fused using down-sampling and maximum pooling to maintain × × ℎ constant. The final fused features are input into the tiered estimation module, and three head branches with varied parameters are generated by minimizing the channel number. The traditional regression and classification loss are employed to compute the total estimation loss in the training process, where each head branch is spread out by a linear layer. Furthermore, external attention is used to perform feature selection [37], which better differentiates the three angles.
Details on feature fusion, tiered estimation, and limitations on the loss function will be depicted in the following subsections.

Tiered Estimation
Three linear layers, each of which is responsible for predicting a single vector, are commonly employed in head pose estimation. The three linear layers share the same convolutional layer parameters, as shown in Equation (2): where denotes the various weights, denotes the feature obtained by the convolution layer, and represents the bias factor. Suppose the estimation loss of an image is ( , ) = [0,5,10]. Since the network layer is shared in gradient backpropagation, the estimation After the fixed-size images go through the model, a feature mapping is obtained at each stage of the backbone network, and the features extracted from neighboring stages are fused using down-sampling and maximum pooling to maintain c × w × h constant. The final fused features are input into the tiered estimation module, and three head branches with varied parameters are generated by minimizing the channel number. The traditional regression and classification loss are employed to compute the total estimation loss in the training process, where each head branch is spread out by a linear layer. Furthermore, external attention is used to perform feature selection [37], which better differentiates the three angles.
Details on feature fusion, tiered estimation, and limitations on the loss function will be depicted in the following subsections.

Tiered Estimation
Three linear layers, each of which is responsible for predicting a single vector, are commonly employed in head pose estimation. The three linear layers share the same convolutional layer parameters, as shown in Equation (2): where K denotes the various weights, Γ denotes the feature obtained by the convolution layer, and b represents the bias factor. Suppose the estimation loss of an image is L(ŷ, y) = [0, 5,10]. Since the network layer is shared in gradient backpropagation, the estimation loss after tuning can be denoted as L(ŷ, y) = [2, 3,5]. Although the total predicted loss is lowered, it is not the best model for yaw. Inspired by the idea of feature pyramid network, a tiered structure is developed in this study. In the feature fusion, only the down-sampling technique is adopted, and the estimation findings under various scales are not fused. For the 1/2 ratio case, a 3 × 3 convolution layer with a stride of 2 is employed; for the 1/4 ratio case, a two-step maxpooling layer is added before the 2-stride convolution; and for the 1/8 ratio case, fusion is Entropy 2022, 24, 974 6 of 18 not applied, as shown in Figure 2. Each phase of the backbone network is denoted by S, the features are fused as follows: where S j j = 3, 4 denotes the last two stages, →j denotes fusion with the current layer as the spatial scale standard, and γ represents the fusion weight. When j is equal to 1 or 2, γ 2 or γ 3 is 0, respectively. Similar to [43], we force Particularly, three 1 × 1 convolution layers are employed to compute the weight scalar maps for each of λ γ 1 , λ γ 2 , and λ γ 3 from γ 1 , γ 2 , and γ 3 , respectively.
In the tiered estimation module, a 3 × 3 convolution layer with padding of 1 is employed to maintain the spatial resolution unchanged, as 1/2 spatial scale ratio downscaling is performed three times, generating features dw 1 , dw 2 , and dw 3 in sequence. The external attention comprises two layers of 1 × 1 convolution that are responsible for the common feature selection in the dataset. Then, softmax is conducted on the probability matrix of yaw, pitch, and roll, which are generated from the linear layer. From this, the interaction between the three angles is weakened, as shown in Equation (5): where Γ 1 , Γ 2 , and Γ 3 are the parameters of dw 1 , dw 2 , and dw 3 , respectively. Γ 1 , Γ 2 , and Γ 3 are related to each other as follows: In Equation (6), W 1 and W 2 are parameters of the new convolution, and b 4 and b 5 are new bias terms.
In the proposed model, head pose estimation is considered to be three tasks, and additional tuning space is also employed. As demonstrated in Figure 3, Grad-CAM [50] is used to visualize the original single-branch structure and the proposed three-branch structure (i.e., dw1, dw2, and dw3), aiming to show the changes brought about by the tiering: the areas of concern are no longer identical between the three angles. loss after tuning can be denoted as ( , ) = [2,3,5]. Although the total predicted loss is lowered, it is not the best model for yaw. Inspired by the idea of feature pyramid network, a tiered structure is developed in this study. In the feature fusion, only the down-sampling technique is adopted, and the estimation findings under various scales are not fused. For the 1/2 ratio case, a 3 × 3 convolution layer with a stride of 2 is employed; for the 1/4 ratio case, a two-step max-pooling layer is added before the 2-stride convolution; and for the 1/8 ratio case, fusion is not applied, as shown in Figure 2. Each phase of the backbone network is denoted by , the features are fused as follows: where | = 3, 4 denotes the last two stages, →j denotes fusion with the current layer as the spatial scale standard, and represents the fusion weight. When is equal to 1 or 2, or is 0, respectively. Similar to [43], we force + + = 1 | , , ∈ [0,1]. Particularly, three 1 × 1 convolution layers are employed to compute the weight scalar maps for each of , , and from , , and , respectively.
In the tiered estimation module, a 3 × 3 convolution layer with padding of 1 is employed to maintain the spatial resolution unchanged, as 1/2 spatial scale ratio downscaling is performed three times, generating features d , d , and dw in sequence. The external attention comprises two layers of 1 × 1 convolution that are responsible for the common feature selection in the dataset. Then, softmax is conducted on the probability matrix of yaw, pitch, and roll, which are generated from the linear layer. From this, the interaction between the three angles is weakened, as shown in Equation (5): where , , and are the parameters of d ,d , and d , respectively. , , and are related to each other as follows: In Equation (6), and are parameters of the new convolution, and and are new bias terms.
In the proposed model, head pose estimation is considered to be three tasks, and additional tuning space is also employed. As demonstrated in Figure 3, Grad-CAM [50] is used to visualize the original single-branch structure and the proposed three-branch structure (i.e., dw1, dw2, and dw3), aiming to show the changes brought about by the tiering: the areas of concern are no longer identical between the three angles.

Dynamic Loss Adjustment
Rotation matrices are employed to solve the angle discontinuity challenges caused by a quaternion or Eulerian angle labeling, although effective, specially designed models are often required. However, it is discovered that the loss function's incoherence is another cause of the discontinuity; in detail, this discontinuity in angle estimation is due to the Entropy 2022, 24, 974 7 of 18 classification loss being larger than the MSE loss at about 1 • from the classification edge. Taking a single picture as an example, the typical loss function is as follows: where k represents the number of categories; Y ic is 0 or 1, corresponding to whether the classification is correct;Ŷ i is the probability matrix; and σ denotes softmax.
Another simple example to illustrate the loss imbalance at both ends of the classification is as follows. Set the ground truth to [0 • , 3 • , 5 • ] and estimation to [1 • , 3.5 • , 7 • ], and then divide (−99 • , 99 • ) into 66 groups with 3 • as an interval. When the truth error between estimation and ground truth is within 1 • , the regression task appears in two cases: the estimation is correctly classified, which is called intra-class regression, or the estimation is incorrectly classified, which is called inter-class regression. To be specific, in the case when the estimation is intra-class, the cross-entropy loss is minimal, and the total loss follows the truth loss trend. However, in the case when estimation is inter-class, the cross-entropy loss is larger than the mean squared loss (because of the index of 2), leading to the total loss being inverse to the truth loss trend, as stated in Section 1. This makes the model difficult to learn.
In [15], a coefficient α = 2 is provided for the MSE, as shown in Equation (8): where d represents the category length and Y id represents the category label. Here, 99 is the regression constant term, as a result of restricting the angle to between −99 • and 99 • during the processing of the dataset. When L mse is considerably small, multiplying by a factor α = 2 can relatively alleviate the discontinuity problem caused by the loss function. However, it does not capture the matter's crux and can further increase this incongruity when an intra-class loss is greater than an inter-class loss.
Considering the synergy present between the two losses, we set an additional constraint for classification loss: β = (ŷ − y) 2 /((ŷ − y) 2 + 1). Then, the cross-entropy loss after the update is given by After restriction, β ∈ [0, 1] is also added to the backpropagation gradient, making the resulting penalty small when the true loss is small. In the above example, when multiplying by β, the CE loss of pitch can be lowered to 1/5 of its original. This resets the model's total loss to the same trend as the true loss. Subsequently, 2β is employed to improve the error penalty for loss above 1 • , which can accelerate the model's convergence.
To confirm the effectiveness of the proposed approach, in our study, another set of loss functions is developed based on the rotation matrix, as demonstrated in Equation (10), which comprises the MSE and MAEV. The concept is that the vectors corresponding to the three angles in the rotation matrix must be perpendicular to each other, or else a penalty is given.

Optimization
To further improve the proposed model, a series of measures are employed to enhance the baseline of Resnet-50 as the backbone, and the enhancements caused are also listed, as shown in Figure 4. First, the dataset is kept in balance using both oversampling and left-right mirroring with Hopenet [15] as the benchmark. The distribution ratio of the large, medium, and small (about 30 • for each interval size) angles is 2:2:1 in the balanced dataset. Then, according to the previous research on the Resnet network and transformer structure [35,36] the ReLU is modified to the Dynamic ReLU stated in [47] to enhance the model's representation ability, and the AdamW optimizer [46] instead of the Adam optimizer is employed to enhance the model's generalization. The combination of these approaches leads to a 0.5 • reduction in baseline loss. 0.5] given by [11]. The experimental findings reveal that our loss-limiting approach (i) has similar performance to the rotation matrix-based approach under the same conditions and (ii) can solve the discontinuity problem from two aspects, as demonstrated in Section 4. Algorithm 1 details the proposed approach's training process. Limiting cross-entropy loss = β with (9); 8: ▽l , ▽l , ▽l ← (▽l , ▽l , ▽l ) β; 9: return ▽l , ▽l , ▽l .

Optimization
To further improve the proposed model, a series of measures are employed to enhance the baseline of Resnet-50 as the backbone, and the enhancements caused are also listed, as shown in Figure 4. First, the dataset is kept in balance using both oversampling and left-right mirroring with Hopenet [15] as the benchmark. The distribution ratio of the large, medium, and small (about 30° for each interval size) angles is 2:2:1 in the balanced dataset. Then, according to the previous research on the Resnet network and transformer structure [35,36] the ReLU is modified to the Dynamic ReLU stated in [47] to enhance the model's representation ability, and the AdamW optimizer [46] instead of the Adam optimizer is employed to enhance the model's generalization. The combination of these approaches leads to a 0.5° reduction in baseline loss.

Implementation Details
Pytorch is used to implement the proposed network. All images are cropped to 224 × 224 size (surrounding the face) and then normalized using transform mean and standard deviation. During training, random masks are introduced to all images using

Implementation Details
Pytorch is used to implement the proposed network. All images are cropped to 224 × 224 size (surrounding the face) and then normalized using transform mean and standard deviation. During training, random masks are introduced to all images using CutOut. An AdamW optimizer with a weight decay of 1 × 10 −5 is employed, the learning rate is set to 1 × 10 −3 , the learning rate decayed is set to the original 0.9 every 20 epochs, and the loss-limit factor is set to 2β. In addition, the linear layer's learning rate is adjusted to 5 × 10 −3 , and both the first convolution layer and batch norm layer are kept frozen. The model is trained for 200 epochs with a batch size of 64, and four GTX 1080Ti GPUs are employed for this process.

Datasets and Evaluation
As shown in Figure 5, the proposed model is examined on four popular public benchmark datasets: 300W-LP [21], BIWI [23], AFLW2000 [22], and UPNA [24]. to 5 × 10 , and both the first convolution layer and batch norm layer are kept frozen. The model is trained for 200 epochs with a batch size of 64, and four GTX 1080Ti GPUs are employed for this process.

300W-LP:
The 300W-LP [21] dataset is an extended version of the 300 W [51] dataset, which has over 120 k images for face alignment with 68 landmarks.
BIWI: The BIWI dataset [23] has 24 videos produced from 20 subjects, totaling 15,678 frames, each corresponding to both RGB and depth images. Since face position is not offered in this dataset, in our study, Yolo5-face [52]is employed to produce the persons' head borders. For comparison with other the-state-of-the-art approaches, as stated in Hopenet [15], FSA-Net [10], and TriNet [11], the same training and testing setup is used in our study, and the images with Euler angle deflection outside of −99° to 99° are filtered out. In particular, it is discovered that the angle distributions of the UPNA and BIWI datasets are between [−48°, 36°] and [−75°, 85°], respectively. Figure 5 shows samples of the datasets, and this study is conducted in the following two scenarios: (1) The model is trained and evaluated on the datasets of 300W-LP, BIWI, AFLW2000, and UPNA. (2) In total, 70% of the BIWI and UPNA datasets are employed for training and 30% for testing. The train set is not crossed with the test set. For example, in the BIWI dataset, 16 videos are employed for training and 8 videos for testing.
In all of the above studies, to assess the performance of the proposed model, the MAE is used as the loss function.

300W-LP:
The 300W-LP [21] dataset is an extended version of the 300 W [51] dataset, which has over 120 k images for face alignment with 68 landmarks.
BIWI: The BIWI dataset [23] has 24 videos produced from 20 subjects, totaling 15,678 frames, each corresponding to both RGB and depth images. Since face position is not offered in this dataset, in our study, Yolo5-face [52] is employed to produce the persons' head borders. For comparison with other the-state-of-the-art approaches, as stated in Hopenet [15], FSA-Net [10], and TriNet [11], the same training and testing setup is used in our study, and the images with Euler angle deflection outside of −99 • to 99 • are filtered out. In particular, it is discovered that the angle distributions of the UPNA and BIWI datasets are between [−48 • , 36 • ] and [−75 • , 85 • ], respectively. Figure 5 shows samples of the datasets, and this study is conducted in the following two scenarios: (1) The model is trained and evaluated on the datasets of 300W-LP, BIWI, AFLW2000, and UPNA. (2) In total, 70% of the BIWI and UPNA datasets are employed for training and 30% for testing. The train set is not crossed with the test set. For example, in the BIWI dataset, 16 videos are employed for training and 8 videos for testing.
In all of the above studies, to assess the performance of the proposed model, the MAE is used as the loss function.

Competing Methods
To show the effectiveness, we compare the proposed approach with other state-of-theart approaches on public benchmark datasets, with data from either the original article or experimental findings.
The following is a brief description of previous work related to the proposed model, all based on RGB images. Dlib [1] addresses 2D to 3D fitting challenges by matching face landmark points for head pose estimation. 3DDFA [21] employed a CNN to develop an approach for fitting 3D face models to 2D images that skips the step of facial landmark detection. There are also more popular methods that do not rely on key points. For example, Hopenet [15] suggested a concept of head pose estimation without key points based on Resnet-50, considerably enhancing the model's performance under complex scenes. Thereafter, FSA-Net [10] introduced the idea of soft stagewise regression and developed a fine-grained structural mapping to capture spatial features. QuatNet [16] employed a multivariate loss function based on quaternion to address the difficulty of the non-stationary property caused by Euler angle representation. FDN [13] elaborates a feature decoupling network with cross-category center loss to restrict the distribution of the latent variable subspaces. MFDNet [12] constructed the triplet module and the matrix's Fisher distribution module to address the uncertainty of head rotation. TriNet [11] re-labeled dataset samples using orthogonal constraints on the three vectors and assessed them using MAEV. To enhance the accuracy of head pose estimation for drivers, ref. [54] proposed a spatial temporal vision transformer (ST-ViT) model, taking a pair of image frames rather than one single frame as the input.

Experiment Results
We explore the performance variation of the model using different backbone networks. The comparison between three various backbones (including ResNet-50, ResNext-101, and the latest ConvNext) is given in Table 1. Notably, in this study, all orientations are shown in degrees. First, we note that "w" denotes with the proposed method (see the odd rows in Table 1), and "w/o" denotes without the proposed method (see the even rows in Table 1). The comparison between the odd and even rows shows that the proposed method can improve the model performance for all three backbones. Taking ResNet-50 as an example, by introducing the proposed method, the average MAE value (of yaw, pitch, and roll) on the AFLW 2000 dataset can be improved from 6.16 • to 4.40 • , and the average MAE value (of yaw, pitch, and roll) on the BIWI 2000 dataset can be improved from 5.18 • to 3.56 • .
Second, the comparison between the three backbones show that the best performance can be achieved by using ResNet-50. Taking the validation on the AFLW 20,000 dataset, for example, the MAE values on ResNet-50, ResNet101, and ConvNext are 4.40 • , 5.62 • , and 7.84 • , respectively. Since the best results are achieved with the ResNet-50 backbone, the experiments will be conducted on ResNet-50. Tables 2 and 3 show the findings of our proposed model, which is compared with other state-of-the-art approaches. We note that the proposed model is trained on the 300W-LP dataset. In Table 2, the test results on the AFLW2000 dataset are shown. From this table, we can see that the proposed model THESL-Net attains the minimum error on a roll, and the MAE is somewhat higher than that of MFDNet, but the structure of the proposed approach is much simpler and thus can be readily conducted on other models. Furthermore, in Table 3, the test results on the BIWI dataset are shown. From this table, we can see that THESL-Net realizes the best performance with an MAE reduction of 0.06 • compared to the second-best approach (MFDNet). The proposed approach does not rely on landmark detection, and the loss limitation factors can be adjusted automatically with the evaluation process without additional settings.  Table 4 reveals the findings compared with other approaches on the BIWI dataset, where 70% and 30% of the data were employed for training and testing, respectively, without crossover. All compared methods are based on RGB, and the finding of Hopenet [15] are derived from re-runs in [11]. THESL-Net is first fine-tuned, resulting in the best finding on yaw, and the MAE decreases by 0.36 • compared to the second place. Other indicators are also in the upper middle position, which indicates the effectiveness of our tiered estimation concept. The performance of the proposed method on the UPNA dataset is given in Table 5, where '/' means the corresponding value is not given in the original article. To make a fair comparison, we make up the experiment by using 90% of the UPNA dataset for training and 10% of the UPNA dataset for testing. From this table, it can be seen that the best MAE was achieved by the proposed method when using the same dataset-partitioning method. To examine the influence of head deflection angle range on the proposed model, we further compare the BIWI dataset with the UPNA dataset and generate the findings as shown in Figure 6. Both datasets are obtained in an experimental setting with low disturbance, containing three angles of different intervals. We only employ the MAE to evaluate the change in model performance. The experimental findings reveal that the proposed model has good performance in various angle ranges. Table 6 shows the details. Equation (10) shows further development of a new loss function, which consists of MSE and MAEV. It is compared with the proposed approach to show the extent to which the loss function and labeling affect the angle estimation discontinuity, as shown in Table 6.
ST-ViT [52] 3.27 2.82 3.12 3.07 THESL-Net 2.53 3.08 2.95 2.85 The performance of the proposed method on the UPNA dataset is given in Table 5, where '/' means the corresponding value is not given in the original article. To make a fair comparison, we make up the experiment by using 90% of the UPNA dataset for training and 10% of the UPNA dataset for testing. From this table, it can be seen that the best MAE was achieved by the proposed method when using the same dataset-partitioning method. To examine the influence of head deflection angle range on the proposed model, we further compare the BIWI dataset with the UPNA dataset and generate the findings as shown in Figure 6. Both datasets are obtained in an experimental setting with low disturbance, containing three angles of different intervals. We only employ the MAE to evaluate the change in model performance. The experimental findings reveal that the proposed model has good performance in various angle ranges. Table 6 shows the details. Equation (10) shows further development of a new loss function, which consists of MSE and MAEV. It is compared with the proposed approach to show the extent to which the loss function and labeling affect the angle estimation discontinuity, as shown in Table 6.  By combining the loss limitation and the rotation matrix, as shown in Equation (11), the overall loss increases instead.  By combining the loss limitation and the rotation matrix, as shown in Equation (11), the overall loss increases instead.
L(ŷ, y) = 2βL ce + L mse + L maev (11) A reasonable explanation is that loss-limiting and labeling approaches have similar influences, and simply adding them together equals 4β, which destroys the loss function's coordination within 1 • of the prediction error again.

Visualization
In this section, the process of model training and the comparison between different approaches are visualized. First, Figure 7 shows the performance of the proposed approach in the case of occlusion and significant angle deflection. We have selected a part of the images with significant angle deflection in the AFLW2000 dataset. Both the Hopenet and THESL-Net models, which have a similar backbone, are employed to forecast the head pose. We plot various colored lines to visualize the head deflection, where the blue, green, and red lines are used to indicate the front, bottom, and side of the face, respectively. Our approach minimizes the MAE by more than 10 • in deflection cases and also reduces the MAE by about 4 • for the case where the face is obscured.
influences, and simply adding them together equals 4 , which destroys the loss function's coordination within 1° of the prediction error again.

Visualization
In this section, the process of model training and the comparison between different approaches are visualized. First, Figure 7 shows the performance of the proposed approach in the case of occlusion and significant angle deflection. We have selected a part of the images with significant angle deflection in the AFLW2000 dataset. Both the Hopenet and THESL-Net models, which have a similar backbone, are employed to forecast the head pose. We plot various colored lines to visualize the head deflection, where the blue, green, and red lines are used to indicate the front, bottom, and side of the face, respectively. Our approach minimizes the MAE by more than 10° in deflection cases and also reduces the MAE by about 4° for the case where the face is obscured.  Figure 8 shows the function of the tiered estimation module in the training process. A batch of features generated from the backbone network is taken as input, and then a 1 × 1 convolution layer is used to deflate the number of channels. The three colors in the figure denote the respective regions of interest in the estimation task of yaw, pitch, and roll. Finally, the features after weight assignment go through a layer of 1 × 1 convolution to reduction channel numbers before outputting to the linear layer. Notably, we use the external attention mechanism to detect common features among different character samples, although other tasks may require different attention mechanisms. The concept of tiered estimation minimizes the influence of fine-tuning between the three angles.  Figure 8 shows the function of the tiered estimation module in the training process. A batch of features generated from the backbone network is taken as input, and then a 1 × 1 convolution layer is used to deflate the number of channels. The three colors in the figure denote the respective regions of interest in the estimation task of yaw, pitch, and roll. Finally, the features after weight assignment go through a layer of 1 × 1 convolution to reduction channel numbers before outputting to the linear layer. Notably, we use the external attention mechanism to detect common features among different character samples, although other tasks may require different attention mechanisms. The concept of tiered estimation minimizes the influence of fine-tuning between the three angles. Furthermore, demonstrating the changes more explicitly in the model during the training process, Grad-CAM [50] is employed to visualize the areas that the model focuses on before the tiered layer, as shown in Figure 9: columns (a) and (c) have separate identities, columns (a) and (b) have different postures, and columns (a) and (d) are both different. As the training epochs improve, the external attention makes the model's area of interest gradually focus on those common features, which leads to good robustness in the head pose estimation model for people with a similar pose, but separate identities. Additionally, for the same person, the regions that the model focuses on are alsso different for different head poses. This indicates that the proposed model is simultaneously identityrobust and pose-robust. Furthermore, demonstrating the changes more explicitly in the model during the training process, Grad-CAM [50] is employed to visualize the areas that the model focuses on before the tiered layer, as shown in Figure 9: columns (a) and (c) have separate identities, columns (a) and (b) have different postures, and columns (a) and (d) are both different. As the training epochs improve, the external attention makes the model's area of interest gradually focus on those common features, which leads to good robustness in the head pose estimation model for people with a similar pose, but separate identities. Additionally, for the same person, the regions that the model focuses on are alsso different for different head poses. This indicates that the proposed model is simultaneously identity-robust and pose-robust.
Furthermore, demonstrating the changes more explicitly in the model during the training process, Grad-CAM [50] is employed to visualize the areas that the model focuses on before the tiered layer, as shown in Figure 9: columns (a) and (c) have separate identities, columns (a) and (b) have different postures, and columns (a) and (d) are both different. As the training epochs improve, the external attention makes the model's area of interest gradually focus on those common features, which leads to good robustness in the head pose estimation model for people with a similar pose, but separate identities. Additionally, for the same person, the regions that the model focuses on are alsso different for different head poses. This indicates that the proposed model is simultaneously identityrobust and pose-robust.

Ablation Study
In this section, the effect of different blocks (tiered estimation module and various loss limits) on the THESL-Net model performance is investigated. The ablation studies are performed following the enhancement of Resnet-50; the three techniques used are shown in Figure 4. For this, two sets of studies are developed. The first set is trained on 300W-LP and tested on the AFLW2000 and BIWI datasets. The second set employs 70% of each of the BIWI and UPNA datasets as the training set, and 30% as the test set. Each set of studies examines the influence of with/without the tiered idea and with β/2β/without loss limit on the findings differently. The experimental findings are shown in Tables 7 and 8.

Ablation Study
In this section, the effect of different blocks (tiered estimation module and various loss limits) on the THESL-Net model performance is investigated. The ablation studies are performed following the enhancement of Resnet-50; the three techniques used are shown in Figure 4. For this, two sets of studies are developed. The first set is trained on 300W-LP and tested on the AFLW2000 and BIWI datasets. The second set employs 70% of each of the BIWI and UPNA datasets as the training set, and 30% as the test set. Each set of studies examines the influence of with/without the tiered idea and with β/2β/without loss limit on the findings differently. The experimental findings are shown in Tables 7 and 8. Table 7. Ablation study over different components (with/without tiered module and with/without loss limit) on the AFLW2000 and BIWI datasets. All methods are trained on the 300W-LP dataset. As observed in Table 7, the MAE of the base model is 5.65 • on AFLW2000 and 4.82 • on the BIWI dataset when either module is not used. However, the model performance is significantly improved when either of the two modules is added alone. Among them, the performance of THESL-Net is optimal when using the tiered module with loss limit = 2β, which reduces by 1.25 • and 1.26 • on the AFLW2000 and BIWI datasets, respectively. This shows that both of our strategies are effective.
In Table 8, we introduce the ablation findings for the BIWI and UPNA datasets, which have different angle distribution ranges, whereas the UPNA dataset alone has a smaller and more concentrated one. The losses using the best combination in the BIWI and UPNA datasets are minimized by 0.94 • and 1.39 • , respectively, and the final MAE of the two are not considerably different, indicating that our model performs well at various angle ranges. Figure 10 further reveals the details of the experimental findings for each module of the model at various angles. As seen from the figure, using the combination of the two techniques always attains optimal findings.

Conclusions
To overcome the two challenges in the field of head pose estimation, in this study, THESL-Net is proposed, which comprises the tiered estimation module and the loss-limit component. To be specific, to solve the problem of mutual interference between the angles in regulation, the tiered structure forms three branches by dimensionality reduction, corresponding to the three angles of the head pose estimation. By separating the three angles' network parameters, the mutual interference between the yaw, pitch, and roll tuning is substantially decreased, which makes the estimation loss have space for more reduction. In addition, to solve the problem of discontinuity in angle prediction, unlike the rotation matrix-based approach, we solve the problem from the perspective of the loss function by restricting the loss function, while the effect is comparable to that of the rotation matrix.
On the popular public standard datasets AFLW2000, BIWI, and UPNA, the experimental findings reveal that our approach has better identity robustness than previous approaches and demonstrates state-of-the-art performance.

Conclusions
To overcome the two challenges in the field of head pose estimation, in this study, THESL-Net is proposed, which comprises the tiered estimation module and the losslimit component. To be specific, to solve the problem of mutual interference between the angles in regulation, the tiered structure forms three branches by dimensionality reduction, corresponding to the three angles of the head pose estimation. By separating the three angles' network parameters, the mutual interference between the yaw, pitch, and roll tuning is substantially decreased, which makes the estimation loss have space for more reduction. In addition, to solve the problem of discontinuity in angle prediction, unlike the rotation matrix-based approach, we solve the problem from the perspective of the loss function by restricting the loss function, while the effect is comparable to that of the rotation matrix.
On the popular public standard datasets AFLW2000, BIWI, and UPNA, the experimental findings reveal that our approach has better identity robustness than previous approaches and demonstrates state-of-the-art performance.