SD-HRNet: Slimming and Distilling High-Resolution Network for Efficient Face Alignment

Face alignment is widely used in high-level face analysis applications, such as human activity recognition and human–computer interaction. However, most existing models involve a large number of parameters and are computationally inefficient in practical applications. In this paper, we aim to build a lightweight facial landmark detector by proposing a network-level architecture-slimming method. Concretely, we introduce a selective feature fusion mechanism to quantify and prune redundant transformation and aggregation operations in a high-resolution supernetwork. Moreover, we develop a triple knowledge distillation scheme to further refine a slimmed network, where two peer student networks could learn the implicit landmark distributions from each other while absorbing the knowledge from a teacher network. Extensive experiments on challenging benchmarks, including 300W, COFW, and WFLW, demonstrate that our approach achieves competitive performance with a better trade-off between the number of parameters (0.98 M–1.32 M) and the number of floating-point operations (0.59 G–0.6 G) when compared to recent state-of-the-art methods.


Introduction
Face alignment, also known as facial landmark detection, aims at locating a set of semantic points on a given face image.It usually serves as a critical step in many face applications, such as face recognition [1], expression analysis [2], and driver-status tracking [3], which are significant components of human-computer interaction systems.As an example, face alignment is used to generate a canonical face in the preprocessing of face recognition [4,5].In the past decade, there have been many methods and common datasets reported in the literature [6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21][22] to promote the development of face alignment.Nevertheless, it remains a challenging task to develop an efficient and robust facial landmark detector that performs well in various unconstrained scenarios.
In early works, the methods [6][7][8][9] based on cascaded regression made significant progress on face alignment.They could learn a mapping function to iteratively refine the estimated landmark positions from an initial face shape.Despite the success of the methods for near-frontal face alignment, their performance was dramatically degraded on challenging benchmarks.The main reason is that the methods use handcrafted features and simply learned regression methods, which are weak to take full advantage of data for the accurate shape mapping on unconstrained faces.
With the development of deep learning on computer vision, convolutional neural network (CNN)-based methods have achieved impressive performance for unconstrained face alignment.Most existing works focus on improving the accuracy of the landmark localization by utilizing large backbone networks (e.g., VGG-16 [23], ResNet-50/152 [24], and Hourglass [25]).Although the networks have powerful feature extraction ability, they involve many parameters and a high computational cost and are difficult to apply in resource-limited environments.
Recently, some researchers have tended to balance the accuracy and efficiency of a facial landmark detector.They either train a small model from scratch [26,27] or use knowledge distillation (KD) for model compression [28][29][30][31].The former aims to design a lightweight network combined with an effective learning strategy, while the latter considers how to apply the KD technique to transfer the dark knowledge from a large network to a small one.However, the methods are not flexible enough to adapt to different computing resources as they usually rely on a fixed and carefully designed network structure.
Inspired by the works [32,33] of neural architecture search and neural network pruning for image classification, in which a compact target network was derived from a large supernetwork, we attempted to search for a lightweight face alignment network from a dynamically learned neural architecture.Concretely, we first trained a high-resolution supernetwork based on the structure of HRNet [34].In this network, a lightweight selective feature fusion (LSFF) block was designed to quantify the importance of the built-in transformation and aggregation operations.Then, we optionally pruned the redundant operations or even the entire blocks to obtain a slimmed network.To reduce the performance gap between the slimmed network and the supernetwork, we developed a triple knowledge distillation scheme, where two peer student networks with masked inputs could learn the ensemble of landmark distributions while receiving the knowledge from a frozen teacher network.In this paper, our main contributions are summarized as follows:

•
We propose a flexible network-level architecture slimming method that can quantify and reduce the redundancy of the network structure to obtain a lightweight facial landmark detector adapted to different computing resources.

•
We design a triple knowledge distillation scheme, in which a slimmed network could be improved without additional complexity by jointly learning the implicit landmark distribution from a teacher network and two peer student networks.

•
Extensive experimental results on challenging benchmarks demonstrate that our approach achieves a better trade-off between accuracy and efficiency than recent state-of-the-art methods (see Figure 1).Comparison of the computational cost (i.e., FLOPs) and the performance (i.e., NME) on 300W between the proposed approach and existing state-of-the-art methods.The size of a circle represents the number of parameters.Our approach (SD-HRNet) achieves a better trade-off between accuracy and efficiency than its counterparts.
The rest of this paper is organized as follows: Section 2 provides a review of related works about existing face alignment methods.In Section 3, we describe the detail of our proposed slimming and distillation methods.Section 4 shows the experimental results and analysis on common datasets.Finally, we give a brief conclusion in Section 5.

Related Work
In this section, we provide a detailed review of the related methods on face alignment.

Conventional Face Alignment
In the early literature [6][7][8][9], the cascaded regression method was popular and widely used to predict facial landmark positions by resolving a regression problem.The representative methods included SDM [6], ESR [7], LBF [8], and CFSS [9].The main differences among the methods were the choices of extracted features and the landmark regression methods.SDM used the scale-invariant feature transform (SIFT) as a feature descriptor applied to a cascaded linear regression model.ESR was a two-stage boosted regression method to predict the landmark coordinates by using the shape-indexed features.LBF combined the random forest algorithm with local binary features to accelerate the landmark localization process.To avoid a local optimum due to poor initialization, CFSS exploited hybrid image features to estimate the landmark positions in a coarse-to-fine manner.These methods were weak to detect landmarks on unconstrained face images due to the use of handcrafted features and simply learned regression methods.In our work, we build a CNN model to jointly learn the deep feature extraction and facial landmark heatmap regression.

Large CNN-Based Face Alignment
In recent years, there have been some advanced approaches reported in the literature [10][11][12][13][14][15][16][17][18][19], which have exploited large CNN models to drastically improve the landmark localization accuracy.Wu and Yang [10] proposed a deep variation leveraging network (DVLN), which contained two strongly coupled VGG-16 networks for landmark prediction and candidate decision.Lin et al. [11,12] adopted a classic two-stage detection architecture [35] based on the VGG-16 backbone for joint face detection and alignment.Feng et al. [13] and Dong et al. [14] applied the ResNet-50 [24] and ResNet-152 [24] networks, respectively, as the feature extraction module in the landmark detection process.The stacked hourglass network [25] is a popular CNN backbone used in recent state-of-the-art works [15,16,18] to generate features with multiscale information.Xia et al. [19] combined the HRNet backbone with a transformer structure to achieve a coarse-to-fine face alignment framework.The methods had high accuracy on challenging benchmarks, but inevitably required a large number of parameters and a high computational cost.Our approach only utilizes the large CNN model (HRNet) as a teacher network and adopts a lightweight model for face alignment.

Lightweight CNN-Based Face Alignment
Due to the limited application of large CNN models, some researchers have begun to study the lightweight network design for face alignment.Bulat et al. [26] applied the network quantization technique to construct a binary hourglass network.Guo et al. [27] trained a lightweight network consisting of the MobileNetV2 [36] blocks by using an auxiliary 3D pose estimator.To utilize the learning ability of large models, some recent works [28][29][30][31] used the teacher-guided KD technique to make a small student network learn the dark knowledge from a large teacher network.The student networks were usually based on the existing lightweight networks (e.g., MobileNetV2, EfficientNet-B0 [37], and HRNetV2-W9 [34]), while the teacher networks use the large CNN models (e.g., ResNet-50, EfficientNet-B7 [37], and HRNetV2-W18 [34]) as the network backbone.It is worth mentioning that the KD technique was also applied to improve a facial landmark detector [38][39][40] by mining the spatial-temporal relation from unlabeled video data.Inspired by the student-guided KD [41] that made student networks learn from each other without a teacher network, we introduce a student-guided learning strategy into the original KD framework, which can generate more robust supervision knowledge for learning landmark distribution.Moreover, our student network is derived from a supernetwork and thus has a more flexible structure than other handcrafted models.

Methods
As illustrated in Figure 2, our approach is a two-stage process consisting of a networklevel architecture slimming and triple knowledge distillation, which results in a lightweight facial landmark detector.

Network-level Architechture Slimming
Triple Knowledge Distillation

Network-Level Architecture Slimming
Our high-resolution supernetwork (HRSuperNet) follows a similar structure to HRNet in Figure 3 and begins from a stem that is composed of two 3 × 3 convolutions with a stride of 2. The spatial resolution is downsampled to H/4 × W/4, where H and W denote the height and width of an input image I ∈ R 3×H×W .The main body consists of ten stages maintaining the high-resolution representations throughout the network.Different from HRNet, the supernetwork contains a single-resolution LSFF block with a downsampling ratio of 1 in the first stage and repeats four-resolution blocks with downsampling ratios of {1, 1/2, 1/4, 1/8} from the beginning of the second stage.Each block has four stacked mobile inverted bottleneck convolutions (MBConvs [36]) with a 3 × 3 kernel size and an expansion ratio of 1.The design could make the supernetwork keep a larger architecture space but fewer parameters and lower computational cost than HRNet.Except for the first stage, the LSFF block is designed to transform and aggregate features from the previous stage and generate new features as inputs to the next stage.The process is formulated as follows: where Y i,k is the output of the kth block in the ith stage and X k i−1,j denotes the kth output from the jth block in the (i − 1)th stage.J i−1 is the number of blocks in the (i − 1)th stage.
T represents a transformation operation that is either a 1 × 1 convolution with a stride of 1 and a bilinear interpolation for upsampling, a sequence of 3 × 3 convolutions with a stride of 2 for downsampling, or an identity shortcut connection.E denotes a feature encoding operation implemented by the stacked MBconvs.The factor α is used as the weight of each transformation operation to participate in the follow-up aggregation process.The head in the supernetwork consists of two 1 × 1 convolutions with a stride of 1 and generates the landmark heatmaps P ∈ R N×M×H/4×W/4 when receiving N samples with M facial landmark points.During the training, we make the supernetwork learn the landmark heatmap regression along with the subnetwork architecture search by imposing an L1 regularization on α to enforce the sparsity of the operations with few contributions to the network.Formally, the overall training loss is: where MSE(P n,m , G n,m ) denotes the standard mean square error between the predicted heatmap P n,m and the ground-truth heatmap G n,m of the mth landmark in the nth sample.The ground-truth heatmap is generated by applying a 2D Gaussian centered on the groundtruth location of each landmark.λ is the weight to balance the MSE and the L1 penalty term.I and J i denote the number of stages and the number of blocks in the ith stage, respectively.We first train the supernetwork by alternately optimizing the importance factors and the network weights until they converge.Then, we prune the redundant transformation and aggregation operations in the LSFF blocks, where the corresponding factors are smaller than a given pruning threshold.Note that the entire block is discarded if all the associated operations are pruned.

Triple Knowledge Distillation
In our distillation scheme, we adopt the slimmed network as the peer student networks S 1 and S 2 and use the pretrained HRNet as the teacher network T. To increase the model diversity, we use the occluded images with a random-sized mask as the inputs of the student networks.
Specially, we define a KD loss for a network to learn the landmark distribution from another network as follows: where D KL is the Kullback-Leibler (KL) divergence to measure the distance of the landmark distributions from S(P 1 ) to S(P 2 ), and S is the softmax function working on the predicted landmark heatmaps P 1 and P 2 .
During the training, we use MSE and L KD as the main criterion to make the student networks learn the explicit landmark distribution from the ground-truth heatmap, while allowing them to learn the implicit landmark distribution from their ensemble predictions and the output of the teacher network.The overall training loss of a student network S i is formulated as: where P S 1 , P S 2 , and P T denote the predicted landmark heatmaps of S 1 , S 2 , and T, respectively.The weights λ 1 and λ 2 are used to balance MSE and the KD losses.

COFW: It contains 1852 face images with different degrees of occlusion including 1345 training images and 507 test images. Each face image has 29 annotated landmarks.
WFLW: There are 7500 images for training and 2500 images for testing where the test set includes six subsets: large pose (326 images), illumination (698 images), occlusion (736 images), blur (773 images), make-up (206 images), and expression (314 images).

Evaluation Metrics
We followed previous works and used the normalized mean error (NME) to evaluate the performance of the facial landmark detection: where p n,m and g n,m denote the coordinate vectors of the predicted landmark and the ground-truth landmark, respectively.d is the interocular distance.We also report the failure rate by setting a maximum NME of 10%.The number of parameters (#Params) and the number of floating-point operations (FLOPs) were used to measure model size and computational cost, respectively.

Implementation Detail
Following the work [15], all the faces were cropped based on the provided bounding boxes and resized to 256 × 256.We augmented the data by a 1.0 ± 0.25 scaling, ±30-degree rotation, and random flipping with a probability of 50%.The pseudocode in Algorithm 1 shows the training pipeline of our approach in the slimming and distilling stages.

21
Update w 1 and w 2 by gradient descent:

end 25 end
Slimming stage: We alternatively optimized the importance factors and network weights for 60 epochs.To optimize the importance factors, we used the Adam optimizer with the learning rates of 1.8 × 10 −4 on 300W and WFLW, and 3.5 × 10 −4 on COFW.The weight λ was set to 5 × 10 −5 .To update the network weights, we used the Adam optimizer with a momentum of 0.9 and weight decay of 4 × 10 −5 .The initial learning rate was 1 × 10 −4 on 300W and COFW, and 2 × 10 −4 on WFLW, which was dropped by a factor of 0.1 in 40 and 50 epochs.The pruning threshold was set to 0.0017 on 300W, 0.0042 on COFW, and 0.002 on WFLW.The batch size was set to 16 on 300W and COFW, and 32 on WFLW.
Distilling stage: We jointly fine-tuned two slimmed networks for 60 epochs.The settings of the optimizer, learning rate and batch size were the same as those in the slimming stage.The weights λ 1 and λ 2 were set to 4 and 1 on 300W, 3 and 0.1 on COFW, and 0.3 and 0.1 on WFLW.

Comparison with State-of-the-Art Methods
In this section, we compare our approach with the recent state-of-the-art methods on 300W, COFW, and WFLW.S-HRNet is the slimmed network from HRSuperNet.SD-HRNet 1 and SD-HRNet 2 denote two refined S-HRNets through the proposed distillation scheme.We report the average results with the standard deviation of SD-HRNet 1 and SD-HRNet 2 from training them five times over different seeds.

Results on 300W
We report the #Params and FLOPs in Table 1 as well as the NME on the 300W subsets in Table 2. Compared to the advanced models (e.g., LAB and SLPT) with large backbones, our method (SD-HRNet) had far fewer parameters and FLOPs while achieving competitive or even better performance.Compared to HRNet, SD-HRNet only increased the NME by about 2% but reduced the #Params by 89.5% and the FLOPs by 86.3%.We showed that SD-HRNet achieved fewer parameters (0.98 M parameters) and a lower computational cost (0.59 G FLOPs) than existing lightweight models.Moreover, we proved the effectiveness of the proposed slimming and distillation approaches as the #Params and FLOPs of HRSuper-Net were reduced by 70.0% and 52.8%, respectively, and the NME of S-HRNet was reduced by about 3%.Table 3 shows the performance of the methods on Masked 300W.We found that SD-HRNet had an obvious improvement for occluded faces due to the introduction of masked inputs.Although our method obtained a competitive NME compared to most previous methods, it underperformed the recent state-of-the-art methods [47,48] focusing on the occlusion problem.

Ablation Study
In this section, we conduct an ablation study on 300W and analyze the effect of the proposed components.

HRNet vs. HRSuperNet
To verify the rationality of our supernetwork, we trained HRNet and HRSuperNet on 300W without pretraining and used them as the supernetwork to generate a series of slimmed networks.The original residual units [34] or stacked MBConvs were used as the feature encoding operation in the proposed LSFF block.As seen from Figure 6, most networks derived from HRSuperNet had a lower NME than HRNet when their FLOPs were similar, which suggested that a larger architecture space was more likely to generate better subnetworks.

KD Components
In Table 6, we show the effect of different KD components in our distillation scheme for the performance on 300W.We observed that each component incrementally led to the improvement of the slimmed network.It suggested that the combination of the teacherguided KD and the student-guided KD was an effective way for the implicit knowledge transfer.In addition, the introduction of masked inputs could increase the diversity of student networks and make them learn robust landmark distribution from each other.

Visualization of the Architectures
We visualize the slimmed architecture trained on 300W in Figure 2 and the other two architectures on COFW and WFLW in Figure 7.The proposed selective feature fusion mechanism could result in different network structures from a unified architecture space, which were adapted to different datasets and landmark detection tasks.For example, the architectures from 300W and COFW tended to preserve more high-resolution blocks from the first and second branches than the architecture from WFLW.In addition, we found that more than 94% of the blocks in HRSuperNet were utilized by the slimmed architectures.It suggested that the designed architecture space was reasonable to cover most cases for generating an efficient face alignment network.

Conclusions
In this paper, we proposed a network-level slimming method and a hybrid knowledge distillation scheme, which could work together to generate an efficient and accurate facial landmark detector.Compared to existing handcrafted models, our model achieved competitive performance with a better trade-off between model size (0.98 M-1.32 M parameters) and computational cost (0.59 G-0.6 G FLOPs).In addition, our method was more flexible in practical application through an adaptive architecture search technique, which could be applied to real-time human-computer interaction systems under different resource-limited environments.Nevertheless, there was still a performance gap between our method and recent state-of-the-art large models, especially for the dense or strongly occluded landmark detection task.In future work, we will explore how to design a more reasonable architecture search space to improve the upper bound of performance and extend our method to other computer vision tasks such as human pose estimation and semantic segmentation.
Figure 1.Comparison of the computational cost (i.e., FLOPs) and the performance (i.e., NME) on 300W between the proposed approach and existing state-of-the-art methods.The size of a circle represents the number of parameters.Our approach (SD-HRNet) achieves a better trade-off between accuracy and efficiency than its counterparts.

)Figure 2 .
Figure 2. Illustration of the proposed slimming and distillation procedures for face alignment.A lightweight model is first obtained by slimming the HRSuperNet.Then, the lightweight model is refined in a triple knowledge distillation scheme consisting of two peer student networks and a teacher network.We visualize the architecture of the lightweight model trained on the 300W dataset, where the redundant TA operations and LSFF blocks are pruned.

Figure 3 .
Figure 3. Detailed structures of HRNet and HRSuperNet.The proposed lightweight selective feature fusion (LSFF) block is composed of the transformation and aggregation (TA) operations with different importance factors α and stacked mobile inverted bottleneck convolutions (MBConvs).

Algorithm 1 : 2 for 4 Update α by gradient descent: 5 α 6 end 7 for 8 9 Update w by gradient descent: 10 w 13 19 for
SD-HRNet Algorithm Input: The training set D T , initialized importance factor α and network weight w, training epochs N, pruning threshold p, pretrained teacher network T Output: Two lightweight networks S 1 and S 2 1 for i = 1 to N do Mini-batch D t in D T do 3 Calculate the loss L by Eqaution (2) = α − ∇ α L Mini-batch D t in D T do Calculate the loss L by Equation (2) = w − ∇ w L Obtain lightweight networks S 1 and S 2 by p 14 Initialize importance factors in S 1 and S 2 : 15 α 1 , α 2 = α 16 Initialize network weights in S 1 and S 2 : 17 w 1 , w 2 = w 18 for i = 1 to N do Minibatch D t in D T do 20 Calculate the losses L S 1 and L S 2 by Equation (4)

Figure 6 .
Figure 6.Comparison of HRNet and HRSuperNet based on original residual units (a) or stacked MBConvs (b), which were used as the supernetwork in the proposed slimming method.We obtained a series of slimmed networks with different NME and FLOPs on the 300W full set by using different pruning thresholds.

Figure 7 .
Figure 7. Visualization of the slimmed architectures trained on the COFW (a) and WFLW (b) datasets.

Table 1 .
Comparison of different methods in backbone, #Params, and FLOPs.

Table 4 .
Comparison of NME (%) and failure rate (%) for a maximum NME of 10% on the COFW test set.

Table 6 .
NME (%) of our method using different KD components on the 300W full set.