A Multi-View Face Expression Recognition Method Based on DenseNet and GAN

: Facial expression recognition (FER) techniques can be widely used in human-computer interaction, intelligent robots, intelligent monitoring, and other domains. Currently, FER methods based on deep learning have become the mainstream schemes. However, these methods have some problems, such as a large number of parameters, difﬁculty in being applied to embedded processors, and the fact that recognition accuracy is affected by facial deﬂection. To solve the problem of a large number of parameters, we propose a DSC-DenseNet model, which improves the standard convolution in DenseNet to depthwise separable convolution (DSC). To solve the problem wherein face deﬂection affects the recognition effect, we propose a posture normalization model based on GAN: a GAN with two local discriminators (LD-GAN) that strengthen the discriminatory abilities of the expression-related local parts, such as the parts related to the eyes, eyebrows, mouth, and nose. These discriminators improve the model’s ability to retain facial expressions and evidently beneﬁts FER. Quantitative and qualitative experimental results on the Fer2013 and KDEF datasets have consistently shown the superiority of our FER method when working with multi-pose face images.


Introduction
With the rapid development of computer technology and artificial intelligence technology, the demand for human-computer interaction is increasingly strong. The realization of the understanding and recognizing of human facial expressions by computers is valuable in the domains of intelligent robotics, intelligent monitoring, virtual reality, medical assisted diagnosis, and so on. Benefiting from the improvement of computer performance, algorithms based on deep learning have become the mainstream scheme of FER. and achieves accuracy similar to that of AlexNet with fewer parameters [9]. Aiming at running on mobile terminals, ShuffleNet, a lightweight model proposed in [10], adds group convolution to the networks, and this makes the model smaller and faster. In the past two years, researchers have proposed many lightweight models that continuously improve FER accuracy [11][12][13][14][15].
The FER algorithms mentioned above focus on the front face image, but since the facial deflection at various angles cannot be avoided in natural environments, the accuracy of FER is more or less lowered [16]. Therefore, pose normalization is performed before FER; i.e., the face is corrected to the front view in the case of deflection.
In order to correct face deflection, early researchers have proposed some 3D modeling methods. However, when the facial deflection angle is too large, the face normalization results of this type of methods are unsatisfactory. In 2014, Goodfellow et al. proposed the generative adversarial network (GAN) [17], which provides a new solution to the problem of missing features caused by face deflection. Up to now, a series of GAN variants have been developed to correct facial deflection. These GAN models for face normalization focus on preserving contour features in the process of synthesizing face images in order to facilitate identification. If the downstream task is FER, the synthesized frontal face does not meet the requirements very well because it does not focus on the preservation of local features related to facial expressions.
In this paper, we present an expression recognition method which combines the DenseNet FER model with the GAN-based posture normalization model. This method solves the problems of the large number of parameters in the FER model and low accuracy in multi-view face normalization. The contributions of this paper are summarized as follows:

1.
A lightweight FER model, DSC-DenseNet, which reduces network parameters and computations by improving the standard convolution in DenseNet to DSC, is proposed. When the parameter is 0.16M, the FER rate of this model is 96.7% for frontal face input and 77.3% for profiles without posture normalization.

2.
A posture normalization model, GAN, with two local discriminators (LD-GAN) based on the TP-GAN model, is proposed. The encoder-decoder structure implements a two-pathway generator, global pathway, and local pathway. In order to preserve more local features related to facial expressions in generated frontal faces, the discriminator was improved by adding two local discriminators besides the global discriminator to enhance its adversarial capability against the local pathway encoder. The loss functions are also improved to achieve better effects in network training. 3.
The effectiveness of this method was verified on three public datasets. The validity of the lightweight FER model was verified on the CK+ and Fer2013 datasets, and the final effect of the combination of the posture normalization model and the FER model was verified on the KDEF dataset. Compared to the methods used in other representative models, this method effectively reduces the number of parameters of our model and has a higher FER rate (92.7%) under the condition of multi-angle deflection.
The remaining parts of this paper are organized as follows. Section 2 describes the previous related work. Section 3 describes the lightweight FER model and the posture normalization model that we propose in detail. Section 4 describes the experimental datasets, results, and related analysis. Section 5 gives conclusions and suggestions for future work.

DenseNet
Although various CNN-based FER models have improved recognition rates, the consequent increase in the number of parameters has also resulted in more computational requirements. DenseNet is a model with a narrow network structure, as shown in Figure 1 [8].

DenseNet
Although various CNN-based FER models have improved recognition rates, the consequent increase in the number of parameters has also resulted in more computational requirements. DenseNet is a model with a narrow network structure, as shown in Figure 1 [8]. DenseNet consists of dense blocks and transition layers, which lie between two adjacent blocks and change feature map sizes via convolution and pooling. The basic idea is that in a dense block, like in ResNet, direct connections from the preceding to the following layers are created. The difference in the dense block is that it establishes a dense connection between all the preceding layers to the followed layer; i.e., each layer takes all preceding feature-maps as input. The feature reuse of DenseNet improves the transmission ability of information throughout the entire network and reduces the number of parameters. To achieve the same accuracy as ResNet, DenseNet only needs about half of ResNet's parameters and half of its FLOPs (floating-point operations).

Depthwise Separable Convolution (DSC)
Compared to standard convolution, DSC has much lower parameters and computational complexity. Thus, it has been successfully applied to two well-known models, Xception [18] and MobileNet [19], by the Google team. DSC splits the computation of standard convolution into two steps: depthwise convolution, which applies a single convolutional filter per each input channel, and, pointwise convolution, which creates a linear combination of the output of the depthwise convolution. For example, the depthwise conv applies N convolution kernels of size M × M × 1 to N input channels of size W × H, achieves N feature maps of size W × H × 1, concatenates N feature maps, and achieves one feature map of size W × H × N. In other words, the depthwise conv has the same number of channels for input and output feature maps. However, there has been no connection between the different channels in the process so far. Then, the pointwise conv, by applying K standard convolutions of size 1 × 1 × N, solves this problem. It weights the feature map depthwise to generate a feature map of size W × H × K; i.e., it has the ability to fuse channels. The ratio of DSC to standard convolution is 1/K + 1/M 2 .

GAN and Its Variants
The GAN is a deep learning model, and its framework is shown in Figure 2a [20]. This framework corresponds to a minimax two-player game. The GAN consists of two models: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than from G. The aim of the training procedure for G is to maximize the probability of D making a mistake. DenseNet consists of dense blocks and transition layers, which lie between two adjacent blocks and change feature map sizes via convolution and pooling. The basic idea is that in a dense block, like in ResNet, direct connections from the preceding to the following layers are created. The difference in the dense block is that it establishes a dense connection between all the preceding layers to the followed layer; i.e., each layer takes all preceding feature-maps as input. The feature reuse of DenseNet improves the transmission ability of information throughout the entire network and reduces the number of parameters. To achieve the same accuracy as ResNet, DenseNet only needs about half of ResNet's parameters and half of its FLOPs (floating-point operations).

Depthwise Separable Convolution (DSC)
Compared to standard convolution, DSC has much lower parameters and computational complexity. Thus, it has been successfully applied to two well-known models, Xception [18] and MobileNet [19], by the Google team. DSC splits the computation of standard convolution into two steps: depthwise convolution, which applies a single convolutional filter per each input channel, and, pointwise convolution, which creates a linear combination of the output of the depthwise convolution. For example, the depthwise conv applies N convolution kernels of size M × M × 1 to N input channels of size W × H, achieves N feature maps of size W × H × 1, concatenates N feature maps, and achieves one feature map of size W × H × N. In other words, the depthwise conv has the same number of channels for input and output feature maps. However, there has been no connection between the different channels in the process so far. Then, the pointwise conv, by applying K standard convolutions of size 1 × 1 × N, solves this problem. It weights the feature map depthwise to generate a feature map of size W × H × K; i.e., it has the ability to fuse channels. The ratio of DSC to standard convolution is 1/K + 1/M 2 .

GAN and Its Variants
The GAN is a deep learning model, and its framework is shown in Figure 2a [20]. This framework corresponds to a minimax two-player game. The GAN consists of two models: a generative model G that captures the data distribution and a discriminative model D that estimates the probability that a sample came from the training data rather than from G. The aim of the training procedure for G is to maximize the probability of D making a mistake.  Huang et al. proposed a two-pathway GAN (TP-GAN) [21] for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details. As shown in Figure 2b, TP-GAN uses two pathways in G to perceive global structures and local details simultaneously. Four landmark-located patch networks, in addition to the commonly used global encoder-decoder network, are used to attend to local textures. Then, the positive synthesized image is used for the downstream task: identity recogni-   [21] for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details. As shown in Figure 2b, TP-GAN uses two pathways in G to perceive global structures and local details simultaneously. Four landmark-located patch networks, in addition to the commonly used global encoder-decoder network, are used to attend to local textures. Then, the positive synthesized image is used for the downstream task: identity recognition.
The training strategy and the loss function are challenging problems facing GANs. Combining a 3D-morphable model with a traditional GAN, FF-GAN [22] solves the problem of GANs being difficult to train by providing shape and appearance priors to guide the training on insufficient samples. Additionally, a new symmetry loss is introduced into the loss function. Similarly providing additional information to assist in training, the disentangled representation learning GAN (DR-GAN) [23] introduces a pose code to G and a pose estimation to D. Hu et al. proposed a couple-agent pose-guided GAN (CAPG-GAN) [24]. In the learning process for this network, the pose-guided G uses posture information provided by landmark heatmaps of input profile images and ground truth images. The couple-agent D essentially consists of two independent discriminators: one for rotation angle discriminating and the other for texture discriminating. Differing from the approaches above, Hardy et al. proposed a learning procedure for distributed GANs, MD-GAN [25], which can be trained over datasets that are spread across multiple servers. The lightweight FER model in this paper is based on DenseNet's feature reuse strategy, which is shown in Figure 3. In a dense block, the original feature x 0 is inputted into the layer, h 1 , and x 1 is the output. The input of the layer h 2 includes not only x 1 , the output from the layer h 1 , but also the original feature x 0 . The input of the layer h 3 includes not only x 2 , the output of layer h 2 , but also x 1 and x 0 . Huang et al. proposed a two-pathway GAN (TP-GAN) [21] for photorealistic frontal view synthesis by simultaneously perceiving global structures and local details. As shown in Figure 2b, TP-GAN uses two pathways in G to perceive global structures and local details simultaneously. Four landmark-located patch networks, in addition to the commonly used global encoder-decoder network, are used to attend to local textures. Then, the positive synthesized image is used for the downstream task: identity recognition.

Proposed Approach
The training strategy and the loss function are challenging problems facing GANs. Combining a 3D-morphable model with a traditional GAN, FF-GAN [22] solves the problem of GANs being difficult to train by providing shape and appearance priors to guide the training on insufficient samples. Additionally, a new symmetry loss is introduced into the loss function. Similarly providing additional information to assist in training, the disentangled representation learning GAN (DR-GAN) [23] introduces a pose code to G and a pose estimation to D. Hu et al. proposed a couple-agent pose-guided GAN (CAPG-GAN) [24]. In the learning process for this network, the pose-guided G uses posture information provided by landmark heatmaps of input profile images and ground truth images. The couple-agent D essentially consists of two independent discriminators: one for rotation angle discriminating and the other for texture discriminating. Differing from the approaches above, Hardy et al. proposed a learning procedure for distributed GANs, MD-GAN [25], which can be trained over datasets that are spread across multiple servers.

The Framework of Dense Block
The lightweight FER model in this paper is based on DenseNet's feature reuse strategy, which is shown in Figure 3. In a dense block, the original feature x0 is inputted into the layer, h1, and x1 is the output. The input of the layer h2 includes not only x1, the output from the layer h1, but also the original feature x0. The input of the layer h3 includes not only x2, the output of layer h2, but also x1 and x0. To further lighten the FER model, we improve the feature map extracting in the dense block by using DSC instead of standard convolution to simplify the calculation. To further lighten the FER model, we improve the feature map extracting in the dense block by using DSC instead of standard convolution to simplify the calculation.
The non-linear transformation function H(·) in dense block is defined as batch normalization(BN) + rectified linear unit (ReLU) +3 × 3 convolution. As the number of layers increases, the number of input channels increases dramatically with the number of overlapping feature maps. For this reason, a 1 × 1 conv is used before the 3 × 3 conv to limit the number of input channels. Then H(·) is defined as BN + ReLU + 1 × 1 Conv + BN + ReLU + 3 × 3 Conv. Assume that the input size of layer i in a dense block is 48 × 48, the feature maps of all the preceding layers are concatenated with N channels, the bottleneck layer reduces the number of channels to 128, and the growth rate k (the number of output channels per layer; i.e., the increase number of input channels in the next layer after concatenation) is 32. Then, the operation of layer i is as shown in Figure 4. overlapping feature maps. For this reason, a 1 × 1 conv is used before the 3 × 3 conv to limit the number of input channels. Then H(·) is defined as BN+ReLU+1 × 1 Conv+BN+ReLU+3 × 3 Conv. Assume that the input size of layer i in a dense block is 48 × 48, the feature maps of all the preceding layers are concatenated with N channels, the bottleneck layer reduces the number of channels to 128, and the growth rate k (the number of output channels per layer; i.e., the increase number of input channels in the next layer after concatenation) is 32. Then, the operation of layer i is as shown in Figure 4.  After replacing standard convolution with DSC, H(·) is defined as BN + ReLU + 3 × 3 DSC. Then, the operation of layer i is as shown in Figure 5. The number of pointwise conv, i.e., the number of channels of the feature map output by DSC, is the growth rate k of DenseNet.

The Architecture of DSC-DenseNet
Because of its use of DSC, we refer to this network, which is shown in Figure 6, as DSC-DenseNet. The parameters of its components are given in Table 1.  After replacing standard convolution with DSC, H(·) is defined as BN + ReLU + 3 × 3 DSC. Then, the operation of layer i is as shown in Figure 5. The number of pointwise conv, i.e., the number of channels of the feature map output by DSC, is the growth rate k of DenseNet.
limit the number of input channels. Then H(·) is defined as BN+ReLU+1 × 1 Conv+BN+ReLU+3 × 3 Conv. Assume that the input size of layer i in a dense block is 48 × 48, the feature maps of all the preceding layers are concatenated with N channels, the bottleneck layer reduces the number of channels to 128, and the growth rate k (the number of output channels per layer; i.e., the increase number of input channels in the next layer after concatenation) is 32. Then, the operation of layer i is as shown in Figure 4.  After replacing standard convolution with DSC, H(·) is defined as BN + ReLU + 3 × 3 DSC. Then, the operation of layer i is as shown in Figure 5. The number of pointwise conv, i.e., the number of channels of the feature map output by DSC, is the growth rate k of DenseNet.

The Architecture of DSC-DenseNet
Because of its use of DSC, we refer to this network, which is shown in Figure 6, as DSC-DenseNet. The parameters of its components are given in Table 1.

The Architecture of DSC-DenseNet
Because of its use of DSC, we refer to this network, which is shown in Figure 6, as DSC-DenseNet. The parameters of its components are given in Table 1.  DSC-DenseNet consists of four dense blocks, three transition layers, and a classification layer. The four dense blocks contain three, six, twelve, and eight DSC layers, respectively. ReLU is used as the activation function. The transition layer uses 2 × 2 average pooling. If too many channels are output by the previous dense block, a 1 × 1 conv is added to reduce the number of channels and thus simplify operations. Examples of this addition include the 1 × 1 × 128 conv in transition layer 2 and the 1 × 1 × 256 conv in  DSC-DenseNet consists of four dense blocks, three transition layers, and a classification layer. The four dense blocks contain three, six, twelve, and eight DSC layers, respectively. ReLU is used as the activation function. The transition layer uses 2 × 2 average pooling. If too many channels are output by the previous dense block, a 1 × 1 conv is added to reduce the number of channels and thus simplify operations. Examples of this addition include the 1 × 1 × 128 conv in transition layer 2 and the 1 × 1 × 256 conv in transition layer 3. In this situation, the transition layer is BN + 1 × 1 Conv + 2 × 2 average-pooling. Finally, the classification layer realizes the recognition of seven major types of facial expressions through SoftMax multiple classifiers.
Assuming that the input feature map's size and the channel of the convolution layer are H i × W i × C i , and those of the output feature map are After summing two computational quantities and dividing them by that of standard convolution, a fraction can be obtained: 1/C o + 1/H k W k . As the number of feature maps increases, 1/C o can be ignored. The size of the depthwise convolution kernel determines the computational quantity. Due to the use of DSC with 3 × 3 convolution kernels, the computational complexity in a dense block can be reduced to 1/32 + 1/3 2 ≈ 14.2%. Due to the other layers in DSC-DenseNet, the computational complexity can actually be reduced to about 30%.

Frontal Face Normalization Model: GAN with Two Local Discriminators (LD-GAN)
Facial pose variations still remain a great challenge for FER models, especially for lightweight ones that sacrifice some accuracy. Therefore, facial pose normalization is a commonly adopted step. Synthesizing a frontal face from a profile image is a highly non-linear transformation.

The Framework of LD-GAN
Based on the idea of a two-pathway generator for TP-GAN, we propose LD-GAN, whose framework is shown in Figure 7. A global pathway is used to process facial contour features and a local pathway is used to process facial expression features in G. Two local feature discriminators are added in D to enhance the adversarial operation between G and D.

The Framework of LD-GAN
Based on the idea of a two-pathway generator for TP-GAN, we propose LD-GAN, whose framework is shown in Figure 7. A global pathway is used to process facial contour features and a local pathway is used to process facial expression features in G. Two local feature discriminators are added in D to enhance the adversarial operation between G and D. cropping out expression-related parts is to reduce image noise by eliminating pixels that are less associated with expressions and to force to focus on extracting expression features. This way, the generated image can preserve the original expression better. To fuse the information from all pathways, feature maps need to be concatenated together, but only if they have the same spatial resolution. Thus, we first fuse two feature maps from eye patches together, specifically according to their landmarks, and then fuse those of the nose and mouth. At last, we obtain simply by concatenating the global pathway feature map with the two fused feature maps, and .
is decoded by and a frontal face is generated. In the local pathway, and are decoded by and two frontal patches, eye patch and mouth-nose patch , are achieved in order to feed D. The reason we use only two patches instead of four is to simplify the framework of D.
Inspired by multi-discriminator strategy [26,27], we propose a Dgroup including a global discriminator D1 and two local discriminators, eye discriminator D2 and mouth-nose discriminator D3. The input of Dgroup includes four images: the real frontal face , generated frontal face , generated eye patch , and generated mouth-nose patch . The next step is to combine these four images into three image pairs, ( , ), ( , ), and ( , ), then input them into D1-D3 respectively.

The Architecture of LD-GAN
As a key component for extracting features from facial images, the core of the encoder is a CNN. We use Light-CNNs as encoders in both global and local pathways because of their advantages, i.e., having fewer parameters and better robustness [28]. As shown in Table 2, and have the same architecture of Conv0, Conv1-Conv4, fc1, and fc2. The activation function of Conv1-Conv4 is Maxout. The size of the input RGB image is 128 × The profile x p is input into the global pathway and the global encoder G i enc extracts the global contour features f i = G i enc (x p ). After landmarked allocating and cropping, four local patches of eyes, noses, and mouths x p k are input into the local pathway, and the local encoder G e enc extracts the facial expression features f ek = G e enc (x k p ). The purpose of cropping out expression-related parts is to reduce image noise by eliminating pixels that are less associated with expressions and to force G e enc to focus on extracting expression features. This way, the generated image can preserve the original expression better. To fuse the information from all pathways, feature maps need to be concatenated together, but only if they have the same spatial resolution. Thus, we first fuse two feature maps from eye patches together, specifically according to their landmarks, and then fuse those of the nose and mouth. At last, we obtain f s simply by concatenating the global pathway feature map f i with the two fused feature maps, f e and f m . f s is decoded by G dec and a frontal face I f s is generated. In the local pathway, f e and f m are decoded by G dec and two frontal patches, eye patch I f e and mouth-nose patch I f e , are achieved in order to feed D. The reason we use only two patches instead of four is to simplify the framework of D.
Inspired by multi-discriminator strategy [26,27], we propose a D group including a global discriminator D 1 and two local discriminators, eye discriminator D 2 and mouthnose discriminator D 3 . The input of D group includes four images: the real frontal face I r s , generated frontal face I

The Architecture of LD-GAN
As a key component for extracting features from facial images, the core of the encoder is a CNN. We use Light-CNNs as encoders in both global and local pathways because of their advantages, i.e., having fewer parameters and better robustness [28]. As shown in Table 2, G i enc and G e enc have the same architecture of Conv0, Conv1-Conv4, fc1, and fc2. The activation function of Conv1-Conv4 is Maxout. The size of the input RGB image is 128 × 128 and the size of the final feature map obtained is 1 × 1 × 256. The decoder G dec is a deconvolution neural network. The deconvolution process has no learning ability and can only visualize global contour or local expression features. The network architecture of G dec is shown in Table 3.  The architecture of D 1 is shown in Table 4. The size of the generated frontal face input into D 1 is 128 × 128 × 3, and it is changed to 64 × 64 × 64 after the Conv0 layer and to 1 × 1 × 1024 after the Conv1-Conv5 layers. The architectures of D 2 and D 3 are totally same to that of D 1 except for the sizes of input images. The size of the eye patch is 95 × 20, and that of the mouth-nose patch is 50 × 75.

The Loss Function Improved
We have improved the loss function of LD-GAN. Content loss L con is added on the basis of adversarial loss L ck . L con consists of pixel loss L P and symmetric loss L S . Then, the loss function of G is: L P is the difference in pixels between a generated frontal face and an input profile image: The smaller L P is, the closer the quality of the generated image will be to that of the input image, and so, L P should be minimized. L S is the Manhattan distance between the left and right sides of the generated frontal face:

of 17
Calculating L S accelerates the convergence of G. The adversarial loss is: As D's ability to discriminate between true and false improves via training, G needs to compete against it to minimize its probability of discrimination. Thus, D needs to minimize L ck for G. To sum up, the loss function of LD-GAN generator is: where β 1 , β 2 , and β 3 are the weights that affect the loss, and can be adjusted during training to achieve the best training results. The D of LD-GAN does not involve content loss, and its training process only includes adversarial loss, which requires a weighted sum of three adversarial losses. The loss function of D group is: where ω 1 , ω 2 , and ω 3 are weighing hyper parameters. D identifies the true or false images generated by G and obtains a probability that the images will be judged as false; thus, D needs to maximize the adversarial loss.

Experimental Results and Discussions
The experiments were carried out on the Windows 10 operating system and the recognition methods were implemented using the Python language and PyTorch library. The experimental environment included an Intel(R) Core (TM) i7-10750H CPU @ 2.60 GHz processor, 16 GB memory, and GeForce GTX 1650Ti graphics card.
The effectivenesses of the proposed DSC-DenseNet and LD-GAN were verified on three public datasets. The final results of the combination of DSC-DenseNet and LD-GAN are demonstrated below.

Datasets
In experiments, we used the following public datasets: Extended Cohn-Kanade Dataset (CK+) [29]: It is one of the most widely used expression datasets, and was released in 2010. There are 593 sequences in it, and each sequence begins with a neutral expression and proceeds to a peak expression. FER based on a static image often takes the last frames as samples. The eight included expressions are disgust, happiness, surprise, fear, anger, contempt, sadness, and neutral, as shown in Figure 8.
The smaller LP is, the closer the quality of the generated image will be to that of the input image, and so, LP should be minimized. LS is the Manhattan distance between the left and right sides of the generated frontal face: Calculating LS accelerates the convergence of G. The adversarial loss is: As D's ability to discriminate between true and false improves via training, G needs to compete against it to minimize its probability of discrimination. Thus, D needs to minimize Lck for G. To sum up, the loss function of LD-GAN generator is: where , , and are the weights that affect the loss, and can be adjusted during training to achieve the best training results.
The D of LD-GAN does not involve content loss, and its training process only includes adversarial loss, which requires a weighted sum of three adversarial losses. The loss function of Dgroup is: where , , and are weighing hyper parameters. D identifies the true or false images generated by G and obtains a probability that the images will be judged as false; thus, D needs to maximize the adversarial loss.

Experimental Results and Discussions
The experiments were carried out on the Windows 10 operating system and the recognition methods were implemented using the Python language and PyTorch library. The experimental environment included an Intel(R) Core (TM) i7-10750H CPU @ 2.60 GHz processor, 16 GB memory, and GeForce GTX 1650Ti graphics card.
The effectivenesses of the proposed DSC-DenseNet and LD-GAN were verified on three public datasets. The final results of the combination of DSC-DenseNet and LD-GAN are demonstrated below.

Datasets
In experiments, we used the following public datasets: Extended Cohn-Kanade Dataset (CK+) [29]: It is one of the most widely used expression datasets, and was released in 2010. There are 593 sequences in it, and each sequence begins with a neutral expression and proceeds to a peak expression. FER based on a static image often takes the last frames as samples. The eight included expressions are disgust, happiness, surprise, fear, anger, contempt, sadness, and neutral, as shown in Figure 8.    Karolinska directed emotional faces (KDEF) [31]: It includes 4900 GRB images of size 562 × 762 in seven expressions. When compared to Fer2013, every expression in this dataset is represented with five different views, −90°, −45°, 0°, +45°, and +90°, as shown in Figure 10.

Preprocessing
To ensure the effect of FER, we preprocessed the original images before experiments, including the data on face detection and alignment. We used the multitask cascaded convolutional networks (MTCNN) to hasten face detection and alignment [32,33].
We processed the last three frames in the labeled CK+ dataset expression sequences to 48 × 48 grayscale images. Since there were too few samples of contempt, and also in order to match the seven basic expressions across the datasets, contempt samples from CK+ were excluded from the experiments. Then, we expanded the number of samples to ten times their original number by using common methods for data augmentation such as scale augmentation, changing contrast and changing brightness, and flipping from left to right. To preserve an expression, the central area of the image in question should be maintained. Thus, random cropping or severe rescaling was not adopted by us. We found 10,500 samples from seven expression categories in total. In Fer2013, the numbers of most samples are much larger than those in CK+, but the former has insufficient disgust samples. Thus, we expanded the disgust class, and samples that did not contain faces or had severe facial occlusion were excluded. In KDEF, the sample size of each expression in every view is the same (140). We expanded this to 1400, and so the total number of utilized samples was 49,000.

Experiments for Effectiveness of DSC-DenseNet
The network parameters during training were as follows: the epoch was 150 on CK+ and 250 on Fer2013, batch size was 32, initial learning rate was 0. 01, and learning rate decreased to 50% after each 8 epochs. We used 250 images from each expression category for testing on CK+ and 4000 images from all expression categories for testing on Fer2013. The confusion matrixes of FER percents are shown in Figures 11 and 12.

Preprocessing
To ensure the effect of FER, we preprocessed the original images before experiments, including the data on face detection and alignment. We used the multitask cascaded convolutional networks (MTCNN) to hasten face detection and alignment [32,33].
We processed the last three frames in the labeled CK+ dataset expression sequences to 48 × 48 grayscale images. Since there were too few samples of contempt, and also in order to match the seven basic expressions across the datasets, contempt samples from CK+ were excluded from the experiments. Then, we expanded the number of samples to ten times their original number by using common methods for data augmentation such as scale augmentation, changing contrast and changing brightness, and flipping from left to right. To preserve an expression, the central area of the image in question should be maintained. Thus, random cropping or severe rescaling was not adopted by us. We found 10,500 samples from seven expression categories in total. In Fer2013, the numbers of most samples are much larger than those in CK+, but the former has insufficient disgust samples. Thus, we expanded the disgust class, and samples that did not contain faces or had severe facial occlusion were excluded. In KDEF, the sample size of each expression in every view is the same (140). We expanded this to 1400, and so the total number of utilized samples was 49,000.

Experiments for Effectiveness of DSC-DenseNet
The network parameters during training were as follows: the epoch was 150 on CK+ and 250 on Fer2013, batch size was 32, initial learning rate was 0. 01, and learning rate decreased to 50% after each 8 epochs. We used 250 images from each expression category for testing on CK+ and 4000 images from all expression categories for testing on Fer2013. The confusion matrixes of FER percents are shown in Figures 11 and 12.   The mean of the recognition rates on seven expressions, Rexpression, was used as the final evaluation metric, and it was sometimes abbreviated as 'recognition rate' without causing confusion. It was defined thus: The experimental results showed that: 1. On the CK+ dataset, the recognition rate of our model was 96.7%. The percentages of recognition for the happiness, surprise, and disgust classes were the highest: 100%, 99%, and 98%, respectively. The reason for the good recognition results for the happiness class was that the features of happiness were more obvious than other emotions and thus it was not easily confused with other features. These results showed the same performance as other existing FER methods. The recognition rate for the sadness class was the lowest-92%-and 6% of sad expressions were misclassified as angry. The reason they were easily confused was that they both more or less involved frowning. Neutral and fear expressions were misclassified as sad in 4% and 3% of cases, respectively. 2. On the Fer2013 dataset, the recognition rate of our model was 77.3%. The percentages of recognition for the happiness and surprise classes were 93% and 86%, respectively. The recognition rates for the fear and anger classes were the lowest: 62% and 69%, respectively. The main classes that were confused with the fear class were The mean of the recognition rates on seven expressions, R expression , was used as the final evaluation metric, and it was sometimes abbreviated as 'recognition rate' without causing confusion. It was defined thus: The experimental results showed that: 1.
On the CK+ dataset, the recognition rate of our model was 96.7%. The percentages of recognition for the happiness, surprise, and disgust classes were the highest: 100%, 99%, and 98%, respectively. The reason for the good recognition results for the happiness class was that the features of happiness were more obvious than other emotions and thus it was not easily confused with other features. These results showed the same performance as other existing FER methods. The recognition rate for the sadness class was the lowest-92%-and 6% of sad expressions were misclassified as angry. The reason they were easily confused was that they both more or less involved frowning. Neutral and fear expressions were misclassified as sad in 4% and 3% of cases, respectively.

2.
On the Fer2013 dataset, the recognition rate of our model was 77.3%. The percentages of recognition for the happiness and surprise classes were 93% and 86%, respectively. The recognition rates for the fear and anger classes were the lowest: 62% and 69%, respectively. The main classes that were confused with the fear class were sadness and anger, while the main classes that were confused with the anger class were fear and sadness.

3.
For CK+, the frontal faces dataset, the recognition rate of our model could meet the practical requirements. For Fer2013, the dataset with profile faces, the recognition rate of our model was significantly reduced. Part of the reason for this was that facial occlusion affected recognition to some extent, although severely occluded samples were removed. Another reason was that there were multi-view images in the dataset, and facial deflection significantly affected the effectiveness of FER. This has also been the consensus among researchers, and it also indicates the necessity of studying facial pose normalization models in this paper.

Comparison of DSC-DenseNet with Other Lightweight Models
We performed comparison experiments on Fer2013 to compare our model to the classical lightweight classification models (SqueezeNet, ShuffleNet, ResNet, and MobileNet) and the state-of-art classification models (Separate-loss and RAN). The learning curves of some models are shown in Figure 13. We also compared our model to recently proposed FER models (Light-SE-ResNet and PGC-DenseNet). The FER recognition rate, params, and FLOPs of these models are shown in Table 5 (sorted by recognition rate).

Comparison of DSC-DenseNet with Other Lightweight Models
We performed comparison experiments on Fer2013 to compare our model to the classical lightweight classification models (SqueezeNet, ShuffleNet, ResNet, and Mo-bileNet) and the state-of-art classification models (Separate-loss and RAN). The learning curves of some models are shown in Figure 13. We also compared our model to recently proposed FER models (Light-SE-ResNet and PGC-DenseNet). The FER recognition rate, params, and FLOPs of these models are shown in Table 5 (sorted by recognition rate).   Figure 13. The learning curves of some lightweight models on the Fer2013 dataset. The experimental results showed that: 1.
The FER recognition rate of our model on multi-view faces was 77.3% when the params value was 0.16M. Compared to the models with better accuracy (the lower half of Table 5), such as SqueezeNet and ShuffleNet, our model had higher FLOPs because of its concatenation of feature maps. However, its recognition rate was 2.3% and 4.3% higher than that of SqueezeNet and ShuffleNet, respectively. This could also be seen visually in their learning curves. On the other hand (the upper half of Table 5), it was shown that at the cost of accuracy, speed could be significantly improved. The FLOPs of MobileNetV3 and Light-SE-ResNet were extremely small. 2.
Our model had a smaller params value to achieve approximate accuracy. Compared to Separate-loss and RAN, the two models with the closest accuracy to ours, the params value of DSC-DenseNet was equal to only about 15% of their params values, and the FLOPs value of DSC-DenseNet was between theirs. Therefore, our model achieved a practical recognition rate, meaning that the lightweight FER model proposed in this paper achieved a balance between the accuracy and performance requirements of the hardware platform.

Training Strategy
Compared to the true or false discrimination task in G, D's true or false discrimination task was more difficult to train. In order to achieve a dynamic balance between the performances of G and D, the update frequency ratio between G and D was 1:2 in training. At the same time, a small learning rate given to D slowed down its convergence and avoided the ability unbalance between D and G, effectively preventing G's loss from increasing continuously and keeping its internal parameters from improving in the desired direction.
During the training process, we could not know when the performance of GAN would be optimal. Too much training may have produced negative effects which could have made the parameters unstable and damaged the effectiveness of the original model. Therefore, it was necessary to store the parameters when the model achieved excellent performance during training. After a certain number of epochs in the learning process, 10 random face images in the test set were compared with the generated faces. If the difference between two images increased abruptly during the process, the training was terminated because this would have indicated that the previously stable parameters had been destroyed.
During the training process, G and D were optimized using Adam optimizer. the learning rates for G and D were set to 0.001 and 0.0005. For each epoch, the learning rate was adjusted as an exponential decay with a decay parameter of 0.999. The maximum number of epochs was set to 500 and the batch size was set to 128. In order to generalize the network better, label smoothing was used; i.e., the labels 0 and 1 were replaced by random numbers in the range of 0-0.1 and 0.8-1 when the true or false judgment was made.

Experiments for Effectiveness of LD-GAN
We performed an experiment on the KDEF dataset, and the loss curve for this is shown in Figure 14. From the trend of the loss curve, it is clear that the training strategy used on LD-GAN was effective. The result of the posture normalization is shown in Figure 15. In the study, the frontal images generated by LD-GAN were recognize DSC-DenseNet. The mean FER recognition rate of each expression, at all views calculated, and the confusion matrix of seven facial expressions was derived, as is s in Figure 16. In the study, the frontal images generated by LD-GAN were recognized by DSC-DenseNet. The mean FER recognition rate of each expression, at all views, was calculated, and the confusion matrix of seven facial expressions was derived, as is shown in Figure 16. In the study, the frontal images generated by LD-GAN were recognized by DSC-DenseNet. The mean FER recognition rate of each expression, at all views, was calculated, and the confusion matrix of seven facial expressions was derived, as is shown in Figure 16. In the study, the frontal images generated by LD-GAN were recognized by DSC-DenseNet. The mean FER recognition rate of each expression, at all views, was calculated, and the confusion matrix of seven facial expressions was derived, as is shown in Figure 16. The experimental results showed that: 1.
The recognition rate for happy expressions was the highest: 98.7%. The rates of recognition for the surprise and anger classes were the second highest; these amounted to more than 96%. The recognition rates for the fear and sadness classes were lower: 85.7% and 87.5%, respectively. The mean recognition rate for the seven expressions was 92.7%.

2.
When compared to the results without pose normalization (Figure 12), the method proposed significantly improved the recognition rate, and could also reduce the misclassification rate between different expressions to a lower level, thus meeting the needs of practical application.

Comparison of Our Model with Others
We performed comparison experiments on the KDEF dataset. The profiles in the ±45 • and ±90 • views were corrected to frontal faces by three GAN variants (FF-GAN, TP-GAN, and DR-GAN) and our model, and then FER was performed using DSC-DenseNet. We used 5-fold cross validation, and the results of this are shown in Table 6. Table 6. Comparison of FER recognition rates, with 95% Cis, of the four models on the KDEF dataset.

Ablation Study
In order to generate facial images with more expression features, we used two pathways in G to generate local features that were closely related to the selected expressions. Two Local discriminators were added to D to preserve the local details and to improve the accuracy of expression classification. To verify the validity of our model, an ablation experiment was performed on the KDEF dataset.
To compare our model to LD-GAN, two Local discriminators were removed from the model. The normalization effects of the two models are illustrated in Figure 17. Figure 17a shows a sad-faced man's profile at a −90 • view and a neutral-faced woman's profile at a +45 • view. Compared to the real frontal faces in Figure 17d, the man's mouth does not show the drop it should have, and the woman's mouth in Figure 17b does show an excessive rise in the images generated by GAN without the Local discriminators. These errors increase the probability that the man's sad expression would be misclassified as neutral and the woman's neutral expression would be misclassified as happy in subsequent FER steps.
In Figure 17c, which shows the faces generated by LD-GAN, the corners of the subjects' mouths are closer to those in the real frontal faces.

Conclusions
The lightweight DSC-DenseNet FER model proposed in this paper minimizes the number of parameters and the complexity of computation and achieves a useful FER rate. The LD-GAN face posture normalization model proposed improves the ability to reserve local features related to expression and can generate a face that is more conducive to FER. Experiments on multiple datasets show that the recognition rate for faces at multi-view is higher than 92% when combining LD-GAN and DSC-DenseNet.
In future research, we will investigate the impact of different lighting environments and local occlusion on our model and establish a lightweight FER method for natural scenes.

Conclusions
The lightweight DSC-DenseNet FER model proposed in this paper minimizes the number of parameters and the complexity of computation and achieves a useful FER rate. The LD-GAN face posture normalization model proposed improves the ability to reserve local features related to expression and can generate a face that is more conducive to FER. Experiments on multiple datasets show that the recognition rate for faces at multi-view is higher than 92% when combining LD-GAN and DSC-DenseNet.
In future research, we will investigate the impact of different lighting environments and local occlusion on our model and establish a lightweight FER method for natural scenes.

Data Availability Statement:
The code and data presented in this study are available at https: //gitee.com/djw_Hrbust/fer.git (accessed on 1 June 2023).