3D Expression-Invariant Face Veriﬁcation Based on Transfer Learning and Siamese Network for Small Sample Size

: Three-dimensional (3D) face recognition has become a trending research direction in both industry and academia. However, traditional facial recognition methods carry high computational costs and face data storage costs. Deep learning has led to a signiﬁcant improvement in the recognition rate, but small sample sizes represent a new problem. In this paper, we present an expression-invariant 3D face recognition method based on transfer learning and Siamese networks that can resolve the small sample size issue. First, a landmark detection method utilizing the shape index was employed for facial alignment. Then, a convolutional network (CNN) was constructed with transfer learning and trained with the aligned 3D facial data for face recognition, enabling the CNN to recognize faces regardless of facial expressions. Following that, the weighted trained CNN was shared by a Siamese network to build a 3D facial recognition model that can identify faces even with a small sample size. Our experimental results showed that the proposed method reached a recognition rate of 0.977 on the FRGC database, and the network can be used for facial recognition with a single sample.


Introduction
Face recognition has been widely used in various areas and is one of the leading computer vision challenges that has been intensively researched for two decades [1,2]. Although there have been significant breakthroughs for 2D face recognition with emerging deep learning techniques [3], recognition accuracy and stability are still a challenge because the facial appearance and surface of a person can vary significantly due to illumination conditions and changes in pose [4]. With 3D modalities, some research has focused on finding a more robust facial feature representation or descriptor based on the geometric information of a 3D face.
Three-dimensional (3D) face recognition has evolved quickly with the development of 3D scan acquisition devices [5]. It was formerly common to design specific features to extract from 3D face scans because of the restricted acquisition conditions. Lei et al. [6] proposed a face recognition method that combined kernel principal component analysis (KPCA) with support vector machine (SVM) and achieved a verification rate of 97.8%. However, the manually designed feature is extracted from the height of the forehead in this work, which is not stable enough. Therefore, local binary pattern (LBP) was used in a 3D face recognition algorithm under expression varieties [7]. These methods achieved strong recognition performance but involve relatively complex algorithm operations compared to end-to-end deep learning models.
Recently, the deep learning technique reshaped the landscape of face recognition. Baptista et al. [8] used the convolutional neural network (CNN) coupled with low-level 3D local features (3DLBP) to credit faces captured by a Kinect sensor. However, the network scale increased with the people recognized, which made it hard to recognize large numbers of people. Cai [9] proposed a fast and robust face recognition method utilizing facial landmarks transferred to a deep neural network. The CNN coupled with transfer learning had the potential to overcome the expression varieties and over-fitting problems to improve the robustness of the methods.
However, the considerable number of samples required for model training that deep learning requires cannot always be satisfied for face recognition, and the performance of the face recognition system degrades when the variability of the acquired faces increases. Therefore, some deep learning methods tried to convert the classical recognition classification task into a task of searching for the most matching pattern. The angular softmax loss makes CNN learn angularly discriminative features to improve the classification accuracy [10]. The CosFace with an improved loss function enhances the power of the discrimination to maximize inter-class variance and minimize intra-class variance [11]. An Additive Angular Margin Loss (ArcFace) can obtain highly discriminative features for face recognition [12]. Moreover, a category of losses that learn a universal feature embedding whose magnitude can measure the quality of the given face and introduces an adaptive mechanism to learn well-structured within-class feature distributions by pulling easy samples to class centers while pushing hard samples away [13].
Nowadays, the Siamese neural network has become a popular method in image recognition, partly due to the advantage that it does not require a large amount of data in the inference phase [14][15][16][17][18]. The Siamese neural network takes two samples as the input and outputs the spatial features after dimensionality reduction. By comparing the two output features, the similarity of the two samples can be obtained. The Siamese neural network has a deep network structure, which can be composed of a convolutional neural network or a recurrent neural network. The Siamese neural network is often used to calculate the similarity of two inputs and can be used in fields that are difficult to recognize or require high recognition accuracy such as portrait recognition, target tracking, and fingerprint recognition. Some applications use the Siamese network for target detection or classification. Wang et al. [17] proposed a person recognition algorithm based on an improved Siamese network and introduced a bilinear channel fusion attention mechanism to improve the bottleneck of resnet50. This method can improve human and object recognition accuracy but cannot completely recognize local details of complex facial expressions. Figueroa-Mata et al. [19] proposed the use of a convolution Siamese network (CSN) for image-based plant species recognition on small data sets to distinguish plant species based on leaf images. Sameer et al. [18] proposed a deep Siamese network for limited label classification in source cameras. This method has been applied in label classification and recognition, but it has not been studied in face recognition. Chang et al. [20] proposed a new light Siamese network for feature extraction. The weights of the Siamese network are updated online in the tracking process to deal with dynamic changes of the target and background. Kim et al. [15] proposed the Siamese adversarial network tracker (SANT), which uses similarity learning with an SAN discriminator. It also uses the structure as the residual long-shortterm memory tracker. The characteristic of the Siamese network is that it does not need many training faces in the inference phase; however, the number of face classes increases quickly during the face-matching phase, which increases the data storage burden. Liu et al. [21] proposed a joint face alignment and 3D face reconstruction method to enhance face recognition accuracy, although the reconstruction method requires the complete raw face data to be stored for support.
In this study, a transfer-learning-based Siamese network was used to identify faces with a small sample size. The proposed algorithm integrates a pre-trained CNN and a Siamese network structure to obtain a high-dimension transformation from the input 3D face surface to a relatively sparse feature space for face recognition. In contrast to the traditional neural network, the dataset faces were transformed to intermediate results for the network prediction in the future, which can encode faces to save storage space. The network was trained to extract features for end-to-end face recognition first, and then the trained network weight was shared to construct a Siamese network for small sample size recognition.
Specifically, the proposed face recognition method based on transfer learning and siamese networks can resolve the small sample size issue. Therefore, the specific objectives of current research include the following:

•
Develop a face recognition method with the Siamese network that converts the face recognition problem from a single network classification problem to a facial pattern search problem.

•
The distance-based face-matching process was replaced by a small, fully connected neural network trained with different face sample pairs, which effectively reduced the computational cost of face matching.

•
Utilizing the transfer learning to pre-train the network to improve the network efficiency and training speed when the sample size is small.

Dataset and Facial Scan Pre-Processing
The three-dimensional face data used in this study were obtained from the FRGC v2.0 Database [22], collected by the University of Notre Dame in the United States, and included 4007 frontal face scan images of 466 individuals. A total of 1-22 scanned images of each person were collected using a three-dimensional scanner in autumn 2003, spring 2004, and autumn 2004. The images included facial expressions, such as expressionless, smiling, laughing, frowning, bulging, surprised, and small changes in attitude. This database is currently one of the most widely used for face recognition.
Three-dimensional (3D) scanning data usually contain many redundant features including ears, hair, shoulders, and other areas unrelated to the human face, as well as outliers caused by the collection equipment and environmental conditions. Therefore, the face scan needs to be preprocessed. Firstly, outliers were eliminated; secondly, the shape index (SI) [23] could be used to identify the nose tip in order to correct the face position. The shape index was calculated as follow: where s represents the shape index and ranges from −1 to 1; (. . .) x and (. . .) y represent the x component and y component of the vector in parentheses, respectively; ∂n ∂x and ∂n ∂y represents the vector formed by the corresponding point n in the x-direction and ydirection and the surrounding points, respectively. Different shape index values correspond to different curved surface shapes. The more concave curved surface has a smaller shape index, and the more prominent curved surface has a larger shape index. Finally, the face scan was interpolation smoothed. Figure 1 shows the visualization results of face point cloud data before and after pre-processing. The pre-proposed point cloud image was projected and converted into a twodimensional depth image, as shown in Figure 2. The face depth images were used to construct a small sample size dataset for face recognition in this work. More specifically, 995 face images were selected from 199 people with five different facial expressions in the FRGC spring 2004 data set. These face images were transformed into face depth images and classified into 199 classes according to people. Notably, in order to simulate the practical, small sample size situation, only five scans of each person were used as our experimental data, where 3 of each person were used for training, and the remaining 2 were used to estimate the performance of the proposed method.

Method
The process of the proposed face recognition algorithm is shown in Figure 3, and it includes three main phases: (1) Training phase: The pre-processed faces are used for transfer learning from the pre-trained ResNet18 weight to build a classification model. Then, the transferred weight is shared to construct a Siamese network that was fine-tuned for face binary classification with paired face samples. (2) Warehousing phase: The collected faces are transformed into a feature space. The features of the faces form a feature gallery for future face matching. (3) Testing phase: The input face is scanned and pre-processed to obtain a depth image. Then, the trained network will perform the forward operation to obtain the network output, which plays a role of confidence level to match faces.

Transfer Learning of Convolutional Neural Network Model for Face Recognition
We established a primary convolutional neural network, which can be used to extract facial features and achieve face classification when there are sufficient facial data and a small number of categories.
The convolutional layer in the network can be defined as: where I is the two-dimensional face scan input, K is the two-dimensional kernel, K(m, n) means the element m, n of K, and I(i − m, j − n) means element i − m, j − n of I. The convolution neural network is able to map from different inputs to the same output and is thus widely applied for classification tasks. In our work, this characteristic is used to make our network expression-invariant. Expression invariant means that for any input with different expressions from the same person, the network obtains the same output. Expression invariant can be defined as: where f (x Some studies have shown that convolutional neural networks based on transfer learning have improved performance, which is mainly reflected in three aspects [24][25][26]. First, transfer learning may improve the initial network performance; second, transfer learning may increase the training speed of the network, making the performance improvement function of the network grow faster; finally, transfer learning may lead to better convergence, enabling the network to obtain a better training result.
In transfer learning, the learner must perform two or more different tasks, but we assume that many of the factors that explain the variations in P 1 are relevant to the variations that need to be captured for learning P 2 . Many visual categories share low-level notions of edges and visual shapes, the effects of geometric changes, changes in lighting, and so on.
This characteristic of transfer learning can be observed in Figure 4, which shows the changes in network performance during the training process in two scenarios. The higher start point in Figure 4 indicates that the network using transfer learning has better initial performance, and the higher slope reflects the higher accuracy and speed of using transfer learning [27]. The final higher asymptote indicates that transfer learning has a more competitive final convergence effect. When fewer face data are available, using transfer learning has a greater positive effect. Based on the excellent performance of transfer learning, transfer training of the face classifier was carried out in the form of transfer learning to construct a face feature extraction network.
At present, the more common convolutional neural networks include AlexNet [28], ResNet18 [29], and Vgg16 [30], among others. These neural networks have achieved good performance in related application fields. Choosing these neural networks for transfer learning can be more effective in improving the classification performance of the whole face recognition system. These three network structures have been proposed in different periods and have different network structures.
The output layers of these three networks all have a 1000-dimensional fully connected layer. The difference between the networks lies in the number of convolutional layers included and the network structure. In AlexNet and VGG16, the number configuration and sequential arrangement of convolutional layers differ. In ResNet18, the network contains a particular residual module represented as Conv-x. Through this module, the network can picture residual learning. We replaced the number of output dimensions of the original fully connected layer with the number of faces that need to be classified and performed transfer learning to form the required face classifier and construct a suitable convolutional network for the face feature extraction structure.
Considering that the network is a classification task in training, we used crossentropy [31] as the loss function, which can be defined as follow: where p model (y | x; θ) means the output probability distribution of the model with parameter θ when the network input is x and the ground truth output is y. J(θ) represents the loss function with the model parameter θ.p data means the data empirical distribution. The cross-entropy function can measure the degree of difference between two different probability distributions in the same random variable. This was expressed as the difference between the true face label probability distribution and the predicted face label probability distribution. The smaller the value of cross-entropy, the better the prediction effect of the model.

Siamese Network for Face Recognition
The convolutional neural network constructed could effectively extract the face area to achieve the classification task. However, as the number of categories increases, the accuracy of the classification decreases, and it is not possible to build an output layer for all people. Thus, we used a Siamese network to avoid this impact caused by the small number of samples.
There are two main differences between a Siamese network and a traditional convolutional neural network. In terms of neural network structure, the Siamese neural network constructs two convolutional neural network branches in the form of weight sharing, and the two branches are used to extract the prediction results from faces of two different persons or one person with two different expressions. In terms of the loss function, the Siamese neural network replaces the cross-entropy function commonly used in classification problems through a unique distance measurement method, which improves the network's interpretability and makes the neural network feature extraction training more in line with the extraction difference of the gradient direction.
A transfer Siamese neural network can be constructed with the feature extraction results obtained by transfer learning, and its structure is shown in Figure 5. There are two feature extraction branches with shared weights. The w box in Figure 5 represents shared weights, which are used to extract facial features. A fully connected layer is used to complete the comprehensive mapping of features at the end of the network. Using the Siamese network, it is possible to transform the network mapping from a high-dimensional face image X to a low-dimensional encoding output. The detailed parameters of the network are listed in Table 1.  There are three more essential characteristics of the Siamese network. First, in the output code space, the distance in the code space can reflect the similarity of the input data in the high-dimensional space and the neighbor relationship. Second, the mapping is not limited by the distance of the input face data but can learn complex features inside the face data. Finally, this mapping can be effective for some new sample data with unknown neighbor relationships. Therefore, the face feature output layers are connected to a fully connected layer for the final regression, and the MSE loss function is adopted as the loss function to optimize the network.

Results and Discussion
To effectively evaluate the accuracy of face recognition with different numbers of categories, the classification accuracy of the face network for the test set was used as a unified evaluation index for the performance of different models, which was defined as follows: where N c represents the number of correctly classified faces, N represents the total number of faces, and ACC represents the classification accuracy.

Evaluating the Effectiveness of Transfer Learning CNN
To compare the recognition effects of the three types of networks and the improvement in classification accuracy brought by transfer learning, the three convolutional network structures with different numbers of categories were trained separately, and the accuracy of the optimal results for each network after training was extracted. In the training process, the data used were the face data with at least five different expression images in the FRGC v2.0 database. We used a total of 199 groups of similar face data, and each experiment extracted the specified number of people from the 199 groups of face groups. Each group had five sample face data with different expressions, of which three were used for training, and the other two were used as validation data to simulate the practical, small sample condition. The highest accuracy of the validation data was used as the evaluation result for the model. The training algebra used was unified to 600 generations, the batch size for AlexNet was set to 128, and the batch size for VGG16 was set to 32 due to the limitation of the experimental hardware. From the training process of the experiment, we could observe the comparison of the training process with pre-training and without pre-training.
The accuracy and loss function decline performance of different types of training data during the training process has been improved which can be seen in Supplementary Material as Figures S1-S3. From the perspective of training accuracy, the use of transfer learning can increase the accuracy of the network more rapidly, and the final network convergence result was also more accurate. Transfer learning can make the loss function decline faster, and the result of convergence also has a lower loss value.
In addition, AlexNet had lower classification accuracy than the other two networks, and the convergence result of the loss function was also worse, as there was obvious overfitting [32]. From Supplementary Material Figure S1b, it can be seen that the loss function of AlexNet first decreased and then increased. This is because the network was tested after over-fitting on the training set. There was a poor fit on the dataset, so the loss function increased after multiple generations. The same situation is evident in Supplementary Material Figure S3b, which is due to over-fitting of the training set. The ResNet18 network performed better than the VGG16 network, the final convergence result of the loss function was the lowest, and the prediction accuracy of the network was the highest.
In order to further evaluate the effectiveness of transfer learning, Figures 6 and 7 show the face classifier's classification results with different numbers of categories with and without transfer training. It can be seen from Figure 7 that without pre-training, the face classification accuracy of the AlexNet network was low, and it was almost impossible to classify faces effectively. This is because there was less training set algebra, and the amount of training data was small, so the convolution kernel in AlexNet could not be effectively trained, resulting in low classification accuracy. When the number of classification categories increased, the classification accuracy reduced further. When there were fewer classification categories, the experiment was more random, and the accuracy was not sufficient. Nevertheless, when there were fewer categories, the output end of the network was less challenging to map, which improved the classification accuracy.
The other two networks shown in Figure 7 had higher classification accuracy than AlexNet. Overall, the classification of ResNet18 was better than VGG16. This conclusion seems evident in Figure 7 after transfer training. This is because the residual module was included in the ResNet18 network structure, which can better extract the characteristics of the network through residual learning. Based on the above results, it can be seen that ResNet18 had the best performance in face classification tasks, followed by VGG16, while AlexNet had the poorest performance. The use of transfer learning improved the classification accuracy of all the networks, although this was particularly effective for AlexNet.   Table 2 shows the changes in classification accuracy of different models for classification with or without transfer learning. It is clear that the classification accuracy was improved when transfer learning was used. The average classification accuracy of ResNet18 after transfer learning was 95.2%, and the average classification accuracy without transfer learning was 89.0%, showing an increase of 6.2%. The average classification accuracy of VGG16 after transfer learning was 90.0%, and the average classification accuracy without transfer learning was 83.4%, showing an increase of 6.6%. The average classification accuracy of AlexNet after transfer learning was 71.0%, while the average classification accuracy without transfer learning was only 3.1%, showing an increase of 67.9%. These results indicate that transfer learning dramatically improved the accuracy of AlexNet's face classification, and it also had a significant effect on the improvement of the classification accuracy of the other two networks. When there were fewer categories, the amount of test data was lower due to the small sample size. Therefore, when validation was performed, the validation accuracy was not sufficient to reflect the full performance of the network.

Evaluating the Effectiveness of Transfer Siamese Neural Network
To improve the coding stability of the network, a Siamese neural network was used to replace the traditional prediction structure. Using this structure can improve the network's coding stability and make it less likely for over-fitting to occur. Table 3 shows the accuracy comparison of the transfer Siamese neural network and the convolutional transfer network for different numbers of face classes. For the convolutional network, the standard deviation of the classification accuracy was 0.0325; for the transfer convolutional network, it was 0.0251; and for the transfer Siamese neural network, it was 0.0166. The classification accuracy for the Siamese neural network was less sensitive to the number of classes. This is because the coding structure of the Siamese neural network has a fixed structure compared to the other networks and does not change with the number of samples. In addition, the accuracy comparison also shows that the face classification accuracy of the transfer Siamese neural network was higher than that of the other two networks. Table 2. Classification accuracy of models using different convolutional neural networks with and without transfer learning, where the data are in the format of µ ± ∆x, µ represents the mean value of the results, and ∆x shows the confidence interval with a confidence limit of 0.95.

Evaluating the Performance of Different Methods on Face Recognition
In order to test the validity of our algorithm, we conducted another experiment by calculating the overall accuracy of the proposed method. We conducted our algorithm five times and achieved the final average recognition rate shown in Table 4. The proposed method achieves the best 97.70% rank-one recognition performance on the FRGC v2.0 dataset. Comparing the rank-one recognition rate of several state-of-the-art results, the proposed method has slightly better performance than others. The improvement mainly relies on the robustness of the transferred convolutional layer and is further enhanced by the Siamese network architecture. Table 4. Classification accuracy of different models.

Method Accuracy (%)
Mahoor et al. [33] 93.70 Tang et al. [7] 94.89 Lei et al. [6] 96.70 Deng et al. [34] 95.70 The proposed method 97.70 Computational efficiency is a critical concern for practical FR system application, which is one of the major advantages of the proposed method. Table 5 presents a timing comparison between the proposed method and several state-of-the-art methods that have published their computation times. The timing experiments were conducted on a PC with 3.2 GHz Intel Core processors and an NVIDIA GTX2080Ti GPU for training and testing using Pytorch implementation. To analyze the computational cost of the proposed method, we recorded two computational costs, including processing time, which contains the time of processing and feature extracting, and matching time, which is consumed by feature matching. Our method spends less time in both processing and matching. The proposed method only extracts the feature based on one branch of the Siamese network from end to end in the testing phase, so the proposed method has an edge over the method in literature [9], which contains not only a network but also a manually designed feature augmentation. Moreover, the matching time is decreased by 2-45 times by replacing the classic Euclidean distance measurement with a simple two-layer fully connected network. The Euclidean distance measurement spends much time on L2 calculating, which is much more time-consuming than serval matrix product operation in a small neural network. Furthermore, the proposed approach can easily profit from parallel processing due to our neural network-based structure and the rapidly developing GPU-accelerated technique.

Conclusions
In this paper, we proposed a face recognition method based on a transfer learning convolutional network and Siamese network to solve small sample size issues and reduce the influence of facial expressions. In the face feature training phase, the pre-trained convolutional network was fine-tuned for expression-invariant face recognition. The experiment result shows that the transferred network significantly improved the performance of the network face recognition accuracy by 6.2% on average. The transferred network weights were utilized to construct a Siamese network. Random combinations of the samples were used as the dataset of the neural network for Siamese network training. Siamese neural network training made the network model more powerful in coding, which reduced the overall face data size from the raw 3D data and achieved a recognition accuracy of 97.7%.
In our future work, we will try to minimize the coding space of the proposed face feature generated network and the loss function of the proposed method may need to be changed to enhance the network distinguished part learning. The proposed method introduced a new kind of feature representation learning method with Siamese Network, which enriched the face-matching research field.
Author Contributions: All authors designed this work; Z.L. and H.Z. contributed equally to this work. Z.L., H.Z. and X.S. carried out the experiments and validation of this work; Z.L. and T.Z. wrote original draft preparation; T.Z. and C.N. reviewed and edited the manuscript. All authors have read and agreed to the published version of the manuscript.