1. Introduction
With the convolution neural network, in recent years, the vision community has made great progress in many challenge problems, such as object detection [
1], semantic segmentation [
2], object classifiaction [
3] and so on. At the same time, face verification methods based on deep convolutional neural networks (CNNs) have achieved high performance [
4,
5,
6,
7]. As it does not require too much user cooperation, compared to iris verification, fingerprint verification and other methods, face verification has a better user experience. Thus, face verification recently has attracted more and more concern. In general, using a convolution neural network to do face verification needs the following steps: a pair of face images is taken as input into the convolution neural network for feature extraction, and then the extracted two features are sent to the classifier to calculate the similarity, according to the relationship between similarity and the threshold, judging whether it is the face of the same person. Face representation learning plays an important role in face recognition. Some researchers combine deep face representation and verification into one system, which is learning to map faces into similarity space directly [
4]. However, it is much harder to learn the mapping in terms of a lack of training data. In this paper, we use the deep CNN as a feature extractor and adopt an extra classifier to make face representation more discriminative as in [
5,
6,
7].
The common convolution neural network for object classification such as Alexnet [
3], VGG [
8], GoogleNet [
9], Residual net [
10], only use the softmax loss function for multi-class classification. During inference, the input is directly classified by the convolution neural network. However, different from object classification, face verification not only needs to have the ability to distinguish different identities, but also needs to make the distance of the same identity small enough. The network only trained by softmax loss function can not make intra-class closer. Siamese nets [
11] use a pair of images as input and directly output the similarity of the images. Though Siamese nets pay attention to the distance between two samples, the separation of the different classes is ignored. DeepID2 [
6] adds a verification loss to softmax loss function, which is called contrastive loss, and solves the problem of ignoring the separation of different classes. FaceNet [
4] uses triplet loss to obtain the same purpose. Triplet loss utilizes the distance relationship between anchor, positive, and negative, minimizes the distance between anchor and positive, and maximizes the distance between anchor and negative. Contrastive loss and triplet loss need to pair the training samples, and there are no reasonable ways to pair samples efficiently for now. Some work uses online search to get the hard examples to pair. This makes it necessary to perform a work of selecting training samples before each iteration thus increasing the training time. In this paper, we design two CNNs to extract face features that can have strong abilities of identification and verification. CNN1, which is only supervised by an identification signal, is designed for setting different identities apart. In addition, CNN2 is supervised by the combination of identification and semi-verification signals, which can make the distance of the same person small enough. Semi-verification is inspired by triplet loss in Facenet and verification signal in DeepID2, which represents the distance of pairs from the same identity. Different with the DeepID2 and Facenet, we do not need to select the training samples before each iteration, which avoids the extra time consumption. We have similar thoughts to center loss [
12], which is making intra-class samples as close as possible. Center loss calculates the distance between samples and their classes’ centers to minimize the intra-class variations. During backward propagation, center loss needs to update class centers, which means that the extra calculation or parameters, though not complex, are needed. Our method does not need any extra parameters and reduces the intra-class variations that softmax loss function can not solve.
In face pre-processing, it is hard to do great normalization for faces with variation caused by poses. In [
13], Li proposes using the distance of landmarks instead of eye centers for face normalization, which is said to be relatively invariant to pose variations in yaw poses. In our system, we combine this method and use the eye centers method to do face normalization in a certain condition.
Inspired by [
9,
14], we add auxiliary classifiers to assist the training of CNN. In addition, these auxiliary classifiers provide multi-scale features for recognition. Thus, a stronger feature can be obtained by concatenating these multi-scale features. Recently, most face verification methods catenate face representations of multi-resolutions and multi-regions based on deep CNNs to construct a feature with high dimension [
6,
7]. This will conduct high computation and a large burden of storage. In our work, we combine the face representations of two networks and obtain a compact feature as the final face representation. For each network, only one resolution and one region of a face are used. Due to the final feature combining the multi-scale features coming from two CNNs trained by different signals, we called it multi-task and multi-scale features fusion.
The overall framework of our face verification method is illustrated in
Figure 1. In addition, our effective face representation joint Joint Bayesian classifier achieves high performance (99.71%) on the LFW dataset with a small training database.
The rest of this paper is constructed as follows: we introduce the semi-verification signal in
Section 2, which is used for supervising the training of one deep CNN. In
Section 3, we present two deep CNNs and the training algorithm. Face verification based on the proposed framework will be presented in
Section 4. In
Section 5, we present the performance of our method compared with others based on deep CNN. Conclusions will be drawn in
Section 6.
2. The Proposed Loss Function
Recently, there have been a lot of methods to add the verification information to the CNN for face verification tasks, such as contrastive loss [
6], triplet loss [
4], and lifted structured embedding [
15]. The CNN trained with verification information can adjust the parameters end-to-end, so that the features generated from these CNN have greater discriminant power than those from normal networks that only use the cross entropy loss. However, contrastive loss [
6] and triplet loss [
4] need to pair the training sample. Contrastive loss [
6] requires not only the positive pairs, but also negative pairs (where the positive pair refers to two different face images having the same identity, and the negative pair refers to two different face images having different identities). However, the number of positive pairs and the number of negative pairs are extremely unbalanced. For a dataset containing
n individuals and
m face images per person, the number of positive pairs is
, and the number of negative pairs is
. When
,
, which means that the number of negative pairs is much larger than the number of positive pairs. Therefore, unreasonable pairing can not improve the performance or even worse. Triplet loss [
4] proposed online and offline methods for selecting training pairs, and each anchor uses a semi-hard sample as its corresponding negative sample. Although lifted structured embedding [
15] does not need to pair the samples in a complex method, if the batchsize is
N, a high cost
is entailed. The research community still does not have reasonable ways to pair samples.
In order to solve the above problems, we propose a semi-verification signal and a corresponding pair selection method so that the verification information can be added to the CNN reasonably and efficiently.
The semi-verification signal means that only the pairs of the same identity will be used to compute verification loss. It minimizes the L2-distance between the face images of the same identity:
where
S is an index set of face pairs belonging to the same identity. It does not contain pairs of different identities, which is different from verification signals. The negative pairs do not need to be selected, and the imbalance between positive and negative pairs talked above exists no more. In addition, it is the reason why we call it a semi-verification signal. Reducing the intra-class variations and keeping the separable inter-class differences unchanged can also achieve the same purpose as the contrastive loss [
6].
Supposing that there are n different face images from one person, it will be positive pairs. In this view, we only want to use a part of these pairs. However, randomly selected sample pairs cannot establish close relationships between all samples.
Suppose that we randomly select
m pairs from
pairwise combination and there will be such a situation that some images do not appear in selected pairs any more. As shown in
Figure 2, it will make images of this person be divided into two clusters after training. As a result, the distance between
m pairs of face images is small enough in one cluster, but in the other one will not. In addition, the distance between two clusters will not be small enough.
For the purpose of solving the problems mentioned above, we institute positive pairs by creating a circle as a pair selection method. Supposing that there are N training samples of class
i in the training data set, we number these samples
. CNN extracts features
for these N samples. As shown in
Figure 3, one feature corresponding to one image is connected with its directly connected neighbors, and there are no extra connections between it and other features. In other words,
only pairs with
or
. We can easily solve the problem above in this way. On the one hand, it reduces the computation cost to a certain extent
. On the other hand, it establishes direct or indirect relationships between all face images.
In order to make the facial features extracted by CNN have strong identification and verification performance, two kinds of loss functions are used in this paper. One is identification loss, and the other is joint identification and semi-verification loss:
where
is the target probability distribution, and
is the predicted probability distribution. If
t is the target class, then
, and
for
.
The joint identification and semi-verification loss can be formulated as follows:
where
represents the identification part, and
denotes a semi-verification signal.
S is a index set of face pairs belonging to the same identity, and
is a hyper-parameter used to balance the contributions of two signals.
5. Experiments
Our experiments are based on Caffe [
22], with NVIDIA GT980X GPU with 4 GB of onboard memory, using a single GPU. We train our model on the TesoFaces database, which contains 400,000 face images of 15,340 celebrities from the internet. We collect massive Eastern and Western celebrities’ photos from the internet. Any names appearing in the LFW [
23] are removed in order to make it possible to train on this new dataset. For each class, the method claimed in
Section 4.1 are used to detect faces. We delete such images that no bounding box exists or the size of the bounding box is too small. After that, we manually delete some other images such as duplicates or bad quality. The final database has 0.4 M images, consisting of 15,340 identities. The faces are cropped and aligned to the size of
in both training and testing phases. Furthermore, we evaluate our method on LFW [
23] and YouTube Faces Database(YTF) [
24], which are challenging datasets for face verification in the wild. We do not use LFW or TesoFaces to train Joint Bayesian and PCA. CASIA-Webfaces [
13] is used.
5.1. Experiment on Hyper-Parameter
We research the balance between identification and semi-verification signals by a hyper-parameter . In our experiment, we try to explore five different values of . With the increase of , the contribution of semi-verification to loss function is much greater. If , only identification signals will take effect.
We decide the value of
in two views. In the first view, the decrease of loss function is used for measuring the performance of different values. Furthermore, the second view is to use face verification accuracy on the LFW dataset to determine whether
is zero or not. The TesoFaces database is split into training and validation sets. The proportion of the training sets and validation sets is 7:3. We firstly train our model on the train set of TesoFaces and test on the validation set to show the different performance of different
values. The final model is trained entirely by the TesoFaces database.
Figure 9 shows the curve of the decrease of loss with different hyper-parameters. As we can see, the large value of
and
cause the loss to not fall any more. Furthermore, the loss of
decreases very slowly. In contrast to the results of the other three values, the loss of
is decreasing at a high speed and will converge finally.
Table 2 shows the face verification rate on the LFW dataset by Cosine Distance when
and
. We can see that training with the combination of identification and semi-verification, that is,
, has a good performance. Furthermore, the weight of semi-verification should be a little small. That is to say, the identification signal takes a much more important role in the CNN training process. Furthermore, semi-verification plays the role of regularization to a certain extent.
5.2. Learning Effective Face Representation
In order to learn the most effective face representation, we evaluate various combinations of features from auxiliary classifiers by Cosine Distance for face verification. We train each deep CNN with three auxiliary classifiers, and there are seven kinds of possible combinations of features for each CNN. As shown in
Table 3, adding more features from auxiliary classifiers can improve the performance.
As a result, the feature of deeper layer has a much stronger ability of classification. Furthermore, face verification accuracy is much higher with the increase of the number of features. Combining three features increases the accuracy by 0.40% and 0.11% over the best single feature for each CNN, respectively. In addition, the trends of the performance show that with more auxiliary classifiers being used, the accuracy may be improved.
Furthermore, we compare the face verification rate of each CNN and that of the combination of two CNNs. The result is shown in
Table 4. We can see the result of CNN2 is not greater than CNN1. Because CNN2 is supervised by additional verified signals, CNN2 can not do as well as CNN1 in separating classes. If simply comparing CNN1 and CNN2 results, CNN1 will be better than CNN2. However, when we aggregate the outputs of these two nets, the result will be improved greatly. CNN1 is good at separating classes, and CNN2 is good at minimizing intra-class variation. It means that CNN1 and CNN2 complement each other. Aggregating the features from CNN1 and CNN2 will have both advantages of two nets. Our final effective face representation is formed by concatenating features from CNN1 and CNN2, and each feature is a combination of three outputs of auxiliary classifiers. It shows that multi-task and multi-scale feature fusion has great power in face verification.
5.3. Evaluation on Classifiers
Learning more compact and discriminative features is the key for face verification tasks. For final face verification, we explore Cosine Distance and Joint Bayesian to improve the discriminative ability of features. The verification rates of two methods are shown in
Table 5.
As a result, Joint Bayesian seems to be more appropriate for our face representation, and it has a much better performance in face verification tasks. The reason why Joint Bayesian is better than Cosine Distance is that the former one has taken the variance of intra and inter-identity into consideration. In other words, it can further increase differences of inter-identity and reduce that of intra-identity after face representation learning.
5.4. Comparision with Other Methods
To show the performance of our algorithm, we compare pairwise accuracy on the LFW dataset and the YTF dataset with the state-of-the-art deep methods.
In
Table 6, we show the results of comparisons and the scales of the database used for training in different methods. As a result, our method achieves 99.71% test accuracy, and it outperforms most deep face verification algorithms. The method in [
5] is only 0.06% higher than ours, but the number of faces they used for training was 12 times the amount of data that we have. Therefore, our face verification method has a high product with a small cost.
Figure 10 compares receiver operating characteristic(ROC) curves of different methods, and the curve of our algorithm is much smoother than others. In the experiment, there are 17 wrong pairs in which three of them are wrongly labeled. Thus, our final pairwise accuracy is 99.75%. For safety, on some occasions such as financing institutions, a large true positive rate when the false acceptance rate is small is important. Though Baidu [
5] got a better accuracy than us, according to
Figure 11 and
Table 7, we can see that when the false acceptance rate is small, our method will get a better true positive rate.