GicoFace: A Deep Face Recognition Model Based on Global-Information Loss Function †

Global Optimal Loss for Deep Abstract: As CNNs have a strong capacity to learn discriminative facial features, CNNs have greatly promoted the development of face recognition, where the loss function plays a key role in this process. Nonetheless, most of the existing loss functions do not simultaneously apply weight normalization, apply feature normalization and follow the two goals of enhancing the discriminative capacity (optimizing intra-class/inter-class variance). In addition, they are updated by only considering the feedback information of each mini-batch, but ignore the information from the entire training set. This paper presents a new loss function called Gico loss. The deep model trained with Gico loss in this paper is then called GicoFace. Gico loss satisﬁes the four aforementioned key points, and is calculated with the global information extracted from the entire training set. The experiments are carried out on ﬁve benchmark datasets including LFW, SLLFW, YTF, MegaFace and FaceScrub. Experimental results conﬁrm the efﬁcacy of the proposed method and show the state-of-the-art performance of the method.


Introduction
CNNs have greatly promoted the development of face recognition, where the loss function plays a key role in training the CNNs. Among a large number of loss functions, cross entropy loss is the most widely used one in deep learning-based classification, but it is not the best choice in face recognition as it only aims at learning separable features instead of discriminative features [1]. Most of the face recognition tasks are open-set tasks that require the features to have strong discriminative capacity. To enhance the discriminative capacity of the learned features, two targets ought to be thought of: (1) minimizing intra-class variance, and (2) maximizing inter-class variance. Over the past decade, many different loss functions [1][2][3][4][5][6][7][8][9][10][11][12] have been proposed for learning highly discriminative features for face recognition. These loss functions can be broadly grouped into two categories-the Euclidean distance-based loss functions [1][2][3][4][5] and the cosine similaritybased loss functions [6][7][8][9][10][11][12], where the vast majority of these loss functions are derived from cross entropy loss by modifying cross entropy loss with additional constraints or adding a penalty to it. However, only a few of them explicitly follow the aforementioned two targets.
Typical Euclidean distance-based losses include Center loss [1], Marginal loss [2] and Range loss [3]. All of them add another penalty to implement the joint supervision with cross entropy loss. Specifically, Center loss adds a penalty to softmax via computing and limiting the distances between the within-class samples and the corresponding class center, but it does not significantly optimize the inter-class margin. Marginal loss specifies a threshold value and considers all possible combinations of the sample pairs in a minibatch, forcing the sample pairs from the different classes to have a margin larger than the threshold and the sample pairs from the same classes to have a margin smaller than the threshold. However, it is not reasonable to use only one threshold to limit the intra-class and inter-class distance simultaneously. Range loss calculates the distances between the samples within each class, and chooses two sample pairs that have the largest distances as the intra-class constraint; at the same time, Range loss calculates the distance between the class centers, and forces the class center pair that has the smallest distance to have a larger margin than the designated threshold. This method can effectively optimize the positions of the hard samples in the feature space, but ignores the optimization of other samples, so it is unable to learn the optimal feature space. From the relevant experimental results of the methods above [1][2][3][4][5], the performance of face recognition benefits from both two targets of improving discriminative capacity can be found.
Typical cosine similarity-based losses include L-Softmax loss [8], A-Softmax loss [9] and AM-Softmax loss [10]. L-Softmax transforms the measurement from Euclidean distance to cosine similarity by reformulating the output of the softmax layer from W · f to |W| · | f |· cos θ. In addition, L-Softmax enlarges the angular margins between different identities by adding multiplicative angular constraints to cos θ. Nevertheless, L-Softmax does not apply L2 weight and feature normalization. Therefore, the difference between samples is determined by the angle and size of the feature vectors, which is inconsistent with the effort to optimize the feature space only by angle. Based on L-Softmax loss, A-Softmax applies L2 weight normalization, so W · f can be further reformulated to | f |· cos θ, which simplifies the training target. With L2 weight normalization, A-Softmax helps CNNs to learn features with geometrically interpretable angular margin. The experiments in [9] show that the performance can be enhanced by L2 weight normalization, although the improvement is very limited. However, A-Softmax still keeps the multiplicative angular constraints, the multiplicative angular constraints are difficult to control and it is difficult to explain their geometrical meaning.
AM-Softmax uses the additive angular constraints instead of the multiplicative angular constraints, that is, it replaces cos (mθ) with cos θ − m. AM-Softmax also applies feature normalization and makes |W| · | f | = s, where s = 30 is introduced as the global scaling factor. Hence, the training target |W| · | f |· cos θ is again simplified to s· cos θ. In addition, feature normalization brings benefits such as higher recognition accuracy, better mathematical interpretation and better geometrical interpretation. These benefits are disclosed in [13][14][15][16].
The properties of the best-performing and the most recent losses are summarized in Table 1, from which we can see that loss functions such as Center loss, Range loss, Contrastive loss, Marginal loss and Triplet loss do not apply weight and feature normalization, and loss functions such as A-Softmax loss, AM-Softmax loss, L-Softmax loss and ArcFace do not explicitly follow the two targets of improving discriminative capacity. However, according to the previous description, it can be seen that these four properties contribute to the improvement of recognition performance to varying degrees. This paper presents a new loss function, which is called Global Information-based Cosine Optimal loss (i.e., Gico loss), and the deep model trained with Gico loss is named GicoFace accordingly. An overview of the proposed training framework is shown in Figure 1. Table 1 shows the properties of Gico loss, where it can be seen that Gico loss satisfies all four aforementioned properties. To break through the hardware constraints and make Gico loss possible, Gico loss is calculated with the global distribution information from the entire training set, which is different from all other loss functions. The main contribution of this paper lies in the following aspects: We propose a novel loss function to enhance the discriminative capacity of the deep features. To the best of our knowledge, it is the first loss that simultaneously satisfies all the first four properties in Table 1 and also the first attempt to use global information as the feedback information;

2.
We propose and implement three different versions of Gico loss and analyze their performance variation on multiple datasets; 3.
To break through the hardware constraints and make Gico loss possible, we propose an algorithm to learn the cosine similarity between the class center and the class edge; 4.
We conduct extensive experiments on multiple public benchmark datasets including LFW [17], SLLFW [18], YTF [19], MegaFace [20] and FaceScrub [21] datasets. Experimental results presented in Section 3 confirm the efficacy of the proposed method and show the state-of-the-art performance of the method.  [7] No Yes Yes Yes mini-batch SFace loss [12] Yes Yes Yes Yes mini-batch CVM loss [11] Yes Yes No No mini-batch L-Softmax loss [8] No Yes No No mini-batch A-Softmax loss [9] No Yes Yes No mini-batch AM-Softmax loss [10] No Yes Yes Yes mini-batch ArcFace [6] No  Please note that an earlier version of this paper [22] was presented at the International Conference on Image Processing. Compared with the earlier version, this journal paper adds about 50% new content: (1) experiments on MegaFace and FaceScrub datasets to further verify the effectiveness of the proposed methods; (2) more detailed description on related works; (3) more discussion on the proposed methods to answer some key scientific questions; (4) more details about the complete algorithm are given.

From Cross Entropy Loss to Gico Loss
To better understand the proposed loss, firstly we give a brief review of related works including cross entropy loss, Center loss and some variants of cross entropy loss based on cosine similarity. Then we focus on the proposed Gico loss and give a detailed analysis.

Cross Entropy Loss and Center Loss
Cross entropy loss is the most commonly used loss function in deep learning, which can be formulated as: where W j ∈ R d is the jth column of the weight matrix W in the final fully connected layer, f i ∈ R d is the feature vector of the ith sample belonging to the y i th class, b j is the bias term of the jth class, P is the number of classes in the entire training set and N denotes batch size. A summary of notation declarations of this paper is shown in Table 2. From Equation (1), it can be seen that cross entropy loss is essentially calculating the crossentropy between the predicted label and the true label, indicating that cross entropy loss focuses only on optimizing the correctness of the classification results on the training set.
In other words, cross entropy loss aims at separating the training samples of different classes instead of learning highly discriminative features and enlarging the margin between those overlapped or non-overlapped neighbor classes. Cross entropy loss is appropriate for closed-set tasks, where all the testing classes are predefined in the training set, as with most cases in object recognition and behavior recognition. Nevertheless, in face recognition, it is almost impossible to collect all the faces that may appear in the test stage, so most real applications of face recognition are open-set tasks. Open-set tasks require the learned features to have strong discriminative capacity so as to classify the unseen sample correctly. To improve the discriminative capacity of the features, Center loss is proposed by Wen et al. [1]. Center loss can minimize the intra-class distance, which is formulated as follows: where c y i denotes the class center of the y i th class. Center loss is the sum of all the distances between each sample and its class center. Center loss is used in conjunction with cross entropy loss: where λ is a hyper-parameter for adjusting the impact of these two losses. Center loss optimizes only the intra-class variance and it does not apply weight and feature normalization.

Variants of Cross Entropy Loss Based on Cosine Similarity
L-Softmax loss, A-Softmax loss, AM-Softmax loss and ArcFace loss are variants of cross entropy loss based on cosine similarity. They have all been proposed in the past three years. All of them are derived from the original cross entropy loss in Equation (1), replacing the distance measurement from Euclidean distance to cosine similarity. In the cosine space, the similarity between two vectors is only up to the angle between them if feature normalization and weight normalization are applied. This makes the training process more focused on distinguishing different types of samples by optimizing the angle between the vectors, without having to consider the complex multi-dimensional spatial structure in the Euclidean space. The aforementioned variants transform the FC layer formulation from W T y i f i + b y i to W y i f i cos θ y i by setting the bias b y i to 0, where θ y i is the angle between W y i and f i . However, they have different choices for weight and feature normalization, and use different ways to add marginal constraints.   (4) and (5) show the formulation of the L-Softmax loss and the A-Softmax loss, respectively: where , m ≥ 1 is the angular margin. With greater m, the between-class margin becomes larger and the learning objective also becomes harder. In L-Softmax loss and A-Softmax loss, m is used as a multiplier on the angle, so we say that L-Softmax loss and A-Softmax loss apply the multiplicative angular margin. Different from L-Softmax loss, weight normalization is introduced in A-Softmax loss, which sets W y i = 1 by L2 normalization, which makes all class centers to lie on the hypersphere.
On the basis of L-Softmax loss and A-Softmax loss, AM-Softmax loss further adopts feature normalization and uses the additive cosine margin to replace the multiplicative angular margin. Feature normalization makes the samples of all classes to lie on the hypersphere, while the additive cosine margin forces the different classes to be separated from the cosine similarity level. AM-Softmax loss is formulated as follows: where f i is fixed by L2 normalization and is re-scaled to s. So f i is replaced with s in Equation (6). After AM-Softmax loss, ArcFace loss again replaces cos (θ y i ) − m with cos(θ y i + m), which enables m to clearly represent the meaning of angle geometrically. Therefore Arcface loss is computed as follows: log e s(cos(θ y i +m)) e s(cos(θ y i +m)) + ∑ P j=1,j =y i e scos(θ j ) .

The Proposed Gico Loss
After reviewing the recent loss functions used in deep face recognition, we present a new loss function, namely Gico loss (Global Information-based Cosine Optimal loss). Gico loss utilizes the global information from the entire training set and integrates the advantages of the existing losses. Firstly, L2 weight normalization is applied by fixing b j = 0 and ||W j || = 1. Secondly, we apply L2 normalization on the feature vector f i and re-scale f i to s. Similar to Center loss, Gico loss is also used in conjunction with another loss function. Here, the cross entropy loss is adopted like the Center loss, we choose AM-Softmax loss, as AM-Softmax loss shows slightly better performance than cross entropy loss. The total loss is formulated as follows: In designing the Gico loss, two sub-tasks are considered: minimizing the intra-class variance and maximizing the inter-class variance. To cope with these two sub-tasks, two "lite" versions of Gico loss are designed, respectively. Finally, we construct a standard version of Gico loss, which is the combination of these two lite versions. To minimize the intra-class variance, we propose a "lite" version of Gico loss (Gico Lite A), which is formulated as below: where c j is the center of class j, e j represents the farthest sample of class j from the class center, R(j) represents the cosine range of class j, namely the cosine similarity between the class center and the edge of class j, and P is the number of the classes in the entire training set. During the training, the deep features change after each mini-batch, which also leads to the change of c j and e j . To make c j and e j as accurate as possible, ideally, c j and e j should be calculated by traversing the entire training set and updated after each mini-batch. Nevertheless, this is totally unfeasible in terms of the power of the existing hardware. The reason lies in two constraints: the computing power and the memory size of GPU, TPU or other similar processing units. If the computing power constraint can be ignored, the deep neural network could take the entire training set as the source of feedback information; if the memory size constraint can be ignored, the deep neural network would input the entire training set into the memory and get rid of the size limitation of a mini-batch. Perhaps just because of the above two constraints, there is no loss that uses the entire dataset as the source of feedback information to optimize the CNNs in face recognition. In this paper, the first constraint is broken through by two approximation solutions. From Equation (6), it is can be seen that the key optimisation object of the AM-Softmax loss is to minimize θ y i and maximize θ j , where θ y i represents the angle between f i and W y j . θ j represents the angle between f i and W j , where j = y i . In other words, AM-Softmax loss is aimed at decreasing the distances between W j and the sample features in the jth class (j = 1, 2, ..., P). As the training goes on, W j is updated automatically to the center of class j (j = 1, 2, ..., P), as this leads to the minimum distance sum between W j and the sample features in the jth class. Therefore, we can simply use W j as the substitution of c j without any extra computing power. For e j and R(j), we propose a learning algorithm to recursively update the range of each class. In the beginning, R(j) is set to 1 initially, then we update R(j) using the following iterations: where β is the shrink rate for adjusting the shrink speed of the learned class range. φ(y i , j) = 0 when y i = j, otherwise φ(y i , j) = 1. The learning algorithm takes two cases into consideration and performs two operations accordingly: (a) Replace the class range directly with the cosine similarity between the input sample and its corresponding class center, if their cosine similarity is smaller than the recorded class range; (b) Let the class range shrink by scaling their cosine similarities with β, if the cosine similarity between the input sample and its corresponding class center is larger than the recorded class range. Operation (a) keeps the learned class range up to date. Nevertheless, as the training goes on, the real class range will become smaller and smaller, so operation (b) is performed to help the learned class range shrink to its real value.
To maximize the inter-class variance, we also propose another "lite" version of Gico loss (Gico Lite B): where A is a set and ∑ Top (A, K) denotes the sum of the K largest elements in A. Gico Lite B is aimed at finding K pairs of nearest class centers in the entire training set and then calculates the sum of their distances. Compared with the non-adjacent class centers, the corresponding classes of the adjacent centers have a high probability to have small margins or overlaps. If all adjacent classes have proper margins, the non-adjacent classes would have larger margins. Therefore, taking all center pairs into account is unnecessary. The most effective way is optimizing the distances of all the adjacent centers, but it is time-consuming to calculate the number of the adjacent center pairs that exist on the hypersphere. Here, a conservative strategy is adopted, namely set the value of K to P where P is the number of classes. As the minimum number of adjacent center pairs is P which takes place when all the class centers line up in a circle on the hypersphere. For best performance, we propose the standard version of Gico loss (Gico Std) in the end, which integrates the above two lite versions: Algorithm 1 shows the basic learning steps in the CNNs with the finally proposed Gico Std.

Output:
The parameters θ C and the total loss L.

4:
Calculate L G std by L G std = L G A * L G B .

5:
Calculate the total loss by L = L AM + λL G std .

6:
Calculate the backpropagation error ∂L t

Why combine L G A and L G B using multiplication instead of simple addition? Does it cause instability?
The idea of multiplication is inspired by LDA (Linear Discriminative Analysis). Using multiplication, only one parameter λ is needed for adjusting the impact of Gico Std. Using addition, two parameters are needed for the two parts of Gico Std respectively. Roughly speaking, Gico Std is the quotient of the average inter-class distance and the average intra-class distance as shown in Equation (13). Both denominator and nominator have limits, and they are mutually constrained; thus, their quotient does not lead to instability. We checked the loss curves, and confirm that the cases of instability did not happen.

2.
The improvements on recognition accuracy are somewhat incremental? Our observation is that incremental improvements are common in General Face Recognition (GFR). GFR has reached a very high level of performance so the scope of improvement is limited. Most of the recent GFR methods have marginal improvement or even worse than the state-of-the-art but are aimed to solve specific problems. For example, Sphereface+ [9], Center loss [1] and CosFace [15] have improvements from −0.19% to 0.31% on LFW dataset.

3.
What are the highlights of the proposed method? Our method creates two "firsts". It is the first loss function that simultaneously satisfies all five properties in Table 1 and is the first to use global information as feedback.
Therefore, the proposed loss has its own merits, will encourage others to carefully consider the use of global information and will create opportunities for new research. 4.

Cross entropy loss separates the samples of different classes, but does not enlarge the margin between neighbor classes". What's the difference?
These two cases correspond to two kinds of features: separable features and discriminative features. Separable features are able to separate classes by decision boundaries. Discriminative features are further required to have better intra-class compactness and inter-class separability to enhance predictivity. The Example can be found in Figure 1 of [1].

5.
Using global information is better than just using mini-batch? Why is global information introduced?
No, both of them are necessary for training a deep learning model. All practitioners are aware that using mini-batch SGD (Stochastic Gradient Descent) makes the neural network generalize better than using standard gradient descent that takes the entire dataset as input, as the randomness helps the network jump out of some local minimals which is beneficial to the generalization. Therefore, the proposed deep model is trained by the mini-batch data on one hand. On the other hand, the proposed methods also introduce global information, as the mini-batch data cannot provide the loss functions with precise measurement information, like the positions of the class center and the class edge in Gico loss. Introducing global information makes the measurement information precise, thus improve the final recognition accuracy.

Experiment Settings
Our network models are implemented by Tensorflow with Inception-ResNet-v1 [23] as the trunk network. We combine Inception-ResNet-v1 with different losses resulting in five different combinations: In all experiments, we set 320 as the epoch size, 120 as the batch size, 5 × 10 4 as the weight decay, 0.4 as the keep probability of the fully connected layer, 512 as the embedding size and 0.01 as the shrink rate. We manually optimize the hyperparameter λ. Since it is not sensitive to the performance, we just try multiple different values on the verification set and choose the value that leads to the minimum total loss. The initial learning rate is set to 0.05 and is reduced by a factor of 10 every 100,000 iterations. Table 3 summarizes all experimental simulation parameters. In all experiments, VGGFace2 [24] is used as the training data. To guarantee the reliability of the results, we removed the identities which might be overlapped with the testing sets from VGGFace2 but we did not do data cleaning, as VGGFace2 is a very clean dataset. Finally, there are 3.05 million face images in the preprocessed training set. For testing, we use diverse public benchmark datasets: LFW [17], SLLFW [18], YTF [19], MegaFace [20] and FaceScrub [21] datasets. For image preprocessing, we applied the same pipeline of processes on every raw image in the training set and the testing sets. At first, MTCNN [25] is employed for face detection. MTCNN occasionally fails to detect the face.
If this occurs for a training image, the image is simply abandoned. If it occurs for a testing image, we use the provided official landmarks or bounding boxes instead. All the face images are cropped to the size of 160 * 160. To strengthen the randomness of the training data, random horizontal flipping is performed on the training images. The final features of a testing image are generated by concatenating the features of the original image and the features of its horizontally flipped counterpart so as to improve the recognition accuracy.

MegaFace Challenge 1 on FaceScrub
In this section, we evaluate the performance of the proposed Gico loss on the MegaFace dataset [20] and the FaceScrub dataset [21]. Following the experimental protocol of MegaFace Challenge 1, we use the MegaFace dataset as the distractor set and set 1 million distractors. FaceScrub dataset is used as the testing set. The evaluation is conducted with the officially provided code [20]. Figure 2a Figure 2a, we can observe that the three versions of Gico loss outperform Softmax, AM-Softmax and other benchmark methods on the Rank1 identification rate by 5% to 22%. On Rank10, the best-performing comparable method is Vocord, but Gico Std still outperforms it by 7%. On all the values of rank, Gico Std shows better performance than Gico Lite B and Gico Lite A, while Gico Lite A performs better than Gico Lite B. Figure 2b shows the verification performance, where all three versions of Gico loss significantly outperform the other methods with the change of False Positive Rate. Specifically, the proposed Gico loss has a higher True Positive Rate than the other methods by at least 4% when the False Positive Rate is 10 −6 . Gico Std still shows better performance than Gico Lite B and Gico Lite A. These results on the FaceScrub dataset demonstrate the effectiveness of the proposed Gico loss.

Results on LFW, YTF and SLLFW
In this section, the proposed methods and the state-of-the-art methods are evaluated on the LFW, YTF and SLLFW dataset. The LFW [17] face image dataset is collected from the web. It contains 13,233 face images with large variations in facial paraphernalia, pose and expression. Following the standard experimental protocol of "unrestricted with labeled outside data" [26], 6000 face pairs are tested according to the given pair list. The YTF [19] face video dataset contains 3425 videos and is obtained from YouTube. We also follow the standard experimental protocol of "unrestricted with labeled outside data" to evaluate the relevant methods on the given 5000 video pairs. Table 4 shows the experimental results of different methods on the LFW and YTF datasets. As we follow the same experimental protocol and settings, the results shown in the upper part of the table are cited from their original papers. From Table 4, it is can be observed that Gico Std shows higher verification accuracy on LFW than Softmax, AM-Softmax, Gico Lite A and Gico Lite B by about 0.1%. Gico Std ties with FaceNet for first place on LFW. However, Gico Std utilizes only 3.05 million images for training, whilst FaceNet utilizes 200 million images for training. Gico Std also beats the other benchmarks methods by 2.28% to 0.11% on LFW, most of which are published in leading computer vision conferences. As for the results on the YTF dataset, all three versions of Gico loss have a better performance than the comparable methods by at most 3.42%, which demonstrates the state-of-the-art performance of the Gico loss. LFW is a popular face dataset. However, more and more methods are gradually touching its theoretical upper limit. Consequently, it becomes more and more difficult to differentiate different methods on LFW. To confirm the performance of the proposed methods, we conducted an additional experiment on SLLFW [18]. SLLFW uses the same positive pairs as LFW for testing, but in SLLFW, 3000 similar-looking face pairs are deliberately selected out from LFW by human crowdsourcing to replace the random negative pairs in LFW. SLLFW adds more challenges to the testing, causing the accuracy of the same state-of-the-art methods to drop by about 10-20%.
From Table 5, we can see the verification accuracy of different methods on SLLFW. The results of some benchmark methods are shown in the top half of the table, which are provided by the SLLFW team [32] and are publicly accessible http://www.whdeng. cn/SLLFW/index.html#reference accessed on 30 June 2021. As shown in Table 5, Gico loss achieves considerably higher verification accuracy on SLLFW when it is compared with other methods. In the top half of Table 5, the accuracy of the benchmark methods drops by between 4.68% and 16.75% from LFW to SLLFW. By comparison, the accuracy of the proposed Gico loss drops by between 1.45% and 1.49%. The experimental results on SLLFW further confirm the effectiveness of the proposed methods.

Conclusions
This paper presents a novel loss function-Global Information-based Cosine Optimal loss (i.e., Gico loss). To the best of our knowledge, Gico loss is the first attempt to use global information as the feedback in face recognition. We propose a novel algorithm to learn the cosine similarity between the class center and the class edge so as to break through the constraint and make Gico loss possible. In addition, the advantages of the best losses proposed in recent years are also integrated into the Gico loss. Extensive experiments are conducted on the LFW, SLLFW, YTF, MegaFace and FaceScrub datasets. The experimental results show that the proposed Gico loss outperforms all comparable methods on all datasets. Especially in the FaceScrub dataset, the three versions of Gico loss outperform the comparable methods on the Rank1 identification rate by 5% to 22%. The results demonstrate the effectiveness of the Gico loss and show that we achieve a state-of-the-art performance. However, since the class center and the class range used in Gico loss are obtained through a learning process, there is a time lag, which leads to a longer time to complete convergence. Future work will focus on reducing the convergence time while ensuring the learning accuracy of the class center and class range.