Double Additive Margin Softmax Loss for Face Recognition

: Learning large-margin face features whose intra-class variance is small and inter-class diversity is one of important challenges in feature learning applying Deep Convolutional Neural Networks (DCNNs) for face recognition. Recently, an appealing line of research is to incorporate an angular margin in the original softmax loss functions for obtaining discriminative deep features during the training of DCNNs. In this paper we propose a novel loss function, termed as double additive margin Softmax loss (DAM-Softmax). The presented loss has a clearer geometrical explanation and can obtain highly discriminative features for face recognition. Extensive experimental evaluation of several recent state-of-the-art softmax loss functions are conducted on the relevant face recognition benchmarks, CASIA-Webface, LFW, CALFW, CPLFW, and CFP-FP. We show that the proposed loss function consistently outperforms the state-of-the-art.


Introduction
Face recogniton problems are ubiquitous in the computer vision domain. In the past few years, Deep Convolutional Neural Networks (DCNNs) have set the community of face recognition (FR) on fire [1]. Due to effectively layered end-to-end learning network frameworks and carefully deep feature extracting techniques from local to global, which are the most important ingredients for their success, DCNNs has immensely improved the state of the art in real-world face recognition scenarios. Numerous layered network architectures for face recognition tasks such as AlexNet [2], VGG [3], InceptionNet [4], ResNet [5], and DenseNet [6], have been proposed. Among them, the most representative one is AlexNet, originally proposed by Krizhevsky et al. AlexNet has become a pioneered architecture developed for image classification and was the winner of the ImageNet Large Scale Challenge in 2012.
It is well known that the effective feature representation for face images plays an important role in FR. Over the recent years, a hot research trend towards DCNNs has been devoted to learning with more discriminative deep features. Intuitively, the learnt deep features for FR are desired if the maximal within-class variance is less than the minimal between-class variance. However, learning deep features satisfying this condition is generally not easy owing to the inherently large intra-class variation and high inter-class similarity [7] in many FR applications. Despite the softmax function with the cross-entropy loss (called softmax loss) is popularly used in the training of a DCNN, recent studies [8,9] made it clear that the current softmax loss is insufficient to encourage the desired deep features meeting the above condition. To boost the discriminative ability of DCNNs and inspired by the previous idea, the center loss [10], pairwise loss [11] and triplet loss [12] were proposed. They unanimously proposed the enhancement of discrimination power of deep features by minimizing winth-class variance and maximizing between-class variance in the Euclidean space of features. While these methods is superior to the traditional softmax loss over classification performance, they usually suffer from some drawbacks. The center loss only explicitly enhanced the intra-class compactness while disregarding the inter-class separability. For the pairwise loss and triplet loss, They require the careful mining of pairs or triplets of samples, which is highly time-consuming.
Due to the fact that few existing softmax losses can effectively achieve the discriminative condition that the maximal within-class variance is less than the minimal between-class variance under the conventional Euclidean metric space, more recently, approaches have been proposed to address this problem by transforming the original Euclidean space of features to an corresponding angular space [10,[13][14][15]. Specifically, both Large-Margin Softmax Loss [13] and A-Softmax Loss [14] are an angular softmax loss that enables DCNNs to learn angular deep features by imposing an angular margin constraint for larger inter-class variance. Compared to the Euclidean margin suggested by [2,16], the learnt angular features is more discriminative with the angular margin because the angular metric with cosine similarity is intrinsically more suitable to the softmax loss. During training of A-Softmax Loss, the orginal Softmax loss must be combined to ensure the convergence. To overcome the optimisation problem of A-Softmax Loss, the Additive Margin Softmax loass (AM-Softmax) [15] is proposed. The loss integrates a angular margin to the softmax loss in an additve manner. Its implementation and optimisation are much easier than A-Softmax Loss since A-Softmax Loss integrates the angular margin in a multiplicative way. AM-Softmax is easily reproducible and achieves state-of-the-art performance.
Motivated by AM-Softmax loss, This paper propose a new additive angular margin Loss, namely double additive margin softmax loss (DAM-Softmax). The idea behind the proposed loss is to impose an additive margin m to both the intra-class angular variation and inter-class angular variation simultaneously to enhance the intra-class compactness and inter-class discrepancy of the learned features. Compared to AM-Softmax loss, our loss has a stronger geometrical significance and will lead to obtain more discriminative features. Experimental results on some relevant face recognition benchmarks show that the proposed loss achieves better classification performance than the current state-of-the-art losses.
The rest of this paper is organized as follows. In Section 2, We will briefly introduce the related works such as the original softmax Loss, L-Softmax Loss, A-Softmax Loss and AM-Softmax Loss. Then we discuss the proposed loss, Double Additive Margin Softmax Loss in detail in Section 3. Finally, extensive experiments are presented in Section 4.

Preliminaries
In order to clearly understand the proposed DAM-Softmax loss, we will briefly review the classical softmax loss and AM-softmax loss. The classical softmax loss is formulated by where w c ( c = 1, . . . , C, C is the number of classes) denotes the weight vector of the last fully connected classifier layer, f i is the learned deep feature input vector of the last fully connected classifier layer corresponding to the original input x i with the label y i , and N is the number of training samples in a minibatch. The inner product, w c f i , between w c and f i can be also factorized into w c f i cos(θ c ) where θ c is the angle between w c and f i , the loss can thus be rewritten as The A-Softmax is a new loss function derived from the classical Softmax loss which proposed to impose a constraint to make w c = 1 and generalize the modified softmax loss to angular softmax (A-Softmax) loss by replacing f i cos(θ y i ) with f i ψ(θ y i ), where the authors proposed to define ψ(θ) as for removing the restriction which θ must be in the range of [0, π m ]. In the AM-Softmax loss, the authors suggested to introduce an additive margin to its decision boundary by defining ψ(θ) as cos(θ) − m. In addition, both the deep feature vector f i and weight vectors w c are normalized during the implementation. Thus, The AM-Softmax loss is given by where s is a hyper-parameter for scaling the cosine values.
In order to simultaneously enlarge the between-class angular margin and compress the within-class angular variation, It is clear to learn that both the A-Softmax loss and AM-Softmax loss share a common idea to generalize the original softmax loss to angular softmax loss by introducing an integer m to quantitatively control the decision boundary. specifically, In binary class case, a learned feature f from class 1 is given and θ i is the angle between f and w i . A-softmax loss requires cos(mθ 1 ) > cos(θ 2 ) to correctly classify f. AM-softmax loss instead proposes cos(θ 1 ) − m > cos(θ 2 ) to correctly classify f. Both of them explicitly enforce the intra-class compactness to achieve more discriminative deep features by imposing an intra-class angular margin in the multiplicative manner and in the additive manner, respectively. Compared with the A-Softmax loss, AM-Softmax loss is simpler which is simpler and reaches better performance. In addition, It is much easier to implement because the computation of the gradient for back-propagation is no longer required.

Double Additive Margin Softmax Loss
One can obviously learn that AM-Softmax loss can obtain better performance by incorporating a single additive margin to its intra-class angular variation. Inspired by that, we propose to impose an additive margin to both the intra-class angular variation and inter-class angular distribution simultaneously to enhance the intra-class compactness and inter-class discrepancy. To give a formal formulation for the idea, we first define a function g(θ) = cos(θ). The Equation (4) can be rewritten as where ψ(θ y i ) = cos(θ y i ) and g(θ c ) = cos(θ c ).
As analyzed above, we impose an additive margin m to both the intra-class and inter-class angular variation angular distribution simultaneously. Then we have the formulations: Compared to AM-Softmax loss, our formulation is also simple while explicitly encourages intra-class compactness and inter-class separability simultaneously, we thus term the loss as Double Additive Margin Softmax loss (DAM). Finally, the proposed loss function can be formulated by

Geometric Interpretation
Our double additive margin has a more explicit geometric interpretation on the hypersphere manifold. To simplify the geometric interpretation, we project the features onto two dimensional space and discuss the binary classification case on the hypersphere manifold where there are only w 1 and w 2 , and w 1 = w 2 = 1. Thus, the classification performance depends totally on the angles θ 1 between f and w 1 , and θ 2 between f and w 2 .
Classification Boundary. In Figure 1, we draw a schematic diagram to show the classification boundary of the classical softmax loss, AM-Softmax loss and the proposed DAM-Softmax loss. The classification boundary of the traditional softmax loss is denoted as the vector p 0 . In this case, we have w 1 p 0 = w 2 p 0 at the decision boundary p 0 ( w 1 ∈ class1, w 2 ∈ class2 ). For the AM-Softmax, the boundary becomes a marginal region instead of a single vector. At the new boundary p 1 for class 1, one has w 1 p 1 − m = w 2 p 1 , which gives m = (w 1 − w 2 )p 1 = cos(θ w 1 ,p 1 ) − cos(θ w 2 ,p 1 ). If we further assume that all the classes have the same intra-class variance and the boundary for class 2 is at p 2 , we can get cos(θ w 2 ,p 1 ) = cos(θ w 1 ,p 2 ). Thus, m = cos(θ w 1 ,p 1 ) − cos(θ w 1 ,p 2 ), which is the difference of the cosine scores for class 1 between the two sides of the margin region. For our DAM-Softmax loss, the boundary becomes a wider marginal region than the one of AM-Softmax loss. At the new boundary p 3 for class 1, one has w 3 p 3 − m = w 4 p 3 + m ( w 3 ∈ class1, w 4 ∈ class2 ) , which gives 2m = (w 3 − w 4 )p 3 = cos(θ w 3 ,p 3 ) − cos(θ w 4 ,p 3 ). If we further assume that all the classes have the same intra-class variance and the boundary for class 2 is at p 4 , we can get cos(θ w 4 ,p 3 ) = cos(θ w 3 ,p 4 ). Thus, 2m = cos(θ w 3 ,p 3 ) − cos(θ w 3 ,p 4 ), which is the difference of the cosine scores for class 1 between the two sides of the margin region. Obviously, the DAM-Softmax loss leads to a larger classification margin between class 1 and class 2.

Feature Distribution Visualization on MNIST Dataset
In order to better study and verify the effectiveness of the proposed DAM-Softmax loss function, we conducted an experiment on the MNIST dataset [17] to visualize the learned feature distributions. We chose the 7-layer CNN models with the original Softmax loss, AM-Softmax loss and DAM-Softmax loss for training and required to output two-dimensional deep features for visualization. After the 2-dimensional features were obtained, we then made the normalization to them and ploted them on a circle in the two dimensional space.
The visualization from Figure 2 can well demonstrate that our DAM-Softmax outperforms AM-Softmax [15] when the heperparameters s and m is 30 and 0.4, respectively. Compared to AM-Softmax [15], the DAM-Softmax loss can lead to the larger inter-class margin and smaller intra-class variance property to the features without tuning too many hyper-parameters.

Algorithm
The proposed DAM-Softmax loss is extremely easy to implement in the popular deep learning frameworks, e.g., Pytorch [18] and Tensorflow [19]. The algorithm for DAM-Softmax loss is given as follow.

Algorithm 1: The steps of the DAM-Softmax Loss
Input: Feature Scale s, Margin Parameter m in Equation (7), Randomly initialized weights w, Input images f, Batch size N 1. Normalize the input image f (f = f |f| ), and make the new f=f 2. Normalize the weight w (ŵ = w |w| ), and make the new w=ŵ 3. According to the Equation (7), introducing the variable substitutions ( the new f i =f i and the new w y i =ŵ y i ) and get the cos(θ y i ) = w y i f i 4. According to the Equation (7), introducing the variable substitutions ( the new f c =f c and the new w c =ŵ c ) and get cos(θ c ) = w c f c 5. Calculate "cos(θ y i ) − m", and get "s · (cos(θ y i ) − m)" in the Equation (6) and the Equation (7) 6. Calculate "cos(θ c ) + m", and get "s · (cos(θ c ) + m)" in the Equation (6) and the Equation (7) 7. Construct loss functioin:

Experiment
In this section, we first introduce the experiment settings. Then, we will discuss the effect of the hyperparameters. Finally, we will evaluate the performance of our loss function with several existing state-of-the-art loss functions on the benchmark datasets.

Datasets
Training Datasets. The CASIA-WebFace [7] dataset used for training consists of 49,4414 color face images from 10,575 classes.
Test Datasets. LFW dataset [17] contains 13,233 web-collected images from 5749 different identities, with large variations in pose, expression and illuminations. CFP dataset [18] consists of 500 subjects, each with 10 frontal and 4 profile images. The evaluation protocol includes frontal-frontal (FF) and frontal-profile (FP) face verification, each having 10 folders with 350 same-person pairs and 350 different-person pairs. In our experiments, we employ to test the performance the most challenging subsets CFP-FP which contains images of celebrities in frontal and profile views, CPLFW [20] and CALFW [21] which have higher pose and age variations with same identities from LFW dataset. Specific details of the three datasets above are shown in Table 1 and some example images from CFP-FP, CPLFW and CALFW are given in Figure 3-5, respectively.    Data Prepossessing. We adopt the data preprocessing method used in [14,22] to detect faces and facial landmarks in images and align them. Then, we crop the aligned face images and resize them to 112 × 112, and proceed to perform the normalization for the cropped face images by subtracting 128 and dividing 128.

Dataset Overlap Removal.
To develop open-set evaluations, we use the overlap checking code provided by F. Wang [22] to get ride of the overlapped subjects between the training dataset of CASIA-WebFace and the testing dataset of LFW.

Network Architecture and Parameter Settings
For the fair comparison, the CNN architecture used in all experiments of this paper is the ResNet-face18 model specially designed for the training of face recognition, which is a modified ResNet [5]. The model has an improved residual block of BN-Conv-BN-PReLu-Conv-BN structure in which the kernel size and stride in the first convolutional layer is 3 × 3 and 1 instead of the original 7 × 7 and 2, and the stride in the second convolutional layer is set to be 2 instead of 1. In addition, PReLu [23] is used to replace the original ReLu. All implementations in the paper are conducted by Pytorch [18]. We set the batch size to be 256 and the weight decay parameter 5 × 10 −4 . The initial learning rate is set as 10 −1 . We set the learning decay rate to be 0.05 which means that the learning rate will be reduced by 5% when the loss value increases. In addition, the total epoch is set as 110. The SGD [9] is used in the optimization process of ResNet-face18.

Effect of Hyperparameter m
According the discussion in Section 3, our proposed DAM-Softmax loss has two hyperparameters which are the scale s and the margin m. More importantly, the two hyperparameters plays an key role for the performance of the proposed loss. Several recent works [15,24] have already discussed the scale s, we thus follow [15,24] to directly set it to 30 and will no longer discuss it in this paper. In this case, we can focus on the other hyperparameter, margin m. We train the ResNet-face18 model with the DAM-Softmax loss on CASIA-Webface dataset to conduct experiments to seek the best angular margin. For comparison, we train the same network model with the AM-Softmax loss on the same dataset. In Tables 2 and 3

Comparison with State of the Art Loss Functions on LFW Dataset
In this part, we evaluate the performance of the proposed DAM-Softmax loss and the state-of-the-art loss functions. Following the previous experimental setting, we train a ResNet-Face18 model under the guidance of the original softmax, L-Softmax, A-Softmax, AM-Softmax and DAM-Softmax on the training dataset of CAISAWebFace. The experimental results on the test dataset of LFW are shown in Table 4. Table 4. Some results of comparative testing experiment.

Model Accuracy Rate
Softmax ( resnet-face18, 110 epoch ) 97.08% L-Softmax ( resnet-face18, 110 epoch ) [10] 97.33% A-Softmax ( resnet-face18, 110 epoch ) [14] 97.52% AM-Softmax ( resnet-face18, 110 epoch ) [15]  From Figure 6, It can be seen that the verification accuracy of DAM-Softmax loss is over 80% after one epoch while AM-Softmax loss requires 20 epoches to achieve the similar accuracy. In the 40th epoch, DAM-Softmax loss reaches the best performance which is still superior to the one of AM-Softmax loss. Figure 7 reports the training loss under the variation of epoch. When epoch = 75, the training loss of the original Softmax loss approaches stabilization with the value of about 13 by using softmax, the AM-softmax's training loss reach stabilization at around 10 when epoch is 55, while DAM-softmax get to stabilization at the 40th epoch and has a lower training loss. Therefore, This can demonstrate that the proposed loss has a faster convergence speed than AM-Softmax loss. As can be seen in Table 4, our proposed DAM-Softmax loss consistently arrives at competitive results compared to the other losses, which demonstrates the effectiveness of DAM-Softmax loss.

Comparison with State of the Art Loss Functions on CFP-FP, CPLFW and CALFW Datasets
In order to further verify the effectiveness and robustness of DAM-Softmax, we compare the performance of the proposed losse with related baseline methods, e.g., the original softmax, L-Softmax, A-Softmax, AM-Softmax and DAM-Softmax on three datasets which have large-pose, large-age and different-angle. The experimental results are listed in the Table 5. The details of CFP-FP, CPLFW and CALFW datasets are listed in the Table 1. As seen in Table 5, The proposed DAM-Softmax loss obtains the best performance. From Table 5, we can see that DAM-Softmax works much better on three datasets than AM-Softmax loss. Thus, we further demonstrate that our DAM-Softmax loss has stronger robustness.

Conclusions and Future Work
In this paper, we present a novel Double Additive Angular Margin Loss function for face recognition. specifically, we propose to simultaneously impose a angular margin to the intra-class and inter-class variation on the hypersphere manifold, which can effectively enhance the discriminative power of learned deep features. Competitive performance on several popular face benchmarks verify the superiority and robustness of our approach.
Author Contributions: Conceiving the idea, S.Z., and C.C.; writing original draft, S.Z., and C.C.; writing final original manuscript, S.Z., and C.C.; supervision, C.C.; data curation, S.Z., and G.H.; All authors discussed and revised the results, and have read and approved the published version of the manuscript.
Funding: This work was supported by the National Natural Science Foundation of China (60875004).

Conflicts of Interest:
The authors declare there is no conflicts of interest regarding the publication of this paper.