Multi-View Cosine Similarity Learning with Application to Face Veriﬁcation

: An instance can be easily depicted from different views in pattern recognition, and it is desirable to exploit the information of these views to complement each other. However, most of the metric learning or similarity learning methods are developed for single-view feature representation over the past two decades, which is not suitable for dealing with multi-view data directly. In this paper, we propose a multi-view cosine similarity learning (MVCSL) approach to efﬁciently utilize multi-view data and apply it for face veriﬁcation. The proposed MVCSL method is able to leverage both the common information of multi-view data and the private information of each view, which jointly learns a cosine similarity for each view in the transformed subspace and integrates the cosine similarities of all the views in a uniﬁed framework. Speciﬁcally, MVCSL employs the constraints that the joint cosine similarity of positive pairs is greater than that of negative pairs. Experiments on ﬁne-grained face veriﬁcation and kinship veriﬁcation tasks demonstrate the superiority of our MVCSL approach.


Introduction
Metric learning or similarity learning aims to develop an effective metric to measure the similarities of samples [1]. Samples from the same class are projected into neighboring locations in the embedding space while samples of various categories are separated. Recently, numerous metric learning or similarity learning approaches have been introduced [1][2][3] and they have achieved a great success for numerous visual understanding tasks including face verification [2], image retrieval [3], image classification [4], and person re-identification.
Face verification is a representative task of pattern recognition and computer vision; its purpose is to decide whether a pair of facial images belongs to the same subject or not. Face verification has received wide attention since the unconstrained face image datasets were released to the public, for example, labeled faces in the wild (LFW) (LFW) [5], MegaFace [6] and other benchmark face image datasets [7]. A variety of metric learningbased face verification methods have been introduced in the literature [2,8] to advance the performance of face verification. Guillaumin et al. [9] proposed a logistic discriminant method and a nearest neighbor method to learn a distance metric for calculating the similarity of two face images. Koestinger et al. [10] introduced a large-scale metric learning method to compute the Mahalanobis distance of images from the statistical inference perspective and achieved the state-of-the-art performance. Schroff et al. [2] exploited the deep convolutional neural networks for face verification and clustering and proposed a FaceNet method to measure the distance of face images in Euclidean space. More detailed introductions and developments of face verification can refer to the survey paper [8].
Most of the metric learning approaches in face verification are developed for singleview data so that they are not suitable for exploiting multi-view data efficiently. Multiview data are very common in the practical applications, and it usually describes the information of the examples more comprehensively than single-view data. For instance, we can use different feature representations to depict a face image, e.g., scale invariant feature transform (SIFT) [11], local binary pattern (LBP) [12] and histogram of oriented gradient (HOG) [13]. Multi-view learning aims to improve the performance of the classification or recognition tasks by making use of multi-view representations of data. For the sake of utilizing multi-view data, many multi-view learning methods have been introduced in the last decade [14,15]; however, there are only a small number of them developed in the multi-view metric learning perspective, and the existing multi-view metric learning methods are mainly formulated in the framework of Mahalanbis distance metric learning.
In this paper, we develop a multi-view cosine similarity learning (MVCSL) approach to efficiently utilize multi-view data from the cosine similarity learning framework. To capture the correlation across multiple views and exploit the private information of each view, the proposed MVCSL method jointly learns a cosine similarity for each view in the transformed subspace and seeks the optimal combination of multiple cosine similarities of multi-view feature representations under a unified framework, where the joint cosine similarity of each positive pair is forced to be greater than a large constant value and the joint cosine similarity of each negative pair is forced to be less than a small threshold. Experimental results on the fine-grained face verification and facial kinship verification applications demonstrate the advantages of our MVCSL method. Figure 1 illustrates the basic idea of our proposed MVCSL approach.  Given multi-view feature representations of each sample, MVCSL under the large margin framework learns the optimal combination of the cosine similarities of multi-view data, which constrains the joint cosine similarity of positive samples to be greater than a large value t p and that of negative samples to be less than a small value t n .
The remainder of this paper is structured as follows. Section 2 briefly reviews the related work. In Section 3, we detail the proposed multi-view cosine similarity learning approach, and the experiments for face verification are presented in Section 4. Finally, the paper is concluded in Section 5.

Related Work
The target of metric learning or similarity learning is, broadly speaking, to learn a proper similarity measure for increasing the dissimilarity of inter-class samples and increasing the similarity of intra-class samples. A variety of metric learning approaches have been proposed over the past decade, and they have been used in various applications of pattern recognition including face recognition, image searching and fine-grained recognition. For example, Xing et al. [16] designed metric learning as a convex optimization problem by adopting a semidefinite programming formulation of similarity side information. Weinberger et al. [17] introduced the classical large margin nearest neighbor (LMNN) algorithm that makes samples from the same class compose k-nearest neighbors and samples belonging to various categories be separated by an appropriate margin. Davis et al. [18] presented the information theoretic metric learning (ITML) approach by maximizing the entropy of a multivariate Gaussian to learn a Mahalanobis distance. The keep it simple and straightforward metric learning (KISSME) [10] method was proposed to learn a distance metric by the maximum likelihood estimators from the perspective of the statistical inference. Nguyen and Bai [19] introduced a cosine similarity metric learning (CSML) method using the cosine distance as the similarity measurement for a face verification task. In addition, several methods based on fractal theory [20][21][22] were introduced to find an appropriate distance metric for face recognition. Over the past few years, with the prosperity of deep learning algorithms, more and more deep metric learning approaches [2,3,23] were also presented to learn the nonlinear mapping functions using deep neural networks.
Although metric learning methods have been developed so far, most of them primarily aim to seek a metric or similarity function for either the single-view feature or cascading multiple types of features so that they cannot efficiently exploit multi-view feature representations. For the sake of better exploiting multi-view data that usually include the complementary information, several multi-view metric learning methods [7,15,[24][25][26] have been introduced to learn a more comprehensive metric than the single-view based metric learning approaches. For instance, Lu et al. [7] developed a multi-view neighborhood repulsed metric learning approach to utilize multiple feature representations of samples for a kinship verification task. Xie and Xing [24] introduced a multi-modal distance metric learning method that maps the samples in a single latent feature space. Hu et al. [15] proposed a sharable and individual multi-view metric learning method to make use of both the private characteristics from each view and the shared representation for different views. Jia et al. [26] introduced a semi-supervised multi-view deep discriminant representation learning method, which utilizes the consensus content of inter-view features and reduces the redundancy of feature representations. However, these existing multi-view metric learning methods are mainly formulated in the framework of Mahalanbis distance metric learning. In this paper, we present a multi-view cosine similarity learning approach from the cosine similarity learning framework by collaboratively learning multiple cosine similarities to better exploit complementary information of multi-view feature representations.

Multi-View Cosine Similarity Learning
Suppose that we have a training set with N training samples X = {x i ∈ R q |i = 1, 2, · · · , N}, where q is the dimension of the sample x i . For any sample, it can be easily depicted in multiple views with various feature representations.
denote the features set of X from the κ-th view, where x κ i is the κ-th view representation of x i , and K and q κ are the total number of views and the dimension of x κ i , respectively.
In general, it is not desirable to directly map multi-view features into a unified subspace, because the distributions of different views are different in their independent subspaces, and it cannot take advantage of the specific characteristics of each view and ignores the differences of information among the various views. To overcome this limitation, we map samples of each view into the individual space via the linear transformation W κ , and the cosine similarity between x κ i and x κ j in the κ-th view is computed by: Considering that different view representations of each sample depict the same subject and they are able to complement each other with the difference information, the joint cosine similarity between the samples x i and x j is written as: where are the cosine similarity of the κ-th view between x i and x j . Obviously, the cosine similarity score varies from −1 to 1, so it is very suitable for similarity learning.
We formulate our multi-view cosine similarity learning (MVCSL) method under the large margin framework to learn the optimal parameter θ = {W κ } K κ=1 . The objective function of our proposed MVCSL method is as: in which η is the coefficient of the regularization term, and h(x) = max(x, 0). τ p and τ n are thresholds of cosine similarity for positive samples and negative samples, respectively, −1 ≤ τ n ≤ τ p ≤ 1. W 0 is a transformation matrix with ones on the diagonal and zeros elsewhere. We treat the derivative h (0) = 0 at point x = 0. Pairwise label l ij = 1 represents that x i and x j come from the same object (i.e., positive pairs) and l ij = −1 denotes that x i and x j are from different objects (i.e., negative pairs). By setting appropriate thresholds, the joint cosine similarity of positive samples is more than a large value τ p ; simultaneously, the joint cosine similarity between negative samples is less than a small value τ n . The gradient of J with regard to W κ is calculated by: in which the gradients of the cosine similarity cs W κ with regard to W κ are computed by: in which Substituting Formulas (6) and (7) into Formula (5), we can obtain the gradient: After obtaining the gradient ∂J ∂W κ , the stochastic gradient descent method is used to update W κ iteratively for each view until the objective function of our MVCSL method is converged: where µ is the learning rate, κ = 1, 2, . . . , K. Algorithm 1 summarizes the main steps of the MVCSL approach, in which we initialize the κ-th transformation W κ as a matrix with ones on the diagonal and zeros elsewhere, κ = 1, 2, . . . , K.

Experiments
This section conducts experiments on fine-grained face verification and kinship verification to demonstrate the advantages of our MVCSL for exploiting multi-view data.
Following the common settings [15], we evaluate the proposed methods with three different similarity learning baseline approaches as: • MVC-s: This is the single-view cosine similarity learning method that learns a single similarity metric via the objective function (3) using the single-view feature representation; • Concatenation (abbrev., Con): All the multi-view feature representations are concatenated as a high-dimension feature vector, and then, the MVC-s method is employed to find out the cosine similarity; • MVC-i: We independently learn the mapping for each view, and then, we add up the cosine similarities of all views as the final cosine similarity of a sample pair.
For the parameter settings of our MVCSL and baseline methods, we empirically set thresholds t p and t n as 0.8 and 0.1, and µ as 0.01 for all experiments.

Fine-Grained Face Verification
Fine-grained face verification is a special task in face verification, where each negative sample pair consists of very similar face images such as face images from twins and similarly looking facial image of different subjects, so it is more difficult than general face verification in real-world scenarios.

Dataset and Settings
The fine-grained face verification (FGFV) [27] [5]. We evaluate our proposed MVCSL and baseline approaches on the well-aligned version of the FGFV dataset, in which facial images were aligned and cropped into the pixels of 64 × 64. We then convert all images into the gray-scale and extract three hand-crafted features for each face image as: • LBP [12]: we partition an image into 8 × 8 segments and obtain a 59-dimensional LBP for each segment; then, we finally achieve a 3776-dimensional feature representation by concatenating them. • HOG [13]: we split an image into non-overlapping blocks 4 × 4 and 8 × 8 with two different sizes and compute a nine-dimensional HOG feature on each block. Finally, we achieve a feature representation of 2880 dimensions for each image. • SIFT [11]: each facial image is segmented into 49 blocks to extract a feature representation of 6272 dimensions.
Lastly, each feature representation is reduced 200 dimensions by the principal component analysis (PCA) method. We employ a 5-fold cross-validation strategy to evaluate our method on the FGFV dataset under the image restricted setting, which only exploits the pairwise labels of positive pairs and negative pairs.
The fine-grained LFW (FGLFW) [28] dataset includes 10-fold face image pairs, and every fold is composed of 300 positive pairs of face images and 300 negative face image pairs. The positive face pairs are the same as LFW [5], but the negative pairs are similar face images that were manually selected from the LFW dataset. Figure 2 presents several negative pairs of face images from LFW, FGFV and FGLFW datasets, where the negative pairs of FGFV and FGLFW datasets are easy to incorrectly identify as the positive pairs.

Experimental Results
This section evaluates the proposed MVCSL method and the baseline methods with several traditional metric learning methods, namely ITML [18], side-information-based linear discriminant analysis (SILD) [29], KISSME [10], similarity metric learning over intra-personal subspace (Sub-SML) [30] and CSML [19]. Tables 1 and 2 show the mean accuracies and standard error of various methods under restricted settings on the FGFV and FGLFW datasets, respectively. The ITML, SILD and KISSME methods are formulated under the Mahalanobis distance framework, while CSML and MVCSL are designed under the cosine similarity framework. Compared with Mahalanobis distance-based methods, cosine similarity-based methods achieve better performance, and our proposed MVCSL obtains the best performance on both FGFV and FGLFW datasets. The reason is that our MVCSL can collaboratively learn multiple similarity measures from multiple feature representations to supplement each other with the difference information. In addition, Figures 3 and 4 plot the receiver operating characteristic (ROC) curves of various approaches on the FGFV and FGLFW datasets, respectively, and these experiments further show the promising performance of the proposed MVCSL method.

Kinship Verification
Kinship verification aims to predict whether a given pair of face images has a kind of kin relationship or not, which is a challenging subtask of face verification and has attracted a lot of attention in pattern recognition and computer vision.

Dataset and Settings
In this subsection, we evaluate the proposed MVCSL approach in KinFaceW-I [7] and KinFaceW-II [7] datasets for the kinship verification task. The samples of them were collected from the unconstrained conditions with the obvious variations on lighting, age, expression and posture. There are four kin relationships in them, i.e., father son (F-S), father daughter (F-D), mother son (M-S) and mother daughter (M-D).
Referring to the experimental settings provided by the datasets, the positive pair consists of two face images with kin relationship, and each negative pair is randomly selected from two unrelated face images without kinship. In order to reduce the background information of the sample image, we use the aligned KinFaceW-I and KinFaceW-II datasets, where each image was scaled to the size of 64 × 64 pixels. For feature representation, we adopt the same setting as the FGFV dataset and extract LBP, HOG and SIFT features for each sample, and each feature representation is reduced to 200 dimensions by PCA. According to the benchmark protocol of the KinFaceW-I and KinFaceW-II datasets, we adopt the positive and negative samples under the image restricted setting. In the experiment, we use a 5-fold cross-evaluation strategy to divide the positive and negative sample pairs into five groups, four groups for training and one group for test.

Experimental Results
We evaluate the MVCSL method with three evaluation strategies in the KinFaceW-I and KinFaceW-II datasets. Tables 3 and 4 list the mean verification accuracy (%) on two datasets. The mean verification accuracies of our MVC-s method with LBP, HOG and SIFT descriptors are 73.15%, 77.07% and 75.91% on the KinFaceW-I dataset, and those are 75.50%, 79.50% and 80.05% on the KinFaceW-II dataset, respectively. We also notice that the MVC-i and Con methods can learn the discriminative information of multiple feature representations, and our MVCSL further mines the potential information between various features. In Tables 3 and 4, we also provide the comparisons of the proposed MVCSL and several representative kinship verification methods, and these compared methods include block-based neighborhood repulsed metric learning (BNRML) [31], geometric mean metric learning (GMML) [32], multi-view geometric mean metric learning (MVGMML) [33], discriminative compact binary face descriptor (D-CBFD) [34], local large-margin multi-metric learning (L 2 M 3 L) [25], and weakly supervised compositional metric learning (WSCML) [35]. We can see from two tables that our MVCSL method obtains a competitive performance on both KinFaceW-I and KinFaceW-II datasets. Moreover, Figures 5 and 6 plot the ROC curves of the proposed MVCSL and baseline approaches on two kinship datasets, respectively. Experimental results on the benchmark kinship verification datasets intuitively show that our MVCML approach is able to efficiently exploit the common information of different feature representations and the private information of each feature representation to help improve the performance of kinship verification task.

Conclusions
This paper proposes a multi-view cosine similarity learning (MVCSL) approach to make use of multi-view feature representations of data and apply it for fine-grained face verification and facial kinship verification tasks. The proposed MVCSL method can complement each other with the difference information among multiple views by jointly learning a cosine similarity for each view in a unified framework. Moreover, in order to mine non-trivial samples, we set the margin to make sure that the joint cosine similarity of positive pairs is greater than a large value and the joint cosine similarity of negative pairs is less than a small value. Experimental results on the fine-grained face verification and facial kinship verification tasks demonstrate the advantages of our MVCSL method for exploiting multi-view data.
The main novelty and contribution of our work is to advance the multi-view metric learning from the cosine similarity learning framework to better exploit multi-view data, which is different from the existing multi-view metric learning methods that are mainly formulated in the framework of Mahalanbis distance metric learning. The shortcoming of the proposed MVCSL method is that the gradient-descent based method is used to find the linear transformation matrices, and we may not obtain the global optimal solution. In the future, we hope we can further improve our MVCSL and achieve its closed-form solution.
In future work, we will apply our approach to other applications such as visual recognition, classification and clustering in pattern recognition.