SADG: Self-Aligned Dual NIR-VIS Generation for Heterogeneous Face Recognition

: Heterogeneous face recognition (HFR) has aroused signiﬁcant interest in recent years, with some challenging tasks such as misalignment problems and limited HFR data. Misalignment occurs among different modalities’ images mainly because of misaligned semantics. Although recent methods have attempted to settle the low-shot problem, they suffer from the misalignment problem between paired near infrared (NIR) and visible (VIS) images. Misalignment can bring performance degradation to most image-to-image translation networks. In this work, we propose a self-aligned dual generation (SADG) architecture for generating semantics-aligned pairwise NIR-VIS images with the same identity, but without the additional guidance of external information learning. Speciﬁcally, we propose a self-aligned generator to align the data distributions between two modalities. Then, we present a multiscale patch discriminator to get high quality images. Furthermore, we raise the mean landmark distance (MLD) to test the alignment performance between NIR and VIS images with the same identity. Extensive experiments and an ablation study of SADG on three public datasets show signiﬁcant alignment performance and recognition results. Speciﬁcally, the Rank1 accuracy achieved was close to 99.9% for the CASIA NIR-VIS 2.0, Oulu-CASIA NIR-VIS and BUAA VIS-NIR datasets, respectively.


Introduction
Face recognition as a biometric identification method has attracted a large number of researchers' focus through the past several decades for its widespread applications and contactless acquisition [1]. Nevertheless, face recognition primarily concentrates on the visible spectrum, which brings about the more difficult problem of recognizing a subject under low-light or no-light conditions. Consequently, heterogeneous face recognition (HFR) research has emerged in recent years. Heterogeneous face images acquired under different spectra are different in their modalities, such as sketch images [2], near infrared (NIR) images [3] and polarimetric thermal images [4]. HFR with NIR images is an important computer vision task [5].
The literature over the years on HFR can mainly be divided into three branches: feature representation-based methods, common subspace projection-based methods and synthesis-based methods. Feature representation-based methods reduce the domain gap between heterogeneous images through handcrafted features such as linear discriminant analysis (LDA) and principal component analysis (PCA) [6]. Common subspace projectionbased methods project different modality images into a common subspace [5]. However, there is a lack of a large-scale near infrared-visible (NIR-VIS) face database relative to visible face datasets, which boosts the research of synthesis-based methods [7]. They To cover the misalignment problem, we propose a self-aligned generation architecture to align the data distributions between two modalities semantically. Specifically, we use two encoders and two decoders for generating paired images with different domains. A training method is promoted with the same latent code and a self-aligned block to train our network. The same latent code can affect the alignment performance virtually while the self-aligned block acts the part for redressing the unaligned attributes subsidiarily. These strategies guarantee the alignment of images between two domains. After the training stage, we can not only utilize our model to generate abundant NIR-VIS images with the same noise from a standard Gaussian distribution, but also redress the misaligned raw datasets with our well-trained model by reconstructing them. Furthermore, we present a multiscale patch discriminator for the high quality of the generated aligned NIR-VIS images. In the evaluation part, a new metric method, namely the mean landmark distance (MLD), is raised to test the alignment effect of the generated NIR and VIS images with the same identity. We train our model with the CASIA NIR-VIS 2.0 dataset [3] and our numerous generated NIR-VIS images, then evaluate the model performance on the Oulu-CASIA NIR-VIS [11] and BUAA VIS-NIR [12] datasets. The work of the self-aligned dual generation (SADG) method will be available at https://github.com/Renrenren6666/SADG.
In summary, our contributions are as follows: 1. We analyze the mechanism of the semantics misalignment problem between two modality images; To cover the misalignment problem, we propose a self-aligned generation architecture to align the data distributions between two modalities semantically. Specifically, we use two encoders and two decoders for generating paired images with different domains. A training method is promoted with the same latent code and a self-aligned block to train our network. The same latent code can affect the alignment performance virtually while the self-aligned block acts the part for redressing the unaligned attributes subsidiarily. These strategies guarantee the alignment of images between two domains. After the training stage, we can not only utilize our model to generate abundant NIR-VIS images with the same noise from a standard Gaussian distribution, but also redress the misaligned raw datasets with our well-trained model by reconstructing them. Furthermore, we present a multiscale patch discriminator for the high quality of the generated aligned NIR-VIS images. In the evaluation part, a new metric method, namely the mean landmark distance (MLD), is raised to test the alignment effect of the generated NIR and VIS images with the same identity. We train our model with the CASIA NIR-VIS 2.0 dataset [3] and our numerous generated NIR-VIS images, then evaluate the model performance on the Oulu-CASIA NIR-VIS [11] and BUAA VIS-NIR [12] datasets. The work of the self-aligned dual generation (SADG) method will be available at https://github.com/Renrenren6666/SADG.
In summary, our contributions are as follows: 1.
We analyze the mechanism of the semantics misalignment problem between two modality images; 2.
A self-aligned dual generation architecture (SADG) is proposed to align NIR and VIS images, including a self-aligned block and a multiscale patch discriminator; Extensive experiments and an ablation study conducted on three popular datasets show the state-of-the-art alignment and recognition performance of our method.

Related Works
For HFR, the main focus is to reduce the domain gap. Generally, the literature over the years can be categorized into three groups: feature representation-based methods, common subspace projection-based methods and synthesis-based methods.
Feature representation-based methods mainly explore illumination invariance through hand-crafted features to reduce the modality gap. The earliest work on NIR-to-VIS face recognition was proposed by Yi et al. [6], who utilized PCA, LDA and canonical correlation analysis (CCA) in three steps to gain a better performance. Sarfraz and Stiefelhagen [13] explained the NIR-VIS problem as a task with high illumination variation. They solved this problem by designing an effective light invariant descriptor: a logarithmic gradient histogram (LGH).
LGH was superior to the local binary pattern (LBP) and scale-invariant feature transform (SIFT) descriptors used in [14,15], as it was a pure function-based approach without training data. These methods could extract features from images of different modalities from the standpoint that they removed the domain information. However, they could not reach satisfactory recognition performance in most cases, which could only achieve a Rank1 accuracy of 70-80%.
Common subspace projection-based methods try to map heterogeneous face images into a latent common subspace, where the images in different domains can be matched straightaway. Lin and Tang [16] primarily proposed the common discriminant feature extraction (CDFE) approach, which could extract the recognition information and location information simultaneously. Kan et al. [17] raised the multi-view discriminant analysis (MVDA) method to study both the interview and intraview pertinence of heterogeneous face images. Huo et al. [18] presented a form of margin-based cross-modality metric learning to reduce the gaps of different modalities. For better extracting the discriminative information, a regularized discriminative spectral regression method was developed to find a common spectral space in [19], where it used the locality information in kernel space for discrimination. An extreme learning machine (ELM) combined with a multitask cluster were used by [20] for cross-model feature learning. Similar to the feature representationbased methods, these methods cannot achieve high performance for HFR, which can only achieve a Rank1 accuracy of about 90%.
Synthesis-based methods, with the development of deep learning [21] and generation networks such as generative adversarial networks (GANs) [22] and variational autoencoders (VAEs) [23], have aroused great interest among researchers. They usually translate NIR face images to identity-preserved VIS ones, and then evaluate the generated VIS images to match with the NIR ones. Riggan et al. [24] trained a regression network to estimate the projection between the features of the visible and thermal modalities, and then reconstructed the visible face image from the estimated features. Zhang et al. [25,26] leveraged GANs to synthesize visible images from the polarimetric thermal images. Recently, a novel network for generating large-scale paired NIR-VIS images by noise, which brought novel insight for HFR, was adopted [8]. Duan et al. [27] proposed a pose aligned cross-spectral hallucination (PACH) algorithm, which dealt with the facial shape and the spectrum information in two individual stages. Yu et al. [9] proposed a pose-preserving cross-spectral face hallucination (PCFH) model to synthesize a VIS image with the same identity as the given NIR image while preserving poses and expressions. However, they remain under the image-to-image translation framework.
As for the metrics in HFR, the recent literature can be mainly divided into two types: common subspace-based measurement function and bilinearity-based similarity measurement function. Common subspace-based measurement function compares two domains' features in the same space, which are obtained by a common subspace projection. Wu et al. [28] proposed a measurement function to evaluate the distance between two domains' features, mapping the code features and text features into the same semantic space. Siena, Boddeti and Kumar [29] proposed a method for HFR named maximizing margin coupled mappings (MMCM), which narrowed the gap between samples of the same subject and increased the distance between samples of different ones in a pair. The similarity measurement function evaluates the domain gap by the similarity of features from different modalities. Zhen et al. [30] put forward a probability learning framework to distinguish the similar images from different domains. However, these methods focus on the domain gap in the feature level. They cannot measure the alignment effect of paired heterogeneous images with the same image level identity.

Method
As mentioned above, a novel dual variational generation framework to reduce the domain gap was proposed in [8], generating massive paired NIR-VIS images with the same identity for a pair by noise. However, it suffered from the misalignment problem. We utilize the backbone of DVG as a baseline model and improve it for alignment purposes. In this part, we will briefly introduce the baseline model first, and then analyze the mechanism of DVG as it pertains to the misalignment. Finally, we will dwell on the details of our method and discuss our improved architecture.

Baseline
As proposed in [8], the baseline network mainly consisted of a dual variational autoencoder. In particular, the autoencoder included two encoder networks E N and E V , as well as a decoder network D. Given the pairwise NIR-VIS images {x N , x V }, the encoder E N plays the role of mapping NIR images x N to the latent space z N with a reparameterization trick: z N = µ N +σ N τ. µ N and σ N are defined as the mean and standard deviation of the feature maps from the NIR images in a mini-batch, respectively, τ is sampled from a multivariate standard Gaussian and denotes the Hadamard product. E V conducts the same operation correspondingly and finally gets the latent code of the VIS images z V . Note that this encoding process can be found in our architecture, as shown in Figure 2. Finally, the latent code z N and z V are concatenated to get a joint distribution z I , which is utilized to reconstruct the input images by one decoder network in the baseline network. straining the feature distance between the reconstructed paired images. The same operation is conducted for the reconstructed image with the original image, and the loss functions are as follows: where F (·) denotes the uniformed output of the last fully connected layer in F id . For the purpose of increasing the diversity of the generated images, a diversity loss [32], defined as L div , is adopted as the original model [8].
After training the network, DVG can generate paired NIR-VIS images with the same identity by the same noise sampled from Gaussian space, which can boost the limited dataset. Figure 2. The training process of our proposed self-aligned generation architecture. Given the paired data from the training dataset, we first reverse them by E N , D N , E V and D V , respectively (the left part). Then, we use a multiscale patch discriminator (the right part) to do the adversarial learning with the generator. The solid-lined rectangle stands for the residual block, and the dotted-lined rectangle stands for the latent code z.

Architecture
As described above, DVG learns the data distribution from the training dataset, which suffers from the semantic misalignment between two different modalities, as shown in Figure 1. Therefore, the misalignment occurs first in the latent code z N and z V during the encoding process. Note that L in Equation (2) decreases the Wasserstein Figure 2. The training process of our proposed self-aligned generation architecture. Given the paired data from the training dataset, we first reverse them by E N , D N , E V and D V , respectively (the left part). Then, we use a multiscale patch discriminator (the right part) to do the adversarial learning with the generator. The solid-lined rectangle stands for the residual block, and the dotted-lined rectangle stands for the latent code z.
As with the VAE in the original work [23], DVG constrains the learning process of the posterior distributions p ∅N (z N |x N ) and p ∅V (z V |x V ) by a KL divergence: where the prior distributions p(z N ) and p(z V ) are both the multivariate standard Gaussian distributions. Then, a reconstruction loss is used to force the decoder network to reconstruct the input images {x N , x V } from the learned distribution, which is modified in our method for alignment. We will discuss this in Section 3.2. Except for L kl , a simplified Wasserstein distance [8] between the two distributions is also utilized for posterior distribution learning in latent space, where i stands for the identity: For the image space, DVG uses a pre-trained LightCNNv2 [31] model as the Identityfeature extractor F id for calculating identification loss. Due to DVG producing a pair of NIR-VIS images for each allotment of time, we can retain the identity information by constraining the feature distance between the reconstructed paired images. The same operation is conducted for the reconstructed image with the original image, and the loss functions are as follows: where F (·) denotes the uniformed output of the last fully connected layer in F id . For the purpose of increasing the diversity of the generated images, a diversity loss [32], defined as L div , is adopted as the original model [8].
After training the network, DVG can generate paired NIR-VIS images with the same identity by the same noise sampled from Gaussian space, which can boost the limited dataset.

Architecture
As described above, DVG learns the data distribution from the training dataset, which suffers from the semantic misalignment between two different modalities, as shown in Figure 1. Therefore, the misalignment occurs first in the latent code z N and z V during the encoding process. Note that L dist in Equation (2) decreases the Wasserstein distance of the two distributions representing different modalities, which can only maintain the identity information to achieve the same semantic direction with the same identity. The misaligned latent code z N and z V are then concatenated as z I as the input of the decoder. As such, it is challengeable for DVG to generate semantically aligned NIR-VIS images with the same noise.
Recently, precisely semantic face editing with manipulation in the latent space code z was explored, in which we can edit one specific face attribute by a linear operation to change the semantic direction z = z + z 0 [33]. z 0 is the direction of a certain face attribute, and then the generator can create attribute-changed images from z . It indicates the semantic information in a code can be redressed with some specific operation. However, we cannot make the paired NIR-VIS images aligned by the same operation in our scene because the original paired images are misaligned in diversity attributes, such as poses with different angles, different expressions or wearing or not wearing glasses. That is to say, we cannot align the semantic distribution of two modalities' images in the learned latent distribution by E N and E V due to different identities having different misaligned face attributes. We cannot align all the code from different paired heterogeneous face images with one simple operation. Therefore, we modify the baseline and propose a new architecture for generating aligned pairwise NIR-VIS images with the same identity. To be specific, we improve the generation process and adversarial learning part in our architecture, which includes a self-aligned generator and a multiscale patch discriminator, as shown in Figure 2. 3.2.1. Self-Aligned Generator Inspired by [7,33,34], we propose a self-aligned generator for producing ID-consistent aligned NIR-VIS images, including a self-aligned block. As shown in the left part of Figure 2, we utilized two decoders, D N and D V , with the same structure in our generator, which was different from the baseline. Note that the architecture of our decoders was the same as that in the baseline.
As for the training stage, given pairwise NIR and VIS images {x N , x V }, we also encoded them to latent code z N and z V as a baseline. L kl and L dist in Equations (1) and (2) were used for learning the data distribution. Considering the semantic deviation between z N and z V , we sent the same code to D N and D V from one domain at the same time for reconstructing the input paired images x N and x V , which could align the semantic information in the code level. We further aligned one domain's semantic information to another by a self-aligned block in the feature level. Note that z in Figure 2 is one code from z N or z V . We used z V for exhibition in the experimental part. Figure 3 shows our self-aligned block, mainly made up of a self-attention module [35]. First of all, we fetched out the feature maps from the same layer in decoders D α and D β . Secondly, we fed features from one domain into the self-attention module. Then, the feature maps were transformed into three parts-a, b and c-with two 1 × 1 convolutional layers, which halved the channels of features a and b. Next, parts a and b performed elementwise multiplication, followed by a 1 × 1 convolutional layer and a softmax operation to get the attention maps m. Finally, f 2 . We used the Euclidean distance to reduce the semantic gap of the feature maps between two domains: where n and Y stand for the number of residual blocks in the two decoders. Note that we used the self-aligned block in the final convolution layer in last three residual blocks. With the help of the self-aligned block, we could extract the most important semantic information in one domain and align another one's semantic information to it. We used it in multiscale feature maps in several layers of the two decoders for better alignment performance. Regarding the reconstructed images, we employed a reconstruction loss [36] with a low weight for domain α and a normal weight for domain β. Specifically, we needed the decoders logp θ (x α |z α ) and logq θ (x β |z β ) to reconstruct the input images {x N , x V } to a different extent: where ρ is a very low weight for the images x α from domain α and x β is the image to be aligned, whose domain is β. Note that the self-attention block and the code z should have been adjusted according to the aligned modality. For example, if we wanted to align the semantic information in the NIR modality to the VIS one, we should send the same VIS code z V to D N and D V first. Then, we extracted the significant semantic information in the VIS modality by our self-attention block and use L a to constrain the semantic gap between rgw VIS and NIR modalities. Finally, we used the reconstruction loss L rec to force the NIR images to maintain their domain information at the image level and force the VIS images to be rebuilt faithfully. In this way, we could generate aligned NIR-VIS images, where the created NIR images were aligned to the VIS images semantically. Note that both of the codes from two domains were equal in the training stage.
Appl. Sci. 2021, 11, x FOR PEER REVIEW 7 of 15 modality by our self-attention block and use L a to constrain the semantic gap between rgw VIS and NIR modalities. Finally, we used the reconstruction loss L rec to force the NIR images to maintain their domain information at the image level and force the VIS images to be rebuilt faithfully. In this way, we could generate aligned NIR-VIS images, where the created NIR images were aligned to the VIS images semantically. Note that both of the codes from two domains were equal in the training stage. We adopted L id-pair and L id-recpair in Equations (3) and (4) to restrict the identity consistency, as [8] did. To be specific, L id-pair restricts the identity consistency between the generated paired images with different modalities, and L id-recpair makes the reconstructed data have the same identity with the raw data. After the reconstruction procedure, we sent the original data {x N , x V } and the reconstructed data {x N ,x V } to our multiscale patch discriminator for adversarial learning, which will be demonstrated in the next section minutely.

Multiscale Patch Discriminator
We propose a multiscale patch discriminator to promote the quality of generated pairwise NIR-VIS images. Similar to [37], a patch discriminator was used as our backbone to distinguish the authenticity of the input images, which consisted of several convolutional blocks followed by batch normalization and Leaky ReLU operations. We inserted these convolutional blocks before the final fully connected layer to get the multiscale feature maps, which can help to improve the capability of the discriminator.
As shown in Figure 4, in the training stage, we first matched the raw data and reconstructed data with the same corresponding modality and sent them into the feature extractor part of the network to get the feature maps. Different from [37], we then used four different convolutional blocks to get the final feature maps with different sizes, followed by a sigmoid function to get the confidence scores normalized between 0 and 1. The adversarial loss is defined as follows: We adopted L id-pair and L id-recpair in Equations (3) and (4) to restrict the identity consistency, as [8] did. To be specific, L id-pair restricts the identity consistency between the generated paired images with different modalities, and L id-recpair makes the reconstructed data have the same identity with the raw data. After the reconstruction procedure, we sent the original data {x N , x V } and the reconstructed data {X N ,X V } to our multiscale patch discriminator for adversarial learning, which will be demonstrated in the next section minutely.

Multiscale Patch Discriminator
We propose a multiscale patch discriminator to promote the quality of generated pairwise NIR-VIS images. Similar to [37], a patch discriminator was used as our backbone to distinguish the authenticity of the input images, which consisted of several convolutional blocks followed by batch normalization and Leaky ReLU operations. We inserted these convolutional blocks before the final fully connected layer to get the multiscale feature maps, which can help to improve the capability of the discriminator.
As shown in Figure 4, in the training stage, we first matched the raw data and reconstructed data with the same corresponding modality and sent them into the feature extractor part of the network to get the feature maps. Different from [37], we then used four different convolutional blocks to get the final feature maps with different sizes, followed by a sigmoid function to get the confidence scores normalized between 0 and 1. The adversarial loss is defined as follows:  (8) where p is the number of the multiscale patch discriminator andX, X,x, x, N and V indicate the reconstructed images set, the raw images set, one reconstructed image, one raw image, the NIR domain and the VIS domain, respectively. We designed the spatial sizes of the final feature maps as 1/2, 1/4, 1/8 and 1/16 of those of the input images, respectively.
where p is the number of the multiscale patch discriminator and X , X, x , x, N and V indicate the reconstructed images set, the raw images set, one reconstructed image, one raw image, the NIR domain and the VIS domain, respectively. We designed the spatial sizes of the final feature maps as 1/2, 1/4, 1/8 and 1/16 of those of the input images, respectively. Hence, the total loss for training our generation model can be formulated as L = L kl + α 1 L dist + α 2L id-pair + α 3 L id-recpair + α 4 L div + α 5 L rec + L a + L aN + L aV (9) where α 1 , α 2 , α 3 , α 4 and α 5 are the trade-off parameters. After training our model, we could use two decoders-D N and D V -to produce numerous pairwise aligned NIR-VIS images, with the same noise sampled in Gaussian space for a pair, while keeping the identity consistency between them. The details of the generation process are presented in Figure 5.   Hence, the total loss for training our generation model can be formulated as where α 1 , α 2 , α 3 , α 4 and α 5 are the trade-off parameters. After training our model, we could use two decoders-D N and D V -to produce numerous pairwise aligned NIR-VIS images, with the same noise sampled in Gaussian space for a pair, while keeping the identity consistency between them. The details of the generation process are presented in Figure 5.
where p is the number of the multiscale patch discriminator and X , X, x , x, N and V indicate the reconstructed images set, the raw images set, one reconstructed image, one raw image, the NIR domain and the VIS domain, respectively. We designed the spatial sizes of the final feature maps as 1/2, 1/4, 1/8 and 1/16 of those of the input images, respectively. Hence, the total loss for training our generation model can be formulated as L = L kl + α 1 L dist + α 2L id-pair + α 3 L id-recpair + α 4 L div + α 5 L rec + L a + L aN + L aV (9) where α 1 , α 2 , α 3 , α 4 and α 5 are the trade-off parameters. After training our model, we could use two decoders-D N and D V -to produce numerous pairwise aligned NIR-VIS images, with the same noise sampled in Gaussian space for a pair, while keeping the identity consistency between them. The details of the generation process are presented in Figure 5.

NIR-VIS Face Recognition
We followed the baseline [8] to adopt the LightCNN method F for NIR-VIS face recognition. We trained F with both the limited, original labeled data and the plentiful, unlabeled aligned NIR-VIS images, which were generated by our well-trained generator. Quantitative analysis will be conducted for comparison in Section 4.

Mean Landmark Distance
We proposed the mean landmark distance (MLD) to quantitatively evaluate the alignment effect of two NIR-VIS images with the same identity. To be specific, we employed two facial landmark localization models named Landmark-5 [38] and Landmark-68 [39] for facial keypoint localization, both of which are widely used in face detection and face recognition tasks. Landmark-5 can detect 5 keypoints' coordinates, which locate the mouth, eyes and nose, while Landmark-68 can detect 68 keypoints' coordinates, which locate the mouth, eyes, nose and face contour. As for each paired set of NIR-VIS images, we computed the mean coordinate deviation for every keypoint in the face between two domains first. Then, we did the same operation with 100,000 pairwise generated fake images and took the average value as our MLD. MLD5 was computed by Landmark-5, which could stand for the alignment of the facial organs, and MLD68 was computed by Landmark-68, which could stand for not only the alignment of the facial organs, but also the alignment of the face contour: where n stands for the number of generated paired images, l (k) x N is denoted as the kth keypoint's x-coordinate of the NIR image and l (k) x V is denoted as the kth keypoint's xcoordinate of the VIS image. It is the same with the y-coordinate in l

Experiment
In this section, our proposed self-alignment method is evaluated with three popular datasets, including CASIA NIR-VIS 2.0 [3], Oulu-CASIA NIR-VIS [11] and BUAA VIS-NIR [12]. First of all, we introduce these three datasets with training and testing protocols. Secondly, the experiment details are illustrated. Then, the qualitative special alignment and quantitative experimental results are given. Finally, an ablation study is conducted to demonstrate the effect of our proposed methods.

Datasets and Protocols
The CASIA NIR-VIS 2.0 dataset [3] has the largest number of NIR-VIS images (from 725 subjects), each of which includes 5-50 NIR images and 1-22 VIS images. The images have the same resolution of 640 × 480 but are varied with different properties such as expressions, poses, lighting conditions and whether glasses are worn or not. The Oulu-CASIA NIR-VIS dataset [11] consists of 80 subjects with 6 expressions, covering anger, happiness, sadness, surprise, disgust and fear, as well as three illuminations including darkness, normal indoor lighting and weak light. The BUAA VIS-NIR dataset [12] contains 150 subjects, each of which includes 9 NIR images and 14 VIS images varying in illumination, with diversity in terms of different poses and expressions. We followed the protocols of [3] to split the CASIA NIR-VIS 2.0 dataset into training and testing sets, which contained a total of 10-fold experimental settings. We chose the Rank-1 accuracy, verification rate (VR) at the false accept rate (FAR) = 1%, and VR@FAR = 0.1% for quantitation comparisons on all the three datasets. We used our proposed mean landmark distance (MLD) to test the alignment effect between the generated NIR and VIS images with the same identity.
Following [9], the training sets of the Oulu-CASIA NIR-VIS [11] and the BUAA VIS-NIR [12] datasets were not used. We directly employed our model, trained on the tenth fold of the CASIA NIR-VIS 2.0 dataset for generation, and evaluated it on the testing sets in all three databases as in [40].

Implementation Details
Our proposed network was trained on the CASIA NIR-VIS 2.0 dataset with two NVIDIA Titan XP GPUs. All images in the dataset were aligned and cropped to 128 × 128 resolutions. For the self-aligned generation part, Adam was used as the optimizer, and the learning rate was fixed to 0.0002. α 1 , α 2 , α 3 , α 4 and α 5 in Equation (9) were set to 50, 5, 1000, 0.2 and 0.001, respectively. We chose a LightCNN [31] model as the ID-feature extractor F id , which was pre-trained on the MS-Celeb-1M database [41]. For the NIR-VIS face recognition part, LightCNN-v29 [31] was selected. We first randomly produced 100,000 paired NIR-VIS images by our well-trained generative model as expanded images, which would boost the limited raw data. Then, we used both the raw data and the generated images to train our recognition network. Stochastic gradient descent (SGD) was adopted as the optimizer, where the momentum was set to 0.9 and the weight decay was set to 5 × 10 −4 . The learning rate was set to 7 × 10 −4 initially and decayed 1/10 per 5 epochs. The batch size was set to 128, and the dropout ratio was 0.5.

Alignment Analysis
Owing to our proposed self-aligned encoder-decoder architecture, we could not only generate abundant NIR-VIS images, but also align paired images in the raw dataset with our well-trained model by reconstructing them. Figures 6 and 7 show the reconstructed and generated results of the baseline (DVG), as well as our methods, respectively. Figure 6 is the reconstructed results of DVG and SADG. We divided Figure 7 into two parts; part a is the generated results of DVG and part b shows those of SADG.
Following [9], the training sets of the Oulu-CASIA NIR-VIS [11] and the BUAA VIS-NIR [12] datasets were not used. We directly employed our model, trained on the tenth fold of the CASIA NIR-VIS 2.0 dataset for generation, and evaluated it on the testing sets in all three databases as in [40].

Implementation Details
Our proposed network was trained on the CASIA NIR-VIS 2.0 dataset with two NVIDIA Titan XP GPUs. All images in the dataset were aligned and cropped to 128 × 128 resolutions. For the self-aligned generation part, Adam was used as the optimizer, and the learning rate was fixed to 0.0002. α 1 , α 2 , α 3 , α 4 and α 5 in Equation (9) were set to 50, 5, 1000, 0.2 and 0.001, respectively. We chose a LightCNN [31] model as the IDfeature extractor , which was pre-trained on the MS-Celeb-1M database [41]. For the NIR-VIS face recognition part, LightCNN-v29 [31] was selected. We first randomly produced 100,000 paired NIR-VIS images by our well-trained generative model as expanded images, which would boost the limited raw data. Then, we used both the raw data and the generated images to train our recognition network. Stochastic gradient descent (SGD) was adopted as the optimizer, where the momentum was set to 0.9 and the weight decay was set to 5 × 10 −4 . The learning rate was set to 7 × 10 −4 initially and decayed 1/10 per 5 epochs. The batch size was set to 128, and the dropout ratio was 0.5.

Alignment Analysis
Owing to our proposed self-aligned encoder-decoder architecture, we could not only generate abundant NIR-VIS images, but also align paired images in the raw dataset with our well-trained model by reconstructing them. Figures 6 and 7 show the reconstructed and generated results of the baseline (DVG), as well as our methods, respectively. Figure  6 is the reconstructed results of DVG and SADG. We divided Figure 7 into two parts; part a is the generated results of DVG and part b shows those of SADG. It can be observed that DVG could reverse images in the dataset well, yet the reconstructed NIR-VIS images were semantically unaligned, just like the raw data. As shown in Figure 6, the pairwise reconstructed images had different attributes from each other, such as distinct poses, expressions and a pair of glasses. For example, we can see that the VIS image of the first person has no glasses, while the NIR image has glasses. With the devoted reversal process during the training stage, DVG learned two misaligned data distributions according to two modalities and generated misaligned fake images naturally, which is presented in Figure 7a. Different from DVG, our model could redress the misalignment in the raw dataset and generate paired, well-aligned NIR-VIS images. As shown in Figure 6, glasses were removed in the reversal NIR image of the first person, and the misaligned poses were redressed in the reversal NIR images of the other person. We can clearly observe that our generated fake images were well-aligned in Figure 7b. It is worth mentioning that even less-obvious glasses were expressed in the paired fake NIR-VIS images, as marked in the first pair of Figure 7b. We further utilized our proposed MLD5 and MLD68 models to evaluate the alignment performance quantitatively. Table 1 presents the MLD5 and MLD68 models, computed with data in the CASIA NIR-VIS 2.0 dataset, DVG-generated images and the generated images of our model. The MLD5 and MLD68 models of DVG (4.7 and 6.2 pix value per keypoint) were close to those of the raw data, which were 5.2 and 6.5, respectively. This verifies that DVG learned the misaligned data distributions of the two domains from the dataset. The slight improvement of DVG (0.5 in MLD5 and 0.3 in MLD68) may have witnessed the aligning operation, which DVG uses in latent space with L . However, it is not enough for alignment in It can be observed that DVG could reverse images in the dataset well, yet the reconstructed NIR-VIS images were semantically unaligned, just like the raw data. As shown in Figure 6, the pairwise reconstructed images had different attributes from each other, such as distinct poses, expressions and a pair of glasses. For example, we can see that the VIS image of the first person has no glasses, while the NIR image has glasses. With the devoted reversal process during the training stage, DVG learned two misaligned data distributions according to two modalities and generated misaligned fake images naturally, which is presented in Figure 7a.
Different from DVG, our model could redress the misalignment in the raw dataset and generate paired, well-aligned NIR-VIS images. As shown in Figure 6, glasses were removed in the reversal NIR image of the first person, and the misaligned poses were redressed in the reversal NIR images of the other person. We can clearly observe that our generated fake images were well-aligned in Figure 7b. It is worth mentioning that even less-obvious glasses were expressed in the paired fake NIR-VIS images, as marked in the first pair of Figure 7b. We further utilized our proposed MLD5 and MLD68 models to evaluate the alignment performance quantitatively. Table 1 presents the MLD5 and MLD68 models, computed with data in the CASIA NIR-VIS 2.0 dataset, DVG-generated images and the generated images of our model. The MLD5 and MLD68 models of DVG (4.7 and 6.2 pix value per keypoint) were close to those of the raw data, which were 5.2 and 6.5, respectively. This verifies that DVG learned the misaligned data distributions of the two domains from the dataset. The slight improvement of DVG (0.5 in MLD5 and 0.3 in MLD68) may have witnessed the aligning operation, which DVG uses in latent space with L dist . However, it is not enough for alignment in appearance. By contrast, we can obviously see that our generated NIR-VIS images had better performance with 1.4 in MLD5 and 3.1 in MLD68, which substantiates our better alignment capability compared with the baseline.

Quantitative Comparison
We compared our proposed self-aligned dual generation (SADG) method with the state-of-the-art approaches on three datasets in this part, as recorded in Table 2. Note that DVG is our baseline method. As for the CASIA NIR-VIS 2.0 dataset, our proposed self-aligned method slightly improved the VR@FAR = 0.1% of the baseline from 99.4% to 99.6% while maintaining the performance of the Rank1 accuracy and VR@FAR = 1%. Compared with the baseline, our method boosted the performance of VR@FAR = 0.1% and VR@FAR = 1% by 0.4% and 0.3%, respectively, on Oulu. Moreover, our model reached 97.3% in VR@FAR = 0.1% in the BUAA dataset, higher than the baseline model by 0.7%. However, the performance of our model slightly droppoed by 0.1% in Rank1 on Oulu and VR@FAR = 1% in the BUAA dataset, compared with the strong baseline (DVG).
On the whole, our method impressively gained improvement compared with the VGG [40] and Light CNN [31] databases, the PACH [27] and PCFH [9] models and the baseline. This demonstrates the better performance of our proposed method. Numerous better-aligned generated NIR-VIS images can tremendously benefit the performance of heterogeneous face recognition network.

Ablation Study
In this section, we verify the effectiveness of the four segments that were used in our proposed self-aligned generation architecture, including the same z, self-aligned block, slight-weight reconstruction loss and multiscale patch discriminator. Concretely, we set the four contrast experiments as a, b, c and d. Experiment a used a different z code for training in our method, b used the same z code but without our self-aligned block, c used a normal weighted reconstruction loss when training the network and d trained the network without a multiscale patch discriminator.
As for experiment a, when using two different z codes for training, the generated NIR-VIS images showed slight misalignment. As exhibited in Figure 8a, the NIR image of the second person had a misaligned eyebrow with that of the VIS image. In addition, the third paired NIR-VIS images had slightly different poses with each other. The MLD68 in experiment a increased to 4.7 (lower than 6.2 in DVG and higher than 3.1 in ours), and the Rank1 accuracy decreased to 97.2% (lower than 99.9% in ours). This clearly verifies the fact that different z codes can affect the alignment performance virtually, and it can help to bring better alignment and recognition performance with the same latent code z.

Ablation Study
In this section, we verify the effectiveness of the four segments that were used in our proposed self-aligned generation architecture, including the same z, self-aligned block, slight-weight reconstruction loss and multiscale patch discriminator. Concretely, we set the four contrast experiments as a, b, c and d. Experiment a used a different z code for training in our method, b used the same z code but without our self-aligned block, c used a normal weighted reconstruction loss when training the network and d trained the network without a multiscale patch discriminator.
As for experiment a, when using two different z codes for training, the generated NIR-VIS images showed slight misalignment. As exhibited in Figure 8a, the NIR image of the second person had a misaligned eyebrow with that of the VIS image. In addition, the third paired NIR-VIS images had slightly different poses with each other. The MLD68 in experiment a increased to 4.7 (lower than 6.2 in DVG and higher than 3.1 in ours), and the Rank1 accuracy decreased to 97.2% (lower than 99.9% in ours). This clearly verifies the fact that different z codes can affect the alignment performance virtually, and it can help to bring better alignment and recognition performance with the same latent code z. As tabulated in Table 3, b gained a lower MLD68 (3.5) and better recognition accuracy (99.2%) than a, which demonstrates a more important role that the same z value played in the alignment task. That is to say, the same z affected the alignment performance virtually, while the self-aligned block acted the part for redressing the unaligned attributes subsidiarily. Table 3. Results of the ablation study. a: the same z value; b: without the self-aligned block; c: normal weighted reconstruction loss; d: without a multiscale patch discriminator.  As tabulated in Table 3, b gained a lower MLD68 (3.5) and better recognition accuracy (99.2%) than a, which demonstrates a more important role that the same z value played in the alignment task. That is to say, the same z affected the alignment performance virtually, while the self-aligned block acted the part for redressing the unaligned attributes subsidiarily. Table 3. Results of the ablation study. a: the same z value; b: without the self-aligned block; c: normal weighted reconstruction loss; d: without a multiscale patch discriminator. As for experiment c, we can find the worst results in both appearance and performance in Figure 8c and Table 3. The MLD68 of c reached 5.8, which was nearly close to that of DVG (6.2). Images in Figure 8c show diverse attributes such as the pose, expression and eyebrows. We owe the poor results to the normal weighted reconstruction loss that could close the pix-level distance between the raw data and the reconstructed data by a strong constraint in the training stage. Thus, the results were unaligned, as was the raw data.

Same z Self-Aligned
When removing the multiscale patch discriminator in experiment d, the generated images were quite bleary, as shown in Figure 8d, and the MLD68 of d increased to 4.1 by 1 pix value per keypoint due to the challenge of detection with blurry images. However, blurry images slightly affected the recognition performance, because the heterogeneous recognition network was insensitive toward the quality of the images, which was different from human beings. The Rank1 accuracy of d only decreased to 99.4% by 0.5%, as shown in Table 3.

Conclusions
In this paper, we first analyzed the misalignment problem in the baseline model. Then, we proposed a self-aligned dual generation (SADG) architecture to cover it, including the self-aligned block and the multiscale patch discriminator. After training our model, we could generate numerous, well-aligned paired NIR-VIS images. We further proposed the mean landmark distance (MLD) for evaluating the alignment performance quantitatively. Extensive experiments on three popular NIR-VIS datasets were conducted, achieving start-of-the-art quantitative results and showing the best alignment performance with our generated images. Finally, an ablation study was performed to demonstrate the effectiveness of our proposed self-aligned generation architecture.
There is still room for the improvement in the field of generating heterogeneous face images. Though we produced numerous semantic, aligned paired NIR-VIS images, we have not improved the attribute diversity of the heterogeneous face data. Semantic face editing shows its ability to produce versatile semantic classes which are nonexistent in the training data. In the future, we may try to utilize that on edited heterogeneous face images based on SADG, boosting the attribute diversity.