Multimodal Biometric Template Protection Based on a Cancelable SoftmaxOut Fusion Network

: Authentication systems that employ biometrics are commonplace, as they offer a convenient means of authenticating an individual’s identity. However, these systems give rise to concerns about security and privacy due to insecure template management. As a remedy, biometric template protection (BTP) has been developed. Cancelable biometrics is a non-invertible form of BTP in which the templates are changeable. This paper proposes a deep-learning-based end-to-end multimodal cancelable biometrics scheme called cancelable SoftmaxOut fusion network (CSMoFN). By end-to-end, we mean a model that receives raw biometric data as input and produces a protected template as output. CSMoFN combines two biometric traits, the face and the periocular region, and is composed of three modules: a feature extraction and fusion module, a permutation SoftmaxOut transformation module, and a multiplication-diagonal compression module. The ﬁrst module carries out feature extraction and fusion, while the second and third are responsible for the hashing of fused features and compression. In addition, our network is equipped with dual template-changeability mechanisms with user-speciﬁc seeded permutation and binary random projection. CSMoFN is trained by minimizing the ArcFace loss and the pairwise angular loss. We evaluate the network, using six face–periocular multimodal datasets, in terms of its veriﬁcation performance, unlinkability, revocability, and non-invertibility.


Introduction
The scope of deployment of biometrics-based systems is rapidly expanding.In particular, the use of systems that rely on biometrics, such as mobile, banking, and online systems, is increasing.Biometrics capture unique physiological or behavioral trait information about users, and are therefore a convenient and highly accurate means of identity management.However, a biometric trait cannot be used if it has been exposed or abused even once, and the same biometric templates cannot be stored across multiple devices, which may increase the risk of a cross-matching attack [1,2].A system employing biometrics must therefore prioritize security, privacy, and accuracy when authenticating individuals.
Cancelable biometrics (CB), a biometric template protection method, has been proposed to address the abovementioned concerns.A CB scheme generally consists of a feature extractor, a user-specific parameterized transformation function, and a matcher, as shown in Figure 1.CB does not directly use the original biometric template for matching, but instead uses the results of a non-invertible transformation.The original biometric data cannot be restored after being transformed, although a CB template can be generated immediately from its original counterpart.Furthermore, cross-matching of different CB templates generated from the same biometric template is highly unlikely.In brief, there are four conditions that need to be satisfied by the CB scheme, as follows [3]: mediately from its original counterpart.Furthermore, cross-matching of different CB templates generated from the same biometric template is highly unlikely.In brief, there are four conditions that need to be satisfied by the CB scheme, as follows [3]: -Non-invertibility: It must be extremely difficult to restore the original biometric template from the CB template.-Revocability: If a CB template is exposed, a new template should be generated immediately from the original biometric data.This implies that there should be no limit on the number of CB templates generated from one biometric template.-Unlinkability: Two or more CB templates generated by the same user should not be distinguishable, in order to reduce the risk of a cross-matching attack.-Performance: The accuracy performance of a CB-based system should not be poorer than its original counterpart.In general, biometric systems can be classified into two types based on the number of biometric modalities used, namely unimodal and multimodal biometrics [4,5].A unimodal system recognizes a user based on a single biometric modality, whereas a multimodal system performs recognition based on more than one biometric modality.Unimodal biometrics has traditionally been applied, and although its performance has been proven, it has certain limitations.Since only a single biometric modality is deployed, the presence of sensor noise may affect the accuracy performance.Other problems may also arise, such as non-universality, vulnerabilities to spoofing attacks, and intra-class and inter-class similarities [6].Multimodal biometrics is an approach that can compensate for these limitations.Since multimodal biometrics uses multiple biometric modalities, the probability of modalities being unavailable or missing is low.The accuracy performance can also be improved due to the fusion of biometrics information.Furthermore, the use of multiple modalities increases the robustness to security attacks such as spoofing [7].However, the risk of template abuse and attack remains the same as for unimodal biometric systems, and the consequences could be catastrophic, as more private information about the user would be revealed from multiple compromised templates.
In this paper, we propose a deep-learning-based end-to-end multimodal CB scheme called cancelable SoftmaxOut fusion network (CSMoFN).By end-to-end, we mean a model that receives raw biometric data as input and produces a CB template as output [8].Our model relies on two biometric traits, the face and periocular region, as input and produces a CB template as output.In general, biometric systems can be classified into two types based on the number of biometric modalities used, namely unimodal and multimodal biometrics [4,5].A unimodal system recognizes a user based on a single biometric modality, whereas a multimodal system performs recognition based on more than one biometric modality.Unimodal biometrics has traditionally been applied, and although its performance has been proven, it has certain limitations.Since only a single biometric modality is deployed, the presence of sensor noise may affect the accuracy performance.Other problems may also arise, such as non-universality, vulnerabilities to spoofing attacks, and intra-class and interclass similarities [6].Multimodal biometrics is an approach that can compensate for these limitations.Since multimodal biometrics uses multiple biometric modalities, the probability of modalities being unavailable or missing is low.The accuracy performance can also be improved due to the fusion of biometrics information.Furthermore, the use of multiple modalities increases the robustness to security attacks such as spoofing [7].However, the risk of template abuse and attack remains the same as for unimodal biometric systems, and the consequences could be catastrophic, as more private information about the user would be revealed from multiple compromised templates.
In this paper, we propose a deep-learning-based end-to-end multimodal CB scheme called cancelable SoftmaxOut fusion network (CSMoFN).By end-to-end, we mean a model that receives raw biometric data as input and produces a CB template as output [8].Our model relies on two biometric traits, the face and periocular region, as input and produces a CB template as output.

Related Work 1.Multimodal Biometrics with Deep Learning
In this subsection, we review several works that have applied deep learning to multimodal biometric systems.Ding et al. [9] proposed a multimodal face recognition system composed of global facial features, rendered a frontal face image using a 3D face model, and uniformly sampled local face image patches.A combination of multiple convolution neural networks (CNNs) and a stacked autoencoder were used for feature learning and performing feature-level fusion.

Al-Waisy et al.
[10] outlined a multimodal biometric system known as IrisConvNet.The IrisConvNet fuses left and right irises at the ranking-level.Alay et al. [11] considered iris, face, and finger veins and processed them with separate CNNs, with the output of each network fused at the score level.Gunasekaran et al. [12] also proposed a deep multimodal biometric system consisting of the iris, face, and fingerprint, called deep contourlet derivative weighted rank (DCD-WR) network.The matching is achieved with a deep learning template matching algorithm.
The study in [13] presented a multifeature deep learning network (MDLN) architecture that fused the facial and periocular regions, with the addition of texture descriptors.MDLN was designed as a feature-level fusion approach that correlated raw biometric data with texture descriptors to produce a new representation.
Algashaam [14] fused the iris and periocular region using a hierarchical fusion network.Their network allowed the system to automatically explore and discover the best strategy for combining the individual biometric scores.Luo et al. [15] outlined a deep neural network for fusion of the iris and periocular features.A co-attention feature fusion module was used to fuse the features adaptively, to obtain iris-periocular features for accurate recognition.
Jung et al. [16] proposed a teacher-student network for periocular representation learning.The teacher network, which is pre-trained with face images, is leveraged to regulate the student (periocular) network in order to enhance the periocular performance.Soleymani et al. [17] suggested a generalized compact bilinear fusion algorithm composed of multiple CNNs.Three multimodal features (iris, face, and fingerprint) are fused through a fully connected layer.
In summary, deep learning is a natural and well-suited approach for multimodal biometric systems, as deep neural networks enable feature extraction, fusion, and authentication to be performed "under one roof".However, this approach does not consider the issue of template protection.

Cancelable Multimodal Biometrics with Deep Learning
Multimodal biometrics with template protection is not a new topic, with many papers having been published on this subject [18][19][20][21][22].However, deep-learning-based multimodal biometric systems that include template protection remain very scarce.
Abdellatef et al. [23] designed a multi-instance CB for the face using multiple CNNs to extract features from multiple regions of a face image, such as face, eyes, nose, mouth, etc.After fusion of several deep features, a cancelable template was generated via bioconvolving encryption.Their method achieved a better performance than when a unimodal biometric was applied.However, a detailed analysis of CB design criteria was not provided.
Talreja et al. [24] introduced a method for generating CB templates from face and iris biometrics.Each biometric image was subjected to feature extraction via a CNN, with a random component selected from the generated features and used as a transformation key.The transformed templates generated from the CB module were then converted to a secure sketch via a forward error correction (FEC) decoder and cryptographic hashing.However, a compromised transformation key may pose the risk of CB template inversion.
Sudhakar et al. [25] put forward a finger vein and iris-based CB scheme in which a CNN was applied to perform feature extraction and a support vector machine (SVM) was used for user verification.The template was protected with a random projectionbased approach.
More recently, El-Rahiem et al. [26] proposed a multibiometric CB system using fingerprint, finger vein, and iris images.First, feature extraction was performed for each biometric modality through a CNN and fusion of the three modalities was achieved using feature maps.The DeepDream algorithm was then applied to give a cancelable template.However, a security analysis was not performed.
A detailed comparison with the above papers is provided in Section 4.6.

Motivations and Contributions
In this paper, we propose an end-to-end CSMoFN scheme for multimodal biometric systems.Our network fuses face and periocular biometric traits at the feature level, producing a single CB template as output.More specifically, CSMoFN is composed of two components, the first of which transforms the fused biometric vectors based on the notion of random permutation maxout transform (RPMoT), which was proposed in [27].Although RPMoT is a CB transformation scheme, it is not learnable and is data-agnostic, and hence barely meets the performance requirements for the CB scheme.RPMoT is reformulated as a part of the deep neural network and is referred to here as the permutation SoftmaxOut transform (PSMoT).PSMoT is data-driven, learnable, and parameterized by user-specific permutation seeds to satisfy the revocability and unlinkability criteria.PSMoT can be viewed as a locality-sensitive hashing process that transforms biometric data from a high-dimensional input space to a relatively low-dimensional hash space [28].
The PSMoT comes with a customized layer called permutation SoftmaxOut layer.The layer composed of maxout units and a modified softmax function to approximate the permutation and the winner-takes-all operations in the RPMoT.Apart from that, the modified softmax function is also useful to minimize the quantization errors introduced by the SoftmaxOut approximation.Since PSMoT aims to produce a discrete hash vector from the network directly, it is a representation learning problem.Hence, a pairwise distancebased loss called Pairwise Angular (PA) loss, is introduced to optimize the margin between intra-and inter-class distances.
The output of PSMoT (i.e., a hash vector) is immediately followed by the second component of the network, called the multiplication-diagonal compression (MDC) module.The MDC module is designed to further enhance the security and compress the PSMoT hash vector, and offers a user-specific seeded binary random projection mechanism as a means to enhance the revocability and unlinkability of the proposed method.
Both the PSMoT and MDC transformation follow the many-to-one mapping principle attributed to their hashing trait, meaning that inversion (one-to-many mapping) of the terminal hash output is theoretically impossible and computationally hard in practice.This is essential to satisfy the non-invertibility requirement for a CB scheme.
In this paper, we opt to fuse the face and periocular features.The periocular region, also known as the periphery of the ocular area, includes the vicinity of a person's eyes and contains information on the subject's eyebrows, eyelashes, and skin texture.The periocular region is a complementary biometric of the face that is helpful in terms of enhancing the biometric performance of the face alone.It is particularly useful in situations such as when a mask is worn, where the face is occluded and a performance degradation is expected.As we will show in the experiment section, fusion of the face and periocular region outperforms the respective unimodal counterparts, i.e., the face or periocular region alone.
We can summarize our contributions as follows: • A deep-learning-based CB scheme for multimodal biometrics is proposed.Although the face and periocular biometrics form the focus of this paper, our proposed method can also be applied to other biometrics modalities, provided the input is a raw image.

•
A deep network, CSMoFN is composed of three modules: a feature extraction and fusion module, a PSMoT module, and an MDC module is proposed to realize the above proposal.The first module is responsible for performing feature extraction and fusion and the latter two are cancelable transformation functions, which are devised with respect to the four CB design criteria.

•
The three modules are trained in an end-to-end manner with a combination of classification loss and representation learning, namely ArcFace loss and PA loss.

•
We evaluate the proposed network on six face-periocular multimodal datasets in terms of verification performance, unlinkability, revocability, and non-invertibility.

Preliminaries: Random Permutation Maxout Transform
RPMoT [27] is a data-agnostic CB scheme that transforms a biometric feature vector into a discrete hash vector, as illustrated in Figure 2. RPMoT is parameterized by a userspecific seeded permutation matrix, which means that the hash vector can be revoked if it is compromised.The flow of the algorithm is summarized as follows: 1.
A user-specific permutation matrix is first created.Suppose the size of the biometric feature vector X is d and permutation matrix is d × d.There are m permutation matrices that are generated and stacked to form P.

2.
X and P are multiplied to yield a matrix W with size m × d.

3.
The first q column vectors of W are used and the rest are discarded, yielding a matrix Y with size m × q. 4.
Finally, the position of the feature with the largest value in each row of Y is recorded as the index value.When all rows have been processed, the RPMoT hash vector u with size m is obtained.Note that u is an integer-value vector ranging from 1 to q.
In this paper, RPMoT is redesigned as a component of CSMoFN, which is learnable and data-driven.However, the essence of PRMoT as a CB scheme that satisfies the requirements of non-invertibility, revocability, and unlinkability remains intact.
fusion and the latter two are cancelable transformation functions, which are devised with respect to the four CB design criteria.

•
The three modules are trained in an end-to-end manner with a combination of classification loss and representation learning, namely ArcFace loss and PA loss.

•
We evaluate the proposed network on six face-periocular multimodal datasets in terms of verification performance, unlinkability, revocability, and non-invertibility.

Preliminaries: Random Permutation Maxout Transform
RPMoT [27] is a data-agnostic CB scheme that transforms a biometric feature vector into a discrete hash vector, as illustrated in Figure 2. RPMoT is parameterized by a userspecific seeded permutation matrix, which means that the hash vector can be revoked if it is compromised.The flow of the algorithm is summarized as follows: 1.A user-specific permutation matrix is first created.Suppose the size of the biometric feature vector  is  and permutation matrix is  × .There are  permutation matrices that are generated and stacked to form . 2.  and  are multiplied to yield a matrix  with size  × .3. The first  column vectors of  are used and the rest are discarded, yielding a matrix  with size  × .4. Finally, the position of the feature with the largest value in each row of  is recorded as the index value.When all rows have been processed, the RPMoT hash vector  with size  is obtained.Note that  is an integer-value vector ranging from 1 to .
In this paper, RPMoT is redesigned as a component of CSMoFN, which is learnable and data-driven.However, the essence of PRMoT as a CB scheme that satisfies the requirements of non-invertibility, revocability, and unlinkability remains intact.

Overview
The proposed CSMoFN system takes face and periocular biometric information as its input and is composed of three modules: a feature extraction and fusion module, a PSMoT module, and an MDC module.As portrayed in Figure 3, the backbone of CSMoFN is based on a CNN, which performs feature extraction from images of faces and periocular regions via multiple convolutional blocks.The extracted features are fused at the feature level.PSMoT then transforms the fused vector to a discrete hash vector, which is further compressed to yield a terminal hash vector (CB template) from the MDC module.A userspecific token or password is required to generate a random seed for permutation and binary random projection in the PSMoT and MDC modules, respectively.
In essence, the proposed system is a two-factor cancelable multimodal biometric system for which both biometric inputs and user-specific token/passwords are required.The

Overview
The proposed CSMoFN system takes face and periocular biometric information as its input and is composed of three modules: a feature extraction and fusion module, a PSMoT module, and an MDC module.As portrayed in Figure 3, the backbone of CSMoFN is based on a CNN, which performs feature extraction from images of faces and periocular regions via multiple convolutional blocks.The extracted features are fused at the feature level.PSMoT then transforms the fused vector to a discrete hash vector, which is further compressed to yield a terminal hash vector (CB template) from the MDC module.A userspecific token or password is required to generate a random seed for permutation and binary random projection in the PSMoT and MDC modules, respectively.
In essence, the proposed system is a two-factor cancelable multimodal biometric system for which both biometric inputs and user-specific token/passwords are required.The entire network is trained end-to-end following an open-set (database and identity independence) evaluation protocol [29].This means that the model is trained on datasets that are independent of the enrolled subjects, which is preferable for biometric systems as the model does not need to be retrained when a new user is enrolled or an old CB template is reissued.In the latter case, the user only needs to change the token or password.
entire network is trained end-to-end following an open-set (database and identity independence) evaluation protocol [29].This means that the model is trained on datasets that are independent of the enrolled subjects, which is preferable for biometric systems as the model does not need to be retrained when a new user is enrolled or an old CB template is reissued.In the latter case, the user only needs to change the token or password.

Feature Extraction and Fusion Module
In Figure 4, we adopt ResNet-50 as the backbone for the proposed method, which consists of 49 convolution layers and a linear activated fully connected layer with  neurons, thus producing a -dimensional feature vector for each face and periocular image.The backbone is pre-trained using the MS-Celeb-1M dataset [30].Two feature vectors from the face   and periocular region   are fused at the feature level by concatenation, and hence the number of dimensions of the fused vector  = [    ] is 2.Fusion with concatenation can largely preserve the useful information from both biometrics despite the increase in feature size compared to other strategies such as the feature sum or average.In this work, we set  = 512.

Feature Extraction and Fusion Module
In Figure 4, we adopt ResNet-50 as the backbone for the proposed method, which consists of 49 convolution layers and a linear activated fully connected layer with p neurons, thus producing a p-dimensional feature vector for each face and periocular image.The backbone is pre-trained using the MS-Celeb-1M dataset [30].Two feature vectors from the face z f ace and periocular region z periocular are fused at the feature level by concatenation, and hence the number of dimensions of the fused vector z = [z f ace z periocular ] is 2p.Fusion with concatenation can largely preserve the useful information from both biometrics despite the increase in feature size compared to other strategies such as the feature sum or average.In this work, we set p = 512.
entire network is trained end-to-end following an open-set (database and identity independence) evaluation protocol [29].This means that the model is trained on datasets that are independent of the enrolled subjects, which is preferable for biometric systems as the model does not need to be retrained when a new user is enrolled or an old CB template is reissued.In the latter case, the user only needs to change the token or password.

Feature Extraction and Fusion Module
In Figure 4, we adopt ResNet-50 as the backbone for the proposed method, which consists of 49 convolution layers and a linear activated fully connected layer with  neurons, thus producing a -dimensional feature vector for each face and periocular image.The backbone is pre-trained using the MS-Celeb-1M dataset [30].Two feature vectors from the face   and periocular region   are fused at the feature level by concatenation, and hence the number of dimensions of the fused vector  = [    ] is 2.Fusion with concatenation can largely preserve the useful information from both biometrics despite the increase in feature size compared to other strategies such as the feature sum or average.In this work, we set  = 512.

Permutation SoftmaxOut Transform (PSMoT) Module
The PSMoT module is located immediately after the FC (Fully Connected) layer of the feature extraction and fusion module.As depicted in Figure 5, it is composed of two ReLU activated hidden layers, the first of which (h 1 ) consists of l 1 neurons, and the second (h 2 ) consists of l 2 neurons for nonlinear transformation purposes.We set l 1 = l 2 = 2014.The SoftmaxOut layer is a dedicated layer designed for hashing.There are m maxout units m i composed of q permutable linear activated neurons.A permutation has user-specific and/or application-specific dependence.The maxout unit is a function that returns the index of the maximal entry of the q neurons.The hashing layer produces m discrete hash codes v i forming a hash vector v ∈ {1, . . . ,q} m .

Permutation SoftmaxOut Transform (PSMoT) Module
The PSMoT module is located immediately after the FC (Fully Connected) layer of the feature extraction and fusion module.As depicted in Figure 5, it is composed of two ReLU activated hidden layers, the first of which ( 1 ) consists of  1 neurons, and the second ( 2 ) consists of  2 neurons for nonlinear transformation purposes.We set  1 =  2 = 2014.The SoftmaxOut layer is a dedicated layer designed for hashing.There are  maxout units   composed of  permutable linear activated neurons.A permutation has user-specific and/or application-specific dependence.The maxout unit is a function that returns the index of the maximal entry of the  neurons.The hashing layer produces  discrete hash codes   forming a hash vector  ∈ {1, … , }  .Recall that the RPMoT (Section 2) produces the index value of the maximum entry of a -dimensional permuted vector, as described in Step 4. This is equivalent to taking the index value   from a permutable maxout unit, as follows [31]: However, Equation ( 1) is non-differentiable and hence non-trainable with backpropagation.In view of this, we approximate Equation (1) with the following function: where   () is the Softmax function parameterized with  > 1: (3) Unlike the conventional Softmax function,   () is parameterized by a scalar factor μ > 1 that forces the output of the network towards zero or one, thereby allowing the PSMoT to learn a discrete hash code.In our experiments, we use μ = 9.Recall that the RPMoT (Section 2) produces the index value of the maximum entry of a q-dimensional permuted vector, as described in Step 4. This is equivalent to taking the index value v i from a permutable maxout unit, as follows [31]: However, Equation ( 1) is non-differentiable and hence non-trainable with backpropagation.In view of this, we approximate Equation (1) with the following function: where s µ () is the Softmax function parameterized with µ > 1: Unlike the conventional Softmax function, s µ (v) is parameterized by a scalar factor µ > 1 that forces the output of the network towards zero or one, thereby allowing the PSMoT to learn a discrete hash code.In our experiments, we use µ = 9.
Unlike RPMoT, which is data-agnostic, PSMoT is data-driven.In addition, RPMoT transforms a biometric feature vector that is separately processed by a feature extractor, whereas the PSMoT module is connected to the CNN backbone and both are trained in an end-to-end manner.The inclusion of two hidden layers is beneficial in terms of improving the feature discrimination, which can be attributed to the nonlinear transformation of the fused features.

Multiplication-Diagonal Compression (MDC) Module
MDC is a learning-free module located immediately after the PSMoT.As shown in Figure 6, the PSMoT hash vector with size m is first reshaped into a matrix V with size k × n, where k = m/n.Then, based on a user-specific seed, a binary random matrix R ∈ {0, 1} n×k is generated and multiplied with V , yielding a matrix Q ∈ {0, 1} k×k .Finally, the diagonal elements of Q are extracted and a terminal hash vector (CB template) s ∈ {1, . . . ,q} k is obtained.
Unlike RPMoT, which is data-agnostic, PSMoT is data-driven.In addition, RPMoT transforms a biometric feature vector that is separately processed by a feature extractor, whereas the PSMoT module is connected to the CNN backbone and both are trained in an end-to-end manner.The inclusion of two hidden layers is beneficial in terms of improving the feature discrimination, which can be attributed to the nonlinear transformation of the fused features.

Multiplication-Diagonal Compression (MDC) Module
MDC is a learning-free module located immediately after the PSMoT.As shown in Figure 6, the PSMoT hash vector with size  is first reshaped into a matrix ′ with size  × , where  = ⌈/⌉.Then, based on a user-specific seed, a binary random matrix  ∈ {0, 1} × is generated and multiplied with ′, yielding a matrix  ∈ {0, 1} × .Finally, the diagonal elements of  are extracted and a terminal hash vector (CB template)  ∈ {1, … , }  is obtained.The MDC module is devised to further enhance the non-invertibility, revocability, and unlinkability of the CSMoFN.Another important goal for the MDC module is to compress the PSMoT hash vector from size  to  (where  ≪ ) without sacrificing accuracy.Offering computational advantages, the multiplication of the user-specific binary random projection and the hash matrix is a special kind of random projection [32] that can approximately preserve the pairwise distances of the hash vectors with respect to the distance of their original counterpart [33].
Finally, the extraction of the diagonal elements from  can be seen as yet another many-to-one mapping that enhances the non-invertibility property of the proposed scheme.

ArcFace Loss
The ArcFace loss [34] is used in the feature extraction and fusion module as a means of enhancing the terminal hash code discrimination.It is a modified Softmax classification loss in which the prototype weight vector of the  ℎ identity   is L2-normalized.Specifically, the target logit (activation of the classification layer before applying the Softmax function) is redefined as      = ∥ ∥  ∥ ∥ ∥ ∥  ∥ ∥cos  , where   are the L2-normalized fused features (Section 3.2) of the  ℎ sample, belonging to the  ℎ identity.The normalization The MDC module is devised to further enhance the non-invertibility, revocability, and unlinkability of the CSMoFN.Another important goal for the MDC module is to compress the PSMoT hash vector from size m to k (where k m) without sacrificing accuracy.Offering computational advantages, the multiplication of the user-specific binary random projection and the hash matrix is a special kind of random projection [32] that can approximately preserve the pairwise distances of the hash vectors with respect to the distance of their original counterpart [33].
Finally, the extraction of the diagonal elements from Q can be seen as yet another many-to-one mapping that enhances the non-invertibility property of the proposed scheme.

• ArcFace Loss
The ArcFace loss [34] is used in the feature extraction and fusion module as a means of enhancing the terminal hash code discrimination.It is a modified Softmax classification loss in which the prototype weight vector of the j th identity w j is L2-normalized.Specifically, the target logit (activation of the classification layer before applying the Softmax function) is redefined as w T j z i = ||w j ||||z i || cos θ j , where z i are the L2-normalized fused features (Section 3.2) of the i th sample, belonging to the j th identity.The normalization of the fused features and weights means that the predictions rely only on the angle between z i and w j , denoted as θ j .The prediction of the y i th identity then only depends on θ y i .The ArcFace loss is defined as: e γ(cos (θ y i +β)) + ∑ N j=1,j =y i e γ cos θ j (4 where B is the batch size, β is an angular margin introduced to force the classification boundary closer to that prototype weight w j , and γ is a feature rescaling factor.In this manner, the learned fused features are distributed on a hypersphere with a radius of γ.In this paper, we set β = 0.35 and γ = 25.

• Pairwise Angular Loss
The pairwise angular (PA) loss function is introduced to optimize the PSMoT module.Face-periocular pairs are associated with similarity labels c ij , where c ij = 1 implies that two face-periocular pairs are from the same identity (positive pair) and c ij = 0 indicates that they are from different identities (negative pair).The aim of the loss function is to ensure that the similarity between a pair of hash vectors s i and s j in the MDC module is high if they are positive pairs, and to make the dissimilarity greater than a given margin for negative pairs.Our PA loss is defined as follows: where δ is a scaling factor, α is the angular margin, and φ is the cosine distance between ŝi and ŝj (i.e., φ = cos −1 ŝT i ŝj , given that ŝ is the L2-normalized hash code to be rescaled based on δ).
Similarly, to the ArcFace loss, normalization and rescaling on ŝi and ŝj means that the similarity measure relies only on the angle between the two hash codes, and thus forces the hash codes to be set down on a hypersphere with radius δ.An additive angular margin penalty α between ŝi and ŝj is introduced to enhance the intra-class compactness and the inter-class separation simultaneously.Here, we set δ = 2.5 and α = 0.5.

• Total Loss
In a nutshell, the total loss function L total used to optimize the CSMoFN is given as: where L 2 is a weight decay regularizer and α is a coefficient that is beneficial to reduce overfitting.In this paper, we set α = 0.5.

Experiments 4.1. Datasets
Our experiments were performed on six face-periocular datasets.Although these datasets originally contained only face images, periocular images were obtained by cropping the eye region from the face images.The six datasets considered are AR [35], Ethnic [36], Facescrub [37], IMDB Wiki [38], Pubfig [39], and YTF [40].We followed the open-set evaluation protocol [29], in which the training and testing datasets do not overlap.The training set was constructed from the Ethnic and IMDB Wiki subsets, and the subjects were independent of the testing sets.Six datasets were used for testing.Table 1 gives a summary of the composition of each of the training and testing datasets.

Experimental Setup
Evaluations were carried out under authentication (verification).The equal error rate (EER) and the receiver operating characteristic (ROC) curve were used as authentication metrics for the proposed method.The specifications of the computer used for the experiment were an Intel(R) Core (TM) i7-6700 K CPU @ 4.00 GHz, 32 GB of RAM, and NVIDIA GeForce GTX 1080 Ti GPU, with the model implemented using the Pytorch library.
For network training, the batch size B was fixed to 256, the epoch was 90, and the learning rate was 0.0001.Matching of the hash vectors was performed using the Hamming distance.All experiments were conducted using the same user-specific seeds and a scenario called the stolen-token scenario [27], to enable a fair comparison.

Hyperparameter Analysis
In this section, we explore the impact of the two essential hyperparameters, q and m, used in CSMoFN, where q determines the range of elements of the PSMoT hash vector and m is the hash vector size.We set m = 256, 512, 1024, 2048, and 4096, and q = 8, 16, 32, and 64.
Table 2 shows the performance on the six datasets at various settings of m and a fixed value of q = 32.For all datasets, it can be seen that the smaller the value of m, the lower the accuracy performance, which implies a loss of information.As m increases, the degree of information loss decreases, which leads to a lower EER.
For Table 3, we set m = 4096 and checked the performance for different values of q.We can observe that although the parameter q does not have a significant effect on the overall accuracy performance, medium (i.e., q = 32) and small values (such as q = 8) give a slightly higher EER.However, it is better to set q to a larger value to enhance the security, as a larger q increases the complexity of a brute-force guessing attack on a hash vector.
Figure 7 shows the EER (%) performance for m and q, while Figure 8 shows the ROC curve and area under the curve (AUC) for six datasets with m = 4096 and q = 32.As discussed in Section 3.4, the MDC module reshapes the PSMoT hash vector with size m to a matrix V with size k × n, and hence m = k × n.Note that the size of the final hash vector (CB template) is k, with the value of k dependent on m and n.For this experiment, we examine two combinations of k and n, i.e., (k, n) = (128, 16) and (256, 8) with a fixed value of m = 2048.The EER is shown in Table 4.We observe that (256, 8) outperforms (128, 16), which implies that k = 256 is a better choice than k = 128.It is worth noting that the EERs for the PSMoT hash vector (without reshaping) and (256 × 8) are identical, although the size of the final hash vector in the latter case is only 256.This demonstrates the performance of the MDC module in terms of compression.As discussed in Section 3.4, the MDC module reshapes the PSMoT hash vector with size  to a matrix ′ with size  × , and hence  =  × .Note that the size of the final hash vector (CB template) is , with the value of  dependent on  and .For this experiment, we examine two combinations of  and , i.e., (, ) = (128, 16) and (256, 8) with a fixed value of  = 2048.The EER is shown in Table 4.We observe that (256, 8) outperforms (128, 16), which implies that  = 256 is a better choice than  = 128.It is worth noting that the EERs for the PSMoT hash vector (without reshaping) and (256x8) are identical, although the size of the final hash vector in the latter case is only 256.This demonstrates the performance of the MDC module in terms of compression.

Performance Comparison with Unimodal CB Systems
In this section, we demonstrate the advantage of the proposed multimodal CB system through a comparison with unimodal CB systems (i.e., where either the face or the periocular region alone is adopted).
Table 5 shows the average EER performance for two unimodal CB systems and a multimodal CB biometric system for six datasets.Figure 9 shows the change in performance for varying m and with q fixed at 32.In general, the performance of each system improves with large m and a moderate value of q.This is consistent with the finding in Section 4.3.However, we also note that the best EER can be achieved by the face-periocular CB system over its unimodal counterparts, with an EER reduction of around 50% for the unimodal systems with m = 4096 and q = 32.This suggests that fusion is essential for performance gain, despite the simplicity.

Ablation Studies
This section presents an ablation study on the proposed CSMoFN.We first explore the accuracy performance of the sole feature extraction module with cosine distance, which is equivalent to an unprotected biometric system, and serves as a baseline.We then examine the feature extraction module + PSMoT, and lastly the entire CSMoFN.The latter two are CB systems.
From Table 6, we can observe that the baselines for the face and periocular region alone perform better than their CB counterparts (i.e., PSMoT and CSMoFN).This is as expected, and can be attributed to the performance-security tradeoff made in CB systems, where the performance may be degraded after the CB transformation.However, the use of face-periocular fusion largely restores the verification performance for CSMoFN.

Ablation Studies
This section presents an ablation study on the proposed CSMoFN.We first explore the accuracy performance of the sole feature extraction module with cosine distance, which is equivalent to an unprotected biometric system, and serves as a baseline.We then examine the feature extraction module + PSMoT, and lastly the entire CSMoFN.The latter two are CB systems.
From Table 6, we can observe that the baselines for the face and periocular region alone perform better than their CB counterparts (i.e., PSMoT and CSMoFN).This is as expected, and can be attributed to the performance-security tradeoff made in CB systems, where the performance may be degraded after the CB transformation.However, the use of face-periocular fusion largely restores the verification performance for CSMoFN.In this section, we present a summary with remarks in Table 7 rather than a comparison between different approaches.This is because a fair comparison between different template protection schemes is very difficult, or even impossible, due to several factors such as the choice of biometric modality, fusion method, datasets, and evaluation metrics.

Unlinkability Analysis
In our unlinkability analysis, we follow the protocol and method proposed in [41].The "mated score" and "non-mated score" first have to be calculated and are defined as follows.
Mated sample scores: This is a score calculated through cross-matching of the same subject.In our case, we use the face-periocular pair X of the same user and different permutations and random projection seeds r.
Let the mated CB template pair be T m1 = CSMoFN(X 1 , r 1 ) and T m2 = CSMoFN(X 1 , r 2 ).The mated-samples score can then be obtained via s = d H (T m1 , T m2 ), where d H is the Ham- ming distance.
The mated sample distribution is denoted as p(s|H m ) , where H m belongs to the relationship in which both CB templates are mated.
Non-mated sample scores: The non-mated score is calculated in a similar way but for different subjects.Using a similar notation to that given above, let the non-mated CB template pair be T nm1 = CSMoFN(X 1 , r 1 ), T nm2 = CSMoFN(X 2 , r 2 ).The non-mated scores are then estimated as s = d H (T nm1 , T nm2 ), and the distribution of the non-mated sample scores is p(s|H nm ) , where H nm is when both templates are non-mated.
In addition, we use two measures of unlinkability: a local and a global measure.
Local measure D ↔ (s): This measure represents the likelihood ratio of two score variances, D ↔ (s) = p(H m |s) − p(H nm |s) ∈ [0, 1]  Global measure Dsys ↔ : Unlike the local measure, this metric evaluates the unlinkability of the overall system independently of the score domain.This measure also has a range of [0, 1].
A CB scheme is judged to ideally satisfy the unlinkability criterion if p(H m |s) = p(H nm |s).If they are completely separated, the CB templates are fully linkable; in other words, if both the local and global measures are close to zero, the CB scheme is deemed nonlinkable.
According to the proposed benchmark protocol in [41], we carried out experiments by generating three CSMoFN hashed vectors with m = 4096 and q = 32, by using the Pubfig, Facescrub, and YTF datasets with different user-specific seeds.The three distributions, the mated samples score, non-mated samples score, and local measure values are all plotted together in Figure 10.It can be seen that the two score distributions, mated and non-mated, explicitly overlap.Furthermore, the meaning of this demonstrates that CB templates are unlinkable.Furthermore, the global measure Dsys ↔ of three datasets are 0.056, 0.039, and 0.104, respectively.For each specific linkage score s, D ↔ (s) = 0 denotes fully unlinkability, while D ↔ (s) = 1 is a fully linkable of two transformed templates.With the significant overlap, the overall linkability of the proposed method is close to zero.This indicates that the CSMoFN hashed vectors are unlinkable.fig, Facescrub, and YTF datasets with different user-specific seeds.The three distributions, the mated samples score, non-mated samples score, and local measure values are all plotted together in Figure 10.It can be seen that the two score distributions, mated and nonmated, explicitly overlap.Furthermore, the meaning of this demonstrates that CB templates are unlinkable.Furthermore, the global measure  �� of three datasets are 0.056, 0.039, and 0.104, respectively.For each specific linkage score ,  ↔ () = 0 denotes fully unlinkability, while  ↔ () = 1 is a fully linkable of two transformed templates.With the significant overlap, the overall linkability of the proposed method is close to zero.This indicates that the CSMoFN hashed vectors are unlinkable.

Revocability Analysis
To analyze the revocability of the proposed scheme, we generated three score distributions: the mated-imposter score, the genuine score, and the imposter score [42].The genuine and imposter score distributions were calculated by matching CSMoFN hashed vectors generated from the same and different subjects, respectively.The mated-imposter score is identical to the mated-samples score described in Section 5.1, and is calculated from the matching of two CB templates generated by the same subject with different userspecific seeds.In other words, it is assumed that the user revokes the old CSMoFN hashed vectors and creates a new instance, meaning that the mated-imposter score is the matching score of the old and new CSMoFN hashed vectors.The revocability criterion is deemed to be satisfied if the mated-samples score distribution overlaps with the imposter score distribution.
It can be observed from Figure 11 that the distributions of the mated-imposter and imposter scores substantially overlap for the three datasets, which indicates that the revocability criterion is satisfied.

Revocability Analysis
To analyze the revocability of the proposed scheme, we generated three score distributions: the mated-imposter score, the genuine score, and the imposter score [42].The genuine and imposter score distributions were calculated by matching CSMoFN hashed vectors generated from the same and different subjects, respectively.The mated-imposter score is identical to the mated-samples score described in Section 5.1, and is calculated from the matching of two CB templates generated by the same subject with different user-specific seeds.In other words, it is assumed that the user revokes the old CSMoFN hashed vectors and creates a new instance, meaning that the mated-imposter score is the matching score of the old and new CSMoFN hashed vectors.The revocability criterion is deemed to be satisfied if the mated-samples score distribution overlaps with the imposter score distribution.
It can be observed from Figure 11 that the distributions of the mated-imposter and imposter scores substantially overlap for the three datasets, which indicates that the revocability criterion is satisfied.score of the old and new CSMoFN hashed vectors.The revocability criterion is deemed to be satisfied if the mated-samples score distribution overlaps with the imposter score distribution.
It can be observed from Figure 11 that the distributions of the mated-imposter and imposter scores substantially overlap for the three datasets, which indicates that the revocability criterion is satisfied.

Non-Invertibility Analysis
For our non-invertibility analysis, we consider two types of attack: brute-force and false acceptance (FA) attacks.

Brute-Force Attack
The goal of a brute-force attack is to estimate the CSMoFN hashed vectors by brute force, with it assumed that the attacker knows the structure of the CSMoFN and the corresponding hyperparameters [42].
The CSMoFN hashed vector  is a discrete vector with size , where every element is within the range [1, ].For a configuration such as  = 32 and  = 512 ( = 4096), the guessing complexity for each element is  = 32 = 2 5 .Since there are  entries, the minimum guessing complexity is 2 (5×512) = 2 2560 , which is prohibitively large in practice and prevents the attacker from going through all possible combinations.Furthermore, since CSMoFN is revocable, the hash vector can be replaced with a new one if it is found to be compromised.

False Acceptance Attack
An FA attack, also called a dictionary attack, is an attempt to gain illegal access to a biometric system [43].This attack is realistic for any biometric system that relies on a decision threshold value.In other words, if the matching score of the authentication instance   and the transformed template  is less than a pre-defined threshold value , the right to access the biometric system is obtained.In the stolen token scenario,   is a CSMoFN hashed vector generated with a biometric vector   and stolen user-specific keys.The decision rule for authentication is then   (  , ) > , where   () is the Hamming distance and  is the threshold value when the False Acceptance Rate (FAR) is equal to the False Rejection Rate (FRR).
To mitigate the FA attack, the threshold value should be set high, to achieve FAR = 0%.However, this implies a GAR (Genuine Acceptance Rate) reduction that suggests a degradation in accuracy.To balance the performance with security, a suitable threshold

Non-Invertibility Analysis
For our non-invertibility analysis, we consider two types of attack: brute-force and false acceptance (FA) attacks.

Brute-Force Attack
The goal of a brute-force attack is to estimate the CSMoFN hashed vectors by brute force, with it assumed that the attacker knows the structure of the CSMoFN and the corresponding hyperparameters [42].
The CSMoFN hashed vector s is a discrete vector with size k, where every element is within the range [1, q].For a configuration such as q = 32 and k = 512 (m = 4096), the guessing complexity for each element is q = 32 = 2 5 .Since there are k entries, the minimum guessing complexity is 2 (5×512) = 2 2560 , which is prohibitively large in practice and prevents the attacker from going through all possible combinations.Furthermore, since CSMoFN is revocable, the hash vector can be replaced with a new one if it is found to be compromised.

False Acceptance Attack
An FA attack, also called a dictionary attack, is an attempt to gain illegal access to a biometric system [43].This attack is realistic for any biometric system that relies on a decision threshold value.In other words, if the matching score of the authentication instance s a and the transformed template s is less than a pre-defined threshold value τ, the right to access the biometric system is obtained.In the stolen token scenario, s a is a CSMoFN hashed vector generated with a biometric vector X a and stolen user-specific keys.The decision rule for authentication is then Dist H (s a , s) > τ, where Dist H () is the Hamming distance and τ is the threshold value when the False Acceptance Rate (FAR) is equal to the False Rejection Rate (FRR).
To mitigate the FA attack, the threshold value should be set high, to achieve FAR = 0%.However, this implies a GAR (Genuine Acceptance Rate) reduction that suggests a degradation in accuracy.To balance the performance with security, a suitable threshold value τ should be carefully calibrated.
In this paper, the distance between the fake template s * and s is calculated via Dist H (s, s * ) = UB imp , where UB imp is the upper bound on the imposter scores for the considered dataset, and represents the worst scenario.To succeed in this approach, an attacker can attempt to find s * to satisfy Dist H (s a , s * ) = τ.That is, the goal is to generate a fake template such that the distance score with s a falls into the interval [τ, UB imp ].Hence, the complexity of an FA attack can be estimated as q m(τ−UB imp ) .
To analyze and respond to FA attacks in our context, it is necessary to determine the threshold value according to each GAR.Table 8 shows the complexity of an FA attack calculated for six datasets used with the proposed method, with m = 4096 and q = 64.The complexity of the FA attack is evaluated based on GAR = 85%, 90%, and 95%.The UB imp for each dataset can be obtained from the largest value of the imposter scores.Note that if τ − UB imp is negative, the complexity of the attack cannot be estimated.
In summary, the complexity of an FA attack can be increased in two ways: by increasing the value of m or reducing the GAR.However, the latter also implies a compromise in the accuracy performance.For GAR = 90%, our proposed method is reasonably robust in resisting FA attacks.In addition, an FA attack can be prevented by restricting the number of attempts at the authentication stage.

Conclusions
In this paper, we propose a deep-learning-based multimodal cancelable biometrics scheme which we call CSMoFN.Our scheme fuses two biometric traits, namely the face and the periocular region, and is composed of three modules: a feature extraction and fusion module, a PSMoT module, and an MDC module.CSMoFN is trained by minimizing the ArcFace loss and the pairwise angular loss.Experiments were conducted on six datasets, with the verification performance approximately preserved with respect to its original counterpart.In addition, we have analyzed four conditions for our cancelable biometrics scheme and have shown that the proposed method satisfies them.In future research, we will consider more than two biometric modalities.

Figure 1 .
Figure 1.Illustration of a general cancelable biometrics scheme.

Figure 1 .
Figure 1.Illustration of a general cancelable biometrics scheme.

Figure 3 .
Figure 3. Overview of the proposed CSMoFN model.

Figure 3 .
Figure 3. Overview of the proposed CSMoFN model.

Figure 3 .
Figure 3. Overview of the proposed CSMoFN model.

Figure 4 .
Figure 4.The feature extraction and fusion module.

Figure 4 .
Figure 4.The feature extraction and fusion module.

Figure 6 .
Figure 6.Structure of the multiplication-diagonal compression module.

Figure 6 .
Figure 6.Structure of the multiplication-diagonal compression module.

Figure 8 .
Figure 8. ROC curves and AUC analysis for six datasets with  = 4096 and  = 32.

Figure 8 .
Figure 8. ROC curves and AUC analysis for six datasets with m = 4096 and q = 32.

Figure 11 .
Figure 11.Revocability results from the proposed method on the (a) Ethnic, (b) IMDB Wiki, and (c) AR datasets.

Figure 11 .
Figure 11.Revocability results from the proposed method on the (a) Ethnic, (b) IMDB Wiki, and (c) AR datasets.

Table 1 .
Description of training and testing datasets.

Table 2 .
Accuracy of the model for various values of m and q = 32.

Table 3 .
Accuracy of the model for various values of q and m = 4096.

Table 4 .
Performance comparison for varying values of (k, n) with m = 2048 and q = 32.

Table 5 .
Average performance of unimodal and multimodal CB systems on six datasets for varying values of m and q.

Table 7 .
Summary of works related to cancelable multimodal biometric systems.

Table 8 .
Complexity of the false acceptance attack on the proposed system.