Learning to Combine Local and Global Image Information for Contactless Palmprint Recognition

In the past few years, there has been a leap from traditional palmprint recognition methodologies, which use handcrafted features, to deep-learning approaches that are able to automatically learn feature representations from the input data. However, the information that is extracted from such deep-learning models typically corresponds to the global image appearance, where only the most discriminative cues from the input image are considered. This characteristic is especially problematic when data is acquired in unconstrained settings, as in the case of contactless palmprint recognition systems, where visual artifacts caused by elastic deformations of the palmar surface are typically present in spatially local parts of the captured images. In this study we address the problem of elastic deformations by introducing a new approach to contactless palmprint recognition based on a novel CNN model, designed as a two-path architecture, where one path processes the input in a holistic manner, while the second path extracts local information from smaller image patches sampled from the input image. As elastic deformations can be assumed to most significantly affect the global appearance, while having a lesser impact on spatially local image areas, the local processing path addresses the issues related to elastic deformations thereby supplementing the information from the global processing path. The model is trained with a learning objective that combines the Additive Angular Margin (ArcFace) Loss and the well-known center loss. By using the proposed model design, the discriminative power of the learned image representation is significantly enhanced compared to standard holistic models, which, as we show in the experimental section, leads to state-of-the-art performance for contactless palmprint recognition. Our approach is tested on two publicly available contactless palmprint datasets—namely, IITD and CASIA—and is demonstrated to perform favorably against state-of-the-art methods from the literature. The source code for the proposed model is made publicly available.


Introduction
The area of palmprint recognition is a well-established subfield of biometrics and has been widely deployed commercially throughout the years [1,2]. Palmprint recognition techniques typically rely on two kinds of features for identity recognition in addition to the obvious principle lines, i.e., palmar friction ridges (the ridge and valley structures like fingerprint) and the palmar flexion creases (discontinuities in the epidermal ridge patterns), as illustrated in Figure 1a,b, respectively [1,3,4]. These characteristics represent rich sources of discriminative information based on which efficient recognition models can be designed.
Palmprint-based biometric systems have been increasingly researched recently due to their high recognition accuracy, usability, and acceptability [1,2]. Over the last few years, there has been a transition from contact-based to contactless palmprint recognition approaches with a considerable amount of research focusing on the latter problem [1,2,[4][5][6][7][8][9][10][11]. Different from palmprint images captured with contactbased sensors, palmprint images captured in a contactless manner may be taken with commercial off-the-shelf cameras and in unconstrained environments [1]. This not only widens the range of potential application scenarios where palmprint recognition can be used, but also brings improved convenience to users. Additionally, contactless palmprint acquisition offers many other benefits over contact-based alternatives, such as higher user friendliness, increased privacy, and better hygiene [4]. Nevertheless, capturing palmprints in a contactless manner also leads to new challenges related to increased appearance variability caused by rotations, ambient illumination, translations, different distances from the sensor, elastic deformations, as well as by the presence of higher levels of noise [12]. Among these, elastic deformations of the palmar surface (originating from the lack of contact surface that would constrain the placement of the hand) are one of the biggest issues for the recognition technology. This is illustrated in Figure 2, where a contact-based (shown at the top) as well as a contactless (shown at the bottom) acquisition procedures are presented. As can be seen, the unconstrained placement of the hand typically results in palmprint samples that exhibit significant deformations compared to images captured with contact-based sensors. It has been shown that conventional palmprint recognition methods often degrade in performance when applied on such types of data due to the severe inter-class variability induced by the acquisition procedure. To avoid such performance degradations, more powerful models capable of extracting robust and discriminative images representations are needed.
To address the problems encountered in challenging deployment scenarios, researchers are increasingly looking into deep-learning (DL) techniques when designing biometric solutions. Such techniques are capable of: (i) extracting knowledge and learning discriminative representations from (partially noisy) data automatically and in an end-to-end manner, (ii) adapting to the characteristics of the biometric samples, and (iii) achieving more accurate recognition performance in less-constrained environments compared with traditional hand-crafted feature extraction techniques. Several recent solutions, therefore, consider DL techniques for contactless palmprint recognition [1,4,6,13,14].
However, they exhibit at least one (or more) of the following shortcomings. First, the majority of existing techniques process input images in a holistic manner, where the local information that is preserved in the computed representation is limited only to the most discriminative parts of the input. In unconstrained settings this can be problematic, since different parts of the input might be elastically deformed, which greatly affects the performance of such methods. Second, deep-learning models are often designed for closedset recognition tasks (using a softmax over a fixed set of classes), raising questions about the discriminability of the feature representations learned internally for open-set problems.
Moreover, existing (hybrid) methods that utilize both global and local image information fail to do so in a transparent way, that would allow interpretation of the results and lead to explainable decisions. This has implications for the future application of such models and their compliance with existing privacy laws, such as GDPR [15,16].
(a) (a) (a) (a) Figure 2. Illustrative comparison between a contact-based (at the top) and a contactless acquisition procedure used in palmprint recognition systems. Because the placement of the hand is not constrained by a contact surface, images acquired in a contactless manner exhibit larger variability (also due to elastic deformations of the palmar surface) as shown by the example palmprint images on the right. Note that both example palmprints originate from the same subject.
In this work, we aim to address the problems outlined above by presenting a hybrid DL based technique for contactless palmprint recognition, which: (i) utilizes both global and local information, (ii) learns highly discriminative features with considerable generalization capability, while not requiring extensive parameter tuning, and (iii) ensures interpretable recognition results. To the best of our knowledge, the proposed approach represents the first hybrid DL technique developed specifically for contactless palmprint recognition. Different from other DL based solutions, our model also exhibits a high-level of robustness to elastically deformed images and, as we show in the experimental section, ensures state-of-the-art recognition performance.
In summary, we make the following contributions in this study: • We present a hybrid DL approach for contactless palmprint recognition based on a novel two-path network architecture that takes global as well as local image information into account. • We propose an attention-based channel pooling operation, which is capable of (adaptively) extracting the most discriminative (local) information from sampled patches of the given input image. • We introduce thin plate splines (TPS) as an augmentation procedure for explicitly modeling elastic deformations and demonstrate that the proposed augmentation procedure is beneficial for recognition performance. • We show the benefit of combining local and global image information into discriminative representations that lead to a state-of-the-art performance for contactless palmprint recognition on two publicly available datasets.

Related Work
In this section we discuss prior work related to this study and cover some of the most important representatives of conventional as well as deep learning-based palmprint recognition methods. The reader is referred to some of the excellent surveys on palmprint recognition for a more in-depth coverage of the field [17][18][19].

Conventional Methods
Early work on palmprint recognition focused on hand-crafted feature extraction techniques to encode the palmprint structure. Jia et al. [8], for example, presented a palmprint descriptor, called Histogram of Oriented Lines (HOL) inspired by the established Histogram of Oriented Gradients (HOG) descriptor [8] and developed a recognition approach around HOL that also considered subspace learning methods for dimensionality reduction, i.e., PCA, LDA, and SRDA, as well as their corresponding kernel versions KPCA, KLDA, and KSRDA. The proposed approach was tested on two palmprint datasets, i.e., PolyU II [20] and PolyU MB [21]. Kumar [2] explored the use of improved alignment and matching strategies for contactless palmprint recognition using several publicly available contactless palmprint datasets and showed that these lead to state-of-the-art performance. Wu et al. [22] presented a contactless palmprint verification method that used an isotropic filter for preprocessing and SIFT based feature extraction and matching. The proposed method utilized a two-stage strategy, i.e., in the first stage, an iterative RANSAC (I-RANSAC) algorithm was employed to remove the mismatched points, and in the second stage, Local Palmprint Descriptors (LPDs) were extracted at SIFT keypoints and used to further refine the keypoint matching procedure. As a detection score, the authors used the final number of matched SIFT keypoints.To evaluate the performance of the proposed approach, two publicly available contactless palmprint databases were considered, i.e., the IIT Delhi (IITD) Touchless Palmprint Database [23], and the CASIA palmprint database [24]. Zheng et al. [11] presented a new descriptor referred to as Difference of Vertex Normal Vectors (DoN) for 2D palmprint matching. The descriptor is based on the ordinal measure which partially describes the difference of the neighboring points' normal vectors. Using the DoN descriptor, the authors were able to extract 3D information from 2D palmprint images, which was demonstrated to be highly stable under commonly occurring illumination variations during contactless imaging. The proposed method was evaluated on several publicly available 2D palmprint databases in identification as well as verification mode with highly encouraging results.
While the solutions reviewed above achieved impressive performance for palmprint recognition, they are still limited in the amount of discriminative information they are able to extract from the input images given their hand-crafted design. A richer and more descriptive set of features can in general be obtained through the use of deep learning models, such as the one presented in this work.

Deep Learning Methods
More recently, the focus in the field of palmprint recognition shifted from traditional feature engineering approaches to the deep learning based solutions [1,13,14,25]. Fei et al. [18], for example, evaluated a number of feature extraction methods for contactless palmprint recognition, including four deep learning architectures, i.e., AlexNet [26], VGG-16 [27], GoogLeNet [28] and Res-Net-50 [29]. These networks were pre-trained on ImageNet [30] and then fine tuned using extracted palmprint regions of interest (ROIs). The authors concluded that deep learning methods achieve comparable or even higher performance than conventional palmprint methods. Genovese et al. [6] introduced the PalmNet architecture, a novel Convolutional Neural Network (CNN) that is capable of fine tuning specific palmprint filters through an unsupervised procedure based on Gabor responses and Principal Component Analysis (PCA), while not requiring class labels during the training process. The authors validated their approach on several publicly available datasets, i.e., the CASIA palmprint database V1 [24], the IITD database (Version 1.0) [23], the REgim Sfax Tunisia (REST) hand database 2016 [25] and the Tongji Contactless Palmprint Dataset [5]. In all cases, they obtained recognition accuracies greater than those of the prior methods from the literature [6]. Svoboda et al. [1], presented a Siamese-type CNN for palmprint image recognition based on a novel loss function related to the d-prime index. Their approach automatically learns features from the data, thus not requiring extensive parameter tuning. By using their design, the authors achieved greater class separation between the genuine and the impostor score distributions and better scalability than competing methods, while not requiring large amounts of training data. The authors also achieved state-of-the-art verification results on the standard IIT Delhi [23] and CASIA palmprint [24] datasets. Liu and Kumar [4], designed a fully convolutional and highly optimized network that used a soft-shifted triplet (SSTL) loss function to learn discriminative palmprint features. The presented method was shown to offer superior generalization capability over different publicly available contactless palmprint databases, while not requiring database-specific parameter tuning. By using the proposed design, the authors consistently outperformed several classical and state-of-the-art palmprint recognition methods in identification, as well as in verification experiments. Zhu et al. [31] introduced an adversarial metric learning methodology for palmprint recognition and reported superior performance over competing techniques on eight different datasets.
The approach in this paper is related to the above methods in that it also uses deep learning to address the problem of contactless palmprint recognition but differently from the existing solutions, it takes not only global image appearance into account, but also the most discriminative local image cues that are aggregated through a novel attention based channel pooling operation. Moreover, to further boost the verification performance the proposed model uses a powerful learning objective that allows it to learn highly discriminative feature representations.

Proposed Method
In this section we present our new DL-based methodology for contactless palmprint recognition. First, we explain how the model encodes global and local image characteristics from the palmprint images. Next, we introduce the attention mechanism. Finally, we proceed to the learning mechanism used to further enhance the discriminative power and the quality of the learned deep palmprint features.

Overview of the Proposed Model
The idea behind our approach is illustrated on Figure 3. Compared to other relevant deep learning methods, our method relies on a dual-path model architecture, where one path processes the input in a holistic manner through a backbone CNN feature extractor and the second path captures local (patch-based) information through a backbone Siamese-based CNN feature extractor. In the local processing path feature representations are extracted from each of the patches sampled from the input image and aggregated by a novel pooling mechanism, called channel-wise attention pooling. At the final stage, the global holistic and the aggregated local features are combined by using a simple concatenation procedure. The model is trained end-to-end with a learning objective that combines the Additive Angular Margin (ArcFace) loss with the center loss. This overall training objective helps the model learn highly discriminative features, which is crucial for contactless palmprint recognition. The overall idea with our design is that the complementary information is learned within the two processing paths, where the local path compensates for the image characteristics that are not captured well by the global processing path. Since the presence of elastic deformations of the palmar surface is a major obstacle for palmprint images acquired in a contactless manner, we designed our model under the assumption that a combination of such global and local information and joint supervision through the combined loss, would lead to more robust and discriminative image representations. As we show in the experimental section, the addition of local image representations contributes toward better and more robust overall recognition performance. The following sections provide detailed explanations of all the building blocks of the proposed method.

The Global Processing Path
The global processing part of the model (shown in Figures 3a and 4) is responsible for capturing global palmprint characteristics, similarly to standard CNNs used for recognition purposes in other areas of biometrics [27,32,33]. The processing path consists of a simple CNN backbone feature extractor ψ g that takes a palmprint ROI x ∈ R w×h×3 as input, where w and h denote image dimensions, and generates a global image representation f g as the results. Formally, this can be written as: where θ g denotes the parameters of the global processing path that need to be learned during training. While any backbone CNN can be used in our approach, we select the popular VGG-16 model [27] in this work because of (i) its competitive performance, and (ii) the fact that an open source implementation is publicly available.
To make the global processing path applicable to differently sized input images, we make sure the backbone CNN is fully convolutional. Thus, the d-dimensional global features f g ∈ R d are generated through an average pooling layer in case of the VGG-16 backbone, as also illustrated in Figure 4.

The Local Patch-Based Processing Path
In the second processing path of our model (shown in Figure 3b and in detail in Figure 5), the global palmprint image is first decomposed into N smaller patches-sampled from a fixed grid with overlap ( Figure 5). Here, N is an open hyper-parameter that can be set depending on the data characteristics. This processing path differs from the global path in that it represents a Siamese CNN with shared model parameters. Thus, the same feature extractor is used for all image patches. This design is able to process all N patches in a single shot. By focusing only on a small palmprint area at the time, the local processing path is able to encode important local image characteristics that are not properly represented in the holistic representation extracted by the global processing path. If we denote the set of sampled local patches as , where x i ∈ R w ×h ×3 and w ≤ w and h ≤ h, then the local image representations can be described as: where ψ l i denotes the i-th component of the Siamese structure, θ l are the shared model parameters,  Similarly as in the global processing path, the local processing path can be implemented with arbitrary backbone CNNs in the Siamese structure. However, we again utilize a fully convolutional VGG-16 model in this work to make our approach more coherent as illustrated in Figure 5. Next, we aggregate all local feature vectors into a N × d dimensional feature matrix F as shown in Equation (3), where N denotes the number of patches and d is the feature dimension, i.e., The generated feature matrix then serves as the basis for a novel pooling operation that is described in depth in the next section.

The Attention Mechanism
In order to obtain the final most relevant d-dimensional feature representation from the local feature matrix F, a novel pooling operation called channel-wise attention-pooling is proposed in this paper. This operation is implemented through a dedicated CNN, as shown in Figure 6, and has its own trainable parameters. Suppose the output of the local processing path is the N × d feature matrix F, then the output from the attention-pooling layer can be calculated as follows: where and whereF denotes the weighted feature matrix, represent the element-wise or Hadamard product, w is a vector of weights, 1 d ∈ R d is a d-dimensional vector of all ones, y is the output from the last fully connected (FC) layer with the number of output units equal to the number of patches N, and ζ is the softmax activation function. After applying the softmax activation function, the vector of weights w = [w 1 , w 2 , . . . , w N ] T has the following property: and can be considered as a vector of probabilities. The final local feature vector f l is obtained by averaging the weighted feature matrix over the patch dimension (indicated by the superscript) as shown in Equation (7), i.e., The main idea behind the proposed channel-wise attention-pooling mechanism is to encourage the model to focus on the most discriminative parts of the input, while giving less importance to less discriminative parts. The proposed approach also offers a straightforward way of exploring the importance of each image part (by examining the weights in w) which contributes towards high explainability of our model. In summary the mapping ψ a performed by the proposed attention mechanism is formally described as: where θ a stands for the set of parameters of the attention block from Figure 6. Figure 6. Illustration of the channel-wise attention-pooling mechanism. The local feature matrix F is reweighted along the patch dimension to give more importance to more informative local image patches. Shown is the architecture, implemented in the model we have used.

Combining Information
In the next step, the global and local features are concatenated to form the combined feature representation f c , i.e., where ⊕ is the concatenation operator. This representation is then fed to a series of additional processing layers with parameters θ f , as shown in Figure 3d. Various forms of regularizations are also included to prevent overfitting and to enhance the stability of the model during the training process. One such form of regularization is the dropout regularization, which prevents the model from learning interdependent sets of feature weights. The use of the batch normalization as a second form of regularization, increases the stability of the model by normalizing the output of the previous layer, i.e., by subtracting the batch mean and dividing by the batch standard deviation. This layer has its own trainable parameters that are learned along with the remaining model parameters. Before computing the loss, the final embeddings are normalized to unit L 2 norm, which is a common strategy that proved beneficial in the open literature when training models with angular margin losses [34,35].

Discriminative Feature Learning Approach for Deep Palmprint Recognition
The training process for the proposed palmprint recognition model is done by using a combined learning objective of the following form: where and In the above equations, L aam is the Additive Angular Margin (aam) loss, also known as the ArcFace loss [34], L C is the center loss [36] and λ is a hyper parameter that balances the two [36]. In Equation (11), m denotes the angular margin penalty on angle Θ i , s is a feature scale parameter, cos Θ y j is the logit for each class y j , while the batch size and the class number are denoted as M and n, respectively. For the center loss in Equation (12), x i denotes the i-th feature vector belonging to the y i -th class, and c y i is the y i -th class center. The center update in the center loss function is done over mini-batches as follows: where The hyper parameter α controls the learning rates of the centers and helps to avoid large perturbations caused by a few mislabelled samples, ∆c t j (Equation (14)) denotes the difference between the current center vector and the individual feature vectors, t stands for the iteration number, and δ(·) is equal to 1 when the inside condition is satisfied and 0 otherwise.
The motivation for using the combined loss is two-fold: (i) the ArcFace loss L aam contributes towards better separation between classes in the embedding space and ensures high discriminability of the learned feature representations, whereas (ii) the center loss L C minimizes intra-class variations and encourages features from the same classes to be pulled closer to their corresponding class centers [36,37]. While the ArcFace loss was originally designed to avoid the need for additional (discriminative) learning objectives (such as center loss), we show in the experimental section that the combination of the two losses still leads to improved performance.

Model Implementation
We implement our model in the PyTorch framework (available at: https://pytorch.org latest accessed on 21 December 2021) and use a built-in implementation of the VGG-16 model initialized with weights learned on the ImageNet dataset for all of our backbone feature extractors. We modify the backbone models in order to accept arbitrary sized input images by replacing the last max pooling layer in both processing paths with the average pooling layer with kernel size of 2, which yields 2048-dimensional embeddings. We set the number of patches in the local processing path to N = 9 with a size of 75 × 75 pixels, which in our preliminary experiments was observed to provide a good trade-off between performance improvement of the overall model and its computational complexity. Both the global (holistic) and the aggregated local features are then concatenated to form a 4096 dimensional feature representation. After that, the combined feature vector is fed to a series of layers as summarized in Table 1 and illustrated in Figure 3d. Additional implementation details can be found in the publicly available source code: https://github. com/Marjan1111/tpa-cnn latest accessed on 21 December 2021. Table 1. Description of the used layers with their corresponding output dimensions after the concatenation procedure. This part of the model is shown in Figure 3d. Throughout all of our experiments we compare the global processing path ( Figure 4) as a standalone CNN feature extractor against the two-path CNN feature extractor ( Figure 5) in order to explore how the model benefits from using both global and local image information.

Experimental Datasets
Two publicly available contactless palmprint datasets are used to evaluate the performance of the proposed method in this study, namely the IITD and CASIA palmprint datasets. Details on the two datasets are given below.

IITD Palmprint Image Dataset
The IIT Delhi (IITD) Touchless Palmprint Dataset [23], collected by the Indian Institute of Technology Delhi, provides images from the left and right hands of more than 230 volunteers in the age group between 14-57 years. Images in the database were collected with a simple touchless imaging setup in an indoor environment and with a circular fluorescent illumination around the camera lens. A total of 2601 color hand images from 460 palms with 7 samples for each hand of each person were acquired in a single session. The IITD database also provides automatically segmented (i.e., extracted Regions-Of-Interests-ROIs) and normalized images with the size of 150 × 150 pixels and at least 5 samples for each of the two hands. Some typical example images from the IITD dataset from the left and right palmprints with their corresponding ROIs are shown in Figure 7a.

CASIA Palmprint Image Dataset
The second palmprint database used in our experiments is the CASIA Palmprint Image Dataset [24] (or CASIA-palmprint for short), collected by the Institute of Automation of the Chinese Academy of Sciences, which is one of the largest (to date) palmprint database available in terms of the number of individuals [24]. It contains 5052 palmprint images captured from 312 individuals, both males and females, with at least 8 samples for each hand of each individual acquired in a single session. The images were acquired by a self-developed palmprint recognition device with a CMOS imaging sensor fixed on top of it. No pegs were used to restrict postures and positions of the palms. The acquired images are in an 8-bit grayscale format with dimensions of 640 × 480 pixels. During data capture, the subjects were required to put their palms into the device and lay them on a uniform-colored background, to generate evenly distributed illumination. As explained in [2], there are several things that need to be considered w.r.t. this database. The individual "101" is the same as individual "19", and therefore these two classes were merged into one class. The 11th image from the left hand of individual "270" was misplaced to the right hand, and the 3rd from the left hand of individual "76" represents a distorted sample with very poor quality. These problematic images were excluded from our experiments.
As the CASIA database does not provide segmented palmprint images, we used the segmentation procedure from [6] to extract the palmprint ROIs. The final dataset adopted for our evaluation contained 301 individuals and a total of 5467 palmprint ROIs. A few illustrative sample images from the dataset are presented in Figure 7b.

Experimental Setup
To evaluate the performance of the proposed model, verification experiments were conducted over the two experimental datasets. Every two samples in the datasets are matched. If the two images were from one person (class), the matching is considered genuine (a mated comparison), otherwise, it was considered an impostor matching (nonmated comparison). Following the all-vs.-all experimental protocol, the number of genuine verification attempts conducted (N genuine ) and the number of impostor attempts (N impostor ) can be computed with Equations (15) and (16), as follows: where N hands is the number of hands, N enroll is the number of enrollment palmprint images of the the subjects, and N class is the number of unique classes in a given dataset. The cosine similarity function is used to compute client and impostor matching scores S, as seen in Equation (17), where n is the length of the feature vector, f p is the feature vector extracted from the given probe image, and f e is the feature vector extracted from the enrollment image.
where · denotes the scalar product, || · || is the L 2 norm operator, and f ·,i is the i-th element of vector f · .
For the experiments we partition the image datasets into three disjoint subsets, i.e., (i) a training set, which is used to learn the model parameters, (ii) a validation set, a subset of the overall training data, used to monitor the training progress and provide an unbiased estimate of the model fit when tuning hyperparameters, and (iii) a testing set which is utilized to evaluate the final verification performance.
Since the two datasets contain different numbers of samples per class, we perform a simple undersampling process. This process ensures that each class has as many samples as the class with the minimum number of samples. The number of samples in the datasets before and after the undersampling are shown in Table 2. By using this process, we balance the datasets with the goal of removing potential biases for the training procedure. On the other hand, the undersampling discards some potentially valuable sources of information. We address this issue through severe data augmentations that further stimulate training and prevent overfitting.

Performance Metrics
Following standard methodology [32,38,39], the following error rates commonly used for verification experiments are adopted to evaluate the performance of the tested models: • Area Under the Curve (AUC): AUC represents a performance metric that measures the overall performance of a learned binary model and is typically computed from a standard Receiver Operating Characteristic (ROC) curve. This metric is widely used to assess performance of biometric systems operating in verification mode. A poor model fit results in AUC ≈ 0.5 which indicates randomness, on the other hand, a good model fit results in AUC ≈ 1. • Verification Rate at the False Accept Rate of 0.1% (VER@0.1FAR): This performance measure corresponds to an operating point on the ROC curve and is computed as follows: where θ ver01 is the relevant decision threshold, FAR denotes the False Accept Rate, and FRR is the False Reject Rate. FAR corresponds to the percentage/fraction of times an invalid (impostor) user is accepted by the system and is calculated as follows: Similarly, FRR represents the percentage/fraction of times a valid (genuine) user is rejected by the system and is calculated as follows: In Equations (20) and (21), {d imp } and {d gen } represent sets of impostor and genuine scores generated during the experiments, respectively, | · | denotes a cardinality measure, and θ k represents the decision threshold. • Verification Rate at the False Accept Rate of 1% (VER@1FAR): This measure corresponds to another operating point on the ROC curve defined as follows: where θ ver1 denotes the relevant decision threshold. • Equal Error Rate (EER): The last performance measure, EER, is defined with a decision threshold θ eer that ensures equal values of FAR and FRR as follows: Additionally, we also use Receiver Operating Characteristic (ROC) curves to visualize the performance of the models throughout all experiments. Note that all performance scores defined above are provided in % when reporting results.

Training Details
We train our model(s) with the combined objective function from Equation (10). During the training process, some hyperparameters remain fixed, while others are further analyzed in a broader range to study their impact. The parameters that we vary in our experiments are the hyper-parameters of the combined objective function, i.e., the λ parameter, which balances the impact of the ArcFace and the center losses, and the parameter α that controls the learning rate of the centers in the center loss function. The optimization of the base model parameters is done with the Adam optimizer with a learning rate of 1 × 10 −4 . The channel-wise attention-pooling operation, which has its own trainable parameters, is trained with a learning rate of 1 × 10 −2 .
To prevent overfitting we use severe data augmentations and modify data on the fly, so that our CNN models is transformation invariant to the maximum possible extent. The augmentations are done with the Imgaug (available at: https://github.com/aleju/imgaug, accessed on 12 December 2021) library and the following image transformations: • Histogram equalization, • Rotations in the range ±10 • , • Random cropping (with a 50% chance), where between 5% and 30% of the original image is cropped away, • Multiplication of all image pixels with a random value v sampled once per image in the range (0.9, 1.2), • Application of one of the following: (1) Random corruption of p percent of all pixels within an image area by salt-andpepper noise, where p ∈ (1, 10)%. The size of the image area is between 2% and 20% of the size of the input image, (2) Random replacement of a certain fraction of pixels from the image with zero, with a percentage value p ∈ (1, 10)%, with 20% per-channel probability, (3) Random replacement of rectangular masks from the image with zeros, with a percentage value p ∈ (1, 10)%. The size of the masks is between 2% and 20% of the size of the input image, • Application of one of the following: (1) Blurring with a Gaussian kernel with a sigma value between (0, 3.0), (2) Blurring using averaging over neighborhoods that have random sizes, which can vary between 5 and 11 in height and 1 and 3 in width, (3) Application of motion blur with a kernel size of 15 × 15 pixels. Through these data augmentations we ensure that our model sees new variations of the data at each and every epoch of the training procedure, and (almost) never sees the exact same image multiple times, which is always beneficial when fine-tuning models trained initially on large scale datasets like ImageNet to downstream tasks. By carefully choosing the image augmentations, we can additionally boost the predictive performance of the learned model, which further enhances its generalization capability. Another reason to use data augmentation is because we want to lower the gap between the training and validation performance during the training process. All augmentations are performed in random order and with 50% probability, which means there is a (small) chance for not performing augmentations at all.
For illustrative purposes, several image augmentations with the patch decomposition procedure from a single augmented image are shown in Figure 8a for IITD, and in Figure 8b for the CASIA dataset. It is important to note that the second input in the proposed Two-Path CNN model ( Figure 5), takes the same augmented image as input, but decomposes it into N smaller patches sampled in a grid-like fashion with overlap. By doing this, we ensure that the global and the local processing path are looking at the same input data during training.

Quantitative Results
The main goal of this section is to rigorously evaluate the proposed Two-Path CNN palmprint recognition model and to compare its performance with competing state-of-theart methods from the literature. Throughout the experiments, we evaluate all models on different subsets of images by randomly splitting the test data into five equal folds and adopting the all-vs-all experimental protocol in each of the folds. We use this approach to be able to also estimate confidence scores on all reported results in addition to the overall performance.
For the competing methods we report results for several traditional feature extraction methods, using the code provided in [37,40]. Furthermore, we also compare our results against four state-of-the-art methods from the literature, i.e., three DL-based from [4,31,41], and one traditional (non DL-based) solution [11], previously reported to achieve state-ofthe-art performance on standard benchmarks.

Comparison with Competing Methods
For the comparison with traditional feature extraction approaches we consider seven dense-descriptor based methods, i.e., Local Binary Patterns (LBPs) [42], Rotation Invariant and Local Phase Quantization (RILPQ [43] and LPQ [44]) features, Histograms of Oriented Gradients (HOG) [8,45], Binarized Statistical Image Features (BSIF) [46], Patterns of Oriented Edge Magnitudes (POEM) [47] and features extracted with the help of a bank of the Gabor filters [48]. These feature extraction techniques are used (with default parameters from [40]) to infer compact representations from the palmprint images for matching. To facilitate a fair comparison, the cosine similarity is again used to produce the comparison scores.
The results of the comparison are presented in the form of ROC curves in Figure 9 and in the form of performance scores in Table 3. We can see that the proposed Two-path CNN approach clearly outperforms all traditional feature extraction methods at every evaluated operating point, with BSIF, RILPQ and POEM being the closest competitors. We also observe that the Global-CNN, when used as a standalone feature extractor, is able to outperform the traditional methods in almost all cases except for BSIF, RILPQ and POEM, when evaluated on the right palmprints from the CASIA dataset. Additionally, we notice that the local image properties captured by recognition techniques based on HOG, LBP and Gabor features, are not sufficient to extract useful information from contactless palmprint images where the presence of elastic deformations is known to be problematic. We can conclude that even though the Global-CNN model processes the input image in a holistic manner, the very nature of the CNN models helps to extract discriminative local information due to the local connectivity of the convolutional kernels. However, the addition of the local image properties through the local processing path in the proposed Two-path model boosts the verification performance while making the deeply learned features more discriminative and more resilient to the presence of elastic deformations and other artefacts caused by the contactless acquisition procedure.

Comparison with State-of-the-Art Methods
Next, we compare the proposed method against three state-of-the-art DL-based palmprint recognition techniques and one non DL-based method. To ensure a fair comparison, we either use the evaluation protocol used by the authors of the baseline techniques or utilize the code provided by the authors of the baseline methods (where available), and use our evaluation protocol for the comparison.
The first state-of-the-art method is a highly optimized DL architecture, referred to as Residual Feature Network (RFN) [4]. RFN is based on a fully convolutional network that generates deeply learned residual features trained with the Soft-Shifted Triplet Loss (SSTL).
The key advantage of this method is that it offers superior generalization capabilities across different datasets. FRN-SSTL does not have fully connected layers which results in pure feature map outputs and is beneficial for preserving spatial correspondences among the most discriminative palmprint areas. The second method is referred to as Difference of Vertex Normal Vectors (DoN) and represents a non-DL based method [11].
The key advantage of DoN lies in the ability to extract 3D information from 2D palmprint images, which is beneficial when dealing with contactless palmprint images, where the presence of different variations in illuminations is common. The next method selected for the comparison represents a DL approach initially presented in [31] and uses the GoogLeNet DL architecture as a backbone feature extractor. We denote this method as AML_N-Pair in this paper. To obtain discriminative and equidistributed embeddings, the authors propose a novel optimization objective which consists of a pair of adversarial loss functions, specifically, a distance metric term and a confusion term. By using the joint supervision of those two terms, this method was reported to achieve strong generalization in cross-dataset contactless palmprint recognition experiments. We also compare the results against the Weight-based Meta Metric Learning (W2ML) approach proposed in [41]. This method uses metric learning in a meta way to extract discriminative palmprint features using an end-to-end DL network, which was shown to be robust and efficient for diverse palmprint recognition scenarios.
It is important to note that the experimental results are done only on the IITD dataset since some authors did not provide results for the CASIA dataset. Table 4 summarizes the experimental results for IITD in terms of EER values. The proposed Two-path model clearly outperforms the DoN, AML_N-Pair and W2ML approaches and ensures an EER reduction of around 50% compared to any of these methods. Additionally, it performs comparably to the recent RFN-SSTL method, where a very similar EER score is achieved by the proposed approach. The difference in the point estimates of the EER scores of RFN-SSTL and our approach is statistically not significant. The reported results point to the discriminative power of feature representations generated by the proposed Two-path model and their suitability of contactless palmprint recognition. Furthermore, we note that we are able to further improve on the results reported here by explicitly modeling elastic deformations using a Thin Plate Spline (TPS) transformation. We provide detailed information on this process (not yet used for generating results in this section) in Section 5.4. Table 4. Comparison against state-of-the-art methods from the literature. The comparison is done on the IITD dataset in terms of EER values. All results are presented in (%).

Sensitivity Analysis and Ablations
In this section for the ablation study we conduct a wide range of experiments by examining different parts of our method.
To explore the impact of various model components on performance we conduct a sensitivity analysis. Specifically, we investigate the sensitiveness of models to changes of the parameters in the combined training objective, i.e., the λ parameter (Equation (10)), which balance the contribution of the two learning objectives, and the α parameter that controls the learning rate of the centers (Equation (13)) in the center loss function. Note that the experiments also include cases where λ = 0 and, hence, one of the loss terms is ablated. Additionally, we also provide a detailed look into the performance of the global and local processing paths and their contributions to the overall performance of the complete Two-path model.

Impact of λ and α Parameters
In the next series of experiments we train multiple models with different λ and α parameters and investigate how sensitive the training procedure is to changes in hyperparameter values and how these values affect performance. Note that we use only the IITD dataset for this sensitivity analysis. In the first experiment, we fix the α parameter to 0.5 and vary λ from 0 to 0.1, and in the second experiment we vary α between 0.2 and 1, while using the value of λ from the best performing model obtained in the first experiment. The reason why we restrict α in that specific range is because when using values below 0.2, we observed that the model did not converge.
The results of the analysis are reported in the form of box plots of the calculated EER scores in Figure 10. To highlight the differences between the learned models we use different colors for each box plot, where the best and the worst performing ones are highlighted with lime and red colors, respectively. We can clearly see that the best performing model when trained on IITD left palmprints (Figure 10a,b), is obtained when λ = 0.01 and α = 0.5. The same applies when the model is trained on the right IITD palmprints. We notice that the model trained only with the ArcFace loss already ensures competitive performance. However, the inclusion of the center loss with the right set of parameters is observed to further improve results. Thus, the box-plots clearly show that both loss terms are important for the proposed model.

Contribution of Global and Local Processing Paths
The proposed Two-path model consist of two distinct processing paths. To explore the contribution of the two paths to the overall performance of the proposed model, we extract the global features as well as the local features before the concatenation operation and perform verification experiments based on each image representation. We compare the performance of the two processing paths with the full two-path model in Table 5. It is evident that the use of combined image information plays an important role to further improve the verification performance of the global processing path. The addition of local information consistently improves results and allows the Two-path model to produce more reliable verification decisions. In terms of EER, for example, we observe relative error reductions ranging from 6% and up to 48%-depending on the dataset and the used experimental setup. Thus, the information added by the local processing path clearly complements the information extracted by the global path and helps addressing issues caused by the contactless acquisition procedure. When interpreting the reported results, it is important to note that the complete two-path model is trained end-to-end, so the performance of the learned global and local representations needs to be considered in this context. The reported performances for the global and local features are, therefore, correctly interpreted as the partial contributions of the processing paths to the overall performance of the model and not in terms of capacities of the individual representations for recognition. The performances here would be quite different if the representations would be trained one without the other. Table 5. Summary of averaged performance metrics when using different types of learned image descriptors. The results are presented in the form of a µ ± σ and in (%).

Explicit Modeling of Elastic Deformations
In this series of experiments we try to explicitly address the problem of elastic deformations which represent one of the major problems with contactless palmprint recognition techniques.
A common way to tackle this problem is to achieve invariance to warping through some kind of spatial transformation on the input images. Based on this insight, we use a Spatial Transformer Network (STN) [49,50] to learn a warping procedure based on Thin Plate Splines (TPS). TPS is selected as the basis for warping because it ensures a higher degree of freedom when modeling deformations compared to other affine transformations. The TPS algorithm has been applied successfully in several computer vision applications, especially in medical image registration [51], as well as a minutiae matching algorithms that model the elastic deformations in fingerprint images [52]. For detailed explanation about the STN and TPS, the reader is referred to [49,50,53], respectively. However, to the best of our knowledge, it has not yet been applied as a warping technique for modelling elastic deformations in palmprint images acquired in a contactless manner. It is important to note that we did not integrate the STN directly into our Two-path model. Instead we used it as an external source to generate TPS warped images. This means that in each iteration we generate an additional batch of warped images which are then combined with the current batch of images when training the model. With this procedure, we are forcing our augmentations to be TPS warped with 50% chance.
The results before and after the TPS-STN warping applied on sample images from the IITD and CASIA datasets are shown in Figure 11. We can clearly see that the images after the warping procedure are somewhat deformed. This effect can be mostly seen around the principal lines of the palmprint, which is also the region where elastic deformations are most prominent. The ROC curves generated with and without data augmentation based on TPS warping are shown in Figure 12, and the respective box-plots of the corresponding EER scores are presented in Figure 13. As can be seen from the results, the TPS warping has a significant impact on the verification performance, which results in lower EER values and narrower confidence intervals.

Before TPS warping
After TPS warping

Qualitative Results
In this section we present different qualitative results related to the Two-path architecture. We mainly focus on the following analyses: • Exploring the discriminative power of the deeply learned embeddings through the Tdistributed Stochastic Neighbor Embedding (t-SNE) [54] data visualization technique, • Investigating the feature importance and attention mechanism of the Twopath architecture, • Qualitative evaluations of edge cases generated in the verification experiments.

Embedding Visualization with t-SNE
We use t-Distributed Stochastic Neighbor Embedding (t-SNE) [54] to visualize the high dimensional embeddings generated by the proposed model from IITD and CASIA images. t-SNE represents a nonlinear dimensionality reduction technique developed by the authors in [54], which is suitable for embedding high-dimensional data for visualization into a low-dimensional space of two or three dimensions. For a detailed explanation about the t-SNE technique, the reader is referred to [54].
In our case, we use t-SNE to generate 2-D visualizations in order to see the clustering effect produced from the joint supervision between ArcFace loss and the center loss. In Figure 14 the learned 2-D embeddings for IITD and CASIA datasets are shown. From the 2-D representation, we can clearly see the effect of the center loss which efficiently pulls the features to their corresponding centers, thus, minimizing the intra-class variations. On the other hand, the ArcFace loss helps the embeddings from different classes to spread uniformly on the hyperspace by maximizing the inter-class variations. We also notice that the embeddings from the CASIA dataset are less uniformly spread compared to the embeddings from IITD dataset. This effect can be seen from Figure 14c, where we can see regions of less discriminative embeddings. As seen from Figure 14d, there are also regions where the discriminative power between the embeddings is preserved, however these regions suffer from high locality, i.e., there are features located in several small areas of the embedding space isolated from the rest of the embeddings. Such cases where the combined objective did not produce sufficiently discriminative features can possibly lead to miss-classifications of palmprint images.

Feature Importance and the Attention Mechanism
Next, we report qualitative results with respect to the attention mechanism integrated within the proposed Two-path architecture. As discussed earlier (in Section 3.4), the so called channel-wise attention pooling mechanism is a trainable attention operation which offers a straight-forward way of exploring the importance of each image part through the learned feature vector of weights. This feature vector is then converted to a vector of probabilities by using the softmax activation function. To generate interpretable results, we transform the learned vector of probabilities into heatmaps and then superimpose the computed heatmaps over the palmprint images. In Figure 15a,b some qualitative evaluations for the IITD and CASIA datasets are shown, respectively. We can see how the attention mechanism selects certain image regions that are not only locally important but are also beneficial for the overall image representation. Note how this procedure is adaptive in nature and selects a different area in each palmprint depending on image characteristics. Additionally, the selected areas are also quite localized due to the use of the softmax function in the attention mechanism.

Verification Experiment
Finally, we present qualitative results related to the verification experiments conducted on the IITS and CASIA dataset. Specifically, we are interested in separability of the score distributions for the mated and non-mated experiments as well as failure cases generated by our Two-path model.
In Figure 16 the mated and non-mated score distributions expressed as probability density functions (PDFs) [1] when evaluating on different palmprints from the datasets are shown. It can be observed that the Two-path architecture achieved good separation between the genuine and impostors score distributions, where the combined learning objective efficiently minimized the intra-class variations, while maximizing the inter-class variations, which validates our previous analyses. It is also clear that the impostor score distributions are centered at zero, which points to the fact that inter-class embeddings are well distributed in the hyper feature space [55]. Here we can also see the presence of the "double-peak" phenomenon [55]in our case in the mated (genuine) score distributions [55]. This implies that some features of same classes exhibit weaker similarities and are therefore harder to recognize. Next, we preform an verification experiment based on the decision threshold θ eer , which ensures equal values of the FAR and FRR. Since there is a slight overlap between the score distributions, we can expect that some of the genuine verification attempts will be falsely rejected, i.e., false negatives (FN), and vice versa, some of the impostor verification attempts will be falsely accepted i.e., false positives (FP). For illustration purposes, we perform the experiment on the palmprint images from both datasets. We present the results in Table 6 by visualizing some of the compared image pairs and their respective cosine similarity scores. As can be seen from Table 6, the matching scores between the correctly accepted mated/genuine attempts, i.e., true positives (TP), are positive and above the decision threshold, whereas the matching scores between the correctly rejected impostor attempts, i.e., true negatives (TN), are negative, which indicates opposite feature vectors. The results in the overlapping region between the genuine and impostor distribution are also quite interesting. As seen from the falsely rejected genuine attempts or FNs, palmprints from the same subjects share less similarities due to the high appearance variability caused by the palmprint acquisition process. For example, in CASIA right palmprint images there is an extreme case where one palmprint image between the genuine matches is completely white due to high illuminations that happened during the palmprint acquisition process. This means that the Two-path architecture is not able to find any shared patterns between such images, which results in lower matching score. On the other hand, there are cases where the exact opposite happens. Different palmprints from different individuals share more similarities, which results in relatively high scores and above the threshold similarity scores.

Conclusions
In this study we introduced a hybrid Two-path DL based model for contactless palmprint recognition. Unlike competing hand-crafted feature extraction techniques, our architecture automatically learns discriminative features from the data while not requiring extensive parameters tuning. Compared to other DL techniques, our model is designed as dual path CNN, where one input encodes the input image holistically using a backbone CNN-based feature extractor, whereas the second path, a parallel Siamese architecture with shared model parameters, captures local information from image patches sampled from the input image. Moreover, we showed that the discriminative power of the deeply learned features can be highly enhanced by combining the ArcFace loss with the center loss to jointly supervise the learning of our model. By using this design we were able to achieve great genuine-impostor score distribution separation which is key in biometric recognition systems and paramount for achieving state-of-the-art performance on standard benchmarks.
We also performed extensive experiments related to different parts of the architecture and showed how the addition of local information contributes to enhanced robustness to elastic deformations, as well as, increased generalization capability across different datasets. Furthermore, we showed that our design choices ensure a level of explainability facilitated by the proposed channel-wise attention pooling mechanism. Given the promising results, further extension of this work will focus on palmprint recognition in the wild and explore how the model generalizes when images are acquired under complex backgrounds, in different environments and with various postures and illuminations. Furthermore, different feature pooling strategies and attention mechanisms that could maximize the information content that can be extracted from the local image patches, especially when considering detailed image aligned will also be investigated. Particular attention will be payed to the impact of alignment on the patch generation strategy used.