Contactless Palm Vein Recognition Based on Attention-Gated Residual U-Net and ECA-ResNet

: Palm vein recognition has received some considerable attention regarding its use in biometric identiﬁcation. Palm vein characteristics offer a superior level of security and reliability in personal identiﬁcation compared to extrinsic methods such as ﬁngerprint, face, and palm print recognition, as vein patterns are difﬁcult to duplicate and do not change throughout one’s lifetime. This study proposes both segmentation and recognition methods to enhance the authentication performance and achieve correct identiﬁcation using palm vein features. First, we propose a segmentation model based on the U-Net model, enhanced with an attention gate, to effectively segment palm vein patterns. The incorporation of both the attention gate and residual block allows the segmentation model for the learning of the essential features required for speciﬁc segmentation tasks. The Hessian-based Jerman ﬁltering method is used for ground-truth labeling. The segmentation model extracts the palm vein patterns and ﬁlters out the irrelevant and noisy pixels for the purpose of recognition. The efﬁcient channel attention residual network is trained to learn discriminative features for personal identiﬁcation using combined margin-based loss functions for palm vein recognition. The channel attention module enhances the useful information and suppresses irrelevant features in the feature maps, which overcomes the problem of rotation, position translation, and scale transformation, as well as improves the recognition rate. The combined loss function used in this study increases the similarity between the intra-class samples and the diversity between inter-class samples. The proposed recognition model achieved 100% accuracy for palm vein recognition and an equal error rate of 0.018 for palm vein veriﬁcation.


Introduction
Biometric identification is an authentication process leveraging unique physical or behavioral human features.During the COVID-19 pandemic, various researchers showed substantially increased interest in contactless biometric identification over contact-based biometrics, such as fingerprints [1]; this is because, in contact-based methods, users need to put their fingers directly on a sensor, which is not practical for health reasons.Contactless biometric systems such as those using the iris [2], face [3], or palm print [4,5], are considered to be better functional recognition systems for this reason.However, these extrinsic biometrics features are susceptible to spoofing and can be significantly impacted by various factors such as the age and health of the person and the condition of the skin or injuries.On the other hand, a contactless biometrics feature such as palm vein is an intrinsic biometrics trait and offers several advantages: first, it provides better privacy and security advantages, as obtaining the palm vein is only possible using special equipment, thus preventing forging.Second, palm vein patterns remain stable throughout an individual's lifespan and disappear in the absence of blood flow.Third, human palms include highly complicated and unique vein patterns, as the structure of the pattern differs even between identical twins.Moreover, the collection process of contactless palm images is comfortable, easily accepted by users, and provides better hygiene, as there is no need for any interaction between the user's hand and a sensor on a public device.This complexity and consistency make palm veins an exceptionally reliable biometric feature for personal identification, surpassing other external features.Palm veins are often collected via the reflection method.In this approach, near-infrared light is transmitted from the sensor through the person's hand.Because the hemoglobin in a vein can absorb more near-infrared radiation passing through a hand than the surrounding tissues, the palm vein patterns appear dark.This reflective approach allows for contactless pattern recognition and user identification.In conclusion, palm vein authentication is considered to be more secure and practical due to privacy concerns, and it, therefore, offers excellent research potential and broad application prospects.
Recently proposed approaches for palm vein recognition generally face the problem of a quality issue in the data collection stage.The acquisition of contactless palm images using near-infrared light often results in poor contrast between the vein pixels and non-vein areas.The visibility of the palm vein can be compromised by various factors, including surrounding temperature, lighting conditions, and illumination.This causes noise and optical blurring in the palm images, which results in degraded recognition accuracy.Enhancing or segmenting the accurate vein patterns is a crucial aspect of improving the robustness of palm vein features; however, it is difficult to extract blood vessels and effectively remove the optical blurring.Primarily, this makes it more challenging to extract the precise vein pattern using distribution assumption-based hand-crafted methods [6,7].Moreover, the contactless palm vein biometrics system suffers from the problem of image rotation, position translation, and scale transformation.Last but not least, most of the deep-learning-based palm vein recognition studies have to this point focused on a classification approach using the SoftMax loss, a popular loss function for recognition tasks.Although SoftMax loss offers excellent performance for inter-class separation, it is ineffective in minimizing intra-class diversity, which decreases the discriminative features [8,9].
The attention-gated residual U-Net addresses the first problem.The U-Net-based segmentation model segments the near-accurate palm vein vessels from the original grayscale images.The segmentation model filters out the irrelevant and noisy pixels for recognition.To mitigate this challenge and attain more accurate segmentation outcomes with minimal computation, we integrated an attention gate mechanism with the U-Net model.Inspired by the residual U-Net [10], the segmentation model introduces residual learning into each convolutional block.The efficient channel attention residual network (ECA-ResNet) solves the second problem by enhancing the channel-wise information in the feature maps as well as suppressing useless features, which overcomes the problem of rotation, position translation, and scale transformation.This recognition model uses ResNet as the backbone for solving the vanishing or exploding gradients' problem where the repeated multiplication in the back-propagation may cause the gradient to be infinitely small.The combined loss function proposed in this paper addresses the issue of inter-class similarity and intra-class compactness for palm vein recognition systems, where only a few samples are available for each class.
The main contributions of this work can be summarized in four points: The rest of this paper is systematized as follows.Section 2 discusses the related work.Section 3 presents the proposed methodology for palm vein recognition.Section 4 and Section 5 describe the experiments and results.Finally, Section 6 consists of the conclusion.

Handcrafted Methods
The different handcrafted approaches used for palm vein recognition can be categorized into four groups: geometric-based methods, statistical-based methods, local invariant-based methods, and subspace-based methods.
Geometric-based methods, which use geometric elements such as points, lines, or curves, have been studied as methods with which to extract the palm vessel [6,11,12].However, due to problems such as external light conditions, low contrast, skin scattering, and optical blurring, which generally occur in contactless palm vein images, and which cause difficulties for the extraction of precise vein segmentation.Moreover, the geometrybased methods offer low discrimination, as they suffer directly from rotation, scaling, and translation.
Statistical-based methods, such as local binary patterns (LBP) [13][14][15][16], modified local binary patterns [17,18], local texture patterns [19], local tetra patterns [20], and a local directional texture pattern (LDTP) [21,22], are used to extract the rich texture-based features of blood vessels.Because the pixel-to-pixel image processing technique was used, the aforementioned methods suffered from weak texture and are very vulnerable to image noise as well as sensitive to image rotation and shifting caused by the displaced hands of users.Therefore, they often achieve a low identification rate.Moreover, encoding the palm vessel structure also degrades the feature representation of the local binary pattern.
Local invariant-based methods [23][24][25][26][27], such as the scale invariant feature transform (SIFT), speeded-up robust features (SURF), and RootSIFT, can overcome such problems of scale uncertainty, orientation, and translation.Thus, this approach can be considered a competitive handcrafted feature extraction method.However, this results in a lengthy computational time and incorrect verification due to the existence of unstable feature points caused by external factors in low-quality palm images.
Subspace-based approaches [28][29][30][31][32][33][34] are also proposed to reduce the dimensionality of training data to a lower-dimensional space.These methods include principal component analysis (PCA), linear discriminative analysis (LDA), Fisher linear discriminant (FLD), and independent component analysis (ICA).Because of the manual feature extraction and the fact that non-training-based methods are selected, these feature-based systems are generally time-consuming and error-prone.

Deep Learning Methods
Obayya et al. [35] proposed a CNN architecture for palm vein recognition using Bayesian optimization.The recognition accuracy of the CNN model is very high, but its error rate is still high for verification problems.This approach applies the Jerman filtering method to raw ROIs at different scales to enhance the palm vein images.The maximum filter responses from different scales are considered to be the final output, and they produce complex palm vein structures that may differ from the actual vein pattern.Moreover, using handcrafted vein enhancement at different scales for each ROI image is time-consuming and not applicable in real cases.Pan et al. [36] proposed a multi-scale deep representation approach for palm vein recognition.Due to the limitation of the training database on a small scale, training a deep convolutional neural network becomes challenging since it substantially relies on the amount of training data.The present study proposed the use of multi-scale deep representation aggregation to remove the noisy features from the pretrained CNN and refine the feature maps using a local mean threshold approach.
Wu et al. [37] proposed the wavelet denoising ResNet.The proposed wavelet denoising (WD) model removes noise and optical blurring from palm vein images by enhancing the low-frequency feature.The network is composed of both the ResNet-18 model and the squeeze-and-excitation module to achieve better performance.However, there is a trade-off between performance and complexity, where fully connected layers in the excitation path of the SE module increase the complexity of the model.This paper utilized the wavelet denoising technique as a sub-band to enhance the low-frequency feature that fuses with a deep learning network using a residual connection.The proposed WD model helps remove the image noise caused by skin scattering and optical blurring in the high-frequency part.However, this approach requires greater practical enhancement for lower contrast and blurry palm vein images.Pan et al. [38] extracted semantic palm vein features using multilayer convolutional feature concatenation.Chen et al. [39] proposed a lightweight CNN and adaptive augmentation method for palm vein authentication.Categorical cross-entropy loss has been widely proposed in various approaches, but these have lacked discriminative separation between intra-class and inter-class samples.
Moreover, only a few studies have focused on the segmentation problem.Felix et al. [40] and Wang et al. [41] both proposed palm vein segmentation models based on the U-Net architecture.However, the experimental results were not satisfactory due to the poor feature representation in the initial layers that were used in skip-connection, which may cause redundant low-level feature extractions.Similar to the U-Net architecture, PVSNet [42] proposed a novel Siamese method using triplet loss and an adaptive hard mining technique.The pretrained model is composed of an encoder and a decoder for learning enforced palm vein features.Positive and negative samples are separated using triplet loss, which is a popular loss function introduced by FaceNet [3].However, the triplet loss function still has certain drawbacks and limitations, which negatively impact the model accuracy rate due to the triplet selection during the training stage.

Attention-Gated Residual U-Net and ECA-ResNet
Figure 1 shows the overall flowchart of the proposed palm vein recognition system.First, the ROI extraction method is used to locate the area that is to be identified.The U-Net-based segmentation model removes redundant noisy information and enhances domain-specific features that are important for recognition.However, the ground-truth label data needed to train the U-Net model are not provided by any of the available palm vein databases.Therefore, we used the handcrafted method to label the palm vessel.Once the segmentation model becomes stable after optimization, the model's hyperparameters were frozen, and the segmentation output was connected to the ECA-ResNet for further training for identification.The margin-based ArcFace loss, focal loss, and triplet loss functions were applied to train the ECA-ResNet.For testing and evaluating the model, the output of ECA-ResNet can be used two-fold: First, the 512-feature embedding can be used for a distance comparison between the registered template and a query using Euclidean distance or cosine similarity metrics.Secondly, we can obtain the SoftMax probability prediction from the binary head of ECA-ResNet.The following sections detail each step separately.

Region of Interest Extraction
Similarly to the face recognition system, only some of the information acquired by the near-infrared radiation (NIR) camera is necessary, and region of interest (ROI) extraction or detection is an essential aspect of improving the model performance.This step also improves the palm vein recognition system in terms of computational efficiency by reducing the template size.Despite having advantages such as better hygiene and a user-friendly approach, contactless systems produce inconsistent results, including hand displacement, rotation, and zooming at different degrees.Therefore, this study uses a reliable background subtraction and ROI extraction process as the preprocessing method to overcome these problems.The segmentation process involves five steps: segmenting the hand from the background using the binarization method, locating hand contours, centroid positioning and detecting key points, normalization, and ROI extraction.First, this study used Gaussian blur to remove the image noise caused by external light factors.The contactless palm images obtained generally contain the entire hand with a darker background due to the variations in NIR light response.Therefore, the hand contour information can be segmented using the Otsu binarization method.Occasionally, due to the influence of lighting conditions-such as the brightness of some areas of the backgroundbeing similar to that of the segmented hand, the output of the Otsu binarization process may contain a couple of smaller, irregularly shaped areas that are also segmented besides the segmented hand.These areas can introduce noise and compromise the accuracy of the subsequent process.Thus, the morphological operations such as erosion and dilation are utilized to eliminate these irregular areas and obtain better segmentation results.After obtaining hand contours from binarization, the point C is positioned in the center to compute the radial distance function (RDF).The process of RDF is performed by calculating the distance between each point on the contour of the hand to the centroid C, which is a measure of the radial distance for each point.Once the radial distances for all points on the contour is calculated, the maxima and minima points can be determined.The maxima typically correspond to the fingertips, as these are the points furthest away from C. The minima, on the other hand, correspond to the finger valleys, as these are the points closest to C. This can be achieved by finding the local maxima and minima in the RDF curve, as shown in Figure 2. From the obtained five maxima finger tips, the thumb, which is identified as the fingertip with the smallest radial distance from the C, is excluded to simplify the process.Finally, four maxima fingertips and three corresponding minima finger valleys are obtained, as shown in Figure 3c.From these points, the leftmost and rightmost valley points can be, respectively, defined as P1 and P2, respectively.Because the normalization of the hand position is a fundamental step for identification, we rotationally normalized the images by horizontally aligning them.The normalizing angle θ was then computed as the angle between d, which is shown in Equation ( 1), and the dotted red line, as shown in Figure 3d.The hand images are then zoomed and rotated according to θ in Equation ( 2) by using bilinear interpolation, as depicted in Figure 3e.
(1) Most existing methods solely rely only on P1 and P2 for ROI extraction.However, such an approach is considered to be error-prone due to the diversity of hand shapes where part of the selected ROI exists outside the actual boundary points.Thus, this study adaptively obtained each ROI image's size and location.First, from the four fingertip points obtained, the left-most and right-most ones are marked as T1 and T2, respectively.After that, two boundary points called P3 and P4 are obtained from the outer boundary points, followed by locating two reliable points called E1 and E2, as presented in Figure 3e.P3 exists on the hand-contour at the equal boundary distance between that of T1 and P1.Similarly, P4 is placed at the equal boundary distance as T2 and P2.E1 and E2 are then located at the midpoint between P3 and P1 and the midpoint between P4 and P2, respectively.The ROI image is directly extracted from E1 and E2.

Palm Vein Segmentation
Palm vein segmentation is the process of extracting vascular vein patterns from noisy and low-contrast palm vein images.This section describes the ground-truth preparation method, the detailed network architecture of the attention-gated residual U-Net, and the loss functions used in this process.

Ground-Truth Labeling
Currently, there is no existing database that includes annotated labels for training and evaluating palm vein segmentation algorithms.As a result, we had to manually generate the ground-truth label for palm vein segmentation by employing an already-existing handcrafted vein enhancement algorithm.This approach allowed us to create a reliable and accurate set of labeled data for use in the training and evaluation of our algorithm.In this study, the Jerman filtering method [43] was used to measure the eigenvalues of the Hessian, the results of which are almost accurate and similar to the actual palm vein based on a careful selection of parameters such as the kernel size.The Jerman filtering method is widely used to strengthen the intensity of the vessels in contrast to non-vein areas.Denoting I(X) as the 2-D input at coordinate X = [x1, x2], the Hessian value of I(X) at X and scale s is then represented as Equation (3): where G is a bivariate Gaussian filter.Then, the Jerman filtering algorithm is computed as Equation ( 4): where: The kernel size for the Gaussian size and the parameter value τ is the crucial factor when using a handcrafted method such as the Jerman filter, as it can result in different responses ranging from a simple vessel output to complex and sensitive textures, which is also the practical reason behind the need for deep-learning-based segmentation.Depending on the variations in the light, texture, size, and shape of collected vein images via different devices and setups, the arbitrarily chosen parameters of the algorithm may become inconsistent and lack versatility.The image processing method called Jerman is used for extracting the vascular shape palm vessel from the ROI in the manner shown in Figure 4b.This method has involved experimenting with different Gaussian kernels, and a kernel size of three is selected as the optimal value for extracting, as smaller kernels cause the palm image to be more responsive to image noise, and the larger kernels can create an unnecessary and complicated texture in the image.It is crucial to obtain near-accurate ground-truth labels for palm vein segmentation.The extracted images are then applied with a specific threshold for removing the weak and unclear responses, and the pixel values are set to 0 for the palm vessel and 255 for the background in a proper labeling.

Attention Mechanism
We incorporated the attention gate mechanism, as referenced in [44], into our segmentation model.This mechanism is applied after a series of convolutions, serving as the enhancement module diminishing irrelevant regions and amplifying crucial features.The combination of the attention mechanism and U-Net model in the skip connections can achieve better results while maintaining a minimal computation.The detailed architecture of the attention mechanism is shown in Figure 5. g i is the gating signal that provides contextual information that can be used to define focus areas for a given feature map x i .The attention coefficient, α, detects the salient regions while refining feature responses to retain only the activation that is relevant for segmentation.Finally, the final feature map is obtained by multiplying x i and α element-wise, which can be defined as Equation ( 6): To compute the gating coefficient α, additive attention is utilized for achieving accurate segmentation results.The additive attention is defined as Equation ( 7): where σ 1 and σ 2 are ReLU and Sigmoid functions, respectively, W x , W g , and ψ are all linear transformations, and b g and b ψ are biases.

Residual Units
In many multi-layer neural network models, the number of deep layers is increased to improve the model performance.However, this impedes training and may cause a degradation problem [45].Several existing studies utilize the residual neural network to solve the degradation problem, which eases training and alleviates the degradation issue.The residual neural network is composed of layered residual units.Adding residual units to the network allows one to more efficiently train of the model, and the skip connections within each residual unit between the low and high levels will cultivate the back-propagation process without degradation.Moreover, the model can be designed with fewer parameters while maintaining comparative performance on the specific task.Each residual unit is described as Equations ( 8) and ( 9): x l+1 = activation(y l ) where x l and x l+1 are the input and output features of lth residual block, F represents residual function, and h(x l ) is the identity mapping function, which was also stated as h(x l ) = x l .
3.2.4.Proposed Attention-Gated Residual U-Net U-Net [46] demonstrated remarkable success in medical image analysis.The architecture comprises a down-sampling path and an up-sampling path, followed by a skip connection.The proposed attention-aware residual U-Net model's network architecture is illustrated in Figure 6.In the U-Net model, using a pooling layer during the down-sampling process might result in the loss of certain image features.Additionally, the U-Net model employs skip connections to concatenate low-level features with high-level features, which can cause losing spatial details as low-level features often lacking spatial information.We have incorporated the attention mechanism into the U-Net model by addressing these challenges.This addition helps suppress irrelevant feature responses and provides a more precise segmentation result.Moreover, deep neural networks often encounter the vanishing gradient problem, where the gradient shrinks infinitely due to repeated multiplication during backpropagation.To overcome this issue, we replaced the convolution process at each level of the encoder and decoder networks with the residual block.The combination of residual blocks and attention mechanism in the skip connection has enhanced the segmentation results of our proposed method, surpassing the baseline methods' performance.
The contracting path of the proposed model is composed of several residual units.The residual unit includes two 3 × 3 convolution blocks and an identity mapping, where each convolution block contains a batch normalization layer, a ReLU activation layer, and a convolutional layer.The identity mapping connects the input and output of the residual unit.In the up-sampling part, the features obtained from the previous residual unit are tuned by computing the attention response using the lower-level features as the attention gate.The tuned output is then concatenated with up-sampled feature maps.This process is continued for each up-sampling stage, and the segmented output image is reconstructed in the final layer.With the benefits offered by the attention gate, the model can correctly predict the vascular shape of vein patterns.

Loss Function
The primary function of segmentation is to classify each pixel in terms of the specified output.Therefore, cross-entropy loss, which is the popular loss function among several classification problems, is often used to classify pixels.However, unlike character/text recognition in which the text usually occupies a relatively larger portion of the image, vein pixels only relate to the narrower region of the image than the background and occupy a smaller portion of the image compared to the background, which can lead to a class imbalance problem.This imbalance can cause the traditional loss function such as crossentropy loss to be biased towards the majority class (background), affecting the model's performance.Hence, dice loss, which is more suitable for handling such imbalances and not plagued by the ratio of foreground pixels to background pixels, is utilized in this research.Calculating the dice coefficient function solves the imbalances between foreground and background, but it ignores another imbalance between easy and difficult instances.Dice loss (DL) is formalized as Equation ( 10): where p and g represent pairs of corresponding pixel values of prediction and ground truth, respectively.

Palm Vein Recognition
Palm vein recognition is the method of identifying and verifying identities using palm vein templates.This section describes in detail the architecture of the proposed ECA-ResNet-50.

Efficient Channel Attention
Inspired by the SENet architecture [47], the ECA module [48] is proposed to enhance the performance of the CNN model by highlighting valuable information in the feature maps and suppressing irrelevant features.Figure 7 illustrates ECA module.Each feature map F ∈ R (W×H×C) is decomposed to a 1 × 1 dimensional space which results in the feature F avg ∈ R (1×1×C) to utilize the global average pool (GAP); this is defined as Equation (11): ECA introduced local cross-channel interaction (local CCI) to solve the problem of the computational overhead of the cross-channel interaction in SENet.Local CCI enables the computational overhead at a considerably lower cost while allowing each channel in a small local group to be interdependent of every other channel.First, the global parametric space described by C × C is decomposed into a smaller localized space defined by k × C, where k is the pre-defined size of the local region, such that k < C. Therefore, attention based on the local CCI may thus be represented as Equation ( 12): where Ω k i is a set of k adjacent channels of channel y j i ∈ Ω k i , σ represents a sigmoid activation function.However, the parametric space overhead caused by the attention mechanism k × C can be reduced by having the channels share the same learning weights as shown in Equation ( 13), which decrease from k × C to k, which is relatively small.
Moreover, the shared local cross-channel interaction described above is achieved using a 1D convolution kernel with k layers.By doing so, ECA can be expressed as Equation ( 14):

Proposed ECA-ResNet
Feature extraction is an essential process in biometrics recognition, which aims to extract key features from the input image.At this stage, to improve the prediction accuracy and generalization ability, the modified CNN based on ECA and ResNet-50 was proposed as the feature extractor to produce 512 feature embedding vectors for classification and metric learning.The ECA module is introduced into the ResNet-50 to enable the models to learn the channel-wise information in the feature maps in an adaptive and efficient manner without computational overhead.The modified ResNet-50 model solves the problem of the vanishing gradient or the degradation problem where the repeated multiplication in the back-propagation may cause the gradient to be infinitely small.After the residual blocks, the average pooling operation is applied, and 512-D feature embedding is separately fed into the binary head and margin head for the purpose of multi-task training.
Figure 8 shows the overall proposed network architecture.The modified ResNet-50 architecture, which consists of convolution layers, pooling layers, and an efficient channel attention module, is used as the backbone structure to extract palm vein features.The proposed network model takes segmented palm vein images from prior U-Net models as input with a 1 × 112 × 112 size.The network backbone structure comprises four convolutional layer block stages, consisting of 3, 4, 6, and 3, and the feature maps are 64 × 56 × 56, 128 × 28 × 28, 256 × 14 × 14, and 512 × 7 × 7 each.The ECA module is applied after two convolutions on each residual block to enhance the critical features while suppressing irrelevant features.The ECA module contains a series of global average pooling, a 1 × 1 convolution, and a sigmoid activation function, the task of which is to compute the attention weights and refine the feature map to form F o .This operation is repeated until the last convolutional block, followed by an average pooling operation and batch normalization to form the 512-feature embeddings.

Loss Function
As a proposed recognition model is trained with the combined loss functions, the following layers after the feature embedding output are set in a multi-task trainable manner for each loss function.First, the feature embedding is used to calculate triplet loss [49] by computing the Euclidean distances between the pairs of genuine and imposter palm samples.To boost the triplet loss to learn better generalization features for hard negative samples, a hard-mining strategy is used before computing the triplet loss function.After that, the embedding outputs are separated into the binary head and margin head separately, where the aim is for each head to be simultaneously trained with focal loss and ArcFace loss.In the margin head, an angular margin m is added to the targets and multiplied by the feature scale s to compute ArcFace loss.By combing triplet loss, ArcFace loss, and focal loss, the network learned to separate hard samples with high discriminative features for identification and verification.The triplet loss in our experiment is defined as Equation (15): where a represents the anchor sample, p represents the positive sample, hn represents the hard negative sample, and margin M stands for margin.The ArcFace loss [8], which is based on modifications in the SoftMax loss function, is used to obtain discriminative embedding features for palm vein samples.An additive angular margin penalty is applied to tighten the distances between intra-class samples and boost inter-class diversity.The margin penalty also provides precise correspondence to the geodesic distance.The ArcFace loss is formulated as Equation (16).
e s(cos(θ y i +m)) + ∑ n j=1,j =y i e s cos θ j where θ j is the angle between the feature vector and the weight vector, s is the scaling factor for the feature vectors, m represents the penalty imposed on the angular margin, and N and n stand for the size of the batch being processed and the total number of classes, respectively.
Focal loss [50], which is a cross-entropy loss that is dynamically scaled with a scaling factor, was proposed to solve the class imbalance problem in binary classification.To punish hard-to-classify classes more severely during training, the scaling factor down-weighs the easy-to-classify samples.In this study, the focal loss is modified to apply to multi-class classification, as shown in Equation (17).
where C denotes the number of categories, y i denotes a probability distribution of the prediction, and γ is the focusing parameter that controls the degree to which the loss function focuses.When γ > 0, the focal loss will assign more weight to hard-to-classify classes and less weight to easy-to-classify samples.Thus, ArcFace loss broadens the inter-class margins with the margin penalty, while hard triplet loss enhances the intra-class compactness.Meanwhile, focal loss is incorporated with triplet loss and ArcFace loss to focus more on hard-to-classify palm vein samples.The combination of these losses allows our deep recognition model to learn more specific discriminative features for palm vein recognition.

Experiments 4.1. Dataset
The CASIA Multi-Spectral Palmprint Image Database [51] is a public palm print and palm vein dataset that is available for research.It consists of 7200 samples collected from 100 different people.When collecting the data, two sessions were used to take palm pictures for each hand.The gap between the two sessions was more than a month.As there were three examples in each session, each sample had six palm photos captured simultaneously with six distinct electromagnetic spectrums (460 nm, 630 nm, 700 nm, 850 nm, 940 nm, and white light).Some variations in hand postures between two samples are also performed for increasing the diversity of intra-class samples and simulating real-world applications.
Since each person provides palm samples for the left hand and right hand separately, and the palm patterns of both hands are different from each other, each hand is considered as one identity, meaning that two hands from 100 different persons lead to 200 different identities in this study.Moreover, the palm vein appears vividly under NIR illumination, and only the samples captured with 850 nm and 940 nm are selected, as there is no vein information in the images that are captured in white light, and spectra under 850 nm produce unclear vein images.Thus, from a total of 7200 images from the database, 2400 trainable images are obtained.

Experimental Setup
Initially, the palm images are normalized and aligned to a vertical orientation, which is required for calculating the RDF, and resized to 112 × 112 in the ROI extraction step.This orientation change introduces a minor impact as when presented with horizontally oriented palm images, a few adaptations may be necessary, such as adjusting the ROI extraction to account for changes along the x axis.However, it is critical to note that once the ROI images were identified and extracted, orientation changes do not significantly impact the performance of the U-Net segmentation model as the model can overcome the problem of rotation and scale transformation.When training ECA-ResNet-50, data augmentation methods such as random cropping and random rotation (0.6) are performed.To optimize the network, an Adam optimizer with the weight decays of 2 × 10 −4 is used.Regarding the learning rate, an adaptive learning rate strategy is used to update the learning rate from 1 × 10 −4 at the initial epoch to 1 × 10 −5 at the max epoch.The network is trained for a total of 60 epochs for the CASIA dataset.The training for the segmentation model took approximately 3 h, and the training for recognition only required approximately 2 h.In terms of models' complexity and size, the attention-gated residual U-Net segmentation model comprises 2.4 M parameters with a model size of 9.2 MB.In contrast, the ECA-ResNet recognition model contains 26.9 M parameters with a model size of 98 MB.In our experiments, the average computational time per single palm image through the entire pipeline was approximately 40 milliseconds.This rapid inference time ensures a better user experience, especially in applications where instant identification is required.All experiments are simultaneously run in parallel on four NVIDIA TitanxX GPUs (12 GB of RAM per GPU) with the computation from several GPUs.

Evaluation Metrics
This section describes the primary evaluation metrics used for both the segmentation and authentication tasks.First, to assess the performance of palm vein segmentation using the attention-gated residual U-Net model, both the intersection over union (IoU) and dice coefficient are used for evaluation.For biometric authentication, the accuracy of the model is simply not sufficient to be used as an evaluation method.Thus, to evaluate the proposed ECA-ResNet-50, the evaluation metrics of identification accuracy, precision, recall, F1 score, and equal error rate (EER) are used.

IoU Coefficient
The IoU coefficient, which is commonly known as the Jaccard index, is a popular method for evaluating our segmentation model performance.The IoU coefficient, which is used to calculate the percentage overlap between the ground-truth mask pixels and the prediction output pixels, can be defined as Equation (18): where p and g denote the pixel values of the prediction result and label, respectively.

Dice Similarity Coefficient
In computer vision tasks, the dice similarity coefficient (DSC) is widely used to measure the distance between the output of the segmentation and the respective label.DSC can be defined as Equation (19): where p and g denote the pixel values of the prediction result and label, respectively.

Identification Accuracy
In the process of biometrics identification, which seeks to identify to whom the palm template belongs, the percentage of correctly categorized samples is computed to obtain identification accuracy, as shown in Equation ( 20 EER can be defined as the error rate at which the false acceptance rate (FAR) and the false rejection rate (FRR) are equal.Thus, an EER can be generally explained as, the smaller the EER, the greater the biometric system's accuracy and verification performance.

Results
Experiments were carried out by splitting the dataset into 1920 samples for use as a training set and 480 samples for use as a test set.Initially, the attention-gated residual U-Net is trained until the network becomes stable, after which the weights are frozen to connect with the ECA-ResNet-50 feature extractor to train discriminative features for recognition and verification.The segmentation and authentication results are reported separately.

Palm Vein Segmentation
To analyze and evaluate the proposed segmentation network's performance, the IoU coefficient and the dice similarity coefficient are calculated.Aside from the baseline U-Net model, residual blocks and an attention gate module are also separately used as different training configurations of the U-Net architecture to be evaluated.As shown in Figure 9, the segmentation result of the U-Net model (c) contains several incorrect predictions.In contrast, our proposed model (d) results in better segmentation performance while suppressing irrelevant areas.Table 1 displays the results of the comparison between the state-of-the-art methods and our proposed method for palm vein segmentation.To address the significance of our segmentation model for palm vein recognition and verification, the ECA-ResNet model is additionally trained with the original grayscale ROI images without connecting the segmentation model.As can be seen in Table 2, both the recognition and verification results became lower than the proposed approach as the network surfer learned enough discriminative features from low-resolution and unclear palm vein ROI images.

Palm Vein Authentication
This section includes two experiments for palm vein recognition and verification.First, the palm vein recognition involves identifying a palm vein image using a SoftMax probability output within a fixed number of classes.Second, palm vein verification involves a one-to-one comparison between the query palm image and the existing template to verify whether they have the same identity.
For recognition, the binary head of the proposed ECA-ResNet model is used and SoftMax prediction probabilities can be obtained by applying the Sigmoid activation of logits to identify 1-to-N approach classification as it can predict 200 classes (identities).Each class represents one identity (or the one hand of one person).The input is the single palm ROI image resulting from the segmentation model, and the output is the result of the fixed-length vector with a size of 200 that includes the probability of each identity.As described in Table 3, our proposed network correctly predicts all 480 test samples, therefore, achieving high-performance accuracy compared to existing proposed methods.

Method Year Accuracy
PCANet with deep learning [52] 2017 96.50 PVSNet [42] 2018 85.16 Hong et al. [7] 2019 96.33 CNN + Bayesian optimization [35] 2020 99.40 TripletGAN VeinNet [53] 2021 97 Explainable palm vein recognition [39] 2021 100 Ensembling scale invariant and multiresolution Gabor scores [54] 2022 99.73 Proposed Method 2023 100 For verification, an experiment using a 1-to-1 matching technique is performed.First, genuine and imposter pairs are created from 480 test samples that include 200 classes.In this experiment, the feature embedding layer right before the binary head and margin head of our proposed model (Figure 8) is used for comparison.The feature vector embeddings are normalized using the L2 norm before computing the distance between two vectors to decide whether the vein sample belongs to the template.Since the test set includes at least two sample images (or three for some classes) for each individual, the number of genuine matching scores becomes 360, and the number of imposter matching scores becomes 114,600.For both genuine and imposter pairs, the image is matched with another sample at least once but not more than once.The purpose of this is to ensure that no duplicate samples are paired repeatedly, which may cause an unbalance and biased verification score.To verify between the samples, the matching scores are computed using the cosine distance, where the scores will be closer to 0 for intra-class and closer to 1 for inter-class samples.The cosine similarity between two samples with an angle θ is defined as Equation (24).As presented in Table 4, an equal error rate of 0.018 is obtained, which is a relatively low error rate for palm vein verification.

Ablation Study
To study the importance of each loss function in the experiment, an ablation study examining the different loss functions was also conducted.As presented in Figure 10, the margin-based ArcFace loss function has a significantly larger improvement than triplet loss.We can see that the imposter distribution resulting from triplet loss is wider, and the overlap areas between the imposter/genuine scores are also larger.Moreover, the proposed combined loss can perform better, with the benefits of a higher recall rate and F1 score and better matching score distances between the separation of genuine and imposter samples, as shown in Figure 10f.The verification result obtained from each loss function is described in Table 5.

Conclusions
In this study, we proposed two subsequent methods for palm vein authentication.First, we proposed an attention-aware segmentation model and explained the vein labeling approach using the Hessian-based blood vessel filtering method.With the correct labeling of the palm vein patterns, this research has examined the effectiveness of integrating the attention mechanism and residual block within a single U-Net architecture, aiming to distinguish between salient and noisy features while enhancing critical features.Our proposed segmentation method has demonstrated superior performance compared to other baseline models.Second, the palm vein authentication model is proposed, which emphasizes the problem of feature embeddings between intra-class and inter-class palm vein verification.In particular, efficient channel attention is integrated with the ResNet-50 architecture with no additional computational cost, followed by a multi-task learning approach with the binary head and margin head, which further strengthens the model's authentication capability.As a third significant contribution, we design an optimized loss function aiming to effectively learn discriminative features, thereby enhancing the overall authentication process.This proposed methodology achieves 100% accuracy for palm vein recognition and an equal error rate of 0.018 for palm vein verification.Moreover, the ablation study demonstrates that our proposed combined loss function enhances inter-class diversity and intra-class compactness while focusing on hard-to-classify palm vein samples, ultimately achieving more discriminative features for improved palm vein recognition.

Figure 1 .
Figure 1.Overall proposed method for palm vein authentication.

Figure 4 .
Figure 4. (a) Original ROI; (b) ROI image filtered by Jerman method; and (c) final labeled ROI image for segmentation.

Figure 6 .
Figure 6.Detailed network architecture of proposed attention-gated residual U-Net segmentation model.

Figure 9 .
Figure 9. Palm vein segmentation results: (a) Histogram equalized original palm images; (b) Groundtruth label images; (c) Segmentation results from original U-Net model; and (d) Segmentation results from proposed attention-gated residual U-Net model.
We propose the most effective loss function for palm vein discriminative learning by combining state-of-the-art loss functions, such as ArcFace, focal loss, and triplet loss.
structure from low contrast and blurry palm images, allowing for more precise authentication performance; •We propose a light-weight attention-aware feature extractor for palm vein recognition that can efficiently extract palm vein features without any extra computational overhead;•

Table 1 .
Comparison with other state-of-the-art models for palm vein segmentation.

Table 2 .
Result of palm vein recognition with/without segmentation model.

Table 3 .
Comparison of the proposed method with other existing studies using identification accuracy.

Table 4 .
ERRs of different methods for palm vein verification.

Table 5 .
Results from the comparison obtained by training proposed the ECA-Resnet-50 on different loss functions.