Finger-Vein Veriﬁcation Based on LSTM Recurrent Neural Networks

: Finger-vein biometrics has been extensively investigated for personal veriﬁcation. A challenge is that the ﬁnger-vein acquisition is affected by many factors, which results in many ambiguous regions in the ﬁnger-vein image. Generally, the separability between vein and background is poor in such regions. Despite recent advances in ﬁnger-vein pattern segmentation, current solutions still lack the robustness to extract ﬁnger-vein features from raw images because they do not take into account the complex spatial dependencies of vein pattern. This paper proposes a deep learning model to extract vein features by combining the Convolutional Neural Networks (CNN) model and Long Short-Term Memory (LSTM) model. Firstly, we automatically assign the label based on a combination of known state of the art handcrafted ﬁnger-vein image segmentation techniques, and generate various sequences for each labeled pixel along different directions. Secondly, several Stacked Convolutional Neural Networks and Long Short-Term Memory (SCNN-LSTM) models are independently trained on the resulting sequences. The outputs of various SCNN-LSTMs form a complementary and over-complete representation and are conjointly put into Probabilistic Support Vector Machine (P-SVM) to predict the probability of each pixel of being foreground (i.e., vein pixel) given several sequences centered on it. Thirdly, we propose a supervised encoding scheme to extract the binary vein texture. A threshold is automatically computed by taking into account the maximal separation between the inter-class distance and the intra-class distance. In our approach, the CNN learns robust features for vein texture pattern representation and LSTM stores the complex spatial dependencies of vein patterns. So, the pixels in any region of a test image can then be classiﬁed effectively. In addition, the supervised information is employed to encode the vein patterns, so the resulting encoding images contain more discriminating features. The experimental results on one public ﬁnger-vein database show that the proposed approach signiﬁcantly improves the ﬁnger-vein veriﬁcation accuracy.


Introduction
With the wide application of internal and increasing risk of terrorist attacks, information security became a hot topic and received more and more attention. A key point is how to recognize a person to protect personal poverty and privacy. Biometrics as an authentication method of recognizing a person has been widely investigated in past years. Currently, various biometric characteristics such as fingerprints [1], palm-print [2], finger-vein [3,4], hand-vein [5], palm-vein [6], face [7], iris [8], voice [9], signature [10] have been employed for verification and can be broadly classified into two categories. (1) Extrinsic characteristics (e.g., fingerprints, palm-print, face, voice, signature); (2) Intrinsic characteristics (e.g., finger-vein, palm-vein, hand-vein). The extrinsic characteristics are prone to be attacked because faked face and fingerprint can successfully cheat the verification system [11]. As the intrinsic characteristics such as finger-vein conceal the skin and not easily copied and forged, they show high security and privacy in practical application.
However, vein verification faces serious challenges. In practical applications, various factors such as environmental illumination [12][13][14], ambient temperature [3,14,15], light scattering [16,17], and user behavior [12,13] affect the finger-vein image quality. Generally, these factors are not controlled, so many capturing images not only contain vein patterns but also noise and irregular shadowing. Generally, the separability between the vein and non-vein patterns is poor in the regions associated with noise and irregular shadowing. Performing matching from such regions degrades the verification accuracy. To solve this problem, many segmentation-based methods are proposed to segment robust vein network for finger-vein recognition. Broadly, they can be categorized into two groups.
(1) Handcraft-based segmentation approaches. In this category, researchers employed the existing mathematical models to detect vein features based on attribute assumptions such as valleys and straight-lines. For example, they assume that the vein patterns can be approximated to line-like texture in a predefined neighborhood region and the descriptors such as Gabor filters are proposed to extract the vein pattern. The representative works include wide line detector (WLD) [13], Gabor filters [3,[18][19][20][21], and matched filters [22]. Some researchers observe that the cross-sectional profile of a vein pattern shows the attribute of valley shape. Therefore, many models are built to detect the valley for vein pattern extraction. For instance, the curvature is sensitive to valley, so various approaches are proposed to enhance the vein patterns by computing mean curvature [14], difference curvature [23], and maxim curvature [15] of pixels in an image. In [24][25][26][27], the vein patterns are detected by computing the depth of the valley. In the region growth approach [27], both depth and symmetry of valley are combined to extract vein pattern. Recently, according to the anatomical knowledge, some characteristics of finger-vein structure, e.g., directionality, continuity, width variability, smoothness, and solidness are taken into account for finger-vein texture extraction in [28].
(2) Deep learning-based segmentation approaches. Unlike handcrafted approaches, the deep learning-based approaches are capable of extracting the vein patters from a raw image without the manual attribute distribution assumption and have shown promising performance in medical image segmentation such as neuronal membrane segmentation [29], prostate segmentation [30], retinal blood vessels [31], and brain image segmentation [32]. In work [33], the Convolutional Neural Network (CNN) model is firstly employed for finger-vein segmentation, and outperforms handcrafted feature-based approaches in terms of verification errors improvement. In their work, the pixels are automatically labeled and a patch-based dataset is built for CNN training. For testing, an image is split into various patches and each patch is put into the CNN to predict the probability of its center point being a vein pattern.
The approaches described above achieve good performance on some finger-vein recognition tasks, but they suffer from the following problems. For example, existing handcrafted approaches segment vein pattern based on assumptions. However, these assumptions are not always effective to detect the finger-vein patterns because some vein pixels may be created by more complicated distributions than valleys or straight lines. Also, they explicitly extract some vein features by an image processing method, which might discard relevant information about the finger-vein pattern. In addition, they do not get any prior knowledge from the different images as they segment each image independently from the others. For the deep learning-based approach [33], these problems are alleviated to an extent because it directly uncovers hierarchical features from raw images to minimize its decision errors on vein patterns without the attribute distribution assumptions. Meanwhile, rich prior knowledge is harnessed by training it on a huge patch-based training data from different images. However, these approaches, including CNN in [33], segment each pixel independently based on a predefined neighborhood region (e.g., patch) instead of considering spatial dependencies among the closed pixels.
Factually, finger-vein vein patterns extend from finger root to fingertip, and show clear direction and good connectivity [34]. Therefore, there exists spatial dependencies among the closed vein pixels. So, the performances of these existing approaches are still limited for finger-vein texture pattern extraction.
Recurrent neural networks have shown powerful capacity for the representation of long-term dependency information and have been successfully applied to human activity [35,36], speech recognition [37], and handwriting recognition [38]. In recent years, LSTM networks [39] as the most successful extensions of recurrent neural networks have received more and more attention. The Long Short-Term Memory (LSTM) model adopts a gating mechanism controlling the contents of an internal memory cell so that it is capable of learning a better and more complex representation of long-term dependencies in the input sequential data. Consequently, LSTM networks work well for feature learning over time series data. Some researches employ it to learn the complex spatial dependencies for scene labeling and action recognition [40][41][42].
Inspired by this idea, in this paper we proposed a stacked Convolutional Neural Networks and Long Short-Term Memory (SCNN-LSTM) for finger-vein texture segmentation by combining the CNN model and LSTM model. Compared to existing segmentation-based methods, our approach not only predicts the probability of a pixel based only on its pixels and their correlations in a local region, but it does so by relying also on the spatial dependencies in its neighboring contexts, through a feature representation learned by LSTM from a large sequence training set. The main paper contributions are summarized as follows: (1) We proposed a stacked Convolutional Neural Network and Long Short-Term Memory model to automatically learn features from raw pixels for finger-vein verification. First, the vein and background pixels are automatically labeled based on several baselines. For each labeled pixel, we generated four sequences along different directions. As a result, there are various sequence-based training sets, on which several SCNN-LSTMs are independently trained to form a complementary and an over-complete representation. Secondly, for a testing image, the probability of each sequence being to vein pattern is predicted and the scores from patch-based sequences are conjointly input to P-SVM to segment the vein patterns. As the CNN model has the capacity for representation of vein texture features in a local region (i.e., patch) and the LSTM model captures the spatial dependencies among the closed regions, the proposed SCNN-LSTM model is capable of predicting the probability of belonging to a vein pattern. The rigorous experimental results on a public finger-vein database imply that the proposed approach is able to extract vein pattern, which results in a significant improvement for finger-vein verification accuracy.
(2) This paper investigates a new approach to encode the finger-vein for verification. Generally, the existing finger-vein segmentation approaches encode an image to extract binary vein patterns based on one or more thresholds, which are not related to verification error reduction. Different from them, an effective supervised scheme is employed to automatically select the threshold for vein pattern encoding. We search for a robust threshold to encode image by maximizing the inter-class distance and minimizing intra-class distance, which is not based on human domain knowledge. So the proposed scheme directly targets biometrics verification performance instead of human perception. We analyze the experimental results and estimate the verification performance.

The Proposed Approach
To learn compositional representations of the texture feature and spatial dependencies information, a SCNN-LSTM model is proposed for finger-vein feature extraction. First, we employed seven baselines to label the pixel from a training set and validation set. Secondly, for each labeled pixel, different sequences are created along different orientations. Thirdly, each sequence is forwarded to SCNN-LSTM to predict its probability of belonging to a vein patten. As a result, there are several labeled scores for different orientations, which are taken out of the input of SVM to extract a vein feature. Applying the proposed SCNN-LSTM model to the whole image in this way, the vein images are enhanced.
To achieve verification, the resulting enhancement image is encoded by a supervised encoding scheme. The framework of the proposed approach is illustrated in Figure 1.

Label Vein Patterns
Similar to work [33], for each input finger-vein image, seven baselines, i.e., Repeated line tracking [24], Maximum Curvature points [15], Mean curvature [14], Different Curvature [43], Region growth [27], Wide line detector [13], and Gabor filters [3] are employed to segment vein pattern, resulting in seven binary images (as shown from Figure 2a-i). The values in each binary image (0 and 1 denote background and vein pixels, respectively) are treated as labels of corresponding pixels in the input image. We compute the average of seven binary images and obtain an average image F (Fin.3(i)). For a pixel (x, y), it is labeled as vein pattern if F(x, y) = 1 white region in Figure 2f), and it is labeled as vein for F(x, y) = 0 (black region in Figure 2j). We do not label the pixels in the remaining region (the color region in Figure 2j).

Stacked Convolutional Neural Networks and Long Short-Term Memory
The proposed stacked Convolutional Neural Networks and Long Short-Term Memory (SCNN-LSTM) consist of a CNN model and LSTM model (as shown in Figure 3) and are trained to learn the joint texture and spatial dependency representations for finger-vein texture segmentation. Our SCNN-LSTM takes a sequence associated with K patches as its input. In SCNN-LSTM, a deep CNN model is built by removing the output layer of an existing CNN model [33] for the vein texture representation. Then we take any patch as an input of the CNN model and it outputs a fixed-length vector representation which is further forwarded to a recurrent sequence learning module (LSTM) to learn the compositional representations in space, as shown Figure 3b. Figure 4 shows the architecture of the proposed SCNN-LSTM. As shown in Figure 4, our approach consists of a CNN model and LSTM model. This CNN model (as shown in the red box in Figure 4) consists of three convolutional layers and one fully connected layer. There are 24 kernels of 5 × 5 in the first convolutional layer, 48 kernels of 5 × 5 in the second convolutional layer, and 100 kernels in the fully connected layer. The LSTM model (the blue region in Figure 4) includes 128 kernels. For SCNN-LSTM training, its input is a sequence of 7 patches with size of 11 × 11. Each patch in the sequence is forwarded to CNN model to obtain a 100 dimensional vector. As a result, there are 7 vectors for an input sequence with length of 7. The resulting vectors are taken as an input of LSTM model to obtain a 100 dimensional representative vector. Finally, the output of LSTM model is put into the last layer for classification. The output of last layer is a 2 dimensional vector because there are two classes (vein and background) for vein segmentation. When the input size changes, the width and height of the map in each convolutional layer changes accordingly. Along the forward direction, a patch-based sequence is represented effectively.

CNN Module
As the existing CNN model with three layers described in [33] has achieved promising performance for vein feature segmentation, we create a CNN module for feature representation of vein or background patch by removing the output layer of CNN in work [33]. During the training stage, our CNN is initialized using weights of an existing CNN [33]. Our CNN model consists of one input layer, three convolutional layers, two max-pooling layers, and one full-connection layer, respectively. The number of kernels in the three layers are 24, 48, and 100 respectively, and the sizes of kernels in both convolutional layers are 5. Each layer is detailed as follows.
Convolutional layer: The concept of Rectified Linear Units (y = max(0, x)) is used to active the hidden neurons.
Pooling alyer: The max-pooling is employed to extract location information by ensuring robustness to translation.
where r k denotes as k-th output map obtained by the k-th filter; The value R k i,j pools over non-overlapping r × r local regions in I k to extract the compact feature.
Dropout: The drop-out technique [44] is applied in three fully connected layers. The overfitting can be greatly prevented by randomly omitting half of the hidden units.

LSTM Module
The LSTM module is a subnet of our SCNN-LSTM which allows to easily memorize the context information for long periods of time in sequence data. In general, LSTM is proposed to model the temporal dependencies. In images, this temporal dependency learning is converted to the spatial domain [41]. Therefore, we employ a LSTM unit as described in [39] to model spatial dependencies by mapping the deep feature sequences produced from CNN to hidden states. To predict a distribution over spatial step, the softmax is employed in output layer. Finally, we average the outputs of the LSTM network's softmax layer to compute the predicted distribution, as shown in Figure 3b. Given inputs x t , h t−1 , and c t−1 , the LSTM updates at the position t are where σ and tanh are logistic sigmoid (sigm) and hyperbolic tangent (tanh), which are defined as and * is the element-wise product. In addition, h t , i t , f t , o t , g t , and c t denote hidden unit, input gate, forget gate, output gate, input modulation gate, and memory cell, respectively, at the position t.
Output layer: The outputs from the last hidden layer are normalized with the softmax function: where z n is a linear combination of outputs in LSTM hidden states.

Multi-SCNN-LSTM Feature Representation
For a pixel with a label l ∈ {0, 1} from a given finger-vein image F, we produce a sequence S θ * ∈ S s×s×K along an orientation θ * and label it as L θ * ∈ K×1 using the scheme described in Section 2.1, where 0 and 1 denote respectively background and vein. The training set used for vein segmentation is represented as where N is the number of sequences from finger-vein images in the training database. As we quantize all the possible vein orientations into four orientations, we in this way obtain 4 training datasets.
A different SCNN-LSTM for each dataset is then trained independently, and each SCNN-LSTM produces a score from a particular sequence. We combine the outputs of the 4 SCNN-LSTMs to generate a 4-dimensional vector , which is taken as an input of P-SVM to predict the probability of the pixel ( Figure 5).

Generating Score
A SVM model is employed to compute the probability of a pixel belonging to vein pattern based on its predicted distribution along four orientations. In this work, we employ the P-SVM model [45], which requires a set of vectors for training, to combine all features from all orientations (shows in Figure 5). Let v be a vector extracted from four sequences of a pixel with a label l ∈ {0, 1}. The P-SVM is trained to provide a probabilistic value p (0 to 1: from background to vein) where ε(v) is the output of a general two-class SVM [46] with v as the input feature vector, and w and γ as fitting parameters trained by P-SVM. After training, we are able to compute the probability of any pixel based on its feature vector v and Equation (12).

Supervised Feature Encoding
In this section, we propose a scheme to obtain the threshold for vein feature encoding. After applying SCNN-LSTM for all pixels, an enhancing vein image is obtained and then we encode it for matching. In existing works [3,[13][14][15]27,33,43], the vein patterns are encoded by one or more thresholds. For example, the probability of 0.5 is employed to obtain vein patterns in [33]. In [3], the vein image is enhanced by Gabor and then subject to binarzition using threshold of 0. In the classic repeated line tracking approach [24], two global thresholds (i.e., 85 and 175) are used to divide a image into three regions for matching. Some curvature-based approaches [14,15,23] enhanced vein patterns by computing the curvature of all pixels and an empirical threshold is employed to encode resulting enhancement image. For the finger-vein verification, the primary target of feature encoding is to improve performance, mainly verification error rates. However, the approaches determine the threshold based on human perception instead of minimizing the verification error, so the resulting binary code (vein texture features) may not be robust for finger-vein verification. To overcome this problem, in this section, a supervised scheme is proposed to encode vein pattern. Our approach decides the threshold by maximizing the distance between intra-class score set and inter-class score set computed from a training set, such that the resulting threshold is directly related to verification performance. The robust thresholds T are computed as follows.
Assume that there are N classes in the training set and each class provides M samples. Using the proposed SCNN-LSTM model ( Figure 5), all finger-vein images are enhanced and we denote the mth enhancement image in the nth class as x m,n , where m = 1, 2, ..., M and n = 1, 2, ...N. We aim to find a function to map and quantize each enhancement image into a binary image b m,n ∈ {0, 1} I×J which encodes a more discriminative information for verification error minimization. In our work, the binary code (vein texture pattern) b m,n of x m,n is computed by where sgn(z) is equal to −1 if z ≤ 0 and 1 otherwise and T ∈ [0 1] is a parameter which is determined as follows.
Based on the Equation (13), all training samples are mapped into Hamming space, so a Hamming distance in [47] is employed to match two images for verification. We match the binary codes from same class to generate intra-class scores while the inter-class scores are produced by matching the binary codes from different class. So there are a 1 = N × C M 2 genuine matching scores Ω 1 = {d 1 (T), d 2 (T), ..., d a 1 (T)} and a 2 = N × (N − 1) × M × M/2 impostor scores Ω 2 = {d 1 (T), d 2 (T), ..., d a 2 (T)}. To make b m,n discriminative, we enforce an important criterion to encode the enhancement images that the resulting binary codes should maximize the distance between two sets Ω 1 and Ω 2 . Therefore, we formulate the following optimization objective function: where | · | represents the absolute value. u 1 (T) and u 2 (T) are the means of the scores in Ω 1 and Ω 2 , and D 1 (T) and D 2 (T) are the variances of the scores in the sets Ω 1 and Ω 2 To facilitate to search the threshold T, all enhancement images are converted to gray-scale images with integer values between 0 and 255. The parameter T is assigned from 0 to 255 to transform the enhancement image into a binary code map according to Equation (13). So, 256 different values J(T) (T = 1, 2, ..., 256) are computed using Equation (14). The parameter T * , which can maximize Equation (14), are selected to encode the vein pattern. The binary code of x m,n is computed by b m,n = 0.5 × (sgn(x m,n − T * /255) + 1)

Feature Matching
After all training images are mapped into Hamming space, the Hamming distance is employed to match two images. In general, the capturing images are subject to translation and rotation normalization, but there are still some variations due to inaccurate localization and normalization. However, Hamming distance is not robust enough to reduce these variations. So, an enhanced Hamming distance is employed to compute the non-overlapping region between two images with possible spatial shifts for finger-vein matching. Assuming Q and B are enrolment and test binarized feature codes with size of I × J, respectively (as shown in Figure 6), the height and width of Q are extended to 2E + I and 2H + J, and then its expanded imageQ is obtained and expressed as: Figure 6b illustrates the extended imageQ of a template Q and the extend region with values of −1 is marked in color. The matching distance between Q and B is obtained by In Equation (17)

Experiments and Results
To estimate the performance of our approach, we compare various approaches with respect to verification performance improvement. In our experiments, we repeat the experimental results of classic approaches, i.e., Repeated line tracking [24], Maximum Curvature points [15], and recent approaches, i.e., Mean curvature [14], Different Curvature [43], Region growth [27], Wide line detector [13], and Gabor filter [3] for comparison. Also, we show the performance of the deep-based segmentation approach [33] to estimate the verification performance of our approach. In addition, based on the supervised encoding scheme in Equation (15), we can extract the finger-vein patterns from the probability map which is computed by the proposed SCNN-LSTM approach. To simplify the description, we denote them as the SCNN-LSTM + Supervised encoding. To test our encoding approach, we also encode the resulting probability map using a probability threshold of 0.5. This scheme is presented as SCNN-LSTM + Unsupervised encoding. The corresponding performance is shown in the following experiments. We compare all finger-vein extraction approaches mentioned above with the proposed one to get more insights into the problem of finger-vein verification. All experiments are carried out on one public database, namely the PolyU [3] finger-vein database, which is described below.

HKPU Database
The Hong Kong Polytechnic University (HKPU) finger-vein image database [3] includes 3132 images with a resolution of 513 × 256 pixels. All images are collected from 156 subjects using an open and contactless imaging device. The first 105 subjects provided 2520 finger images (105 subjects × 2 fingers × 6 images × 2 sessions) in two separate sessions with a minimum interval of one month and a maximum of over six months, with an average of 66.8 days. In each session, each subject provided 2 fingers (index finger and middle finger) and each finger provided 6 image samples. The remaining 51 subjects only provided image data in one session. To verify our approach, the 2520 finger images captured in two sessions are employed in our experiment because it is closer to a practical captured environment. A pre-processing method [3] is employed to extract the region of interest (ROI) image and carry out translation and orientation alignment. In addition, the image background is cropped because it contributes matching errors and computation cost. As a result, all images are normalized to 39 × 146.

Experimental Setting
To test our approach, we split the database into three data sets: training set associated with 660 (55 fingers × 12) images, validation associated with 600 (50 fingers × 12) images, and testing set associated with 1260 (105 fingers × 12) images. Based on the label scheme described in Section 2.1, we label vein and background pixels from the training set and validation set. To train our model, we select the sequences centered on vein pixel as positive samples and sequences centered on background pixels as negative ones. For each image in training set, we only employ about 80 positive sequences and negative sequences, respectively. As the length of sequences is fixed to 11 using next experiments in Section 3.3, there are about 1760 (80 sequences × 11 (length of sequences) × 2 (positive sequences and negative sequences)) patches for an image. This results in a total of 100,000 training sequences (50,000 positive sequences and 50,000 negative sequences) from 660 images. In the testing phase, we generate a patch for each pixel in a test image. So, for an image with size of 39 × 146, there are 5694 (39 × 146) patches, based on which a sequence is created for each pixel along a given orientation. In our work, the length of the sequence is 11. Therefore, for a pixel, the patches centered on its 11 adjacent pixels form a sequence along a given orientation (shown in Figure 3), which results 5694 sequences for a test image with size of 39 × 146. Then, the sequence of each pixel is put into our model, the output of which is taken as the probability of this pixel to belong to vein pattern.

Parameter Estimation
As described in Section 2.1, each sequence from images in training set consists of K patches with size of s × s. The CNN module in our SCNN-LSTM is trained by fine-tuning the CNN with an input of 11 × 11 patch in [33]. Such a size has also shown good performance in work [33], so the patch size s for SCNN-LSTM is fixed to 11. The length of the sequence is important to achieve high verification accuracy. If K is too small, more detailed vein patterns are extracted but including more noise. Matching pixels in noisy regions can create errors which result in verification accuracy reduction. On the contrary, sequences with large K will suppress vein feature details, leading to smooth vein features, which also degrades the verification accuracy. Therefore, we determine the appropriate size of sequence for SCNN-LSTM experimentally. Firstly, we train the proposed SCNN-LSTM model to extract the vein feature of the finger-vein images in the training and validation at different lengths of sequence. To reduce the redundant information, we obtain patches with sampling intervals of one pixel to create training sequences. Secondly, the first 6 images acquired at the first session are employed as registration templates and the remaining as testing images. Therefore, there are 300 (50 × 6) genuine scores and 14,700 (50 × 49 × 6) impostor scores. The False Rejection Rate (FRR) is computed by the genuine scores and the False Acceptance Rate is computed by impostor scores. The Equal Error Rate (EER) is the error rate when FAR is equal to FRR. Figure 7 illustrates the relationships between length of sequence and EER, and the results are obtained by using only the validation data. From Figure 7, we can see that a smaller equal error rate is achieved at a sequence of length 11 and 13. With increasing the length K, the computation time will be increased. Therefore, we fix the length of a sequence to 11 in our experiments.  To verify over-fitting of our model, we shows learning curves in Figure 8. Figure 8a,b show the accuracy on the validation dataset and loss on the training dataset. From Figure 8, we can observe that the accuracy of validation dataset increases to about 65% and the loss decreases slowly after 2000 backpropagations. When the number of iteration steps is between 5000 and 10,000, the accuracy increases to more than 90% and the loss dramatically reduces. After 10,000 iterative steps, the loss fluctuates but it still decreases slowly. Therefore, our SCNN-LSTM model has good convergence for finger-vein segmentation.

Visual Assessment
In this experiment, we visually analyze the extracted finger-vein patterns from various approaches to get more insights into the proposed approach. The seven baselines and a state of the art [33] are employed to segment the vein texture, respectively. Also, the vein patterns encoded by a threshold of 0.5 and supervised threshold are reported in our experiment. Figure 9 shows the extracted results of various approaches. We can see from Figure 9 that the deep learning-based approaches suppresses the noise, and extract more connective and smoothness vein texture compared to the seven baselines. Observed the experiments in Figure 9i,j,f, it sees that the SCNN-LSTM-based approaches outperform the CNN in terms of extracting the connective vein patterns.

Verification Results Based on Image Dataset from One Session
In this section, we evaluate the performance of various approaches on the HKPU finger-vein dataset by considering vein images collected in each of the two sessions. First, the performance is evaluated in each session, individually. In one session, there are 630 images from 105 fingers. Therefore, the total number of genuine scores and impostor scores is 1575 (105 × C 6 2 ) and 196,560 (105 × 104 × 36/2). To compute the impostor score, the symmetric matches are not executed. Second, the performance of combining scores from two sessions is reported. So, there are 3150 (1575 × 2 sessions) genuine scores and 393,120 (196,560 ×2 sessions) impostor scores. Table 1 lists the verification error of various approaches for each session taken separately, and then for the two sessions, mixed. The receiver operating characteristics (ROC) curve for the corresponding performances is illustrated in Figure 10. The experimental results from Table 1 imply that the proposed SCNN-LSTM approach outperforms existing approaches including CNN [33] and achieves low errors, e.g., 1.12%, 0.62%, and 1.01% for data in the first session, second session, and two mixed sessions, respectively. The ERRs are further reduced to 1.08%, 0.58%, and 0.95% using the proposed encoding approach. We also observe from Figure 10 that the SCNN-LSTM-based approaches significantly improve FRR when the FAR is lower than 0.01%, which implies that our system achieve lower verification error than the methods considered in our work at high security level system.

Verification Results Based on Image Dataset from Two Sessions
This experiment aims at estimating the effectiveness and robustness of various algorithms on the finger-vein image data from both sessions. In the testing dataset, there are 1260 (105 fingers × 6 images × 2 sessions) images, acquired at two sessions. For each finger, we select the 6 images captured at the first session as enrollment samples and the remaining 6 images captured at the second session as testing samples. The genuine matching scores are produced by matching samples from same finger, while the impostor scores are produced by matching samples from different fingers. This results in a total of 630 (105 × 6) genuine scores and (105× 104 × 6/2) impostor scores, based on which we the compute FRR and FAR. In addition, we computed the sensitive index(d ) [48] by d = Z(hit rate) − Z( f alse alarm rate) to estimate the performance of various approaches.
The experimental results from various approaches are summarized in Table 2. The ROC curves for the corresponding performances are illustrated in Figure 11. The experimental results summarized in Table 2 show consistent trends with the those from experiments in each session. The proposed SCNN-LSTM-based approaches (e.g., SCNN-LSTM + Unsupervised encoding and SCNN-LSTM + Supervised encoding) get the best results, especially at the lower FAR. The lowest EER of 2.38% is achieved using the supervised encoding approach. Similarly, the proposed method achieves higher d (e.g., 3.89 and 3.95) compared to existing approaches, which implies that the lowest verification error is achieved using our SCNN-LSTM model. Table 2. EER of various approaches on image dataset from two different sessions.

Discussion
The experiments depicted in Tables 1 and 2, Figures 10 and 11 show that the proposed SCNN-LSTM-based models achieve best performance among the all approaches considered in our work, including seven baselines and the CNN-based model. For example, the EER achieved by the best one (CNN) among existing approaches is reduced to 2.53% using the proposed SCNN-LSTM model with unsupervised encoding scheme on the data set acquired from two sessions. The verification accuracy may be further improved by combing the features of sequences along more directions or enlarging the training set. The good performance can be explained by the following fact. The existing handcrafted approaches (seven baselines) explicitly extract some features by image processing method, which might discard relevant information about finger-vein pattern. Also, they do not get any prior knowledge from the different images as they segment each image independently from the others. In addition, all approaches, including CNN, independently process each pixel based on a predefined neighborhood region or cross-sectional profile during the segmentation procedure, and ignore the spatial dependencies among different vein pixels. By contrast, the proposed approach uncovers hierarchical features for vein texture representation by training its CNN module and harnesses rich dependency information by training its LSTM module on a huge sequence set from different images. Therefore, it is capable of predicting the probability of a pixel belonging to a vein pattern.
We can also observe from the experimental results (Tables 1 and 2, Figures 10 and 11), the performance is improved after adopting a supervised encoding scheme. For instance, the EER is reduced to 0.95% (about 6% relative error reduction) on the data from two mixed sessions. When we employ the images in the first session as templates and the remaining images captured at the second session as testing samples, a EER, namely 2.38% (about 8.1% relative error reduction) is achieved by the SCNN-LSTM + Supervised encoding. The experimental results are explained by this fact. The existing finger-vein encoding approaches do not infer any prior knowledge from the different images because they compute the threshold from each image independently from others or employ some empirical threshold values such as 0.5 and 0. By contrast, the proposed encoding approach harnesses a rich prior knowledge acquired by maximizing the distance between the genuine score set and impostor score set (as shown in Equation (14)) and the resulting threshold is directly related to verification error reduction. Therefore, our approach can extract the discriminative vein texture for verification. Also, the experimental results show that the supervised encoding shows more significant improvement on the data acquired in two sessions. The reason is that there is not large room for improvement because it is easier to distinguish the images from one session compared to those from two sessions. Actually, the 2-sessions scenario is more realistic so the supervised encoding scheme is effective to reduce the verification error.
Compared to the experimental results in Section 3.5 (Table 1 and Figure 10) and in Section 3.6 ( Table 2 and Figure 11), we see that all approaches achieve significant improvement in terms of verification accuracy on image datasets acquired in one session. Such a good performance can be attributed to the fact that there exist smaller within-class variations in the images captured at the same session because the imaging environment is similar and the subjects increase familiarity in the finger presentations during finger-vein image acquisition within a short duration. On the contrary, there are the larger within-class variations for the data acquired in two different sessions, which causes more mismatching errors.
In addition, we also compare our approach with existing approaches with respect to the computational cost. All experiments are carried out in Matlab 2014a and conducted on a high performance computer with 8 Core E3-1270v3 3.5 GHz processor, 16 GB of RAM, and a NVIDIA Quadro GTX1070 graphics cards. For our approach and CNN [33], they are trained with Caffe package [49] on the graphics cards, and tested with Matlab on the central processing unit (CPU). To improve the time cost, we optimize SCNN-LSTM to extract the vein feature of a test image. First, as described in Section 3.2, a test image with size of 39 × 146 is divided into 39 × 146 overlapping patches, based on which 39 × 146 sequences are generated for all pixels along a given orientation using the scheme in Section 2.1.2. Therefore, there are same patches in the sequences of adjacent pixels. If we input the sequence for each pixel into SCNN-LSTM for feature extraction, it results in a lot of repeated feature extraction operations in the CNN model. To further reduce the computation time, 39 × 146 patches from an image are separately input into CNN model of SCNN-LSTM and we take its output (a 100 dimensional vector) as the feature vector of the input. Then, for each pixel, we arrange resulting vectors along a given orientation to form a sequence, which is forwarded to the LSTM model to extract its spatial dependence feature. Therefore, as each patch is only subject to one feature extraction operation using CNN model, the computational time for our model is significantly reduced in this way. Second, the four SCNN-LSTMs (shown in Figure 5) for four orientations are implemented in parallel to further reduce time cost. For the remaining approaches mentioned in our work, all experiments are implemented in Matlab on CPU. The average verification time of an image using various methods is listed in Table 3. We can see from Table 1 that the proposed method, CNN, and Repeated line tracking approach require more than two seconds to verify a finger-vein image, e.g., 3.25 s, 2.13 s, and 2.53 s, respectively, which are more than those achieved by the remaining approaches. This can be explained by the following fact. The proposed approach and CNN process the patch centered on each pixel and predict its probability of belonging to a vein pattern. When the size of test image is large, it is computationally expensive. The Repeated line tracking approach starts at a seed point and then tracks all vein patterns pixel by pixel by detecting the local dark line. When a dark line is not detectable, a new tracking operation starts at another position. The local line tracking operation is repeatedly performed and the tracking number for each pixel is recorded in a tracking matrix for segmentation. The larger tracking number will enhance the vein pattern and result in high verification accuracy, but the computational cost increases. Overall, our approach shows high time cost, but it can achieve best performance for finger-vein verification (as shown in experimental results in Tables 1 and 2 and Figures 10 and 11). Moreover, these time costs are expected to be significantly reduced after code optimization. For example, implementing these algorithms in C++ can also improve the computation speed. With development of parallel computing technologies such as CUDA, the computing performance can be dramatically improved by harnessing the power of the graphics processing unit (GPU). Therefore, our approach can achieve computational requirement for practical application after accelerating using GPU.

Conclusions
In this paper, we proposed an approach to extract the finger-vein pattern for verification. First, a SCNN-LSTM is proposed to predict the probability of a vein pixel belonging to a vein patten. As SCNN-LSTM combines recurrent models such as LSTMs with deep convolutional networks, it can be jointly trained to learn the complex spatial dependencies and convolutional perceptual representations. Second, to improve the performance, we proposed a supervised scheme to encode the vein patterns. As the threshold for encoding is related to verification performance, it can extract robust vein texture features for verification. Experimental results show that the proposed approach extracts robust vein features and significantly improves the verification error rate with respect to state of the art.
As our model can learn the complex spatial dependencies, it extract continuous vein network for verification. Also, our approach is employed to extract the hand-vein and palm-vein for recognition. In medical image analysis, some images such as retinal image, brain segmentation, and neuronal membranes contain continuous texture patterns, so the proposed approach can be applied to segment such texture patterns for disease diagnosis. In addition, if the patterns in vision image show the similar connectivity to vein pattern (as shown in Figure 1), our approach can be used to process vision image. In future work, we will extend the application of our approach to further verify its generalization.