A High-Speed Finger Vein Recognition Network with Multi-Scale Convolutional Attention

Zhang, Ziyun; Liu, Peng; Su, Chen; Tong, Shoufeng

doi:10.3390/app15052698

Open AccessArticle

A High-Speed Finger Vein Recognition Network with Multi-Scale Convolutional Attention

by

Ziyun Zhang

¹,

Peng Liu

^2,3,*,

Chen Su

¹ and

Shoufeng Tong

^1,2,*

¹

School of Optoelectronic Engineering, Changchun University of Science and Technology, Changchun 130022, China

²

Institute of Space Ophotoelectronics Technology, Changchun University of Science and Technology, Changchun 130022, China

³

School of Electronic Information Engineering, Changchun University of Science and Technology, Changchun 130022, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2025, 15(5), 2698; https://doi.org/10.3390/app15052698

Submission received: 7 February 2025 / Revised: 28 February 2025 / Accepted: 28 February 2025 / Published: 3 March 2025

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

With the advancement of technology, biometric recognition technology has gained widespread attention in identity authentication due to its high security and convenience. Finger vein recognition, as a biometric technology, utilizes near-infrared imaging to extract subcutaneous vein patterns, offering high security, stability, and anti-spoofing capabilities. Existing research primarily focuses on improving recognition accuracy; however, this often comes at the cost of increased model complexity, which, in turn, affects recognition efficiency, making it difficult to balance accuracy and speed in practical applications. To address this issue, this paper proposes a high-accuracy and high-efficiency finger vein recognition model called Faster Multi-Scale Finger Vein Recognition Network (FMFVNet), which optimizes recognition speed through the FasterNet Block module while ensuring recognition accuracy with the Multi-Scale Convolutional Attention (MSCA) module. Experimental results show that on the FV-USM and SDUMLA-HMT datasets, FMFVNet achieves recognition accuracies of 99.80% and 99.06%, respectively. Furthermore, the model’s inference time is reduced to 1.75 ms, representing a 20.8% improvement over the fastest baseline model and a 62.7% improvement over the slowest, achieving more efficient finger vein recognition.

Keywords:

biometrics; finger vein recognition; deep learning

1. Introduction

With the continuous development of modern society and the ongoing advancement of technology, the security issues surrounding information authentication have received increasing attention. Traditional identity authentication technologies mainly rely on physical identifiers (such as identity cards, keys, access cards, etc.) or knowledge-based identifiers (such as passwords, security questions, passcodes, etc.) for identity verification [1]. However, these methods have many limitations. Not only are they cumbersome to use and prone to being forgotten or lost, but they are also vulnerable to cracking and theft, leading to lower security and inconvenience in terms of portability and management. Biometric recognition technology is a method that uses unique physiological or behavioral characteristics of the human body to authenticate an identity. As an alternative to traditional identity authentication methods, it offers higher security and convenience [2]. In recent years, biometric recognition technology has developed rapidly and is gradually becoming an important component of various organizations’ systems and even national security systems. It has been widely applied in various fields of society and holds significant application and research value. Biometric recognition technology can be divided into first-generation and second-generation technologies based on the type of biological characteristics used. First-generation biometric technologies mainly rely on external features and morphological characteristics of the human body, including fingerprints [3], iris [4], and facial recognition [5]. Fingerprint recognition, as one of the most widely used biometric methods, uses compact devices and has a low cost, ease of integration, and quick response times. However, its performance can be greatly affected by finger wear, sweat, or interference from stains, and it is easily stolen through methods like photography or capture, posing a security risk. Iris recognition matches identity based on unique visual characteristics, such as stripes and spots, found in the annular membrane between the pupil and sclera of the human eye. It is highly secure but requires expensive devices and difficult miniaturization, and it necessitates that the device be placed near and aligned with the user’s eye, which may reduce user acceptance and provoke resistance. Facial recognition technology has matured significantly and can maintain a high accuracy even with facial coverings such as masks or glasses, but it is highly susceptible to theft and abuse, raising privacy and ethical concerns. For these reasons, there is an increasing demand for biometric technologies that are secure, convenient, and difficult to forge or misuse. Compared to first-generation technologies that rely on external features, second-generation biometric technologies have enhanced anti-forgery capabilities and stability, using internal human characteristics that cannot be directly observed, such as finger vein [6] and palm vein [7].

Finger vein recognition, as a second-generation biometric recognition technology based on physiological characteristics, captures internal vein images of the finger using near-infrared imaging. Under near-infrared light illumination, finger veins exhibit higher light absorption than the surrounding tissues, resulting in a series of distinctive dark patterns in the captured images—these represent the finger vein patterns. Subsequently, through image processing techniques, the vein image is converted into individual biometric data, enabling identity recognition and authentication. Finger vein recognition has the following advantages: (1) uniqueness: The distribution of finger veins is as diverse as that of the iris, with each individual having a unique vein pattern [8]. (2) Stability: since finger veins are located inside the human body, they are unaffected by external factors such as abrasion, moisture, or temperature. Additionally, the shape of human veins remains nearly unchanged after adulthood [9]. (3) Contactless: finger vein images can be acquired using contactless sensors, effectively preventing bacterial transmission caused by physical contact, making the method hygienic and convenient [9]. (4) High security: finger veins are internal human features that can only be captured using specialized devices, making the features difficult to leak. Moreover, because finger vein recognition relies on the flow of blood within the finger [10], it is highly resistant to forgery using inanimate molds, enhancing its anti-spoofing capabilities. Due to its superior security, low-cost acquisition devices, and potential for large-scale applications, finger vein recognition technology has become a research hotspot, with particularly rapid developments over the past decade.

In recent years, numerous outstanding finger vein recognition algorithms have emerged. By incorporating deeper feature extraction networks and module stacking techniques, these algorithms have significantly improved the accuracy of finger vein recognition, thereby promoting its widespread application across various fields. However, the resulting increase in model complexity and lower computational efficiency have notably prolonged recognition time, making it challenging to simultaneously meet the demands of real-time performance, user experience, and security in practical applications [11]. In scenarios such as smartphone unlocking, access control systems, and financial payments, users expect identity verification to be completed almost instantaneously, as prolonged waiting times can degrade user experience and even hinder the adoption of the technology [12]. In environments such as banks, subways, and train stations, where hundreds or even thousands of people may require identity verification per second, excessive system response times can lead to congestion and disrupt operations [13]. Simply reducing model depth and complexity can enhance computational efficiency, but it often results in decreased recognition accuracy, making it difficult to meet the core security and reliability requirements of biometric recognition technology. Therefore, achieving a balance between recognition accuracy and efficiency remains one of the key challenges in finger vein recognition technology.

To address the aforementioned issues and enable finger vein recognition algorithms to achieve both recognition accuracy and efficiency, this paper proposes a deep learning network model called Faster Multi-Scale Finger Vein Recognition Network (FMFVNet), which ensures a high recognition accuracy while significantly improving recognition efficiency.

2. Related Works

Before deep learning techniques were widely used, research on finger vein recognition was mainly based on traditional image processing algorithms. Kono et al. [14] pioneered the use of vein features for authentication by using a background noise suppression filter to enhance the vein features, while the rotated images were subjected to normalized cross-correlation to quantify and evaluate the feature similarity between them. Miura et al. [15] proposed a repetitive line-tracking-based identification method to construct vein patterns through multiple tracking operations, which enhances the ability to recognize dark line structures in images. Later, they extracted vein features by detecting the point of maximum curvature on the image profile, thus effectively coping with fluctuations in vein width and brightness [15]. Park et al. [16] proposed a finger vein recognition method combining global and local features by extracting the local information of the finger vein through local binary patterns (LBPs) and obtaining the global features by using a wavelet transform. Lu et al. [17] proposed the Multi-directional Local Linear Binary Pattern (PLLBP) to address the limitation that it can only extract horizontal and vertical features, and they verified the importance of different directional features in finger vein recognition. Asaari et al. [18] proposed a multimodal biometric method that combines finger vein and finger geometric recognition by utilizing the band-limited phase correlation (BLPOC) to compute finger vein similarity. Li et al. [19] proposed an improved maximum curvature method that enhances traditional maximum curvature feature extraction by incorporating Gabor filtering and a gray-level grouping enhancement mechanism. This approach effectively removes noise from the image, enhances contrast, and improves the preservation of detailed vein structures. Song et al. [20] introduced a finger vein validation approach based on mean curvature, where vein features are extracted by identifying regions with negative mean curvature.

With the breakthrough of deep learning in some image processing tasks such as classification [21], target detection [22], and digital image processing [23], many researchers have tried to use deep learning methods in finger vein recognition tasks with good results. Deep learning methods do not require too much human intervention and can obtain deep features with better stability by using a large number of training samples. Huang et al. [24] achieved finger vein feature matching using a deep convolutional neural network (DCNN), which employs a hard mining strategy to optimize network training and enhance validation accuracy and robustness. Song et al. [25] improved the accuracy and robustness of validation by combining two finger vein images. Song et al. [25] reduced the effect of noise by combining two finger vein images into a composite image and inputting them into the DenseNet model for feature extraction and pattern matching. Qin et al. [26] improved finger vein recognition by integrating texture features extracted via CNN with spatial dependencies captured by Long Short-Term Memory (LSTM). Noh et al. [27] applied multi-channel convolutional modeling to finger vein images using both shape and texture images as composite inputs, which improved the verification accuracy and robustness. They also employed multi-channel convolutional layers for feature extraction and pattern matching, which effectively handle illumination variations and background noise, thereby enhancing recognition accuracy. Zhang et al. [28] proposed a Generative Adversarial Network (GAN)-based image enhancement method, which generates high-quality synthesized images using a Fully Convolutional Generative Adversarial Network (FCGAN) to improve the classification performance of convolutional neural networks (CNNs) and the classification of finger veins. Hou et al. [29] proposed a finger vein verification method based on Convolutional Autoencoder (CAE) combined with Support Vector Machines (SVMs), where high-level feature representations of finger vein images are extracted by using CAEs, and these features are classified by SVMs. Zhong et al. [30] proposed a lightweight convolutional neural network model based on the pre-trained network structure of MobileNetV2 and by incorporating customized auxiliary network modules designed to optimize the training process and improve the finger vein recognition performance. Ren et al. [31] proposed a template-protected finger vein authentication system based on template protection and efficient authentication through the application of RSA encryption and a convolutional neural network. FV-ViT [32] incorporated Vision Transformer (ViT) into the finger vein recognition task, enhanced its architecture with regularized MLP (regMLP), and optimized the model performance by training it from scratch. SegNeXt [33] shows that, compared to the self-attention mechanism in Transformers, convolutional attention is more effective and efficient. Liu et al. [34] proposed a multi-scale convolutional network called MCNet, which enhances structural features through the Multi-Scale Feature Extraction (MFE) module and improves local feature representation using the Cross-Information Fusion Attention (CFA) module. This approach addresses the limitations of deep learning methods in finger vein recognition, particularly in capturing long-texture features and local details. Bhushan et al. [35] proposed an automatic finger vein recognition method based on the Swin Transformer (SwinT) and the Super Graph Gluing (SGG) model to reduce dependence on domain knowledge and optimize finger vein feature extraction through deep learning. This approach enhances the accuracy of feature extraction and matching by integrating multi-stage preprocessing, advanced feature extraction, and an optimized classification framework. Bai et al. [36] proposed a feature extraction algorithm based on an open-set testing protocol, which enhances vein feature learning through segmentation-assisted classification. The method integrates multi-scale feature fusion and spatial attention while employing a hybrid loss function to improve feature discriminability, achieving promising recognition performance on public datasets.

This paper proposes FMFVNet, a finger vein recognition network that balances both recognition speed and accuracy. The model utilizes the FasterNet Block as its primary feature extraction module. By utilizing its unique partial convolution, FMFVNet reduces computational redundancy and memory access, decreases model complexity, and enhances actual computational efficiency, thereby shortening inference time. To further improve recognition accuracy and mitigate the impact of factors such as finger position variations, uneven illumination, and noise from image acquisition devices, this paper introduces Multi-Scale Convolutional Attention (MSCA) to enhance the model’s multi-scale feature extraction capability, allowing it to focus more on vein features while ignoring irrelevant background information. The main contributions of this work can be summarized as follows:

1.: This study proposes an efficient finger vein recognition network called FMFVNet to meet the requirements of finger vein recognition tasks for both accuracy and efficiency. Experimental results demonstrate that the proposed model achieves a higher recognition accuracy and lower inference time compared to other methods while maintaining fewer parameters and a lower computational complexity.
2.: In this study, MSCA is introduced for finger vein recognition. By incorporating multi-branch deep strip convolutions of different scales, the model captures rich details of finger vein images at multiple scales, focusing more on vein feature extraction. Experimental results show that integrating MSCA effectively improves the recognition accuracy of the finger vein recognition model.
3.: To validate the model’s performance, we conducted comparative tests using the FV-USM dataset from Universiti Teknologi Malaysia and the SDUMLA-HMT dataset from Shandong University. Furthermore, an ablation study was conducted to thoroughly analyze the impact of the MSCA module on FMFVNet. The experimental results indicate that the proposed FMFVNet achieves outstanding performance across various finger vein databases and exhibits a strong generalization ability.

3. Materials and Methods

3.1. Network Architecture

Figure 1 illustrates the overall network architecture of the proposed FMFVNet. FMFVNet consists of four feature extraction stages, with spatial downsampling and channel expansion performed through convolutional units between each stage. Before the first stage, a

4 \times 4

convolution layer is used as the embedding layer, with a stride of 4. This convolution operation maps the input finger vein image into a high-dimensional feature space, generating feature representations rich in semantic information. This serves as the initial step for transforming raw image data into high-dimensional features that the network can process. Before the following three stages,

2 \times 2

standard convolution layers are used as merging layers, which reduce the spatial dimensions of the feature maps while preserving and integrating important features, thereby improving the efficiency of subsequent feature extraction. In the feature extraction stages, the FasterNet Block is employed as the primary feature extraction module due to its low parameter count and high computational efficiency, which helps improve the model’s inference speed while maintaining recognition accuracy. After the four feature extraction stages, a Global Average Pooling Layer, a

1 \times 1

convolution layer, and a Fully Connected Layer are used for feature transformation and classification. Table 1 presents the detailed architecture of the entire network, specifically illustrating the feature extraction process and dimensional changes in the proposed network. The table lists the spatial resolution, key operations, and final output results at each stage of the network, where the input and output dimensions are represented as

width \times height \times channels

. The model takes as input a

224 \times 224 \times 3

finger vein image.

In the first stage, a single FasterNet Block is used for initial feature extraction, mapping the input finger vein image from the original pixel space to a lower-dimensional feature space. Since this is the initial stage of the model, the extracted features are relatively simple, and the computational demand is low; therefore, a single lightweight network module is sufficient to meet the requirements.

After passing through the first Merging layer, the feature complexity increases to a moderate level. A slight increase in the number of lightweight network modules helps enhance the model’s feature extraction capability while maintaining a better balance between computational efficiency and recognition accuracy. Therefore, in the second stage, two FasterNet Blocks are used to further refine and enhance the feature representation, improving the model’s ability to capture finger vein patterns.

As the features undergo progressive refinement in the first two stages, the feature map size is significantly reduced, leading to lower memory access overhead in the subsequent stages and enhancing actual computational efficiency. Therefore, in the third stage, increasing the number of FasterNet Blocks to eight enables more effective extraction of critical features from finger vein images while preserving recognition efficiency and strengthening the model’s representational capacity.

In the fourth stage, the features will already have been sufficiently extracted and processed in the previous stages. Therefore, further increasing the number of modules would not significantly improve the model’s performance; instead, it could increase the risk of overfitting. Reducing the number of modules helps avoid excessively complex feature representations while maintaining sufficient model capacity to process global information. This approach ensures stability in the final stage and mitigates potential negative impacts on the model’s generalization ability. Hence, only two FasterNet Blocks are used in this stage to generate the final output features suitable for the recognition task.

To further improve recognition accuracy and mitigate the effects of finger position variations, uneven illumination, and noise from image acquisition devices, while enhancing the model’s multi-scale representation ability and recognition accuracy, we incorporated the Multi-Scale Convolutional Attention (MSCA) module in the third stage. By introducing a multi-scale attention mechanism into the convolution layers, the model assigns appropriate weights to features at different scales. Since the first two stages mainly extract low-level features such as textures and edges, using a complex attention mechanism at these stages would introduce redundant information, wasting computational resources and potentially reducing the efficiency of attention computation. Although the fourth stage has a lower computational load, the feature map size is too small, meaning that using MSCA would not provide significant multi-scale information gain, making the additional computational cost unjustifiable. Therefore, MSCA is introduced exclusively in the third stage, ensuring that multi-scale attention is fully utilized while keeping computational resources under control, ultimately enhancing feature extraction efficiency.

3.2. FasterNet Block

In finger vein recognition projects, traditional network modules typically incur high computational overhead and exhibit a low efficiency in channel and spatial information interaction. Although they may achieve an excellent recognition accuracy, they often result in a slow recognition speed, making it difficult to meet the real-time requirements of practical applications. Utilizing the lightweight FasterNet Block [37] for feature extraction enables efficient feature extraction while maintaining recognition accuracy.

As shown in Figure 2a, the FasterNet Block consists of a Partial Convolution (PConv) layer, two

1 \times 1

convolution layers with a stride of 1, a Batch Normalization (BN) layer, and a GELU activation function. The BN layer normalizes the input features, enhancing the stability and speed of model training, while the GELU activation function improves non-linear expressiveness by smoothly activating the input, thereby enhancing model performance. To further optimize computational efficiency, the FasterNet Block employs a targeted strategy by placing the BN layer and GELU activation function only after the first

1 \times 1

convolution layer in each module. This reduces redundant computation steps and improves inference speed. Each FasterNet Block maintains the same number of channels, with channel changes occurring only between stages. This design reduces computational complexity and memory consumption while ensuring consistency in feature transmission across modules, allowing the network to better capture both global and local feature relationships and effectively extract structural information from images. Since Partial Convolution operates on only a subset of channels, a Skip Connection is used at the end of the module to fully utilize all of the channel information and prevent feature loss. The input to the Partial Convolution layer is directly passed to the final addition operation, preserving input features and fusing them with the convolution-processed features. This design enhances the model’s learning capability and stability, facilitating better retention and transmission of information in deep networks.

As a core component of FasterNet Block, PConv is specifically designed to minimize computational redundancy while optimizing memory access. As shown in Figure 2b, the core idea of PConv is to perform regular convolution on a subset of the input feature map channels for spatial feature extraction. To optimize sequential or regular memory access, the first or last consecutive

c_{p}

channels are computed, treating them as representatives of the entire feature map. For the number of standard convolution channels c, when the ratio

r = \frac{c_{p}}{c} = \frac{1}{4}

, PConv has only

\frac{1}{16}

floating-point operations (FLOPs) of a standard convolution. PConv’s FLOPs are as follows:

h \times w \times k^{2} \times c_{p}^{2}

(1)

In addition, PConv’s memory accesses are significantly reduced for

h \times w \times 2 c_{p} + k^{2} \times c_{p}^{2} \approx h \times w \times 2 c_{p}

(2)

This is only 1/4 of a standard convolution, and the actual computational efficiency on the GPU is 10.5 times higher than that of Depthwise Separable Convolution (DWConv) [37]. This localized computation is highly efficient for processing sparse and irregular vascular textures in finger vein images, substantially reducing the computational resources needed for each feature map. The first

1 \times 1

Conv after PConv provides higher channel capacity for feature maps, thus enabling the model to flexibly express and integrate features. Since it does not affect the spatial dimensions (acting only on the channels), it can be efficiently transformed and expanded in channel dimensions for the feature vectors at each location, so that the next step of the nonlinear activation layer can perform better. The second

1 \times 1

Conv recompresses the expanded features to their original dimensions, thus maintaining the output features in the same channel dimensions as the input, facilitating summation with the residual connections. The effective receptive field of PConv, combined with the first 1 × 1 Conv, forms a T-shaped Conv on the input feature map. Compared to a regular convolution that uniformly processes a region, a T-shaped convolution focuses more on the central position, minimizes attention to less important surrounding areas, reduces computational cost and memory usage, and improves overall model efficiency. And compared to directly using a T-shaped Conv, decomposing into two Convs utilizes the redundancy between filters, further saving FLOPs. For the same input and output, the FLOPs of a T-type Conv can be computed as follows:

h \times w \times (k^{2} \times c_{p} \times c + c \times (c - c_{p}))

(3)

This is higher than the FLOPs of PConv plus

1 \times 1

Conv, which is as follows:

h \times w \times (k^{2} \times c_{p}^{2} + c \times c_{p})

(4)

where

c > c_{p}

and

c - c_{p} > c_{p}

(when

r = \frac{1}{4}

). Moreover, we can conveniently implement this process in two steps using standard convolution. A

1 \times 1

convolution is then applied to adjust the number of channels. The second

1 \times 1

Conv compresses the expanded features back to the original dimension, ensuring that the output features maintain the same channel dimension as the input features, facilitating addition with the skip connection.

3.3. Multi-Scale Convolutional Attention (MSCA)

In finger vein recognition tasks, since finger vein images are captured under near-infrared light, the contrast between the veins and the background may sometimes be relatively low. Additionally, factors such as uneven finger illumination can introduce significant background interference and noise. Figure 3 illustrates a clear finger vein image alongside images with uneven illumination and low contrast. Traditional networks often rely on single-scale convolution for feature extraction, making it difficult to simultaneously capture both local details and global information. As a result, their representation of finger vein texture details is insufficient, leading to poor robustness. The Multi-Scale Convolutional Attention (MSCA) module [33] combines the advantages of multi-scale convolution and attention mechanisms, effectively addressing the aforementioned challenges in finger vein recognition.

As shown in Figure 4, the input feature map first passes through a BN layer to stabilize the training process and accelerate model convergence. It then goes through a

1 \times 1

Conv layer, where the GELU activation function introduces non-linearity to enhance the model’s expressiveness. Next, the feature map enters the Multi-Scale Convolutional Attention (MSCA) module for further processing and optimization. After being processed by MSCA, the feature map passes through another

1 \times 1

convolution layer and is then added to the original input feature map via a skip connection. This fusion of the original features with the attention-adjusted features helps retain the original information while further enhancing the representation of key features.

The Multi-Scale Convolutional Attention (MSCA) module consists of three main components. First, a

5 \times 5

Depthwise Separable Conv (DWConv) is applied to the input feature map for initial spatial feature extraction, aggregating local information. Next, the feature map undergoes multi-branch depth-wise strip convolutions at different scales to capture multi-scale contextual information. These features are then fused through skip connections, forming a richer feature representation. Subsequently, the fused features pass through a

1 \times 1

convolution layer to model the relationships between different channels. The output of the

1 \times 1

convolution serves directly as the attention weights for reweighting. Finally, the reweighted feature map is element-wise multiplied with the original input feature map via a residual connection, ensuring selective and dependent feature fusion. This approach retains original information while further enhancing the representation of key features. By using different types of residual connections, the module maintains feature propagation while increasing flexibility. Through the combination of multi-scale convolutions and depthwise separable convolutions, the MSCA module demonstrates excellent performance in handling complex image tasks. Specifically, in finger vein recognition, which requires fine-grained details, MSCA significantly improves recognition accuracy while maintaining high computational efficiency. Mathematically, MSCA can be expressed as follows:

A t t = C o n v_{1 \times 1} (\sum_{i = 0}^{3} S c a l e_{i} (D W - C o n v (F)))

(5)

O u t = A t t \otimes F

(6)

Here, F represents the input features;

Att

and

Out

denote the attention map and the output, respectively; and ⊗ represents the element-wise matrix multiplication operation.

DWConv

stands for depth-wise separable convolution, while

{Scale}_{i}

(

i \in {0, 1, 2, 3}

) represents the i-th branch in the diagram, where

{Scale}_{0}

is an identity connection. In each branch, we use two depth-wise strip convolutions to approximate a standard depth-wise convolution with a large kernel. The kernel sizes for each branch are set to 7, 11, and 21, respectively. There are two main reasons for choosing depth-wise strip convolutions. First, strip convolutions are lightweight. By applying two one-dimensional convolutions,

7 \times 1

and

1 \times 7

, we can achieve an equivalent receptive field to a standard

7 \times 7

two-dimensional convolution while significantly reducing computational costs. Assuming the input feature map has a size of

H \times W

, the FLOPs for a standard

7 \times 7

two-dimensional convolution can be computed as follows:

{FLOPs}_{2 D} = H \times W \times C_{in} \times C_{out} \times K_{h} \times K_{w} = H \times W \times C_{in} \times C_{out} \times 49

(7)

If the

7 \times 7

two-dimensional convolution is decomposed into two one-dimensional convolutions, namely

7 \times 1

and

1 \times 7

, the FLOPs can be computed as follows:

{FLOPs}_{1 D} = H \times W \times C_{in} \times C_{out} \times 7 \times 1 \times 2 = H \times W \times C_{in} \times C_{out} \times 14

(8)

It can be observed that the computational cost of the two one-dimensional convolutions is reduced to

\frac{14}{49} = \frac{2}{7} \approx 28.6 %

of that of the two-dimensional convolution. This approximation significantly reduces computational costs and allows for more efficient utilization of hardware resources. Furthermore, vein structures in finger vein images typically appear as elongated strip-like or tubular patterns. Strip convolutions are particularly well-suited for capturing these highly directional features, as they are specifically designed to handle stripe-like structures. Depth-wise strip convolutions enhance the response to vein patterns, thereby improving the model’s ability to capture vein features [38].

4. Experiments and Results

4.1. Dataset

In this study, the proposed algorithm was evaluated on two publicly available finger vein datasets: the FV-USM dataset [18] and the SDUMLA-HMT dataset [39]. The FV-USM dataset provides extracted finger vein region-of-interest (ROI) images, and a total of 2952 images were used in this study. In contrast, the SDUMLA-HMT dataset does not include pre-extracted ROI images. To eliminate irrelevant backgrounds and noises while improving the recognition accuracy and efficiency, we performed ROI extraction on the dataset images, resulting in a total of 3816 extracted ROI images. First, Laplacian edge detection was applied to enhance the edge features of the image, followed by Gaussian filtering for smoothing to reduce noise interference. Subsequently, the least squares method was used to fit the upper and lower edges and the centerline of the finger, calculate its inclination angle, and perform image rotation correction to align the finger orientation. Next, based on the fitted edge lines, the largest inscribed rectangle of the finger was extracted to obtain a more standardized finger region. Since finger joints exhibit weaker absorption of near-infrared light, the horizontal light intensity distribution typically presents two peaks. We extracted the region between these two peaks as the finger vein region of interest (ROI) to ensure that the selected area contains the main vein information while reducing background interference. The ROI is the region in the finger vein image where vein information is most concentrated and informative. Using ROI images for recognition enhances the distinction between different vein regions while maintaining the stability and consistency of the same-class vein regions.

To match the input requirements of deep learning models, all dataset ROI images underwent zero-padding and were normalized to

224 \times 224

pixels to maintain a uniform input size. Both datasets were split into training, testing, and validation sets with a ratio of 4:1:1.

4.2. Experimental Configuration

The training and testing of the proposed model in this study were conducted in a Windows 11 environment. The hardware configuration included an Intel(R) Core i9-14900K CPU, 48 GB of RAM with a standard frequency of 7200 MHz, an NVIDIA GeForce RTX 4090 GPU with 24 GB of VRAM, and an ROG STRIX Z-790A motherboard. For the software environment, the experiments were performed using PyTorch 2.2.1+cu121 as the deep learning framework with Python version 3.11.8. The Adam optimizer was employed in these experiments, with an initial learning rate of 0.001 and a weight decay of 0.0001. A cosine annealing learning rate scheduler was applied, setting the maximum cycle to 200 and the minimum learning rate to

1 \times 10^{- 6}

. The loss function used for optimization was primarily the cross-entropy loss. The training batch size was set to 32, and the model was trained for 200 epochs.

4.3. Comparative Experiments with Different Models

To more accurately and objectively evaluate the performance of the proposed model in this chapter, we conducted a comprehensive assessment from two perspectives: recognition accuracy and recognition efficiency. First, for a finger vein recognition system, ensuring a high recognition accuracy is crucial. To measure the model’s recognition accuracy, we adopted accuracy (ACC) as the evaluation metric, where a higher ACC indicates more precise recognition and enhanced model security. ACC is suitable for classification tasks with balanced class distributions. It is computed by dividing the number of correctly predicted samples by the total number of samples and multiplying by 100% to express the final result. The formula for calculating ACC is given as follows:

A C C = \frac{N u m b e r o f C o r r e c t P r e d i c t i o n s}{T o t a l N u m b e r o f P r e d i c t i o n s} \times 100 %

(9)

To comprehensively evaluate the accuracy of the proposed FMFVNet model, we compared it with several representative models that have been applied in the field of finger vein recognition, including MobileNetV2 [40], EfficientNet-B0 [41], ResNet-50 [42], and Swin-T [43]. Among these models, MobileNetV2 and EfficientNet-B0 are lightweight convolutional neural network architectures that have gained significant popularity in computer vision in recent years. They are particularly suitable for resource-constrained devices, such as mobile or embedded systems, as both architectures effectively optimize computational efficiency and model size. ResNet-50, with its strong feature extraction capability, is well-suited for processing complex textures and detailed features in finger vein images. It has been widely used as a backbone network for finger vein recognition to extract deep vein features. Swin-T is an attention-based network belonging to the Transformer architecture family. It adopts a shifted window attention mechanism to optimize the traditional Transformer’s global self-attention computation, enabling precise modeling of the correlation between local and global features in finger vein images. Additionally, we compared FMFVNet’s recognition accuracy with several other models specifically designed for finger vein recognition, including Coding SA [30], LFVRN-CE [31], and FV-ViT [32].

Table 2 presents the accuracy of different models on the FV-USM and SDUMLA-HMT datasets. From the table, we can observe that the deep residual structure of ResNet-50 enables it to capture more complex features, leading to excellent performance and a higher recognition accuracy compared to the lightweight networks MobileNetV2 and EfficientNet-B0. However, Swin-T exhibits lower recognition accuracy than other models. This is because the Transformer architecture relies heavily on large datasets for training, and when the dataset size is relatively small, the sliding window attention mechanism cannot be fully trained, limiting its feature representation capability. Among all the compared methods, the proposed FMFVNet model achieves the highest recognition accuracy. On the FV-USM dataset, FMFVNet achieves an accuracy of 99.80%, surpassing Swin-T by 1.22%. On the SDUMLA-HMT dataset, it reaches an accuracy of 99.06%, which is 0.7% higher than that of MobileNetV2. These results demonstrate that the proposed model has superior feature extraction capability, meaning it can more precisely capture and learn deep features and subtle variations in finger vein images. This suggests that FMFVNet holds more reliable application potential in finger vein recognition systems.

We used the number of model parameters (Params) as a metric to assess model complexity. The number of parameters refers to the total count of trainable parameters in a deep learning model, including weights and biases. It reflects the scale and complexity of the model and serves as an important indicator of resource consumption and computational requirements. A larger number of parameters indicates a more complex model, requiring more GPU memory and computational resources during training and inference. Since finger vein recognition models are often deployed on embedded devices or hardware terminals (such as access control systems and mobile payment devices), where hardware performance and resources are typically constrained, models with fewer parameters are better suited for practical applications. Additionally, we measured the floating-point operations (FLOPs), which represent the total number of floating-point computations required for a single forward pass of the model. FLOPs quantify the total computational workload for matrix multiplications, additions, convolutions, and other operations, making them a key metric for evaluating computational complexity. FLOPs are closely related to inference efficiency and computational resource requirements. A higher FLOPs count indicates a more computationally intensive model, requiring more processing power. Generally, models with lower FLOPs tend to have faster inference speeds. However, in real-world inference scenarios, multiple additional factors can influence overall model performance.

Table 3 presents the comparison of Params and FLOPs across different models. It can be observed that due to differences in architectural design, ResNet-50 and Swin-T have a significantly larger number of parameters than lightweight networks, making them less suitable for deployment in resource-constrained environments. In contrast, MobileNetV2 and EfficientNet-B0 have relatively fewer parameters, making them more suitable for deployment on embedded devices with limited computational resources.

The proposed FMFVNet model reduces the number of parameters by 1.71 M compared to EfficientNet-B0, classifying it as a lightweight network while achieving superior accuracy and feature extraction capabilities. This balance between model size and recognition performance makes FMFVNet well-suited for deployment in finger vein recognition systems. Additionally, FMFVNet reduces FLOPs by 27.42 M compared to EfficientNet-B0 and is approximately 11.5 times more efficient than ResNet-50 in terms of floating-point operations, demonstrating a significant advantage in finger vein recognition tasks by balancing efficiency and practicality.

However, a higher FLOPs count indicates greater computational complexity, but it does not necessarily correlate directly with actual runtime speed. The real execution speed of a model is influenced by factors such as architectural efficiency, hardware optimizations, and execution pipeline design. Therefore, evaluating model runtime performance should not rely solely on FLOPs as the only reference metric. To evaluate inference performance in real-world scenarios, it is necessary to include inference time per image as part of the model assessment for a more accurate evaluation. Inference time per image typically refers to the duration of the entire forward pass, from the moment the model receives an input to the generation of the output. It serves as a key metric for measuring the performance of deep learning models and their deployment efficiency in practical applications.

Table 4 presents the inference time per image for different models. To ensure measurement accuracy, we repeated the inference process on the input tensor 1000 times. Among lightweight models, although EfficientNet-B0 has a parameter count and FLOPs close to that of MobileNetV2, its more complex compound scaling mechanism increases the actual computational workload, resulting in a longer inference time. ResNet-50 benefits from well-optimized convolutional computations, allowing the GPU to process convolutions efficiently, making its inference time lower than that of EfficientNet-B0 despite having a higher parameter count and FLOPs. MobileNetV2, which employs depthwise separable convolutions, reduces FLOPs compared to partial convolution. However, frequent memory access results in a lower floating-point operations per second (FLOPS), leading to an increased actual inference time [37]. This explains why MobileNetV2, despite having fewer FLOPs than FMFVNet, exhibits a longer inference time per image. The proposed FMFVNet achieves an inference time of 1.75 ms per image, reducing the inference time by 20.81% compared to MobileNetV2, achieving a speedup of approximately 26%. Compared to Swin-T, FMFVNet reduces inference time by 62.7%, achieving a speedup of approximately 168%. These results demonstrate that FMFVNet offers significant advantages in inference efficiency. This improvement makes FMFVNet more suitable for deployment on embedded devices or portable identity authentication terminals, where rapid and efficient identification is required under limited hardware resources. It fully meets the real-time and performance requirements of finger vein recognition systems.

4.4. Ablation Experiment

To verify the contribution of the Multi-Scale Convolutional Attention (MSCA) module, we conducted ablation experiments on the FV-USM and SDUMLA-HMT datasets, evaluating the accuracy variations when the model is used with and without MSCA. As shown in Table 5, the incorporation of multi-scale convolution and attention mechanisms enhances the model’s ability to capture fine-grained features, thereby improving recognition accuracy. The use of the MSCA module resulted in a 0.61% increase in accuracy on the FV-USM dataset and a 0.79% increase on the SDUMLA-HMT dataset.

5. Conclusions

To balance the trade-off between recognition accuracy and efficiency in finger vein recognition systems, this paper proposes a more efficient finger vein recognition network model—Faster Multi-Scale Finger Vein Network (FMFVNet). By integrating a lightweight design with the Multi-Scale Convolutional Attention (MSCA) module, FMFVNet significantly enhances computational efficiency while maintaining recognition accuracy, making it more suitable for real-world applications. FMFVNet employs FasterNet Block as the primary feature extraction network, utilizing efficient Partial Convolution (PConv) to optimize model complexity and computational efficiency, ensuring fast inference capability. The combination of multi-scale convolution and attention mechanisms in MSCA strengthens the model’s ability to capture vein vessel features, significantly improving its adaptability and robustness. Experimental results on the FV-USM and SDUMLA-HMT datasets demonstrate that FMFVNet achieves a higher recognition accuracy and lower inference time compared to existing models, validating its feasibility for finger vein recognition. Furthermore, the ablation study results confirm that the MSCA module effectively improves the recognition accuracy, providing a viable solution for finger vein detection.

Although the proposed FMFVNet method demonstrates a high recognition accuracy and computational efficiency in finger vein recognition tasks, it still has certain limitations. First, the experiments in this study were conducted on high-performance GPUs, whereas real-world applications often involve embedded devices with limited computational capacity. Therefore, the suitability of this method in embedded environments requires further verification. Therefore, future research will explore techniques such as model pruning, quantization, and knowledge distillation to reduce computational costs and enhance the applicability of the model in embedded environments. Second, the input data used in this study underwent ROI extraction. The robustness of the proposed method for raw finger vein images without ROI extraction still needs further investigation. Future work will consider directly recognizing unprocessed finger vein images, enabling the network to learn effective vein features directly from raw images. This approach aims to reduce image preprocessing steps, further enhancing the system’s real-time performance and practical application convenience.

Author Contributions

Conceptualization, Z.Z.; Methodology, Z.Z.; Software, Z.Z.; Validation, Z.Z.; Resources, C.S.; Writing—original draft, Z.Z. and C.S.; Writing—review & editing, P.L.; Project administration, P.L. and S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are publicly available in the FV-USM and SDUMLA-HMT, reference number [18,39].

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sundararajan, K.; Woodard, D.L. Deep learning for biometrics: A survey. ACM Comput. Surv. 2018, 51, 65. [Google Scholar] [CrossRef]
Nguyen, K.; Fookes, C.; Sridharan, S.; Tistarelli, M.; Nixon, M. Super-resolution for biometrics: A comprehensive survey. Pattern Recognit. 2018, 78, 23–42. [Google Scholar] [CrossRef]
Cozzolino, D.; Verdoliva, L. Noiseprint: A CNN-based camera model fingerprint. IEEE Trans. Inf. Forensics Secur. 2019, 15, 144–159. [Google Scholar] [CrossRef]
Wang, C.; Muhammad, J.; Wang, Y.; He, Z.; Sun, Z. Towards complete and accurate iris segmentation using deep multi-task attention network for non-cooperative iris recognition. IEEE Trans. Inf. Forensics Secur. 2020, 15, 2944–2959. [Google Scholar] [CrossRef]
Jeevan, G.; Zacharias, G.C.; Nair, M.S.; Rajan, J. An empirical study of the impact of masks on face recognition. Pattern Recognit. 2022, 122, 108308. [Google Scholar] [CrossRef]
Hashimoto, J. Finger vein authentication technology and its future. In Proceedings of the 2006 Symposium on VLSI Circuits, Honolulu, HI, USA, 13–17 June 2006; pp. 5–8. [Google Scholar]
Zhou, Y.; Kumar, A. Human identification using palm-vein images. IEEE Trans. Inf. Forensics Secur. 2011, 6, 1259–1274. [Google Scholar] [CrossRef]
Yanagawa, T.; Aoki, S.; Oyama, T. Diversity of human finger vein patterns and its application to personal identification. Bull. Inform. Cybern. 2009, 41, 1–9. [Google Scholar] [CrossRef]
Syazana-Itqan, K.; Syafeeza, A.R.; Saad, N.M.; Hamid, N.A.; Saad, W.H.B.M. A review of finger-vein biometrics identification approaches. Indian J. Sci. Technol. 2016, 9, 19. [Google Scholar] [CrossRef]
Miura, N.; Nagasaka, A.; Miyatake, T. Feature extraction of finger-vein patterns based on repeated line tracking and its application to personal identification. Mach. Vis. Appl. 2004, 15, 194–203. [Google Scholar] [CrossRef]
Mohsin, A.H.; Zaidan, A.A.; Zaidan, B.B.; Albahri, A.S.; Albahri, O.S.; Alsalem, M.A.; Mohammed, K.I. Real-time remote health monitoring systems using body sensor information and finger vein biometric verification: A multi-layer systematic review. J. Med. Syst. 2018, 42, 238. [Google Scholar] [CrossRef]
Zirjawi, N.; Kurtanovic, Z.; Maalej, W. A survey about user requirements for biometric authentication on smartphones. In Proceedings of the 2015 IEEE 2nd Workshop on Evolving Security and Privacy Requirements Engineering (ESPRE), Ottawa, ON, Canada, 31 August 2015; IEEE: Piscataway, NJ, USA, 2015; pp. 1–6. [Google Scholar]
Elharrouss, O.; Almaadeed, N.; Al-Maadeed, S. A review of video surveillance systems. J. Vis. Commun. Image Represent. 2021, 77, 103116. [Google Scholar] [CrossRef]
Kono, M. A new method for the identification of individuals by using vein pattern matching of a finger. In Proceedings of the 5th Symposium on Pattern Measurement, Yamaguchi, Japan, 20–22 January 2000; pp. 9–12. [Google Scholar]
Miura, N.; Nagasaka, A.; Miyatake, T. Extraction of finger-vein patterns using maximum curvature points in image profiles. IEICE Trans. Inf. Syst. 2007, 90, 1185–1194. [Google Scholar] [CrossRef]
Park, K.R. Finger vein recognition by combining global and local features based on SVM. Comput. Inform. 2011, 30, 295–309. [Google Scholar]
Lu, Y.; Xie, S.J.; Yoon, S.; Park, D.S. Finger vein identification using polydirectional local line binary pattern. In Proceedings of the International Conference on ICT Convergence (ICTC), Jeju, Korea, 14–16 October 2013; pp. 61–65. [Google Scholar]
Mohd Asaari, M.S.; Suandi, S.A.; Rosdi, B.A. Fusion of band limited phase only correlation and width centroid contour distance for finger based biometrics. Expert Syst. Appl. 2014, 41, 3367–3382. [Google Scholar] [CrossRef]
Li, J.; Ma, H.; Lv, Y.; Zhao, D.; Liu, Y. Finger vein feature extraction based on improved maximum curvature description. In Proceedings of the Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019; pp. 7566–7571. [Google Scholar]
Song, W.; Kim, T.; Kim, H.C.; Choi, J.H.; Kong, H.-J.; Lee, S.-R. A finger-vein verification system using mean curvature. Pattern Recognit. Lett. 2011, 32, 1541–1547. [Google Scholar] [CrossRef]
Chen, Y.; Lin, Z.; Zhao, X.; Wang, G.; Gu, Y. Deep learning-based classification of hyperspectral data. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 2094–2107. [Google Scholar] [CrossRef]
Hou, Q.; Cheng, M.; Hu, X.; Borji, A.; Tu, Z.; Torr, P. Deeply supervised salient object detection with short connections. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 815–828. [Google Scholar] [CrossRef]
Tamang, L.D.; Kim, B.W. Deep D2C-Net: Deep learning-based display-to-camera communications. Opt. Express 2021, 29, 11494–11511. [Google Scholar] [CrossRef]
Huang, H.; Liu, S.; Zheng, H.; Ni, L.; Zhang, Y.; Li, W. DeepVein: Novel finger vein verification methods based on deep convolutional neural networks. In Proceedings of the IEEE International Conference on Identity, Security and Behavior Analysis (ISBA), New Delhi, India, 22–24 February 2017; pp. 1–8. [Google Scholar]
Song, J.M.; Kim, W.; Park, K.R. Finger-vein recognition based on deep DenseNet using composite image. IEEE Access 2019, 7, 66845–66863. [Google Scholar] [CrossRef]
Qin, H.; Wang, P. Finger-vein verification based on LSTM recurrent neural networks. Appl. Sci. 2019, 9, 1687. [Google Scholar] [CrossRef]
Noh, K.J.; Choi, J.; Hong, J.S.; Park, K.R. Finger-vein recognition based on densely connected convolutional network using score-level fusion with shape and texture images. IEEE Access 2020, 8, 96748–96766. [Google Scholar] [CrossRef]
Zhang, J.; Lu, Z.; Li, M.; Wu, H. GAN-based image augmentation for finger-vein biometric recognition. IEEE Access 2019, 7, 183118–183132. [Google Scholar] [CrossRef]
Hou, B.; Yan, R. Convolutional autoencoder model for finger-vein verification. IEEE Trans. Instrum. Meas. 2020, 69, 2067–2074. [Google Scholar] [CrossRef]
Zhong, Y.; Li, J.; Chai, T.; Prasad, S.; Zhang, Z. Different dimension issues in deep feature space for finger-vein recognition. In Proceedings of the Chinese Conference on Biometric Recognition, Shanghai, China, 10–12 September 2021; pp. 295–303. [Google Scholar]
Ren, H.; Sun, L.; Guo, J.; Han, C.; Wu, F. Finger vein recognition system with template protection based on convolutional neural network. Knowl.-Based Syst. 2021, 227, 107159. [Google Scholar] [CrossRef]
Li, X.; Zhang, B.-B. FV-ViT: Vision transformer for finger vein recognition. IEEE Access 2023, 11, 75451–75461. [Google Scholar] [CrossRef]
Guo, M.-H.; Lu, C.-Z.; Hou, Q.; Liu, Z.-N.; Cheng, M.-M.; Hu, S.-M. SegNeXt: Rethinking convolutional attention design for semantic segmentation. arXiv 2022, arXiv:2209.08575. [Google Scholar]
Liu, J.; Ma, H.; Guo, Z. Multi-Scale convolutional neural network for finger vein recognition. Infrared Phys. Technol. 2024, 143, 105624. [Google Scholar] [CrossRef]
Bhushan, K.; Singh, S.; Kumar, K.; Kumar, P. Deep learning based automated vein recognition using Swin Transformer and Super Graph Glue model. Knowl.-Based Syst. 2025, 310, 112929. [Google Scholar] [CrossRef]
Bai, H.; Tan, Y.; Li, Y.-J. Mask-guided network for finger vein feature extraction and biometric identification. Biomed. Opt. Express 2024, 15, 6845–6863. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.-H.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. arXiv 2023, arXiv:2303.03667. [Google Scholar]
Hou, Q.; Zhang, L.; Cheng, M.M.; Feng, J. Strip Pooling: Rethinking Spatial Pooling for Scene Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 4003–4012. [Google Scholar]
Yin, Y.; Liu, L.; Sun, X. SDUMLA-HMT: A multimodal biometric database. In Proceedings of the 6th Chinese Conference on Biometric Recognition, Beijing, China, 3–4 December 2011; Springer: Cham, Switzerland, 2011; pp. 260–268. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. MobileNetV2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4510–4520. [Google Scholar]
Tan, M.; Le, Q.V. EfficientNet: Rethinking model scaling for convolutional neural networks. arXiv 2019, arXiv:1905.11946. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar]

Figure 1. Overall architecture of FMFVNet. The network consists of four feature extraction stages, with spatial downsampling and channel expansion performed through convolutions.

Figure 2. (a) FasterNet Block structure; (b) Partial Convolution (PConv).

Figure 3. (a) Clear image; (b) uneven illumination image; (c) low-contrast image.

Figure 4. Overall structure of the Multi-Scale Convolutional Attention (MSCA) module.

Table 1. Spatial resolution, number of channels, and applied operations of each stage in FMFVNet.

Stage	Input Size	Output Channels	Output Size	Operation
Embedding Layer	$224 \times 224 \times 3$	40	$56 \times 56 \times 40$	$4 \times 4$ Conv
Stage 1	$56 \times 56 \times 40$	40	$56 \times 56 \times 40$	FasterNet Block $\times 1$
Merging Layer	$56 \times 56 \times 40$	80	$28 \times 28 \times 80$	$2 \times 2$ Conv
Stage 2	$28 \times 28 \times 80$	80	$28 \times 28 \times 80$	FasterNet Block $\times 2$
Merging Layer	$28 \times 28 \times 80$	160	$14 \times 14 \times 160$	$2 \times 2$ Conv
Stage 3	$14 \times 14 \times 160$	160	$14 \times 14 \times 160$	MSCA + FasterNet Block $\times 8$
Merging Layer	$14 \times 14 \times 160$	320	$7 \times 7 \times 320$	$2 \times 2$ Conv
Stage 4	$7 \times 7 \times 320$	320	$7 \times 7 \times 320$	FasterNet Block $\times 2$
Global Pooling Layer	$7 \times 7 \times 320$	320	$1 \times 1 \times 320$	Global Average Pooling
1 × 1 Conv	$1 \times 1 \times 320$	1280	$1 \times 1 \times 1280$	$1 \times 1$ Conv

Table 2. Comparison of ACC with other methods on FV-USM and SDUMLA-HMT.

Methods	ACC (%)
Methods	FV-USM	SDUMLA-HMT
MobileNetV2	99.19	98.36
EfficientNet-B0	99.19	98.67
ResNet-50	99.39	98.85
Swin-T	98.58	98.42
LFVRN-CE [30]	98.58	97.74
Coding SA [31]	99.39	96.01
FV-ViT [32]	99.59	94.51
FMFVNet	99.80	99.06

Table 3. Comparison of Params and FLOPs across different models.

Method (ms)	Params (M)	FLOPs
MobileNetV2	3.47	300.84 M
EfficientNet-B0	5.24	385.88 M
ResNet-50	24.81	4.13 G
Swin-T	28.27	4.32 G
FMFVNet	3.53	358.46 M

Table 4. Comparison of inference time across different models.

Method	Inference Time (ms)
MobileNetV2	2.21
EfficientNet-B0	3.02
ResNet-50	2.63
Swin-T	4.69
FMFVNet	1.75

Table 5. Comparison of model accuracy on datasets with and without MSCA.

MSCA	ACC (%)
MSCA	FV-USM	SDUMLA-HMT
–	99.19	98.36
✔	99.80	99.06

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Liu, P.; Su, C.; Tong, S. A High-Speed Finger Vein Recognition Network with Multi-Scale Convolutional Attention. Appl. Sci. 2025, 15, 2698. https://doi.org/10.3390/app15052698

AMA Style

Zhang Z, Liu P, Su C, Tong S. A High-Speed Finger Vein Recognition Network with Multi-Scale Convolutional Attention. Applied Sciences. 2025; 15(5):2698. https://doi.org/10.3390/app15052698

Chicago/Turabian Style

Zhang, Ziyun, Peng Liu, Chen Su, and Shoufeng Tong. 2025. "A High-Speed Finger Vein Recognition Network with Multi-Scale Convolutional Attention" Applied Sciences 15, no. 5: 2698. https://doi.org/10.3390/app15052698

APA Style

Zhang, Z., Liu, P., Su, C., & Tong, S. (2025). A High-Speed Finger Vein Recognition Network with Multi-Scale Convolutional Attention. Applied Sciences, 15(5), 2698. https://doi.org/10.3390/app15052698

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A High-Speed Finger Vein Recognition Network with Multi-Scale Convolutional Attention

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. Network Architecture

3.2. FasterNet Block

3.3. Multi-Scale Convolutional Attention (MSCA)

4. Experiments and Results

4.1. Dataset

4.2. Experimental Configuration

4.3. Comparative Experiments with Different Models

4.4. Ablation Experiment

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI