1. Introduction
With the rapid development of security systems and the internet economy, personal identification has become an essential technology for security purposes. Furthermore, due to the outbreak of seasonal influenza, face recognition requires the removal of masks and fingerprints and palm prints require contact with the equipment, all of which increase the chances of infection with the virus. Venous structures are vascular structures that exist under the skin and are difficult to damage or replicate. The advantage of vein technology is that it can only be obtained from a living person and the pattern does not change over time. As a result, vein patterns provide more accurate results as well as a higher level of safety. The proposed dorsal hand vein (DHV) recognition system is therefore contactless, low cost, and more secure than other popular biometric systems.
In recent years, DHV recognition has gained much attention as an emerging biometric technology. Owing to its safety, accuracy, and effectiveness, more and more researchers are involved [
1,
2,
3,
4]. Lefkovits et al. [
5] presented a dorsal hand vein recognition method based on convolutional neural networks (CNN). Wang et al. [
6] put forward the method of dorsal hand vein recognition based on bit plane and block mutual information. Chin et al. [
7] described dorsal hand vein recognition using statistical and Gray Level Co-occurrence Matrix (GLCM)-based features extraction techniques and artificial neural networks (ANNs). Liu et al. [
8] presented an improved biometric graph matching method that included edge attributes for graph registration and a matching module to extract discriminating features. Sayed et al. [
9] proposed a dorsal hand recognition system working in real time to achieve good results with a high frame rate.
It is well known that DHV recognition belongs to the family of hand-based biometrics. DHV recognition is a technology that uses analysis and matching of the subcutaneous vein structure on the dorsal hand to achieve personnel identity verification. Region of interest (ROI) extraction is a key step in DHV recognition. Lin et al. [
10,
11,
12] binarized an image of an open hand, calculated the Euclidean distance between each edge pixel of the hand and the midpoint of the wrist, and used these distances to construct a distance distribution map with a shape very similar to the geometry of the dorsal hand; they selected the second and fourth finger webs as reference points to define a square ROI. Damak et al. [
13] used the Otsu thresholding method for hand segmentation, hand boundary tracing, drawing hand boundary distance contours by scanning contours, and rotating the image so that the line connecting the first and third finger valleys became horizontal by selecting four hand boundaries (vertical left limit, vertical right limit, horizontal lower limit, and horizontal upper limit) to create the ROI region. Cimen [
14] split the hand image and determined the boundaries of the hand surface area. The entire image was then scanned pixel by pixel from right to left and top to bottom; the first point that reached 255 pixels was found to be the tip of the bone, at which point a square area of 256 × 256 pixels was selected 150 pixels down as the ROI. Meng et al. [
15,
16,
17,
18,
19,
20] converted the image of a clenched hand into a binary image and used morphological methods to detect the dorsal boundary of the hand; after determining the distance between each point on the boundary and the midpoint of the wrist, the valley points between the fingers were found to be the corresponding valley points in the distance contour. A final fixed-size sub-image based on valley points 1 and 3 was extracted. A total of 240 images of 80 users were obtained from the Bosphorus Hand Vein Database [
21]. Nozaripour et al. [
22] used the sparse representation, kernel trick, and a different technique of the region of interest (ROI) extraction that was present in the previous work, and a new and robust method against rotation was introduced for dorsal hand vein recognition. A general ROI extraction algorithm usually consists of the following main steps: (1) converting the hand image to a binary image with a segmentation algorithm; (2) performing hand boundary tracking; (3) calculating the distance contour points (points located on the hand contour) and reference points between (the midpoint of the wrist is generally used as the reference point); (4) locating the ROI based on the detected points; and (5) cropping the ROI sub-image.
The current research objects were dorsal hand vein images acquired under constrained environments with restricted hand positions and very clean backgrounds. Traditional image processing algorithms are very favorable for segmenting hands in such images; however, these algorithms struggle to segment complete hand images in complex backgrounds, which limits their practical applications. So far, deep learning techniques have not been studied for ROI extraction of vein images of the dorsal hand. Today, deep learning has become one of the most important techniques in the field of computer vision. To date, many classical neural networks [
23,
24,
25,
26,
27,
28] have been proposed, and impressive results have been achieved in many recognition tasks.
Keypoint detection research is divided into two methods: one is to directly regress the keypoint coordinates through the fully connected layer of the neural network, and the other is to directly output the heat map by removing the fully connected layer of the neural network; the coordinates corresponding to the peak of the heat map are the keypoint coordinates. Heat map-based keypoint regression is a computer vision technique that involves generating a heat map of an image to highlight the presence or absence of specific features and predict their locations. In this approach, each feature (or keypoint) is represented as a Gaussian function centered at its true location, and a heat map is produced by summing together these Gaussian functions. The resulting heat map can be used to train a convolutional neural network model to perform keypoint detection and regression tasks on new images. Heat map regression can also be combined with color encoding to visually highlight targets and relevant keypoints in the image based on their confidence scores [
29]. Most face keypoint detection uses the method of direct regression of keypoint coordinates from the fully connected layer [
30,
31] because the face can be regarded as a rigid body, the relative position between points is constant, the spatial information of the feature map is lost, and the spatial generalization is lacking. While the hand is very flexible and has a more diverse posture when collecting non-contact dorsal hand vein images because there is no hand fixation device, the hand will be pitched, bent, and opened and closed differently, and the relative positions of keypoints will vary greatly, so this study used the heat map method to detect the keypoints of dorsal hand vein images. A U-Net [
32] network model was used as the classical model in the fully convolutional neural network and divided into downsampling and upsampling paths, and the low-level features were fused with high-level features through jumping connections to obtain richer feature information, which could detect the keypoints of dorsal hand vein images more accurately.
Deep learning is developing rapidly in the field of computer vision and has better capabilities than traditional image processing algorithms for image detection and recognition in complex situations, so it is crucial to systematically study deep learning for ROI extraction in keypoint detection of non-contact dorsal hand vein images. To this end, this study constructed a dorsal hand vein dataset by capturing unconstrained dorsal hand vein images with a self-developed infrared image acquisition device and proposed an improved U-Net network model, which added residual modules [
33] to the downsampling path of the original U-Net network to solve the problem of model degradation caused by deepening the network. We also replaced the transpose convolution [
34] in the upsampling path of the U-Net network with bilinear interpolation [
35] to avoid “checkerboard artifacts”. To ensure the final output feature map to tend toward Gaussian distribution, we introduced the Jensen–Shannon (JS) divergence loss function as supervision. Finally, Soft-argmax [
36] was used to decode the keypoints’ coordinates from the feature map for end-to-end training. We utilized the improved network for hand dorsum vein image keypoint detection and then extracted the ROI based on the keypoints.