Face Recognition Using the SR-CNN Model

In order to solve the problem of face recognition in complex environments being vulnerable to illumination change, object rotation, occlusion, and so on, which leads to the imprecision of target position, a face recognition algorithm with multi-feature fusion is proposed. This study presents a new robust face-matching method named SR-CNN, combining the rotation-invariant texture feature (RITF) vector, the scale-invariant feature transform (SIFT) vector, and the convolution neural network (CNN). Furthermore, a graphics processing unit (GPU) is used to parallelize the model for an optimal computational performance. The Labeled Faces in the Wild (LFW) database and self-collection face database were selected for experiments. It turns out that the true positive rate is improved by 10.97–13.24% and the acceleration ratio (the ratio between central processing unit (CPU) operation time and GPU time) is 5–6 times for the LFW face database. For the self-collection, the true positive rate increased by 12.65–15.31%, and the acceleration ratio improved by a factor of 6–7.


Introduction
Face matching is a biometric technology that is widely used in a variety of areas [1], such as public security control, intelligent video monitoring, verification of identity, robot vision, etc. [2]. In the geometrical domain, it is a novel way of converting what the human eyes normally do in recognizing one person from another by implicitly extracting some morphological features [3,4]. It was suggested that multimodal emotion recognition can be combined with speech and facial expressions for emotion analysis [5]. Face matching is the most crucial stage in face recognition. Among these stages, local features are invariant to image scale, rotation, illumination, and angle of view, and still have good matching performance under the influence of noise, occlusion, and other factors. The extraction method of local features mainly includes two steps: detection of feature point and generation of feature description. Compared with the selection of the detection algorithm, the algorithm of local feature description has a more significant impact on the performance of the local feature. Furthermore, how to construct a feature descriptor with robustness and uniqueness is of paramount importance in the research of local features.
The purpose of constructing the feature descriptor is to establish the description of the neighborhood information of the feature points, thereby constructing the descriptive vectors of the feature points. These descriptors are based either on the regional distribution, the spatial frequency,

Detection of Scale-Space Extrema
Firstly, the Gaussian pyramid and the difference of Gaussian pyramid were established, and then the space extremum was detected. In the construction process of the Gaussian pyramid, the size of the image was doubled; on this basis, Gaussian blur was carried out on the image under this size, and the set of a few blurred images constituted an octave. Then, down-sampling was carried out on the most blurred image under this octave, the length and width of which were halved once each, making the area of the image into a quarter of the original. The processed image was the initial image of the next octave, based on the completion of Gaussian blur processing on this octave; this process was repeated until the complete construction of all octaves, thus forming the Gaussian pyramid. Then, the two adjacent images of the same octave were subtracted to obtain the interpolation image. Finally, the DOG pyramid was built using the collection of these images of all octaves.
where k is the scale of the Gaussian kernel; the initial value of k is set to 1, and the scale is incremented by k times. The establishment of the DOG Pyramid is shown in Figure 2.

Detection of Scale-Space Extrema
Firstly, the Gaussian pyramid and the difference of Gaussian pyramid were established, and then the space extremum was detected. In the construction process of the Gaussian pyramid, the size of the image was doubled; on this basis, Gaussian blur was carried out on the image under this size, and the set of a few blurred images constituted an octave. Then, down-sampling was carried out on the most blurred image under this octave, the length and width of which were halved once each, making the area of the image into a quarter of the original. The processed image was the initial image of the next octave, based on the completion of Gaussian blur processing on this octave; this process was repeated until the complete construction of all octaves, thus forming the Gaussian pyramid. Then, the two adjacent images of the same octave were subtracted to obtain the interpolation image. Finally, the DOG pyramid was built using the collection of these images of all octaves.
D(x, y, σ) = [G(x, y, kσ) − G(x, y, σ)] ⊗ I(x, y) = L(x, y, kσ) − L(x, y, σ) (1) where k is the scale of the Gaussian kernel; the initial value of k is set to 1, and the scale is incremented by k times. The establishment of the DOG Pyramid is shown in Figure 2.

Detection of Scale-Space Extrema
Firstly, the Gaussian pyramid and the difference of Gaussian pyramid were established, and then the space extremum was detected. In the construction process of the Gaussian pyramid, the size of the image was doubled; on this basis, Gaussian blur was carried out on the image under this size, and the set of a few blurred images constituted an octave. Then, down-sampling was carried out on the most blurred image under this octave, the length and width of which were halved once each, making the area of the image into a quarter of the original. The processed image was the initial image of the next octave, based on the completion of Gaussian blur processing on this octave; this process was repeated until the complete construction of all octaves, thus forming the Gaussian pyramid. Then, the two adjacent images of the same octave were subtracted to obtain the interpolation image. Finally, the DOG pyramid was built using the collection of these images of all octaves.
where k is the scale of the Gaussian kernel; the initial value of k is set to 1, and the scale is incremented by k times. The establishment of the DOG Pyramid is shown in Figure 2. For each octave of scale space, the initial image was repeatedly convolved with Gauss to obtain the image set shown on the left. By subtracting the Gaussian filter, the difference graph of the Gaussian function on the right could be obtained. After each octave, the Gaussian image was sampled and the process was repeated.
The next step was to traverse and detect the extremum points in three-dimensional space (two-dimensional image, one-dimensional scale). The specific process was to compare the gray value of the current feature point with the other 26 points (eight points adjacent to this point at the current scale, and the nearest 9 × 2 points before and after the scale).

Extrema Localization of Key Points
Since the image is a discrete space, the coordinates of the feature points are all integers, but the coordinates of the key point detected in the scale space are not always integers. To precisely locate the coordinates of key points, the Taylor formula was adopted, and the location of the extreme points was determined through five iterations.
The known point A was used to estimate the value of the nearby point B in the above formula, where x = (x, y, σ) T is the offset of point B relative to point A. The extreme value of Equation (2) was determined by taking the derivative of this formula, and setting it to 0, as follows: If the offset is larger than 0.5 in any dimension, indicating that the point is closer to another sample point C, then it is updated as C; otherwise, the iteration ends and x 0 is appended to the current integer coordinates of the feature point. This can be obtained by substituting Equation (3) into Equation (2), as follows: When D(x 0 ) ≤ 0.03 [25], we can determine that the point is with low contrast, which is susceptible to noise and becomes unstable; thus, it needs to be removed.
After precise positioning, there are some edge points in the feature points which are unstable due to some noise; thus, they need to be removed. The characteristic of edge points is that the principle curvature in the direction of the vertical edge is larger, and is smaller in the direction of the tangent to the edge. According to these characteristics, the edge points can be removed. The principle curvature can be obtained using the second-order Hessian matrix, as follows: where D xy and D yy can be obtained by subtracting the current pixel points from the surrounding pixel points. The principal curvature of the point is proportional to the eigenvalue of the Hessian matrices of a certain point; thus, it can be determined by calculating the eigenvalues of the Hessian matrix. Let α and β be the matrix eigenvalues (assuming α > β), giving the following equations:

Orientation Assignment of the Feature Point
For the key points detected in the DOG, the gradient and distribution of direction of the pixels in the 3σ domain of the pyramid are collected.
where l(x, y) and ρ(x, y) are the modulus and direction of the gradient at (x, y) in the DOG, respectively.

Construction of SIFT/RITF Descriptor
For the traditional SIFT feature descriptor, firstly, the image was selected from the pyramid according to the feature point scale; then, we calculated the gradient direction and the amplitude of each point in the domain scope. Each feature point was composed of 16 (4 × 4) sub-points, of which each sub-point had vector data of the eight directions, generating 128 (4 × 4 × 8) dimensions of the eigenvector. Finally, to avoid the influence of illumination, the length of the eigenvector was normalized, and the eigenvector was denoted as F si f t .
Then, for the construction of the feature vector of the RITF, with the feature point as the centre and the longest diagonals D of the image as the diameter, a circular region was constructed as the field of the feature points. The field was divided into a number of concentric circular regions, and, upon eliminating the Gaussian blur and zeroing operation, it still maintained a good rotation invariance, while also avoiding the computational cost from the rotation of the image.
On the circumference where the key point is the center and r (desirable D/5, D/4, D/3, etc.) is the radius, m-points are sampled by the same interval. We compared the size of m pixels (denoted as I 0 (x, y), I 1 (x, y), . . . , I m−1 (x, y)) with the center pixel (denoted as I c (x, y)), and the image binary was implemented. That is, if I c (x, y) < I i (x, y), (i = 0, 1, . . . , m − 1), the sampling pixel was set to 0; otherwise, it was set to 1. Therefore, the calculation formula of the central pixel point of the RITF feature (denoted as RITF ri m,r ) is as follows: where where ROR(x, k) indicates the right cycle shift k times of m-bit binary x. Upon rotating the image area to the reference direction according to the main direction of the key points, and the image area of 8 × 8 was taken as the area to be described. In this area, for each pixel point I i (x, y), (i = 1, 2, . . . , 64), the RITF feature RITF ri m,r performed calculations, which was centered on it, and denoted as ritf i (i = 1,2, . . . ,64). The further the I i (x, y) distance is from the center point I c (x, y), the less description data are provided. Therefore, it is necessary for ritf i to weight, and the weight coefficient is as follows: where (x i , y i ) and (x c , y c ) are the coordinates of the pixel point I i (x, y) and the center point I c (x, y), respectively, and β is a constant. Then, all the weighted RITF eigenvalues are composed of a one-dimensional eigenvector, which is denoted as R i , giving the following: Finally, in order to eliminate the effect of the changes due to illumination, R i was normalized as R i R i → R i . Thereby, a 64-dimensional eigenvector F rit f was obtained, which is the RITF feature descriptor for the surrounding area of the key point I c (x, y).
The SIFT/RITF feature descriptor is defined as follows: where F si f t is a 128-dimensional SIFT feature descriptor, F rit f is a 64-dimensional RITF feature descriptor, F sr is a 192-dimensional SIFT/RITF feature descriptor, and α (α < 1) is a relative weighting factor depending on the pixel size and the type of the image. In this paper, according to the experimental optimization, we set α to be 0.6.

Feature Extraction with CNN Model
The CNN model proposed in this paper is from Alexnet. The CNN model consists of five convolution layers, three max-pooling layer, and two fully connected layers. The last layer adopts a Softmax classifier [9]. After all kinds of images reach the optimal classification effect, the parameters of each layer of the convolutional neural network are determined. Global features are extracted according to these parameters, and the features of the third layer are extracted. The CNN firstly uses a large sample for pre-training, and then uses the pre-trained CNN model to extract features. Figure 3 shows the structure of the CNN network.
where (xi, yi) and (xc, yc) are the coordinates of the pixel point Ii(x, y) and the center point Ic(x, y), respectively, and β is a constant. Then, all the weighted RITF eigenvalues are composed of a one-dimensional eigenvector, which is denoted as Ri, giving the following: Finally, in order to eliminate the effect of the changes due to illumination, Ri was normalized as . Thereby, a 64-dimensional eigenvector was obtained, which is the RITF feature descriptor for the surrounding area of the key point Ic(x, y).
The SIFT/RITF feature descriptor is defined as follows: where sift F is a 128-dimensional SIFT feature descriptor, ritf F is a 64-dimensional RITF feature descriptor, sr F is a 192-dimensional SIFT/RITF feature descriptor, and  ( 1)

 
is a relative weighting factor depending on the pixel size and the type of the image. In this paper, according to the experimental optimization, we set  to be 0.6.

Feature Extraction with CNN Model
The CNN model proposed in this paper is from Alexnet. The CNN model consists of five convolution layers, three max-pooling layer, and two fully connected layers. The last layer adopts a Softmax classifier [9]. After all kinds of images reach the optimal classification effect, the parameters of each layer of the convolutional neural network are determined. Global features are extracted according to these parameters, and the features of the third layer are extracted. The CNN firstly uses a large sample for pre-training, and then uses the pre-trained CNN model to extract features. Figure 3 shows the structure of the CNN network.

Batch Normalization Layer
The Alexnet consists of five convolution layers, three max-pooling layers, and three fully connected layers. The input is the scale-normalized image. The training of the deep network is a complex process; as long as the first few layers of the network change slightly, then the changes in the next few layers will be accumulated and magnified. Once the distribution of the input data

Batch Normalization Layer
The Alexnet consists of five convolution layers, three max-pooling layers, and three fully connected layers. The input is the scale-normalized image. The training of the deep network is a complex process; as long as the first few layers of the network change slightly, then the changes in the next few layers will be accumulated and magnified. Once the distribution of the input data changes, the layer requires constant adaptation to the new distribution, thereby affecting the training speed of the network.
The stochastic gradient descent (SGD) is a simple and efficient method for the training depth network; however, it has the disadvantage of us having to carefully adjust model parameters such as learning rate, initial values, weight attenuation coefficient, dropout ratio, and so on. The choice of these parameters is crucial to the training result; thus, we spent a lot of time on this work. To this end, we deleted the local response normalization (LRN) layer and introduced the batch normalization (BN) layer [45]. Due to BN having the ability to improve network generalization, we could remove the dropout parameter. In each layer of the network, a normalization layer was inserted, which is a normalization process (normalized to a mean of 0, a variance of 1).
Consider a mini-batch α of size n. Since the normalization is applied to each activation independently, we can take any dimension as an example. We have n values of this activation in the mini-batch, α = {x 1 , . . . , x n }. The elements of each mini-batch are sampled from the same distribution. Let the normalized values be {x 1 , . . . ,x n }, and their linear transformations be β = {y 1 , . . . , y n }. The forward conduction process formula of the batch normalization network layer is shown in Algorithm 1.

Algorithm 1. Batch normalization (BN)
Input: Values of n over a mini-batch: α = {x 1 , . . . , x n }; Parameters to be learned: η, λ. Output: β = {y 1 , . . . , y n }; In this algorithm, BN can increase the speed of training, whereby it is possible to choose a larger initial learning rate, which can accelerate the attenuation such that the "optimal learning rate" is obtained, increasing the speed of adjustment of the learning rate. Moreover, it also alleviates the problem of internal covariate shift.

Activation Function Layer
The main function of the activation function in neural networks is to add some nonlinear factors to the neural network, such that the neural network can better solve more complex problems. Without the activation function, the weights and deviations are only linearly transformed. The activation function makes backpropagation possible because the error gradient of the activation function can be used to adjust weights and deviations. This is not possible without a differentiable nonlinear function. When a number is applied to a deep neural network, it mainly affects the training of the network in the two processes of forward propagation and backward propagation. Figure 4 shows the PReLU activation function. The PReLU is a variant of ReLU, where the slope of the negative region is determined by data, and is not predefined. It was found that PReLU converges faster than ReLU [46]. PReLU was chosen over ordinary ReLU to solve the dying ReLU problem; its mathematical representation is as follows: The PReLU is a variant of ReLU, where the slope of the negative region is determined by data, and is not predefined. It was found that PReLU converges faster than ReLU [46]. PReLU was chosen over ordinary ReLU to solve the dying ReLU problem; its mathematical representation is as follows: where a i is a learning parameter; when a i is a fixed nonzero decimal, it is equivalent to LeakyReLU; when it is 0, PReLU is equivalent to ReLU. Its functional expressions are as follows:

Backpropagation (BP) of CNN Networks
BP neural networks have strong nonlinear mapping capabilities, as well as a high ability of self-learning and self-adaptation, the ability to apply learning results to new knowledge, and a certain fault tolerance. Based on the above advantages, we used BP to reverse-adjust the parameters. Since the cross-entropy loss function can be assigned a linear gradient, effectively preventing the gradient from vanishing, it can overcome the problem of slow updating of variance cost function. In this paper, we used the cross-entropy loss function instead of the mean variance loss function, and the last activation function was replaced by the sigmoid function, because the cross-entropy loss function is more compatible with it. The cross-entropy cost function is defined as follows: where y is the expected output, and a is the actual output. The following equations are its derivatives: ∂C ∂b The weight update is affected by σ(z) − y, which is the impact of the error. Therefore, when the error is large, the weight updates quickly, and, when the error is small, the weight updates slowly.

Feature Vector Extraction
The input face image passes through the convolution layer, pooling layer, and fully connected layer. The output of the second fully connected layer is taken as the feature vector extracted by the CNN network, and is denoted as F cnn . The fully connected layer works by looking at the output of the previous layer (the activation diagram representing the high-order characteristics) and determining which functions are the most relevant to the particular class. The structure of the fully connected layer is shown in Figure 5.
layer. The output of the second fully connected layer is taken as the feature vector extracted by the CNN network, and is denoted as cnn F . The fully connected layer works by looking at the output of the previous layer (the activation diagram representing the high-order characteristics) and determining which functions are the most relevant to the particular class. The structure of the fully connected layer is shown in Figure 5.

… …
Output Layer Figure 5. Structure of the fully connected layer.

Feature Fusion and Classification
The SR-CNN descriptors were constructed from 128-dimensional SIFT descriptors, 64-dimensional RITF descriptors, and 4096-dimensional CNN features, which contain a variety of image information and should adopt different matching strategies. For the SIFT local feature vector S, Euclidean distance was used as its matching strategy.
where Si,m and Sj,m represent the m-th eigenvector of the i-th feature point and the m-th eigenvector of the j-th feature point (m ≤ 128) of the SIFT local feature, respectively. For the RITF local feature vector R, the χ 2 statistic was used as the matching strategy.
where Ru,m and Rv,m represent the m-th eigenvector of the u-th feature point and the m-th eigenvector of the v-th feature point (m ≤ 64) of the RITF local feature, respectively. For a given face image, in this paper, we extracted two groups of features, denoted as and cnn F . Considering the large differences in the features of different categories, this paper used the mean and variance of two groups of features to normalize the corresponding feature vectors. The mean and variance of each group's features were obtained in the training dataset. The two normalized features were connected together to get the fusion feature vector, which is expressed as Figure 5. Structure of the fully connected layer.

Feature Fusion and Classification
The SR-CNN descriptors were constructed from 128-dimensional SIFT descriptors, 64-dimensional RITF descriptors, and 4096-dimensional CNN features, which contain a variety of image information and should adopt different matching strategies. For the SIFT local feature vector S, Euclidean distance was used as its matching strategy.
where S i,m and S j,m represent the m-th eigenvector of the i-th feature point and the m-th eigenvector of the j-th feature point (m ≤ 128) of the SIFT local feature, respectively. For the RITF local feature vector R, the χ 2 statistic was used as the matching strategy.
where R u,m and R v,m represent the m-th eigenvector of the u-th feature point and the m-th eigenvector of the v-th feature point (m ≤ 64) of the RITF local feature, respectively. For a given face image, in this paper, we extracted two groups of features, denoted as F sr and F cnn . Considering the large differences in the features of different categories, this paper used the mean and variance of two groups of features to normalize the corresponding feature vectors. The mean and variance of each group's features were obtained in the training dataset. The two normalized features were connected together to get the fusion feature vector, which is expressed as For the fusion feature vector F extracted from each face image, this paper used the random forest (RF) learning method [47] to construct a classifier for feature classification. The number of decision trees in the random forest method was set to 100; the classification results had high accuracy with a feature screening mechanism.

Architecture of Parallel Computing
In 2007, NVIDIA released the official development platform Compute Unified Device Architecture (CUDA) for GPU programming [48]. The core idea of CUDA is based on three important abstract concepts: shared memory, shielding synchronization, and thread group hierarchy.
The GPU can be used for the computing of parallel data, such that many data elements can be executed in parallel with a higher computational density. The model of CUDA programming releases the GPU's parallel capabilities. Due to the unique hardware architectural design of GPU, it has strong parallel data computing capabilities (a large number of transistors on GPU are used for arithmetic logic operations). Since several data elements run independently on a thread and have high computational density, there is no need for larger data caches.

Design of GPU Implementation
This paper introduces the optimization and acceleration of the SR-CNN model using the aspects delineated in Figure 6. In this framework, the left side of the dotted line indicates that the program is running on the CPU side, and the right side indicates that it is running on the GPU side.

Architecture of Parallel Computing
In 2007, NVIDIA released the official development platform Compute Unified Device Architecture (CUDA) for GPU programming [48]. The core idea of CUDA is based on three important abstract concepts: shared memory, shielding synchronization, and thread group hierarchy.
The GPU can be used for the computing of parallel data, such that many data elements can be executed in parallel with a higher computational density. The model of CUDA programming releases the GPU's parallel capabilities. Due to the unique hardware architectural design of GPU, it has strong parallel data computing capabilities (a large number of transistors on GPU are used for arithmetic logic operations). Since several data elements run independently on a thread and have high computational density, there is no need for larger data caches.

Design of GPU Implementation
This paper introduces the optimization and acceleration of the SR-CNN model using the aspects delineated in Figure 6. In this framework, the left side of the dotted line indicates that the program is running on the CPU side, and the right side indicates that it is running on the GPU side.

Establishment of DOG and Detection of Extrema
In the process of the establishment of the Gaussian pyramid, images of each layer need to perform Gaussian filtering, which is time-consuming; thus, two-dimensional convolution was used for the calculations. According to their separability, two dimensions (rows and columns) were used to calculate the parallel convolution kernel function.
Since the column convolution is similar to the row convolution, the parallel construction of the Gaussian pyramid is described below in detail using the column convolution operation as an example. The block was set to two dimensions and the thread was set to one dimension. Part of the image data read in blocks after each calculation are stored in shared memory, such that the data communication between threads can speed up the parallel computing of the program. After the establishment of the Gaussian pyramid, the DOG was obtained by directly subtracting the pixels on the image of the adjacent scale. During the subtraction, non-adjacent images were not affected by the calculation, and they could be processed in parallel on GPU. In the same way, it was possible to detect the extreme points in parallel to the pixels of each layer of images in DOG and find the potential key points. Then, the location and scale of the key points were marked using the binary bitmap, which was inserted into the sequence and uploaded to the host terminal.

Histogram Statistics
In this paper, a circular area with a radius of 15 pixels was used as domain information for the key point descriptors. To compensate for the feature point instability caused by no affine invariance, the gradient amplitude was firstly weighted by the Gaussian smoothing in the whole circular domain; then, division of the region was carried on to this field, dividing 2π into 36 equal points, with each region corresponding to a block. The histogram of the feature points could be statistically acquired using gradient information of the pixel gradient in every block, and the value of the maximum group distance (bin) in the histogram was the main direction of the feature point. The data were imported into red/green/blue/alpha (RGBA) texture storage, and the multi-objective rendering algorithm on the advanced graphics card was used to accelerate the processing, to further optimize the calculation. Finally, the calculated results were compressed to a one-dimensional array and passed back to the host terminal from the device terminal, followed by updating the gradient direction on the CPU and establishing the SIFT descriptor.

Construction of RITF Descriptor
Since an image typically has hundreds of feature points, and the calculation of RITF features of each feature point is time-consuming, we processed the RITF on the GPU. Firstly, the images and the data of the feature points were copied from the host to the device, and then saved to global memory; then, each block was assigned to calculate the RITF eigenvalue of a feature point. As a grid can allow up to 65,535 × 65,535 blocks, the number of blocks is much larger than the number of feature points; thus, there is no need to worry about the problem of data overflow. From Equations (10), (12), and (13), we know that the calculations of the RITF eigenvalue and of the weighting coefficient do not affect each other. Therefore, different kernel functions were designed to be called in parallel in two streams, and the RITF eigenvalues and weighting coefficients were stored in different arrays. To ensure that their values corresponded to each other, the calculation order of the RITF eigenvalue was made to be consistent with that of the weighting coefficient. Finally, before the construction of the RITF descriptor, it was ensured that all calculations proceeded. To do this, in these kernel functions, in addition to returning the calculation results, the state information of calculations was also returned.

Feature Matching of SR-CNN
In the process of feature matching, the face image data were uploaded from the host to the device, and we used the GPU to process in parallel. SIFT/RITF feature vectors of each feature point were put in a block of shared memory, and each block was assigned 192 threads; the first 128 threads executed the calculation of Euclidean distance, and the following 64 threads executed the calculation of the χ 2 statistic. For CNN features, each feature map was mapped to a thread block, and each neuron on the feature map was mapped to each thread on the thread block. The three dimensions of the thread grid (x, y, z) correspond to the width, height, and number of each layer, respectively. The settings in the kernel function were as follows: kernel <z, x, y>. The kernel function started z thread blocks, each containing x × y threads, and started z × x × y threads altogether. Because the number of threads in a thread block was 512, multiple thread blocks were allocated here. Finally, by calling the _syncthreads() function to achieve thread synchronization, each thread was evaluated before executing the subsequent process.

Experimental Results and Analysis
This section evaluates the performance of the SR-CNN model using a public dataset and a self-collecting dataset. One open dataset was the Labeled Faces in the Wild (LFW) dataset, which is explained in detail later. To verify the validity of the model, we compared the experimental results with other methods. The specific flow is shown in Figure 7.
neuron on the feature map was mapped to each thread on the thread block. The three dimensions of the thread grid (x, y, z) correspond to the width, height, and number of each layer, respectively. The settings in the kernel function were as follows: kernel <z, x, y>. The kernel function started z thread blocks, each containing xy  threads, and started z x y  threads altogether. Because the number of threads in a thread block was 512, multiple thread blocks were allocated here. Finally, by calling the _syncthreads() function to achieve thread synchronization, each thread was evaluated before executing the subsequent process.

Experimental Results and Analysis
This section evaluates the performance of the SR-CNN model using a public dataset and a self-collecting dataset. One open dataset was the Labeled Faces in the Wild (LFW) dataset, which is explained in detail later. To verify the validity of the model, we compared the experimental results with other methods. The specific flow is shown in Figure 7.

Experimental Database
Experimental Platform Experimental Evaluation Index

Experimental Database
The LFW dataset contains 13,000 images of human faces collected from the network, each named after the person being photographed. Of these images, 1680 people had two or more different photographs. We used these images for training and testing, and they were generated randomly and independently of the splits for 10-fold cross-validation. All images were 250 × 250 pixels. Figure 8 shows some sample images from the LFW database [49].

Experimental Database
The LFW dataset contains 13,000 images of human faces collected from the network, each named after the person being photographed. Of these images, 1680 people had two or more different photographs. We used these images for training and testing, and they were generated randomly and independently of the splits for 10-fold cross-validation. All images were 250 × 250 pixels. Figure 8 shows some sample images from the LFW database [49].
The settings in the kernel function were as follows: kernel <z, x, y>. The kernel function started z thread blocks, each containing xy  threads, and started z x y  threads altogether. Because the number of threads in a thread block was 512, multiple thread blocks were allocated here. Finally, by calling the _syncthreads() function to achieve thread synchronization, each thread was evaluated before executing the subsequent process.

Experimental Results and Analysis
This section evaluates the performance of the SR-CNN model using a public dataset and a self-collecting dataset. One open dataset was the Labeled Faces in the Wild (LFW) dataset, which is explained in detail later. To verify the validity of the model, we compared the experimental results with other methods. The specific flow is shown in Figure 7.

Experimental Database
Experimental Platform Experimental Evaluation Index

Experimental Database
The LFW dataset contains 13,000 images of human faces collected from the network, each named after the person being photographed. Of these images, 1680 people had two or more different photographs. We used these images for training and testing, and they were generated randomly and independently of the splits for 10-fold cross-validation. All images were 250 × 250 pixels. Figure 8 shows some sample images from the LFW database [49]. The self-collecting face database includes orthotopic face, lateral face, and tilted face images, among others. It contains 1002 images of 50 people, with each image tagged. Images were taken The self-collecting face database includes orthotopic face, lateral face, and tilted face images, among others. It contains 1002 images of 50 people, with each image tagged. Images were taken with different levels of light intensity, posture, and expression. All images were in color and manually cut to 300 × 300 pixels.

Experimental Platform
The CPU was an Intel Core i5-6300U processor, clocked at 2.40 GHz, with 8 GB of memory. The GPU was a GeForce GTX 1060 M, with 640 CUDA cores, a memory bandwidth of 80.16 GB/s, and a core frequency of 1097 MHz.
The operating system was Microsoft Windows 10, and the software used was the compiler platform for Visual Studio 2010, CUDA Toolkit 8.0, and OpenCV 3.4.

Experimental Evaluation Index
The example was divided into positive and negative classes. If an instance was positive, samples were predicted as positive samples, denoted as TP.
If negative examples were predicted as positive examples, this was denoted as FP. Correspondingly, if negative samples were predicted as negative samples, this was denoted as TN, and if positive classes were predicted as negative samples, this was denoted as FN. T CPU represents the serial execution time for the CPU, T GPU represents the parallel execution time for the GPU, S r is the speed ratio, TPR is the true positive rate, FPR is the false positive rate, and ACC is the recognition accuracy rate; the formulas are given below.

Experimental Analysis
In this subsection, we analyze the performance of the CNN network with different settings, including training epochs, activation function, and whether the CNN network was pre-trained or fine-tuned. The specific flow is shown in Figure 9.

Experimental Platform
The CPU was an Intel Core i5-6300U processor, clocked at 2.40 GHz, with 8 GB of memory. The GPU was a GeForce GTX 1060 M, with 640 CUDA cores, a memory bandwidth of 80.16 GB/s, and a core frequency of 1097 MHz.
The operating system was Microsoft Windows 10, and the software used was the compiler platform for Visual Studio 2010, CUDA Toolkit 8.0, and OpenCV 3.4.

Experimental Evaluation Index
The example was divided into positive and negative classes. If an instance was positive, samples were predicted as positive samples, denoted as TP .

Experimental Analysis
In this subsection, we analyze the performance of the CNN network with different settings, including training epochs, activation function, and whether the CNN network was pre-trained or fine-tuned. The specific flow is shown in Figure 9.

Analysis of the Accuracy in Different Training Epochs
The training epochs of the CNN network inevitably affect the accuracy. Figure 10 shows that the accuracy rate increased with the increase in epoch numbers. The training epochs of the CNN network inevitably affect the accuracy. Figure 10 shows that the accuracy rate increased with the increase in epoch numbers. When the network reached the convergence state, the accuracy tended to be stable upon increase in epoch number. The larger the value of the epoch was, the easier it would be to converge, but the longer it would take. As we can see from the graph, when the number of epochs was 20, the network state was already converged, and SR-CNN had the best performance on the self-collection database. Therefore, in this paper, the training epoch of the CNN network used was set to 20.

Analysis of the Accuracy of Different Activation Functions
Residuals in the backpropagation (BP) algorithm diminish with the depth of network transmission, which makes the underlying network unable to be trained effectively. Residual attenuation is closely related to the selection of activation function in the network model. A better activation function can inhibit the residuals in the process of network transmission attenuation, and improve the convergence speed of the models. We used ELU, Tanh, ReLU, and PReLU instead of the sigmoid function in the original network to perform the experiments, and we achieved better results. Then, we compared the network performance with different activation functions using a table of comparison. The development of activation functions plays a crucial role in the further application of deep neural networks. We conducted experiments on two benchmark datasets; we chose the LFW dataset and self-collection dataset as test sets. In this paper, the activation function of the CNN network used was the PReLU activation function. The experimental results are shown in Figure 11.    When the network reached the convergence state, the accuracy tended to be stable upon increase in epoch number. The larger the value of the epoch was, the easier it would be to converge, but the longer it would take. As we can see from the graph, when the number of epochs was 20, the network state was already converged, and SR-CNN had the best performance on the self-collection database. Therefore, in this paper, the training epoch of the CNN network used was set to 20.

Analysis of the Accuracy of Different Activation Functions
Residuals in the backpropagation (BP) algorithm diminish with the depth of network transmission, which makes the underlying network unable to be trained effectively. Residual attenuation is closely related to the selection of activation function in the network model. A better activation function can inhibit the residuals in the process of network transmission attenuation, and improve the convergence speed of the models. We used ELU, Tanh, ReLU, and PReLU instead of the sigmoid function in the original network to perform the experiments, and we achieved better results. Then, we compared the network performance with different activation functions using a table of comparison. The development of activation functions plays a crucial role in the further application of deep neural networks. We conducted experiments on two benchmark datasets; we chose the LFW dataset and self-collection dataset as test sets. In this paper, the activation function of the CNN network used was the PReLU activation function. The experimental results are shown in Figure 11. The training epochs of the CNN network inevitably affect the accuracy. Figure 10 shows that the accuracy rate increased with the increase in epoch numbers. When the network reached the convergence state, the accuracy tended to be stable upon increase in epoch number. The larger the value of the epoch was, the easier it would be to converge, but the longer it would take. As we can see from the graph, when the number of epochs was 20, the network state was already converged, and SR-CNN had the best performance on the self-collection database. Therefore, in this paper, the training epoch of the CNN network used was set to 20.

Analysis of the Accuracy of Different Activation Functions
Residuals in the backpropagation (BP) algorithm diminish with the depth of network transmission, which makes the underlying network unable to be trained effectively. Residual attenuation is closely related to the selection of activation function in the network model. A better activation function can inhibit the residuals in the process of network transmission attenuation, and improve the convergence speed of the models. We used ELU, Tanh, ReLU, and PReLU instead of the sigmoid function in the original network to perform the experiments, and we achieved better results. Then, we compared the network performance with different activation functions using a table of comparison. The development of activation functions plays a crucial role in the further application of deep neural networks. We conducted experiments on two benchmark datasets; we chose the LFW dataset and self-collection dataset as test sets. In this paper, the activation function of the CNN network used was the PReLU activation function. The experimental results are shown in Figure 11.    Figure 11. The matching accuracy of the activation function.
As can be seen from Figure 11, the matching accuracy with the PReLU activation function was higher than that with the other functions. Meanwhile, with the same activation function, the accuracy was higher when using the self-collecting database.

Analysis of Pre-Trained and Fine-Tuned CNN Networks
Generally, convolutional neural networks trained from scratch are prone to problems, and fine-tuning can quickly converge the network to an ideal state, but there is a large difference between the target dataset and the pre-trained dataset. In the recognition task of the target dataset, it is difficult to extract the visual features of the image using the pre-trained CNN model. In order to make the pre-trained CNN parameters more appropriate to the target dataset, we fine-tuned the CNN models based on the target dataset. With the exception of the last layer, the parameters of the pre-training model were used to initialize other layers, and the output layer was initialized by a zero-mean Gaussian distribution with a standard deviation of 0.01. Unlike the pre-training, we assumed that the weights of the pre-trained CNN were relatively good; thus, we set a small learning rate for the CNN weights to be fine-tuned. We did not want to change them too quickly or too much; thus, we kept our learning rate and learning rate attenuation very small.
As can be seen from Table 1, the fine-tuned network helped boost the accuracy, while the performance could be further improved. Interestingly, fully connected layer 7 (FC7) worked best after fine-tuning, unlike the pre-trained CNN. Therefore, the output of the FC7 layer was taken as the feature vector.

Comparative Experiment on Performance
In this subsection, we evaluate the proposed model in this paper in different situations. The comparisons included face matching methods before and after improvement, our model with other matching models, and the GPU before and after optimization. The details are shown in Figure 12. As can be seen from Figure 11, the matching accuracy with the PReLU activation function was higher than that with the other functions. Meanwhile, with the same activation function, the accuracy was higher when using the self-collecting database.

Analysis of Pre-Trained and Fine-Tuned CNN Networks
Generally, convolutional neural networks trained from scratch are prone to problems, and fine-tuning can quickly converge the network to an ideal state, but there is a large difference between the target dataset and the pre-trained dataset. In the recognition task of the target dataset, it is difficult to extract the visual features of the image using the pre-trained CNN model. In order to make the pre-trained CNN parameters more appropriate to the target dataset, we fine-tuned the CNN models based on the target dataset. With the exception of the last layer, the parameters of the pre-training model were used to initialize other layers, and the output layer was initialized by a zero-mean Gaussian distribution with a standard deviation of 0.01. Unlike the pre-training, we assumed that the weights of the pre-trained CNN were relatively good; thus, we set a small learning rate for the CNN weights to be fine-tuned. We did not want to change them too quickly or too much; thus, we kept our learning rate and learning rate attenuation very small.
As can be seen from Table 1, the fine-tuned network helped boost the accuracy, while the performance could be further improved. Interestingly, fully connected layer 7 (FC7) worked best after fine-tuning, unlike the pre-trained CNN. Therefore, the output of the FC7 layer was taken as the feature vector.

Comparative Experiment on Performance
In this subsection, we evaluate the proposed model in this paper in different situations. The comparisons included face matching methods before and after improvement, our model with other matching models, and the GPU before and after optimization. The details are shown in Figure 12. The face database used in this experiment was provided by LFW (the size of each face was 250 × 250 pixels) and the self-collection face database (the size of each face was 300 × 300 pixels), including

Comparison of Face Matching Methods Before and After Improvement
The face database used in this experiment was provided by LFW (the size of each face was 250 × 250 pixels) and the self-collection face database (the size of each face was 300 × 300 pixels), including front face, side face, and rotated face images, among others. The method of face matching under the traditional SIFT model is only suitable for face matching with the rotation of the positive face and images with small rotation (tilt) angles, but the matching effect is poor for faces with large rotation angles. The improved method combined RITF and SIFT, which improved matching for faces with large rotation (tilt) angle, on the basis of maintaining the effect of the original matching. In this experiment, the improved face matching method was compared with the traditional SIFT model for the same face database. The experimental results were averaged over 10 experiments.
For the LFW face database, as shown in Figure 13, the SIFT model could match three pairs of feature points, but only two pairs of matching points were correct. The improved SIFT/RITF model could match 10 pairs of feature points, and eight pairs of matching points were correct.
The matching effects of two methods for the LFW face database are shown in Figure 14 (acc represents the recognition accuracy rate, and tpr is the true positive rate). For the improved face matching method, the recognition accuracy rate improved by 10.97-13.24% and the true positive rate improved by 8.50-10.11% (due to the low number of feature points in LFW's face database, the threshold of matching point pairs m was set to 3). For the LFW face database, as shown in Figure 13, the SIFT model could match three pairs of feature points, but only two pairs of matching points were correct. The improved SIFT/RITF model could match 10 pairs of feature points, and eight pairs of matching points were correct.
The matching effects of two methods for the LFW face database are shown in Figure 14 (acc represents the recognition accuracy rate, and tpr is the true positive rate). For the improved face matching method, the recognition accuracy rate improved by 10.97-13.24% and the true positive rate improved by 8.50-10.11% (due to the low number of feature points in LFW's face database, the threshold of matching point pairs m was set to 3).   For the LFW face database, as shown in Figure 13, the SIFT model could match three pairs of feature points, but only two pairs of matching points were correct. The improved SIFT/RITF model could match 10 pairs of feature points, and eight pairs of matching points were correct.
The matching effects of two methods for the LFW face database are shown in Figure 14 (acc represents the recognition accuracy rate, and tpr is the true positive rate). For the improved face matching method, the recognition accuracy rate improved by 10.97-13.24% and the true positive rate improved by 8.50-10.11% (due to the low number of feature points in LFW's face database, the threshold of matching point pairs m was set to 3).   For the self-collection face database, as shown in Figure 15, the SIFT model had 19 pairs of matching points, for which only nine pairs of matching points were correct. The improved SIFT/RITF model had 31 pairs of matching points, and 23 pairs of matching points were correct.
The matching effects of both methods for the self-collection face database are shown in Figure 16 (acc represents recognition accuracy rate, and tpr is the true positive rate). For the improved face matching method, the recognition accuracy rate improved by 12.65-15.31% and the true positive rate improved by 7.49-9.09% (the threshold of matching point pairs m was set to 16). For the self-collection face database, as shown in Figure 15, the SIFT model had 19 pairs of matching points, for which only nine pairs of matching points were correct. The improved SIFT/RITF model had 31 pairs of matching points, and 23 pairs of matching points were correct.
The matching effects of both methods for the self-collection face database are shown in Figure  16 (acc represents recognition accuracy rate, and tpr is the true positive rate). For the improved face matching method, the recognition accuracy rate improved by 12.65-15.31% and the true positive rate improved by 7.49-9.09% (the threshold of matching point pairs m was set to 16).    For the self-collection face database, as shown in Figure 15, the SIFT model had 19 pairs of matching points, for which only nine pairs of matching points were correct. The improved SIFT/RITF model had 31 pairs of matching points, and 23 pairs of matching points were correct.
The matching effects of both methods for the self-collection face database are shown in Figure  16 (acc represents recognition accuracy rate, and tpr is the true positive rate). For the improved face matching method, the recognition accuracy rate improved by 12.65-15.31% and the true positive rate improved by 7.49-9.09% (the threshold of matching point pairs m was set to 16).

Comparison of Our Model with Other Face Matching Models
The LFW face database and self-collection database were used to experiment with different improved SIFT models. In this experiment, we compared eleven different models: principal component analysis with SIFT (PCA/SIFT) [50], Gabor/SIFT [51], speeded up robust features with SIFT (SURF/SIFT) [52], partial-descriptor SIFT (PDSIFT) [53], person-specific SIFT [54], SIFT/Kepenekci [55], wavelet transform of the SIFT feature [56], hexagonal SIFT (H-SIFT) [57], and SIFT/RITF. For PCA/SIFT, gradient patches were used around the key points instead of the original patches to make the representation robust to changes in lighting and to reduce the changes that PCA needs to model [50]. Reference [51] proposes to combine Gabor and SIFT for face recognition. The SURF/SIFT method uses a SURF detector with a SIFT descriptor to augment the proficiency of contemporary face recognition systems [52]. The proposed PDSIFT method retains key points at large or near-face boundaries to achieve better performance than the original SIFT [53]. The person-specific SIFT model uses the SIFT features of a particular person and a non-statistical matching strategy to solve the face recognition problem in combination with the local and global similarities on the key point group [54]. SIFT/Kepenekci is a SIFT model-based Kepenekci approach [55]. The method proposed in Reference [56] is based on wavelet transform of the SIFT features. H-SIFT takes advantage of hexagonal transformed image pixels and applies processing on a hexagonal coordinate system, rather than using SIFT on the square image coordinates, displaying better performance [57].
The experimental results are shown in Table 2. According to the experimental results, the SR-CNN model presented in this paper obtained the optimal face recognition performance compared with the other ten methods. As we can see from the Table 3, the SIFT/RITF model performed well with the LFW database, and the CNN model performed well with the self-collection database. Correspondingly, SR-CNN performed well with both databases, which proves that the proposed SR-CNN model is very effective for face recognition. We fully considered the complementary information of different characteristics, which can solve the problems that CNN cannot avoid. We fused complementary features, and the results of the experiment show that the fusion was a success. Based on the above experiments, compared with the traditional method, the facial feature matching method using the SR-CNN model is greatly improved. For the LFW face database, the size of the face image is smaller. Although there are fewer matching points, the optimization effect was better, and the true positive rate could still be increased to 97.32%. For the self-collection database, the size of the face image was larger, and all of them were color images. Therefore, there were more pairs of matching points, and the optimization effect was better, with the true positive rate as high as 97.61%. In addition, for the improved method, there was also an improvement in the number of matching points pairs, such that the true positive rate with both databases could reach about 99%.

Comparison of GPU before and after Optimization
To carry out the comparison experiment of multi-group data, the LFW face database and the self-collection face database were arranged into five groups according to the number of images. In each experiment, the serial computing of the CPU was compared with the parallel computing of the GPU, and the test time of each group was the average of 10 experiments.
According to Figure 17 and Table 4, it can clearly be seen that, with the increase in the number of face images, the calculating time exhibited an upward trend, the acceleration ratio of which was a factor of about 6-7 for the LFW database. Based on the above experiments, compared with the traditional method, the facial feature matching method using the SR-CNN model is greatly improved. For the LFW face database, the size of the face image is smaller. Although there are fewer matching points, the optimization effect was better, and the true positive rate could still be increased to 97.32%. For the self-collection database, the size of the face image was larger, and all of them were color images. Therefore, there were more pairs of matching points, and the optimization effect was better, with the true positive rate as high as 97.61%. In addition, for the improved method, there was also an improvement in the number of matching points pairs, such that the true positive rate with both databases could reach about 99%.

Comparison of GPU before and after Optimization
To carry out the comparison experiment of multi-group data, the LFW face database and the self-collection face database were arranged into five groups according to the number of images. In each experiment, the serial computing of the CPU was compared with the parallel computing of the GPU, and the test time of each group was the average of 10 experiments.
According to Figure 17 and Table 4, it can clearly be seen that, with the increase in the number of face images, the calculating time exhibited an upward trend, the acceleration ratio of which was a factor of about 6-7 for the LFW database. Figure 17. Comparison of central processing unit (CPU) and graphics processing unit (GPU) operation time (using the LFW database). Table 4. Acceleration ratios of central processing unit (CPU) and graphics processing unit (GPU) using the LFW face database.  Figure 17. Comparison of central processing unit (CPU) and graphics processing unit (GPU) operation time (using the LFW database). Table 4. Acceleration ratios of central processing unit (CPU) and graphics processing unit (GPU) using the LFW face database.

200
6.06 500 6.37 800 6.51 1100 6.69 1400 6.92 Similarly, as can be seen from Figure 18 and Table 5, for the self-collection database, the acceleration ratio was a factor of about 5-6. Similarly, as can be seen from Figure 18 and Table 5, for the self-collection database, the acceleration ratio was a factor of about 5-6.  Table 5. Acceleration ratios of CPU and GPU using the self-collection face database. Based on the above experiments, the use of GPU parallel computing could effectively improve the running speed of the program and reduce the execution time. Compared with the self-collection database, the face size in the LFW database was small; thus, calculation time of the feature extraction was lower for each image, the acceleration ratio was relatively large, and the optimization effect was more distinct. When the number of pictures was small, and there was no comprehensive use of resources on the GPU thread, the acceleration was small and the speed was not very high. However, when the number of images gradually increased, all the threads on the GPU were called, and the acceleration ratio of both databases gradually increased. Therefore, GPU parallel computing is optimal for situations where there are many pictures.

Conclusions and Future Work
This paper introduced a robust face matching method based on an SR-CNN model. It improved the rotation invariance of the traditional SIFT model while maintaining the advantages of the original SIFT model, CNN model, and the RITF model, which tackles the problem caused by the traditional SIFT model when there are multiple similar regions in an image. Furthermore, the parallel acceleration based on the GPU could not only ensure the matching accuracy, but also effectively improve the running efficiency and shorten the running time. However, in the case of complex backgrounds, the accuracy of face recognition methods will be affected, which will be further studied in the future.  Based on the above experiments, the use of GPU parallel computing could effectively improve the running speed of the program and reduce the execution time. Compared with the self-collection database, the face size in the LFW database was small; thus, calculation time of the feature extraction was lower for each image, the acceleration ratio was relatively large, and the optimization effect was more distinct. When the number of pictures was small, and there was no comprehensive use of resources on the GPU thread, the acceleration was small and the speed was not very high. However, when the number of images gradually increased, all the threads on the GPU were called, and the acceleration ratio of both databases gradually increased. Therefore, GPU parallel computing is optimal for situations where there are many pictures.

Conclusions and Future Work
This paper introduced a robust face matching method based on an SR-CNN model. It improved the rotation invariance of the traditional SIFT model while maintaining the advantages of the original SIFT model, CNN model, and the RITF model, which tackles the problem caused by the traditional SIFT model when there are multiple similar regions in an image. Furthermore, the parallel acceleration based on the GPU could not only ensure the matching accuracy, but also effectively improve the running efficiency and shorten the running time. However, in the case of complex backgrounds, the accuracy of face recognition methods will be affected, which will be further studied in the future.