Deep CNNs with Robust LBP Guiding Pooling for Face Recognition

Pooling layer in Convolutional Neural Networks (CNNs) is designed to reduce dimensions and computational complexity. Unfortunately, CNN is easily disturbed by noise in images when extracting features from input images. The traditional pooling layer directly samples the input feature maps without considering whether they are affected by noise, which brings about accumulated noise in the subsequent feature maps as well as undesirable network outputs. To address this issue, a robust Local Binary Pattern (LBP) Guiding Pooling (G-RLBP) mechanism is proposed in this paper to down sample the input feature maps and lower the noise impact simultaneously. The proposed G-RLBP method calculates the weighted average of all pixels in the sliding window of this pooling layer as the final results based on their corresponding probabilities of being affected by noise, thus lowers the noise impact from input images at the first several layers of the CNNs. The experimental results show that the carefully designed G-RLBP layer can successfully lower the noise impact and improve the recognition rates of the CNN models over the traditional pooling layer. The performance gain of the G-RLBP is quite remarkable when the images are severely affected by noise.


Introduction
Nowadays, using deep learning architectures to dig out information and extract features from images have drawn a lot of attention in computer vision and machine learning tasks. Among them, CNN has gradually become the most effective method since it can extract essential features quickly from images and has been widely applied in face recognition, target tracking, expression analysis, and other fields. For instance, in 2014, the Deepface [1] method came out and achieved 97.35% accuracy on the LFW database [2]. In DeepID2 [3] and DeepID2+ [4] models, the authors skillfully combined face identification and verification to increase inter-class variations and reduce intra-class variations simultaneously. This mechanism successfully gets an obvious improvement on some typical databases. DeepID3 [5] further enlarges and deepens the network, finally reaching a 99.53% accuracy on the LFW database. VGGnet [6] is another influential model to learn effective features from input images which has been used in many visual recognition tasks [7][8][9]. In this model, many convolutional layers are stacked together to get some more complex features. The GoogleNet [10] was proposed which ranked in the top in the ILSVRC 2014. There are many inception modules in this model which combine pooling with convolutional layers to form a new feature extraction layer. Moreover, in 2014, Gong et al. [11] proposed the Multi-scale Orderless Pooling (MOP) CNN to extract CNN activations for local patches at multiple scale levels. Later, faster R-CNN [12] proposed in 2016 merges the Region Proposal Network (RPN) and Fast R-CNN [13] into a single network by sharing their convolutional features which is not only a cost-efficient solution for practical usage but also an effective way to improve object detection accuracy.
CNN is effective for visual recognition, but sometimes it is also very susceptible to the noise injected into the input images in the real-world applications. Taking Alexnet [14] and ZF-5net [15] as examples, we select 100 subjects from the CASIA-WebFace [16] to train and test the two networks (10,900 for training and 1700 for testing). To evaluate the networks in the noisy conditions, the testing images were injected with different intensity of Gaussian noise, as shown in Figure 1. It is clear that the edges and other features information in the face images become more and more challenging to recognize along with the noise intensity.  Table 1 shows the recognition rates of the Alexnet and ZF-5net. In Table 1, we can see that the recognition rates of the networks decrease drastically along with the noise intensity. When the variance of Gaussian noise increases to 0.01, the recognition rates of both networks even drop to 50% and nearly cannot be used in practice. However, in the real-world applications, the obtained face images would be easily affected by various factors and finally contain some noise during the collection, processing, and transmission. A carefully designed CNN is mainly comprised of three types of layers: convolutional layer, pooling layer, and fully-connected layer. In general terms, the objective of pooling is to transform the common feature representations into a new, more usable one which preserves essential information while discarding irrelevant details [17]. However, most of the pooling methods such as the max pooling and the average pooling down sample the input feature maps in the corresponding sliding window based on a constant criterion and all the pixels are treated equally in these cases. Once some of the pixels in the sliding window are affected by noise, they are still probably preserved or averaged after the pooling layer since the current pooling methods have no response to the noise injected into the input. To address this issue, a new pooling method based on robust LBP guiding in deep CNNs is proposed in this paper to deal with the noise injected into the input images, which is named as RLBP Guiding Pooling (G-RLBP).
There are many effective hand-crafted methods to extract features from images. For instance, HOG [18] constructs features by computing and counting histograms of gradient directions in local regions of images. HOG features combined with SVM classifier have been widely applied in pedestrian detection. SIFT [19] feature is a very stable local feature, which is invariant to rotation, scale scaling, and luminance change. To address the impact of makeup on automated face recognition, Chen et al. [20] proposed another useful method in which a set of feature descriptors such as Local Gradient Gabor Pattern (LGGP) [21] and Densely Sampled Local Binary Pattern (DS-LBP) are utilized to represent each patch of the face images. LBP is another representative hand-crafted feature extraction method which has been widely used in many face recognition tasks [22][23][24]. Compared with other feature extraction methods such as HOG and SIFT, LBP-based methods can extract small movements in the facial images, and they are described in a much lower dimensional feature space which benefits the real-time applications. In the proposed G-RLBP, the robust LBP algorithm is utilized to guide the pooling mechanism. The proposed G-RLBP first analyses each pixel in the sliding window, calculates their probabilities affected by noise, and gets the robust LBP (RLBP) weight maps. Then, all the pixels are weighted averaged as the final results of the current sliding window in this pooling layer according to the RLBP weight maps. Here, we utilize the fact that most of the LBP patterns in the face images belong to the uniform patterns and only a small part belongs to the non-uniform patterns [22,25]. Moreover, the non-uniform patterns are usually caused by noise injected into the images. Thus, we can utilize the pattern of the pixel to guide the pooling procession to decrease the noise injected into the feature maps. In this way, the parameters of the input feature maps can be reduced as the traditional pooling methods, and the impact of noise injected into the feature maps can also be effectively lowered simultaneously. The experimental results also show that the performance of some CNNs equipped with the G-RLBP pooling layer can be improved notably in the noisy conditions. The remainder of this paper is organized as follows. Section 2 describes the proposed G-RLBP pooling method, and its theoretical analysis is also carried out in this section. Section 3 reports the experimental design and performance comparisons of the G-RLBP pooling method. Section 4 concludes this paper.

Proposed Method
In the traditional CNNs, the convolutional layer, pooling layer and fully-connected layer care little about the noise injected into the input images. However, the noise impact introduced by the input images would accumulate layer by layer. When the intensity of the noise reaches a certain degree, the recognition rate of the network will drop sharply, as Table 1 shows. Therefore, it is highly necessary to lower noise interference at the first several layers of the network. Figure 2 is the structure of our designed G-RLBP pooling layer to reduce the noise impact in the first pooling layer. There are three main modules in the G-RLBP pooling layer: The values in the RLBP weight maps reflect the probability of each pixel affected by noise in the convolutional feature maps. Utilizing this weight maps to down sample the convolutional feature maps, the pixels which are more likely to be affected by noise would be assigned smaller weights to lower the noise interference to the networks. We give a detailed description of this RLBP weight maps in the next section.

Robust LBP
LBP is a very effective method to extract local texture features from images. In recent years, LBP and its variants have been successfully applied to various pattern recognition tasks, such as texture analysis, face detection, facial expression recognition and so on.
Ahonen et al. [22] firstly introduced the LBP method into face recognition field. They cut the face image into several sub-images and then calculated the LBP values of each pixel in certain sub-image. The local and overall features of the face image are combined in this method with excellent performance in real-world applications. However, in the coding process, the traditional LBP method usually compares the central pixel and its neighbors to get a binary string in the sliding window. Some small changes of the pixels finally result in very different coding results. For instance, Figure 3 shows the two coding processes of LBP; the red numbers indicate the pixels affected by noise. It is clear that the normal LBP value of the central pixel is 1011 0000; once some neighboring pixels are slightly affected by noise, the coding results would be very different from the normal one. Thus, the LBP is very sensitive to noise which can easily modify the gray pixel value of the image and may result in entirely different coding results. In this section, we utilize a noise-robust LBP coding method called RLBP in which we can judge whether the pixel value of the image is affected by noise before feeding into the network based on its probability.
The basic LBP algorithm encodes the signs of the pixel differences between the central pixel and its neighboring pixels in a sliding window. The coding criterion is as follows: where z p indicates the difference between the central pixel and its neighbors in a sliding window (e.g., 3 × 3), and z p is encoded into 1 or 0 according to Equation (1). The central pixel and all of its neighbors are compared in turn and then these 1-bit binary numbers are connected in a certain direction to get a P-bits binary string (P is the neighbor number of the central pixel). Finally, this P-bits binary string is converted to a decimal number between 0 and 2 P which is regarded as the LBP value of the central pixel in this 3 × 3 sliding window.
There are 2 P different patterns in the LBP algorithm. Among them, P × (P − 2) + 2 LBP patterns are defined as uniform patterns with at most two circularly bitwise transitions from 0 to 1 or vice versa, and the rest are non-uniform patterns. Most LBP values in natural images are uniform patterns [22]. Thus, uniform patterns are statistically more significant, and their occurrence probabilities can be more reliably estimated. In contrast, non-uniform patterns are statistically insignificant, and hence noise-prone and unreliable. Figure 4 shows some of the local primitives (spots, flat region, edges ends and corners) represented by uniform LBP patterns [26]. The RLBP is different from LBP in which z p is encoded to a ternary pattern (0, 1 and u) according to Equation (2).
where t p is the threshold, u is an uncertain binary number and encoded to 0 or 1 in a certain probability. Obviously, a binary string which contains uncertain u is unable to be encoded into a certain decimal number between 0 and 2 P . LBP uniform patterns can capture the main structural information of the image while reducing the noise interference in the texture. In natural images, the frequency of the uniform patterns appears far higher than the non-uniform ones. Many experimental data show that 90.6% of the LBP patterns in the face image belong to the uniform patterns, and only a small part belongs to the non-uniform patterns. Moreover, these non-uniform patterns are often caused by noise. Here is a simple experiment to explain this phenomenon in face images. For example, injecting different intensity (d) of salt and pepper noise into the samples of the ORL database [27], the proportion of the non-uniform patterns increases continuously along with the intensity of noise. When d is 0.05, 0.1 and 0.15, the proportion of the non-uniform mode is 27.86%, 31.46% and 34.49%, respectively. Based on this observation, we can preset the value of u in different coding patterns. Generally, there are three cases in total according to the number of uniform patterns.
2.1.1. Case 1: Only One Pattern Belongs to the Uniform Patterns Figure 5 is a comparison of LBP and RLBP coding process in a 3 × 3 sliding window. Here, t p is set to be 5. The detailed discussion of t p can be seen in the Section 3.1.1 The corresponding uncertain RLBP pattern collection in the RLBP algorithm is defined as C(U) = 1u 1 110u 2 00. There are two uncertain binary numbers U = {u 1 , u 2 }. According to the combinations of U, there are four different P = 8 bits binary strings: 1011 0000, 1011 0100, 1111 0000, and 1111 0100. Among them, 1011 0000, 1011 0100 and 1111 0100 belong to the non-uniform patterns, and they are likely to be affected by noise. Therefore, the central pixel '90' of the 3 × 3 sliding window is encoded to 1111 0000 in the RLBP algorithm because it is the only one uniform pattern and we can also calculate the probability of this code. Let p(u = 1) be the probability of z p encoded to 1.
We can easily get p(u 1 = 1) = 0.3, p(u 2 = 0) = 0.8. Finally, the probability of the central pixel encoded to 1111 0000 is p(  Figure 6 shows another case when more than one patterns belongs to the uniform patterns. The uncertain RLBP pattern in this case is: 11u 1 00u 2 00. According to the values of u 1 and u 2 , there are four candidate binary strings: 1110 0000, 1110 0100, 1100 0100, and 1100 0000. Two binary strings belong to the uniform patterns: 1110 0000 and 1100 0000. Here, according to Equations (3) and (4), we have: Then, the encoding probability of 1110 0000 and 1100 0000 are, respectively: In the RLBP algorithm, the binary strings with the max probability is defined as the final coding result, thus the central pixel in Figure 6 is encoded as 1100 0000 with a probability of 0.72. Figure 6. The comparison of LBP and RLBP patterns in a 3 × 3 sliding window with more than one patterns in uncertain RLBP pattern collection belong to the uniform patterns.

Case 3: None Pattern Belongs to the Uniform Patterns
In the third case, no pattern belongs to the uniform patterns, as Figure 7 shows. In Figure 7, the uncertain RLBP pattern is C(U) = u 1 0110110. There are two binary strings when u 1 is set to 0 or 1: 0011 0110 and 1011 0110. They both belong to the non-uniform patterns. Therefore, the central pixel can only be encoded to 1011 0110 with a probability of p(u 1 = 1) = 0.7. In conclusion, the RLBP value of the central pixel in any sliding windows can be summarized as follows: 1. Calculating all the uncertain RLBP values of the central pixel according to Equation (2).
where Φ u andΦ u are the collections of the uncertain uniform patterns and non-uniform patterns. U = {u 1 , u 2 , . . . , u n }. 2. If URLBP Φ u == None and URLBPΦ u = None, then calculate the probabilities of all non-uniform patterns in URLBPΦ u according to Equation (7). Otherwise, calculate the probabilities of all uniform patterns in URLBP Φ u .
3. Finally, the pattern with the max probability is regarded as the RLBP value of the central pixel.
where R indicates the radius of the current sliding window and P is the neighbor number of the central pixel.

RLBP Weight Maps
We have given a detailed discussion of the RLBP coding process. There are three cases to be considered to generate the RLBP weight maps in our G-RLBP method.

•
Case 1: If the center pixel of the sliding window is encoded as only one uniform pattern according to Equation (2), the corresponding RLBP weight of the center pixel is defined as 1.

•
Case 2: If the center pixel of the sliding window is encoded as more than one uniform patterns, then the probabilities of each uncertain RLBP in the URLBP Φ u are calculated, and the max probability is taken as the RLBP weight corresponding to the central pixel.

•
Case 3: If all the uncertain RLBP patterns belong to the non-uniform, the RLBP weight of this central pixel is set to be 0. Figure 8 visualizes the pooling process in our G-RLBP layer. After the RLBP pooling weight maps are generated, we can down sample the convolutional feature maps according to Equation (10): where N is the pixel number in a sliding window, and w i and x i indicate the RLBP weight and the corresponding value in the input feature maps respectively. Here, N is set to 9. y is the output of the current sliding window after the G-RLBP pooling.

Baseline Network Architectures
The Alexnet, ZF-5net, and GoogleNet are three baseline network architectures studied in the experiments. The specific configurations of the Alexnet and ZF-5net are shown in Table 2. In contrast to the Alexnet, the ZF-5net uses smaller filters in Conv1 to preserve more original pixels information. In our models, the Pool1 layers of the three baseline networks are replaced by the proposed G-RLBP layers. One point needs to be emphasized: only the Pool1 layer is replaced by the proposed G-RLBP layer. Most of the noise in the feature maps could be reduced after the first G-RLBP layer in the network which would be verified in the following sections. It fails to bring too much good effect when other pooling layers are all replaced. Besides, the G-RLBP layers need to calculate the weight maps of each pooling window, which is time-consuming to some extent. In particular, to further evaluate the proposed G-RLBP layer, we also used data augmentation by injecting different random noise into the training data as another experiment. The three networks are transferred to face recognition task through fine-tuning. The networks are implemented by Caffe toolbox [28]. Stochastic Gradient Descent (SGD) is used for optimizing in our model with back propagation. We set the weight decay and momentum to 0.005 and 0.9, respectively. The base learning rate is initially set to be 0.001 for training the original ZF-5net and Alexnet models. All networks in our experiments are trained 80 epochs. The evaluation is performed on a machine with 64G memory Xeon CPU 2.1GHz and GPU GeForce GTX1080Ti. The training database used in the experiments is the CASIA-WebFace database including 10,575 subjects with 494,414 face images which are collected from the website. We selected 100 subjects from the CASIA-WebFace database to train the models. The testing data were selected from the ORL and AR database [29]. The output of the fc6 layer is regarded as the feature extracted from the networks with a feature dimension of 4096. In the recognition stage, the nearest neighbor classifier is introduced to calculate the distance between two feature vectors with three different distance measures: the Chi-Square distance, the Euclidean distance, and the Cosine distance. During the experiment, the AR and ORL databases were injected with different intensity of Gaussian noise and salt and pepper noise to test the performance of the G-RLBP pooling method.
3.1.1. The Discussion of t p t p is one of the important parameters in our method which affect the algorithm complexity and performance. We conducted an experiment on the ORL database to give a simple discussion of it. The histogram of the RLBP patterns was regarded as the feature of each face image. For brevity, only Euclidean distance was utilized to measure the similarity of two features. In this experiment, one image was selected as the training set and the rest as the testing set for evaluation. Figure 9 shows the recognition rates of the RLBP with respect to t p . We can see clearly in Figure 9 that, when t p is set to 5, the RLBP can get the highest recognition rate on the ORL database. Thus, in the following experiments, t p was set to 5, which is suitable in most of the cases.

Experiments on the ORL Database
The first set of experiments were carried out on the ORL database which contains 40 subjects with 10 images for each subject. One image of each subject was selected as the training set and the rest as the testing set per turn. The experiment was repeated 10 times so that each image could be used as the training set for evaluation. Before the testing images were fed into the network, we injected different intensity of Gaussian noise and salt and pepper noise to evaluate the performance of the G-RLBP pooling layer. Figure 10 visualizes the output of the Pool1 and G-RLBP layer in the baseline Alexnet and the Alexnet equipped with G-RLBP layer respectively.
Comparing Figure 10a,b, it is worth noticing that, when the input image is injected with some noise, some random noise points appear in the feature maps of the pooling layer. Meanwhile, the crucial edges of the output are no longer clear, and some of the original texture information is even lost in some feature maps. The noise in the image can severely affect the output of some intermediate layers in the networks, and the noise accumulates layer by layer which would eventually reduce the recognition rate of the whole network. However, once the network is equipped with the G-RLBP pooling layer, as Figure 10c shows, the noise injected into the output feature maps can be effectively decreased, and some edges in the feature maps of the G-RLBP also become much clearer compared with Figure 10b. In Tables 3 and 4, we summarize the recognition rates when the testing images were injected with Gaussian noise and salt and pepper noise respectively.
The results in Tables 3 and 4 show that the recognition rates of the six network models are all decreasing along with the noise intensity, especially for the baseline networks with max pooling layer and data augmentation. The fourth and seventh columns in Tables 3 and 4 indicate the training data of the Alexnet and ZF-5net were injected with different intensity of random noise (Gaussian and salt and pepper noise). When the Gaussian noise intensity increases to σ 2 = 0.005, the recognition rates of the baseline networks with max pooling layer begin to be lower than 50%. The original pooling method is sensitive to noise, which brings about poor performance when the input images are affected by slight noise. In the tables, we can see that, when the training data are augmented with random noise, the recognition results are much better than the original networks with max pooling method. Furthermore, in comparison, the G-RLBP pooling method proposed in this paper gets the best results, and it is robust to noise. Although the recognition rates of the two networks equipped with the G-RLBP pooling layer are also decreasing along with the noise intensity, the recognition results in these cases are much better than those of the baseline networks. Even if the intensity of the Gaussian noise reaches σ 2 = 0.01, the recognition rates are also higher than 50%, which indicates that the G-RLBP pooling layer is more effective than the original. The main reason is that, when the networks are equipped with the G-RLBP layer, a smaller weight would be assigned to the pixel that is likely affected by noise. Thus, after the G-RLBP pooling, the noise injected into the feature maps would be kept within a smaller extent. In addition, when the testing images are clean (σ 2 = 0 in Table 3 or d = 0 in Table 4), the recognition rates of the networks with G-RLBP are also slightly higher than those of the baseline networks. In conclusion, the G-RLBP can be used to reduce the model dimensions as the original pooling method and lower the noise interference to the networks in the real-world applications simultaneously. Finally, we also evaluated the time cost of these network models both at the stage of training and classification, as Table 5 shows. All the network models in our experiments are trained 80 epochs. The classification times in Table 5 only refer to the 4096-feature extraction times here since the feature matching times of all the network models are the same. In Table 5, we can see that it is difficult to train a network with Data Augmentation, since it has the maximum training data. Compared with the network with a max pooling layer, the classification time of the G-RLBP pooling method is a little longer. However, considering the improvement of the G-RLBP pooling method on the recognition rates in Tables 3 and 4, the weakness on the classification time-consuming is acceptable.

Experiments on the AR Database
The second set of experiments were conducted on the AR database which contains 126 subjects with 13 images for each subject. The experimental settings were the same as Experiment 1. We chose one image from each subject as the training set per turn. Finally, all recognition rates were averaged as the final results. The AR database is more challenging than the ORL database with more subjects and some uncontrolled conditions.
A face image in the AR database is selected randomly, and the image was fed into the networks ( Figure 10). The visual results are shown in Figure 11.
It is evident that the edges and other texture information are severely affected in Figure 11b by noise since the max pooling method has no response to noise containing in the inputs. Thus, many random noise points mixed in the feature maps are preserved after pooling. If the max pooling layer is replaced by the G-RLBP pooling method, we can see clearly that the noise impact is lowered effectively (Figure 11c). Furthermore, some crucial texture edges in Figure 11c become more recognizable compared with Figure 11b.
We also quantitatively analyzed the recognition rates of these network models (Tables 6 and 7) when the testing images were injected with different intensity of Gaussian noise and salt and pepper noise. The recognition rates of the six network models in this section are much lower than those of Experiment 1. If the testing images are clean, the networks succeed to make a nearly 2% improvement of the recognition rates when the networks are equipped with the proposed G-RLBP pooling layer compared with the max pooling layer. The recognition rates of the six network models in Tables 6  and 7 begin to decrease along with the intensity of noise injected into the testing images, especially in the four baseline networks with max pooling layer and data augmentation. However, the networks with G-RLBP have better performance than the baseline networks. This is mainly because the G-RLBP pooling method is less sensitive to noise and can preserve more crucial texture information of the input feature maps, thus further gets more discriminative features. However, if testing images are severely affected by noise, the six network models are all unable to get good performance, it is even difficult for a human to identify the contour information of the images in this case.

Experiments Based on the GoogleNet
GoogleNet is another effective visual recognition model which has been used in many recognition fields such face recognition, image classification, target tracking and so on. Here, we conduct some experiments on this model to further evaluate our pooling layer. We use the entire CASIA-WebFace database to train our models. For data augmentation, the training data were injected with different intensity of Gaussian noise. The others experimental settings were the same as Experiments 1 and 2. The testing data were chosen from the ORL database and the AR database, respectively. Before the images were fed into the networks, they were injected with different intensity of Gaussian noise. Tables 8 and 9 show the results. Finally, to make the experimental results more convincing, we also utilized the average filter and the BM3D algorithm [30] as pre-processing steps, respectively, to remove the noise injected in the testing data, as can be seen in the fourth and fifth columns of Tables 8 and 9. We can see that the average filter and the BM3D algorithm are both effective when there is some slight noise injected in the testing images (σ 2 = 0.002 in Tables 8 and 9). Comparing with the average filter, the BM3D algorithm is more robust with better recognition performance than the average filter in this case. However, when the testing images are clean, the recognition rates of the average filter and the BM3D algorithm are both reduced to some extent. Nevertheless, the G-RLBP pooling method, which gets the best recognition results in all cases, sometimes still has excellent performance even when the testing images are seriously affected by noise. It is also clear that, when the G-RLBP pooling method is transferred to other networks such as the GoogleNet, it can also improve the recognition performance. Thus, the proposed G-RLBP pooling method is effective and can be used in more modern CNN architectures.

Discussion
In this paper, we propose the G-RLBP pooling method to down sample the feature maps of convolutional layers. Our work has two main contributions: (1) With the robust LBP guiding, each pixel in the input feature maps is assigned with a different weight based on the probability affected by noise. In this way, the proposed G-RLBP can successfully remove the pixels which are likely to be affected by noise and then calculate the weighted average of the rest pixels as the final results to get more noise-robust features. (2) The proposed pooling method can extract more discriminative information from the feature maps and preserve more crucial edges of the face images during the down-sampling. The experimental results in Section 3 show that the proposed G-RLBP pooling method can be used as an effective method to further improve the performance of deep CNNs. It gets the best recognition results comparing with the max pooling method and data augmentation by injecting different random noise into the training data. Especially, in some uncontrolled noisy conditions, the networks equipped with the G-RLBP pooling layer can get better performance. It would be our future work to improve further our networks to adapt to some more complex conditions, such as performing experiments based on some other modern CNN architectures using different image databases with varying degrees of challenge, and so on. It should be mentioned that the proposed G-RLBP pooling method can only be used in the face recognition field at present. Thus, transferring this method to other fields would be our main work in the future. Another possible future work is to involve sparsity-based models [31] to further improve cost-effectiveness and robustness of the recognition system.