Multi-Descriptor Random Sampling for Patch-Based Face Recognition

: While there has been a massive increase in research into face recognition, it remains a challenging problem due to conditions present in real life. This paper focuses on the inherently present issue of partial occlusion distortions in real face recognition applications. We propose an approach to tackle this problem. First, face images are divided into multiple patches before local descriptors of Local Binary Patterns and Histograms of Oriented Gradients are applied on each patch. Next, the resulting histograms are concatenated, and their dimensionality is then reduced using Kernel Principle Component Analysis. Once completed, patches are randomly selected using the concept of random sampling to ﬁnally construct several sub-Support Vector Machine classiﬁers. The results obtained from these sub-classiﬁers are combined to generate the ﬁnal recognition outcome. Experimental results based on the AR face database and the Extended Yale B database show the effectiveness of our proposed technique.


Introduction
Face images can be captured easily at a distance and can also be used in various applications including surveillance, tracking, access control, etc. Therefore, face modality has been widely investigated in the biometric research field compared to other biometric modalities such as iris, fingerprint, and palmprint counterparts.
Currently, the human face can be accurately recognised in a restricted environment. However, in an unrestricted environment, several challenges are encountered where faces are exposed to distortions. These distortions include illumination changes, pose variations, and partial occlusion. Moreover, while multiple algorithms have been proposed to tackle them in recent years, they have their limitations or requirements that cannot be met for faces in the wild.
An image-based recognition system comprises of a feature extraction and representation process followed by a classification stage. Feature extraction methods can be classified into two main approaches: holistic feature-based and local feature-based methods [1].
In holistic approaches, the features extracted from the whole images are processed using either global linear, nonlinear statistical techniques or combined. The more conventional holistic methods include the popular linear techniques such as Principal Component analysis (PCA) method [2], Independent Component Analysis (ICA) [3], and Linear Discriminant Analysis (LDA) [4]. However, these methods may not be efficient due to the nonlinear characteristics of the face images. Therefore, some nonlinear kernel-based techniques have been investigated to address the problem by exploiting the contours of face images includ-ing the information details of the curves. Kernel Principal component analysis (KPCA) and Kernel Fisher Analysis (KFA) [5,6] are widely used methods of this category.
Local feature-based approaches are proven to be more robust to deal with complex backgrounds and occlusions inherently present in real image data. Unlike global descriptors which compute features from the whole image, local descriptors [7] have been shown to be more effective. Patch-based face recognition, which was proposed in [8], is another effective technique and operates by dividing an image into multiple overlapping or nonoverlapping patches using either global or local descriptors for matching. In the case of patch-based approaches, the extraction of the local features is performed for each region (or patch) of the images where each face image is divided into a number of either overlapping or non-overlapping blocks. There exists a number of approaches for patch-based face recognition in the literature. The authors in [9] have proposed a feature concatenation method including a block selection with similarity measure. On the other hand, the work described in [10] suggests the use of a weight for classification results of the patches by calculating the genuine classification rates extracted from the test set. The work [11] proposes to employ the concept of subspace by using a majority voting scheme for combining the results of classification generated from the patches using random subspaces. The work discussed in [12] proposes carrying out the training of classifiers using separate random patches of the images and suggests a combination using a two-step layer decision: (i) using a weighted summation and (ii) combining the outcome from local ensemble classifiers with that of a global classifier obtained from the whole faces. The work described in [13] proposes to determine and select face areas containing more discriminative information for use in the classification phase. Although this proposed method shows high effectiveness while being highly robust against the issues of illumination distortions and partial occlusions, the classification performances are not significant, which is mainly due to the fact that one single classifier is constructed for all the image patches. The authors in [14] propose to first determine the area having the largest matching score at each point of the face. This is then used to carry out an occlusion de-emphasis stage in order to deal with partial occlusion distortions. However, this approach has shown limitations since it can be challenging to develop such a de-emphasis procedure due to the variations and extent of the occlusions. Recently, the concept of deep learning [15][16][17] has been proposed and has gained popularity in face recognition problems. This technology gives outstanding results and clearly outperforms the conventional machine learning algorithms. However, deep learning architectures generally require a considerable amount of data, including specialised high performance hardware for the training stage especially for practical situations. This makes them hard to deploy and less suited especially for embedded and low power applications. Therefore, this work proposes an approach for human face recognition under partial occlusion. A random patch sampling method for face recognition under various distortions is proposed in this paper. Local descriptors are deployed to capture smaller texture patterns which can be more discriminative in human faces while still keeping the spatial relations. This paper is a follow-up of our previous work [18], deploying a multi-descriptor approach instead of a single descriptor. In addition, the proposed method has been validated using a dataset with more challenging illumination and occlusion conditions. The paper is organised as follows: Section 2 gives an overview of the method including a brief description of the concept of face patching, the multi-LBP approach, the feature extraction process using HOG and Kernel PCA and their application in the proposed method, and finally describing the proposed Random Patching method and its adaptation to the problem of face recognition. Section 3 discusses the validation process and the experiments performed and compared against existing methods. Finally, conclusions and future work are in Section 4.

Overview of the Proposed Method
As mentioned above, this paper proposes a random patch (RP) sampling method for face recognition under distortions targeting partial occlusion in particular. The use of multiple local descriptors helps capture smaller texture patterns. Therefore, they offer higher accuracy compared to holistic feature-based descriptors that tend to average over the given image. The local descriptors used are Local Binary Patterns (LBPs) and Histogram of Oriented Gradients (HoGs). These two descriptors provide different type of features, which are complimentary and therefore offer more discriminative power. For example, their combination would offer an advantage over using a single descriptor. For dimensionality reduction, KPCA, which is nonlinear extension of the conventional PCA, offers more refined features. For the matching process, the proposed approach uses Random Patch Sampling based on the employment of several Support Vector Machine (SVM) classifiers. It operates by considering all generated face patches equally to build multiple sub-classifiers to further improve the recognition performances.
The proposed algorithm works as follows: First, each image is partitioned into several 50% overlapping regions/blocks. Then, the LBP and HOG descriptors are used to individually extract features from the generated image patches. Since the previous step generates high dimensionality descriptors, potentially including redundancies, the KPCA method is used in order to extract the most significant feature patterns of the descriptors. Next, the reduced descriptors of the image patches are normalised and fed to the classification module. Finally, a number of patches are randomly sub-sampled within each image training set in order to build multiple SVM classifiers from each subset. The validation of the proposed algorithm was performed through extensive experiments using a single sample per person as per real world conditions. A combination of the final results of the performances generated from all the sub-classifiers is performed with a union rule. Figure 1 depicts the process.

Face Patching
Let S be a greyscale image. S can be defined as a collection of k patches. The blocks can be overlapping, non-overlapping, covering, or non-covering. The shapes and sizes can vary as well. Figure 2 is an illustration of overlapping blocks. Selecting the optimum patch size is an important step since the recognition performances can be significantly affected. This is mainly due to the fact that the extracted features may adversely correlate in small blocks while the more discriminative ones may not be captured especially in large patches. In this work, the face patches are selected in rectangular shape and each overlapping by 50%. This is because, as explained above, the features may correlate in small blocks and, thus, an overlap of 50% would help to capture more distinguishing features while avoiding excessive redundancies. As for determining the appropriate patch size, initial experiments were carried out by varying the block size [18] and noting the performances, a size of 33 × 30 was found to be the best and it is noted that it relates to the image's original size of 165 × 120.

Multi-Scale Local Binary Patterns
The LBP operator has gained much popularity as a local texture descriptor for various computer vision and biometric security applications including face recognition [19]. It is based on a combination of greyscale invariants and works by thresholding and labelling a pixel of an image neighbourhood (P, R) (P sampling points on a circle of radius R) against the central pixel value. This results in a binary number and the histogram of the labels can the be used as a texture descriptor.
One of the earliest LBP neighbourhoods introduced is (8, 1) and is generated by the 8 neighbouring pixels in a radius of 1, as shown in Figure 3. This scheme was later extended to other neighbourhoods having larger sizes. As can be seen in Figure 3, the threshold value is generally the value of the central pixel g c which can be used for comparing the neighbourhood pixels g p . The result of applying the operator would give 1 if the g p is larger than g c and 0 otherwise. The final form of the LBP is an integer value and the features extracted by the LBP operator can be represented as histograms. Mathematically, this can be expressed as: The local neighbourhood (P, R) is a set of evenly spaced sampling points P on a circle of radius R centred at a fixed pixel. Uniform patterns [20] were inspired from the fact that some binary patterns occur more commonly in facial images than others. LBP is called uniform when the binary pattern contains at most two bitwise transitions from 0 to 1 or vice versa when the bit pattern is considered circular. By using uniform patterns and computing the occurrence histogram, structural and statistical approaches are effectively combined. The distribution of micro structures-like edges, lines, and flat areas is estimated by the uniform histogram. LBP histograms have been introduced for face description in [21], where the face images are divided into a number of local regions allowing for the texture descriptors to be extracted from each region. The descriptors are then combined into one uniform histogram representing the face image as depicted in Figure 4. Uniform histograms were proposed as a result of the observation that some binary patterns do occur more commonly in face images than others and are therefore used to reduce the usual length of 256-bins patterns to smaller 59 patterns [20]. In addition, since the area covered by a conventional LBP algorithm is usually small, a uniform multiscale LBP has been chosen in our work. This ensures that neighbourhoods with varying sizes can be used. Therefore, an LBP is carried out using various sample points P = 8, P = 16, and P = 24. The extracted feature vectors from each neighbourhood are then concatenated to form one uniform LBP histogram having 857 bins. This method covers a larger area, thus providing a much larger range of discriminative descriptors. The choice of LBP neighbourhoods is based on the best results obtained from initial experiments. The neighbourhoods LBP (3,8), LBP (8,16), and LBP(6, 24) offer a different range of features on different levels.

Histograms of Oriented Gradients
Histograms of Oriented Gradients is a representation that captures edge or gradient structures/patterns that are very characteristic to local shapes (counts occurrences of edge orientations). They are also invariant to geometric transformations when they are smaller than the local spatial or orientation bin size. They have been used mainly in human detection [22,23] and later for recognition [24]. HOG features are calculated by taking orientation histograms of the edge intensity in local regions. An image can be divided into N local regions called 'blocks'. Each block can then be divided into smaller spatial areas called 'cells'. Consequently, each block is defined as a set of cells. Figure 5 describes a step-by-step overview of the method. Each image patch is first divided into blocks of A × B pixels, then each block is divided into a number of a × b cells from which histograms of oriented gradients with k orientations are computed. After that, histograms from each cell are concatenated into one histogram representing the whole block. These histograms are then concatenated together to represent each patch.

Kernel Principal Component Analysis
Kernel PCA, which remains one of the most effective nonlinear dimensionality reduction techniques [25], is a nonlinear extension of conventional PCA that uses second order statistics to take into account partial statistical information of the face image at hand. In addition, higher order statistics have become a useful tool resulting from the extension of PCA using kernels. This works by mapping texture patterns of the original input space to a higher nonlinear dimensional feature vector space [25]. Its appearance is due mainly to the need to carry out PCA in the feature space. Previously, it was not possible to perform PCA in the feature space due to the high computational expense of the dot product computation in the high dimensional feature space [26] and, thus, the appearance of kernel PCA. Ultimately, KPCA is implemented and performed in the input space by using various kernels without the need to perform the mapping explicitly [27], thus overcoming the initial issue. Let the set x 1 , x 2 , . . . , x m ∈ R N be the data in the input space and there exists a nonlinear mapping Φ : R N → F between the input and the feature space.
KPCA has been used extensively in various face recognition applications [28][29][30], including facial expression under illumination variations and proven to give satisfactory results as compared to other feature reduction techniques, thus its use in this work. Furthermore, this work uses the polynomial kernel since it has shown to effectively extract discriminative facial features.

Random Patch-Based SVM
In previous papers employing patched faces, researchers have either deployed all the patches or have selected only a smaller number of blocks to construct a global classifier. In our approach described in [18], we have chosen the use of a random sampling method to construct more than one classifier to improve the recognition performances. In this case, a random sample can be seen as a subset of a population selected by considering that all samples have an equal occurrence probability. Support Vector Machine [31], which has been selected at the matching stage, has been found to be very effective. SVM is a type of binary supervised learning algorithm where the classification module is trained by mapping the training set feature vectors in a space that efficiently separates them using some kernel function (for example, polynomial, Gaussian. . .). Once done, the test set is mapped onto the same space. Typically, an SVM classifier determines an optimal hyperplane for use as a decision function in a high-dimensional space, thus predicting the optimum class using an in-between maximum distance. The novelty of our approach relates to the new approach of training multiple SVM classifiers based on the sub-training sets, and combining the individual results with a union rule to obtain the final score as illustrated in Figure 6. SVM has shown clear advantages in different applications [32] dealing with nonlinear data as well as high dimensionality and small samples, thus making it ideal for the problem at hand.

Experiments and Analysis
In order to assess the effectiveness of the proposed approach, experiments were carried out using two different and well known datasets.

AR Face Dataset
The first dataset used, the cropped AR face database [27], contains 2600 images generated from 100 individuals (26 different images per person) taken in two sessions under various distortions including facial expression, lighting, and occlusions. A resizing of the images into 165 × 160 pixels has been performed in this experiment. Some sample images of this dataset are shown in Figure 7.

Experiments Part 1: Single Descriptor
Experiments started out by testing the proposed approach using LBP and HOG separately. First, the following LBP neighbourhoods have been used: LBP (8,3), LBP (16,8), and LBP (24,6). After extracting the features using each scale separately, the resulting feature vectors are then concatenated into one big feature set. Following the LBP algorithm, Figure 9 presents the accuracy rate of each neighbourhood separately and when the features are concatenated before classification. It is observed that each LBP neighbourhood gives different results depending on the testing set and type of occlusion present in the images, with LBP 3 8 scoring the highest rate for both sets. It is therefore concluded that the combination will tackle different types of challenges as compared to single-scale. From the same figure, it is seen that the multi-LBP goes as high as 95% for set2 and averaging around 70% for set1. Next, when extracting HOG features, the following cell sizes have been used: 6 × 6, 7 × 7, and 8 × 8 as seen in Figure 10. The results show that each cell size works differently for each testing set. In the same figure it could be seen that the recognition rate for set1 goes up to 83% and 96% for set2 with cell size 7 × 7. The last rate is lower compared to cell size 6 × 6, which reaches 98%. Cell size 8 × 8 records lower recognition rate than both smaller cells. It is to be noted that the smaller the cell, the more features HOG produces as their number increases.

Experiments Part 2: Multi-Descriptor
The second set of experiments focused on the combination of both HOG and LBP features for classification. First, multi-LBP features used previously (see Figure 9) are concatenated with HOG features from Figure 11. Results in Figure 11 show that although the recognition rate for the combination is higher, especially for test set1 reaching as high as 81%, the improvement is slight and insignificant. Figure 11. Results of experiments carried out by combining HOG (with a cell size of 6 × 6) and LBP.
Another experiment was carried out to validate the approach using different HOG features using a cell size of 8 × 8. The results are depicted in Figure 12, where it clearly shows a significant improvement for test set1, increasing sharply and reaching an outstanding 91% as compared to previous results that fall below 83%. Test set2 sees an increase as well to a high rate of 98.5%. Although the HOG features used in the last experiments have lower recognition rates separately compared to when using different HOG cell sizes (see Figure 10), their combination with LBP features has given superior results. It can be concluded that both types of features are complimentary for both testing sets making them more robust against different partial-occlusion types.

Experiments Part 3: Classifier Size
In the third set of experiments, the number of samples used per SVM classifier is varied in order to find the best subset. Figure 13 presents the results of the conducted experiments decreasing the number of samples each time. For testing set1, the accuracy rate sees a noticeable increase as the number of samples used decreases, starting from 75% when the number samples p = 6 and going to a highest of 91% when p = 3. The same observation for testing set2, as it starts from 94.5% when p = 6 going up as p decreases and rating 98.5% when p = 3. These results and observations can be explained by the fact that a smaller training set decreases the possibility of error, thus making the accuracy higher and the approach more robust in general.
Another observation relates to the difference between the performance rates of test set1 and test set2. Even though in set2, the scarf used as partial occlusion hides a larger chunk of the face as compared to set1 where only the eyes are invisible (see Figure 8), set2 gives higher performance accuracy when tested under the proposed approach. This can be concluded that the features extracted from the eyes and eyebrows using the proposed method play a significant role in recognition as compared to other parts of the face. Table 1 shows the results of our comparative study of our approach against some existing and similar approaches available in literature including our previous work [18]. From the results shown in the table, one can observe that our proposed method clearly compares favourably when compared against some of the best performing algorithms. For example, our proposed technique attains 98% performance accuracy, thus matching [14,33] using the scarf-occluded set. It is also worth mentioning that the authors in [33] have used more than one training sample in their analysis, unlike the proposed method which uses a single training sample, thus making it more recommendable as it operates under real world conditions.

Extended Yale B Dataset
The Extended Yale B database [34] consists of 2414 frontal face images generated from 38 persons using 64 different illumination conditions. In addition, an image with ambient illumination was also captured for every subject in all poses. Then, the images are grouped into four different subsets depending on the lighting angle with respect to the axis of the camera. Typically, subset1 and subset2 cover the range of 0°to 25°while subset3 covers the angular range of 25°to 50°. subset4 covers 50°to 77°and subset5 covers angles which are larger than 78°. To allow a simulation of different levels of contiguous occlusions, the most widely used technique described in [35] is used to replace a randomly located square patch from each test image with a baboon image, this is because it has a texture similar to that of the human face. Moreover, the location of the occlusion is randomly selected. The sizes of the synthetic occlusions vary in the range of 10% to 80% of the original image size. Figure 14 shows some samples of randomly occluded faces generated from the Extended Yale B database. For this set of experiments, subset1 was used for training while the remaining 4 subsets were used for testing. For the other parameters, the best performing ones from the previous experiments were used. First, the original image size of 192 × 168 was kept, and 50% overlapping patches were sized equally at 32 × 28 each. The classifier size was set to p = 3, the HOG cell size to 6 × 6 combined with multi-LBP.
The average recognition accuracy for each subset for an occlusion level ranging between 10% and 80% is depicted in Figure 15.
The obtained results have been evaluated and compared against some state-of-the-art algorithms and Table 2 depicts the accuracy percentages. From the table, it can be observed that our proposed method achieves consistent results throughout the experiments. Despite not reaching a higher accuracy at small occlusions, its increase does not affect it as does the SSR-P/W method proposed in [36]. Finally, it eventually outperforms it when occlusion is at 50%, reaching 90.58% as compared to 88.6% for SSR-W, in subset5. Further results can be seen in Table 3, which are consistent even when occlusion increases. The accuracy remains above 85% for any occlusion level and under different lighting conditions. This could also be seen in Figure 15, where the average accuracy for each of the four testing sets has been illustrated.

Conclusions
This paper has proposed a novel face recognition algorithm using the concept of random patching. The method operates by dividing the face images into a number of non-overlapping patches. Next, LBP operator is employed as a local descriptor and then combined with HOG technique to extract a concatenated descriptor of the image patches. A dimensionality reduction step using KPCA method is then applied to the inherent high dimensional descriptors. Once done, a random patch sampling operation is employed allowing us to build a number of sub-SVM classifiers. Finally, the results from the classification obtained from the SVMs are fused using a simple union rule. The experiments carried out suggest that the proposed algorithm performs favourably when compared against conventional global SVM face classifiers when the lower part of the face is missing (up to 98.5%). Furthermore, the algorithm outperforms other similar state-ofthe-art techniques, thus clearly demonstrating its potential recognition performances, even when working in an under-sampled and challenging operational environment.

Conflicts of Interest:
The authors declare no conflict of interest.