Periocular Recognition in the Wild: Implementation of RGB-OCLBCP Dual-Stream CNN

Featured Application: The proposed periocular biometric network can apply to any application that requires identity management, such as homeland security, border controls, access control, criminal investigation, etc. Abstract: Periocular recognition remains challenging for deployments in the unconstrained environments. Therefore, this paper proposes an RGB-OCLBCP dual-stream convolutional neural network, which accepts an RGB ocular image and a colour-based texture descriptor, namely Orthogonal Combination-Local Binary Coded Pattern (OCLBCP) for periocular recognition in the wild. The proposed network aggregates the RGB image and the OCLBCP descriptor by using two distinct late-fusion layers. We demonstrate that the proposed network beneﬁts from the RGB image and thee OCLBCP descriptor can gain better recognition performance. A new database, namely an Ethnic-ocular database of periocular in the wild, is introduced and shared for benchmarking. In addition, three publicly accessible databases, namely AR, CASIA-iris distance and UBIPr, have been used to evaluate the proposed network. When compared against several competing networks on these databases, the proposed network achieved better performances in both recognition and veriﬁcation tasks.


Introduction
Biometric systems have been widely deployed since the late 1990s worldwide for identity management, banking, homeland security, etc. [1]. Among different biometric systems, face recognition enjoys flexibility, availability, and user-friendly [2]. However, biometrics experts and the police departments of the United States have agreed that the face recognition technology remains challenging after the "Boston Marathon bombings" in 2013 [3]. For instance, the appearances of subjects such as cosmetic products, plastic surgery or wearing masks may cause the failure of identifying the suspects. To hinder the complexity of the facial region, periocular recognition is gaining attention these days attributed to its promising recognition performance [4].
What does periocular refer to? According to the definition in [5], periocular defines the region around the eyes, which includes the eyelids, eyelashes, and eyebrows (see Figure 1). The periocular region demonstrates more tolerance of variability in expression and occlusion, such as crime scene where perpetrators intentionally mask part of their faces. This creates more capability of matching partial faces [6,7]. In addition, due to the rapid growth of camera use in social networks, surveillance, and smartphones, this arguably increases the interest of periocular recognition [8,9]. For all these reasons, periocular recognition has become an area of intense study in the biometrics and computer vision communities. Figure 1. Samples of periocular regions. We demonstrate sample images of the periocular region that including eyebrows. The images are collected from The Korea Times [10] and Kitchen Decor [11].

Periocular regions
In this paper, we address the challenges of periocular recognition in the unconstrained or "in-thewild" environments that remain not well-addressed by the current works [12,13]. This challenge is associated with the issue of dissimilarities in periocular images due to the placement of sensors, pose alignments, illumination levels, occlusions, etc. Thus, we study this problem by means of a fusion approach with dual-stream Convolutional Neural Network (CNN), which accepts RGB ocular image and a novel colour-based texture descriptor, known as Orthogonal Combination-Local Binary Coded Pattern (OCLBCP). We have also developed and shared a new database, namely Ethnic-Ocular database, by collecting the periocular region images in the wild to validate the proposed network.

Related Works
The early study on periocular biometrics presented in [5] shows promising results in human recognition. The authors adopted several handcrafted descriptors such as Histogram of Oriented Gradients (HOG), Local Binary Patterns (LBP) and Scale-Invariant Feature Transform (SIFT) as periocular feature representation, followed by the score fusion for classification. Fernandez et al. [14] and Cao et al. [15] also introduced a similar approach, which convolves periocular features extracted from HOG or LBP feature matrix with Gabor filters and followed by score fusion. There are several research articles focused on combinations of texture descriptors with fusion algorithm for periocular representation and recognition [16][17][18][19][20]. All these approaches are mainly focused on amalgamation of various handcrafted texture descriptors and followed by learning machines for decent performances in periocular recognition. However, these approaches are less robust to "in the wild" variations such as resolutions, levels of illumination, poses, and occlusions due to inadequacy and inflexibility of handcrafted texture descriptors in representing periocular features. Therefore, the periocular recognition in the wild remains a challenge.
In recent years, CNNs have gained escalating attention in image classification [21,22]. CNNs can be used to extract image texture features from different layers while handcrafted texture descriptor are only limited to low-level features, which is equivalent to the first convolutional (conv) layer features of CNNs. Apart from conv layers at different level, the features can be extracted from max pooling (maxpool) and fully-connected (fc) layers of CNNs. Several researchers have employed CNNs for periocular recognition. For instance, Gangwar et al. [23] proposed two CNNs (for left and right oculars), namely DeepIrisNet, which extracts comprehensive information to boost recognition performance. Other studies, e.g., by Proença et al. [24] and Zhao et al. [25], have demonstrated enhanced CNN frameworks for periocular recognition where the prior knowledge is exploited to discard unnecessary information. Proença et al. [24] suggested removal/separation of the iris and sclera from the periocular regions, while Zhao et al. [25] identified the critical regions (only included eyebrow and eye region) that can extract more discriminative information to improve periocular recognition. However, these networks were found to underperform when there are misalignments of periocular images, images missing the eyebrows and images missing ocular.
The relevant works that deal with non-ideal ocular are those by Zhang et al. [26] and Soleymani et al. [27]. Zhang et al. [26] fused iris and periocular modalities through a weighted concatenation. The network achieved significant results when compared to other CNNs. Similarly, Soleymani et al. [27] invented a new multimodal CNN, namely multi-fusion CNN, where the iris, face and fingerprint features are fused at fc layer. A fusion layer is designed to fuse different levels of fc layers as multi-feature representations with the sole RGB image. However, these works leveraged several biometrics where all of them may not always be available such as occluded face with mouth covered or iris from a distance. Furthermore, the use of multiple biometrics modality may jeopardise the usability of the system such as fingerprint and iris need cooperation from the users.
In the previous work of CNN that consumes face texture descriptor, Levi et al. [28] demonstrated the use of colour-based LBP descriptor as input to CNN rather than raw RGB face image for emotional recognition. The authors showed that colour-based texture descriptor is useful to train their network in the wild environment. This work motivates us to investigate and analyse the impact of colour-based texture descriptor within CNN for periocular recognition in the wild.

Motivation and Contributions
In the early days of periocular recognition, the problems were mostly concerned about what was the best way to handle periocular in the presence of illuminations, pose alignments, and occlusions [5,6]. Many periocular databases were built using carefully controlled images for each of these issues. UBIPr [12], CASIA-iris database [29], and MICHE database [30] are the most comprehensive efforts in this direction and created in a well-controlled environment.
Presently, the challenges of periocular recognition concern about images that having large variations due to in the wild environments, such as ageing, appearances, cameras location, level of illuminations, occlusions, pose alignments, and others [18,31]. In addition, many existing databases [12,13,29,30] and research communities [18,23,27] still yet to prepare for periocular recognition in the wild challenge. Especially, the appearances of periocular with cosmetic products, and plastic surgery can affect the recognition performance negatively. This paper offers a solution for periocular recognition in the wild by investigating the fusion of RGB periocular images and a novel texture descriptor, i.e., OCLBCP, by means of a dual-stream CNN. OCLBCP exploits the colour information in the periocular texture to better represent the periocular features for recognition in the wild. The two networks share the parameters and a late fusion takes place at the last conv layer before fc layer.
For validation of the proposed network, a new database is introduced, namely Ethnic-ocular, by collecting the periocular region images in the wild setup. The databased includes five ethnic groups: African, Asian, Latin American, Middle Eastern, and White. The database is created in such a way that each ethnic group has a unique shape of periocular and skin texture of periocular regions [32]. Therefore, the database avoids unbalanced selection, as there are differences in the configuration of oculars among different ethnicities.
Hence, the contributions of this paper are as follows: • To study complementarity between CNN and input features, we investigate and analyse the combination of RGB image and a novel texture descriptor, namely OCLBCP for periocular recognition in the wild.
• Two distinct late-fusion layers are introduced in the proposed CNN. The role of the late-fusion layers is to aggregate the RGB image and OCLBCP descriptor. Hence, the proposed two-stream CNN is beneficial from these new features of the late-fusion layers to deliver better accuracy performance.

•
A new periocular in the wild database, namely Ethnic-ocular, is created and shared in [33]. The images were collected across highly uncontrolled subject-camera distances, appearances, resolutions, locations, levels of illumination, and so on. The database includes training and testing schemes for performance analysis and evaluation.
The paper is organised as follows: Section 2 describes the structure of the proposed colour-based Orthogonal Combination-Local Binary Coded Pattern (OCLBCP) texture descriptor. The proposed network with fusion algorithm is presented in Section 3 and the detailed database information is presented in Section 4. Section 5 discusses the experimental results and analysis. A conclusion is summarised in Section 6.

Colour-Based Orthogonal Combination-Local Binary Coded Pattern
This section introduces a new colour-based texture descriptor known as Orthogonal Combination-Local Binary Coded Pattern (OCLBCP). OCLBCP is devised based on the notion of an orthogonal combination of Local Binary Pattern (LBP) [34] and Local Ternary Pattern (LTP) [35]. The OCLBCP descriptor yields a more vibrant texture representation since it is less sensitive to the image noise and levels of illuminations.
Let I p ∈ R x×y be the periocular grayscale image, where x and y are the width and height of I p , respectively. The apparent changes in the images are related to illuminations and poses, thus we deploy the pre-processing method used in [36] to reduce the noise from I p . First, we transform the I p into Fourier domain as Z. Furthermore, we apply the Butterworth filter (B) to Z by reducing the illumination noise and enhancing the reflectance [37]. After that, we apply an inverse Fourier transform to obtain the filtered image I p .
To construct the OCLBCP descriptor, I p has to be proposed first according to the LBP [34] and LTP [35] transformation. LBP summarises the local structure in an image by comparing each pixel with its neighbourhood [34]. This descriptor works by thresholding a neighbourhood matrix using the grey level of the central pixel in the binary code. LTP is an extension of the LBP with three-valued codes [35]. The descriptor works by comparing each pixel with its neighbouring pixels. Then, they are combined after thresholding into a ternary pattern. The ternary pattern is split into two binary patterns and called positive and negative matrices.
In this paper, the LBP consists of the 3 × 3 neighbourhood matrix, and the LTP consists of the positive and negative matrices. To do so, I p is partitioned into sub-matrix with size 3 × 3 and the neighbourhood values of sub-matrix is binarised according to the centre value of the sub-matrix, which serves as a reference value for thresholding. After that, the descriptor combines the sub-matrix of LBP and LTP into four orthogonal groups: D 1 , D 2 , D 3 , and D 4 (see Figure 2). The orthogonal groups serve to achieve illumination invariance and uncover better texture information by removing outlying disturbances. Specifically, to obtain D 1 , the bits from the yellow boxes in the LBP and the bits from green boxes in LTP positive in Figure 2 are combined. The same processes are repeated for D 2 , D 3 , and D 4 . Suppose θ is the OCLBCP descriptor, we first convert the binary codes D k into a decimal number D ck , k = 1, 2, 3, and 4, and then choose the largest value from all the orthogonal groups. Specifically, the θ is formed by combining the groups as follows: where i and j are the indices of θ.
To map θ(i, j) into a colour-based texture descriptor, we create a distance pattern matrix ∆ to represent the similarity of the image intensity patterns across all possible pixel values based on [28]: where r and c are defined as the indices of δ. δ r,c is calculated by Earth Mover's Distance. After that, teh Multi-Dimensional Scaling (MDS) algorithm is adopted to seek the mapping of ∆ to the low-dimensional metric space (colour pattern matrix M) [38]: where is scale factor and f (δ r,c ) is a monotonic transformation function of δ r,c . In this paper, we set to three due to RGB channels in the colour image. Note that M is a three-colour channels matrix that outputs from MDS(·), which contains R, G, and B pixel values. Finally, we map θ(i, j) with M to generate colour-based texture descriptor OCLBCP. The mapping process uses the given pixel values of θ(i, j) to match the pixel values from the R channel of M. After that, θ(i, j) is converted with the RGB values from M. Algorithm 1 summarises the process of generating OCLBCP.

Algorithm 1 Creating colour-based texture description OCLBCP.
Input: I p ∈ R x×y Output: OCLBCP 1: Perform preprocessing to I p and obtain the filtered image I p 2: Construct LBP and LTP process on I p 3: Perform Equation (1) with the LBP, LTP positive and LTP negative matrices to obtain θ 4: Construct distance pattern matrix ∆ using Equation (2) 5: Generate the colour-based pattern matrix M with δ by using Equations (3) and (4) 6: Map θ with M to generate OCLBCP

RGB-OCLBCP of Dual-Stream CNN
We propose a dual-stream CNN that conceives the periocular RGB image and OCLBCP descriptor as the first and second stream to the network. Note that the dual-stream CNN was originally proposed by Feichtenhofer et al. [39] for action detection and recognition. The two input streams refer to temporal and structural streams. In our work, the network accepts and processes periocular colour image and texture descriptor, and then feature fusion layers are devised to extract better feature representation for ocular recognition.
As shown in Figure 3, the architecture of the proposed network consists of 16 convolutional (conv) layers and 8 max-pooling (maxpool) layers. The conv layers are designed to learn the correspondence between the RGB image and OCLBCP descriptor and to discriminate between themselves with the shared weights. Table 1 tabulates the architecture of the proposed network. OCLBCP (12) OCLBCP (14) OCLBCP (16)

Network Layers Configurations
1 f refers to the size of the feature map in conv layers. 2 k is defined as the filter size.

Fusion Layers
Two fusion layers, namely f use max and f use sum , are designed to aggregate the information from the RGB image and OCLBCP descriptor, as shown in Figure 3. The f use max layer takes the largest activation from the f lat RGB and f lat OCLBCP layers with m nodes, where both of them are flattened to conv (15) RGB and conv (16) OCLBCP , respectively. The f use max can be represented as: On the other hand, f use sum takes a sum of activations of f lat RGB and f lat OCLBCP . The layer is defined as follows:

Total Loss for Training
For training, we define a total loss function, L total , which is composed of a summation of softmax cross entropy L of logit vector and their respective encoded label: where V ∈ {V max , V sum }. V max and V sum are defined as the features of f use max and f use sum layers in the training samples V, respectively. L, N, and C denote class labels, the number of training samples in V, and the number of classes, respectively. Note that a periocular region contains left and right oculars; we therefore train each side with separate networks (Figure 3).

Score Fusion Layer for Recognition
To recognise an unknown identity, a score fusion layer S total is devised to merge the distance scores from the softmax vectors for decision-making. Let Y max = softmax(V max ) ∈ R C and Y sum = softmax(V sum ) ∈ R C be the softmax vectors of f c (3) and f c (4) , respectively. Since we train the proposed network for left and right ocular, we thus differentiate the softmax vector Y to Y left and Y right . Note each individual Y to Y left and Y right is still the sum of its corresponding Y = Y max + Y sum .
We evaluated the proposed system in two common biometric working modes: recognition and verification. For the former, the testing data are divided into a gallery set and a probe set. Each subject in the gallery set is composed of his/her left and right softmax vectors as The score fusion layer is computed with the sum rule as follows: where s(Y P * , Y G j, * ) = 1 − cos(Y P * , Y G j, * ) is defined as cosine similarity distance and * ∈ {left, right}. To identify Y P , φ is decided as follows: Verification protocol refers to verifying a person's identity that is claimed as a genuine or an impostor. Let Y R = {Y R left Y R right } as the reference set (template) and Y A = {Y A left Y A right } as the query set, to decide the Y A is a genuine or an impostor, ζ is decided by using Equation (12) as follows: where τ is training dataset dependence threshold value.

Database
A large-scale collection of periocular in the wild images from different ethnic groups was created, namely Ethic-ocular database. This database is built for periocular recognition, which contains left and right oculars that were extracted from 85,394 images downloaded from the web. All images were collected in the wild, with uncontrolled subject-camera distances, poses, appearances with and without make-up, and levels of illumination.
We propose this new database to support balanced selection in the configuration of oculars among different ethnicities, and also to stimulate research for periocular recognition in the wild that all periocular images are taken in common and everyday settings. Figure 4 demonstrates several samples of images.

Collection Setup
To create our database, we selected subject names randomly from BBC News [40], CNN News [41], Naver News [42], and FaceScrub database [43]. The subjects were randomly selected based on different ethnicities. They mostly are celebrities, politicians, athletes, etc.
From the search result, the top 300 images for each subject were downloaded using Python scripts. After that, the images were manually verified to ensure that the subjects correctly labelled the images. We firstly extracted facial regions in these images by using the face detector from Matlab [44] for periocular region extraction. Then, the coordinates of facial feature points were fixed based on the face detector bounding box for image alignment. Then, the images of subjects were labelled manually. After that, we implemented the technique from [45], which allowed us to crop images into left and right oculars. The database contains 85,394 images (including left and right oculars images) of 1034 subjects. Note that the views of these images are between −45 • and 45 • .

Training Protocol
For the training protocol, 623 subjects were randomly selected. Note that no subjects for training overlapped with the subjects for benchmarking. To develop or train our own models, we designed the protocol by dividing the images for each subject with the ratio of training, testing, and validations as 70:15:15.

Benchmark Protocol
We selected the remaining 411 subjects as benchmarks. In the benchmarking scheme, we created recognition and verification tasks. For recognition task, images about a specific set of individuals to be recognised (gallery set) were gathered and a new image (the probe set) was presented; the task was to decide which of the gallery identities was represented by the probe set. In the experiments, we divided the images per subject with the ratio of the gallery set to probe set as 50:50. This division process was repeated three times.
For verification task, the task was to analyse two sets of periocular images and decide whether they represent the same person or two different people. In the experiments, we randomly selected 1200 pairs as "same" labels and 1200 pairs as "not same". This selection process was repeated three times.

Experiments
We conducted several experiments to evaluate the performance comparisons of recognition and verification between our network and other benchmark networks. All configurations of the networks are described in Section 5.1 and the experimental results are presented in Section 5.2.

Configuration of Proposed Network
The proposed network was implemented using the open source deep learning toolkit TensorFlow [46]. About the configurations, we applied an annealed learning rate and it was started from 1.0 × 10 −3 . The rate was subsequently reduced by 10 −1 for every 10 epochs. The minimum learning rate was defined as 1.0 × 10 −5 . We applied an Adam optimiser in this network, where the weight decay and momentum were set to 1.0 × 10 −4 and 0.9, respectively.
In our experiments, the batch size was set to 64 and the training was carried out across 200 epochs. The training was done by using our database and following the protocols mentioned in Section 4.2 and it was performed by an NVidia Titan Xp GPU.

Configuration of Benchmark Networks
We selected several deep networks to evaluate the performance of periocular recognition: AlexNet [21], DeepIrisNet-A [23], DeepIrisNet-B [23], FaceNet [47], LCNN29 [48], Multi-fusion CNN [27], and VGG16 [49]. Inspired by the work of Gangwar et al. [23], Soleymani et al. [27], Schroff et al. [47], Wu et al. [48], and Hernandez et al. [50], these networks have been proven to be successful in very large recognition tasks. In the experiments, we utilised the pre-trained models that were provided by the authors to fine-tune and improve the networks themselves by training the left and right oculars, respectively. In the cases of DeepIrisNet-A, DeepIrisNet-B, and Multi-fusion CNN, the networks are not publicly available. Therefore, we did our best effort to implement these networks from scratch by following Gangwar et al. [23] and Soleymani et al. [27], respectively.

Experimental Results
We present the experimental results on the tasks of periocular recognition and verifications by conducting the databases on periocular recognition in the wild and controlled environments. For the recognition, we evaluated the performance by using Cumulative Matching Characteristic (CMC) curve with 95% confidence interval (CI). For the verification, we evaluated the performance using Receiver Operating Characteristic (ROC) curve with Equal Error Rate (EER) and Area under the ROC curve (AUC).

Performance Analysis on Proposed Network
This section analyses the robustness and performance of our network and other networks using Ethnic-ocular database, which reports the experimental results in Table 2.  Table 2 shows the proposed network achieved the highest Rank-1 and Rank-5 recognition accuracies with 85.03 ± 1.88% and 94.23 ± 1.26%, respectively. As compared to CNN, this network using the RGB image only achieved the Rank-1 and Rank-5 accuracies of 80.79 ± 1.43% and 90.42 ± 1.29%, respectively. In addition, CNN using the OCLBCP can only achieved 66.65 ± 2.22% and 89.73 ± 1.91% for Rank-1 and Rank-5 accuracies, respectively. These results indicate that our network provides more complementary information than CNN. This leads to the proposed late-fusion layers that significantly correlate the RGB image and OCLBCP for achieving better recognition performance.
Furthermore, we also evaluated the dual-stream CNN without using shared weights. However, this network only achieved 82.09 ± 1.59% and 92.11 ± 1.32% at Rank-1 and Rank-5 accuracies (see Table 2), respectively. The experimental results prove that the proposed network performed well with at least 2.9% improvement as compared to dual-stream CNN without using shared weight. As can be observed, the shared conv layers and the fusion layers were utilised in the network to aggregate the RGB image and OCLBCP. Thus, the proposed network successfully transformed new knowledge representations to perform better recognition in the wild.
In Table 2, we also notice the space complexity (total weight number) and time complexity (flops) of the proposed network are significantly smaller than its single network and dual-stream unshared weights networks counterparts while still outperforming them.

Performance Evaluation on Recognition and Verification Tasks
We used Ethnic-ocular, as well as three public databases, the AR [51], CASIA-iris distance [29], and UBIPr [12], to evaluate the performances of the proposed network and other benchmark networks. All the experimental results are outlined in the following sections.

Evaluation on AR Database
The AR database is designed under a constrained environment, which consists of 117 subjects with varying neutrals, expressions, illuminations, and occlusion conditions, who were captured across two sessions. We opted for this database as it provides a good baseline to evaluate the robustness and performance in constrained environments, such as different levels of illuminations and expressions in an indoor environment. Extraction for the periocular regions was done by using the method in [45].
The experimental protocol for recognition was as follows: ten images for each subject were used as gallery sets from Session 1 and another ten per subject as probe sets from Session 2. On the other hand, the verification protocol was designed by randomly selecting 250 reference-query pairs as "same' and another 250 pairs as "not same". Table 3 presents the performance comparisons on recognition. As can be seen in the table, our network achieved the highest Rank-1 and Rank-5 recognition accuracies with 96.32% and 98.80%, respectively. Likewise, DeepIrisNet-A had the best performance on Rank-1 and Rank-5 among the other benchmark approaches, which only achieved accuracies of 95.24% and 98.38%, respectively. Figure 5a illustrates that the proposed network outperformed other approaches with respect to all the benchmarks from Rank-1 to Rank-10 recognition.
For the verification task, we report the experimental results in Table 4. The proposed network also achieved the best EER and AUC with 5.13% and 0.9880, respectively. DeepIrisNet-A, Multi-fusion CNN, and VGG16 achieved the second-best performances among the other benchmark approaches with 7.69% for EER. Figure 5b illustrates the ROC curve and shows that the proposed network (red solid line with diamond) outperformed the benchmark approaches.

Evaluation on CASIA-Iris Distance Database
To evaluate whether our approach performs well on another standard database, we also tested its performance in a more subjective experiment with CASIA-iris distance database. This database consists of 142 subjects under a long-range subject-camera distance and indoor environment. The images were captured by a high-resolution camera so both dual-eye iris and periocular are included in the image region of interest. The further details of the database can be found in [29].
The experimental protocol for recognition was designed with the ratio of the gallery set to probe set as 50:50 and the division process was repeated three times. The experimental protocol for the verification was designed by randomly selecting 250 reference-query pairs as "same" and another 250 pairs as "not same". This selection process was repeated three times.
According to Table 3, the proposed network achieved the highest average accuracies for Rank-1 and Rank-5 recognitions with 96.62 ± 1.3% and 98.45 ± 0.4%, respectively. Besides, FaceNet achieved the second-best performance with 96.09 ± 2.1% and 98.10 ± 0.4% for Rank-1 and Rank-5 recognition accuracies, respectively. We also present in Figure 6a the Rank-1 to Rank-10 recognition results. As can be seen, our network achieved the best results among the benchmark networks.
For the verification, the proposed network achieved the lowest EER accuracy as 4.35 ± 0.5% and AUC as 0.9860. Interestingly, DeepIrisNet-B attained second lowest performance with 5.87 ± 1.5% for EER and 0.9756 as AUC. Figure 6b illustrates the ROC curve, which demonstrates that our network obtained the best performance of AUC and the lowest EER. Both recognition and verification results indicate that the proposed network is capable of learning the features of the RGB image and OCLBCP decently for improving the performance of recognition and verification tasks.

Evaluation on UBIPr Database
We also conducted another more challenge experiment with the UBIPr database to verify the robustness of the proposed network. This database consists of 342 subjects with varying subject-camera distances, levels of illumination, and poses [12]. This experiment evaluated the performance of all the networks with varying poses and subject-camera distances. Six images from each subject were randomly divided as a gallery set; the remaining images were used as a probe set. The division process was repeated three times. For the verification, we randomly selected 600 reference-query pairs as "same" and another 600 pairs as "not same". This selection process was also repeated three times. Table 3 presents that our network achieved the highest average Rank-1 and Rank-5 recognition accuracies with 91.28 ± 1.2% and 98.59 ± 0.4%, respectively. The second best was achieved by multi-fusion CNN with 90.75 ± 1.0% and 97.44 ± 0.3% as Rank-1 and Rank-5 accuracies, respectively. Besides, Figure 7a also illustrates the CMC curve and shows that our network achieved the best performance of recognition for all ranks.
For the verification, Table 4 reveals that our network achieved the lowest EER with 3.41 ± 1.8% and AUC was 0.9938. This is concrete evidence to demonstrate that the proposed network can verify the unconstrained periocular robustly. Figure 7b shows that our network outperformed most of the benchmark networks and achieved the highest recall rate against all other approaches.

Evaluation on Ethnic-Ocular Database
We present the experimental results in Table 3 by following the recognition protocol mentioned in Section 4.3. To evaluate the performance of the proposed approach, we compared our results with seven benchmark approaches (see Table 3). For the results of recognition, our network achieved 84.79 ± 1.9% and 94.23 ± 1.3% as Rank-1 and Rank-5 accuracies, respectively. Figure 8a illustrates the CMC curve of the proposed network, showing that the proposed method outperformed other benchmark methods from Rank-1 to Rank-10 recognition accuracies. The results indicate that the late-fusion layers are capable of correlating the RGB image and OCLBCP descriptor. Table 4 also shows that the proposed network achieved the lowest EER accuracy with 6.63 ± 1.5% for verification. Figure 8b illustrates the ROC curve, showing that our network outperformed all benchmark networks. The results prove that our approach can learn new features from the late-fusion layers in order to transfer knowledge between the networks to perform better performance of recognition.

Discussion
Through the experimental analysis and results, we observed that having access to the RGB image and OCLBCP descriptor can exploit the discriminatory features as inputs for a better periocular recognition. In addition, the proposed network utilises the colour-based texture information, which contributes to a more robust feature representation for the challenges in recognition and verification in the wild. This is because handcrafted texture descriptor can offer latent and complement information for complex data learning.
By evaluating across constrained environments, our results score higher accuracies consistently. Periocular recognition and verification in the wild bring more challenges as compared to the constrained environment. The experimental results prove that our network is able to perform better recognition due to its ability to learn new features from the proposed late-fusion layers. The effectiveness of fusion layers in the network supports our assumption firmly that multi-feature learning can work much better than just using RGB image in periocular recognition.

Conclusions
This paper proposed a dual-stream CNN, which accepts RGB ocular image and OCLBCP for periocular recognition in the wild. By aggregating the RGB image and OCLBCP features into two distinct late-fusion layers, these features offer robust and better recognition performance. We collected and shared a new Ethnic-ocular database, which consists of a large collection of periocular images in the wild based on different ethnic groups. Through extensive experiments by comparing against several competing networks on new Ethnic-ocular database and publicly available databases, the proposed network achieved better performance in both recognition and verification tasks.
In the near future, we plan to investigate different kinds of fusion stages and fusion layers in CNNs, which could improve the performance of multi-feature learning. Periocular recognition is futile for subjects with "wearing sunglasses". As a remedy, we shall incorporate the Generative Adversarial Model, which is useful to recover the periocular area in the face image.