Recognizing Human Races through Machine Learning—A Multi-Network, Multi-Features Study

: The human face holds a privileged position in multi-disciplinary research as it conveys much information—demographical attributes (age, race, gender, ethnicity), social signals, emotion expression, and so forth. Studies have shown that due to the distribution of ethnicity/race in training datasets, biometric algorithms suffer from “cross race effect”—their performance is better on subjects closer to the “country of origin” of the algorithm. The contributions of this paper are two-fold: (a) ﬁrst, we gathered, annotated and made public a large-scale database of (over 175,000) facial images by automatically crawling the Internet for celebrities‘ images belonging to various ethnicity/races, and (b) we trained and compared four state of the art convolutional neural networks on the problem of race and ethnicity classiﬁcation. To the best of our knowledge, this is the largest, data-balanced, publicly-available face database annotated with race and ethnicity information. We also studied the impact of various face traits and image characteristics on the race/ethnicity deep learning classiﬁcation methods and compared the obtained results with the ones extracted from psychological studies and anthropomorphic studies. Extensive tests were performed in order to determine the facial features to which the networks are sensitive to. These tests and a recognition rate of 96.64% on the problem of human race classiﬁcation demonstrate the effectiveness of the proposed solution.


Introduction
The definition and taxonomy of human races is complex, subjective and fluid, as there are many paradigms that can be used when defining them. Generally speaking, the context of race is related to biological traits, such as (facial) bone structure, or hair, eyes and skin colour In Encyclopaedia Britannica [1] the concept of human race is defined as the "the idea that the human species is divided into distinct groups on the basis of inherited physical and behavioural differences". The idea of human race, as known today, arose in the period of European colonization, when the colonizers took contact with local populations with different physical traits, languages, traditions and culture. However, up to this day, scientists have not reached a consensus regarding the definition or even existence of human races, the number of taxons or the features that can be used to discriminate between the races. For example, the anthropologist Charles Darwin [2], argued that the human races "graduate into each other" and therefore it is almost impossible to find cues that can clearly taxonomize the human race.
A standard taxonomy for human races was never established. For computer vision based recognition, the seven most commonly accepted racial groups are: "African/African American, Caucasian, East Asian, Native American/American Indian, Pacific Islander,

•
The gathering of a large-scale "in the wild" face dataset (FaceARG) annotated with race and ethnicity information. To our knowledge, we gathered the largest available face database (of more than 175,000 images) annotated with race, age, gender and accessories information. The remainder of this work is organized as follows-in Section 2 we present various anthropomorphic features that might be used to distinguish between racial groups and some theories regarding how humans perceive race. In Section 3 relevant works from the scientific literature that tackle the problem of race classification are presented. Section 4 details the proposed solution and Section 5 describes the training and test dataset (their gathering and the augmentation process). In Section 4 we mention some implementation details and in Section 5 we discuss the experimental results we performed. Section 5.4 presents various visualizations techniques of the CNNs and their relation to how humans perceive the race. Finally, Section 6 concludes this work.

Related Works
A detailed survey on racial and ethnical classification from facial images is presented in [3]. The problem of race classification is discussed from the fundamental and analytical understanding of race based on an interdisciplinary expertise (psychology, cognitive neuroscience, anthropometry etc.), with emphasis on the various racial feature representations; also, the most relevant works in the field of automatic race classification from facial images are presented and compared.
From the perspective of racial features used, the state of the art methods use either chromatic information [8], global features [9], local features [10] or a combination of the above methods [10]. Chromatic based methods are usually based on the skin tone and are highly sensible to illumination conditions. Global based methods are the most commonly used and exploit the interrelations between different facial regions to establish the racial belonging. On the other hand, local features based methods categorize the race based on lower level features, such as Gabor filters [10] of histograms of gradient directions. Finally, hybrid methods combine some or all of the above presented methods to obtain the optimal representation for race classification.
None of these methods are suitable for all use cases-for degraded, low resolution facial images, the face should be treated as a whole object and the chromatic information can bring important information. On the other hand, for high resolution images geometrical and local feature models would be more appropriate.
Recently, with the emergence of deep learning, several works address the problem of race and ethnicity recognition using convolutional neural networks (CNN). In [11], the authors propose a deep network to classify visible and multi-distance near-infrared images solely into Asian or Caucasian race groups, as well as Male and Female groups.
The network is trained and tested on images belonging to only 203 subjects and for the task of race classification it achieves an accuracy of 95%. In [12], the authors train and evaluate a deep learning race and ethnicity classification approach on three scenarios: (a) the recognition of white and black subjects, (b) the recognition of Chinese and Non-Chinese subjects, and (c) the classification of Han, Uyghurs and Non-Chinese people.
Some works attempted a more fine-grained classification, by classifying humans into ethnic groups. An analysis of both human and machine performance on a challenging ethnical classification task of Indian faces is presented in [13]. The authors gathered the Centre for Neuroscience Indian Face Dataset (CNSIFC) which consist of 1650 faces labelled with ethnicity (South vs. North Indian). Several classifiers were trained for this binary ethnic classification problem using spatial intensity features, local shape features or CNN based features. The best attained accuracy is 62% using CNN based features.
InclusiveFaceNet [14] learns gender and race attributes from a held-out dataset; the race and gender labels are not explicitly specified, just that the dataset contains two gender identities and four ethnic groups. Then, these learned representations are transferred to face attribute recognition models, allowing them to leverage ethical and gender representation without having to predict these attributes on the subject.
In [15], the authors gathered the Racial Faces in the Wild (RFW) dataset, consisting of 4K images divided into four ethnic groups-Caucasian, Asian, Indian and African. The database comprises images from MS-Celeb-1M [16], which where labelled with Face++ API [17] or the ethnic group was selected based on the information available on a list of celebrities. Using this benchmark, the study confirmed the racial bias of state of the art face recognition system. To overcome the racial bias problem the authors proposed a deep information maximization adaptation network by using Caucasian as source domain and other races as target domains.
Another work that addresses the problem of racial bias for face recognition is [18]. The authors authors envisioned an image generation method, which transfers racial characteristics, while preserving the identity features. The main idea of this image augmentation method is to make the racial related features irrelevant to the identity recognition problem.

Race and Gender Faces in the Wild
The gathering of training data-a time consuming process that also requires domain specific knowledge-is a crucial process in the context of machine learning. The training data determines what the network learns to recognize before being applied to unseen data. Nowadays, multiple facial image databases are publicly available and they made a significant contribution to the progress of machine learning, as they are used to train and evaluate the performance of machine learning algorithms.
However, datasets have received some criticism as they often narrow the focus of object recognition research by reducing it to a single benchmark performance number. In [19], the authors triggered a warning signal about database bias, a subject that has not received the appropriate attention from the scientific community. Starting from the fact that a scientist can easily (with 75% accuracy) determine the database from which an image came from, the study analyses the different kind of biases that can appear in the datasets, as well as the impact of this bias to the performance of the detection and classification accuracy. Four types of biases were identified: capture bias (dataset contain images captured in similar conditions), selection bias (datasets often contain some particular type of images: for example, street images), category bias (semantic labels are subjective and can be interpreted differently by labelers) and negative bias (what the dataset defines as the "the rest of the world").
Since the publicly available image databases are rarely annotated with race information, we decided to gather a large dataset of public personalities annotated with race, gender, age and accessories (eyeglasses) information. With the observations from [19] in mind, we collected several lists of famous celebrities (actors, singers, athletes, politicians and mathematicians) belonging to each race class and automatically crawled the Internet for their images.
We collected lists of famous Afro-American, Asian, Caucasian and Indian subjects, and we grouped the images by race and gender. Faces were detected and cropped from all the images as described below (Section 3). As some images contain multiple faces, there human labelers carefully analysed the downloaded data, and annotated it with age and accessories information (eyeglasses). Also, the labelers discarded the false positive faces reported by the face detector. This dataset will be made public.
The face database we gathered contains approximately 175,000 facial images labelled with four race labels (African-American, Asian, Caucasian and Indian), gender, age and face accessories (eyeglasses) information ( Figure 1). The number of images belonging to each race is approximately the same (24.02% African-American, 25.60% Asian, 24.42% Caucasian and 25.94% Indian).
The distribution of the other labels in the dataset is depicted in Figure 2. We argue that our training dataset minimizes the selection and capture bias as the images are collected automatically from the Internet, and were captured in different scenarios and with different camera. The problem of negative bias does not occur because the first step is to detect the face in the input image, and only then the images is labeled.
The racial belonging is subjective, especially in the case of multiracial subjects, and we tried to address the category bias using three independent human labelers. In order to further correct the database capture bias (that arises from the fact that photographers tend to capture pictures in similar ways) and to enlarge the dataset, we also performed several data augmentation techniques: random horizontal flips, applying small affine transformations, altering the brightness and contrast of the image and generating random horizontal crops.  For the validation of the proposed algorithm, we used a leave one subject out methodology: we selected images of persons which were not seen by the network and only these images were used to determine the performance of the classifier. In order to make the network to be equally discriminative for all classes, we equalized the race distribution for training.

Race Detection Using Convolutional Neural Networks
After gathering the data, we trained several convolutional neural networks to recognize the race classes in these images, such that we can establish the performance of state of the art algorithms for this problem. The outline of this basic framework is depicted in Figure 3.

Input image
Face detection + 40% margin  The first step of the algorithm is to locate the face area in the input image. As opposed to other methods, that use complex alignment operations [20], we use a fast and simple normalization procedure. An off the shelf face detector [21] is used to find a square region corresponding to the face area. We heuristically determined that the area around the face (such as the hair structure and texture) contains valuable information, so the face region is also enlarged by 40% on width and on height. If the face is too large, the resulting image is clamped to edge. Finally, all the images are resized to a standard size as dictated by the input of the network architecture.
We chose to train and compare four different convolutional neural network (CNN) architectures to classify the input images into one of the four race categories-Afro-American, Asian, Caucasian and Indian: VGG19 [22], Inception ResNet v2 [23], Se-ResNet [24] and Mobilenet V3 [25].
Convolutional neural networks achieved near human performance to multiple computer vision tasks and since then nearly all computer vision tasks have been re-examined from a deep learning perspective [26][27][28]. Recently, significant progress has been made due to new paradigms and improved network architectures. Multiple attempts have been made to improve the accuracy of CNNs. The Visual Geometry Group (VGG) from the Oxford University [22] performed a thorough evaluation of CNNs and their performance, by modifying and evaluating the performance of the VGG architecture by iteratively increasing its depth, that is, adding more convolutional layers (up to 19 layers). In order to make this approach feasible, the size of the convolutional layers was significantly reduced (to 3 × 3 convolutional filters with 1 stride in all layers, as opposed to 11 × 11 receptive field with stride 4 in [29]). It can be easily observed that by stacking two 3 × 3 convolutional filters, the same effect as applying one filter with 5 × 5 receptive field is obtained. However, in the VGG case, two nonlinearities are applied, so the decision function is more discriminative. In addition, the number of parameters is much decreased.
The "network in network" architecture [30] replaces the traditional convolutional layers of CNN, by building mico-networks to abstract the data within the receptive field; each micro-network is instantiated with a multilayer perceptron. Based on this work, Google developed the Inception model [31] and proposed GoogLeNet, a 22 layer convolutional neural network, which won the ILSVRC 2014, with a top 5 error of 6.7%. The network achieved very good performance, at a relatively low computational cost and with fewer parameters. The main contributions introduced by GoogLeNet feature: the use of Inception modules-which perform multiple convolutions (with different filter sizes) in parallel and concatenate the resulting feature maps-stacked on top of each other and the use 1 × 1 convolutions before more expensive (3 × 3 or 5 × 5) convolution as a mean of dimensionality reduction. Using these concepts, the depth of the network is increased without an uncontrolled increase in the computational complexity. In addition, it also allows the visual information to be processed at different scales and then aggregated so the future layers can analyze features at different scales simultaneously. The original architecture was iteratively improved [32,33], until the latest version Inception-v4 [23], which achieved 3.08% top-5 error on the ILSVRC classification challenge. A similar performance was achieved by Inception ResNet v2 [23], an architecture that combines Inception modules with residual connections [34]; their main advantage is that they can significantly speed up the training time.
SENet architecture [24] is ranked first in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2017 [26], bringing a relative improvement of roughly 25% over the winner of the previous year. The main contribution of SENet is the Squeeze and Excitation block (SE block), which models channel-wise feature-feature dependencies to increase the representational power of the network. A SE block is composed of two parts: the squeeze and excitation. The purpose of the squeeze step is the capture the global spatial information of a feature map into a channel descriptor; a global average pooling layer is used to generate channel-wise statistics. The output of this squeeze step is later fed into the excitation block, which will compute a set of specific channel weights, and therefore modelling channel-wise dependencies. This is achieved by a simple gating mechanism with a sigmoid activation.
This SE block can be incorporated into classical convolutional neural network architecture. To this end, we experimented with SE-ResNet architecture, in which the SE block is inserted into the ResNet [34] architecture right before the summation with the identity branch.
Finally, the last neural network that we trained and evaluated for the problem of race classification is Mobilenet V3 [25]. MobileNets are lighter architectures which are more suitable to be run on devices with limited computational power, such as embedded systems or mobile devices. Mobilenet V1 proposed depthwise separable convolutions (a depthwise convolution, followed by a part base convolution) to improve the computational efficiency of the networks. Its successor, Mobilenet v2 [35] further improves the by introducing the inverted residual block. Recently, reinforcement learning has been used to improve and automate the convolutional neural architecture design process [36][37][38]. This is the case for Mobilenet V3 [25], in which the network is generated by a combination of hardware-aware network architecture search combined with NetAdapt algorithm [38]. Also, Mobilenet V3 features some network improvements in the first and final layers and introduces the h-swish nonlinearity, a faster version and more "quantization-friendly" activation than the introduced swish function.
The Chicago Face Database [39] contains high-resolution, standardized images of 158 participants with ages between 18 and 40 years and extensive data about the subjects: race (Asian, Black, Hispanic/Latino, White), gender, facial attributes (feminine, attractive, baby-faced etc.). The labels of this dataset were converted to our taxonomy using the following rules: Asian-Asian, Black-African-American, Hispanic/Latino-Caucasian, White-Caucasian.
The Minear-Park [40] face database contains frontal face images from 575 individuals, with ages ranging from 18 to 93; the majority of participants are Caucasian (76%), followed by African-Americans (16%) and the remaining 8% are Asian, South Asian and Hispanic. The labels of this dataset were converted to our taxonomy using the following rules: Asian-Asian, African-American-African-American, Hispanic-Caucasian, Caucasian-Caucasian, South Asian-Indian.
The Japanese Female Facial Expression (JAFFE) [41] database contains 213 images of 7 facial expression posed by 10 Japanese female models. All the images from this dataset were added to the training data and labeled with the Asian class.
The Multi-Racial Mega-Resolution (MR2) [42] face database contains high quality, high resolution images of 74 European, African and East Asian subjects captured in a professional photography studio. The subjects have facial features that unambiguously place them in one of the racial categories and don't have any unnatural hair colour or accessories. The labels of this dataset were converted to our taxonomy using the following rules: East Asian-Asian, African-African American, European-Caucasian.
Several other databases that contain thousands of images annotated with ethnic and/or racial labels are available. The Indian Movie Face Database (IMFDB) [43] contains more than 34,000 images of Indian actors that were manually selected and cropped from 100 indian movies. All the images are annotated with age, pose, gender, expression and the presence of occlusions. The main disadvantage of this database is that the images are cropped to a tight bounding box (from the forehead to the chin) and so they do not meet the requirements of the proposed solution. Another ethnic face database is the Korean Face Database (KFDB) [44]; it contains videos and images of approximately 1000 Korean subjects captured in controlled scenarios over a period of 3 years. However, this database is not publicly available. Morph face database [45] is a large scale longitudinal face database with meta-data (gender, race and age) collected over a lifespan of 8 years. The tagged racial labels are: African, European, Asian, Hispanic and Other. This database is not available free of charge; in addition, the database is unbalanced as more than 70% of the samples are of African subjects.
The code was written in python, and the CNNs were trained on Nvidia Tesla K40 GPUs using Google's tensorflow framework [46]. Training the network took around 10 days.
We used the same training procedure for all the trained networks. The training is performed using the rmsprop optimizer [47], by dividing the gradient by a running average of its recent magnitude. The batch size was set to 32, the rmsprop momentum to 0.9 and the decay to 0.9. The learning rate was initially set to 10 −2 and then exponentially decreased with a learning rate decay factor of 0.94. The learning rate is decreased after each 2 epochs.

Evaluation Protocol
To evaluate the proposed solution, leave-one-subject-out cross-validation is used in order to see if the trained models generalize well to an independent data set. To gather the dataset, we collected several lists of famous persons belonging to each ethnicity class and automatically crawled the Internet for their images. For each downloaded image, we also stored the query used to obtain it. Based on this information, we were able to use the leave one subject out validation technique. We selected the names of some celebrities that were not used in the training process and the proposed solution is evaluated on these images. In total, we selected 10,000 facial images (2500 for each race category).
Moreover, we also evaluated the trained networks on other publicly available datasets which were not used in the training process-LHDF [48,49] and CAS-PEAL [50] in order to study the cross-dataset variance impact. The CAS-PEAL dataset is a large-scale Chinese database containing images of 1040 individuals (595 male and 445 female) with different variations (pose, expression, accessories and lighting). The LDHF database contains facial images of 100 subjects (70 males and 30 females), in both visible (VIS) and near-infrared (NIR) spectrum, captured at distances of 60 m, 100 m, and 150 m outdoors and at a 1m distance indoors.

Race Detection
In Table 1 we report the performance of our trained convolutional neural networks for the four-race classification problem. The networks attained similar performances: the best performance is obtained by Inception Resnet-v2 (96.36%). We report the well-known metrics: precision (or positive predicted value), recall (or true positive rate) and the F 1 score-the harmonic mean of precision and recall: As we deal with a multilabel classification problem, when computing the precision, recall and F 1 scores, we reported the results for each individual class label (the rows Afro-American, Asian, Caucasian and Indian), and also we aggregated the metrics globally (the row Overall) by counting the total true positives, false positives and false negatives. In all the networks, the Indian and Caucasian classes have lower precision rates. The confusion matrices for the race classification problem of all networks is reported in Table 2. The majority of confusions occur between Indian and Caucasian subjects, followed by Indian and African-American subjects. Some examples of correctly classified and misclassified images are depicted in Figure 4.

Robustness Analysis
There are several salient features of the face that influence the way humans perceive race: the iris texture, the whole peri-ocular region (eyelids, eyelashes and the canthus), the nostrils and the skin tone. We also performed some experiments in order to determine if the convolutional neural network "learned" and used the same classification features.
In early anthropometric studies, the skin color was used as a cue for race classification and anthropometry. In the 18th century, in the first edition of Systema Naturae [51], Carl Linnaeus proposed a four class taxonomy for humans based on the skin colour and continent. Another example of skin colour taxonomy is the von Luschan scale [52] which classifies the skin colour into 36 categories based on painted glass tiles. Nowadays, it is well known that human skin colour is determined by the melanin pigment, as a part of a natural process which controls the biochemical effects of ultraviolet radiation (UV) that penetrates the skin. As a result, a direct correlation between the geographical latitude (in other words, the UV radiation level) and the skin pigmentation can be established. The skin colour can greatly vary within the same ethnic group, so it should not be taken as a differentiating factor for ethnic or racial groups.
The first comprehensive and broad study on cranio-facial anthropometric measurements and the comparison of facial variations between races and gender is presented in [53]. The authors compared fourteen linear facial measurements of North American White (NAW) subjects with measurements of subjects from other countries in the world. As anticipated, there were few differences between the NAW and the Caucasian groups from Europe; the nose height of Caucasian was the measurement that seldom differed from that of the reference group. The measurements from the periocular region were found to be "one of the cranio-facial areas most exposed to visual judgement": subjects from Asia and Middle East have a significantly larger intercanthal and binocular width, while the eye fissure length is much smaller than the one of the NAWs.
A more recent study [54], reviewed and collated numerous facial photogrammetric studies from the specialized literature in order to determine inter-ethnic and racial variations of various angular and linear facial measurements. The main inter-racial angular differences found by the study are: African males have a smaller naso-frontal angle compared to Caucasian males and a larger naso-facial angle than Asian males. The naso-labial angle is more obtuse in Caucasian females than in African and Asian females. Regarding the linear facial measurements, Caucasian females have, on average, a shorter facial height, a smaller width of the face as compared to Asian females.
The straightforward approach to investigate the sensibility of the network towards the facial features is to iteratively mask the input test image with an occluder object (a gray rectangle) and plot the probability of the ground truth class as a function of the occluder object. Figure 5 shows the occluding process and the probabilities after occluding the images for images belonging to each one of the race classes as heat maps. For the Asian class, the peri-ocular region has a big impact on the classification result: if the eye area is masked, the subject is considered a Caucasian subject. This is explainable, as the inner canthus region is essential in differentiating between Caucasians and Asians.
For the African-American class the masking process does not cause the confusion with another class; this could be be a clue that in this case, the chromatic information is used as a cue in the classification problem. The same behaviour is observed for the Caucasian class. Finally, in the Indian case, the intra-ocular and the nose region seem to have a large impact on the classification problem.
In order to numerically express the sensibility of the network towards certain facial features, several transformations (occlusions, blurs, color enhancement) were applied on the test images in order to degrade the most prominent features of the face. The modified images are then fed to the CNN and the results are re-examined.
First, 68 facial landmarks were localized on the face using the dlib framework [55]. Based on the position of these features, different rectangular regions of interest (ROIs) on the face were established and masked/blurred.
The occlusion operation corresponds to a simple overlay of a grey rectangle (the red, green and blue color components are set to 128) over the ROI. The main disadvantage of this operation is that it also introduces strong edges around the area of interest. In order to overcome this issue, a radial blur operation is proposed: the region of interest is strongly blurred in its center and the blur factor decreases inversely with the distance from the center.
To implement the radial blur, the area of interest is blurred with a strong Gaussian filter and a gradient radial mask is generated based on the distance from the region's center; the original and the blurred ROIs are blended according to the mask pixels in order to obtain the radial blur ( Figure 6). The following alterations were performed (as illustrated in Figure 7): • Blur/occlusion of the eye region: the eyes were masked/blurred in order to determine their importance in the race classification problem. • Blur/occlusion of the nose region: the nostril region was masked/blurred in order to determine its importance in the race classification problem. • Blur/occlusion of the mouth region: the mouth region was masked/blurred in order to determine its importance in the race classification problem. • Grayscale conversion: this transformation is performed in order to determine the importance of the chromatic information.  The results of this test experiment are reported in Table 3. The periocular region seems to have the highest impact on the classification performance: the overall accuracy decreases with 5.99% when the eyes are blurred and with 15.62% when the eyes are totally occluded. The mouth and the nose area seem to have a lower importance. The accuracy on the Asian and Caucasian classes is mostly impaired when altering the eye area, because the canthus is essential for differentiation Caucasians from other races.
These observations are consistent with the way humans perceive the facesstudies [56,57] revealed the importance of the periocular region followed by the mouth and then the nose in human face perception and recognition. Other studies suggest that the eyebrows could be even more important than the eyes [58], due to their role in non-verbal communication and because they are large, high-frequency facial features.
As stated in [58], the geometrical relationship between the facial parts is at least as important as the appearance of each facial feature, and although in some cases, features alone are sufficient for face recognition, "the geometric relationship between each feature and the rest of the face can override the diagnosticity of that feature". This could be an explanation of why the detection performance does not drastically decreases when independently masking facial parts.
Chromacity is known to play an important role in the human visual system-but studies reveal that when it comes to face identification color is not the primary trait [59]. Even more, people are better at encoding facial information of their own race than from other races. Our classifiers based on CNNs seem to do the same and chromacity does not seem to have a big impact: when the images are converted to grayscale, the overall performance decreases only with 1.9%.
In today's context of globalization, the racial mixing between human accelerates and a discrete identification of the race becomes much more difficult than other demographic traits that can be extracted from the face (age, gender, eye color etc.). Biracial/multiracial refers to individuals belonging or related to two/many races.
Although the networks were not trained on multi-racial subjects, we also evaluated the performance of the proposed solution on multi-racial subjects in order to determine if the races to which the subject is related to should have a higher prediction score.
For this task we downloaded from the internet images from of subjects with multiracial descent [60].
The average probability of the correct class is 98.07% with a standard deviation of 6.36%. Little research has been made on this field and a sufficiently large scale database for the problem of multi-racial classification does not yet exist.

Comparison to the State of the Art
Recently, several other works tackled the problem of race [11,12] and ethnicity [12,13] classification using convolutional neural networks. The obtained results compared to the state of the art are summarized in Table 4. However, a direct numerical comparison with all previous works is not possible because some methods were trained and evaluated on private databases that do not seem to be available anymore. Moreover, because the race and ethnicity classification problem is fluid and not rigorously defined, different taxonomies are often used. In [11] the authors propose a deep learning framework for extracting soft-biometric information-gender and ethnicity in Near Infrared (NIR) long-range, night-time face images and visible images (VIS). However, their method requires face normalization-the eye centers must be located manually and their positions are used to generate the canonical images by applying an affine transformation. In addition, the racial taxonomy contains only the Asian and Caucasian classes. The method was tested on images captured by the authors and on the LDHF [48,49] dataset.
To compare with [11], we also tested the proposed method on the LDHF dataset, both on the visible and near infra-red images, even if the proposed method was trained only with VIS images. As the database contains only Asian subjects, we consider two output classes Asian and Non-Asian (i.e., Afro-American, Caucasian and Indian). In other words, we transform the output of the trained network into a binary classification problem with the classes Asian vs. Non-Asian: that is, if the network prediction belongs to the labels Afro-American, Caucasian or Indian, then it is considered that the network predicted the class Non-Asian; otherwise, the if the network predicted the Asian class, the label is left unchanged.
In Table 5 we report the classification accuracy of the proposed solution on all the subsets of the LDHF database (Asian vs. Non-Asian classes). For the near-infrared spectrum, even if the network was not trained on near-infrared images, the method obtains good accuracy for near images. For images captured outdoor, in near-infrared at large distances, the performance decreases drastically. However, the proposed method is not intended for the near-infrared case and, we argue that on these images, ethnicity recognition is very challenging even for a human (Figure 8).
In [12], the authors trained a 5 layer convolutional neural network (3 convolutional layers and 2 fully connected layers) for several race and ethnicity scenarios: Black vs. White, Chinese vs. Non-Chinese and Han, Uygurs and Non-Chinese. The method requires face alignment and normalization. For the Chinese vs. Non-Chinese case, the authors tested the database on 542 images randomly selected from the CAS-PEAL database and obtained an accuracy of 99.81%. We tested the trained Inception Resnet-v2 network on all the frontal pose subsets of the CAS-PEAL (8658 images). In the normal image capturing conditions, the network achieved an accuracy of 99.61%, a value comparable to the one of [12]. The worst results are achieved on the Lighting subset, in which the subjects are illuminated by fluorescent light source located at different azimuth and elevation coordinates. In extreme cases, the subject's face is in [total] darkness and it is quite challenging to distinguish his/hers facial features.
In [61], the authors finely tuned several convolutional neural networks for classifying Chinese, Japanese and Korean subjects. They gathered a dataset of 39883 images annotated with these three ethnicity labels; the best accuracy (75.03%) is obtained by the Resnet network [34]. For the problem of ethnicity classification between Asian subjects we obtained an accuarcy of 77.87%. A similar approach is proposed in [13]: the authors finely tuned the VGG network for ethnicity classification in Indian subjects into South-Indian and North Indian. They obtained 62% accuracy. We cannot compare to this method, as there is no intersection between the proposed taxonomies. As opposed to the other methods from the literature, we proposed a network that is trained to recognize between four race classes at once (Afro-American, Asian, Caucasian and Indian); moreover, the method is able to work with "in the wild" images and it does not require any face normalization.

Conclusions
The human face is one of the most important visual stimulus for humans, as it encodes information about one's emotional state and demographics (race, ethnicity, age, gender etc.). In this paper, we built an annotated dataset of facial images and proposed a deep learning approach for automatic human race and ethnicity detection from facial images; 4 state of the art convolutional neural networks were fully trained to differentiate between the following racial classes: African-American, Asian, Caucasian and Indian.
To train the networks, we gathered over 175,000 facial images from the Internet and used four independent human subjects to label the images with race information. The training database is made publicly available. To the best of our knowledge this is the largest publicly available, free and balanced face database annotated with race information.
The networks were evaluated on images from our database using the leave-onesubject-out cross-validation in order to determine if they generalize well on new images and on images from publicly available datasets, which were not used in the training process. The average prediction time is 70 milliseconds.
The best results are obtained by the Inception Resnet-v2 network (96.36% accuracy). In order to compare with the state of the art, we also assessed the accuracy of our classifiers on near infra-red images, even if they were not trained in such use cases. The networks attained surprisingly good results in indoor conditions (100% accuracy for distinguishing between Asian and non-Asian subjects). Lower accuracies were obtained for outdoor, near infra-red images, but we argue that this is a difficult task even for a human. Moreover, this was not the scope of the current method.
We employed various visualization techniques (plot the probability distribution function as a function of an occluding object) in order to determine the facial features to which the network is sensible to. In addition, we altered the test images, such as the most prominent parts of the face are occluded or blurred. The numerical results revealed the importance of the periocular region followed by the mouth; this pattern is consistent to the way humans process and recognize faces. Another experiment involved the gray-scale conversion of the test images and chromatic alterations (brightness increasing and decreasing). It was determined that the chromatic information is not crucial to the classification performance. This idea is also sustained by the fact that the network performed well on near infra-red images. Psychological studies also show that for humans, the color information is important only when facing degraded images.
In conclusion, we proposed a human race classification system from facial images which is illumination invariant. The experimental results show that the proposed method is at least comparable or better than the state of the art. The visualization experiments we performed show that the way the network "perceive" the human face is similar to the way humans perceive faces. Using fine tuning, the network can be used to distinguish between ethnic groups.
The race topic seems to be the target of ethical criticism as race is perceived as a social and political discrimination factor. Recent psychological and anthropological studies state that there is not scientific basis for any claim supporting hierarchical human categories or race and ethnicity based on patterns of human genetic variations-and we are on the same sentiment. However, being able to scientifically discern and infer the source race of an individual or groups of individual could help for better cultural and social understanding of human relations. In the very end any scientific discovery has a potential to be used for the good or for the bad. That choice belongs to the humanity.
As future work, we plan to gather more training data such that the system can classify between seven human races. In addition, we plan to address the problem of multi-racial subjects ("African/African American, Caucasian, East Asian, Native American/American Indian, Pacific Islander, Asian Indian and Hispanic/Latino).