Measuring Embedded Human-Like Biases in Face Recognition Models †

: Recent works in machine learning have focused on understanding and mitigating bias in data and algorithms. Because the pre-trained models are trained on large real-world data, they are known to learn implicit biases in a way that humans unconsciously constructed for a long time. However, there has been little discussion about social biases with pre-trained face recognition models. Thus, this study investigates the robustness of the models against racial, gender, age, and an intersectional bias. We also present the racial bias with a different ethnicity other than white and black: Asian. In detail, we introduce the Face Embedding Association Test (FEAT) to measure the social biases in image vectors of faces with different race, gender, and age. It measures social bias in the face recognition models under the hypothesis that a speciﬁc group is more likely to be associated with a particular attribute in a biased manner. The presence of these biases within DeepFace, DeepID, VGGFace, FaceNet, OpenFace, and ArcFace critically mitigate the fairness in our society.


Introduction
Recent advances in machine learning technologies allow computer vision researchers to employ massive datasets from the web to train models with image representations for general purposes from face recognition to image classification [1,2]. However, the absence of scrutinizing those datasets disproportionately can cause negative impacts on racial and ethnic minorities as well as other vulnerable individuals [3]. Without the necessary precautions of these problematic narratives, there can be some issues in image classification and labeling practices that entail stereotypes and prejudices [4,5]. The machine learning models with such datasets may elaborate and normalize these stereotypes, inflicting unprecedented harm on those who already comprise the margins of our society.
Therefore, it is essential to understand how datasets are sourced, labeled, and what representations the models are trained on. One of the common measures called the Word Embedding Association Test (WEAT) is used to assess undesirable associations in word embeddings [6]. That is, WEAT is used to show that both humans and natural language processing reveal many of the same biases with similar significance. For instance, WEAT shows racial bias in the word vector space by quantifying the close relations between pleasant words and European American names and unpleasant words with African American names. Ross et al. [7] extend this work with a metric throughout interaction between vision and language embeddings to measure biases in social and cultural concepts, such as race. We extend prior works with a metric, which we term Face Embedding Association Test (FEAT) to probe race, gender, and age biases in embeddings of pre-trained face recognition models. Unlike the previous measurements that measure bias within the facial image representation itself [8,9], our measurement measures evaluative associations between pairs of semantic categories which resemble the implicit attitudes underlying human cognitive priming procedure [10]. That is, FEAT measures the models' automatic associations as if estimating humans' stereotypical discrimination toward social categories represented by associations between a target and an attribute dimension. In addition, a strong advantage of FEAT is its potential for extension to additional discrimination tests. It is adaptable to assess a wide range of biases in our society.
By taking advantage of the expandability of FEAT, we expand to assess social biases toward a relatively unexplored racial group. There have been a lack of studies measuring biases of various races but only focused on white and black ethnicity. It is a significant oversight to invalidate ethnic group differences within racial category, which is another common form of discrimination experienced not only by Asian people but by other racial groups as well [11]. Understanding nuances in how different groups of people are affected by their ethnicities represents the next step in advancing this field of study. Thus, we take the next step to answer the question whether the models are significantly affected by the biases toward other racial groups rather than white and black. To achieve this goal, we employ face images of European American (EU), African American (AF), and Asian American (AS) people. Moreover, we measure an interaction between racial and gender biases that submissiveness and incapable of becoming leaders is prevalent in Asian women [12]. In short, our contributions are: • We introduce FEAT to measure racial, gender, age, and an intersectional bias in face recognition models with images. • We find statistically significant social biases embedded in pre-trained DeepFace [13], DeepID [14], VGGFace [15], FaceNet [16], OpenFace [17], and ArcFace [18]. • Our new dataset and implementations are publicly available (https://github.com/ sange1104/face-embedding-association-test, accessed on 28 February 2022).

Related Work
A bias mitigation method can be largely divided according to the areas of model distribution targeted for pre-processing, in-processing, and post-processing [19]. The most widely used pre-processing technique is to re-balance datasets [20,21] or use synthetic data [22]. In the case of datasets used in face recognition tasks, they proved to have an imbalanced class distribution both in gender and race [23]. To address this problem, several datasets with a balanced number of gender, ethnicity, and the other attributes are proposed by the previous studies, including Racial Faces in Wild [24], Balanced Faces in the Wild [25], and DiveFace [26]. Although, these datasets contribute to mitigating abnormal distributions, but not to demonstrating that training with these datasets leads to impartial results, because labels for ethnicity in the datasets are not widely allowed as ground truth and are overly dependent on the annotator's decision [27]. This motivates researchers to develop in-and post-processing methods.
In-processing approaches take several methods to get rid of impartiality while training. For example, cost-sensitive training and adversarial learning are used to get rid of sensitive information from functionality [20,21]. Moreover, adjusting parameters of loss functions and taking an unsupervised way of training are used to protect minorities by training models with unbiased representations [26,28]. The examples of post-processing techniques include re-regulating the similarity scores of the two feature vectors based on demographic groups of the images [29] or attaching layers to the feature extractor for removing sensitive information from the representation [26].
Along the line, growing numbers of measurements have appeared to measure the effectiveness of the mitigation approaches. In the natural language processing field, various tests have been proposed to quantify bias in pre-trained word embedding models. Bolukbasi et al. [30] and Manzini et al. [31] employed word analogy tests and demonstrated undesirable bias toward gender, racial, and religious groups in word embeddings.
Moreover, Nadeem et al. [32] present a new evaluation metric that measures how close a model is to an idealistic model, showing that word embeddings contain several stereotypical biases.
Though less work has been studied to measure bias in the computer vision area compared to text, there are several approaches to examine embedded bias in visual recognition tasks. Acien et al. [33] investigate to what extent sensitive data such as gender or ethnic origin attributes are present in the face recognition models. Wang et al. [34] propose a set of measurements of the encoded bias in vision tasks and demonstrate that models amplify the gender biases with an existing dataset. Furthermore, recent studies focus on generation models to explore biases in face classification systems [22,35].
One of the widely used methods to examine bias is evaluating the representation produced by the model [6,36], as it can be easily utilized as a tool to analyze human bias [37,38]. To analyze the implicit bias, the WEAT [6] calculates word associations between target words and attribute words. Replacing words to sentences, the Sentence Encoder Association Test (SEAT) is introduced to apply WEAT to measure biases in sentence embeddings [39]. Moreover, recent studies generalize WEAT to contextualized word embeddings and investigate gender bias in contextual word embeddings from ELMo [40,41]. Steed and Caliskan [1] adapt WEAT to the image domain to evaluate embedded social biases. However, to our knowledge, there are no principle tests for measuring bias toward diverse racial subgroups, especially for Asians with face recognition models. Our work aims to generalize WEAT to facial image embeddings in order to examine social biases toward a wide range of subgroups in pre-trained face recognition models.

Face Embedding Association Test
Existing bias measures in natural language processing assess bias of word or sentence based on an Implicit Association Test administered to humans [6,42,43]. We introduce Face Embedding Association Test (FEAT) by extending the prior works throughout face embeddings. The details of the FEAT are as follows.
FEAT uses sets of face images, rather than sets of words or sentences, to demonstrate race and gender. Two sets of face images, X and Y, denote two sets of target races of the same size, while A and B are two sets of attribute images. For example, as in Figure 1, a face image x represents EU, while y as AS. One example of career attribute images A denote as a and b is an example of family attributes B. The basis of an indicator of bias is calculated by the average cosine similarity between pairs of images. Equation (1) measures the association of one of the target face images f with different attributes as follows: where the s function measures how close an average embedding for face image f with attribute set A compared to the B. The relative proximity of f and A opposed to B indicates that both concepts are more closely related. Then, all target face images (i.e., X and Y) can be used to measure the bias in vector space. Bias is defined as one of the two target sets being significantly closer to one set of attribute images compared to the other. For example, the social bias is present when it comes to one of the target sets EU or AS is significantly closer to the concept of career compared to family. The following equation, s(X, Y, A, B), measures the differential association of the two sets of target images with the attribute: To compute the significance of the association between (X, Y) and (A, B), a permutation test on s(X, Y, A, B) is used as below: where the probability is computed over the space of partitions (X i , Y i ) of X ∪ Y with such that X i and Y i are of same size. The effect size, a normalized difference of means of s( f , A, B), is used to measure the magnitude of the association, This normalized measure implies how separated the two distributions of associations between the target and attribute are. That is, a larger effect size indicates a larger differential association.

Face Recognition Models
To evaluate the robustness of the models toward the social biases, we employed popular pre-trained face recognition models. All the models are widely used in real world applications, where the models learn to produce embeddings based on the implicit patterns in the entire training set of image features. Moreover, with different structures of multiple hidden layers, each model learns a different level of abstraction [1]. We extracted image representations from the last layer of each model, where each model encoded a different set of information. The detail of each model is given below: DeepFace. DeepFace is the face recognition model by adopting a deep neural network. DeepFace uses a pre-trained three-dimensional face geometry model to perform face alignment by using affine transformations after landmark extraction and then learns feature representation from a neural network consisting of convolutional nine layers. This model is trained on the Social Face Classification (SFC) dataset which consists of 4.4 million face images.
DeepID. DeepID is one of the well-known face recognition models. DeepID employs a set of high-level feature representations through deep learning, referred to as deep hidden identity features. This model is trained with CelebFaces + dataset and rated by the stateof-the-art score with Labeled Faces in the Wild (LFW) dataset (http://vis-www.cs.umass. edu/lfw/, accessed on 1 December 2021) [44,45].
VGGFace. VGGFace is a very deep CNN model with a VGG16 architecture that employs 15 convolutional layers. The VGGFace is trained by the VGG face dataset, a dataset for a large capacity of face images created from Internet face image searches. This dataset contains over 2.6 million images of 2622 celebrities.
FaceNet. FaceNet is another face recognition model, which returns 128-dimensional face feature representations. To achieve better performance, FaceNet measures face similarity by mapping face images to a compact Euclidean space. The model uses a triplet loss to optimize the weights of the deep convolution layers. This model was pre-trained with MicroSoft Celebrity dataset (MS Celeb) (https://megapixels.cc/msceleb/, accessed on 1 December 2021).
OpenFace. OpenFace is an approximate version of FaceNet. With 3.7 million parameters, it is more frequently adapted in the face recognition field. The model is trained on 500k images from combining the two labeled face recognition datasets, CASIA WebFace [46] and FaceScrub [47].
ArcFace. ArcFace is one of the face recognition models, which learned features from CASIA [46], VGGFace2 [48], ms1m-arcface, and DeepGlint-Face (http://trillionpairs. deepglint.com/overview, accessed on 1 December 2021) datasets. This model proposes a new loss function, Additive Angular Margin Loss, which uses the arc-cosine function to calculate angles between the input features and target weight.

Dataset
To measure the social biases in face embeddings, we compared the closeness between target images and attribute images. For target images, we used UTKFace dataset (https: //susanqq.github.io/UTKFace/, accessed on 1 December 2021), which consists of 24,190 cropped by 200 × 200 face images with diverse demographic profiles. In order to measure racial bias in face recognition models, we randomly selected 3434 images from each EU, AF, and AS, which is the minimum number among three categories. Moreover, for the attribute images, we combined images from Ross et al. [7] and top-ranked hits on Google Images. As we additionally examined racial bias toward Asian American, we collected the same attribute images of Asians as the other racial groups. In detail, we input the search query as Asian, Attribute to obtain the images from a search engine in line with our interest. To measure gender bias, 5244 of male and 5058 of female images were employed. For the attribute images, we used images from Ross et al. [7].
Similar approach was conducted to collect data for measuring age bias. We categorized an individual between 19 to 50 as young adult, while over 60 as old adult [49]. Following this, we randomly selected 851 face images for each young and old adult from the UTKFace dataset. For the attribute images, we crawled images from Google Images by adapting the search rule used in gender query.
In order to measure an intersectional bias in the face recognition models, we employed 1515, 1684, and 1859 images of European American Female, African American Female, and Asian American Female, respectively. To analyze a certain stereotype with respect to incompetence of Asian Female, we employed images from "Competent" and "Incompetent" attribute. Detailed statistics of the collected dataset are described in Table 1. Table 1. The statistics of dataset used in our paper. To measure racial bias, targets are EU, AF, and AS, while attributes are Career/Family, Pleasant/Unpleasant, Likable/Unlikable, and Competent/Incompetent. For gender bias test, targets are Male and Female, while attributes are same as racial bias test. In age bias measure, targets are young and old, while attributes are also same as in the gender bias test. To measure gendered racism, the most common stereotype of Asian Female (ASF) having Incompetent attribute, we sorted out images of each racial group with a certain gender (i.e., European American Female (EUF) and African American Female (AFF)) and attribute (i.e., Competent/Incompetent).

Experiments and Results
In this paper, we validate the FEAT in correspondence with the previous studies [1,6,7] to measure social biases based on the human Implicit Association Test (IAT) [10] with face image stimuli. The FEAT aims to measure the biases embedded during pre-training by comparing the relative association of image embeddings in a systematic process. We present three tests to measure racial, gender, and an intersectional bias: 1.
Race test, in which two target race concepts are tested for association with a pair of stereotypical attributes (e.g., "European American" vs. "Asian American", "Pleasant" vs. "Unpleasant").

4.
Intersectional test, we term as gendered racism to measure well-known stereotype toward Asian Female; "Asian women are considered as incompetent; not a leader, submissive, and expected to work at a low-level gendered job [12]".
In line with the human IATs, we find several significant racial biases, gender stereotypes, age biases, and an intersectional bias shared by pre-trained face recognition models.

Experiment 1: Do Face Recognition Models Contain Racial Biases?
We first present a racial bias test where targets have different ethnicity, including European American, African American, and Asian American. For the attributes, we replicate the same concepts as the original IATs [10]. We adapted sets of attribute pairs, which include Career/Family, Pleasant/Unpleasant, Likable/Unlikable, and Competent/Incompetent, into images. In this experiment, we hypothesized that European American will be significantly related to the first attributes of the pairs, which are career, pleasantness, likable, competences than the others in line with the previous studies [1,6,7,50]. To validate this assumption, we measured the association of races with attributes using FEAT. For example, we calculated s(EU, AF, Career, Family) to compare relative distance between vectors of the target sets, EU and AF, against career attributes such as "business" and "ceo" and family-related attributes such as "children" and "home".
Effect sizes and p-values from the 100,000 permutation test for each racial bias measurement are reported in Table 2. As we hypothesized, EU is more likely to be related with the attributes career and pleasant compared to other racial groups in all models. In detail, relations show strong bias with presence of large effect size with associations between faces of EU and pleasantness, whereas AF with unpleasantness (VGGFace: d = 0.939, p < 10 −4 ; FaceNet: d = 1.081, p < 10 −4 ). Moreover, EU is significantly biased with the attribute likable when embeddings are extracted from all models, except VGGFace.
On the other hand, the differential association of images of EU vs. AS with the attributes show less significant biases. Even though the associations might be significantly different, the effect sizes scored below 0.5, which is considered a small magnitude of biases. Meanwhile, regardless of the race of the counterpart, OpenFace and ArcFace present inherent bias that EU is more likely to be significantly related to the concepts of career, pleasant, likable, and competent (p < 10 −4 ).

Experiment 2: Do Face Recognition Models Contain Gender Stereotypes?
This experiment measures gender biases in the pre-trained face recognition models. To be concrete, the target is a gender pair (i.e., male/female) and attributes are the same as we employed in the racial bias test. To examine gender stereotypes, we calculated the association as s(Male, Female, Career, Family), which measures the relative association of the category men with career attributes and the category women with family-related attributes. We hypothesized male will be highly associated with the concepts including career and competence compared to the other attributes. To examine the magnitude of the gendered biases in the models, we quantified the effect size and p-value as mentioned.

Experiment 3: Do Face Recognition Models Contain Age Stereotypes?
This experiment explores whether face recognition models reproduce stereotypes toward a particular age group, such as elderly are slow, incompetent, and forgetful [52,53]. To measure age bias, we replicated the same attributes as the racial and gender bias tests. Specifically, the target is an age pair (i.e., young/old) and attributes are pairs of Career/Family, Pleasant/Unpleasant, Likable/Unlikable, and Competent/Incompetent. One of the possible stereotypes is that young adults are more likely to be associated with the concepts of career and competence compared to the other attributes. As in the aforementioned experiments, effect sizes and p-values are quantified to examine the magnitude of stereotypes toward each age group.
The results in Table 4 show that DeepID, VGGFace, OpenFace, and ArcFace present age biases. That is, young people are associated with the attributes pleasant (VGGFace: d = 1.406, p < 10 −4 , OpenFace: d = 0.551, p < 10 −4 ), likable (DeepID: d = 0.290, p < 10 −4 , VGGFace: d = 1.222, p < 10 −4 , OpenFace: d = 0.431, p < 10 −4 , ArcFace: d = 0.509, p < 10 −4 ), and competent (VGGFace: d = 1.046, p < 10 −4 , OpenFace: d = 0.225, p < 10 −4 ). In particular, VGGFace shows age biased representation with all four attributes. Moreover, effect size d of three attributes, including Pleasant/Unpleasant, Likable/Unlikable, and Competent/Incompetent, rated over one, which is considered a large magnitude of bias. On the contrary, we cannot observe any significant differences in associations from DeepFace and FaceNet. Further studies are needed to ensure that neither model shows age bias. We attempt to replicate a stereotype toward the Asian American Female (ASF). Asian women are usually seen as incapable of being or becoming leaders as they are quiet and lacking leadership qualities. Instead, they are assumed to work at a low-level gendered job, such as being a maid or working in a nail salon [12]. We used incompetent attribute to test this intersectional stereotype, which includes "passive" and "indecisive". In detail, we set the targets for comparison as European American Female (EUF) and African American Female (AFF). Similar to the bias tests above, we computed the relative distances between the pairs of targets and attributes. For example, s(EUF, ASF, Competent, Incompetent) is used to compare distance between EUF and ASF against the concepts of competence and incompetence. Effect size and p-values are measured to systematically present the gendered racism in the pre-trained models. Table 5 presents the results of gendered racism of each model, which indicates the biases are prevalent in VGGFace, FaceNet, OpenFace, and ArcFace. In detail, AFF is more likely to be related to competence notions, while ASF is associated with incompetence (VGGFace: d = 1.424, p < 10 −4 ; FaceNet: d = 0.451, p < 10 −4 ; OpenFace: d = 0.453, p < 10 −4 ). Moreover, compared to EUF, ASF is significantly related to incompetence concepts (FaceNet: d = 0.165, p < 10 −4 ; ArcFace: d = 0.354, p < 10 −4 ). The results prove the incompetent Asian women stereotype is prevalent in several face recognition models which hampers the accuracy of the models. In addition to the incompetent Asian women stereotype, it appears that EUF is more likely to be associated with competence, while AFF is related to incompetence (DeepID: d = 0.465, p < 10 −4 ; FaceNet: d = 0.748, p < 10 −4 ; ArcFace: d = 0.358, p < 10 −4 ). This counters the past stereotypes that black women are self-reliant, strength, resourcefulness, autonomy, and the responsibility of providing for the material for their family [54].

Race Sensitivity Analysis
In order to verify that the racial features of the images result in racial bias in pretrained models, we measured the differences of racial bias depending on the variances of racial features. We hypothesized that if a strong association between a target and attribute becomes loose as changing the racial features, a model tends to link a certain target that has specific race-dependent features with an attribute. In this regard, we reversed the races of images to measure associations between reversed race targets and attributes with FEAT. We synthesized the set of target images to having reversed races (i.e., EU to AF and AF to EU) by varying the extent of the racial variances by increasing the levels of transformation from 0% to 100% with 25% interval. We preserved the identity-related features of the images while reversing the racially dependent features of the faces. Following the findings of prior research, AF and EU have several differences in external facial features [55]: (1) skin color, (2) nose shape, and (3) lip shape. In detail, skin color is one of the most representative features that can be used to visually distinguish race. Moreover, AF individuals typically have shorter, wider, and shallower noses than the EU population [56]. In addition, their lips are also thicker and wider [57]. Therefore, the aforementioned face features of EU are converted into AF features and vice versa.
For the reliability of the racial transformation, we validated whether the race of a given image is represented differently as the level of the transformation increased. We employed the convolutional neural network (CNN) model, which has shown good performance with image classification tasks [58], to classify the race of the image. We trained the CNN using a race balanced dataset which consists of 774 EU and 774 AF. By employing the trained CNN, we classified the race transformed dataset which contains 500 EU and 500 AF images into one of the race classes. For each degree of transformation, we averaged the race classification probabilities of transformed images where 0 indicates the EU class and 1 indicates the AF class. The classification probabilities are represented in Figure 2. As the transformation level of EU becoming AF moves from 0% to 100%, there is a probability of EU being classified as AF. Similarly, AF are more likely to be classified into EU throughout the level of race transformation. The classification variances imply that the race of the image is distinguished by the extent of the transformation.
As we verified the racial transformation, we measured the FEAT by varying the racial features of target images. For example, we calculated s(EU25, AF25, Career, Family), where EU25 indicates the EU images transformed into AF at about 25%, while AF25 represents the AF images converted into EU by 25%. Table 6 describes the FEAT result with race sensitivity. Accordingly, as the race converted, the number of significant differences decreases. In other words, as the race becomes converted, the associations between targets and attributes are not significantly different. For instance, EU25 is more likely to be related to a career than family, while EU100 is not significantly related to a certain attribute. In accordance with this result, AF100 is not associated with a certain attribute, but AF25 is linked with family rather than a career. In particular, for the Career/Family attribute, we found that a significant difference in association only exists in the 25% race transformed embeddings for all models. As the EU becomes AF (i.e., 50% to 100%), and vice versa, the associations between target and the attribute become insignificant. That is, the models are sensitive to racial features which would be the cause of discriminative associations. x-axis indicates level of race transformation, while y-axis indicates probability of prediction to EU (0) or AF (1).

Discussion
The current study demonstrates that the pre-trained face recognition models are prone to stereotypical bias even though they are widely used as building blocks for various vision tasks. We investigated a wide range of social biases to show how human-like biases are automatically encoded in vector spaces of face recognition models. By introducing FEAT, we systematically evaluated how pre-trained models interpret an image containing a bias target and associate them to a specific attribute. We confirmed racial, gender, age, and an intersectional bias are reproduced through the embeddings from pre-trained models by assessing differences in evaluative associations between pairs of semantic or social categories. To be specific, the results show an intersectional bias in minorities such as females of relatively unexplored ethnicity in the field. This implies a wide range of subgroups and ethnicities should be considered with respect to diagnosing social biases.
The new measurement, FEAT, would be useful for quantification of the social biases from the way people are portrayed in images that are used to train machine learning models. This alerts practitioners to be cautious against using pre-trained models for transfer learning, which implies the importance of monitoring the harms these biases may pose. Moreover, the different levels of social biases in each model emphasize the importance of model selection when fair decisions are to be made in the real world. Leveraging these developments will spur future research in understanding human bias in pre-trained models and further mitigating social biases in models to build a fair society.
However, our study has some limitations to be solved in a future study. There is a lack of exploration as to whether the discriminative associations result from underlying biased data distribution or a training procedure. Moreover, as we collected our test data in the wild, the test set might amplify the biases of the models because most of the models are fine-tuned on task specific datasets. That is, the absence of the fine tuning process with the new dataset might deteriorate the accuracy of the models. Therefore, to confirm the origins of these biases in face images, syntactic and semantic features from the contextual representation would be analyzed in the future study following the previous study [59]. Furthermore, measuring biases depending on each training batch can be another direction for future work. That is, we can test the FEAT with the face embeddings from every batch to detect the stage where the social biases start while training with the pre-trained model. In addition, to analyze the main factors of biases within the embeddings, the bias mitigation techniques would be presented to contribute to the fairness in the field of computer vision.