#
Quantifying Bias in a Face Verification System^{ †}

^{1}

^{2}

^{*}

^{†}

^{‡}

## Abstract

**:**

## 1. Introduction

## 2. Related Work

#### 2.1. Sources of Bias

**Historical Bias**arises when injustice in the world conflicts with values we want encoded in a model. Since systemic injustice creates patterns reflected in data, historical bias can exist despite perfect sampling and representation.

**Representation Bias**arises when training data under-represent a subset of the target population and the model fails to optimize for the under-represented group(s).

**Measurement Bias**arises when data are a noisy proxy for the information we desire, e.g., in FV, camera quality and discretized race categories contribute to measurement bias.

**Aggregation Bias**arises when inappropriately using a “one-size-fits-all” model on distinct populations, as a single model may not generalize well to all subgroups.

**Evaluation Bias**arises when the evaluation dataset is not representative of the target population. An evaluation may purport good performance, but miss a disparity for populations under-represented in the benchmark dataset.

**Deployment Bias**arises from inconsistency between the problem that a model is intended to solve and how it is used to make decisions in practice, as there is no guarantee that measured performance and fairness will persist.

#### 2.2. Statistical Fairness Definitions

#### 2.3. Bias in the Embedding Space

## 3. Method

#### 3.1. FV Pipeline

- Pass a pair of face images to MTCNN to crop them to bounding boxes around the faces (we discard data where MTCNN detects no faces). Each input pair has an “actual classification” of 1 (genuine) or 0 (imposter).
- Pass each cropped image tensor into the model (FaceNet, for our experiments) to produce two face embeddings.
- Compute the cosine similarity between the two embeddings (the “similarity score”).
- Use a pre-determined threshold (the threshold is determined according to a false accept rate (FAR) of 0.05 on a 20% heldout validation set; all datasets have no overlap between people in the testing and validation splits) to produce a “predicted classification” of 1 (genuine) or 0 (imposter).

#### 3.2. Datasets

#### 3.3. Statistical Fairness

#### 3.4. Cluster-Based Fairness

**Clustering Metrics**We employ the following three metrics [10] to assess embedding space partitioning into clusters according to each sensitive attribute.

- Mean silhouette coefficient [37]: A value in the range [−1, 1] indicating how similar elements are to their own cluster. A higher value indicates that elements are more similar to their own cluster and less similar to other clusters (good clustering).
- Calinski–Harabasz index [38]: The ratio of between-cluster variance and within-cluster variance. A larger index means greater separation between clusters and less within clusters (good clustering).
- Davies–Bouldin index [39]: A value greater than or equal to zero aggregating the average similarity measure of each cluster with its most similar cluster, judging cluster separation according to their dissimilarity (a lower index means better clustering).

**Intra-Cluster Visualizations**To observe whether or not there is inequality in the embedded cluster quality of protected and unprotected groups, we produce intra-cluster visualizations and compare clusters using pairwise distance distribution, centroid distance distribution, and persistent homology ${H}_{0}$ death time distribution [40,41].

## 4. Experiments

#### 4.1. Statistical Fairness Metrics

#### 4.2. Clustering Metrics

#### 4.3. Intra-Cluster Fairness Visualizations

#### 4.3.1. Pairwise Distance Distribution

**average**threshold will be lower than optimal for Asian, Indian, and Black face pairs, leading to more frequent false positives (supported by Figure 3).

#### 4.3.2. Centroid Distance Distribution

#### 4.3.3. Persistent Homology

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Abbreviations

BFW | Balanced Faces in the Wild |

CH | Calinski–Harabasz Index |

DB | Davies–Bouldin Index |

FNR | False Negative Rate |

FPR | False Positive Rate |

FR | Face Recognition |

FV | Face Verification |

IJBC | IARPA Janus Benchmark C |

ML | Machine Learning |

MS | Mean Silhouette Coefficient |

MTCNN | Multi-Task Cascaded Convolutional Networks |

NPV | Negative Predictive Value |

PPV | Positive Predictive Value |

RFW | Racial Faces in the Wild |

t-SNE | t-distributed Stochastic Neighbor Embedding |

## Appendix A. Pair Generation

**Racial Faces in the Wild**Table A1 displays the breakdown of positive and negative pairs for the RFW testing split for each race subgroup. Positive and negative pairs are same-race faces (there is no gender attribute for this dataset).

Asian | Indian | Black | White | |
---|---|---|---|---|

% positive | 25.0 | 25.0 | 25.2 | 25.0 |

% negative | 75.0 | 75.0 | 74.8 | 75.0 |

**Janus-C**Table A2 details the Janus-C test set’s positive and negative pairs across skin tone and gender subgroups. All pairs are same-skin-tone and same-gender faces. Because Janus-C is not balanced over sensitive attributes, we had to vary positive and negative pair generation for each skin tone and gender subgroup. The drastically different number of faces across skin tones and genders make it difficult to achieve parity in the number of pairs for these subgroups while maintaining a large enough sample for testing. This should be considered when interpreting Janus-C results.

**Table A2.**The test set percentages of positive and negative pairs generated per subgroup for Janus-C.

Female | 1 | 2 | 3 | 4 | 5 | 6 |
---|---|---|---|---|---|---|

% positive | 54.9 | 40.6 | 36.5 | 14.9 | 14.4 | 7.1 |

% negative | 45.1 | 59.4 | 63.5 | 85.1 | 85.6 | 92.9 |

Male | 1 | 2 | 3 | 4 | 5 | 6 |

% positive | 54.7 | 37.3 | 29.5 | 13.7 | 8.6 | 5.5 |

% negative | 45.3 | 62.7 | 70.5 | 86.3 | 91.4 | 94.5 |

**VGGFace2 Test Set**Table A3 shows the breakdown across gender subgroups of positive and negative pairs for the VGG testing split. All pairs are same-gender faces (VGGFace2 does not have a race attribute). The VGGFace2 test set is not balanced over its sensitive attribute, so we had to vary positive and negative pair generation by gender subgroup. Because VGGFace2 has less inequality than Janus-C in number of faces per subgroup, we achieved positive to negative pair ratios much closer to 25:75.

**Figure A1.**Statistical fairness metric results for BFW subgroups. A = Asian; I = Indian; B = Black; W = White; F = Female; M = Male.

**Table A3.**The test set percentages of positive and negative pairs generated per subgroup for the VGGFace2 test set.

Female | Male | |
---|---|---|

% positive | 23.6 | 29.6 |

% negative | 76.4 | 70.4 |

## Appendix B. Statistical Fairness Metric Experiments

**Figure A2.**Statistical fairness metric results for RFW race subgroups. A = Asian; I = Indian; B = Black; W = White.

**Figure A3.**Statistical fairness metric results for VGGFace2 test set gender subgroups. F = Female; M = Male.

**Figure A4.**Statistical fairness metric results for Janus-C skin tone subgroups. Dark blue bars represent original data; light blue bars represent blurred data. Skin tone groups are labelled from 1 (lightest skin) to 6 (darkest skin).

## Appendix C. Clustering Metrics

**Table A4.**Clustering metric results for RFW. ↑ means that a higher value indicates better clustering and ↓ means that a lower value indicates better clustering.

Metric | Race |
---|---|

MS↑ | 0.112 |

CH↑ | 1423 |

DB↓ | 4.21 |

Metric | Gender |
---|---|

MS↑ | 0.026 |

CH↑ | 1835 |

DB↓ | 8.44 |

Metric | Gender | Skin Tone |
---|---|---|

MS↑ | 0.034 | −0.002 |

CH↑ | 380 | 227 |

DB↓ | 7.57 | 7.81 |

## Appendix D. Clustering Visualizations

**Figure A5.**Intra-cluster visualizations for RFW. Pairwise distance distribution (

**left**); centroid distance distribution (

**middle**); persistent homology 0th class deaths distribution (

**right**).

**Figure A6.**Intra-cluster visualizations for the VGGFace2 test set. Pairwise distance distribution (

**left**); centroid distance distribution (

**middle**); persistent homology 0th class deaths distribution (

**right**).

**Figure A7.**Intra-cluster visualizations for Janus-C. Pairwise distance distribution (

**left**); centroid distance distribution (

**middle**); persistent homology 0th class deaths distribution (

**right**).

#### Intra-Cluster Distribution T-Tests

**Table A7.**Corrected p-values of the 2-sample independent t-test results for BFW race (top) and gender (bottom) subgroup pairs. A: Asian; I: Indian; B: Black; W: White; F: Female; M: Male.

Pairwise Distance Distributions | Centroid Distance Distributions | ${\mathit{H}}_{0}$ Death Time Distributions | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

I | B | W | I | B | W | I | B | W | |||

A | <0.001 | <0.001 | <0.001 | A | <0.001 | <0.001 | <0.001 | A | >0.999 | >0.999 | <0.001 |

I | - | <0.001 | <0.001 | I | - | <0.001 | <0.001 | I | - | >0.999 | <0.001 |

B | - | - | <0.001 | B | - | - | <0.001 | B | - | - | <0.001 |

Pairwise Distance Distributions | Centroid Distance Distributions | ${\mathit{H}}_{\mathbf{0}}$Death Time Distributions | |||||||||

M | M | M | |||||||||

F | <0.001 | F | >0.999 | F | >0.03 |

**Table A8.**Corrected p-values of the 2-sample independent t-test results for RFW race subgroup pairs. Top: race subgroup results; bottom: gender subgroup results. A: Asian; I: Indian; B: Black; W: White.

Pairwise Distance Distributions | Centroid Distance Distributions | ${\mathit{H}}_{0}$ Death Time Distributions | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

I | B | W | I | B | W | I | B | W | |||

A | <0.001 | <0.001 | <0.001 | A | <0.001 | <0.001 | <0.001 | A | <0.001 | >0.999 | <0.001 |

I | - | <0.001 | <0.001 | I | - | <0.001 | <0.001 | I | - | <0.001 | <0.001 |

B | - | - | <0.001 | B | - | - | <0.001 | B | - | - | <0.001 |

**Table A9.**Corrected p-values of the 2-sample independent t-test results for VGGFace2 test set gender subgroup pairs. F: Female; M: Male.

Pairwise Distance Distributions | Centroid Distance Distributions | ${\mathit{H}}_{0}$ Death Time Distributions | |||
---|---|---|---|---|---|

M | M | M | |||

F | <0.001 | F | <0.001 | F | 0.02 |

**Table A10.**Corrected p-values of the 2-sample independent t-test results for Janus-C skin tone (top) and gender (bottom) subgroup pairs. Results are for non-blurred data. Skin tone groups are labelled from 1 (lightest skin) to 6 (darkest skin). F: Female; M: Male.

Pairwise Distance Distributions | Centroid Distance Distributions | ${\mathit{H}}_{0}$ Death Time Distributions | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

2 | 3 | 4 | 5 | 6 | 2 | 3 | 4 | 5 | 6 | 2 | 3 | 4 | 5 | 6 | |||

1 | >0.999 | >0.999 | 0.01 | 0.62 | >0.999 | 1 | <0.001 | <0.001 | <0.001 | <0.001 | <0.001 | 1 | <0.001 | 0.81 | >0.999 | 0.13 | <0.001 |

2 | - | 0.70 | <0.001 | 0.02 | >0.999 | 2 | - | <0.001 | <0.001 | >0.999 | <0.001 | 2 | - | <0.001 | <0.001 | 0.22 | >0.999 |

3 | - | - | 0.30 | >0.999 | 0.54 | 3 | - | - | <0.001 | <0.001 | <0.001 | 3 | - | - | 0.98 | 0.03 | <0.001 |

4 | - | - | - | 0.91 | <0.001 | 4 | - | - | - | <0.001 | <0.001 | 4 | - | - | - | 0.98 | 0.06 |

5 | - | - | - | - | 0.003 | 5 | - | - | - | - | <0.001 | 5 | - | - | - | - | 0.99 |

Pairwise Distance Distributions | Centroid Distance Distributions | ${\mathit{H}}_{\mathbf{0}}$Death Time Distributions | |||||||||||||||

M | M | M | |||||||||||||||

F | 0.08 | F | >0.999 | F | >.999 |

## References

- Monahan, J.; Skeem, J.L. Risk Assessment in Criminal Sentencing. Annu. Rev. Clin. Psychol.
**2016**, 12, 489–513. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Christin, A.; Rosenblat, A.; Boyd, D. Courts and Predictive Algorithms. Data & Civil Rights: A New Era of Policing and Justice. 2016. Available online: https://www.law.nyu.edu/sites/default/files/upload_documents/Angele%20Christin.pdf (accessed on 28 February 2022).
- Romanov, A.; De-Arteaga, M.; Wallach, H.; Chayes, J.; Borgs, C.; Chouldechova, A.; Geyik, S.; Kenthapadi, K.; Rumshisky, A.; Kalai, A.T. What’s in a Name? Reducing Bias in Bios without Access to Protected Attributes. arXiv
**2019**, arXiv:1904.05233. [Google Scholar] - De-Arteaga, M.; Romanov, A.; Wallach, H.; Chayes, J.; Borgs, C.; Chouldechova, A.; Geyik, S.; Kenthapadi, K.; Kalai, A.T. Bias in Bios: A Case Study of Semantic Representation Bias in a High-Stakes Setting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Atlanta, GA, USA, 29–31 January 2019. [Google Scholar]
- Fuster, A.; Goldsmith-Pinkham, P.; Ramadorai, T.; Walther, A. Predictably Unequal? The Effects of Machine Learning on Credit Markets. SSRN Electron. J.
**2017**, 77, 5–47. [Google Scholar] [CrossRef] - Mitchell, S.; Potash, E.; Barocas, S.; D’Amour, A.; Lum, K. Prediction-Based Decisions and Fairness: A Catalogue of Choices, Assumptions, and Definitions. arXiv
**2020**, arXiv:1811.07867. [Google Scholar] - Verma, S.; Rubin, J. Fairness Definitions Explained. In Proceedings of the 2018 IEEE/ACM International Workshop on Software Fairness (FairWare), Gothenburg, Sweden, 29 May 2018; pp. 1–7. [Google Scholar] [CrossRef]
- Robinson, J.P.; Livitz, G.; Henon, Y.; Qin, C.; Fu, Y.; Timoner, S. Face Recognition: Too Bias, or Not Too Bias? In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; IEEE: Seattle, WA, USA, 2020; pp. 1–10. [Google Scholar] [CrossRef]
- Buolamwini, J.; Gebru, T. Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification. In Proceedings of the Conference on Fairness, Accountability, and Transparency, New York, NY, USA, 23–24 February 2018; ACM: New York, NY, USA, 2018. [Google Scholar]
- Gluge, S.; Amirian, M.; Flumini, D.; Stadelmann, T. How (Not) to Measure Bias in Face Recognition Networks. In Artificial Neural Networks in Pattern Recognition; Schilling, F.P., Stadelmann, T., Eds.; Springer International Publishing: Berlin/Heidelberg, Germany, 2020; pp. 125–137. [Google Scholar]
- Bhattacharyya, D.; Ranjan, R. Biometric Authentication: A Review. Int. J. u- e-Serv. Sci. Technol.
**2009**, 2, 13–28. [Google Scholar] - Schroff, F.; Kalenichenko, D.; Philbin, J. FaceNet: A Unified Embedding for Face Recognition and Clustering. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015. [Google Scholar]
- Wheeler, F.W.; Weiss, R.L.; Tu, P.H. Face recognition at a distance system for surveillance applications. In Proceedings of the 2010 Fourth IEEE International Conference on Biometrics: Theory, Applications and Systems (BTAS), Washington, DC, USA, 27–29 September 2010; IEEE: Washington, DC, USA, 2010; pp. 1–8. [Google Scholar] [CrossRef]
- Van der Maaten, L.; Hinton, G. Visualizing Data using t-SNE. J. Mach. Learn. Res.
**2008**, 9, 2579–2605. [Google Scholar] - Suresh, H.; Guttag, J.V. A Framework for Understanding Unintended Consequences of Machine Learning. arXiv
**2020**, arXiv:1901.10002. [Google Scholar] - Hardt, M.; Price, E.; Price, E.; Srebro, N. Equality of Opportunity in Supervised Learning. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016; Lee, D.D., Sugiyama, M., Luxburg, U.V., Guyon, I., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2016; pp. 3315–3323. [Google Scholar]
- Chouldechova, A. Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments. Big Data
**2017**, 5, 153–163. [Google Scholar] [CrossRef] [PubMed] - Corbett-Davies, S.; Pierson, E.; Feller, A.; Goel, S.; Huq, A. Algorithmic Decision Making and the Cost of Fairness. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, 13–17 August 2017; Association for Computing Machinery: Halifax, NS, Canada, 2017; pp. 797–806. [Google Scholar] [CrossRef] [Green Version]
- Zemel, R. Learning Fair Representations. In Proceedings of the ICML, Atlanta, GA, USA, 16–21 June 2013; pp. 325–333. [Google Scholar]
- Feldman, M.; Friedler, S.A.; Moeller, J.; Scheidegger, C.; Venkatasubramanian, S. Certifying and Removing Disparate Impact. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015; Association for Computing Machinery: Sydney, NSW, Australia, 2015; pp. 259–268. [Google Scholar] [CrossRef] [Green Version]
- Louizos, C.; Swersky, K.; Li, Y.; Welling, M.; Zemel, R. The Variational Fair Autoencoder. arXiv
**2017**, arXiv:1511.00830. [Google Scholar] - Rothblum, G.N.; Yona, G. Probably Approximately Metric-Fair Learning. In Proceedings of the ICML, Stockholm, Sweden, 10–15 July 2018. [Google Scholar]
- Dwork, C.; Hardt, M.; Pitassi, T.; Reingold, O.; Zemel, R. Fairness through awareness. In Proceedings of the 3rd Innovations in Theoretical Computer Science Conference, Cambridge, MA, USA, 8–10 January 2012; Association for Computing Machinery: Cambridge, MA, USA, 2012; pp. 214–226. [Google Scholar] [CrossRef] [Green Version]
- Kusner, M.J.; Loftus, J.; Russell, C.; Silva, R. Counterfactual Fairness. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 4066–4076. [Google Scholar]
- Kilbertus, N.; Rojas Carulla, M.; Parascandolo, G.; Hardt, M.; Janzing, D.; Schölkopf, B. Avoiding Discrimination through Causal Reasoning. In Advances in Neural Information Processing Systems 30; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; pp. 656–666. [Google Scholar]
- Nabi, R.; Shpitser, I. Fair Inference On Outcomes. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Berk, R.; Heidari, H.; Jabbari, S.; Kearns, M.; Roth, A. Fairness in Criminal Justice Risk Assessments: The State of the Art. Sociol. Methods Res.
**2018**, 50, 3–44. [Google Scholar] [CrossRef] - Kleinberg, J.; Mullainathan, S.; Raghavan, M. Inherent Trade-Offs in the Fair Determination of Risk Scores. arXiv
**2018**, arXiv:1609.05807. [Google Scholar] - Zhang, K.; Zhang, Z.; Li, Z.; Qiao, Y. Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks. IEEE Signal Process. Lett.
**2016**, 23, 1499–1503. [Google Scholar] [CrossRef] [Green Version] - Cao, Q.; Shen, L.; Xie, W.; Parkhi, O.M.; Zisserman, A. VGGFace2: A dataset for recognising faces across pose and age. arXiv
**2018**, arXiv:1710.08092. [Google Scholar] - Wang, M.; Deng, W.; Hu, J.; Tao, X.; Huang, Y. Racial Faces in-the-Wild: Reducing Racial Bias by Information Maximization Adaptation Network. arXiv
**2019**, arXiv:1812.00194. [Google Scholar] - Wang, M.; Zhang, Y.; Deng, W. Meta Balanced Network for Fair Face Recognition. IEEE Trans. Pattern Anal. Mach. Intell.
**2021**. [Google Scholar] [CrossRef] - Wang, M.; Deng, W. Mitigate Bias in Face Recognition using Skewness-Aware Reinforcement Learning. arXiv
**2019**, arXiv:1911.10692. [Google Scholar] - Wang, M.; Deng, W. Deep Face Recognition: A survey. Neurocomputing
**2021**, 429, 215–244. [Google Scholar] [CrossRef] - Maze, B.; Adams, J.; Duncan, J.A.; Kalka, N.; Miller, T.; Otto, C.; Jain, A.K.; Niggel, W.T.; Anderson, J.; Cheney, J.; et al. IARPA Janus Benchmark – C: Face Dataset and Protocol. In Proceedings of the 2018 International Conference on Biometrics (ICB), Gold Coast, QLD, Australia, 20–23 February 2018; IEEE: New York, NY, USA, 2018; pp. 158–165. [Google Scholar] [CrossRef]
- Orloff, J.; Bloom, J. Bootstrap confidence intervals. 2014. Available online: https://math.mit.edu/~dav/05.dir/class24-prep-a.pdf (accessed on 28 February 2022).
- Rousseeuw, P.J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math.
**1987**, 20, 53–65. [Google Scholar] [CrossRef] [Green Version] - Calinski, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat.-Theory Methods
**1974**, 3, 1–27. [Google Scholar] [CrossRef] - Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE Trans. Pattern Anal. Mach. Intell.
**1979**, 1, 224–227. [Google Scholar] [CrossRef] - Chazal, F.; Michel, B. An introduction to Topological Data Analysis: Fundamental and practical aspects for data scientists. arXiv
**2017**, arXiv:1710.04019. [Google Scholar] [CrossRef] - Wasserman, L. Topological Data Analysis. Annu. Rev. Stat. Appl.
**2018**, 5, 501–532. [Google Scholar] [CrossRef] [Green Version] - Domingos, P. A few useful things to know about machine learning. Commun. ACM
**2012**, 55, 78–87. [Google Scholar] [CrossRef] [Green Version] - Saul, N.; Tralie, C. Scikit-TDA: Topological Data Analysis for Python. 2019. Available online: https://zenodo.org/record/2533369 (accessed on 28 February 2022).

**Figure 1.**A two-dimensional t-SNE [14] visualization of Balanced Faces in the Wild (BFW) [8] embeddings, colored by race and gender. Clusters roughly correspond to race and gender, with varied densities (e.g., Asian clusters are tighter than White clusters). Note that t-SNE embeddings are not completely representative of actual relationships due to information loss during dimensionality reduction.

**Figure 2.**An overview of our approach. We use diverse face datasets to assess bias in FaceNet [12] by leveraging the face embeddings that it produces for various fairness experiments.

**Figure 3.**Statistical fairness metric results for BFW race and gender subgroups. See Table 1 for metric descriptions. Blue bars denote race subgroups; gray bars denote gender subgroups. A = Asian; I = Indian; B = Black; W = White; F = Female; M = Male.

**Figure 4.**Pairwise distance distribution for BFW race (

**left**) and gender (

**right**) subgroups. Top plots include all pairs for each subgroup and bottom plots include distinct curves for genuine pairs (solid) and imposter pairs (dashed) for each subgroup.

**Figure 5.**Centroid distance distribution for BFW race subgroups (

**left**) and BFW gender subgroups (

**right**).

**Figure 6.**Distribution of persistent homology class 0 (${H}_{0}$) death times for BFW race (

**left**) and gender (

**right**) subgroups.

Metric | Description | Definition | References |
---|---|---|---|

Overall Accuracy Equality | Equal prediction accuracy across protected and unprotected groups | $P(d=Y|{A}_{1})=\phantom{\rule{0ex}{0ex}}P(d=Y|{A}_{2})=\cdots =\phantom{\rule{0ex}{0ex}}P(d=Y|{A}_{N})=$ | Berk et al. [27] Mitchell et al. [6] Verma and Rubin [7] |

Predictive Equality | Equal FPR across protected and unprotected groups | $P(d=1|Y=0,{A}_{1})=\phantom{\rule{0ex}{0ex}}P(d=1|Y=0,{A}_{2})=\cdots =\phantom{\rule{0ex}{0ex}}P(d=1|Y=0,{A}_{N})$ | Chouldechova [17] Corbett-Davies et al. [18] Mitchell et al. [6] Verma and Rubin [7] |

Equal Opportunity | Equal FNR across protected and unprotected groups | $P(d=0|Y=1,{A}_{1})=\phantom{\rule{0ex}{0ex}}P(d=0|Y=1,{A}_{2})=\cdots =\phantom{\rule{0ex}{0ex}}P(d=0|Y=1,{A}_{N})$ | Chouldechova [17] Hardt et al. [16] Kusner et al. [24] Mitchell et al. [6] Verma and Rubin [7] |

Conditional Use Accuracy Equality | Equal PPV and NPV * across protected and unprotected groups | $P(Y=1|d=1,{A}_{1})=\phantom{\rule{0ex}{0ex}}P(Y=1|d=1,{A}_{2})=\cdots =\phantom{\rule{0ex}{0ex}}P(Y=1|d=1,{A}_{N})$ AND $P(Y=0|d=0,{A}_{1})=\phantom{\rule{0ex}{0ex}}P(Y=0|d=0,{A}_{2})=\cdots =\phantom{\rule{0ex}{0ex}}P(Y=0|d=0,{A}_{N})$ | Berk et al. [27] Mitchell et al. [6] Verma and Rubin [7] |

Balance for the Positive Class | Equal avg. score S for the positive class across protected and unprotected groups | $AVG(Y=1|{A}_{1})=\phantom{\rule{0ex}{0ex}}AVG(Y=1|{A}_{2})=\cdots =\phantom{\rule{0ex}{0ex}}AVG(Y=1|{A}_{N})$ | Kleinberg et al. [28] Mitchell et al. [6] Verma and Rubin [7] |

Balance for the Negative Class | Equal avg. score S for the negative class across protected and unprotected groups | $AVG(Y=0|{A}_{1})=\phantom{\rule{0ex}{0ex}}AVG(Y=0|{A}_{2})=\cdots =\phantom{\rule{0ex}{0ex}}AVG(Y=0|{A}_{N})$ | Kleinberg et al. [28] Mitchell et al. [6] Verma and Rubin [7] |

**Table 2.**The four benchmark datasets that we use in our experiments. Faces/ID is the average number of faces per ID. * VGG Test represents the VGGFace2 test set.

Dataset | # IDs | Faces/ ID | Attributes | Notes |
---|---|---|---|---|

BFW | 800 | 25 | Race, Gender | Equal balance for race and gender |

RFW | 12,000 | 6.7 | Race | Equal balance for race |

IJBC | 3531 | 6 | Skin Tone, Gender | Occlusion, occupation diversity |

VGG Test * | 500 | 375 | Gender | Variation in pose and age |

**Table 3.**The percentage of positive and negative pairs per subgroup for the BFW testing split. Ratios for the validation set are similar.

Female | Asian | Indian | Black | White |
---|---|---|---|---|

% positive | 25 | 25 | 25 | 25 |

% negative | 75 | 75 | 75 | 75 |

Male | Asian | Indian | Black | White |

% positive | 25 | 25 | 25 | 25 |

% negative | 75 | 75 | 75 | 75 |

**Table 4.**Clustering metric results for BFW. ↑ means that a higher value indicates better clustering and ↓ means that a lower value indicates better clustering.

Metric | Gender | Race | Both |
---|---|---|---|

MS↑ | 0.034 | 0.091 | 0.103 |

CH↑ | 280 | 572 | 444 |

DB↓ | 7.55 | 4.36 | 3.98 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Frisella, M.; Khorrami, P.; Matterer, J.; Kratkiewicz, K.; Torres-Carrasquillo, P.
Quantifying Bias in a Face Verification System. *Comput. Sci. Math. Forum* **2022**, *3*, 6.
https://doi.org/10.3390/cmsf2022003006

**AMA Style**

Frisella M, Khorrami P, Matterer J, Kratkiewicz K, Torres-Carrasquillo P.
Quantifying Bias in a Face Verification System. *Computer Sciences & Mathematics Forum*. 2022; 3(1):6.
https://doi.org/10.3390/cmsf2022003006

**Chicago/Turabian Style**

Frisella, Megan, Pooya Khorrami, Jason Matterer, Kendra Kratkiewicz, and Pedro Torres-Carrasquillo.
2022. "Quantifying Bias in a Face Verification System" *Computer Sciences & Mathematics Forum* 3, no. 1: 6.
https://doi.org/10.3390/cmsf2022003006