A Responsible Machine Learning Workﬂow with Focus on Interpretable Models, Post-hoc Explanation, and Discrimination Testing

: This manuscript outlines a viable approach for training and evaluating machine learning systems for high-stakes, human-centered, or regulated applications using common Python programming tools. The accuracy and intrinsic interpretability of two types of constrained models, monotonic gradient boosting machines and explainable neural networks, a deep learning architecture well-suited for structured data, are assessed on simulated data and publicly available mortgage data. For maximum transparency and the potential generation of personalized adverse action notices, the constrained models are analyzed using post-hoc explanation techniques including plots of partial dependence and individual conditional expectation and with global and local Shapley feature importance. The constrained model predictions are also tested for disparate impact and other types of discrimination using measures with long-standing legal precedents, adverse impact ratio, marginal effect, and standardized mean difference, along with straightforward group fairness measures. By combining interpretable models, post-hoc explanations, and discrimination testing with accessible software tools, this text aims to provide a template workﬂow for machine learning applications that require high accuracy and interpretability and that mitigate risks of discrimination.


Introduction
Responsible artificial intelligence (AI) has been variously conceptualized as AI-based products or projects that use transparent technical mechanisms, that create appealable decisions or outcomes, that perform reliably and in a trustworthy manner over time, that exhibit minimal social discrimination, and that are designed by humans with diverse experiences, both in terms of demographics and professional backgrounds (e.g., Responsible Artificial Intelligence, Responsible AI: A Framework for Building Trust in Your AI Solutions, PwC's Responsible AI, Responsible AI Practices). Although responsible AI is today a somewhat broad and amorphous notion, at least one aspect is becoming clear. Machine learning (ML) models, a common application of AI, can present serious risks. ML models can be inaccurate and unappealable black-boxes, even with the application of newer post-hoc explanation techniques [1] (e.g., When a Computer Program Keeps You in Jail). ML models can perpetuate and exacerbate discrimination [2][3][4], and ML models can be hacked, resulting in manipulated model outcomes or the exposure of proprietary intellectual property or sensitive training data [5][6][7][8]. This manuscript makes no claim that these interdependent issues of ML opaqueness, discrimination, privacy harms, and security vulnerabilities have been resolved, even as singular

Materials and Methods
The simulated data (see Section 2.1) are based on the well-known Friedman datasets. Its known feature importance and augmented discrimination characteristics are used to gauge the validity of interpretable modeling, post-hoc explanation, and discrimination testing techniques [10,11]. The mortgage data (see Section 2.2) are sourced from the Home Mortgage Disclosure Act (HMDA) database, a fairly realistic data source for demonstrating the template workflow [12] (see Mortgage data (HMDA)). To provide a sense of fit differences, performance is compared on simulated data and collected mortgage data between the more interpretable constrained ML models and the less interpretable unconstrained ML models. Because the unconstrained ML models, gradient boosting machines (GBMs, e.g., [13,14]) and artificial neural networks (ANNs, e.g., [15][16][17][18]), do not exhibit convincing accuracy benefits on the simulated or mortgage data and can also present the unmitigated risks discussed above, further explanation and discrimination analyses are applied only to the constrained, interpretable ML models [1,19,20]. Here, monotonic gradient boosting machines (MGBMs, as implemented in XGBoost or h2o, see Section 2.3) and explainable neural networks (XNNs, e.g., [21,22], see Section 2.4) will serve as those more interpretable models for subsequent explanatory and discrimination analyses. MGBM and XNN interpretable model architectures are selected for the example workflow because they are straightforward variants of popular unconstrained ML models. If practitioners are working with GBM and ANN models, it should be relatively uncomplicated to also evaluate the constrained versions of these models.
The same can be said of the selected explanation methods and discrimination tests. Due to their post-hoc nature, they can often be shoe-horned into existing ML workflows and pipelines. Presented explanation techniques include partial dependence (PD) and individual conditional expectation (ICE) (see Section 2.5) and Shapley values (see Section 2.6) [14,[23][24][25]. PD, ICE, and Shapley values provide direct, global, and local summaries and descriptions of constrained models without resorting to the use of intermediary and approximate surrogate models. Discrimination testing methods discussed (see Section 2.7) include adverse impact ratio (AIR, see Part 1607-Uniform Guidelines on Employee Selection Procedures (1978) §1607.4), marginal effect (ME), and standardized mean difference (SMD) [2,26,27]. Accuracy and other confusion matrix measures are also reported by demographic segment [28]. All outlined materials and methods are implemented in open source Python code, and are made available on GitHub (see Section 2.8).

Simulated Data
Simulated data are created based on a signal-generating function, f , applied to input data, X, first proposed in Friedman [10] and extended in Friedman et al. [11]: f (X) = 10 sin(π X Friedman,1 X Friedman,2 ) + 20(X Friedman,3 − 0.5) 2 + 10 X Friedman,4 + 5 X Friedman,5 (1) where each X Friedman,j is a random uniform feature in [0,1]. In Friedman's texts, a Gaussian noise term was added to create a continuous output feature for testing spline regression methodologies. In this manuscript, the signal-generating function and input features are modified in several ways. Two binary features, a categorical feature with five discrete levels, and a bias term are introduced into f to add a degree of complexity that may more closely mimic real-world settings. For binary classification analysis, the Gaussian noise term is replaced with noise drawn from a logistic distribution and coefficients are re-scaled to be 1 5 of the size of those used by Friedman, and any f (X) value above 0 is classified as a positive outcome, while f (X) values less than or equal to zero are designated as negative outcomes. Finally, f is augmented with two hypothetical protected class-control features with known dependencies on the binary outcome to allow for discrimination testing. The simulated data are generated to have eight input features, twelve after numeric encoding of categorical features, and a binary outcome, two class-control features, and 100,000 instances. The simulated data are then split into a training and test set, with 80,000 and 20,000 instances, respectively. Within the training set, a five-fold cross-validation indicator is used for training all models. For an exact specification of the simulated data, see the software resources referenced in Section 2.8.

Mortgage Data
The mortgage dataset analyzed here is a random sample of consumer-anonymized loans from the HDMA database. These loans are a subset of all originated mortgage loans in the 2018 HMDA data that were chosen to represent a relatively comparable group of consumer mortgages. A selection of features is used to predict whether a loan is high-priced, i.e., the annual percentage rate (APR) charged was 150 basis points (1.5%) or more above a survey-based estimate of other similar loans offered around the time of the given loan. After data cleaning and preprocessing to encode categorical features and create missing markers, the mortgage data contain ten input features and the binary outcome, high-priced. The data are split into a training set with 160,338 loans and a marker for 5-fold cross-validation and a test set containing 39,662 loans. While lenders would almost certainly use more information than the selected features to determine whether to offer and originate a high-priced loan, the selected input features (loan to value (LTV) ratio, debt to income (DTI) ratio, property value, loan amount, introductory interest rate, customer income, etc.) are likely to be some of the most influential factors that a lender would consider. See the resources put forward in Section 2.8 and Appendix A for more information regarding the HMDA mortgage data.

Monotonic Gradient Boosting Machines
MGBMs constrain typical GBM training to consider only tree splits that obey user-defined positive and negative monotonicity constraints, with respect to each input feature, X j , and a target feature, y, independently. An MGBM remains an additive combination of B trees trained by gradient boosting, T b , and each tree learns a set of splitting rules that respect monotonicity constraints, Θ mono b . For an instance, x, a trained MGBM model, g MGBM , takes the form: As in unconstrained GBM, Θ mono b is selected in a greedy, additive fashion by minimizing a regularized loss function that considers known target labels, the predictions of all subsequently trained trees in g MGBM , and the b-th tree splits applied to x, T b (x; Θ mono b ), in a numeric loss function (e.g., squared loss, Huber loss), and a regularization term that penalizes complexity in the current tree. See Appendices B.1 and B.2 for details pertaining to MGBM training.
Herein, two g MGBM models are trained. One on the simulated data and one on the mortgage data. In both cases, positive and negative monotonic constraints for each X j are selected using domain knowledge, random grid search is used to determine other hyperparameters, and five-fold cross-validation and test partitions are used for model assessment. For exact parameterization of the two g MGBM models, see the software resources referenced in Section 2.8.

Explainable Neural Networks
XNNs are an alternative formulation of additive index models in which the ridge functions are neural networks [21]. XNNs also bear a strong resemblance to generalized additive models (GAMs) and so-called explainable boosting machines (EBMs or GA 2 Ms), which consider main effects and a small number of two-way interactions and may also incorporate boosting into their training [14,29]. XNNs enable users to tailor interpretable neural network architectures to a given prediction problem and to visualize model behavior by plotting ridge functions. A trained XNN function, g XNN , applied to some instance , x, is defined as: where µ 0 is a global bias for K individually specified ANN subnetworks, n k , with weights γ k . The inputs to each n k are themselves a linear combination of the J modeling inputs and their associated β k,j coefficients in the deepest, i.e., projection, layer of g XNN . Two g XNN models are trained by mini-batch stochastic gradient descent (SGD) on the simulated data and mortgage data. Each g XNN is assessed in five training folds and in a test data partition. L 1 regularization is applied to network weights to induce a sparse and interpretable model, where each n k and corresponding γ k are ideally associated with an important X j or combination thereof. g XNN models appear highly sensitive to weight initialization and batch size. Be aware that g XNN architectures may require manual and judicious feature selection due to long training times. For more details regarding g XNN training, see the software resources in Section 2.8 and Appendices B.1 and B.3.

One-Dimensional Partial Dependence and Individual Conditional Expectation
PD plots are a widely-used method for describing and plotting the estimated average prediction of a complex model, g, across some partition of data, X, for some interesting input feature, X j ∈ X [14]. ICE plots are a newer method that describes the local behavior of g with regard to values of an input feature in a single instance, x j . PD and ICE can be overlaid in the same plot to create a holistic global and local portrait of the predictions for some g and X j [23]. When PD(X j , g) and ICE(x j , g) curves diverge, such plots can also be indicative of modeled interactions in g or expose flaws in PD estimation, e.g., inaccuracy in the presence of strong interactions and correlations [23,30]. For details regarding the calculation of one-dimensional PD and ICE, see the software resources in Section 2.8 and Appendices B.1 and B.4.

Shapley Values
Shapley explanations are a class of additive, locally accurate feature contribution measures with long-standing theoretical support [24,31]. Shapley explanations are the only known locally accurate and globally consistent feature contribution values, meaning that Shapley explanation values for input features always sum to the model's prediction, g(x), for any instance x, and that Shapley explanation values should not decrease in magnitude for some instance of x j when g is changed such that x j truly makes a stronger contribution to g(x) [24,25]. Shapley values can be estimated in different ways, many of which are intractable for datasets with large numbers of input features. Tree Shapley Additive Explanations (SHAP) is a specific implementation of Shapley explanations that relies on traversing internal decision tree structures to efficiently estimate the contribution of each x j for some g(x) [25]. Tree SHAP has been implemented in popular gradient boosting libraries such as h2o, LightGBM, and XGBoost, and Tree SHAP is used to calculate accurate and consistent global and local feature importance for MGBM models in Section 3.2.2 and Appendix E.1. Deep SHAP is an approximate Shapley value technique that creates SHAP values for ANNs [24]. Deep SHAP is implemented in the shap package and is used to generate SHAP values for the two g XNN models discussed in Section 3.2.2 and Appendix E.1. For more information pertaining to the calculation of Shapley values, see Appendices B.1 and B.5.

Discrimination Testing Measures
Because many current technical discussions of fairness in ML appear inconclusive (e.g., Tutorial: 21 Fairness Definitions and Their Politics), this text will draw on regulatory and legal standards that have been used for years in regulated, high-stakes employment and financial decisions. The discussed measures are also representative of fair lending analyses and pair well with the mortgage data.
(See Appendix C for a brief discussion regarding different types of discrimination in US legal and regulatory settings, and Appendix D for remarks on practical vs. statistical significance for discrimination measures.) One such common measure of DI used in US litigation and regulatory settings is ME. ME is simply the difference between the percent of the control group members receiving a favorable outcome and the percent of the protected class members receiving a favorable outcome: whereŷ are the model decisions, X p and X c represent binary markers created from some demographic attribute, c denotes the control group (often whites or males), p indicates a protected group, and Pr(·) is the operator for conditional probability. ME is a favored DI measure used by the US Consumer Financial Protection Bureau (CFPB), the primary agency charged with regulating fair lending laws at the largest US lending institutions and for various other participants in the consumer financial market (see Supervisory Highlights, Issue 9, Fall 2015). Another important DI measure is AIR, more commonly known as a relative risk ratio in settings outside of regulatory compliance.
AIR is equal to the ratio of the proportion of the protected class that receives a favorable outcome and the proportion of the control class that receives a favorable outcome. Statistically significant AIR values below 0.8 can be considered prima facie evidence of discrimination. An additional long-standing and pertinent measure of DI is SMD. SMD is often used to assess disparities in continuous features, such as income differences in employment analyses, or interest rate differences in lending. It originates from work on statistical power, and is more formally known as Cohen's d. SMD is equal to the difference in the average protected class outcome,ȳ p , minus the control class outcome,ȳ c , divided by a measure of the standard deviation of the population, σŷ. (There are several measures of the standard deviation of the score that are typically used: 1. the standard deviation of the population, irrespective of protected class status, 2. a standard deviation calculated only over the two groups being considered in a particular calculation, or 3. a pooled standard deviation, using the standard deviations for each of the two groups with weights.) Cohen defined values of this measure to have small, medium, and large effect sizes if the values exceeded 0.2, 0.5, and 0.8, respectively.
The numerator in the SMD is roughly equivalent to ME but adds the standard deviation divisor as a standardizing factor. Because of this standardization factor, SMD allows for a comparison across different types of outcomes, such as inequity in mortgage closing fees or inequities in the interest rates given on certain loans. In this, one may apply definitions in Cohen [26] of small, medium, and large effect sizes, which represent a measure of practical significance, which is described in detail in Appendix D. Finally, confusion matrix measures in demographic groups, such as accuracy, false positive rate (FPR), false negative rate (FNR), and their ratios, are also considered as measures of DI in Section 3.2.3 and Appendix E.2.

Results
Results are laid out for the simulated and mortgage datasets. Accuracy is compared for unconstrained, less interpretable g GBM and g ANN models and constrained, more interpretable g MGBM and g XNN models. Then, for the g MGBM and g XNN models, intrinsic interpretability, post-hoc explanation, and discrimination testing results are explored.

Simulated Data Results
Fit comparisons between unconstrained and constrained models and XNN interpretability results are discussed in Sections 3.1.1 and 3.1.2. As model training and assessment on the simulated data is a rough validation exercise meant to showcase expected results on data with known characteristics, and given that most of the techniques in the proposed workflow are already used widely or have been validated elsewhere, reporting of simulated data results in the main text will focus mostly on fit measures and the more novel g XNN interpretability results. The bulk of the post-hoc explanation and discrimination testing results for the simulated data are left to Appendix E. Table 1 presents a variety of fit measures for g GBM , g MGBM , g ANN , and g XNN on the simulated test data. g XNN exhibited the best performance, but the models exhibited only a fairly small range of fit results. Interpretability and explainability benefits of the constrained models appeared to come at little cost to overall model performance, or in the case of g ANN and g XNN , no cost at all. For the displayed measures, g MGBM performed ∼2% worse on average than g GBM . g XNN performed ∼0.5% better on average than g XNN , and g XNN actually showed slightly better fit than g ANN across all fit measures except specificity. Fit measures that required a probability cutoff were taken at the best F1 threshold for each model. Table 1. Fit measures for g GBM (X), g MGBM (X), g ANN (X), and g XNN (X) on the simulated test data. Arrows indicate the direction of improvement for each measure and the best result in each column is displayed in bold font. For g XNN , inherent interpretability manifested as plots of sparse γ k output layer weights, n k subnetwork ridge functions, and sparse β j,k weights in the bottom projection layer. Figure 1 provides detailed insights into the structure of g XNN (also described in Equation (3)). Figure 1a displays the sparse γ k weights of the output layer, where only n k subnetworks with k ∈ {1, 4, 7, 8, 9} were associated with large magnitude weights. The n k subnetwork ridge functions appear in Figure 1b as simplistic but distinctive functional forms. Color-coding between Figure 1a,b visually reinforces the direct feed-forward relationship between the n k subnetworks and the γ k weights of the output layer.

Model
n k subnetworks were plotted across the output values of their associated ∑ j β k,j x j projection layer hidden units, and color-coding between Figure 1b,c link the β j,k weights to their n k subnetworks. Most of the heavily utilized n k subnetworks had sparse weights in their ∑ j β k,j x j projection layer hidden units. In particular, subnetwork n 1 appeared to be almost solely a function of X Friedman,3 and appeared to exhibit the expected quadratic behavior for X Friedman,3 . Subnetworks n 7 , n 8 , and n 9 appeaed to be most associated with the globally important X Friedman,1 and X Friedman,2 features, likely betraying the effort required for g XNN to model the nonlinear sin() function of the X Friedman,1 and X Friedman,2 product, and these subnetworks, especially n 7 and n 8 , appeared to display some noticeable sinusoidal characteristics. Subnetwork n 4 seemed to be a linear combination of all the original input X j features, but did weigh the linear X Friedman,4 and X Friedman,5 terms roughly in the correct two-to-one ratio. As a whole, Figure 1a-c exhibited evidence that g XNN learned about the signal-generating function in Equation (1) and the displayed information should help practitioners understand which original input X j features were weighed heavily in each n k subnetwork, and which n k subnetworks have a strong influence on g XNN (X) output. See Appendix B.3 for additional details regarding general XNN architecture. Figure 1. Output layer γ k weights, corresponding n k subnetwork ridge functions, and associated projection layer β k,j weights for g XNN on the simulated data.

Mortgage Data Results
Results for the mortgage data are presented in Sections 3.2.1-3.2.3 to showcase the example workflow. g ANN and g XNN outperformed g GBM and g MGBM on the mortgage data, but as in Section 3.1.1, the constrained variants of both model architectures did not show large differences in model fit with respect to unconstrained variants. Assuming that in high-stakes applications small fit differences on static test data did not outweigh the need for enhanced model debugging facilitated by high interpretability, only g MGBM and g XNN interpretability, post-hoc explainability, and discrimination testing results are presented. Table 2 shows that g ANN and g XNN noticeably outperformed g GBM and g MGBM on the mortgage data for most of the fit measures. This is at least partially due to the preprocessing required to present directly comparable post-hoc explainability results and to use neural networks and TensorFlow, e.g., numerical encoding of categorical features and missing values. This preprocessing appears to hamstring some of the tree-based models' inherent capabilities. g GBM models trained on non-encoded data with missing values repeatedly produced receiver operating characteristic area under the curve (AUC) values of ∼0.81 (not shown, but available in resources discussed in Section 2.8). Table 2. Fit measures for g GBM (X), g MGBM (X), g ANN (X), and g XNN (X) on the mortgage test data. Arrows indicate the direction of improvement for each measure and the best result in each column is displayed in bold font. Regardless of the fit differences between the two families of models, the difference between the constrained and unconstrained variants within the two types of models is small for the GBMs and smaller for the ANNs, ∼3.5% and ∼1% worse fit respectively, averaged across the measures in Table 2.

Interpretability and Post-hoc Explanation Results
For g MGBM (X), intrinsic interpretability was evaluated with PD and ICE plots of mostly monotonic prediction behavior for several important X j , and post-hoc Shapley explanation analysis was used to create global and local feature importance. Global Shapley feature importance for g MGBM (X) on the mortgage test data is reported in Figure 2. g MGBM placed high importance on LTV ratio, perhaps too high, and also weighed DTI ratio, property value, loan amount, and introductory rate period heavily in many of its predictions. Tree SHAP values are reported in the margin space, prior to the application of the logit link function, and the reported numeric values can be interpreted as the mean absolute impact of each X j on g MGBM (X) in the mortgage test data in the g MGBM (X) margin space. The potential over-emphasis of LTV ratio, and the de-emphasis of income, likely an important feature from a business perspective, and the de-emphasis of the encoded no introductory rate period flag feature may also contribute to the decreased performance of g MGBM (X) as compared to g XNN (X). Domain knowledge was used to positively constrain DTI ratio and LTV ratio and to negatively constrain income and the loan term flag under g MGBM . The monotonicity constraints for DTI ratio and LTV ratio were confirmed for g MGBM (X) on the mortgage test data in Figure 3. Both DTI ratio and LTV ratio displayed positive monotonic behavior at all selected percentiles of g MGBM (X) for ICE and on average with PD. Because PD curves generally followed the patterns of the ICE curves for both features, it is also likely that no strong interactions were at play for DTI ratio and LTV ratio under g MGBM . Of course, the monotonicity constraints themselves could have dampened the effects of non-monotonic interactions under g MGBM , even if they did exist in the training data (e.g., LTV ratio and the no introductory rate period flag, see Figure 6). This rigidity could also have played a role in the performance differences between g MGBM (X) and g XNN (X) in the mortgage data not observed for the simulated data, wherein strong interactions appeared to be between features with the same monotonicity constraints (e.g., X Friedman,1 and X Friedman,2 , see Figure 1).
PD and ICE are displayed with a histogram to highlight any sparse regions in an input feature's domain. Because most ML models will always issue a prediction on any instance with a correct schema, it is crucial to consider whether a given model learned enough about an instance to make an accurate prediction. Viewing PD and ICE along with a histogram is a convenient method to visually assess whether a prediction is reasonable and based on sufficient training data. The DTI ratio and LTV ratio do appear to have had sparse regions in their univariate distributions. The monotonicity constraints likely play to the advantage of g MGBM in this regard, as g MGBM (X) appears to carry reasonable predictions learned from dense domains into the sparse domains of both features. Figure 3 also displays PD and ICE for the unconstrained feature property value. Unlike the DTI ratio and LTV ratio, PD for property value did not always follow the patterns established by ICE curves. While PD showed monotonically increasing prediction behavior on average, apparently influenced by large predictions at extreme g MGBM (X) percentiles, ICE curves for individuals at the 40th percentile of g MGBM (X) and lower exhibited different prediction behavior with respect to property value. Some individuals at these lower percentiles displayed monotonically decreasing prediction behavior, while others appeared to show fluctuating prediction behavior. Property value was strongly right-skewed, with little data regarding high-value property from which g MGBM can learn. For the most part, reasonable predictions did appear to be carried from more densely populated regions to more sparsely populated regions. However, prediction fluctuations at lower g MGBM (X) percentiles were visible, and appeared in a sparse region of property value. This divergence of PD and ICE could be indicative of an interaction affecting property value under g MGBM [23], and analysis by surrogate decision tree did show evidence of numerous potential interactions in lower predictions ranges of g MGBM (X) [32] (not shown, but available in resources discussed in Section 2.8). Fluctuations in ICE could also have been caused by overfitting or by leakage of strong non-monotonic signal from important constrained features into the modeled behavior of non-constrained features. In Figure 4, local Tree SHAP values are displayed for selected individuals at the 10th, 50th, and 90th percentiles of g MGBM (X) in the mortgage test data. Each Shapley value in Figure 4 represents the difference in g MGBM (x) and the average of g MGBM (X) associated with this instance of some input feature x j [33]. Accordingly, the logit of the sum of the Shapley values and the average of g MGBM (X) is g MGBM (x), the prediction in the probability space for any x.  Figure 4 lies in the ability of Tree SHAP to accurately and consistently summarize any single g MGBM (x) prediction in this manner, which is generally important for enabling logical appeal or override of ML-based decisions, and is specifically important in the context of lending, where applicable regulations often require lenders to provide consumer-specific reasons for denying credit to an individual. In the US, applicable regulations are typically ECOA and FCRA, and the consumer-specific reasons are commonly known as adverse actions codes.  Figure 5 are the estimated average absolute impact of each input, X j , in the projection layer and probability space of g XNN (X) for the mortgage test data. g XNN distributes importance more evenly across business drivers and puts stronger emphasis on the no introductory rate period flag feature than does g MGBM . Like g MGBM , g XNN puts little emphasis on the other flag features. Unlike g MGBM , g XNN assigned higher importance to property value, loan amount, and income, and lower importance on LTV ratio and DTI ratio. The capability of g XNN to model nonlinear phenomenon and high-degree interactions, and to do so in an interpretable manner, is on display in Figure 6. Figure 6a presents the sparse γ k weights of the g XNN output layer in which the n k subnetworks with k ∈ {0, 1, 2, 3, 5, 8, 9} had large magnitude weights and n k subnetworks, k ∈ {4, 6, 7}, had small or near-zero magnitude weights. Distinctive ridge functions that fed into those large magnitude γ k weights are highlighted in Figure 6b and color-coded to pair with their corresponding γ k weight. As in the Section 3.1.2, n k ridge function plots varied with the output of the corresponding projection layer ∑ j β k,j x j hidden unit, with weights displayed in matching colors in Figure 6c. In both the simulated and mortgage data, n k ridge functions appeared to be elementary functional forms that the output layer learned to combine to generate accurate predictions, reminiscent of basis functions for the modeled space. Figure 6c displays the sparse β j,k weights of the projection layer ∑ j β k,j x j hidden units that were associated with each n k subnetwork ridge function. For instance, subnetwork n 3 was influenced by large weights for LTV ratio, no introductory rate period flag, and introductory rate period, whereas subnetwork n 9 was nearly completely dominated by the weight for income. See Appendix B.3 for details regarding general XNN architecture. Figure 6. Output layer γ k weights, corresponding n k ridge functions, and associated projection layer β k,j weights for g XNN on the mortgage data.
To complement the global interpretability of g XNN , Figure 7 displays local Shapley values for selected individuals, estimated from the projection layer using Deep SHAP in the g XNN probability space. Similar to Tree SHAP, local Deep SHAP values should sum to g XNN (x). While the Shapley values appeared to follow the roughly increasing pattern established in Figures A4, A6, and 4, their true value was their ability to be calculated for any g XNN (x) prediction, as a means to summarize model reasoning and allow for appeal and override of specific ML-based decisions.

Discrimination Testing Results
Table 3a,b show the results of the discrimination tests using the mortgage data for two sets of class-control groups: blacks as compared to whites, and females as compared to males. As with the simulated data in Table A1, several measures of disparities are shown, with the SMDs calculated using the probabilities from g MGBM (X) and g XNN (X), and the accuracy, FPRs, and FPR ratios, MEs, and AIRs calculated using a binary outcome based on a cutoff of 0.20 (anyone with probabilities of 0.2 or less receives the favorable outcome; see Appendix F for comments pertaining to discrimination testing and cutoff selection). Since g MGBM and g XNN were predicting the likelihood of receiving a high-priced loan, g MGBM and g XNN assume that a lower score was favorable. Thus, one might consider FPR ratios as a measure of the class-control disparities. FPR ratios were higher under g XNN than g MGBM (2.45 vs. 2.10) in Table 3b, but overall FPRs were lower for blacks under g XNN (0.295 vs. 0.315) in Table 3a. This is the same pattern seen in the simulated data results in Appendix E.2, leading to the question of whether a fairness goal should not only consider class-control relative rates, but also intra-class improvements in the chosen fairness measure. Similar results were found for the female-male comparison, but the relative rates are less stark: 1.15 for g MGBM (X) and 1.21 for g XNN (X).
Both ME and AIR showed higher disparities for blacks under g XNN than g MGBM . Blacks receive high-priced loans 21.4% more frequently using g XNN vs. 18.3% for g MGBM . Both g MGBM and g XNN showed AIRs that were statistically significantly below parity (not shown, but available in resources discussed in Section 2.8), and which were also below the EEOC's 0.80 threshold. This would typically indicate need for further review to determine the cause and validity of these disparities, and a few relevant remediation techniques for such discovered discrimination are discussed in Section 4.3. On the other hand, women improved under g XNN vs. g MGBM (MEs of 3.6% vs. 4.1%; AIRs of 0.955 vs. 0.948). The AIRs, while statistically significantly below parity, were well above the EEOC's threshold of 0.80. In most situations, the values of these measures alone would not likely flag a model for further review. Black SMDs for g XNN (X) and g MGBM (X) were similar: 0.621 and 0.628, respectively. These exceeded Cohen's guidelines of 0.5 for a medium effect size and would likely trigger further review. Female SMDs were well below Cohen's definition of small effect size: 0.105 and 0.084 for g XNN (X) and g MGBM (X), respectively. Similar to results for female AIR, these values alone are unlikely to prompt further review.   Figure 8 play an important role in increasing human trust and understanding of ML, a few pertinent references and Python resources are highlighted below as further reading to augment this this text's focus on certain interpretable models, post-hoc explanation, and discrimination testing techniques.
Any discussion of interpretable ML models would be incomplete without references to the seminal work of the Rudin group at Duke University and EBMs, or GA 2 Ms, pioneered by researchers at Microsoft and Cornell [29,34,35]. In keeping with a major theme of this manuscript, models from these leading researchers and several other kinds of interpretable ML models are now available as open source Python packages. Among several types of currently available interpretable models, practitioners can now use Python to evaluate EBM in the interpret package, optimal sparse decision trees, GAMs in the pyGAM package, a variant of Friedman's RuleFit in the skope-rules package, monotonic calibrated interpolated lookup tables in tensorflow/lattice, and this looks like that interpretable deep learning [34][35][36][37] (see Optimal Sparse Decision Trees, ProtoPNet (this looks like that)). Additional, relevant references and Python functionality include: See Awesome Machine Learning Interpretability for a longer, community-curated metalist of related software packages and resources. Figure 8. An example responsible ML workflow in which interpretable models, post-hoc explanations, discrimination testing and remediation techniques, among several other processes, can create an understandable and trustworthy ML system for high-stakes, human-centered, or regulated applications.

Appeal and Override of Automated Decisions
Interpretable models and post-hoc explanations can play an important role in increasing transparency into model mechanisms and predictions. As seen in Section 3, interpretable models often enable users to enforce domain knowledge-based constraints on model behavior, to ensure that models obey reasonable expectations, and to gain data-derived insights into the modeled problem domain. Post-hoc explanations generally help describe and summarize mechanisms and decisions, potentially yielding an even clearer understanding of ML models. Together they can allow for human learning from ML, certain types of regulatory compliance, and crucially, human appeal or override of automated model decisions [32]. Interpretable models and post-hoc explanations are likely good candidates for ML uses cases under the FCRA, ECOA, GDPR and other regulations that may require explanations of model decisions, and they are already used in the financial services industry today for model validation and other purposes. (For examples uses in financial services, see Deep Insights into Explainability and Interpretability of Machine Learning Algorithms and Applications to Risk Management. Also note that many non-consistent explanation methods can result in drastically different global and local feature importance values across different models trained on the same data or even for refreshing a model with augmented training data [33]. Consistency and accuracy guarantees are perhaps a factor in the growing momentum behind Shapley values as a candidate technique for generating consumer-specific adverse action notices for explaining and appealing automated ML-based decisions in highly-regulated settings, such as credit lending [57].) In general, transparency in ML also facilitates additional responsible AI processes such as model debugging, model documentation, and logical appeal and override processes, some of which may also be required by applicable regulations (e.g., US Federal Reserve Bank SR 11-7: Guidance on Model Risk Management). Among these, providing persons affected by a model with the opportunity to appeal ML-based decisions may deserve the most attention. ML models are often wrong ("All models are wrong, but some are useful."-George Box, Statistician (1919-2013)) and appealing black-box decisions can be difficult (e.g., When a Computer Program Keeps You in Jail). For high-stakes, human-centered, or regulated applications that are trusted with mission-or life-critical decisions, the ability to logically appeal or override inevitable wrong decisions is not only a possible prerequisite for compliance, but also a failsafe procedure for those affected by ML decisions.

Discrimination Testing and Remediation in Practice
A significant body of research has emerged around exploring and fixing ML discrimination [58]. Methods can be broadly placed into two groups: more traditional methods that mitigate discrimination by searching across possible algorithmic and feature specifications, and many approaches that have been developed in the last 5-7 years that alter the training algorithm, preprocess training data, or post-process predictions in order to diminish class-control correlations or dependencies. Whether these more recent methods are suitable for a particular use case depends on the legal environment where a model is deployed and on the use case itself. For comments on why some recent techniques could result in regulatory non-compliance in certain scenarios, see Appendix G.
Of the newer class of fairness enhancing interventions, within-algorithm discrimination mitigation techniques that do not use class information may be more likely to be acceptable in highly regulated settings today. These techniques often incorporate a loss function where more discriminatory paths or weights are penalized and only used by the model if improvements in fit overcome some penalty. (The relative level of fit-to-discrimination penalty is usually determined via hyperparameter.) Other mitigation strategies that only alter hyperparameters or algorithm choice are also likely to be acceptable. Traditional feature selection techniques (e.g., those used in linear models and decision trees) are also likely to continue to be accepted in regulatory environments. For further discussion of techniques that can mitigate DI in US financial services, see Schmidt and Stephens [59].
Regardless of the methodology chosen to minimize disparities, advances in computing have enhanced the ability to search for less discriminatory models. Prior to these advances, only a small number of alternative algorithms could be tested for lower levels of disparity without causing infeasible delays in model implementation. Now, large numbers of models can be quickly tested for lower discrimination and better predictive quality. An additional opportunity arises as a result of ML itself: the well-known Rashomon effect, or the multiplicity of good ML models for most datasets. It is now feasible to train more models, find more good models, and test more models for discrimination, and among all those tested models, there are likely to be some with high predictive performance and low discrimination.

Intersectional and Non-static Risks in Machine Learning
The often black-box nature of ML, the perpetuation or exacerbation of discrimination by ML, or the privacy harms and security vulnerabilities inherent in ML are each serious and difficult problems on their own. However, evidence is mounting that these harms can also manifest as complex intersectional challenges, e.g., the fairwashing or scaffolding of biased models with ML explanations, the privacy harms of ML explanations, or the adversarial poisoning of ML models to become discriminatory [8,19,20] (e.g., Tay, Microsoft's AI chatbot, gets a crash course in racism from Twitter). (While the focus of this paper is not ML security, proposed best-practices from that field do point to transparency of ML systems as a mitigating factor for some ML attacks and hacks [55]. High system complexity is sometimes considered a mitigating influence as well [60]. This is sometimes known as the transparency paradox in data privacy and security, and it likely applies to ML security as well, especially in the context of interpretable ML models and post-hoc explanation (see The AI Transparency Paradox).) Practitioners should of course consider the discussed interpretable modeling, post-hoc explanation, and discrimination testing approaches as at least partial remedies to the black-box and discrimination issues in ML. However, they should also consider that explanations can ease model stealing, data extraction, and membership inference attacks, and that explanations can mask ML discrimination. Additionally, high-stakes, human-centered, or regulated ML systems should generally be built and tested with robustness to adversarial attacks as a primary design consideration, and specifically to prevent ML models from being poisoned or otherwise altered to become discriminatory. Accuracy, discrimination, and security characteristics of a system can change over time as well. Simply testing for these problems at training time, as presented in Section 3, is not adequate for high-stakes, human-centered, or regulated ML systems. Accuracy, discrimination, and security should be monitored in real-time and over time, as long as a model is deployed.

Conclusion
This text puts forward results on simulated data to provide some validation of constrained ML models, post-hoc explanation techniques, and discrimination testing methods. These same modeling, explanation, and discrimination testing approaches are then applied to more realistic mortgage data to provide an example of a responsible ML workflow for high-stakes, human-centered, or regulated ML applications. The discussed methodologies are solid steps toward interpretability, explanation, and minimal discrimination for ML decisions, which should ultimately enable increased fairness and logical appeal processes for ML decision subjects. Of course, there is more to the responsible practice of ML than interpretable models, post-hoc explanation, and discrimination testing, even from a technology perspective, and Section 4 also points out numerous additional references and open source Python software assets that are available to researchers and practitioners today to increase human trust and understanding in ML systems. While the complex and messy problems of racism, sexism, privacy violations, and cyber crime can probably never be solved by technology alone, this work and many others illustrate numerous ways for ML practitioners to mitigate such risks.
Author Contributions: N.G., data cleaning; GBM and MGBM, assessment and results; P.H., primary author; K.M., ANN, and XNN, implementation, assessment, and results; N.S., secondary author, data simulation and collection, and discrimination testing. All authors have read and agreed to the published version of the manuscript.
Funding: This work received no external funding.
Acknowledgments: Wen Phan for work in formalizing notation. Sue Shay for editing. Andrew Burt for ideas around the transparency paradox.

Conflicts of Interest:
The authors declare no conflict of interest. XNN was first made public by the corporate model validation team at Wells Fargo bank. Wells Fargo is a customer of, and investor in, H2O.ai and a client of BLDS, LLC. However, communications regarding XNN between Wells Fargo and Patrick Hall at H2O.ai have been extremely limited prior to and during the drafting of this manuscript. Moreover, Wells Fargo exerted absolutely no editorial control over the text or results herein.

Abbreviations
The following abbreviations are used in this manuscript:

Appendix A. Mortgage Data Details
The US HMDA law, originally enacted in 1975, requires many financial institutions that originate mortgage products to provide certain loan-level data about many types of mortgage-related products on an annual basis. This information is first provided to the CFPB, which subsequently releases some of the data to the public. Regulators often use HMDA data to, "... show whether lenders are serving the housing needs of their communities; they give public officials information that helps them make decisions and policies; and they shed light on lending patterns that could be discriminatory" (see Mortgage data (HMDA)). In addition to regulatory use, public advocacy groups use these data for similar purposes, and the lenders themselves use the data to benchmark their community outreach relative to their peers. The publicly available data that the CFPB releases includes information such as the lender, the type of loan, loan amount, LTV ratio, DTI ratio, and other important financial descriptors. The data also include information on each borrower and co-borrower's race, ethnicity, gender, and age. Because the data includes information on these protected class characteristics, certain measures that can be indicative of discrimination in lending can be calculated directly using the HDMA data. Ultimately, the HMDA data represent the most comprehensive source of data on highly-regulated mortgage lending that is publicly available, which makes it an ideal dataset to use for the types of analyses set forth in Sections 2 and 3.

Appendix B. Selected Algorithmic Details
Appendix B.1. Notation To facilitate descriptions of data, modeling, and other post-hoc techniques, notation for input and output spaces, datasets, and models is defined.
Input features come from the set X contained in a P-dimensional input space, X ⊂ R P . An arbitrary, potentially unobserved, or future instance of X is denoted x, x ∈ X . • Labels corresponding to instances of X come from the set Y. • Learned output responses of models are contained in the setŶ.

Data
• An input dataset X is composed of observed instances of the set X with a corresponding dataset of labels Y, observed instances of the set Y. • Each i-th observed instance of X is denoted as , with corresponding i-th labels in Y, y (i) , and corresponding predictions inŶ,ŷ (i) . • X and Y consist of N tuples of observed instances: Appendix B.1.3. Models • A type of ML model g, selected from a hypothesis set H, is trained to represent an unknown signal-generating function f observed as X with labels Y using a training algorithm A: X, Y A − → g, such that g ≈ f .
• g generates learned output responses on the input dataset g(X) =Ŷ, and on the general input space g(X ) =Ŷ. • A model to be explained or tested for discrimination is denoted as g.

Appendix B.2. Monotonic Gradient Boosting Machine Details
For some g MGBM model (see Equation (2)), monotonic splitting rules, Θ mono b , are selected in a greedy, additive fashion by minimizing a regularized loss function, L, that considers known target labels, y, the predictions of all subsequently trained trees in g MGBM , g MGBM b−1 (X), and the b-th tree splits applied to some instance x, T b (x; Θ mono b ), in a numeric error function (e.g., squared error, Huber error), l, in addition to a regularization term that penalizes complexity in the b-th tree, Ω(T b ). For the b-th iteration over N instances, L b , can generally be defined as: In addition to L, g MGBM training is characterized by monotonic splitting rules and constraints on tree node weights. Each binary splitting rule in T b , θ b,j,k ∈ Θ b , is associated with a feature, X j , is the k-th split associated with X j in T b , and results in left, L, and right, R, child nodes with a numeric weights, {w b,j,k,L , w b,j,k,R }. For terminal nodes, {w b,j,k,L , w b,j,k,R } can be direct numeric components of some g MGBM prediction. For two values, x α j and x β j , of some feature The following rules and constraints ensure positive monotonicity in Θ mono b : 1. For the first and highest split in T b involving X j , any θ b,j,0 resulting in T(x j ; θ b,j,0 ) = {w b,j,0,L , w b,j,0,R } where w b,j,0,L > w b,j,0,R , is not considered. 2. For any subsequent left child node involving X j , any θ b,j,k≥1 resulting in T(x j ; θ b,j,k≥1 ) = {w b,j,k≥1,L , w b,j,k≥1,R } where w b,j,k≥1,L > w b,j,k≥1,R , is not considered. 3. Moreover, for any subsequent left child node involving X j , T(x j ; θ b,j,k≥1 ) = {w b,j,k≥1,L , w b,j,k≥1,R }, {w b,j,k≥1,L , w b,j,k≥1,R } are bound by the associated θ b,j,k−1 set of node weights, (1) and (2) are also applied to all right child nodes, except that for right child nodes w b,j,k,L ≤ w b,j,k,R and {w b,j,k≥1,L , w b,j,k≥1,R } ≥ Note that for any one X j and subtree in g MGBM , left subtrees will always produce lower predictions than right subtrees, and that any g MGBM (x) is an addition of each full T b prediction, with the application of a monotonic logit or softmax link function for classification problems. Moreover, each tree's root node corresponds to some constant node weight that by definition obeys monotonicity constraints, Together these additional splitting rules and node weight constraints j } ∈ X j , left and right splitting rules and node weight constraints are switched. Also consider that MGBM models with independent monotonicity constraints between some X j and y likely restrict non-monotonic interactions between multiple X j . Moreover, if monotonicity constraints are not applied to all X j ∈ X, any strong non-monotonic signal in training data associated with some important X j maybe forced onto some other arbitrary unconstrained X j under some g MGBM models, compromising the end goal of interpretability.

Appendix B.3. Explainable Neural Network Details
g XNN is comprised of 3 meta-layers: 1. The first and deepest meta-layer, composed of K linear ∑ j β k,j x j hidden units (see Equation (3)), which should learn higher magnitude weights for each important input, X j , is known as the projection layer. It is fully connected to each input X j . Each hidden unit in the projection layer may optionally include a bias term. 2. The second meta-layer contains K hidden and separate n k ridge functions, or subnetworks. Each n k is a neural network itself, which can be parametrized to suit a given modeling task. To facilitate direct interpretation and visualization, the input to each subnetwork is the 1-dimensional output of its associated projection layer ∑ j β k,j x j hidden unit. Each n k can contain several bias terms. 3. The output meta-layer, called the combination layer, is an output neuron comprised of a global bias term, µ 0 , and the K weighted 1-dimensional outputs of each subnetwork, γ k n k (∑ j β k,j x j ). Again, each n k subnetwork output into the combination layer is restricted to 1-dimension for interpretation and visualization purposes.

Appendix B.4. One-dimensional Partial Dependence and Individual Conditional Expectation Details
Following Friedman et al. [14] a single input feature, X j ∈ X, and its complement set, X P \{j} ∈ X, where X j ∪ X P \{j} = X is considered. PD(X j , g) for a given X j is the estimated average output of the learned function g(X) when all the observed instances of X j are set to a constant x γ ∈ X and X P \{j} is left unchanged. ICE(x j , g) for a given instance x and feature x j is estimated as the output of g(x) when x j is set to a constant x γ ∈ X and all other features x ∈ X P \{j} are left untouched. PD and ICE curves are usually plotted over some set of constants drawn from X , as displayed in Section 3.2.2 and Appendix E.1. Due to known problems for PD in the presence of strong correlation and interactions, PD should not be used alone. PD should be paired with ICE or be replaced with accumulated local effect (ALE) plots [23,30].

Appendix B.5. Shapley Value Details
For some instance x ∈ X , Shapley explanations take the form: In Equation (A2), z ∈ {0, 1} P is a binary representation of x where 0 indicates missingness. Each Shapley value, φ j , is the local feature contribution value associated with x j , and φ 0 is the average of g(X). Each φ j is a weighted combination of model predictions with x j , g x (S ∪ {j}), and the model predictions without x j , g x (S), for every possible subset of features S not including j, S ⊆ P \ {j}, where g x incorporates the mapping between x and the binary vector z.
Local, per-instance explanations using Shapley values tend to involve ranking x j ∈ x by φ j values or delineating a set of the X j names associated with the k-largest φ j values for some x, where k is some small positive integer, say five. Global explanations are typically the absolute mean of the φ j associated with a given X j across all of the instances in some set X.

Appendix C. Types of Machine Learning Discrimination in US Legal and Regulatory Settings
It is important to explain and draw a distinction between the two major types of discrimination recognized in US legal and regulatory settings, disparate treatment (DT) and disparate impact (DI). DT occurs most often in an algorithmic setting when a model explicitly uses protected class status (e.g., race, sex) as an input feature or uses a feature that is so similar to protected class status that it essentially proxies for class membership. With some limited exceptions, the use of these factors in an algorithm is illegal under several statutes in the US. DI occurs when some element of a decisioning process includes a facially neutral factor (i.e, a reasonable and valid predictor of response) that results in a disproportionate share of a protected class receiving an unfavorable outcome. In modeling, this is most typically driven by a statistically important feature that is distributed unevenly across classes, which causes more frequent unfavorable outcomes for the protected class. However, other factors, such as hyperparameter or algorithm choices, can drive DI. Crucially, legality hinges on whether changing the model, for example exchanging one feature for another or altering the hyperparameters of an algorithm, can lead to a similarly predictive model with lower DI.

Appendix D. Practical vs. Statistical Significance for Discrimination Testing
A finding of practical significance means that discovered disparity is not only statistically significant, but also passes beyond a chosen threshold that would constitute prima facie evidence of illegal discrimination. Practical significance acknowledges that any large dataset is likely to show statistically significant differences in outcomes by class, even if those differences are not truly meaningful. It further recognizes that there are likely to be situations where differences in outcomes are beyond a model user's ability to correct them without significantly degrading the quality of the model. Moreover, practical significance is also needed by model builders and compliance personnel to determine whether a model should undergo remediation efforts before it is put into production. Unfortunately, guidelines for practical significance, i.e., the threshold at which any statistically significant disparity would be considered evidence of illegal discrimination, are not as frequently codified as the standards for statistical significance. One exception, however, is in employment discrimination analyses, where the US Equal Employment Opportunity Commission (EEOC) has stated that if the AIR is below 0.80 and statistically significant, then this constitutes prima facie evidence of discrimination, which the model user must rebut in order for the DI not to be considered illegal discrimination. (Importantly, the standard of 0.80 is not a law, but a rule of thumb for agencies tasked with enforcement of discrimination laws. "Adoption of Questions and Answers To Clarify and Provide a Common Interpretation of the Uniform Guidelines on Employee Selection Procedures," Federal Register, Volume 44, Number 43 (1979).) It is important to note that the 0.80 measure of practical significance, also known as the 80% rule and the 4/5ths rule, is explicitly used in relation to AIR, and it is not clear that the use of this threshold is directly relevant to testing fairness for measures other than the AIR.
The legal thresholds for determining statistical significance is clearer and more consistent than that for practical significance. The first guidance in US courts occurred in a case involving discrimination in jury selection, Castaneda vs. Partida (430 US 482 -Supreme Court (1977)). Here, the US Supreme Court wrote that, "As a general rule for such large samples, if the difference between the expected value and the observed number is greater than two or three standard deviations, then the hypothesis that the jury drawing was random would be suspect to a social scientist." This "two or three standard deviations" test was then applied to employment discrimination in Hazelwood School Districts vs. United States (433 US 299 (1977)). Out of this, a 5% two-sided test (z = 1.96), or an equivalent 2.5% one-sided test, has become a common standard for determining whether evidence of disparities is statistically significant.

Appendix E. Additional Simulated Data Results
As seen in Section 3.1.1, little or no trade-off is required in terms of model to fit to use the constrained models. Hence, intrinsic interpretability, post-hoc explainability, and discrimination are explored further for the g MGBM and g XNN models in Appendices E.1 and E.2. Intrinsic interpretability for g MGBM is evaluated with PD and ICE, and post-hoc explainability is highlighted via global and local Shapley explanations. For g XNN , Shapley explanation techniques are also used to generate global and local feature importance to augment interpretability results exhibited in Section 3.1.2. Both g MGBM (X) and g XNN (X) are evaluated for discrimination using AIR, ME, SMD, and other measures.

Appendix E.1. Interpretability and Post-hoc Explanation Results
Global mean absolute Shapley value feature importance for g MGBM (X) on the simulated test data is displayed in Figure A2. Figure A2. Global mean absolute Tree SHAP feature importance for g MGBM (X) on the simulated test data.
As expected, the X Friedman,j features from the original Friedman [10] and Friedman et al. [11] formula are the main drivers of g MGBM (X) predictions, with encoded versions of the augmented categorical and binary features contributing less on average to g MGBM (X) predictions. Figure A3 highlights PD, ICE, and histograms of the most important features from Figure A2. Figure A3. PD, ICE for ten instances across selected percentiles of g MGBM (X), and histograms for the three most important input features of g MGBM on the simulated test data.
X Friedman,1 , X Friedman,2 , and X Friedman,4 were positively monotonically constrained under g MGBM for the simulated data, and positive monotonicity looks to be confirmed on average with PD and at numerous local percentiles of g MGBM (X) with ICE. As the PD curves generally follow the patterns of the ICE curves, PD is likely an accurate representation of average feature behavior for X Friedman,1 , X Friedman,2 , and X Friedman,4 . Also because PD and ICE curves do not obviously diverge, g MGBM is likely not modeling strong interactions, despite the fact that known interactions are included in the simulated data signal-generating function in Equation (1). The one-dimensional monotonic constraints may hinder the modeling of non-monotonic interactions, but do not strongly affect overall g MGBM accuracy, perhaps due to the main drivers, X Friedman,1 , X Friedman,2 , and X Friedman,4 , all being constrained in the same direction and able to weakly interact as needed.
Local Shapley values for records at the 10th, 50th, and 90th percentiles of g MGBM (X) in the simulated test data are displayed in Figure A4. Figure A4. Tree SHAP values for three instances across selected percentiles of g MGBM (X) for the simulated test data. Figure A4 appear to be a logical result. For the lower prediction at the 10th percentile of g MGBM (X), the largest local contributions are negative and the majority of local contributions are also negative. At the median of g MGBM (X), local contributions are roughly split between positive and negative values, and at the 90th of g MGBM (X), most large contributions are positive. In each case, large local contributions generally follow global importance results in Figure A2 as well. Figure A5 shows global mean absolute Shapley feature importance for g XNN (X) on the simulated test data, using the approximate Deep SHAP technique. Figure A5. Global mean absolute Deep SHAP feature importance for g XNN (X) on the simulated test data.

The Shapley values in
Like g MGBM , g XNN ranks the X Friedman,j features higher in terms of importance than the categorical and binary features. The consistency between the feature rankings of g MGBM and g XNN is somewhat striking, given their different hypothesis families and architectures. Both g MGBM and g XNN rank X Friedman,1 , X Friedman,2 , and X Friedman,4 as the most important features, both place X Categorical,2 and X Categorical,3 above the X Binary,1 and X Binary,2 features, both rank X Binary,1 above X Binary,2 , and both place the least importance on X Categorical,4 and X Categorical,0 .
Local Deep SHAP feature importance in Figure A6 supplements the global interpretability of g XNN displayed in Figures A5 and 1. Local Deep SHAP values are extracted from the projection layer of g XNN and reported in the probability space. Deep SHAP values can be calculated for any arbitrary g XNN (x), allowing for detailed, local summarization of individual model predictions. Figure A6. Deep SHAP values for three instances across selected percentiles of g XNN (X) on the simulated test data.
As expected, Deep SHAP values generally increase from the 10th percentile of g XNN (X) to the 90th percentile of g XNN (X), with primarily important global drivers of model behavior contributing to the selected local g XNN (x) predictions.

Appendix E.2. Discrimination Testing Results
Tables A1a,b show the results of the disparity tests using the simulated data for two hypothetical sets of class-control groups. Several measures of disparities are shown, with the SMDs calculated using the probabilities from g MGBM (X) and g XNN (X), FNRs, their ratios, MEs, and AIRs calculated using a binary outcome based on a cutoff of 0.6 (anyone with probabilities of 0.6 or greater receives the favorable outcome).
Since g MGBM and g XNN assume that a higher score is favorable (as might be the case if the model were predicting responses to marketing offers), one might consider the relative FNRs as a measure of the class-control disparities. Table A1b shows that protected group 1 has higher relative FNRs under g XNN (1.13 vs. 1.06). However, in Table A1a the overall FNRs were lower for g XNN (0.357 vs. 0.401). This illustrates a danger in considering relative class-control measures in isolation when comparing across models: despite the g MGBM appearing to be a relatively fairer model, more protected group 1 members experience negative outcomes using g MGBM . This is because FNR accuracy improves for both the protected group 1 and control group 1, but members of control group 1 benefit more than those in protected group 1. Of course, the choice of which model is truly fairer is a policy question. For g XNN (X), 12.0% fewer control group 1 members receive the favorable offer under the ME column in Table A1b. Of note is that 12.0% is not a meaningful difference without context. If the population of control group 1 and control group 2 were substantially similar in relevant characteristics, 12.0% could represent an extremely large difference and would require remediation. However, if they represent substantially different populations, then 12.0% could represent a reasonable deviation from parity. As an example, if a lending institution that has traditionally focused on high credit quality clients were to expand into previously under-banked communities, an 12.0% class-control difference in loan approval rates might be expected because the average credit quality of the new population would be lower than that of the existing population. Protected group 1's AIR under g XNN is 0.727, below the EEOC 4/5ths rule threshold. It is also highly statistically significant (not shown, but available in resources discussed in Section 2.8). Together these would indicate that there may be evidence of illegal DI. As with ME and other measures, the reasonableness of this disparity is not clear outside of context. However, most regulated institutions that do perform discrimination analyses would find an AIR of this magnitude concerning and warranting further review. Some pertinent remediation strategies for discovered discrimination are discussed in Section 4.3.
SMD is used here to measure g MGBM (X) and g MGBM (X) probabilities prior to being transformed into classifications. (This measurement would be particularly relevant if the probabilities are used in combination with other models to determine an outcome.) The results show that g MGBM (X) has less DI than g XNN (X) (−0.206 vs. −0.274), but both are close to Cohen's small effect threshold of −0.20. Whether a small effect would be a highlighted concern would depend on a organization's chosen threshold for flagging models for further review.

Appendix F. Discrimination Testing and Cutoff Selection
The selection of which cutoff to use in production is typically based on the model's use case, rather than one based solely on the statistical properties of the predictions themselves. For example, a model developer at a bank might build a credit model where the F1 score is maximized at a delinquency probability cutoff of 0.15. For purposes of evaluating the quality of the model, she may review confusion matrix statistics (accuracy, recall, precision, etc.) using cutoffs based on the maximum F1 score. However, because of its risk tolerance and other factors, the bank itself might be willing to lend to anyone with a delinquency probability of 0.18 or lower, which would mean that anyone who is scored at 0.18 or lower would receive an offer of credit. Because disparity analyses are concerned with how people are affected by the deployed model, it is essential that any confusion matrix-based measures of disparity be calculated on the in-production classification decisions, rather than on cutoffs that are not related to what those affected by the model will experience.

Appendix G. Recent Fairness Techniques in US Legal and Regulatory Settings
Great care must be taken to ensure that the appropriate discrimination measures are employed for any given use case. Additionally, the effects of changing a model must be viewed holistically. For example, the mortgage data disparity analysis in Section 3.2.3 shows that if one were to choose g MGBM over g XNN because g MGBM has a lower FPR ratio for blacks, it would ultimately lead to a higher FPR for blacks overall, which may represent doing more harm than good. Furthermore, using some recently developed discrimination mitigation methods may lead to non-compliance with anti-discrimination laws and regulations. A fundamental maxim of US anti-discrimination law is that (to slightly paraphrase), "similarly situated people should be treated similarly." (In the pay discrimination case, Bazemore vs. Friday, 478 US 385 (1986), the US Supreme Court found that,"Each week's paycheck that delivers less to a black than to a similarly situated white is a wrong actionable ..." Beyond the obvious conceptual meaning, what specifically constitutes similarly situated is controversial and its interpretation differs by circuit.) A model developed without inclusion of class status (or proxies thereof) considers similarly situated people the same on the dimensions included in the model: people who have the same feature values will have the same model output (though there may be some small or random differences in outcomes due to computational issues). Obviously, the inclusion of protected class status will change model output by class. With possible rare exceptions, this is likely to be viewed with legal and regulatory skepticism today, even if including class status is done with fairness as the goal. (In a reverse discrimination case, Ricci v Desafano, 557 US 557 (2009), the court found that any consideration of race which is not justified by correcting for past proven discrimination is illegal and, moreover, a lack of fairness is not necessarily evidence of illegal discrimination.) Preprocessing and post-processing techniques may be similarly problematic, because industries that must provide explanations to those who receive unfavorable treatment (e.g., adverse action notices in US financial services) may have to incorporate the class adjustments into their explanations as well.