A General Framework for Fair Regression

Fairness, through its many forms and definitions, has become an important issue facing the machine learning community. In this work, we consider how to incorporate group fairness constraints into kernel regression methods, applicable to Gaussian processes, support vector machines, neural network regression and decision tree regression. Further, we focus on examining the effect of incorporating these constraints in decision tree regression, with direct applications to random forests and boosted trees amongst other widespread popular inference techniques. We show that the order of complexity of memory and computation is preserved for such models and tightly binds the expected perturbations to the model in terms of the number of leaves of the trees. Importantly, the approach works on trained models and hence can be easily applied to models in current use and group labels are only required on training data.


Introduction
As the proliferation of machine learning and algorithmic decision making continues to grow throughout industry, the net societal impact of them has been studied with more scrutiny. In the USA under the Obama administration, a report on big data collection and analysis found that "big data technologies can cause societal harms beyond damages to privacy" [1]. The report feared that algorithmic decisions informed by big data may have harmful biases, further discriminating against disadvantaged groups. This along with other similar findings has led to a surge in research around algorithmic fairness and the removal of bias from big data.
The term fairness, with respect to some sensitive feature or set of features, has a range of potential definitions. In this work, impact parity is considered. In particular, this work is concerned with group fairness under the following definitions as taken from [2].
Group Fairness: A predictor H : X → Y achieves fairness with bias with respect to groups A, B ⊆ X and O ⊆ Y being any subset of outcomes iff, The above definition can also be described as statistical or demographical parity. Group fairness has found widespread application in India and the USA, where affirmative action has been used to address discrimination against caste, race and gender [3][4][5].
The above definition does not, unfortunately, have natural application to regression problems. One approach to get around this would be to alter the definition to bound the absolute difference between the respective marginal distributions over the output space. However, this is a strong requirement and may hinder the model's ability to model the function space appropriately. Rather, a weaker and potentially more desirable constraint would be to force the expectation of the marginal There are many machine learning techniques with which Group Fairness in Expectation constraints (GFE constraints) may be incorporated. While constraining kernel regression is introduced in Section 3, the main focus of the paper is examining decision tree regression and respective ensemble methods which build on decision tree regression such as random forests, extra trees and boosted trees due to their widespread use in industry and hence their extensive impact on society [6]. The reason for this is to show that such an approach will not affect the order of computational or memory complexity of the model.
The main contributions of this paper are: I We use quadrature approaches to enforce GFE constraints on kernel regression with applications to Gaussian processes, support vector machines, neural network regression and decision tree regression, as outlined in Section 3. II We incorporate these constraints on decision tree regression without affecting the computational or memory requirements, as outlined in Sections 5 and 6. III We derive a tight bound for the variance of the perturbations due to the incorporation of GFE constraints on decision tree regression in terms of the number of leaves of the tree, as outlined in Section 7. IV We show that these fair trees can be combined into random forests, boosted trees and other ensemble approaches while maintaining fairness, as shown in Section 8.

Related Work
There are many ways in which the now huge volume of literature on algorithmic fairness may be split. One such approach is to break the proposed literature into three branches of research based upon the stage of the machine learning life cycle they belong. The first is the data alteration approach, which endeavours to modify the original dataset in order to prevent discrimination or bias due to the protected variable [7,8]. The second is an attempt to regularise such that the model is penalised for bias [9][10][11][12][13]. Finally, the third endeavours to use post-processing to re-calibrate and mitigate against bias [14,15].
The literature also differs dramatically as to what is the objective of the fairness algorithm. Recent work has made efforts towards grouping these into consistent objective formalisation [2,16]. Often, the focus of algorithmic fairness is on classification problem with regression receiving very little attention.
The approach applied to enforce fairness may be from a plethora of definitions, anti-classification [16], or fairness through unawareness as it is also referred to as [2], endeavour to treat data agnostic of protected variables and hence enforces fairness via treatment rather than outcome. The second popular method is classification parity, i.e., the error with respect to some given measure is equal across groups defined by the protected variable. Finally, calibration is the term used when outcomes are independent of protected group conditioned on risk.
Narrowing our focus to regression, two contradicting objectives once again arise, namely group level fairness and individual fairness. Individual fairness implies that small changes to a given characteristic of an individual leads to small changes in outcome. Group fairness on the other hand endeavours to make aggregate outcomes of protected groups similar. The latter is the focus of this work and an overview of where this fits into the broader litterature may be found in Table 1. Table 1. This table is amended from [2], highlighting some of the major contribution currently in the domain of fairness in machine learning. Parity versus preference refers to whether fairness means achieving equality or satisfying the preferences. Treatment versus impact refers to whether fairness is to be maintained in treatment or process of the learning algorithm or resulting output of the system. To the best of the authors knowledge, this work is the first group fair framework for regression problems.
Specifically to decision trees, discrimination aware decision trees have been introduced [30] for classification. They offer dependency aware tree construction and leaf relabelling approach. Later, fair forests [13] introduced a further tree induction algorithm to encourage fairness. They did this by introducing a new gain measure to encourage fairness. However, the issue with adding such regularisation is two-fold. Firstly, discouraging bias via a regularising term does not make any guarantee about the bias of the post trained model. Secondly, it is hard to make any theoretical guarantees about the underlying model or the effect the new regulariser has had on the model.
The approach offered in this work seeks to perform model inference in a constrained space, leveraging basic theory from Bayesian quadrature such that the predicted marginal distributions are guaranteed to have equal means. Such moment constraints have a natural relationship to maximum entropy methods. By utilising quadrature methods, it is also possible to derive bounds for the expected absolute perturbation induced by constraining the space. This is shown explicitly in Section 7. Ultimately, the paper develops a general framework to perform group-fair regression, an important open problem as pointed out in [23].
We emphasise to the reader that, as outlined in the next section, there are many definitions of fairness, each with reasonable motives but conflicting values. Group fairness, addressed in this work, inherently leads to individual unfairness, i.e., to create equal aggregate statistics between sub-population, individuals in each sub-population are treated inconsistently. The reverse is also true. As such, we should always think through the adverse effects of our approach before applying it in the real world. The experiments in this paper are aimed to explore and demonstrate the approach introduced, but are not meant to advocate using group fairness specifically for the task in hand.

Constrained Kernel Regression
We first show how one can create such linear constraints on kernel regression models. This work builds on the earlier contributions in [31], where the authors examined the incorporation of linear constraints on Gaussian processes (GPs). Gaussian processes are a Bayesian kernel method most popular for regression. For a detailed introduction to Gaussian processes, we refer the reader to [32]. However, for the reader unfamiliar with GPs specifically, they may simply think of a high dimensional Gaussian distribution parameterised by a kernel K(·, ·), with zero mean and unit variance without loss of generality. Given a set of inputs and respective outputs, {x i , y i } N i=1 , split into training and testing sets, where K x,x denotes the kernel matrix between training examples, Kx ,x is the kernel matrix between the test and training examples and Kx ,x is the prior variance on the prediction point defined by the kernel matrix. Gaussian processes differ from high dimensional Gaussian distributions as they can model the relationships between points in continuous space, via the kernel function, as opposed to being limited to a finite dimension. An important note is that any combination of Gaussian distributions via addition and subtraction is a closed space, i.e., the sum of Gaussians is also Gaussian and so on. While this may at first appear trivial, it is, in fact, a very useful artefact. For example, let us assume there are two variables, a and b, drawn from Gaussian distributions with mean and variance µ a , µ b , σ 2 a , σ 2 b , respectively. Further, assume that the correlation coefficient ρ describes the interaction between the two variables. Then, a new variable c, which is equal to the difference a and b, is drawn from a Gaussian distribution with mean and variance, We can thus write all three variables in terms of a single mean vector and covariance matrix, Given any two of the above observations, the third can be inferred exactly. We refer to this as a degenerate distribution as K will naturally be low rank. If we observe that µ a − µ b is equal to zero, we are thus constraining the distribution of a and b. This can easily be extended to the relationship between sums and differences of more variables.
Bayesian quadrature [33] is a technique used to incorporate integral observations into the Gaussian process framework. Essentially, quadrature can be derived through an infinite summation and the above relationship between these summations can be exploited [34]. An example covariance structure thus looks akin to, where p(x) is some probability distribution over the domain of x, on which the Gaussian process is defined and against which the quadrature is performed against.
Reiterating the motivation of this work, given two generative distributions p A (x) and p B (x) which subpopulations A and B of the data are generated from, we wish to constrain the inferred function f (·) such that, This constraint can be rewritten as, which allows us to incorporate the constraint on f (·) as an observation in the above Gaussian process. Let q A,B (x) = p A (x) − p B (x) be the difference between the generative probability distributions of A and B; then, by setting the corresponding observation as zero, the covariance matrix becomes, We refer to these as equality constrained Gaussian processes. Let us now turn to incorporate these concepts into decision tree regression.

Trees as Kernel Regression
Decision tree regression (DTR) and related approaches offer a white box approach for practitioners who wish to use them. These methods are among the most popular methods in machine learning [6] in practice as they are generally intuitive even for those not from statistics, mathematics or computer science background. It is their proliferation, especially in businesses without machine learning researchers, that makes them of particular interest.
DTR regress data by sorting them down binary trees based partitions in the input domain. The trees are created by recursively partitioning the domain of input along axis aligned splits determined by a given metric of the data in each partition, such as information gain or variance reduction. In this work, we do not consider the many possible techniques for learning decision trees, but rather assume that the practitioner has a trained decision tree model. For a more complete description of decision trees, the authors refer the readers to [35].
For the purposes of this work, DTR can be described as a partitioning of space such that predictions are made by averaging the observations in the local partition, referred to as the leaves of the tree. As such, DTR has a very natural formulation as a degenerate kernel whereby, where L(·) is the index of the leaf in which the argument belongs. The kernel hence becomes naturally block diagonal and the classifier/regressor written as, with Kx ,x denoting the vector of kernel values betweenx and the observations, K x,x denoting the covariance matrix of the observations as defined by the implicit decision tree kernel and y denoting the values of the observations. It is also worth noting how one can also write the decision tree as a two-stage model: first by averaging the observations of associated with each leaf and then by using a diagonal kernel matrix to perform inference. Trivially, the diagonal kernel matrix acts only as a lookup and outputs the leaf average that corresponds to the point being predicted. Let us refer to this compressed kernel matrix approach as the compressed kernel representation and the block diagonal variant as the explicit kernel representation.

Fairness Constrained Decision Trees
Borrowing concepts from the previous section on equality constrained Gaussian processes using Bayesian quadrature, decision trees may be constrained in a similar fashion. The first consideration to note is that we wish the constraint observation to act as a hard equality, i.e., noiseless. In contrast, we are willing for the observations to be perturbed in order to satisfy this hard equality constraint. To achieve this, let us add a constant noise term, σ 2 noise , to the diagonals of the decision tree kernel matrix. Similar to ordinary least squares regression, the regressor now minimises the L2-norm of the error induced on the observations, conditioned on the equality constraint, which is noise free. In the explicit kernel representation, this implies the minimum induced noise per observation, whereas in compressed kernel representation this implies the minimum induced noise per leaf.
An important note is that the constraint is applied to the kernel regressor equations, hence the method is exact for regression trees or when the practitioner is concerned with relative outcomes of various predictions. However, in the case that the observations range within [0, 1], as is the case in classification, then we must renormalise the output to [0, 1]. This no longer guarantees a minimum L2-norm perturbation and while potentially still useful, is not the focus of this work.
The second consideration is how to determine the generative probability distributions p A (x) and p B (x). Given the frequentist nature of decision trees, it makes sense to consider p A (x) and p B (x) as the empirical distributions of subpopulations A and B, as described in Section 1. Thus, the integral of the empirical distribution on a given leaf, L i p A (x)dx, is defined as the proportion of population A observed in the partition associated with leaf L i . We emphasise that how p A (x) and p B (x) are determined is not the core focus of this work and many approaches have merit. For example, a Gaussian mixture model could be used to model the input distribution, in which case L i p A (x)dx would equal the cumulative distribution of the generative PDF over the bounds defined by the leaf. This is demonstrated in the Experimental Section. Many other such models would also be valid and determining which method to use to model the generative distribution is left to the practitioner with domain expertise.

Efficient Algorithm For Equality Constrained Decision Trees
At this point, an equality constrained variant of a decision tree has been described, in both explicit representation and compressed representation. In this section, we show that equality constraints on a decision tree do not change the computational or memory order of complexity. The motivation for considering the order of complexities is that decision trees are one of the more scalable machine learning models, whereas kernel methods such as Gaussian processes naively scale at O(n 3 ) in computation and O(n 2 ) in memory, where n is the number of observations. While the approach presented in this work utilises concepts from Bayesian quadrature and linearly constrained Gaussian processes, the model's usefulness would be drastically hindered if it no longer maintained the performance characteristics of the classic decision tree, namely computational cost, and memory requirements.

Efficiently Constrained Decision Trees in Compressed Kernel Representation
As Figure 1 shows, the compressed kernel representation of the constrained decision tree creates an arrowhead matrix. It is well known that the inverse of an arrowhead matrix is a diagonal matrix with a rank-1 update. Letting D represent the diagonal principal sub-matrix with diagonal elements equal to one, z being vector such that the ith element is equal to the relative difference in generative populations distributions for leaf i, z i = L i (p A (x) − p B (x))dx, then the arrowhead inversion properties state that, Note that the integral of the difference between the two generative distributions when evaluated over the entire domain is equal to zero, as both p A (x) and p B (x) must sum to one by definition and hence their differences to zero. Returning to the equation of interest, namely f (x) = Kx ,x K −1 x,x y with y as the average value of each leaf of the tree, and subbing in Kx ,x as a vector of zeros with a one indexing the jth leaf in which the predicted point belongs to and is equal to zero, as it does not contribute to the empirical distributions, we arrive at, Figure 1. This is a visualisation of a decision tree kernel matrix with marginal constraint, left in explicit representation and right in compressed representation. The dark cell in the upper left of the matrix is the double integrated kernel function with respect to the difference of input distributions, which constrain the process. The solid grey row and column are single integrals of the kernel function. White cells have zero values and the dashed (block) diagonals are the kernel matrix between observations or leaves of the tree. We can note that the above, compressed representation kernel matrix is an arrowhead matrix, which we exploit to create an efficient algorithm.
The term 1 1+σ 2 n is the effect of the prior under the Gaussian process perspective; however, by post-multiplying by (1 + σ 2 n ), this prior effect can be removed. While relatively simple to derive, the above equation shows that only an additive update to the predictions is required to ensure group fairness in decision trees. Further, if the same relative population is observed for Group A and Group B on a single leaf j, then z j = 0 and no change is applied to the original inferred prediction before the constraint is applied other than the effect of the noise. In fact, the perturbation to a leaf's expectation grows linearly with the bias in the population of the leaf.
From an efficiency standpoint, only the difference in generative distributions, z, needs to be stored, which is an additional O(L) extra memory requirement and the update per leaf can be pre-computed in O(L). These additional memory and computational requirements are negligible compared to O(N) cost of the decision tree itself.

Efficiently Constrained Decision Trees in Explicit Kernel Representation
Let us now turn our attention to the explicit kernel representation case, where the D in the previous subsection is replaced with the block diagonal matrix equivalent. First, let us state the bordering method, a special case of the block diagonal inversion lemma, with ρ = − 1 z T D − 1z once again. Substituting this into the kernel regression equation once more, we find, where I j denotes a vector of zeros with ones placed in all elements relating to observations in the same leaf. Expanding the above linear algebra, where j is iterating over the set of leaves. Note that, when m j = 1 for all j, we arrive at the same value for ρ as we did in the previous subsection. We can continue to apply this result to the other terms of interest, where y j is once again the average output observation over leaf j. The terms have been labelled X 1 , X 2 and X 3 for shorthand. The computation time for the three terms, along with ρ, can be computed in linear time with respect to the size of the data, O(n), and can be pre-computed ahead of time, hence not affect the computational complexity of a standard decision tree. Once again, only z j and m j have to be stored for each leaf and hence the additional memory cost is only O(L). As such, we can simplify the full expression for the expected outcome as,

Expected Perturbation Bounds
In imposing equality constraints on the models, the inferred outputs become perturbed. In this section, the expected magnitude of the perturbation is analysed for the compressed kernel representation. We define the perturbation due to the equality constraint, not due to the incorporation of the noise, as,

Theorem 1.
Given a decision tree with L leaves, with expected value of leaf observations denoted by the vector y ∈ R L normalised to have zero mean and unit variance and leaf frequency imbalance denoted as z ∈ R L , the expected variance induced by the perturbation due to the incorporating a Group Fairness in Expectation constraint is bounded by, As the expectation of z j is zero due to it being the difference of two probability distributions, the variance is equal to the expectation of 2 , withz equal to z after normalisation. By Lemma 1, the expectation of the dot product (z T y) 2 is equal to 1 L . Further, the 2-norm of z can be cancelled from the numerator and denominator. Finally, using the L 1 , L 2 norm inequality, z 2 ≤ z 1 ≤ √ L z 2 , we can then tightly bound the worst case introduced variance as, Given two vectors y,z uniformly distributed on the unit hypersphere S L−1 , the expectation of their dot product is zero and variance, Proof. As the inner product is rotation invariant when applied to bothz and y, let us denote the vector z as [1, 0, . . . , 0] without loss of generality. The first element of the vector y, denoted by y 0 , is thus equal toz T y. The probability density mass of the random variable y 0 is proportional to the surface area lying at a height between y 0 and y 0 + dy 0 on the unit hypersphere. That proportion occurs within a belt of height dy 0 and radius 1 − y 2 0 , which is a conical frustum constructed out of an S L−2 of radius 1 − y 2 0 , of height dy 0 , and slope 1 √ 1−y 2 0 . Hence, the probability is proportional to, Substituting u = y 0 +1 2 . we find that, Note that this last simplification of P(u) is equal to the probability density function of the Beta distribution with both shape parameters equal α = β = L−1 2 . The variance of the Beta distribution is, Rescaling to find the variance of y 0 , we arrive at 1 L . As the expectation of E[z T y] = 0 due to the properties of symmetry, E[(z T y) 2 ] = 1 L . This is an interesting result as it implies that, if the model is not exploiting biases in the generative distribution evenly across all of the leaves of the tree, i.e., z 1 = √ L z 2 , then the resulting predictions receive the greatest expected absolute perturbation when averaged over all possible y.
For the explicit kernel representation, the expected absolute perturbation bound can be analysed whereby each leaf holds an even number of observations. In such a scenario, m i = m is equal for all leaves i ∈ 1, . . . , L. Substituting this into the equations for ρ, X 2 and X 3 , we can find that the bounded expected perturbation is equal to, L For the sake of conciseness, the full derivation of the above is left to the reader but follows the same steps as the compressed kernel representation.

Combinations of Fair Trees
While it is intuitive to say that ensembles of trees with GFE constraints preserve the GFE constraint, for the sake of completeness, this is now shown more formally. Random forests [36], extremely random trees (ExtraTrees) [37] and tree bagging models [38] combine tree models by averaging over their predictions. Denoting the predictions of the trees at point x as f i (x) for each i ∈ 1, . . . , T, where T is the number of trees, we can easily show that the combined difference in expectation marginalised over the space is equal to zero, It can also be easily shown that modelling residual errors of the trees with other fair trees, such as is the case for boosted tree models [39], also results in fair predictors. These concepts are not limited to tree methods either and the core concepts set out in this paper of constraining kernel matrices can have applications in models such as deep Gaussian process models [40].

Synthetic Demonstration
The first experiment was a visual demonstration to better communicate the validity of the approach. The models examined are ExtraTrees, Gaussian processes and a single hidden layer perceptron. They endeavour to model an analytic function, f (x) = x cos(αx 2 ) + sin(βx), with observations drawn from two beta distributions, p A (x) and p B (x), respectively. The parameters of the two beta distribution are presented in Table 2. Figure 2 shows the effect of perturbing the models using the approach presented to constrain the expected means of the two populations. The figure shows the greater is the disparity between p A (x) and p B (x), the greater is the perturbation in the inferred function. Both the compressed and explicit kernel representation lead to very similar plots for the tree-based models, thus only the compressed kernel representation algorithm has been shown for conciseness. Note, in the case of the ExtraTrees model, each tree was individually perturbed before being combined. Further, in the case of the perceptron, a GMM was fit to the data in the inferred latent space rather than in the original input space.  A downside to group fairness algorithms more generally, as pointed out in [7], is that candidate systems which impose group fairness can lead to qualified candidates being discriminated against. This can be visually verified as the perturbation pushes down the outcome of many orange points below the total population mean in order to satisfy the constraint. By choosing to incorporate group fairness constraints, the practitioner should be aware of these tradeoffs.

ProPublica Dataset-Racial Biases
Across the USA, judges, and probation and parole officers are increasingly using algorithms to aid in their decision making. The ProPublica dataset (https://www.propublica.org/datastore/ dataset/compas-recidivism-risk-score-data-and-analysis) contains data about criminal defendants from Florida in the United States. It is the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS) algorithm [41], which is often used by judges to estimate the probability that a defendant will be a recidivist, a term used to describe re-offenders. However, the algorithm is said to be racially biased against African Americans [42]. To highlight the proposed algorithm, we first endeavoured to use a random forest to approximate the decile scoring of the COMPAS algorithm and then perturbed each tree to remove any racial bias from the system.
The two subpopulations we considered constraining are thus African American and non-African American. We encode the COMPAS algorithms decile score into an integer between zero and ten such that minimising L 2 perturbation is an appropriate objective function. The fact the decile scores are bounded in [0, 10] was not taken into account. The random forest used 20 decision trees as base estimators and the explicit kernel representation version of the algorithm was used for the sake of demonstrative purposes. Figure 3 presents the marginal distribution of predictions on a 20% held out test set before and after the GFE constraint was applied. It is visible that both the expected outcome for African Americans is decreased and for non-African Americans is increased. Notice that, while the means are equal, the structure of the two of distributions are quite different, indicating that GFE constraints still allow greater flexibility than more strict group fairness such as that described in Section 1. The root square difference between the predicted points before and after perturbation was 0.8. Importantly, the GFE constraint described in this work was verified numerically with the average outputs recorded as shown in Table 3. We can see that the respective means (vertical lines) become approximately equal after the inclusion of the constraint using the empirical input distribution.

Intersectionality: Illinois State Employee Salaries
The Illinois state employee salaries (https://data.illinois.gov/datastore/dump/1a0cd05c-7d17-4e3d-938d-c2bfa2a4a0b1) since 2011 can be seen to have a gender bias and bias between veterans and non-veterans. The motivation of this experiment was to show how we can deal with intersectionality issues (multiple compounding constraints) such as if one wished to predict a fair salary for future employees based on current staff. Gender labels were inferred using the employees' first names, parsed through the gender-geusser Python library. GFE constraints were applied between all intersections of gender and veteran/non-veterans, the marginals of gender and the marginals of veteran/non-veterans. Figure 4 visualises the perturbations to the marginals of each demographic intersection due to the GFE constraints. The train-test split was set as 80-20% and the incorporation of the GFE constraints increase the root mean squared error from $12,086 to $12,772, the cost of fairness. The only difference to allow for intersectionality is the z is no longer a vector, but rather a matrix with a column for each constraint. Thus, f (x) = y j + z j (z T z) −1 z T y.

Conclusions
This work offers an easily implementable approach to constrain the means of kernel regression, which has direct applicability to decision tree regression, Gaussian process regression, neural network regression, random forest regression, boosted trees and other tree-based ensemble models.