Adequate and fair explanations

Explaining sophisticated machine-learning based systems is an important issue at the foundations of AI. Recent


Introduction
Explaining the predictions of sophisticated machine-learning algorithms is an important issue for the foundations of AI. Recent efforts [Ribeiro et al., 2016;Ribeiro et al., 2018;Wachter et al., 2017;Ignatiev et al., 2019;Bachoc et al., 2018] have shown various methods for providing explanations. These approaches can be broadly divided into two schools: those that provide a local and human interpretable approximation of a machine learning algorithm, and logical approaches that completely characterise one aspect of the decision. In this paper we investigate a comparison between complete explanations and partial, epistemically accessible ones.
There is an epistemological problem with these complete methods. While they can furnish complete explanations, such explanations may be too complex for humans to understand or even to write down in human readable form. Interpretability requires epistemically accessible explanations, explanations humans can grasp. Yet what is a sufficiently complete or adequate epistemically accessible explanation still needs analysis. We provide such an analysis in terms of counterfactuals, following [Wachter et al., 2017].
With counterfactual explanations, many of the assumptions needed to provide a complete explanation are left implicit. To do so, counterfactual explanations exploit the properties of a particular data point or sample, and as such are also local as well as partial explanations. We explore how to move from local partial explanations to what we call complete local explanations and then to global ones. But to preserve accessibility we argue for the need for partiality. This partiality makes it possible to hide explicit biases present in the algorithm that may be injurious or unfair. We investigate how easy it is to uncover these biases in providing complete and fair explanations by exploiting the structure of the set of counterfactuals providing a complete local explanation.
To make the point about biases in counterfactual explanations concrete, consider the following scenario. An ML program judges A's application for a loan. A is turned down. When A asks the bank for an explanation of the decision, the bank returns with the following.
(1) Your income is 50K euro per year.
If your income had been 100K euro per year, you would have gotten the loan.
The counterfactual in (2) might be true but it also might be misleading, hiding a bias that one might find unfair. Suppose that (1)-(2) is the explanation the bank gives A. But suppose also that there is another more morally repugnant explanation: A black. If A had been white he would have gotten the loan with your current income of 50K per year. There's also a question of features that indirectly code bias. For example, tying the loan availability to the postal code of A's residence could be a way of encoding racial bias. The problem of dissimulating a bias when giving an explanation is a direct outcome of the partiality of epistemically accessible explanations.

Background on explanations
Suppose that f : X n → Y is the "ideal" function taking data encoded in an n-dimensional space of features X n into the representations in Y and thatf : X n → Y is the function that the algorithm has encoded. For this paper, we'll assume thatf is some sort of classifier and thus we can assume Y to be a set of classes. We want to have an explanation of whyf outputs the predictions it does. We might want to know its behavior over the total space X n ; this would be a complete explanation. But for many purposes, we might only need to know howf behaves on a data point of interest or focal point, like A's profile that was submitted to the loan program. Note that, we are implicitly assuming thatf is too complex or opaque for its behaviour to be analyzed statically. For instance,f might be a neural network with multiple layers/parameters/variables etc. Or, it might be be case that we have access only to the binaries off , from which the actual algorithm cannot be reverseengineered.
There are two sorts of explanations of program behavior. Internal explanations involve the internal states of the program-if these are linked by logic then we can have a deductive explanation for a particular response in Y to input data from X n . However, there are also external explanations that involve linking features of X with output Y (we can also have a deductive link or something else). These are initially attractive because they do not involve unpacking the algorithms' internal states and assigning them a meaning, which in the case of deep learning networks with multiple hidden layers can be a very complicated affair. Even partial explanations that exploit internal states of a complex neural architecture may be epistemically inaccessible. [Ignatiev et al., 2019] provide a definition of complete explanations. They assume a classifierf can be represented as a set of logic formulas, which we assume here too. In addition, we will assume thatf has a constant set of features with binary values making the encoding into logic transparent. 1 For [Ignatiev et al., 2019], an explanation, or what we call and MS explanation, of a prediction π of a classifierf given a feature space X is a subset minimal set of literals E (each one describing a value of a feature in the problem space) such that 1 By increasing the number of literals we can simulate nonbinary values, so this is not really a limitation as long as the features are finite.
where ranges over the set of formulas encodingf . E |= f → π means that we can prove in the logic representation that f predicts π given any instance with features as described in E. A complete explanation of the behavior off in our sense would be the set of all possible MS explanations.
An instance in such a set up is a set of literals that assigns values to every feature in the feature space. The underlying logical form of explanations discussed in [Ignatiev et al., 2019] thus exploit universal generalizations and a deductive consequence relation. The explanations in [Ignatiev et al., 2019] thus explain in principle sets of instances, and they are known as global explanations and are a version of a deductive nomological explanation, where a relation of entailment holds between the explanans and the explanandum. 2 Counterfactuals offer a natural way to provide epistemically accessible, partial explanations geared toward properties of individuals or focal points. Such explanations directed to a particular case are often called local explanations in contrast to global ones [Ribeiro et al., 2016;Ribeiro et al., 2018]. One reason for using counterfactuals in explanations is that counterfactuals lend themselves to an attractive analysis of causation, as [Lewis, 1973] proposed. The reason why counterfactual explanations furnish natural candidates for partial epistemically accessible explanations is that they single out properties or features that would make a difference to a decision about an individual as in (2), other things being as equal as they can be given that the individual has the property described by the counterfactual's antecedent. This ceteris paribus property of counterfactuals means that many factors that would be mentioned in a complete explanation can remain implicit. They are thus more partial than MS explanations.
The canonical semantics for counterfactuals symbolised via → as outlined in [Lewis, 1973] exploits a possible worlds model for propositional logic and crucially a similarity relation: A → B is true at world w just in case for all worlds w ′ in which A is true and that are the closest worlds to w in which A is true, B is true.
The similarity relation in Lewis's semantics is used to model the complicated nature of causal laws, which are themselves often formulated with ceteris paribus assumptions. In particular, allows us to have consistent laws with conflicting consequents and antecedents ordered by entailment as in the following set of cascading counterfactuals: (3) a. If I were making 100K euro or more, I would have gotten the loan. b. If I were making 100K euro or more but were convicted of a serious financial fraud, I would not get the loan. c. If If I were making 100K euro or more and were convicted of a serious financial fraud but then the conviction was overturned and I was awarded a medal, I would get the loan.
In a cascading set of counterfactuals we can count how many times the value of the consequent changes as we move from one antecedent to a logically more specific one (e.g., does the We will call the number of flips the degree of the cascading set. The counterfactual semantics with weak centering 3 permits the counterfactuals in (3) to be satisfiable at a world without forcing the antecedents of (3)b or (3)c to be inconsistent. The reason for this is that strengthening of the antecedent fails for counterfactuals; the closest worlds in which I make 100k euro do not include a world w in which I make 100k euro but am also convicted of fraud. Counterfactuals share this property with other conditionals that have been used as the basis for nonmonotonic reasoning [Ginsberg, 1986;Pearl, 1990]. However, if the actual world turns out to be like w, then by weak centering (3)a turns out to be false, because the ceteris paribus assumption in (3)a is that the actual world is one in which I'm not convicted of fraud.
In adapting counterfactuals to provide explanations of a learning algorithm's behavior around a focal point, it is natural to interpret the similarity relation appealed to in the semantics of counterfactuals as a distance function over the feature space X used to describe data points-in effect identifying the latter as the relevant " worlds" for the semantics of the counterfactuals to be defined over. To find the relevant counterfactuals to explain the behavior off around a focal point x p ∈ X n where X has n dimensions, we exploit a linear map where . X is a natural norm on X like the Euclidean norm. Thus, ∆ i represents a minimal change in the features of x p (values of x p in the dimensions i) needed to shift the predictions off to π from the unwanted predictionf (x p ). Exploiting the translation from feature values to literals, the antecedents of our target counterfactuals will express these features a conjunction of literals.
As discussed in [Kusner et al., 2017], these linear maps can be generated via techniques of adversarial perturbations. 4 A typical definition of an adversarial perturbation of an image x, given a classifier, is that it is a smallest change to x such that the classification changes. Essentially, this is a counterfactual by a different name. Finding a closest possible world to x such that the classification changes is, under the right choice of distance function, the same as finding the smallest change to x to get the classifier to make a different prediction. Adversarial perturbation has been the locus of a lot of recent research activity and can be computed quite efficiently [Dube, 2018]. Proofs for minimal perturbations can be found using optimal transport theory [Bachoc et al., 2018].
The fact that counterfactuals are closely tied to adversarial examples relative to some focal point invites a comparison to recent work by [Ignatiev et al., 2019] on explanations and adversarial examples in a deductive framework. The discussion in [Ignatiev et al., 2019] of counterexamples and adversarial examples builds a bridge between deductive nomological explanations and local counterfactual explanations based on a particular focal point. A counterexample to a prediction π is a subset minimal set of literals A such that [ Ignatiev et al., 2019] show that counterexamples and explanations are incompatible in that every explanation and every counterexample to a prediction π contain literals e i and c j such that e i is inconsistent with c j . An adversarial example to a predication π for x is then an instance that has all the features of a counterexample to π and otherwise has the features of the instance x. An adversarial example for a learning modelf thus is a closest element y in X to a focal point x p , in which certain features are shifted so thatf (y) is a different prediction fromf (x). An adversarial example then can be defined in terms of a linear map ∆ i on X, and this map links the adversarial example with a counterfactual.
Counterexamples can also serve as the basis of explanations of properties of a focal element.
• Why not π for x?
• x has features of A and A is a counterexample to π (A |= f → ¬π).
Note that this deductive explanation is distinct from counterfactual explanations. When a counterfactual is used to give an explanation, the relationship between the explanans and the explanandum is not logical consequnce but a more pragmatic relation based on a Lewisian analysis of causation. The counterfactual in (2) gives a sufficient reason for A's getting the loan, all other factors of my situation being equal or being as equal as possible given the assumption of a different salary for me. Deductive explanations specify those ceteris paribus conditions; this makes them more complex but also invariant with respect to the choice of focal point. Counterfactual explanations depend on the nature of the focal point. This (relative) simplicity comes at a cost. Counterfactuals may offer only a partial explanation in some cases [Wachter et al., 2017]. In fact there are two sorts of partiality in a counterfactual explanation. First a counterfactual explanation doesn't specify the ceteris paribus conditions and so doesn't specify what is necessary for the prediction-call this partiality. On the other hand counterfactual explanations are also partial in the sense that they don't specify all the sufficient conditions for the prediction; they are hence what are called local explanations. A given counterfactual might give only a partial picture of the behavior of the program or agent. It might not give a complete local explanation of a decision, as such a decision might be over determined for the given point at hand. A prediction π is over determined for a focal point x 0 iff the following set of linear maps contains at least two elements Many real world applications like our bank loan example will have this feature.

From partial to complete explanations
In principle, we can move from a partial picture of the behavior off to a more complete one. The linear maps ∆ i associated with counterfactuals permit us to plot the local behavior off around a focal point x p . If we look at all possible ∆ i using all the possible combinations of dimensions of X, we can plot a neighborhood around x p , Nf ,x p , where for all points z in the interior Nf ,x p ,f (z) =f (x p ) and for points w on the boundary of Nf ,x p ,f (w) =f (x p ). We call the collection of such linear maps a complete local explanation of the decision at x p .
Nf ,x p captures the fact that there may be several distinct conditions the lack of which would be causally responsible for a particular prediction, like my not getting the loan in our motivating example. Let us call the set of counterfactuals corresponding to Nf ,x p , S(Nf ,x p ). S(Nf ,x p ) may contain many cascading counterfactuals, and its cascading degree may be high.
There are some cases in which the geometry of the prediction space allow us to move from complete local to global explanations of the behavior off . Supposef changes values only once for each feature/dimension d i moving out from a focal point x p . 5 In such a case Nf ,x p forms a convex subspace off [X] and a complete local explanation provides a full global explanation.
There is an important connection between the cascading degree of S(Nf ,x p ) and the geometry off on the feature space. Iff changes values only once for each feature/dimension d i , S(Nf ,x p ), has cascading degree 1. In addition, S(Nf ,x p ), has cascading degree 1 iff the feature/dimensions are pairwise independant of each other with respect tof 's predictions. Proposition 1. Suppose that the feature space is Boolean valued. Then S(Nf ,x p ), has a cascading degree ≤ 2 iff Nf ,x p is convex.
Proof: Note that a cascading set of counterfactuals exhibits an entailment relation between antecedents. Thus, if φ → ψ and χ → ¬ψ are counterfactuals of degrees n and n-1, φ entails χ. This means that if we have a set S of cascading counterfactuals of degree 3 or more, we will have antecedents φ, χ, δ which are conjunctions of feature values, such that if φ |= χ |= δ that in the Manhattan space of Boolean features puts the points in the feature space corresponding to φ, χ, δ are all on a same line. But if S has degree 3 or greater, this forcibly means that one of the points will not be in the space of predictions madef at the other two points. So Nf ,x p is non convex. Conversely, suppose Nf ,x p is non convex. Using the construction of counterfactuals from Nf ,x p will immediately yield a cascading set of degree 3 or higher.
The cascading degree of S(Nf ,x p ) for nonconvex Nf ,x p thus gives a measure of the degree of non-convexity of Nf ,x p , and a measure of the complexity of an explanation.
Remark 1. : Suppose Nf ,x p gives rise to a set of cascading counterfactuals of degree 2. In this case two complete local explanations can determine a full global explanation (and determine the behavior off ).
The problem with complete local explanation is that they may be still too complex for any human to understand. It is not unusual for AI applications to encode data via hundreds even thousands of features. The complete local explanation would involve too many counterfactuals for humans to grasp.

Pragmatic constraints on explanations
Because the set of causal factors can be so complex, it becomes apparent that explanations must have an important pragmatic component. All explanations have an "explainee" who asks for the explanation, and an explanation for an explainee must respond to the particular conundrum that brought the explainee to ask for one [Bromberger, 1962;Achinstein, 1980]. 6 For instance, our loan seeker A may wonder why the bank refused her a loan when she has what she thinks is an adequate qualifying income and other qualities. An appropriate explanation for A would then explain which of his assumptions was faulty or incomplete, thus solving the conundrum. In essence, what is going on here is that A has in mind the "ideal" function f and is confused about why the value off on her data x p is not that of f ; i.e. her conundrum is that f (x p ) =f (x p ). The conundrum can come about for two reasons: either 0 is simply mistaken about the nature of f (perhaps she is also mistaken about f or if not, she is mistaken about howf differs from f ), or her understanding of f is incomplete.
For the incompleteness sense, suppose x p is decomposed into x d 1 , x d 2 ; for A f only pays attention to the values of dimensions d 1 in the sense that for her f ( An adequate explanation will then point out that for some ∆ where ∆( , y d 2 )) = π whilef (x p ) = η. So we have, simplifying, conundra resulting from incompleteness and conundra resulting from mistaken information. We then stipulate:

CI Suppose we have a conundrum based on incomplete-
ness. An adequate explanation for explainee A who re-6 A more contemporary view aligned with this is [Miller, 2019].
quests an explanation whyf (x p ) = η must resolve A's conundrum arising from attending only to some dimensions d 1 of x p = x d 1 , x d 2 . More precisely, the explanation must provide a ∆ such that ∆( CM Suppose A's conundrum is based on error. An adequate explanation for explainee A who requests an explanation whyf (x p ) = η must resolve A's conundrum by providing the values for the dimensions d 2 of x p on which A is mistaken. More precisely, supposing a de-

The fairness / bias problem
Partial explanations seem good epistemologically, but they can also be dangerous. We remarked that several counterfactuals that point to a difference in behavior may be simultaneously true. But this means that without the complete local explanation,f may act in ways unknown to the agent x 0 or the public that is biased or unfair. Worse, the constructor or owner off will be able to conceal this fact if the decision for x 0 is overdetermined, by offering counterfactual explanations using maps ∆ i that don't mention the biased feature. So we need another constraint on satisfactory explanations: a satisfactory explanation must make clear the biases of the system which may account for 0's incomplete understanding of f orf . To be more precise, we say thatf exhibits a biased dependency on particular preducial factor P, which we can also take to be a map on X, just in case for some ∆ i , and for some incompatible predictions δ and π f (∆ i (x p )) = δ (1) Note that the incompleteness condition for a conundrum mirrors the notion of a biased dependency. As a person with data x p might reasonably want to know whether such biases were the result of a particular decision, we put one additional constraint on appropriate explanations: CB an appropriate explanation for explainee A must lay bare any prejudicial factors P that affect A. That is, where π, δ and P are defined as above, the explanation must providê f (∆(x p )) = δ andf (P(∆(x p ))) = π In our loan example, explanations that obey (CB) might not be in the interest of the bank that owns the ML algorithm. For instance, the bias of the bank against loans to people of color is unfair to A. But the bank might not want to have this bias exposed. This is the ethical problem for explanations.
To attain an appropriate explanation meeting CB, we cannot simply rely on the bank's proferred explanation, as the bank may have something to hide. In similar fashion, even if we have access to all off , we cannot rest content with just one counterfactual explanation that satisfies CM and CI. We have to ensure that CB is met as well. We will say that a set of counterfactuals provides an adequate local explanation just in case it obeys CM, CI and CB.
Let us suppose that we do not have access tof . To find an appropriate explanation, we imagine a game played between the bank and the would be loan taker, in which the loan taker can ask questions of the bank (or owner/ developer of the algorithm) about the algorithm's decisions. More particularly, we propose to use a two player game, an explanation game to get at appropriate explanations for A.
To define an explanation game, we first fix a set of two players {0, 1}. (1 − i) denotes the opponent of i.
The moves or actions in the game for 0, denoted A 0 , are to request an explanation from 1 about the behavior off at x p , which means requesting a particular linear map ∆ i . 0 may accept an explanation provided by 1 or reject 1's explanation and request an explanation different from explanations previously offered by 1; that is request a ∆ j where j and i are different sets of dimensions of X (this move is known as restriction. Alternatively, 0 may force 1 to provide a counterfactual explanation generating linear map for a particular set of dimensions. If 0 is allowed to use forcing, then she can name which dimensions she wants varied to find a boundary point of Nf ,x p . 1's moves A 1 consists of the following: 1 may offer a ∆ j generating a counterfactual explanation; 1 may also claim that a particular map ∆ j for dimensions j does generate a counterfactual explanation-that is, for no values in dimensions j do we get the requested prediction; 1 may also insist that a previously given ∆ solve 0's conundra. The game terminates when (a) 0 is convinced that her conundra are resolved or (b) when all possible ∆ have been examined or (c) 0 does not want to continue anymore.
We assume that in an explanation game it is common knowledge among the players that one can check the truth of 1's responses and whether 1's counterfactuals address the constraints we have mentioned. Thus, we do not address issues of deceit here.
We can now specify an explanation game and its winning condition for player 0.
Definition 1. An Explanation game, G, concerning a polynomially computable functionf : X n → Y , where X n is a space of data and Y a set of predictions, is a tuple ((A 0 ∪ A 1 ,f : X n → Y, x p ,C) where: i. 1, but not 0 has access to the behavior off .
ii. x p ∈ X n is the starting position, and 0 opens G with a request for an explanation. iii. 1 responds to 0's requests (forcing or restriction) or claims an adequate explanation is already provided.
iv. C ⊆ X n is a local minimum off with respect to the neighbourhood Nˆf ,x , where x is the current position in the game. Every c ∈ C resolves one or more of 0's conundra CM or CI.
We say that 0 wins G just in case in G she acquires a set of counterfactuals about the behavior off concerning features C that answer her conundrum.
If the dimensions encoding data in X are common knowledge, then 0 always has a winning strategy in an explanation game. We will assume this, and so the real question is how quickly 0 can achieve her winning condition. This depends on three parameters. The first has to do with whether 0 knows which are the prejudicial dimensions forf 's behavior. In general this is not known; an algorithmf may be shown to be insensitive say to an explicit gender dimension in the data in the sense that it makes the same predictions whether this feature is taken into account or not. Nevertheless,f might still be gender sensitive, because other dimensions of X encode gender information, making the gender dimension itself superfluous. In general we do not assume that 0 knows which dimensions are prejudicial-i.e., which dimensions encode prejudicial factors.
The second parameter governing complexity is the degree of the set C of cascading counterfactuals associated with Nf ,x p . The third has to do whether 0 is allowed forcing moves or only restriction moves. 7 If we assume that 1 has already calculated the boundary of Nf ,x p , then the following results are in order.
Proposition 2. Suppose in an explanation game G involvinĝ f and where 0 has focal point x p ∈ X n and the cascading degree of S(Nf ,x p ) < 3. If 0 knows which dimensions encode prejudicial factors, she has an at worst linear time winning strategy using forcing. If she does not know know which dimensions encode prejudicial factors, 0 has an worst n 2 time winning strategy.
To see that this, note that 0 needs only to force 1 to provide the counterfactuals that vary the known prejudicial factors together with at most one other feature value. The set of counterfactuals uncovered is a linear function of n.
If we do not assume 0 knows the prejudicial factors, then: Proposition 3. Suppose in an explanation game G involvinĝ f with 0's data as focal point x p ∈ X n and the cascading degree of S(Nf ,x p ) = k ≥ 3. Then 0 has a winning strategy using forcing in G that is in the complexity class Polynomial Local Search (PLS) in the worst case [Johnson et al., 1988].
Given thatf is polynomial and the cascading degree of S(Nf ,x p ) is fixed, finding the neighbors of a boundary point of Nf ,x p is poly-time. Thus, 0's winning strategy for the game meets the conditions for a PLS problem.
What happens when 0 cannot force 1 to provide certain counterfactuals, but can restrict 1 to shift a single dimension in his explanation?
Proposition 4. Suppose that for X ⊂ R n in an explanation game G, withf and 0 with focal point x p ∈ X n and the cascading degree of S(Nf ,x p ) < 2. Then 0 has a worst case linear time with respect to n winning strategy in G.
Given that 0 can only ask for a single dimension to be shifted, her strategy is just to go through all the dimensions by continuing to ask for a new explanation. Eventually 0 will have gone through all of the dimensions of X n . As the cascading degree of S(Nf ,x p ) < 2, she will have determined S(Nf ,x p ).
Once the cascading degree associated with Nf ,x p ≥ 2, Player 1 can hide, if he wishes, prejudicial factors for exponential time.
Proposition 5. Suppose that for X ⊂ R n in an explanation game G, withf and 0 with focal point x p ∈ X n and the cascading degree of S(Nf ,x p ≥ 2. Then 0 has a worst case exponential time with respect to n winning strategy in G. Player 1 in this case is free to provide counterfactuals that satisfy constraint (CI) and (CM) but possibly not (CB) and can provide counterfactuals with arbitrarily complex antecedents. Without forcing, 0 cannot make 1 provide counterfactuals that take satisfy (CB). However, given that the boundary Nf ,x p is computable, 0 still has a winning strategy with restriction in an explanation game. She will eventually compute the boundary of Nf ,x p in exponential time wrt to n in the worst case .
We draw from this the moral that in order to be effective an explanation game must allow 0 to force 1 to vary certain dimensions that 0 picks and S(Nf ,x p ) must have a fixed cascading degree. We also claim that having such a fixed cascading degree might be an important aspect of good explanations, as there is an important connection between cascading degrees and the generality of laws. A low cascading degree means a general set of laws, which a priori is preferable scientifically.

Conclusions and outlook
We have used the semantics and logic of counterfactuals to explore epistemically accessible and adequate but partial explanations. Counterfactual explanations are promising vehicles for epistemic accessibility, and we have shown that they can algorithmically provide adequate explanations where all biases are made clear if certain conditions (forcing) obtain and the cascading degree of the set of counterfactuals describing the local neighborhood around the focal point is fixed. However, we have not explored here another important parameter for epistemic adequacy. Adequacy also depends on the characterization of the input data X for a learning algorithmf : X → Y . If the dimensionality of X is too high or if the dimensions don't correspond to intuitive concepts, then even partial explanations may not be satisfactory. So to get fully explanatory counterfactuals for the behavior off , it will be important in such cases to find the right representation of the data for the explainee.
Counterfactual explanations may also express adversarial examples. These typically aren't good explanations of the phenomenonf is trying to model. Nevertheless such counterfactuals explain the behavior off and can be as such very valuable.