# Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

`positive`” by a black-box predictor. In this example, a saliency map would highlight those pixels that are most responsible for this prediction: these do not say whether the prediction depends on the image containing a “

`car`”, on the car being “

`red`”, or on the car being “

`sporty`”. As a consequence, it is impossible to understand what the model is “thinking” and how it would behave on other images based on this explanation alone [9].

#### 1.1. Limitations of Existing Works

`intraepithelial`” may be essential for the former, while being complete gibberish to the latter. However, even when concept annotations are employed, they are gathered from offline repositories and as such they may not capture concepts that are meaningful to a particular expert, or that despite being associated with a familiar name, follow semantics incompatible with those the user attaches to that name. Of course, there are exceptions to this rule. These are discussed in Section 6.

#### 1.2. Our Contributions

**human-interpretable representation learning**, or hrl for short. Successful communication is essential for ensuring human stakeholders can understand explanations based on the learned concepts and, in turn, realizing the potential of CBEs and CBMs. This view is compatible with recent interpretations of the role of symbols in neuro-symbolic AI [10,30]. The key question is how to model this human element in a way that can be actually operationalized. We aim to fill this gap.

#### 1.3. Outline

## 2. Preliminaries

#### 2.1. Structural Causal Models and Interventions

#### 2.2. Disentanglement

`color`” of an object and ${G}_{2}$ its “

`shape`”, disentanglement of variables implies that changing the object’s color does not impact its shape. This should hold even if the variables $\mathbf{G}$ have a common set of parents $\mathbf{C}$—playing the role of counfounders, such as sampling bias or choice of source domain [38]—meaning that they can be both disentangled and correlated (via $\mathbf{C}$). From a causal perspective, disentanglement of variables can be defined as follows:

**Definition 1**(Disentanglement of variables).

**Definition 2**

**Definition 3**(Disentanglement of representations).

**Definition 4**(Content-style separation).

## 3. Human Interpretable Representation Learning

**Definition 5**(Intuitive statement).

#### 3.1. Machine Representations: The Ante-Hoc Case

#### 3.2. Machine Representations: The Post Hoc Case

#### 3.3. From Symbolic Communication to Alignment

`color`” or “

`shape`” of that object, or any other properties deemed relevant by the human. The choice and semantics of these concepts depend on the background and expertise of the human observer and possibly on the downstream task the human may be concerned with (e.g., medical diagnosis or loan approval) and, as such, may vary between subjects. It is these concepts that the human associates names, like in Figure 4, and it is these concepts that they would use for communicating the properties of $\mathbf{x}$ to other people.

`redness`” is not causally related to the apple’s appearance, and yet easily understood by most human observers, precisely because it is a feature that is evolutionarily and culturally useful to those observers. In this sense, the concepts $\mathbf{H}$ are understandable, by definition, to the human they belong to.

## 4. Alignment as Name Transfer

#### 4.1. Alignment: The Disentangled Case

**Definition 6**(Alignment).

**D1**.- The index map $\pi :\mathcal{J}\mapsto \mathcal{I}$ is surjective and, for all $j\in \mathcal{J}$, it holds that, as long as ${G}_{\pi \left(j\right)}$ is kept fixed, ${M}_{j}$ remains unchanged even when the other generative factors $\mathbf{G}\setminus \left\{{\mathbf{G}}_{\pi \left(\mathbf{j}\right)}\right\}$ are forcibly modified.
**D2**.- Each element-wise transformation ${\mu}_{j}$, for $j\in \mathcal{J}$, is monotonic in expectation over ${N}_{j}$:$$\exists \bowtie \in \{>,<\}\phantom{\rule{0.277778em}{0ex}}\mathrm{s}.\phantom{\rule{4.pt}{0ex}}\mathrm{t}.\phantom{\rule{0.277778em}{0ex}}\forall {g}_{\pi \left(j\right)}^{\prime}>{g}_{\pi \left(j\right)},\phantom{\rule{0.277778em}{0ex}}\left(\right)open="("\; close=")">{\mathbb{E}}_{{N}_{j}}\left[{\mu}_{j}({g}_{\pi \left(j\right)},{N}_{j})\right]-{\mathbb{E}}_{{N}_{j}}\left[{\mu}_{j}({g}_{\pi \left(j\right)}^{\prime},{N}_{j})\right]$$

**D1**requires that $\alpha $ should not “mix” multiple ${G}_{i}$’s into a single ${M}_{j}$, regardless of whether the former belong to ${\mathbf{G}}_{\mathcal{I}}$ or not. For instance, if ${M}_{j}$ blends together information about both color and shape, or about color and some uninterpretable factor, human observers would have trouble pinning down which one of their concepts it matches. If it does not, then turning the ${G}_{\pi \left(j\right)}$ knob only affects ${M}_{j}$, facilitating name transfer. The converse is not true: as we will see in Section 4.4, interpretable concepts with “compatible semantics” can in principle be blended together without compromising interpretability. We will show in Section 4.2 that this is equivalent to disentanglement.

**D2**is also related to name transfer. Specifically, it aims to ensure that, whenever the user turns a knob ${G}_{\pi \left(j\right)}$, they can easily understand what happens to ${M}_{j}$ and thus figure out the two variables encode the same information. To build intuition, notice that both

**D1**and

**D2**hold for the identity function, as well as for those maps $\alpha $ that reorder or rescale the elements of ${\mathbf{G}}_{\mathcal{I}}$, which clearly preserve semantics and naturally support name transfer. Monotonicity captures all of these cases and also more expressive non-linear element-wise functions, while conservatively guaranteeing a human would be able to perform name transfer. Notice that the mapping needs not to be exact, in the sense that the output can depend on independent noise factors $\mathbf{N}$. This leaves room for stochasticity due to, e.g., variance in the concept learning step. Notice also that

**D2**can be constrained further based on the application.

#### 4.2. Disentanglement Does Not Entail Alignment

**D1**:

**Proposition 1.**

**D1**holds if and only if the representations are disentangled in $({\mathbf{G}}_{\mathcal{I}},{\mathbf{M}}_{\mathcal{J}})$ (see Definition 3).

**D1**implies that disentanglement is insufficient for interpretability: even if $\mathbf{M}$ is disentangled, i.e., each ${M}_{j}$ encodes information about at most one ${G}_{i}\in {\mathbf{G}}_{\mathcal{I}}$, nothing prevents the transformation from ${G}_{i}$ to its associated ${M}_{j}$ from being arbitrarily complex, complicating name transfer. In the most extreme case, $\alpha {(\xb7)}_{j}$ may not be injective, making it impossible to distinguish between different ${g}_{i}$s, or could be an arbitrary shuffling of the continuous line: this would clearly obfuscate any information present about ${G}_{i}$. This means that, during name transfer, a user would be unable to determine what value of ${M}_{j}$ corresponds to what value of ${G}_{i}$ or to anticipate how changes to the latter affect the former.

**D2**in Definition 6 requires the map between each ${G}_{i}\in {\mathbf{G}}_{\mathcal{I}}$ and its associated ${M}_{j}$ to be “simple”. This extra desideratum makes alignment strictly stronger than disentanglement.

#### 4.3. Alignment Entails No Concept Leakage

**Example 1.**

`dSprites`image [52] picturing a white sprite, determined by generative factors including “

`position`”, “

`shape`”, and “

`size`”, on a black background. Now imagine training a concept extractor ${p}_{\theta}(\mathbf{M}\mid \mathbf{X})$ so that ${\mathbf{M}}_{\mathcal{J}}$ encodes

`shape`and

`size`(but not

`position`) by using full concept-level annotations for

`shape`and

`size`. The concept extractor is then frozen. During inference, the goal is to classify sprites as either positive ($Y=1$) or negative ($Y=0$) depending on whether they are closer to the top-right corner or the bottom-left corner. When concept leakage occurs, the label, which clearly depends only on

`position`, can be predicted with above random accuracy from ${\mathbf{M}}_{\mathcal{J}}$, meaning these concepts somehow encode information about

`position`, which they are not supposed to.

`red`” also activates on a few objects that are, in fact, blue, due to leakage. Also, assume the predictor predicts a blue object as positive because

`red`fires. Then, an explanation for that prediction would be that the blue object is positive because (according to the model) it is red. Clearly, this hinders trustworthiness. The only existing formal account of concept leakage was provided by Marconato et al. [19], who view it in terms of (lack of) out-of-distribution (OOD) generalization. Other works instead focus on in-distribution behavior and argue that concept leakage is due to encoding discrete generative factors using a continuous representation [37,53]. We go beyond these works by providing the first general formulation of concept leakage and showing that it is related to alignment. Specifically, we propose to view concept leakage as a (lack of) content-style separation, and show that this explains how concept leakage can arise both in- and out-of-distribution. In order to do so, we start by proposing a general reformulation of the concept leakage problem (Definition 7) and derive two bounds from mutual information properties (Proposition 2). Then, we show that a model that achieves perfect alignment avoids concept leakage entirely (Proposition 3 and Corollary 1).

`dSprites`experiment [19] outlined in Example 1. Here, during training, the “

`position`” of the sprite is fixed (i.e., ${\mathbf{G}}_{pos}={\mathbf{G}}_{-\mathcal{I}}$ are fixed to the center), while at test time the data contains different interventions over the position ${\mathbf{G}}_{pos}={\mathbf{G}}_{-\mathcal{I}}$, and free variations in the other factors ${\mathbf{G}}_{\mathcal{I}}$ (e.g., “

`shape`” and “

`size`”). Essentially, these interventions move the sprite around the top-right and bottom-left borders, where the factors ${\mathbf{G}}_{pos}$ are extremely informative for the label Y.

**Definition 7**(Concept leakage).

**Proposition 2.**

**Proposition 3.**

`tail`” and “

`fur`”—and the task of distinguish between images of cats and dogs using these (clearly non-discriminative) concepts. According to [53], concept leakage can occur when binary concepts like these are modeled using continuous variables, meaning the concept extractor can unintentionally encode “spurious” discriminative information. In light of our analysis, we argue that concept leakage is instead due to lack of content-style separation, and thus of alignment. To see this, suppose there exists a concept ${G}_{k}\in {\mathbf{G}}_{-\mathcal{I}}$ useful for distinguishing cats from dogs and that it is disentangled as in Definition 1 from the concepts of fur ${G}_{fur}$ and of tail ${G}_{tail}$. Then, by content-style separation, any representation ${\mathbf{M}}_{\mathcal{J}}$ that is aligned to ${G}_{fur}$ and ${G}_{tail}$ does not encode any information about ${G}_{k}$, leading to zero concept leakage.

**Corollary 1.**

**D1**. In fact, if the representations $({M}_{{j}_{1}},{M}_{{j}_{2}})$ are aligned to the concepts ${G}_{fur}$ and ${G}_{tail}$, respectively, they cannot be used to discriminate between cats and dogs.

#### 4.4. Alignment: The Block-Wise Case

**Definition 8**(Block-wise alignment).

**D1**- There exists a partition ${\mathcal{P}}_{\mathbf{G}}$ of $\mathcal{I}$ such that $\mathsf{\Pi}:{\mathcal{P}}_{\mathbf{M}}\to {\mathcal{P}}_{\mathbf{G}}$. In principle, we can extend this notion to a family of subsets ${\mathcal{P}}_{\mathbf{G}}$ of $\mathcal{I}$. As an example, for $xyz$ positions, one can consider blocks $\{xy,yz,xz\}$ that are mapped to, respectively, block aligned representations. We call this condition block-wise disentanglement.
**D2**- Each map ${\mu}_{{\mathcal{J}}^{\prime}}$ is simulatable and invertible (for continuous variables, we require it to be a diffeomorphism) on the first statistical moment; that is, there exists a unique pre-image ${\alpha}^{-1}$ defined as:$${\mathbf{G}}_{\mathsf{\Pi}\left({\mathcal{J}}^{\prime}\right)}={\alpha}^{-1}{\left(\mathbb{E}\left[{\mathbf{M}}_{\mathcal{J}}\right]\right)}_{{\mathcal{J}}^{\prime}}:={\left(\right)}^{{\mathbb{E}}_{{\mathbf{N}}_{\mathcal{J}}}}-1$$

**D1**, changes to any block of human concepts only impact a single block of machine concepts, and by

**D2**the change can be anticipated by the human observer, that is the human interacting with the machine grasps what is the general mechanism behind the transformation from $\mathbf{G}$ (or $\mathbf{H}$) and $\mathbf{M}$ (and vice versa). Both properties support name transfer.

**D2**intuitively constraints the variables within each block to be “semantically compatible”. In the context of image recognition, for instance, placing concepts such as “

`nose shape`” and “

`sky color`” in the same block is likely to make name transfer substantially more complicated, as changes to “

`nose shape`” might end affecting the representation of “‘

`sky color`” and vice versa. In that case, it would not be easy for a user to figure out how the concepts have been mixed, undermining simulatability. An example of semantic compatibility is that of rototranslation of the coordinates followed by element-wise rescaling. This condition is identical to “weak identifiability” in representation learning [58,59]. A counter example would be a map $\alpha $ given by a conformal map for the 2D position of an object in a scene. Albeit invertible, it may not be simple at all to simulate.

#### 4.5. Alignment: The General Case

`temperature`and ${G}_{2}$ the

`color`of a metal object: the user knows that temperature affects color, and would not assign the same name to a representation of temperature that does not have a similar effect on the representation of color. In order ensure preservation of these semantics, we say that a machine representation $\mathbf{M}$ is aligned to $\mathbf{G}$ if, whenever the human intervenes on one ${G}_{i}$, affecting those generative factors ${\mathbf{G}}_{{\mathcal{I}}^{\prime}}$ that depend on it, an analogous change occurs in the machine representation.

**Example 2.**

`temperature`and${G}_{2}$the

`color`of a metal solid. Correspondingly, the aligned representation $\mathbf{M}$ encodes the

`temperature`in two distinct variables, ${M}_{1}$ and ${M}_{3}$, respectively, to the temperature, say, measured in degrees Celsius and degrees Fahrenheit. ${M}_{2}$ encodes the

`color`variable.

**Proposition 4.**

`temperature`concept will affect both the corresponded representations and the ones aligned to concept of

`color`, whereas intervening on the latter would only amount to change the representation of

`color`, irrespective of the value assumed in the

`temperature`representation.

#### 4.6. Alignment and Causal Abstractions

**Definition 9**($\beta $-aligned causal abstraction).

**Example 3.**

`temperature`variable and to ${H}_{2}$ and ${M}_{2}$ as the

`color`variable. Despite the different structure, we suppose ${M}_{1}$ and ${M}_{2}$ are aligned to ${H}_{1}$ and ${H}_{2}$, respectively, via an aligned map $\beta $. We indicate the overall causal graph as ${\mathfrak{C}}_{\mathbf{H}\to \mathbf{M}}$; see Figure 7 (right).

`temperature`amounts to modifying only the corresponding variable ${M}_{1}$ and does not affect ${M}_{2}$, as evident in Figure 7 (left). Conversely, a change in the

`temperature`under alignment corresponds also to a change in

`color`for the variable ${M}_{2}$, as depicted in Figure 7 (right). The two interventional effects, hence, do not coincide and ${\mathfrak{C}}_{\mathbf{M}}$ is not an aligned causal abstraction of $\mathbf{H}$.

## 5. Discussion and Limitations

#### 5.1. Is Perfect Alignment Sufficient and Necessary?

`dog`” and how it would change if they were to delete the dog from the picture. This last point takes into account, at least partially, the limited cognitive processing abilities of human agents.

`ripe`or not to be interpretable only requires the concepts appearing in its explanations to be aligned, and possibly only those values that actually occur in the explanations (e.g., $\mathtt{color}=\mathtt{red}$ but not $\mathtt{color}=\mathtt{blue}$). This can be understood as a more lax form of alignment applying only to a certain subset of (values of) the generative factors ${\mathbf{g}}_{\mathcal{I}}$, e.g., those related to apples. It is straightforward to relax Definition 6 in this sense by restricting it to a subset of the support of ${p}^{*}\left({\mathbf{G}}_{\mathcal{I}}\right)$ from which the inputs $\mathbf{X}$ are generated, as these constrain the messages that the two agents can exchange.

#### 5.2. Measuring Alignment

**D1**), one option is to leverage one of the many measures of disentanglement proposed in the literature [64]. The main issues is that most of them provide little information about how simple the map $\alpha $ (

**D2**) is and, as such, they cannot be reused as-is. However, for the disentangled case (see Section 4.1), Marconato et al. [19] noted that one can measure alignment using the linear DCI [40]. Essentially, this metric checks whether there exists a linear regressor that, given ${\mathbf{m}}_{\mathcal{J}}$, can predict ${\mathbf{g}}_{\mathcal{I}}$ with high accuracy, such that each ${M}_{j}$ is predictive for at most one ${G}_{i}$. In practice, doing so involves collecting a set of annotated pairs $\left\{({\mathbf{m}}_{\mathcal{J}},{\mathbf{g}}_{\mathcal{I}})\right\}$, where the ${m}_{j}$’s and ${g}_{i}$’s are rescaled in $[0,1]$, and fitting a linear regressor on top of them using ${L}_{1}$ regularization. DCI then considers the (absolute values of the) regressor coefficients $B\in {\mathbb{R}}^{\left|\mathcal{J}\right|\times \left|\mathcal{I}\right|}$ and evaluates average dispersion of ${B}_{j:}$ for each machine representation ${M}_{j}$. In short, if each ${M}_{j}$ predicts only a single ${G}_{i}$, and with high accuracy, then linear DCI is maximal. The key insight is that the existence of such a linear map implies both disentanglement (

**D1**) and monotonicity (

**D2**), and therefore also alignment. The main downside is that the converse does not hold, that is, linear DCI cannot account for non-linear monotonic relationships.

**D1**and

**D2**, and to leverage causal notions for the former.

**D1**can, for instance, be measured using the interventional robustness score (IRS) [41], an empirical version of $\mathsf{EMPIDA}$ (Definition 2) that, essentially, measures the average effect of interventions on ${\mathbf{G}}_{\mathcal{I}}$ on the machine representation. Alternatives include, for instance, DCI-ES [65], which can better capture the degree by which factors are mixed and the mutual information gap (MIG) [66]. These metrics allow to establish an empirical map $\pi $ between indices of the human and machine representations, using which it is possible to evaluate D2 separately. One option is that of evaluating Spearman’s rank correlation between the distances:

**D1**and

**D2**can be adapted from them. The number of total concept combinations, even in the disentangled case (Definition 1), grows exponentially with the number of concepts k, which requires in practice bounds for the estimation. This is also a noteworthy problem in the disentanglement literature; see, e.g., [64]. We leave an in-depth study of more generally applicable metrics to future work.

#### 5.3. Consequences for Concept-Based Explainers

#### 5.4. Consequences for Concept-Based Models

#### 5.5. Collecting Human Annotations

## 6. Related Work

#### 6.1. Unsupervised Approaches

#### 6.2. Supervised Strategies

#### 6.3. Disentanglement

#### 6.4. Metrics of Concept Quality

**D1**in the definition of alignment, and therefore alignment itself when paired with a metric for measuring the complexity of $\alpha $ (

**D2**). Their properties are extensively discussed in [64].

#### 6.5. Neuro-Symbolic Architectures

## 7. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Proofs

#### Appendix A.1. Proof of Proposition 1

**D1**holds. Then, the conditional distribution of $\mathbf{M}$ can be written as:

**D1**. Suppose there exist at least one $j\in \mathcal{J}$ such that:

#### Appendix A.2. Proof of Proposition 2

#### Appendix A.3. Proof of Proposition 3

**D1**in Definition 6 entails that the conditional probability of ${\mathbf{M}}_{\mathcal{J}}$ can be written in general as:

**D1**in Definition 8. We make use of this fact for deriving a different upper-bound for $\mathsf{\Lambda}$. We focus only on the first term of Equation (11); the analysis of the second one does not change.

**D1**reduces to ${p}_{\theta}({\mathbf{m}}_{\mathcal{J}}\mid {\mathbf{g}}_{\mathcal{I}})$, hence the term appearing in the third line. In the fourth line, we denoted with ${p}_{\lambda ,\theta}\left(y\right)=\int {q}_{\lambda}(y\mid {\mathbf{m}}_{\mathcal{J}}){p}_{\theta}({\mathbf{m}}_{\mathcal{J}}\mid {\mathbf{g}}_{\mathcal{I}})p\left({\mathbf{g}}_{\mathcal{I}}\right)\phantom{\rule{0.166667em}{0ex}}\mathrm{d}{\mathbf{m}}_{\mathcal{J}}\mathrm{d}{\mathbf{g}}_{\mathcal{I}}$ and reduced the first integral in $p\left(y\right)$. Finally, we obtain the upper bound for the first term of $\mathsf{\Lambda}$, where the maximum implies having a vanishing $\mathsf{KL}$ term. Therefore, we have that:

#### Appendix A.4. Proof of Corollay 1

#### Appendix A.5. Proof of Proposition 4

**D1**in Definition 8 it holds that:

**D1**of Definition 8, it holds that the corresponding probability distribution on ${\mathbf{M}}_{\mathcal{K}}$ can be written as:

## References

- Guidotti, R.; Monreale, A.; Ruggieri, S.; Turini, F.; Giannotti, F.; Pedreschi, D. A survey of methods for explaining black box models. ACM Comput. Surv. (CSUR)
**2018**, 51, 1–42. [Google Scholar] [CrossRef] - Štrumbelj, E.; Kononenko, I. Explaining prediction models and individual predictions with feature contributions. Knowl. Inf. Syst.
**2014**, 41, 647–665. [Google Scholar] [CrossRef] - Ribeiro, M.T.; Singh, S.; Guestrin, C. “Why should I Trust You?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 1135–1144. [Google Scholar]
- Kim, B.; Khanna, R.; Koyejo, O.O. Examples are not enough, learn to criticize! Criticism for interpretability. Adv. Neural Inf. Process. Syst.
**2016**, 29. [Google Scholar] - Koh, P.W.; Liang, P. Understanding black-box predictions via influence functions. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1885–1894. [Google Scholar]
- Ustun, B.; Rudin, C. Supersparse linear integer models for optimized medical scoring systems. Mach. Learn.
**2016**, 102, 349–391. [Google Scholar] [CrossRef] - Wang, T.; Rudin, C.; Doshi-Velez, F.; Liu, Y.; Klampfl, E.; MacNeille, P. A bayesian framework for learning rule sets for interpretable classification. J. Mach. Learn. Res.
**2017**, 18, 2357–2393. [Google Scholar] - Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell.
**2019**, 1, 206–215. [Google Scholar] [CrossRef] - Teso, S.; Alkan, Ö.; Stammer, W.; Daly, E. Leveraging Explanations in Interactive Machine Learning: An Overview. Front. Artif. Intell.
**2023**, 6, 1066049. [Google Scholar] [CrossRef] - Kambhampati, S.; Sreedharan, S.; Verma, M.; Zha, Y.; Guan, L. Symbols as a lingua franca for bridging human-ai chasm for explainable and advisable ai systems. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 28 February–1 March 2022; Volume 36, pp. 12262–12267. [Google Scholar]
- Kim, B.; Wattenberg, M.; Gilmer, J.; Cai, C.; Wexler, J.; Viegas, F. Interpretability beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV). In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm, Sweden, 10–15 July 2018; pp. 2668–2677. [Google Scholar]
- Fong, R.; Vedaldi, A. Net2vec: Quantifying and explaining how concepts are encoded by filters in deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, Utah, USA, 18–22 June 2018; pp. 8730–8738. [Google Scholar]
- Ghorbani, A.; Abid, A.; Zou, J. Interpretation of neural networks is fragile. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January 27–1 February 2019; Volume 33, pp. 3681–3688. [Google Scholar]
- Zhang, R.; Madumal, P.; Miller, T.; Ehinger, K.A.; Rubinstein, B.I. Invertible concept-based explanations for cnn models with non-negative concept activation vectors. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 11682–11690. [Google Scholar]
- Fel, T.; Picard, A.; Bethune, L.; Boissin, T.; Vigouroux, D.; Colin, J.; Cadène, R.; Serre, T. Craft: Concept recursive activation factorization for explainability. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2711–2721. [Google Scholar]
- Alvarez-Melis, D.; Jaakkola, T.S. Towards robust interpretability with self-explaining neural networks. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 7786–7795. [Google Scholar]
- Chen, C.; Li, O.; Tao, D.; Barnett, A.; Rudin, C.; Su, J.K. This Looks Like That: Deep Learning for Interpretable Image Recognition. Adv. Neural Inf. Process. Syst.
**2019**, 32, 8930–8941. [Google Scholar] - Koh, P.W.; Nguyen, T.; Tang, Y.S.; Mussmann, S.; Pierson, E.; Kim, B.; Liang, P. Concept bottleneck models. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 5338–5348. [Google Scholar]
- Marconato, E.; Passerini, A.; Teso, S. GlanceNets: Interpretabile, Leak-proof Concept-based Models. Adv. Neural Inf. Process. Syst.
**2022**, 35, 21212–21227. [Google Scholar] - Espinosa Zarlenga, M.; Barbiero, P.; Ciravegna, G.; Marra, G.; Giannini, F.; Diligenti, M.; Shams, Z.; Precioso, F.; Melacci, S.; Weller, A.; et al. Concept Embedding Models: Beyond the Accuracy-Explainability Trade-Off. Adv. Neural Inf. Process. Syst.
**2022**, 35, 21400–21413. [Google Scholar] - Lipton, Z.C. The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue
**2018**, 16, 31–57. [Google Scholar] [CrossRef] - Schwalbe, G. Concept embedding analysis: A review. arXiv
**2022**, arXiv:2203.13909. [Google Scholar] - Stammer, W.; Schramowski, P.; Kersting, K. Right for the Right Concept: Revising Neuro-Symbolic Concepts by Interacting with their Explanations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 3619–3629. [Google Scholar]
- Bontempelli, A.; Teso, S.; Giunchiglia, F.; Passerini, A. Concept-level debugging of part-prototype networks. In Proceedings of the International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
- Hoffmann, A.; Fanconi, C.; Rade, R.; Kohler, J. This Looks Like That… Does it? Shortcomings of Latent Space Prototype Interpretability in Deep Networks. arXiv
**2021**, arXiv:2105.02968. [Google Scholar] - Xu-Darme, R.; Quénot, G.; Chihani, Z.; Rousset, M.C. Sanity Checks for Patch Visualisation in Prototype-Based Image Classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 3690–3695. [Google Scholar]
- Chen, Z.; Bei, Y.; Rudin, C. Concept whitening for interpretable image recognition. Nat. Mach. Intell.
**2020**, 2, 772–782. [Google Scholar] [CrossRef] - Margeloiu, A.; Ashman, M.; Bhatt, U.; Chen, Y.; Jamnik, M.; Weller, A. Do Concept Bottleneck Models Learn as Intended? arXiv
**2021**, arXiv:2105.04289. [Google Scholar] - Mahinpei, A.; Clark, J.; Lage, I.; Doshi-Velez, F.; Pan, W. Promises and pitfalls of black-box concept learning models. In Proceedings of the International Conference on Machine Learning: Workshop on Theoretic Foundation, Criticism, and Application Trend of Explainable AI, Virtual, 8–9 February 2021; Volume 1, pp. 1–13. [Google Scholar]
- Silver, D.L.; Mitchell, T.M. The Roles of Symbols in Neural-based AI: They are Not What You Think! arXiv
**2023**, arXiv:2304.13626. [Google Scholar] - Schölkopf, B.; Locatello, F.; Bauer, S.; Ke, N.R.; Kalchbrenner, N.; Goyal, A.; Bengio, Y. Toward causal representation learning. Proc. IEEE
**2021**, 109, 612–634. [Google Scholar] [CrossRef] - Bengio, Y.; Courville, A.; Vincent, P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell.
**2013**, 35, 1798–1828. [Google Scholar] [CrossRef] - Higgins, I.; Amos, D.; Pfau, D.; Racaniere, S.; Matthey, L.; Rezende, D.; Lerchner, A. Towards a definition of disentangled representations. arXiv
**2018**, arXiv:1812.02230. [Google Scholar] - Beckers, S.; Halpern, J.Y. Abstracting causal models. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 33, pp. 2678–2685. [Google Scholar]
- Beckers, S.; Eberhardt, F.; Halpern, J.Y. Approximate causal abstractions. In Proceedings of the Uncertainty in Artificial Intelligence, PMLR, Online, 3–6 August 2020; pp. 606–615. [Google Scholar]
- Geiger, A.; Wu, Z.; Potts, C.; Icard, T.; Goodman, N.D. Finding alignments between interpretable causal variables and distributed neural representations. arXiv
**2023**, arXiv:2303.02536. [Google Scholar] - Lockhart, J.; Marchesotti, N.; Magazzeni, D.; Veloso, M. Towards learning to explain with concept bottleneck models: Mitigating information leakage. arXiv
**2022**, arXiv:2211.03656. [Google Scholar] - Pearl, J. Causality; Cambridge University Press: Cambridge, UK, 2009. [Google Scholar]
- Peters, J.; Janzing, D.; Schölkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms; MIT Press: Cambridge, MA, USA, 2017. [Google Scholar]
- Eastwood, C.; Williams, C.K. A framework for the quantitative evaluation of disentangled representations. In Proceedings of the International Conference on Learning Representations, Vancouver, BC, Canada, 30 April–3 May 2018. [Google Scholar]
- Suter, R.; Miladinovic, D.; Schölkopf, B.; Bauer, S. Robustly disentangled causal mechanisms: Validating deep representations for interventional robustness. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6056–6065. [Google Scholar]
- Reddy, A.G.; Balasubramanian, V.N. On causally disentangled representations. In Proceedings of the AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, 28 February–1 March 2022; Volume 36, pp. 8089–8097. [Google Scholar]
- von Kügelgen, J.; Sharma, Y.; Gresele, L.; Brendel, W.; Schölkopf, B.; Besserve, M.; Locatello, F. Self-Supervised Learning with Data Augmentations Provably Isolates Content from Style. In Proceedings of the 35nd International Conference on Neural Information Processing Systems, Online, 6–14 December 2021. [Google Scholar]
- Koller, D.; Friedman, N. Probabilistic Graphical Models: Principles and Techniques; MIT Press: Cambridge, MA, USA, 2009. [Google Scholar]
- Yang, Y.; Panagopoulou, A.; Zhou, S.; Jin, D.; Callison-Burch, C.; Yatskar, M. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 18–22 June 2023; pp. 19187–19197. [Google Scholar]
- Bontempelli, A.; Giunchiglia, F.; Passerini, A.; Teso, S. Toward a Unified Framework for Debugging Gray-box Models. In Proceedings of the The AAAI-22 Workshop on Interactive Machine Learning, Online, 28 February 2022. [Google Scholar]
- Zarlenga, M.E.; Pietro, B.; Gabriele, C.; Giuseppe, M.; Giannini, F.; Diligenti, M.; Zohreh, S.; Frederic, P.; Melacci, S.; Adrian, W.; et al. Concept embedding models: Beyond the accuracy-explainability trade-off. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Needham, MA, USA, 2022; Volume 35, pp. 21400–21413. [Google Scholar]
- Fel, T.; Boutin, V.; Moayeri, M.; Cadène, R.; Bethune, L.; andéol, L.; Chalvidal, M.; Serre, T. A Holistic Approach to Unifying Automatic Concept Extraction and Concept Importance Estimation. arXiv
**2023**, arXiv:2306.07304. [Google Scholar] - Teso, S. Toward Faithful Explanatory Active Learning with Self-explainable Neural Nets. In Proceedings of the Workshop on Interactive Adaptive Learning (IAL 2019); 2019; pp. 4–16. Available online: https://ceur-ws.org/Vol-2444/ialatecml_paper1.pdf (accessed on 9 September 2023).
- Pfau, J.; Young, A.T.; Wei, J.; Wei, M.L.; Keiser, M.J. Robust semantic interpretability: Revisiting concept activation vectors. arXiv
**2021**, arXiv:2104.02768. [Google Scholar] - Gabbay, A.; Cohen, N.; Hoshen, Y. An image is worth more than a thousand words: Towards disentanglement in the wild. Adv. Neural Inf. Process. Syst.
**2021**, 34, 9216–9228. [Google Scholar] - Matthey, L.; Higgins, I.; Hassabis, D.; Lerchner, A. dSprites: Disentanglement Testing Sprites Dataset. 2017. Available online: https://github.com/deepmind/dsprites-dataset/ (accessed on 9 September 2023).
- Havasi, M.; Parbhoo, S.; Doshi-Velez, F. Addressing Leakage in Concept Bottleneck Models. Adv. Neural Inf. Process. Syst.
**2022**, 35, 23386–23397. [Google Scholar] - Cover, T.M. Elements of Information Theory; John Wiley & Sons: Hoboken, NJ, USA, 1999. [Google Scholar]
- Montero, M.L.; Ludwig, C.J.; Costa, R.P.; Malhotra, G.; Bowers, J. The role of disentanglement in generalisation. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
- Montero, M.; Bowers, J.; Ponte Costa, R.; Ludwig, C.; Malhotra, G. Lost in Latent Space: Examining failures of disentangled models at combinatorial generalisation. Adv. Neural Inf. Process. Syst.
**2022**, 35, 10136–10149. [Google Scholar] - Sun, X.; Yang, Z.; Zhang, C.; Ling, K.V.; Peng, G. Conditional gaussian distribution learning for open set recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Washington, DC, USA, 14–19 June 2020; pp. 13480–13489. [Google Scholar]
- Hyvarinen, A.; Morioka, H. Nonlinear ICA of temporally dependent stationary sources. In Proceedings of the Artificial Intelligence and Statistics, PMLR, Ft. Lauderdale, FL, USA, 20–22 April 2017; pp. 460–469. [Google Scholar]
- Khemakhem, I.; Monti, R.P.; Kingma, D.P.; Hyvärinen, A. ICE-BeeM: Identifiable Conditional Energy-Based Deep Models Based on Nonlinear ICA. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS 2020), Online, 6–12 December 2020. [Google Scholar]
- Rubenstein, P.K.; Weichwald, S.; Bongers, S.; Mooij, J.M.; Janzing, D.; Grosse-Wentrup, M.; Schölkopf, B. Causal consistency of structural equation models. arXiv
**2017**, arXiv:1707.00819. [Google Scholar] - Zennaro, F.M. Abstraction between structural causal models: A review of definitions and properties. arXiv
**2022**, arXiv:2207.08603. [Google Scholar] - Geiger, A.; Potts, C.; Icard, T. Causal Abstraction for Faithful Model Interpretation. arXiv
**2023**, arXiv:2301.04709. [Google Scholar] - Marti, L.; Wu, S.; Piantadosi, S.T.; Kidd, C. Latent diversity in human concepts. Open Mind
**2023**, 7, 79–92. [Google Scholar] [CrossRef] - Zaidi, J.; Boilard, J.; Gagnon, G.; Carbonneau, M.A. Measuring disentanglement: A review of metrics. arXiv
**2020**, arXiv:2012.09276. [Google Scholar] - Eastwood, C.; Nicolicioiu, A.L.; Von Kügelgen, J.; Kekić, A.; Träuble, F.; Dittadi, A.; Schölkopf, B. DCI-ES: An Extended Disentanglement Framework with Connections to Identifiability. arXiv
**2022**, arXiv:2210.00364. [Google Scholar] - Chen, R.T.; Li, X.; Grosse, R.; Duvenaud, D. Isolating sources of disentanglement in VAEs. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 2615–2625. [Google Scholar]
- Locatello, F.; Bauer, S.; Lucic, M.; Raetsch, G.; Gelly, S.; Schölkopf, B.; Bachem, O. Challenging common assumptions in the unsupervised learning of disentangled representations. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 4114–4124. [Google Scholar]
- Oikarinen, T.; Das, S.; Nguyen, L.M.; Weng, T.W. Label-free Concept Bottleneck Models. In Proceedings of the ICLR, Virtual, 25 April 2022. [Google Scholar]
- Lage, I.; Doshi-Velez, F. Learning Interpretable Concept-Based Models with Human Feedback. arXiv
**2020**, arXiv:2012.02898. [Google Scholar] - Chauhan, K.; Tiwari, R.; Freyberg, J.; Shenoy, P.; Dvijotham, K. Interactive concept bottleneck models. In Proceedings of the AAAI, Washington, DC, USA, 7–14 February 2023. [Google Scholar]
- Steinmann, D.; Stammer, W.; Friedrich, F.; Kersting, K. Learning to Intervene on Concept Bottlenecks. arXiv
**2023**, arXiv:2308.13453. [Google Scholar] - Zarlenga, M.E.; Collins, K.M.; Dvijotham, K.; Weller, A.; Shams, Z.; Jamnik, M. Learning to Receive Help: Intervention-Aware Concept Embedding Models. arXiv
**2023**, arXiv:2309.16928. [Google Scholar] - Stammer, W.; Memmel, M.; Schramowski, P.; Kersting, K. Interactive disentanglement: Learning concepts by interacting with their prototype representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10317–10328. [Google Scholar]
- Muggleton, S.; De Raedt, L. Inductive logic programming: Theory and methods. J. Log. Program.
**1994**, 19, 629–679. [Google Scholar] [CrossRef] - De Raedt, L.; Dumancic, S.; Manhaeve, R.; Marra, G. From Statistical Relational to Neuro-Symbolic Artificial Intelligence. In Proceedings of the IJCAI, Yokohama, Japan, 11–17 July 2020. [Google Scholar]
- Holzinger, A.; Saranti, A.; Angerschmid, A.; Finzel, B.; Schmid, U.; Mueller, H. Toward human-level concept learning: Pattern benchmarking for AI algorithms. Patterns
**2023**, 4, 100788. [Google Scholar] [CrossRef] - Miller, T. Explanation in artificial intelligence: Insights from the social sciences. Artif. Intell.
**2019**, 267, 1–38. [Google Scholar] [CrossRef] - Cabitza, F.; Campagner, A.; Malgieri, G.; Natali, C.; Schneeberger, D.; Stoeger, K.; Holzinger, A. Quod erat demonstrandum?—Towards a typology of the concept of explanation for the design of explainable AI. Expert Syst. Appl.
**2023**, 213, 118888. [Google Scholar] [CrossRef] - Ho, M.K.; Abel, D.; Correa, C.G.; Littman, M.L.; Cohen, J.D.; Griffiths, T.L. People construct simplified mental representations to plan. Nature
**2022**, 606, 129–136. [Google Scholar] [CrossRef] - Khemakhem, I.; Kingma, D.; Monti, R.; Hyvarinen, A. Variational autoencoders and nonlinear ica: A unifying framework. In Proceedings of the International Conference on Artificial Intelligence and Statistics, PMLR, Online, 26–28 August 2020; pp. 2207–2217. [Google Scholar]
- Graziani, M.; Nguyen, A.P.; O’Mahony, L.; Müller, H.; Andrearczyk, V. Concept discovery and dataset exploration with singular value decomposition. In Proceedings of the ICLR 2023 Workshop on Pitfalls of Limited Data and Computation for Trustworthy ML, Kigali, Rwanda, 5 May 2023. [Google Scholar]
- Li, O.; Liu, H.; Chen, C.; Rudin, C. Deep learning for case-based reasoning through prototypes: A neural network that explains its predictions. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Rymarczyk, D.; Struski, L.; Tabor, J.; Zieliński, B. ProtoPShare: Prototypical Parts Sharing for Similarity Discovery in Interpretable Image Classification. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021; pp. 1420–1430. [Google Scholar]
- Nauta, M.; van Bree, R.; Seifert, C. Neural Prototype Trees for Interpretable Fine-grained Image Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 19–25 June 2021; pp. 14933–14943. [Google Scholar]
- Singh, G.; Yow, K.C. These do not look like those: An interpretable deep learning model for image recognition. IEEE Access
**2021**, 9, 41482–41493. [Google Scholar] [CrossRef] - Davoudi, S.O.; Komeili, M. Toward Faithful Case-based Reasoning through Learning Prototypes in a Nearest Neighbor-friendly Space. In Proceedings of the International Conference on Learning Representations, Virtual Event, 3–7 May 2021. [Google Scholar]
- Zhou, B.; Sun, Y.; Bau, D.; Torralba, A. Interpretable basis decomposition for visual explanation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 119–134. [Google Scholar]
- Kazhdan, D.; Dimanov, B.; Jamnik, M.; Liò, P.; Weller, A. Now you see me (CME): Concept-based model extraction. arXiv
**2020**, arXiv:2010.13233. [Google Scholar] - Gu, J.; Tresp, V. Semantics for global and local interpretation of deep neural networks. arXiv
**2019**, arXiv:1910.09085. [Google Scholar] - Esser, P.; Rombach, R.; Ommer, B. A disentangling invertible interpretation network for explaining latent representations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 9223–9232. [Google Scholar]
- Yeh, C.K.; Kim, B.; Arik, S.; Li, C.L.; Pfister, T.; Ravikumar, P. On completeness-aware concept-based explanations in deep neural networks. Adv. Neural Inf. Process. Syst.
**2020**, 33, 20554–20565. [Google Scholar] - Yuksekgonul, M.; Wang, M.; Zou, J. Post-hoc Concept Bottleneck Models. arXiv
**2022**, arXiv:2205.15480. [Google Scholar] - Sawada, Y.; Nakamura, K. Concept Bottleneck Model with Additional Unsupervised Concepts. IEEE Access
**2022**, 10, 41758–41765. [Google Scholar] [CrossRef] - Magister, L.C.; Kazhdan, D.; Singh, V.; Liò, P. Gcexplainer: Human-in-the-loop concept-based explanations for graph neural networks. arXiv
**2021**, arXiv:2107.11889. [Google Scholar] - Finzel, B.; Saranti, A.; Angerschmid, A.; Tafler, D.; Pfeifer, B.; Holzinger, A. Generating explanations for conceptual validation of graph neural networks: An investigation of symbolic predicates learned on relevance-ranked sub-graphs. KI-Künstliche Intell. 2022; 36, 271–285. [Google Scholar] [CrossRef]
- Erculiani, L.; Bontempelli, A.; Passerini, A.; Giunchiglia, F. Egocentric Hierarchical Visual Semantics. arXiv
**2023**, arXiv:2305.05422. [Google Scholar] - Higgins, I.; Matthey, L.; Pal, A.; Burgess, C.; Glorot, X.; Botvinick, M.; Mohamed, S.; Lerchner, A. β-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
- Kingma, D.P.; Welling, M. Auto-encoding variational bayes. In Proceedings of the International Conference on Machine Learning, PMLR, Beijing, China, 22–24 June 2014. [Google Scholar]
- Kim, H.; Mnih, A. Disentangling by factorising. In Proceedings of the International Conference on Machine Learning, PMLR, Stockholm Sweden, 10–15 July 2018; pp. 2649–2658. [Google Scholar]
- Esmaeili, B.; Wu, H.; Jain, S.; Bozkurt, A.; Siddharth, N.; Paige, B.; Brooks, D.H.; Dy, J.; Meent, J.W. Structured disentangled representations. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, PMLR, Naha, Okinawa, Japan, 16–18 April 2019; pp. 2525–2534. [Google Scholar]
- Rhodes, T.; Lee, D. Local Disentanglement in Variational Auto-Encoders Using Jacobian L_1 Regularization. Adv. Neural Inf. Process. Syst.
**2021**, 34, 22708–22719. [Google Scholar] - Locatello, F.; Tschannen, M.; Bauer, S.; Rätsch, G.; Schölkopf, B.; Bachem, O. Disentangling Factors of Variations Using Few Labels. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
- Shu, R.; Chen, Y.; Kumar, A.; Ermon, S.; Poole, B. Weakly Supervised Disentanglement with Guarantees. In Proceedings of the International Conference on Learning Representations, Addis Ababa, Ethiopia, 30 April 2020. [Google Scholar]
- Locatello, F.; Poole, B.; Rätsch, G.; Schölkopf, B.; Bachem, O.; Tschannen, M. Weakly-supervised disentanglement without compromises. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 13–18 July 2020; pp. 6348–6359. [Google Scholar]
- Lachapelle, S.; Rodriguez, P.; Sharma, Y.; Everett, K.E.; Le Priol, R.; Lacoste, A.; Lacoste-Julien, S. Disentanglement via mechanism sparsity regularization: A new principle for nonlinear ICA. In Proceedings of the Conference on Causal Learning and Reasoning, PMLR, Eureka, CA, USA, 11–13 April 2022; pp. 428–484. [Google Scholar]
- Horan, D.; Richardson, E.; Weiss, Y. When Is Unsupervised Disentanglement Possible? Adv. Neural Inf. Process. Syst.
**2021**, 34, 5150–5161. [Google Scholar] - Comon, P. Independent component analysis, a new concept? Signal Process.
**1994**, 36, 287–314. [Google Scholar] [CrossRef] - Hyvärinen, A.; Karhunen, J.; Oja, E. Independent Component Analysis, Adaptive and Learning Systems for Signal Processing, Communications, and Control; John Wiley Sons, Inc.: Hoboken, NJ, USA, 2001; Volume 1, pp. 11–14. [Google Scholar]
- Naik, G.R.; Kumar, D.K. An overview of independent component analysis and its applications. Informatica
**2011**, 35, 63–81. [Google Scholar] - Hyvärinen, A.; Pajunen, P. Nonlinear independent component analysis: Existence and uniqueness results. Neural Netw.
**1999**, 12, 429–439. [Google Scholar] [CrossRef] - Buchholz, S.; Besserve, M.; Schölkopf, B. Function classes for identifiable nonlinear independent component analysis. Adv. Neural Inf. Process. Syst.
**2022**, 35, 16946–16961. [Google Scholar] - Zarlenga, M.E.; Barbiero, P.; Shams, Z.; Kazhdan, D.; Bhatt, U.; Weller, A.; Jamnik, M. Towards Robust Metrics for Concept Representation Evaluation. arXiv
**2023**, arXiv:2301.10367. [Google Scholar] - Manhaeve, R.; Dumancic, S.; Kimmig, A.; Demeester, T.; De Raedt, L. DeepProbLog: Neural Probabilistic Logic Programming. Adv. Neural Inf. Process. Syst.
**2021**, 31, 3753–3763. [Google Scholar] [CrossRef] - Donadello, I.; Serafini, L.; Garcez, A.D. Logic tensor networks for semantic image interpretation. arXiv
**2017**, arXiv:1705.08968. [Google Scholar] - Diligenti, M.; Gori, M.; Sacca, C. Semantic-based regularization for learning and inference. Artif. Intell.
**2017**, 244, 143–165. [Google Scholar] [CrossRef] - Fischer, M.; Balunovic, M.; Drachsler-Cohen, D.; Gehr, T.; Zhang, C.; Vechev, M. Dl2: Training and querying neural networks with logic. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 1931–1941. [Google Scholar]
- Giunchiglia, E.; Lukasiewicz, T. Coherent Hierarchical Multi-label Classification Networks. Adv. Neural Inf. Process. Syst.
**2020**, 33, 9662–9673. [Google Scholar] - Yang, Z.; Ishay, A.; Lee, J. NeurASP: Embracing neural networks into answer set programming. In Proceedings of the IJCAI, Long Beach, CA, USA, 9–15 June 2019. [Google Scholar]
- Huang, J.; Li, Z.; Chen, B.; Samel, K.; Naik, M.; Song, L.; Si, X. Scallop: From Probabilistic Deductive Databases to Scalable Differentiable Reasoning. Adv. Neural Inf. Process. Syst.
**2021**, 34, 25134–25145. [Google Scholar] - Marra, G.; Kuželka, O. Neural markov logic networks. In Proceedings of the Uncertainty in Artificial Intelligence, Online, 27–30 July 2021. [Google Scholar]
- Ahmed, K.; Teso, S.; Chang, K.W.; Van den Broeck, G.; Vergari, A. Semantic Probabilistic Layers for Neuro-Symbolic Learning. Adv. Neural Inf. Process. Syst.
**2022**, 35, 29944–29959. [Google Scholar] - Misino, E.; Marra, G.; Sansone, E. VAEL: Bridging Variational Autoencoders and Probabilistic Logic Programming. Adv. Neural Inf. Process. Syst.
**2022**, 35, 4667–4679. [Google Scholar] - Winters, T.; Marra, G.; Manhaeve, R.; De Raedt, L. DeepStochLog: Neural Stochastic Logic Programming. In Proceedings of the AAAI, Virtually, 22 February–1 March 2022. [Google Scholar]
- van Krieken, E.; Thanapalasingam, T.; Tomczak, J.M.; van Harmelen, F.; Teije, A.T. A-NeSI: A Scalable Approximate Method for Probabilistic Neurosymbolic Inference. arXiv
**2022**, arXiv:2212.12393. [Google Scholar] - Ciravegna, G.; Barbiero, P.; Giannini, F.; Gori, M.; Lió, P.; Maggini, M.; Melacci, S. Logic explained networks. Artif. Intell.
**2023**, 314, 103822. [Google Scholar] [CrossRef] - Marconato, E.; Bontempo, G.; Ficarra, E.; Calderara, S.; Passerini, A.; Teso, S. Neuro-Symbolic Continual Learning: Knowledge, Reasoning Shortcuts and Concept Rehearsal. In Proceedings of the 40th International Conference on Machine Learning (ICML’23), Honolulu, HI, USA, 23–29 July 2023; Volume 202, pp. 23915–23936. [Google Scholar]
- Marconato, E.; Teso, S.; Vergari, A.; Passerini, A. Not All Neuro-Symbolic Concepts Are Created Equal: Analysis and Mitigation of Reasoning Shortcuts. In Proceedings of the Thirty-Seventh Conference on Neural Information Processing Systems, New Orleans, LA, USA, 10–16 December 2023. [Google Scholar]

**Figure 1.**SCMs illustrating two different notions of disentanglement. Left: the variables $\mathbf{G}$ =$\{{G}_{1},\dots ,{G}_{n}\}$ are disentangled. Right: Typical data generation and encoding process used in deep latent variable models. The machine representation $\mathbf{M}$ = $\{{M}_{1},\dots ,{M}_{k}\}$ is disentangled with respect to the generative factors $\mathbf{G}$ if and only if each ${M}_{j}$ encodes information about one ${G}_{i}$ at most.

**Figure 2.**

**Left**: following the generative process $p(\mathbf{X}\mid \mathbf{G})$, concept-based models (CBMs) extract a machine representation $\mathbf{M}=({\mathbf{M}}_{\mathcal{J}},{\mathbf{M}}_{-\mathcal{J}})$ via ${p}_{\theta}(\mathbf{M}\mid \mathbf{X})$, of which only ${\mathbf{M}}_{\mathcal{J}}$ is used to predict the label Y. ${\mathbf{M}}_{\mathcal{J}}$ contains all interpretable concepts, and as such it has to be aligned to their user concepts $\mathbf{H}$ (Section 4): the corresponding map is reported in red.

**Right**: generative process followed by concept-based explainers (CBEs). Here, the machine representation $\mathbf{M}$ is not required to be interpretable. Rather, the concept-based explainer maps it to extracted concepts $\widehat{\mathbf{H}}$ and then infers how these contribute to the prediction Y. Here, alignment should hold between $\widehat{\mathbf{H}}$ and $\mathbf{H}$.

**Figure 3.**Graphical model of our data generation process. In words, n (correlated) generative factors exist in the world $\mathbf{G}$ = $({G}_{1},\dots ,{G}_{n})$ that cause an observed input $\mathbf{X}$. The machine maps these to an internal representation $\mathbf{M}$ = $({M}_{1},\dots ,{M}_{k})$, while the human observer maps them to its own internal concept vocabulary $\mathbf{H}$ = $4({H}_{1},\dots ,{H}_{\ell})$. Notice that the observer’s concepts $\mathbf{H}$ may, and often do, differ from the ground-truth factors $\mathbf{G}$. The concepts $\mathbf{H}$ are what the human can understand and attach names to, e.g., the “

`color`” and “

`shape`” of an object appearing in $\mathbf{X}$. The association between names and human concepts is denoted by dotted lines. We postulate that communication is possible if the machine and the human representations are aligned according to Definition 6.

**Figure 4.**Simplified generative process with a single observer, adapted from [19]. Here, $\mathbf{C}$ is unobserved confounding variables influencing the generative factors $\mathbf{G}$, and $\mathbf{M}$ is the latent representation learned by the machine. The red arrow represents the map $\alpha $.

**Figure 5.**Generative process for Concept Leakage. A predictor observes examples $(\mathbf{X},Y)$ and infers Y from its interpretable representation ${\mathbf{M}}_{\mathcal{J}}$ using a learnable conditional distribution ${q}_{\lambda}(Y\mid {\mathbf{m}}_{\mathcal{J}})$, indicated in orange. Since the label Y depends solely on ${\mathbf{G}}_{-\mathcal{I}}$, we would expect that it cannot be predicted better than at random: intuitively, if this occurs it means that information from ${\mathbf{G}}_{-\mathcal{I}}$ has leaked into the interpretable concepts ${\mathbf{M}}_{\mathcal{J}}$. Any intervention $do({\mathbf{G}}_{-\mathcal{I}}\leftarrow {\mathbf{g}}_{-\mathcal{I}})$ on the uninterpretable/unobserved concepts detaches these from $\mathbf{C}$, meaning that the label truly only depends on ${\mathbf{G}}_{-\mathcal{I}}$.

**Figure 6.**Block-aligned representation when ${\mathfrak{C}}_{\mathbf{G}}$ has causal connections. (

**left**) An intervention on ${G}_{1}$ affects all representation (displayed in blue), since $({M}_{1},{M}_{3})$ are block-aligned to ${G}_{1}$ and ${M}_{2}$ is aligned to ${G}_{2}$. (

**center**) Conversely, an intervention on ${G}_{2}$ only affects ${M}_{2}$, leaving the remaining representations untouched. (

**right**) Intervening on all $\mathbf{G}$ has the effect of isolating the corresponding aligned representations from other interventions. In this case, intervening on ${G}_{2}$ removes the causal connection with ${G}_{1}$, so that ${M}_{2}$ does not depend on the intervention of ${G}_{1}$. Refer to Example 2 for further details.

**Figure 7.**Absence of aligned causal abstraction. (

**left**) The user’s ${\mathfrak{C}}_{\mathbf{H}}$ incorporates a causal connection between ${H}_{1}$ to ${H}_{2}$, while the machine one ${\mathfrak{C}}_{\mathbf{M}}$ presents no causal connections. (

**right**) The total SCM ${\mathfrak{C}}_{\mathbf{H}\to \mathbf{M}}$ of user’s and machine’s concepts resulting from an

**aligned**map $\beta :\mathbf{H}\to \mathbf{M}$ (in blue). Refer to Example 3 for further discussion.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Marconato, E.; Passerini, A.; Teso, S.
Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning. *Entropy* **2023**, *25*, 1574.
https://doi.org/10.3390/e25121574

**AMA Style**

Marconato E, Passerini A, Teso S.
Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning. *Entropy*. 2023; 25(12):1574.
https://doi.org/10.3390/e25121574

**Chicago/Turabian Style**

Marconato, Emanuele, Andrea Passerini, and Stefano Teso.
2023. "Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning" *Entropy* 25, no. 12: 1574.
https://doi.org/10.3390/e25121574