Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of <i>interpretable concepts learned from data</i>. How to reliably acquire such concep...
Main Authors: | , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2023-11-01
|
Series: | Entropy |
Subjects: | |
Online Access: | https://www.mdpi.com/1099-4300/25/12/1574 |
_version_ | 1827574850338684928 |
---|---|
author | Emanuele Marconato Andrea Passerini Stefano Teso |
author_facet | Emanuele Marconato Andrea Passerini Stefano Teso |
author_sort | Emanuele Marconato |
collection | DOAJ |
description | Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of <i>interpretable concepts learned from data</i>. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and <i>concept-based</i> neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: <i>a representation is understandable only insofar as it can be understood by the human at the receiving end</i>. The key challenge in human-interpretable representation learning (<span style="font-variant: small-caps;">hrl</span>) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring <i>interpretable representations</i> suitable for both post hoc explainers and concept-based neural networks. Our formalization of <span style="font-variant: small-caps;">hrl</span> builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of <i>alignment</i> between the machine’s representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive <i>name transfer</i> game, and clarify the relationship between alignment and a well-known property of representations, namely <i>disentanglement</i>. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as <i>concept leakage</i>, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations. |
first_indexed | 2024-03-08T20:47:56Z |
format | Article |
id | doaj.art-9a41acb525494458b58119820e4fe913 |
institution | Directory Open Access Journal |
issn | 1099-4300 |
language | English |
last_indexed | 2024-03-08T20:47:56Z |
publishDate | 2023-11-01 |
publisher | MDPI AG |
record_format | Article |
series | Entropy |
spelling | doaj.art-9a41acb525494458b58119820e4fe9132023-12-22T14:07:13ZengMDPI AGEntropy1099-43002023-11-012512157410.3390/e25121574Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation LearningEmanuele Marconato0Andrea Passerini1Stefano Teso2Dipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, 38123 Trento, ItalyDipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, 38123 Trento, ItalyDipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, 38123 Trento, ItalyResearch on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of <i>interpretable concepts learned from data</i>. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and <i>concept-based</i> neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: <i>a representation is understandable only insofar as it can be understood by the human at the receiving end</i>. The key challenge in human-interpretable representation learning (<span style="font-variant: small-caps;">hrl</span>) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring <i>interpretable representations</i> suitable for both post hoc explainers and concept-based neural networks. Our formalization of <span style="font-variant: small-caps;">hrl</span> builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of <i>alignment</i> between the machine’s representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive <i>name transfer</i> game, and clarify the relationship between alignment and a well-known property of representations, namely <i>disentanglement</i>. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as <i>concept leakage</i>, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.https://www.mdpi.com/1099-4300/25/12/1574explainable AIcausal representation learningalignmentdisentanglementcausal abstractionsconcept leakage |
spellingShingle | Emanuele Marconato Andrea Passerini Stefano Teso Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning Entropy explainable AI causal representation learning alignment disentanglement causal abstractions concept leakage |
title | Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning |
title_full | Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning |
title_fullStr | Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning |
title_full_unstemmed | Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning |
title_short | Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning |
title_sort | interpretability is in the mind of the beholder a causal framework for human interpretable representation learning |
topic | explainable AI causal representation learning alignment disentanglement causal abstractions concept leakage |
url | https://www.mdpi.com/1099-4300/25/12/1574 |
work_keys_str_mv | AT emanuelemarconato interpretabilityisinthemindofthebeholderacausalframeworkforhumaninterpretablerepresentationlearning AT andreapasserini interpretabilityisinthemindofthebeholderacausalframeworkforhumaninterpretablerepresentationlearning AT stefanoteso interpretabilityisinthemindofthebeholderacausalframeworkforhumaninterpretablerepresentationlearning |