Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning

Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of <i>interpretable concepts learned from data</i>. How to reliably acquire such concep...

Full description

Bibliographic Details
Main Authors:	Emanuele Marconato, Andrea Passerini, Stefano Teso
Format:	Article
Language:	English
Published:	MDPI AG 2023-11-01
Series:	Entropy
Subjects:	explainable AI causal representation learning alignment disentanglement causal abstractions concept leakage
Online Access:	https://www.mdpi.com/1099-4300/25/12/1574

_version_	1827574850338684928
author	Emanuele Marconato Andrea Passerini Stefano Teso
author_facet	Emanuele Marconato Andrea Passerini Stefano Teso
author_sort	Emanuele Marconato
collection	DOAJ
description	Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of <i>interpretable concepts learned from data</i>. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and <i>concept-based</i> neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: <i>a representation is understandable only insofar as it can be understood by the human at the receiving end</i>. The key challenge in human-interpretable representation learning (<span style="font-variant: small-caps;">hrl</span>) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring <i>interpretable representations</i> suitable for both post hoc explainers and concept-based neural networks. Our formalization of <span style="font-variant: small-caps;">hrl</span> builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of <i>alignment</i> between the machine’s representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive <i>name transfer</i> game, and clarify the relationship between alignment and a well-known property of representations, namely <i>disentanglement</i>. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as <i>concept leakage</i>, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.
first_indexed	2024-03-08T20:47:56Z
format	Article
id	doaj.art-9a41acb525494458b58119820e4fe913
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-03-08T20:47:56Z
publishDate	2023-11-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-9a41acb525494458b58119820e4fe9132023-12-22T14:07:13ZengMDPI AGEntropy1099-43002023-11-012512157410.3390/e25121574Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation LearningEmanuele Marconato0Andrea Passerini1Stefano Teso2Dipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, 38123 Trento, ItalyDipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, 38123 Trento, ItalyDipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, 38123 Trento, ItalyResearch on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of <i>interpretable concepts learned from data</i>. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and <i>concept-based</i> neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: <i>a representation is understandable only insofar as it can be understood by the human at the receiving end</i>. The key challenge in human-interpretable representation learning (<span style="font-variant: small-caps;">hrl</span>) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring <i>interpretable representations</i> suitable for both post hoc explainers and concept-based neural networks. Our formalization of <span style="font-variant: small-caps;">hrl</span> builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of <i>alignment</i> between the machine’s representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive <i>name transfer</i> game, and clarify the relationship between alignment and a well-known property of representations, namely <i>disentanglement</i>. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as <i>concept leakage</i>, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.https://www.mdpi.com/1099-4300/25/12/1574explainable AIcausal representation learningalignmentdisentanglementcausal abstractionsconcept leakage
spellingShingle	Emanuele Marconato Andrea Passerini Stefano Teso Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning Entropy explainable AI causal representation learning alignment disentanglement causal abstractions concept leakage
title	Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
title_full	Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
title_fullStr	Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
title_full_unstemmed	Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
title_short	Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
title_sort	interpretability is in the mind of the beholder a causal framework for human interpretable representation learning
topic	explainable AI causal representation learning alignment disentanglement causal abstractions concept leakage
url	https://www.mdpi.com/1099-4300/25/12/1574
work_keys_str_mv	AT emanuelemarconato interpretabilityisinthemindofthebeholderacausalframeworkforhumaninterpretablerepresentationlearning AT andreapasserini interpretabilityisinthemindofthebeholderacausalframeworkforhumaninterpretablerepresentationlearning AT stefanoteso interpretabilityisinthemindofthebeholderacausalframeworkforhumaninterpretablerepresentationlearning

Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning

Similar Items