Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning

Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of <i>interpretable concepts learned from data</i>. How to reliably acquire such concep...

Full description

Bibliographic Details
Main Authors: Emanuele Marconato, Andrea Passerini, Stefano Teso
Format: Article
Language:English
Published: MDPI AG 2023-11-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/25/12/1574
_version_ 1827574850338684928
author Emanuele Marconato
Andrea Passerini
Stefano Teso
author_facet Emanuele Marconato
Andrea Passerini
Stefano Teso
author_sort Emanuele Marconato
collection DOAJ
description Research on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of <i>interpretable concepts learned from data</i>. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and <i>concept-based</i> neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: <i>a representation is understandable only insofar as it can be understood by the human at the receiving end</i>. The key challenge in human-interpretable representation learning (<span style="font-variant: small-caps;">hrl</span>) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring <i>interpretable representations</i> suitable for both post hoc explainers and concept-based neural networks. Our formalization of <span style="font-variant: small-caps;">hrl</span> builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of <i>alignment</i> between the machine’s representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive <i>name transfer</i> game, and clarify the relationship between alignment and a well-known property of representations, namely <i>disentanglement</i>. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as <i>concept leakage</i>, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.
first_indexed 2024-03-08T20:47:56Z
format Article
id doaj.art-9a41acb525494458b58119820e4fe913
institution Directory Open Access Journal
issn 1099-4300
language English
last_indexed 2024-03-08T20:47:56Z
publishDate 2023-11-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj.art-9a41acb525494458b58119820e4fe9132023-12-22T14:07:13ZengMDPI AGEntropy1099-43002023-11-012512157410.3390/e25121574Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation LearningEmanuele Marconato0Andrea Passerini1Stefano Teso2Dipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, 38123 Trento, ItalyDipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, 38123 Trento, ItalyDipartimento di Ingegneria e Scienza dell’Informazione, University of Trento, 38123 Trento, ItalyResearch on Explainable Artificial Intelligence has recently started exploring the idea of producing explanations that, rather than being expressed in terms of low-level features, are encoded in terms of <i>interpretable concepts learned from data</i>. How to reliably acquire such concepts is, however, still fundamentally unclear. An agreed-upon notion of concept interpretability is missing, with the result that concepts used by both post hoc explainers and <i>concept-based</i> neural networks are acquired through a variety of mutually incompatible strategies. Critically, most of these neglect the human side of the problem: <i>a representation is understandable only insofar as it can be understood by the human at the receiving end</i>. The key challenge in human-interpretable representation learning (<span style="font-variant: small-caps;">hrl</span>) is how to model and operationalize this human element. In this work, we propose a mathematical framework for acquiring <i>interpretable representations</i> suitable for both post hoc explainers and concept-based neural networks. Our formalization of <span style="font-variant: small-caps;">hrl</span> builds on recent advances in causal representation learning and explicitly models a human stakeholder as an external observer. This allows us derive a principled notion of <i>alignment</i> between the machine’s representation and the vocabulary of concepts understood by the human. In doing so, we link alignment and interpretability through a simple and intuitive <i>name transfer</i> game, and clarify the relationship between alignment and a well-known property of representations, namely <i>disentanglement</i>. We also show that alignment is linked to the issue of undesirable correlations among concepts, also known as <i>concept leakage</i>, and to content-style separation, all through a general information-theoretic reformulation of these properties. Our conceptualization aims to bridge the gap between the human and algorithmic sides of interpretability and establish a stepping stone for new research on human-interpretable representations.https://www.mdpi.com/1099-4300/25/12/1574explainable AIcausal representation learningalignmentdisentanglementcausal abstractionsconcept leakage
spellingShingle Emanuele Marconato
Andrea Passerini
Stefano Teso
Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
Entropy
explainable AI
causal representation learning
alignment
disentanglement
causal abstractions
concept leakage
title Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
title_full Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
title_fullStr Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
title_full_unstemmed Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
title_short Interpretability Is in the Mind of the Beholder: A Causal Framework for Human-Interpretable Representation Learning
title_sort interpretability is in the mind of the beholder a causal framework for human interpretable representation learning
topic explainable AI
causal representation learning
alignment
disentanglement
causal abstractions
concept leakage
url https://www.mdpi.com/1099-4300/25/12/1574
work_keys_str_mv AT emanuelemarconato interpretabilityisinthemindofthebeholderacausalframeworkforhumaninterpretablerepresentationlearning
AT andreapasserini interpretabilityisinthemindofthebeholderacausalframeworkforhumaninterpretablerepresentationlearning
AT stefanoteso interpretabilityisinthemindofthebeholderacausalframeworkforhumaninterpretablerepresentationlearning