Topological Information Data Analysis

This paper presents methods that quantify the structure of statistical interactions within a given data set, and were applied in a previous article. It establishes new results on the <i>k</i>-multivariate mutual-information (<inline-formula> <math display="inline">...

Full description

Bibliographic Details
Main Authors:	Pierre Baudot, Monica Tapia, Daniel Bennequin, Jean-Marc Goaillard
Format:	Article
Language:	English
Published:	MDPI AG 2019-09-01
Series:	Entropy
Subjects:	information theory cohomology information category topological data analysis genetic expression epigenetics multivariate mutual-information synergy statistical independence
Online Access:	https://www.mdpi.com/1099-4300/21/9/869

_version_	1798038537466019840
author	Pierre Baudot Monica Tapia Daniel Bennequin Jean-Marc Goaillard
author_facet	Pierre Baudot Monica Tapia Daniel Bennequin Jean-Marc Goaillard
author_sort	Pierre Baudot
collection	DOAJ
description	This paper presents methods that quantify the structure of statistical interactions within a given data set, and were applied in a previous article. It establishes new results on the <i>k</i>-multivariate mutual-information (<inline-formula> <math display="inline"> <semantics> <msub> <mi>I</mi> <mi>k</mi> </msub> </semantics> </math> </inline-formula>) inspired by the topological formulation of Information introduced in a serie of studies. In particular, we show that the vanishing of all <inline-formula> <math display="inline"> <semantics> <msub> <mi>I</mi> <mi>k</mi> </msub> </semantics> </math> </inline-formula> for <inline-formula> <math display="inline"> <semantics> <mrow> <mn>2</mn> <mo>≤</mo> <mi>k</mi> <mo>≤</mo> <mi>n</mi> </mrow> </semantics> </math> </inline-formula> of <i>n</i> random variables is equivalent to their statistical independence. Pursuing the work of Hu Kuo Ting and Te Sun Han, we show that information functions provide co-ordinates for binary variables, and that they are analytically independent from the probability simplex for any set of finite variables. The maximal positive <inline-formula> <math display="inline"> <semantics> <msub> <mi>I</mi> <mi>k</mi> </msub> </semantics> </math> </inline-formula> identifies the variables that co-vary the most in the population, whereas the minimal negative <inline-formula> <math display="inline"> <semantics> <msub> <mi>I</mi> <mi>k</mi> </msub> </semantics> </math> </inline-formula> identifies synergistic clusters and the variables that differentiate−segregate the most in the population. Finite data size effects and estimation biases severely constrain the effective computation of the information topology on data, and we provide simple statistical tests for the undersampling bias and the k-dependences. We give an example of application of these methods to genetic expression and unsupervised cell-type classification. The methods unravel biologically relevant subtypes, with a sample size of 41 genes and with few errors. It establishes generic basic methods to quantify the epigenetic information storage and a unified epigenetic unsupervised learning formalism. We propose that higher-order statistical interactions and non-identically distributed variables are constitutive characteristics of biological systems that should be estimated in order to unravel their significant statistical structure and diversity. The topological information data analysis presented here allows for precisely estimating this higher-order structure characteristic of biological systems.
first_indexed	2024-04-11T21:41:33Z
format	Article
id	doaj.art-e55e50d7460c4767a5d8f2d4fee2468f
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-04-11T21:41:33Z
publishDate	2019-09-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-e55e50d7460c4767a5d8f2d4fee2468f2022-12-22T04:01:35ZengMDPI AGEntropy1099-43002019-09-0121986910.3390/e21090869e21090869Topological Information Data AnalysisPierre Baudot0Monica Tapia1Daniel Bennequin2Jean-Marc Goaillard3Inserm UNIS UMR1072—Université Aix-Marseille, 13015 Marseille, FranceInserm UNIS UMR1072—Université Aix-Marseille, 13015 Marseille, FranceInstitut de Mathématiques de Jussieu—Paris Rive Gauche (IMJ-PRG), 75013 Paris, FranceInserm UNIS UMR1072—Université Aix-Marseille, 13015 Marseille, FranceThis paper presents methods that quantify the structure of statistical interactions within a given data set, and were applied in a previous article. It establishes new results on the <i>k</i>-multivariate mutual-information (<inline-formula> <math display="inline"> <semantics> <msub> <mi>I</mi> <mi>k</mi> </msub> </semantics> </math> </inline-formula>) inspired by the topological formulation of Information introduced in a serie of studies. In particular, we show that the vanishing of all <inline-formula> <math display="inline"> <semantics> <msub> <mi>I</mi> <mi>k</mi> </msub> </semantics> </math> </inline-formula> for <inline-formula> <math display="inline"> <semantics> <mrow> <mn>2</mn> <mo>≤</mo> <mi>k</mi> <mo>≤</mo> <mi>n</mi> </mrow> </semantics> </math> </inline-formula> of <i>n</i> random variables is equivalent to their statistical independence. Pursuing the work of Hu Kuo Ting and Te Sun Han, we show that information functions provide co-ordinates for binary variables, and that they are analytically independent from the probability simplex for any set of finite variables. The maximal positive <inline-formula> <math display="inline"> <semantics> <msub> <mi>I</mi> <mi>k</mi> </msub> </semantics> </math> </inline-formula> identifies the variables that co-vary the most in the population, whereas the minimal negative <inline-formula> <math display="inline"> <semantics> <msub> <mi>I</mi> <mi>k</mi> </msub> </semantics> </math> </inline-formula> identifies synergistic clusters and the variables that differentiate−segregate the most in the population. Finite data size effects and estimation biases severely constrain the effective computation of the information topology on data, and we provide simple statistical tests for the undersampling bias and the k-dependences. We give an example of application of these methods to genetic expression and unsupervised cell-type classification. The methods unravel biologically relevant subtypes, with a sample size of 41 genes and with few errors. It establishes generic basic methods to quantify the epigenetic information storage and a unified epigenetic unsupervised learning formalism. We propose that higher-order statistical interactions and non-identically distributed variables are constitutive characteristics of biological systems that should be estimated in order to unravel their significant statistical structure and diversity. The topological information data analysis presented here allows for precisely estimating this higher-order structure characteristic of biological systems.https://www.mdpi.com/1099-4300/21/9/869information theorycohomologyinformation categorytopological data analysisgenetic expressionepigeneticsmultivariate mutual-informationsynergystatistical independence
spellingShingle	Pierre Baudot Monica Tapia Daniel Bennequin Jean-Marc Goaillard Topological Information Data Analysis Entropy information theory cohomology information category topological data analysis genetic expression epigenetics multivariate mutual-information synergy statistical independence
title	Topological Information Data Analysis
title_full	Topological Information Data Analysis
title_fullStr	Topological Information Data Analysis
title_full_unstemmed	Topological Information Data Analysis
title_short	Topological Information Data Analysis
title_sort	topological information data analysis
topic	information theory cohomology information category topological data analysis genetic expression epigenetics multivariate mutual-information synergy statistical independence
url	https://www.mdpi.com/1099-4300/21/9/869
work_keys_str_mv	AT pierrebaudot topologicalinformationdataanalysis AT monicatapia topologicalinformationdataanalysis AT danielbennequin topologicalinformationdataanalysis AT jeanmarcgoaillard topologicalinformationdataanalysis

Topological Information Data Analysis

Similar Items