Multivariate Surprisal Analysis of Gene Expression Levels

We consider here multivariate data which we understand as the problem where each data point i is measured for two or more distinct variables. In a typical situation there are many data points i while the range of the different variables is more limited. If there is only one variable then the data ca...

Full description

Bibliographic Details
Main Authors:	Francoise Remacle, Andrew S. Goldstein, Raphael D. Levine
Format:	Article
Language:	English
Published:	MDPI AG 2016-12-01
Series:	Entropy
Subjects:	multivariate analysis maximal entropy prostate cancer markers personalized diagnostics transcriptomics high order SVD tensor data format ensemble phenotypes
Online Access:	http://www.mdpi.com/1099-4300/18/12/445

_version_	1811306961927405568
author	Francoise Remacle Andrew S. Goldstein Raphael D. Levine
author_facet	Francoise Remacle Andrew S. Goldstein Raphael D. Levine
author_sort	Francoise Remacle
collection	DOAJ
description	We consider here multivariate data which we understand as the problem where each data point i is measured for two or more distinct variables. In a typical situation there are many data points i while the range of the different variables is more limited. If there is only one variable then the data can be arranged as a rectangular matrix where i is the index of the rows while the values of the variable label the columns. We begin here with this case, but then proceed to the more general case with special emphasis on two variables when the data can be organized as a tensor. An analysis of such multivariate data by a maximal entropy approach is discussed and illustrated for gene expressions in four different cell types of six different patients. The different genes are indexed by i, and there are 24 (4 by 6) entries for each i. We used an unbiased thermodynamic maximal-entropy based approach (surprisal analysis) to analyze the multivariate transcriptional profiles. The measured microarray experimental data is organized as a tensor array where the two minor orthogonal directions are the different patients and the different cell types. The entries are the transcription levels on a logarithmic scale. We identify a disease signature of prostate cancer and determine the degree of variability between individual patients. Surprisal analysis determined a baseline expression level common for all cells and patients. We identify the transcripts in the baseline as the “housekeeping” genes that insure the cell stability. The baseline and two surprisal patterns satisfactorily recover (99.8%) the multivariate data. The two patterns characterize the individuality of the patients and, to a lesser extent, the commonality of the disease. The immune response was identified as the most significant pathway contributing to the cancer disease pattern. Delineating patient variability is a central issue in personalized diagnostics and it remains to be seen if additional data will confirm the power of multivariate analysis to address this key point. The collapsed limits where the data is compacted into two dimensional arrays are contained within the proposed formalism.
first_indexed	2024-04-13T08:56:06Z
format	Article
id	doaj.art-f872bf03af4f42e487269a83dbbcab3e
institution	Directory Open Access Journal
issn	1099-4300
language	English
last_indexed	2024-04-13T08:56:06Z
publishDate	2016-12-01
publisher	MDPI AG
record_format	Article
series	Entropy
spelling	doaj.art-f872bf03af4f42e487269a83dbbcab3e2022-12-22T02:53:18ZengMDPI AGEntropy1099-43002016-12-01181244510.3390/e18120445e18120445Multivariate Surprisal Analysis of Gene Expression LevelsFrancoise Remacle0Andrew S. Goldstein1Raphael D. Levine2Département de Chimie, B6c, Université de Liège, B4000 Liège, BelgiumDepartment of Urology, David Geffen School of Medicine and Department of Molecular Cell & Developmental Biology, University of California, Los Angeles, CA 90095, USAThe Fritz Haber Research Center for Molecular Dynamics, The Institute of Chemistry, The Hebrew University of Jerusalem, Jerusalem 91904, IsraelWe consider here multivariate data which we understand as the problem where each data point i is measured for two or more distinct variables. In a typical situation there are many data points i while the range of the different variables is more limited. If there is only one variable then the data can be arranged as a rectangular matrix where i is the index of the rows while the values of the variable label the columns. We begin here with this case, but then proceed to the more general case with special emphasis on two variables when the data can be organized as a tensor. An analysis of such multivariate data by a maximal entropy approach is discussed and illustrated for gene expressions in four different cell types of six different patients. The different genes are indexed by i, and there are 24 (4 by 6) entries for each i. We used an unbiased thermodynamic maximal-entropy based approach (surprisal analysis) to analyze the multivariate transcriptional profiles. The measured microarray experimental data is organized as a tensor array where the two minor orthogonal directions are the different patients and the different cell types. The entries are the transcription levels on a logarithmic scale. We identify a disease signature of prostate cancer and determine the degree of variability between individual patients. Surprisal analysis determined a baseline expression level common for all cells and patients. We identify the transcripts in the baseline as the “housekeeping” genes that insure the cell stability. The baseline and two surprisal patterns satisfactorily recover (99.8%) the multivariate data. The two patterns characterize the individuality of the patients and, to a lesser extent, the commonality of the disease. The immune response was identified as the most significant pathway contributing to the cancer disease pattern. Delineating patient variability is a central issue in personalized diagnostics and it remains to be seen if additional data will confirm the power of multivariate analysis to address this key point. The collapsed limits where the data is compacted into two dimensional arrays are contained within the proposed formalism.http://www.mdpi.com/1099-4300/18/12/445multivariate analysismaximal entropyprostate cancer markerspersonalized diagnosticstranscriptomicshigh order SVDtensor data formatensemble phenotypes
spellingShingle	Francoise Remacle Andrew S. Goldstein Raphael D. Levine Multivariate Surprisal Analysis of Gene Expression Levels Entropy multivariate analysis maximal entropy prostate cancer markers personalized diagnostics transcriptomics high order SVD tensor data format ensemble phenotypes
title	Multivariate Surprisal Analysis of Gene Expression Levels
title_full	Multivariate Surprisal Analysis of Gene Expression Levels
title_fullStr	Multivariate Surprisal Analysis of Gene Expression Levels
title_full_unstemmed	Multivariate Surprisal Analysis of Gene Expression Levels
title_short	Multivariate Surprisal Analysis of Gene Expression Levels
title_sort	multivariate surprisal analysis of gene expression levels
topic	multivariate analysis maximal entropy prostate cancer markers personalized diagnostics transcriptomics high order SVD tensor data format ensemble phenotypes
url	http://www.mdpi.com/1099-4300/18/12/445
work_keys_str_mv	AT francoiseremacle multivariatesurprisalanalysisofgeneexpressionlevels AT andrewsgoldstein multivariatesurprisalanalysisofgeneexpressionlevels AT raphaeldlevine multivariatesurprisalanalysisofgeneexpressionlevels

Multivariate Surprisal Analysis of Gene Expression Levels

Similar Items