Mining gene expression data by interpreting principal components

<p>Abstract</p> <p>Background</p> <p>There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively sma...

Full description

Bibliographic Details
Main Authors: Mortazavi Ali, Trout Diane, King Brandon W, Roden Joseph C, Wold Barbara J, Hart Christopher E
Format: Article
Language:English
Published: BMC 2006-04-01
Series:BMC Bioinformatics
Online Access:http://www.biomedcentral.com/1471-2105/7/194
_version_ 1811293311938330624
author Mortazavi Ali
Trout Diane
King Brandon W
Roden Joseph C
Wold Barbara J
Hart Christopher E
author_facet Mortazavi Ali
Trout Diane
King Brandon W
Roden Joseph C
Wold Barbara J
Hart Christopher E
author_sort Mortazavi Ali
collection DOAJ
description <p>Abstract</p> <p>Background</p> <p>There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis.</p> <p>Results</p> <p>We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation.</p> <p>Conclusion</p> <p>We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematic. This approach also shows promise in other topic domains such as multi-spectral imaging datasets.</p>
first_indexed 2024-04-13T04:59:30Z
format Article
id doaj.art-0965ac4aed4640f38c319c9a2a15164e
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-04-13T04:59:30Z
publishDate 2006-04-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-0965ac4aed4640f38c319c9a2a15164e2022-12-22T03:01:23ZengBMCBMC Bioinformatics1471-21052006-04-017119410.1186/1471-2105-7-194Mining gene expression data by interpreting principal componentsMortazavi AliTrout DianeKing Brandon WRoden Joseph CWold Barbara JHart Christopher E<p>Abstract</p> <p>Background</p> <p>There are many methods for analyzing microarray data that group together genes having similar patterns of expression over all conditions tested. However, in many instances the biologically important goal is to identify relatively small sets of genes that share coherent expression across only some conditions, rather than all or most conditions as required in traditional clustering; e.g. genes that are highly up-regulated and/or down-regulated similarly across only a subset of conditions. Equally important is the need to learn which conditions are the decisive ones in forming such gene sets of interest, and how they relate to diverse conditional covariates, such as disease diagnosis or prognosis.</p> <p>Results</p> <p>We present a method for automatically identifying such candidate sets of biologically relevant genes using a combination of principal components analysis and information theoretic metrics. To enable easy use of our methods, we have developed a data analysis package that facilitates visualization and subsequent data mining of the independent sources of significant variation present in gene microarray expression datasets (or in any other similarly structured high-dimensional dataset). We applied these tools to two public datasets, and highlight sets of genes most affected by specific subsets of conditions (e.g. tissues, treatments, samples, etc.). Statistically significant associations for highlighted gene sets were shown via global analysis for Gene Ontology term enrichment. Together with covariate associations, the tool provides a basis for building testable hypotheses about the biological or experimental causes of observed variation.</p> <p>Conclusion</p> <p>We provide an unsupervised data mining technique for diverse microarray expression datasets that is distinct from major methods now in routine use. In test uses, this method, based on publicly available gene annotations, appears to identify numerous sets of biologically relevant genes. It has proven especially valuable in instances where there are many diverse conditions (10's to hundreds of different tissues or cell types), a situation in which many clustering and ordering algorithms become problematic. This approach also shows promise in other topic domains such as multi-spectral imaging datasets.</p>http://www.biomedcentral.com/1471-2105/7/194
spellingShingle Mortazavi Ali
Trout Diane
King Brandon W
Roden Joseph C
Wold Barbara J
Hart Christopher E
Mining gene expression data by interpreting principal components
BMC Bioinformatics
title Mining gene expression data by interpreting principal components
title_full Mining gene expression data by interpreting principal components
title_fullStr Mining gene expression data by interpreting principal components
title_full_unstemmed Mining gene expression data by interpreting principal components
title_short Mining gene expression data by interpreting principal components
title_sort mining gene expression data by interpreting principal components
url http://www.biomedcentral.com/1471-2105/7/194
work_keys_str_mv AT mortazaviali mininggeneexpressiondatabyinterpretingprincipalcomponents
AT troutdiane mininggeneexpressiondatabyinterpretingprincipalcomponents
AT kingbrandonw mininggeneexpressiondatabyinterpretingprincipalcomponents
AT rodenjosephc mininggeneexpressiondatabyinterpretingprincipalcomponents
AT woldbarbaraj mininggeneexpressiondatabyinterpretingprincipalcomponents
AT hartchristophere mininggeneexpressiondatabyinterpretingprincipalcomponents