Hierarchical Dirichlet Process-Based Models For Discovery of Cross-species Mammalian Gene Expression

An important research problem in computational biology is theidentification of expression programs, sets of co-activatedgenes orchestrating physiological processes, and thecharacterization of the functional breadth of these programs. Theuse of mammalian expression data compendia for discovery of su...

Full description

Bibliographic Details
Main Authors: Gerber, Georg K., Dowell, Robin D., Jaakkola, Tommi S., Gifford, David K.
Other Authors: Dave Gifford
Published: 2007
Online Access:http://hdl.handle.net/1721.1/37817
_version_ 1811084895232983040
author Gerber, Georg K.
Dowell, Robin D.
Jaakkola, Tommi S.
Gifford, David K.
author2 Dave Gifford
author_facet Dave Gifford
Gerber, Georg K.
Dowell, Robin D.
Jaakkola, Tommi S.
Gifford, David K.
author_sort Gerber, Georg K.
collection MIT
description An important research problem in computational biology is theidentification of expression programs, sets of co-activatedgenes orchestrating physiological processes, and thecharacterization of the functional breadth of these programs. Theuse of mammalian expression data compendia for discovery of suchprograms presents several challenges, including: 1) cellularinhomogeneity within samples, 2) genetic and environmental variationacross samples, and 3) uncertainty in the numbers of programs andsample populations. We developed GeneProgram, a new unsupervisedcomputational framework that uses expression data to simultaneouslyorganize genes into overlapping programs and tissues into groups toproduce maps of inter-species expression programs, which are sortedby generality scores that exploit the automatically learnedgroupings. Our method addresses each of the above challenges byusing a probabilistic model that: 1) allocates mRNA to differentexpression programs that may be shared across tissues, 2) ishierarchical, treating each tissue as a sample from a population ofrelated tissues, and 3) uses Dirichlet Processes, a non-parametricBayesian method that provides prior distributions over numbers ofsets while penalizing model complexity. Using real gene expressiondata, we show that GeneProgram outperforms several popularexpression analysis methods in recovering biologically interpretablegene sets. From a large compendium of mouse and human expressiondata, GeneProgram discovers 19 tissue groups and 100 expressionprograms active in mammalian tissues. Our method automaticallyconstructs a comprehensive, body-wide map of expression programs andcharacterizes their functional generality. This map can be used forguiding future biological experiments, such as discovery of genesfor new drug targets that exhibit minimal "cross-talk" withunintended organs, or genes that maintain general physiologicalresponses that go awry in disease states. Further, our method isgeneral, and can be applied readily to novel compendia of biologicaldata.
first_indexed 2024-09-23T12:59:18Z
id mit-1721.1/37817
institution Massachusetts Institute of Technology
last_indexed 2024-09-23T12:59:18Z
publishDate 2007
record_format dspace
spelling mit-1721.1/378172019-04-12T08:38:28Z Hierarchical Dirichlet Process-Based Models For Discovery of Cross-species Mammalian Gene Expression Gerber, Georg K. Dowell, Robin D. Jaakkola, Tommi S. Gifford, David K. Dave Gifford Computational & Systems Biology An important research problem in computational biology is theidentification of expression programs, sets of co-activatedgenes orchestrating physiological processes, and thecharacterization of the functional breadth of these programs. Theuse of mammalian expression data compendia for discovery of suchprograms presents several challenges, including: 1) cellularinhomogeneity within samples, 2) genetic and environmental variationacross samples, and 3) uncertainty in the numbers of programs andsample populations. We developed GeneProgram, a new unsupervisedcomputational framework that uses expression data to simultaneouslyorganize genes into overlapping programs and tissues into groups toproduce maps of inter-species expression programs, which are sortedby generality scores that exploit the automatically learnedgroupings. Our method addresses each of the above challenges byusing a probabilistic model that: 1) allocates mRNA to differentexpression programs that may be shared across tissues, 2) ishierarchical, treating each tissue as a sample from a population ofrelated tissues, and 3) uses Dirichlet Processes, a non-parametricBayesian method that provides prior distributions over numbers ofsets while penalizing model complexity. Using real gene expressiondata, we show that GeneProgram outperforms several popularexpression analysis methods in recovering biologically interpretablegene sets. From a large compendium of mouse and human expressiondata, GeneProgram discovers 19 tissue groups and 100 expressionprograms active in mammalian tissues. Our method automaticallyconstructs a comprehensive, body-wide map of expression programs andcharacterizes their functional generality. This map can be used forguiding future biological experiments, such as discovery of genesfor new drug targets that exhibit minimal "cross-talk" withunintended organs, or genes that maintain general physiologicalresponses that go awry in disease states. Further, our method isgeneral, and can be applied readily to novel compendia of biologicaldata. 2007-07-09T17:43:48Z 2007-07-09T17:43:48Z 2007-07-06 MIT-CSAIL-TR-2007-037 http://hdl.handle.net/1721.1/37817 Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory 42 p. application/postscript application/pdf
spellingShingle Gerber, Georg K.
Dowell, Robin D.
Jaakkola, Tommi S.
Gifford, David K.
Hierarchical Dirichlet Process-Based Models For Discovery of Cross-species Mammalian Gene Expression
title Hierarchical Dirichlet Process-Based Models For Discovery of Cross-species Mammalian Gene Expression
title_full Hierarchical Dirichlet Process-Based Models For Discovery of Cross-species Mammalian Gene Expression
title_fullStr Hierarchical Dirichlet Process-Based Models For Discovery of Cross-species Mammalian Gene Expression
title_full_unstemmed Hierarchical Dirichlet Process-Based Models For Discovery of Cross-species Mammalian Gene Expression
title_short Hierarchical Dirichlet Process-Based Models For Discovery of Cross-species Mammalian Gene Expression
title_sort hierarchical dirichlet process based models for discovery of cross species mammalian gene expression
url http://hdl.handle.net/1721.1/37817
work_keys_str_mv AT gerbergeorgk hierarchicaldirichletprocessbasedmodelsfordiscoveryofcrossspeciesmammaliangeneexpression
AT dowellrobind hierarchicaldirichletprocessbasedmodelsfordiscoveryofcrossspeciesmammaliangeneexpression
AT jaakkolatommis hierarchicaldirichletprocessbasedmodelsfordiscoveryofcrossspeciesmammaliangeneexpression
AT gifforddavidk hierarchicaldirichletprocessbasedmodelsfordiscoveryofcrossspeciesmammaliangeneexpression