Data mining techniques for large-scale gene expression analysis

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.

Bibliographic Details
Main Author:	Palmer, Nathan Patrick
Other Authors:	Bonnie Berger.
Format:	Thesis
Language:	eng
Published:	Massachusetts Institute of Technology 2012
Subjects:	Electrical Engineering and Computer Science.
Online Access:	http://hdl.handle.net/1721.1/68493

_version_	1826211406884634624
author	Palmer, Nathan Patrick
author2	Bonnie Berger.
author_facet	Bonnie Berger. Palmer, Nathan Patrick
author_sort	Palmer, Nathan Patrick
collection	MIT
description	Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.
first_indexed	2024-09-23T15:05:23Z
format	Thesis
id	mit-1721.1/68493
institution	Massachusetts Institute of Technology
language	eng
last_indexed	2024-09-23T15:05:23Z
publishDate	2012
publisher	Massachusetts Institute of Technology
record_format	dspace
spelling	mit-1721.1/684932019-04-10T13:49:20Z Data mining techniques for large-scale gene expression analysis Palmer, Nathan Patrick Bonnie Berger. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011. Cataloged from PDF version of thesis. Includes bibliographical references (p. 238-256). Modern computational biology is awash in large-scale data mining problems. Several high-throughput technologies have been developed that enable us, with relative ease and little expense, to evaluate the coordinated expression levels of tens of thousands of genes, evaluate hundreds of thousands of single-nucleotide polymorphisms, and sequence individual genomes. The data produced by these assays has provided the research and commercial communities with the opportunity to derive improved clinical prognostic indicators, as well as develop an understanding, at the molecular level, of the systemic underpinnings of a variety of diseases. Aside from the statistical methods used to evaluate these assays, another, more subtle challenge is emerging. Despite the explosive growth in the amount of data being generated and submitted to the various publicly available data repositories, very little attention has been paid to managing the phenotypic characterization of their samples (i.e., managing class labels in a controlled fashion). If sense is to be made of the underlying assay data, the samples' descriptive metadata must first be standardized in a machine-readable format. In this thesis, we explore these issues, specifically within the context of curating and analyzing a large DNA microarray database. We address three main challenges. First, we acquire a large subset of a publicly available microarray repository and develop a principled method for extracting phenotype information from freetext sample labels, then use that information to generate an index of the sample's medically-relevant annotation. The indexing method we develop, Concordia, incorporates pre-existing expert knowledge relating to the hierarchical relationships between medical terms, allowing queries of arbitrary specificity to be efficiently answered. Second, we describe a highly flexible approach to answering the question: "Given a previously unseen gene expression sample, how can we compute its similarity to all of the labeled samples in our database, and how can we utilize those similarity scores to predict the phenotype of the new sample?" Third, we describe a method for identifying phenotype-specific transcriptional profiles within the context of this database, and explore a method for measuring the relative strength of those signatures across the rest of the database, allowing us to identify molecular signatures that are shared across various tissues ad diseases. These shared fingerprints may form a quantitative basis for optimal therapy selection and drug repositioning for a variety of diseases. by Nathan Patrick Palmer. Ph.D. 2012-01-12T19:32:04Z 2012-01-12T19:32:04Z 2011 2011 Thesis http://hdl.handle.net/1721.1/68493 770409532 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 256 p. application/pdf Massachusetts Institute of Technology
spellingShingle	Electrical Engineering and Computer Science. Palmer, Nathan Patrick Data mining techniques for large-scale gene expression analysis
title	Data mining techniques for large-scale gene expression analysis
title_full	Data mining techniques for large-scale gene expression analysis
title_fullStr	Data mining techniques for large-scale gene expression analysis
title_full_unstemmed	Data mining techniques for large-scale gene expression analysis
title_short	Data mining techniques for large-scale gene expression analysis
title_sort	data mining techniques for large scale gene expression analysis
topic	Electrical Engineering and Computer Science.
url	http://hdl.handle.net/1721.1/68493
work_keys_str_mv	AT palmernathanpatrick dataminingtechniquesforlargescalegeneexpressionanalysis

Data mining techniques for large-scale gene expression analysis

Similar Items