Data mining techniques for large-scale gene expression analysis

Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.

Bibliographic Details
Main Author: Palmer, Nathan Patrick
Other Authors: Bonnie Berger.
Format: Thesis
Language:eng
Published: Massachusetts Institute of Technology 2012
Subjects:
Online Access:http://hdl.handle.net/1721.1/68493
_version_ 1811091627811274752
author Palmer, Nathan Patrick
author2 Bonnie Berger.
author_facet Bonnie Berger.
Palmer, Nathan Patrick
author_sort Palmer, Nathan Patrick
collection MIT
description Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011.
first_indexed 2024-09-23T15:05:23Z
format Thesis
id mit-1721.1/68493
institution Massachusetts Institute of Technology
language eng
last_indexed 2024-09-23T15:05:23Z
publishDate 2012
publisher Massachusetts Institute of Technology
record_format dspace
spelling mit-1721.1/684932019-04-10T13:49:20Z Data mining techniques for large-scale gene expression analysis Palmer, Nathan Patrick Bonnie Berger. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Massachusetts Institute of Technology. Dept. of Electrical Engineering and Computer Science. Electrical Engineering and Computer Science. Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2011. Cataloged from PDF version of thesis. Includes bibliographical references (p. 238-256). Modern computational biology is awash in large-scale data mining problems. Several high-throughput technologies have been developed that enable us, with relative ease and little expense, to evaluate the coordinated expression levels of tens of thousands of genes, evaluate hundreds of thousands of single-nucleotide polymorphisms, and sequence individual genomes. The data produced by these assays has provided the research and commercial communities with the opportunity to derive improved clinical prognostic indicators, as well as develop an understanding, at the molecular level, of the systemic underpinnings of a variety of diseases. Aside from the statistical methods used to evaluate these assays, another, more subtle challenge is emerging. Despite the explosive growth in the amount of data being generated and submitted to the various publicly available data repositories, very little attention has been paid to managing the phenotypic characterization of their samples (i.e., managing class labels in a controlled fashion). If sense is to be made of the underlying assay data, the samples' descriptive metadata must first be standardized in a machine-readable format. In this thesis, we explore these issues, specifically within the context of curating and analyzing a large DNA microarray database. We address three main challenges. First, we acquire a large subset of a publicly available microarray repository and develop a principled method for extracting phenotype information from freetext sample labels, then use that information to generate an index of the sample's medically-relevant annotation. The indexing method we develop, Concordia, incorporates pre-existing expert knowledge relating to the hierarchical relationships between medical terms, allowing queries of arbitrary specificity to be efficiently answered. Second, we describe a highly flexible approach to answering the question: "Given a previously unseen gene expression sample, how can we compute its similarity to all of the labeled samples in our database, and how can we utilize those similarity scores to predict the phenotype of the new sample?" Third, we describe a method for identifying phenotype-specific transcriptional profiles within the context of this database, and explore a method for measuring the relative strength of those signatures across the rest of the database, allowing us to identify molecular signatures that are shared across various tissues ad diseases. These shared fingerprints may form a quantitative basis for optimal therapy selection and drug repositioning for a variety of diseases. by Nathan Patrick Palmer. Ph.D. 2012-01-12T19:32:04Z 2012-01-12T19:32:04Z 2011 2011 Thesis http://hdl.handle.net/1721.1/68493 770409532 eng M.I.T. theses are protected by copyright. They may be viewed from this source for any purpose, but reproduction or distribution in any format is prohibited without written permission. See provided URL for inquiries about permission. http://dspace.mit.edu/handle/1721.1/7582 256 p. application/pdf Massachusetts Institute of Technology
spellingShingle Electrical Engineering and Computer Science.
Palmer, Nathan Patrick
Data mining techniques for large-scale gene expression analysis
title Data mining techniques for large-scale gene expression analysis
title_full Data mining techniques for large-scale gene expression analysis
title_fullStr Data mining techniques for large-scale gene expression analysis
title_full_unstemmed Data mining techniques for large-scale gene expression analysis
title_short Data mining techniques for large-scale gene expression analysis
title_sort data mining techniques for large scale gene expression analysis
topic Electrical Engineering and Computer Science.
url http://hdl.handle.net/1721.1/68493
work_keys_str_mv AT palmernathanpatrick dataminingtechniquesforlargescalegeneexpressionanalysis