Latent variable models for analysing multidimensional gene expression data
<p>Multi-tissue gene expression studies give rise to 3D arrays of data. These experiments make it possible to study the tissue-specific nature of gene regulation and also the relationship between genotypes and higher level traits such as disease status. Analysing these multidimensional data se...
Main Author: | |
---|---|
Other Authors: | |
Format: | Thesis |
Published: |
2015
|
_version_ | 1826303367718109184 |
---|---|
author | Hore, V |
author2 | Marchini, J |
author_facet | Marchini, J Hore, V |
author_sort | Hore, V |
collection | OXFORD |
description | <p>Multi-tissue gene expression studies give rise to 3D arrays of data. These experiments make it possible to study the tissue-specific nature of gene regulation and also the relationship between genotypes and higher level traits such as disease status. Analysing these multidimensional data sets is a statistical challenge, as they contain high noise levels and missing data. In this thesis I introduce a new approach for analysing multidimensional gene expression data sets called SPIDER (SParse Integrated DEcomposition for RNA-sequencing). SPIDER is a sparse Bayesian tensor decomposition that models the data as a sum of components (or factors). Each component consists of three vectors of scores or loadings that describe modes of variation across individuals, genes and tissues. Sparsity is induced in the components using a spike and slab prior, allowing for recovery of sparse structure in the data. The decomposition is easily extended to jointly decompose several data types, handle missing data and allow for relatedness between individuals, another common problem in genetics. Inference for the model is performed using variational Bayes.</p> <p>SPIDER is compared to existing approaches for decomposing multidimensional data via simulations. Results suggest that SPIDER performs comparably to, or better than, existing approaches and particularly well when the underlying signals are very sparse. Additional simulations designed to contain realistic levels of signal and noise suggest that SPIDER has the power to recover gene networks from gene expression data.</p> <p>I have applied SPIDER to gene expression data measured using RNA-sequencing for 845 individuals in three tissues from the TwinsUK cohort. Estimated components were tested for association with genetic variation genome-wide. Five signals describing gene regulation networks driven by genetic variants are uncovered, building on the current understanding of these pathways. In addition, components uncovering effects of experimental artefacts and covariates were also recovered from the data.</p> |
first_indexed | 2024-03-07T06:01:37Z |
format | Thesis |
id | oxford-uuid:ec62bc11-5c3f-467d-9ff3-f3c4eb29d140 |
institution | University of Oxford |
last_indexed | 2024-03-07T06:01:37Z |
publishDate | 2015 |
record_format | dspace |
spelling | oxford-uuid:ec62bc11-5c3f-467d-9ff3-f3c4eb29d1402022-03-27T11:17:04ZLatent variable models for analysing multidimensional gene expression dataThesishttp://purl.org/coar/resource_type/c_db06uuid:ec62bc11-5c3f-467d-9ff3-f3c4eb29d140ORA Deposit2015Hore, VMarchini, J<p>Multi-tissue gene expression studies give rise to 3D arrays of data. These experiments make it possible to study the tissue-specific nature of gene regulation and also the relationship between genotypes and higher level traits such as disease status. Analysing these multidimensional data sets is a statistical challenge, as they contain high noise levels and missing data. In this thesis I introduce a new approach for analysing multidimensional gene expression data sets called SPIDER (SParse Integrated DEcomposition for RNA-sequencing). SPIDER is a sparse Bayesian tensor decomposition that models the data as a sum of components (or factors). Each component consists of three vectors of scores or loadings that describe modes of variation across individuals, genes and tissues. Sparsity is induced in the components using a spike and slab prior, allowing for recovery of sparse structure in the data. The decomposition is easily extended to jointly decompose several data types, handle missing data and allow for relatedness between individuals, another common problem in genetics. Inference for the model is performed using variational Bayes.</p> <p>SPIDER is compared to existing approaches for decomposing multidimensional data via simulations. Results suggest that SPIDER performs comparably to, or better than, existing approaches and particularly well when the underlying signals are very sparse. Additional simulations designed to contain realistic levels of signal and noise suggest that SPIDER has the power to recover gene networks from gene expression data.</p> <p>I have applied SPIDER to gene expression data measured using RNA-sequencing for 845 individuals in three tissues from the TwinsUK cohort. Estimated components were tested for association with genetic variation genome-wide. Five signals describing gene regulation networks driven by genetic variants are uncovered, building on the current understanding of these pathways. In addition, components uncovering effects of experimental artefacts and covariates were also recovered from the data.</p> |
spellingShingle | Hore, V Latent variable models for analysing multidimensional gene expression data |
title | Latent variable models for analysing multidimensional gene expression data |
title_full | Latent variable models for analysing multidimensional gene expression data |
title_fullStr | Latent variable models for analysing multidimensional gene expression data |
title_full_unstemmed | Latent variable models for analysing multidimensional gene expression data |
title_short | Latent variable models for analysing multidimensional gene expression data |
title_sort | latent variable models for analysing multidimensional gene expression data |
work_keys_str_mv | AT horev latentvariablemodelsforanalysingmultidimensionalgeneexpressiondata |