Latent variable models for analysing multidimensional gene expression data

<p>Multi-tissue gene expression studies give rise to 3D arrays of data. These experiments make it possible to study the tissue-specific nature of gene regulation and also the relationship between genotypes and higher level traits such as disease status. Analysing these multidimensional data se...

Full description

Bibliographic Details
Main Author: Hore, V
Other Authors: Marchini, J
Format: Thesis
Published: 2015
_version_ 1826303367718109184
author Hore, V
author2 Marchini, J
author_facet Marchini, J
Hore, V
author_sort Hore, V
collection OXFORD
description <p>Multi-tissue gene expression studies give rise to 3D arrays of data. These experiments make it possible to study the tissue-specific nature of gene regulation and also the relationship between genotypes and higher level traits such as disease status. Analysing these multidimensional data sets is a statistical challenge, as they contain high noise levels and missing data. In this thesis I introduce a new approach for analysing multidimensional gene expression data sets called SPIDER (SParse Integrated DEcomposition for RNA-sequencing). SPIDER is a sparse Bayesian tensor decomposition that models the data as a sum of components (or factors). Each component consists of three vectors of scores or loadings that describe modes of variation across individuals, genes and tissues. Sparsity is induced in the components using a spike and slab prior, allowing for recovery of sparse structure in the data. The decomposition is easily extended to jointly decompose several data types, handle missing data and allow for relatedness between individuals, another common problem in genetics. Inference for the model is performed using variational Bayes.</p> <p>SPIDER is compared to existing approaches for decomposing multidimensional data via simulations. Results suggest that SPIDER performs comparably to, or better than, existing approaches and particularly well when the underlying signals are very sparse. Additional simulations designed to contain realistic levels of signal and noise suggest that SPIDER has the power to recover gene networks from gene expression data.</p> <p>I have applied SPIDER to gene expression data measured using RNA-sequencing for 845 individuals in three tissues from the TwinsUK cohort. Estimated components were tested for association with genetic variation genome-wide. Five signals describing gene regulation networks driven by genetic variants are uncovered, building on the current understanding of these pathways. In addition, components uncovering effects of experimental artefacts and covariates were also recovered from the data.</p>
first_indexed 2024-03-07T06:01:37Z
format Thesis
id oxford-uuid:ec62bc11-5c3f-467d-9ff3-f3c4eb29d140
institution University of Oxford
last_indexed 2024-03-07T06:01:37Z
publishDate 2015
record_format dspace
spelling oxford-uuid:ec62bc11-5c3f-467d-9ff3-f3c4eb29d1402022-03-27T11:17:04ZLatent variable models for analysing multidimensional gene expression dataThesishttp://purl.org/coar/resource_type/c_db06uuid:ec62bc11-5c3f-467d-9ff3-f3c4eb29d140ORA Deposit2015Hore, VMarchini, J<p>Multi-tissue gene expression studies give rise to 3D arrays of data. These experiments make it possible to study the tissue-specific nature of gene regulation and also the relationship between genotypes and higher level traits such as disease status. Analysing these multidimensional data sets is a statistical challenge, as they contain high noise levels and missing data. In this thesis I introduce a new approach for analysing multidimensional gene expression data sets called SPIDER (SParse Integrated DEcomposition for RNA-sequencing). SPIDER is a sparse Bayesian tensor decomposition that models the data as a sum of components (or factors). Each component consists of three vectors of scores or loadings that describe modes of variation across individuals, genes and tissues. Sparsity is induced in the components using a spike and slab prior, allowing for recovery of sparse structure in the data. The decomposition is easily extended to jointly decompose several data types, handle missing data and allow for relatedness between individuals, another common problem in genetics. Inference for the model is performed using variational Bayes.</p> <p>SPIDER is compared to existing approaches for decomposing multidimensional data via simulations. Results suggest that SPIDER performs comparably to, or better than, existing approaches and particularly well when the underlying signals are very sparse. Additional simulations designed to contain realistic levels of signal and noise suggest that SPIDER has the power to recover gene networks from gene expression data.</p> <p>I have applied SPIDER to gene expression data measured using RNA-sequencing for 845 individuals in three tissues from the TwinsUK cohort. Estimated components were tested for association with genetic variation genome-wide. Five signals describing gene regulation networks driven by genetic variants are uncovered, building on the current understanding of these pathways. In addition, components uncovering effects of experimental artefacts and covariates were also recovered from the data.</p>
spellingShingle Hore, V
Latent variable models for analysing multidimensional gene expression data
title Latent variable models for analysing multidimensional gene expression data
title_full Latent variable models for analysing multidimensional gene expression data
title_fullStr Latent variable models for analysing multidimensional gene expression data
title_full_unstemmed Latent variable models for analysing multidimensional gene expression data
title_short Latent variable models for analysing multidimensional gene expression data
title_sort latent variable models for analysing multidimensional gene expression data
work_keys_str_mv AT horev latentvariablemodelsforanalysingmultidimensionalgeneexpressiondata