Yhteenveto: | Motivation: The identification of genetic variants influencing gene expression (known as expression quantitative trait loci or eQTLs ) is important in unravelling the genetic basis of complex traits. Detecting multiple eQTLs simultaneously in a population based on paired DNA-seq and RNA-seq assays employs two competing types of models: models which rely on appropriate transformations of RNA-seq data (and are powered by a mature mathematical theory), or count-based models , which represent digital gene expression explicitly, thus rendering such transformations unnecessary. The latter constitutes an immensely popular methodology, which is however plagued by mathematical intractability. Results: We develop tractable count-based models, which are amenable to efficient estimation through the introduction of latent variables and the appropriate application of recent statistical theory in a sparse Bayesian modelling framework. Furthermore, we examine several transformation methods for RNA-seq read counts and we introduce arcsin , logit and Laplace smoothing as preprocessing steps for transformation-based models. Using natural and carefully simulated data from the 1000 Genomes and gEUVADIS projects, we benchmark both approaches under a variety of scenarios, including the presence of noise and violation of basic model assumptions. We demonstrate that an arcsin transformation of Laplace-smoothed data is at least as good as state-of-the-art models, particularly at small samples. Furthermore, we show that an over-dispersed Poisson model is comparable to the celebrated Negative Binomial, but much easier to estimate. These results provide strong support for transformation-based vs. count-based (particularly Negative-Binomial-based) models for eQTL mapping. Availability: All methods are implemented in the free software eQTLseq : https://github.com/dvav/eQTLseq. Contact: dimitris.vavoulis@ndcls.ox.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.
|