A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias.

We assess inferential quality in the field of differential expression profiling by high-throughput sequencing (HT-seq) based on analysis of datasets submitted from 2008 to 2020 to the NCBI GEO data repository. We take advantage of the parallel differential expression testing over thousands of genes,...

Full description

Bibliographic Details
Main Authors: Taavi Päll, Hannes Luidalepp, Tanel Tenson, Ülo Maiväli
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2023-03-01
Series:PLoS Biology
Online Access:https://doi.org/10.1371/journal.pbio.3002007
_version_ 1797848164006363136
author Taavi Päll
Hannes Luidalepp
Tanel Tenson
Ülo Maiväli
author_facet Taavi Päll
Hannes Luidalepp
Tanel Tenson
Ülo Maiväli
author_sort Taavi Päll
collection DOAJ
description We assess inferential quality in the field of differential expression profiling by high-throughput sequencing (HT-seq) based on analysis of datasets submitted from 2008 to 2020 to the NCBI GEO data repository. We take advantage of the parallel differential expression testing over thousands of genes, whereby each experiment leads to a large set of p-values, the distribution of which can indicate the validity of assumptions behind the test. From a well-behaved p-value set π0, the fraction of genes that are not differentially expressed can be estimated. We found that only 25% of experiments resulted in theoretically expected p-value histogram shapes, although there is a marked improvement over time. Uniform p-value histogram shapes, indicative of <100 actual effects, were extremely few. Furthermore, although many HT-seq workflows assume that most genes are not differentially expressed, 37% of experiments have π0-s of less than 0.5, as if most genes changed their expression level. Most HT-seq experiments have very small sample sizes and are expected to be underpowered. Nevertheless, the estimated π0-s do not have the expected association with N, suggesting widespread problems of experiments with controlling false discovery rate (FDR). Both the fractions of different p-value histogram types and the π0 values are strongly associated with the differential expression analysis program used by the original authors. While we could double the proportion of theoretically expected p-value distributions by removing low-count features from the analysis, this treatment did not remove the association with the analysis program. Taken together, our results indicate widespread bias in the differential expression profiling field and the unreliability of statistical methods used to analyze HT-seq data.
first_indexed 2024-04-09T18:24:07Z
format Article
id doaj.art-3e3eb0d5906c47d4b3e618cea0789c9e
institution Directory Open Access Journal
issn 1544-9173
1545-7885
language English
last_indexed 2024-04-09T18:24:07Z
publishDate 2023-03-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Biology
spelling doaj.art-3e3eb0d5906c47d4b3e618cea0789c9e2023-04-12T05:30:44ZengPublic Library of Science (PLoS)PLoS Biology1544-91731545-78852023-03-01213e300200710.1371/journal.pbio.3002007A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias.Taavi PällHannes LuidaleppTanel TensonÜlo MaiväliWe assess inferential quality in the field of differential expression profiling by high-throughput sequencing (HT-seq) based on analysis of datasets submitted from 2008 to 2020 to the NCBI GEO data repository. We take advantage of the parallel differential expression testing over thousands of genes, whereby each experiment leads to a large set of p-values, the distribution of which can indicate the validity of assumptions behind the test. From a well-behaved p-value set π0, the fraction of genes that are not differentially expressed can be estimated. We found that only 25% of experiments resulted in theoretically expected p-value histogram shapes, although there is a marked improvement over time. Uniform p-value histogram shapes, indicative of <100 actual effects, were extremely few. Furthermore, although many HT-seq workflows assume that most genes are not differentially expressed, 37% of experiments have π0-s of less than 0.5, as if most genes changed their expression level. Most HT-seq experiments have very small sample sizes and are expected to be underpowered. Nevertheless, the estimated π0-s do not have the expected association with N, suggesting widespread problems of experiments with controlling false discovery rate (FDR). Both the fractions of different p-value histogram types and the π0 values are strongly associated with the differential expression analysis program used by the original authors. While we could double the proportion of theoretically expected p-value distributions by removing low-count features from the analysis, this treatment did not remove the association with the analysis program. Taken together, our results indicate widespread bias in the differential expression profiling field and the unreliability of statistical methods used to analyze HT-seq data.https://doi.org/10.1371/journal.pbio.3002007
spellingShingle Taavi Päll
Hannes Luidalepp
Tanel Tenson
Ülo Maiväli
A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias.
PLoS Biology
title A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias.
title_full A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias.
title_fullStr A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias.
title_full_unstemmed A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias.
title_short A field-wide assessment of differential expression profiling by high-throughput sequencing reveals widespread bias.
title_sort field wide assessment of differential expression profiling by high throughput sequencing reveals widespread bias
url https://doi.org/10.1371/journal.pbio.3002007
work_keys_str_mv AT taavipall afieldwideassessmentofdifferentialexpressionprofilingbyhighthroughputsequencingrevealswidespreadbias
AT hannesluidalepp afieldwideassessmentofdifferentialexpressionprofilingbyhighthroughputsequencingrevealswidespreadbias
AT taneltenson afieldwideassessmentofdifferentialexpressionprofilingbyhighthroughputsequencingrevealswidespreadbias
AT ulomaivali afieldwideassessmentofdifferentialexpressionprofilingbyhighthroughputsequencingrevealswidespreadbias
AT taavipall fieldwideassessmentofdifferentialexpressionprofilingbyhighthroughputsequencingrevealswidespreadbias
AT hannesluidalepp fieldwideassessmentofdifferentialexpressionprofilingbyhighthroughputsequencingrevealswidespreadbias
AT taneltenson fieldwideassessmentofdifferentialexpressionprofilingbyhighthroughputsequencingrevealswidespreadbias
AT ulomaivali fieldwideassessmentofdifferentialexpressionprofilingbyhighthroughputsequencingrevealswidespreadbias