Large-Scale Quality Analysis of Published ChIP-seq Data

ChIP-seq has become the primary method for identifying in vivo protein–DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. Howe...

Full description

Bibliographic Details
Main Authors:	Kundaje, Anshul, Marinov, Georgi K., Park, Peter J., Wold, Barbara J.
Other Authors:	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
Format:	Article
Language:	en_US
Published:	Genetics Society of America 2014
Online Access:	http://hdl.handle.net/1721.1/87581

_version_	1826213299891470336
author	Kundaje, Anshul Marinov, Georgi K. Park, Peter J. Wold, Barbara J.
author2	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory
author_facet	Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Kundaje, Anshul Marinov, Georgi K. Park, Peter J. Wold, Barbara J.
author_sort	Kundaje, Anshul
collection	MIT
description	ChIP-seq has become the primary method for identifying in vivo protein–DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia of DNA Elements (ENCODE) project developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here, we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1, 2012. The majority (55%) of datasets scored as being highly successful, but a substantial minority (20%) were of apparently poor quality, and another ∼25% were of intermediate quality. We discuss how different uses of ChIP-seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e., no immunoprecipitation and mock immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high-quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses.
first_indexed	2024-09-23T15:46:52Z
format	Article
id	mit-1721.1/87581
institution	Massachusetts Institute of Technology
language	en_US
last_indexed	2024-09-23T15:46:52Z
publishDate	2014
publisher	Genetics Society of America
record_format	dspace
spelling	mit-1721.1/875812022-09-29T16:05:59Z Large-Scale Quality Analysis of Published ChIP-seq Data Kundaje, Anshul Marinov, Georgi K. Park, Peter J. Wold, Barbara J. Massachusetts Institute of Technology. Computer Science and Artificial Intelligence Laboratory Kundaje, Anshul ChIP-seq has become the primary method for identifying in vivo protein–DNA interactions on a genome-wide scale, with nearly 800 publications involving the technique appearing in PubMed as of December 2012. Individually and in aggregate, these data are an important and information-rich resource. However, uncertainties about data quality confound their use by the wider research community. Recently, the Encyclopedia of DNA Elements (ENCODE) project developed and applied metrics to objectively measure ChIP-seq data quality. The ENCODE quality analysis was useful for flagging datasets for closer inspection, eliminating or replacing poor data, and for driving changes in experimental pipelines. There had been no similarly systematic quality analysis of the large and disparate body of published ChIP-seq profiles. Here, we report a uniform analysis of vertebrate transcription factor ChIP-seq datasets in the Gene Expression Omnibus (GEO) repository as of April 1, 2012. The majority (55%) of datasets scored as being highly successful, but a substantial minority (20%) were of apparently poor quality, and another ∼25% were of intermediate quality. We discuss how different uses of ChIP-seq data are affected by specific aspects of data quality, and we highlight exceptional instances for which the metric values should not be taken at face value. Unexpectedly, we discovered that a significant subset of control datasets (i.e., no immunoprecipitation and mock immunoprecipitation samples) display an enrichment structure similar to successful ChIP-seq data. This can, in turn, affect peak calling and data interpretation. Published datasets identified here as high-quality comprise a large group that users can draw on for large-scale integrated analysis. In the future, ChIP-seq quality assessment similar to that used here could guide experimentalists at early stages in a study, provide useful input in the publication process, and be used to stratify ChIP-seq data for different community-wide uses. 2014-05-30T14:50:19Z 2014-05-30T14:50:19Z 2013-12 2013-09 Article http://purl.org/eprint/type/JournalArticle 2160-1836 http://hdl.handle.net/1721.1/87581 Marinov, G. K., A. Kundaje, P. J. Park, and B. J. Wold. “Large-Scale Quality Analysis of Published ChIP-Seq Data.” G3: Genes-Genomes-Genetics 4, no. 2 (March 13, 2014): 209–223. en_US http://dx.doi.org/10.1534/g3.113.008680 G3: Genes-Genomes-Genetics Creative Commons Attribution http://creativecommons.org/licenses/by/3.0/ application/pdf Genetics Society of America Genetics Society of America
spellingShingle	Kundaje, Anshul Marinov, Georgi K. Park, Peter J. Wold, Barbara J. Large-Scale Quality Analysis of Published ChIP-seq Data
title	Large-Scale Quality Analysis of Published ChIP-seq Data
title_full	Large-Scale Quality Analysis of Published ChIP-seq Data
title_fullStr	Large-Scale Quality Analysis of Published ChIP-seq Data
title_full_unstemmed	Large-Scale Quality Analysis of Published ChIP-seq Data
title_short	Large-Scale Quality Analysis of Published ChIP-seq Data
title_sort	large scale quality analysis of published chip seq data
url	http://hdl.handle.net/1721.1/87581
work_keys_str_mv	AT kundajeanshul largescalequalityanalysisofpublishedchipseqdata AT marinovgeorgik largescalequalityanalysisofpublishedchipseqdata AT parkpeterj largescalequalityanalysisofpublishedchipseqdata AT woldbarbaraj largescalequalityanalysisofpublishedchipseqdata

Large-Scale Quality Analysis of Published ChIP-seq Data

Similar Items