Information Theory of DNA Shotgun Sequencing

DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. A basic question is...

Full description

Bibliographic Details
Main Authors: Motahari, Abolfazl S., Bresler, Guy, Tse, David N. C.
Other Authors: Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Format: Article
Language:en_US
Published: Institute of Electrical and Electronics Engineers (IEEE) 2017
Online Access:http://hdl.handle.net/1721.1/110778
https://orcid.org/0000-0003-1303-582X
_version_ 1826197015124508672
author Motahari, Abolfazl S.
Bresler, Guy
Tse, David N. C.
author2 Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
author_facet Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science
Motahari, Abolfazl S.
Bresler, Guy
Tse, David N. C.
author_sort Motahari, Abolfazl S.
collection MIT
description DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. A basic question is: given a sequencing technology and the statistics of the DNA sequence, what is the minimum number of reads required for reliable reconstruction? This number provides a fundamental limit to the performance of any assembly algorithm. For a simple statistical model of the DNA sequence and the read process, we show that the answer admits a critical phenomenon in the asymptotic limit of long DNA sequences: if the read length is below a threshold, reconstruction is impossible no matter how many reads are observed, and if the read length is above the threshold, having enough reads to cover the DNA sequence is sufficient to reconstruct. The threshold is computed in terms of the Renyi entropy rate of the DNA sequence. We also study the impact of noise in the read process on the performance.
first_indexed 2024-09-23T10:41:25Z
format Article
id mit-1721.1/110778
institution Massachusetts Institute of Technology
language en_US
last_indexed 2024-09-23T10:41:25Z
publishDate 2017
publisher Institute of Electrical and Electronics Engineers (IEEE)
record_format dspace
spelling mit-1721.1/1107782022-09-30T22:18:30Z Information Theory of DNA Shotgun Sequencing Motahari, Abolfazl S. Bresler, Guy Tse, David N. C. Massachusetts Institute of Technology. Department of Electrical Engineering and Computer Science Bresler, Guy DNA sequencing is the basic workhorse of modern day biology and medicine. Shotgun sequencing is the dominant technique used: many randomly located short fragments called reads are extracted from the DNA sequence, and these reads are assembled to reconstruct the original sequence. A basic question is: given a sequencing technology and the statistics of the DNA sequence, what is the minimum number of reads required for reliable reconstruction? This number provides a fundamental limit to the performance of any assembly algorithm. For a simple statistical model of the DNA sequence and the read process, we show that the answer admits a critical phenomenon in the asymptotic limit of long DNA sequences: if the read length is below a threshold, reconstruction is impossible no matter how many reads are observed, and if the read length is above the threshold, having enough reads to cover the DNA sequence is sufficient to reconstruct. The threshold is computed in terms of the Renyi entropy rate of the DNA sequence. We also study the impact of noise in the read process on the performance. 2017-07-19T18:40:47Z 2017-07-19T18:40:47Z 2013-10 2013-05 Article http://purl.org/eprint/type/JournalArticle 0018-9448 1557-9654 http://hdl.handle.net/1721.1/110778 Motahari, Abolfazl S.; Bresler, Guy and Tse, David N. C. “Information Theory of DNA Shotgun Sequencing.” IEEE Transactions on Information Theory 59, 10 (October 2013): 6273–6289 © 2013 Institute of Electrical and Electronics Engineers (IEEE) https://orcid.org/0000-0003-1303-582X en_US http://dx.doi.org/10.1109/tit.2013.2270273 IEEE Transactions on Information Theory Creative Commons Attribution-Noncommercial-Share Alike http://creativecommons.org/licenses/by-nc-sa/4.0/ application/pdf Institute of Electrical and Electronics Engineers (IEEE) arXiv
spellingShingle Motahari, Abolfazl S.
Bresler, Guy
Tse, David N. C.
Information Theory of DNA Shotgun Sequencing
title Information Theory of DNA Shotgun Sequencing
title_full Information Theory of DNA Shotgun Sequencing
title_fullStr Information Theory of DNA Shotgun Sequencing
title_full_unstemmed Information Theory of DNA Shotgun Sequencing
title_short Information Theory of DNA Shotgun Sequencing
title_sort information theory of dna shotgun sequencing
url http://hdl.handle.net/1721.1/110778
https://orcid.org/0000-0003-1303-582X
work_keys_str_mv AT motahariabolfazls informationtheoryofdnashotgunsequencing
AT breslerguy informationtheoryofdnashotgunsequencing
AT tsedavidnc informationtheoryofdnashotgunsequencing