Mappability and Read Length

Power-law distributions are the main functional form forthe distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size offragments and reads may prevent an unique alignment of repeatsequences to the reference seq...

Full description

Bibliographic Details
Main Authors: Wentian eLi, Jan eFreudenberg
Format: Article
Language:English
Published: Frontiers Media S.A. 2014-11-01
Series:Frontiers in Genetics
Subjects:
Online Access:http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00381/full
_version_ 1818196797667934208
author Wentian eLi
Jan eFreudenberg
author_facet Wentian eLi
Jan eFreudenberg
author_sort Wentian eLi
collection DOAJ
description Power-law distributions are the main functional form forthe distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size offragments and reads may prevent an unique alignment of repeatsequences to the reference sequence. Repeats in the human genome canbe as long as $10^4$ bases, or $10^5-10^6$ bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length is in the range of $10^3$ bases.With the read length of exactly 1000 bases, slightly more than 1% of theassembled genome, and slightly less than 1% of the 1kbreads, are unmappable, excluding the unassembled portion of the humangenome (8% in GRCh37). The slow decay (long tail) ofthe power-law function implies a diminishing return in convertingunmappable regions/reads to become mappable with the increase of theread length, with the understanding that increasing read length willalways move towards the direction of 100% mappability.
first_indexed 2024-12-12T01:39:48Z
format Article
id doaj.art-41dc01250a904de4987048ba1f18ba41
institution Directory Open Access Journal
issn 1664-8021
language English
last_indexed 2024-12-12T01:39:48Z
publishDate 2014-11-01
publisher Frontiers Media S.A.
record_format Article
series Frontiers in Genetics
spelling doaj.art-41dc01250a904de4987048ba1f18ba412022-12-22T00:42:45ZengFrontiers Media S.A.Frontiers in Genetics1664-80212014-11-01510.3389/fgene.2014.00381110803Mappability and Read LengthWentian eLi0Jan eFreudenberg1Feinstein Institute for Medical Research, North Shore LIJ Health SystemFeinstein Institute for Medical Research, North Shore LIJ Health SystemPower-law distributions are the main functional form forthe distribution of repeat size and repeat copy number in the human genome. When the genome is broken into fragments for sequencing, the limited size offragments and reads may prevent an unique alignment of repeatsequences to the reference sequence. Repeats in the human genome canbe as long as $10^4$ bases, or $10^5-10^6$ bases when allowing for mismatches between repeat units. Sequence reads from these regions are therefore unmappable when the read length is in the range of $10^3$ bases.With the read length of exactly 1000 bases, slightly more than 1% of theassembled genome, and slightly less than 1% of the 1kbreads, are unmappable, excluding the unassembled portion of the humangenome (8% in GRCh37). The slow decay (long tail) ofthe power-law function implies a diminishing return in convertingunmappable regions/reads to become mappable with the increase of theread length, with the understanding that increasing read length willalways move towards the direction of 100% mappability.http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00381/fullNext-generation sequencingRepeatsCopy Number Variationspower-law distributionmappability
spellingShingle Wentian eLi
Jan eFreudenberg
Mappability and Read Length
Frontiers in Genetics
Next-generation sequencing
Repeats
Copy Number Variations
power-law distribution
mappability
title Mappability and Read Length
title_full Mappability and Read Length
title_fullStr Mappability and Read Length
title_full_unstemmed Mappability and Read Length
title_short Mappability and Read Length
title_sort mappability and read length
topic Next-generation sequencing
Repeats
Copy Number Variations
power-law distribution
mappability
url http://journal.frontiersin.org/Journal/10.3389/fgene.2014.00381/full
work_keys_str_mv AT wentianeli mappabilityandreadlength
AT janefreudenberg mappabilityandreadlength