Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction

Abstract Background Many medicinal plants are known for their complex genomes with high ploidy, heterozygosity, and repetitive content which pose severe challenges for genome sequencing of those species. Long reads from Oxford nanopore sequencing technology (ONT) or Pacific Biosciences Single Molecu...

Full description

Bibliographic Details
Main Authors: Peng Zeng, Zunzhe Tian, Yuwei Han, Weixiong Zhang, Tinggan Zhou, Yingmei Peng, Hao Hu, Jing Cai
Format: Article
Language:English
Published: BMC 2022-08-01
Series:Chinese Medicine
Subjects:
Online Access:https://doi.org/10.1186/s13020-022-00644-1
_version_ 1811215636999700480
author Peng Zeng
Zunzhe Tian
Yuwei Han
Weixiong Zhang
Tinggan Zhou
Yingmei Peng
Hao Hu
Jing Cai
author_facet Peng Zeng
Zunzhe Tian
Yuwei Han
Weixiong Zhang
Tinggan Zhou
Yingmei Peng
Hao Hu
Jing Cai
author_sort Peng Zeng
collection DOAJ
description Abstract Background Many medicinal plants are known for their complex genomes with high ploidy, heterozygosity, and repetitive content which pose severe challenges for genome sequencing of those species. Long reads from Oxford nanopore sequencing technology (ONT) or Pacific Biosciences Single Molecule, Real-Time (SMRT) sequencing offer great advantages in de novo genome assembly, especially for complex genomes with high heterozygosity and repetitive content. Currently, multiple allotetraploid species have sequenced their genomes by long-read sequencing. However, we found that a considerable proportion of these genomes (7.9% on average, maximum 23.7%) could not be covered by NGS (Next Generation Sequencing) reads (uncovered region by NGS reads, UCR) suggesting the questionable and low-quality of those area or genomic areas that can’t be sequenced by NGS due to sequencing bias. The underlying causes of those UCR in the genome assembly and solutions to this problem have never been studied. Methods In the study, we sequenced the tetraploid genome of Veratrum dahuricum (Turcz.) O. Loes (VDL), a Chinese medicinal plant, with ONT platform and assembled the genome with three strategies in parallel. We compared the qualities, coverage, and heterozygosity of the three ONT assemblies with another released assembly of the same individual using reads from PacBio circular consensus sequencing (CCS) technology, to explore the cause of the UCR. Results By mapping the NGS reads against the three ONT assemblies and the CCS assembly, we found that the coverage of those ONT assemblies by NGS reads ranged from 49.15 to 76.31%, much smaller than that of the CCS assembly (99.53%). And alignment between ONT assemblies and CCS assembly showed that most UCR can be aligned with CCS assembly. So, we conclude that the UCRs in ONT assembly are low-quality sequences with a high error rate that can’t be aligned with short reads, rather than genomic regions that can’t be sequenced by NGS. Further comparison among the intermediate versions of ONT assemblies showed that the most probable origin of those errors is a combination of artificial errors introduced by “self-correction” and initial sequencing error in long reads. We also found that polishing the ONT assembly with CCS reads can correct those errors efficiently. Conclusions Through analyzing genome features and reads alignment, we have found the causes for the high proportion of UCR in ONT assembly of VDL are sequencing errors and additional errors introduced by self-correction. The high error rates of ONT-raw reads make them not suitable for self-correction prior to allotetraploid genome assembly, as the self-correction will introduce artificial errors to > 5% of the UCR sequences. We suggest high-precision CCS reads be used to polish the assembly to correct those errors effectively for polyploid genomes.
first_indexed 2024-04-12T06:25:56Z
format Article
id doaj.art-5e49c91fd67a450c8165e940aa718865
institution Directory Open Access Journal
issn 1749-8546
language English
last_indexed 2024-04-12T06:25:56Z
publishDate 2022-08-01
publisher BMC
record_format Article
series Chinese Medicine
spelling doaj.art-5e49c91fd67a450c8165e940aa7188652022-12-22T03:44:09ZengBMCChinese Medicine1749-85462022-08-0117111210.1186/s13020-022-00644-1Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correctionPeng Zeng0Zunzhe Tian1Yuwei Han2Weixiong Zhang3Tinggan Zhou4Yingmei Peng5Hao Hu6Jing Cai7State Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of MacauSchool of Ecology and Environment, Northwestern Polytechnical UniversitySchool of Ecology and Environment, Northwestern Polytechnical UniversityState Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of MacauSchool of Ecology and Environment, Northwestern Polytechnical UniversitySchool of Ecology and Environment, Northwestern Polytechnical UniversityState Key Laboratory of Quality Research in Chinese Medicine, Institute of Chinese Medical Sciences, University of MacauSchool of Ecology and Environment, Northwestern Polytechnical UniversityAbstract Background Many medicinal plants are known for their complex genomes with high ploidy, heterozygosity, and repetitive content which pose severe challenges for genome sequencing of those species. Long reads from Oxford nanopore sequencing technology (ONT) or Pacific Biosciences Single Molecule, Real-Time (SMRT) sequencing offer great advantages in de novo genome assembly, especially for complex genomes with high heterozygosity and repetitive content. Currently, multiple allotetraploid species have sequenced their genomes by long-read sequencing. However, we found that a considerable proportion of these genomes (7.9% on average, maximum 23.7%) could not be covered by NGS (Next Generation Sequencing) reads (uncovered region by NGS reads, UCR) suggesting the questionable and low-quality of those area or genomic areas that can’t be sequenced by NGS due to sequencing bias. The underlying causes of those UCR in the genome assembly and solutions to this problem have never been studied. Methods In the study, we sequenced the tetraploid genome of Veratrum dahuricum (Turcz.) O. Loes (VDL), a Chinese medicinal plant, with ONT platform and assembled the genome with three strategies in parallel. We compared the qualities, coverage, and heterozygosity of the three ONT assemblies with another released assembly of the same individual using reads from PacBio circular consensus sequencing (CCS) technology, to explore the cause of the UCR. Results By mapping the NGS reads against the three ONT assemblies and the CCS assembly, we found that the coverage of those ONT assemblies by NGS reads ranged from 49.15 to 76.31%, much smaller than that of the CCS assembly (99.53%). And alignment between ONT assemblies and CCS assembly showed that most UCR can be aligned with CCS assembly. So, we conclude that the UCRs in ONT assembly are low-quality sequences with a high error rate that can’t be aligned with short reads, rather than genomic regions that can’t be sequenced by NGS. Further comparison among the intermediate versions of ONT assemblies showed that the most probable origin of those errors is a combination of artificial errors introduced by “self-correction” and initial sequencing error in long reads. We also found that polishing the ONT assembly with CCS reads can correct those errors efficiently. Conclusions Through analyzing genome features and reads alignment, we have found the causes for the high proportion of UCR in ONT assembly of VDL are sequencing errors and additional errors introduced by self-correction. The high error rates of ONT-raw reads make them not suitable for self-correction prior to allotetraploid genome assembly, as the self-correction will introduce artificial errors to > 5% of the UCR sequences. We suggest high-precision CCS reads be used to polish the assembly to correct those errors effectively for polyploid genomes.https://doi.org/10.1186/s13020-022-00644-1ONT-based assemblyAllotetraploidVeratrum dahuricumLow-quality sequencesHomozygous variants
spellingShingle Peng Zeng
Zunzhe Tian
Yuwei Han
Weixiong Zhang
Tinggan Zhou
Yingmei Peng
Hao Hu
Jing Cai
Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction
Chinese Medicine
ONT-based assembly
Allotetraploid
Veratrum dahuricum
Low-quality sequences
Homozygous variants
title Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction
title_full Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction
title_fullStr Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction
title_full_unstemmed Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction
title_short Comparison of ONT and CCS sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ONT reads are not suitable for self-correction
title_sort comparison of ont and ccs sequencing technologies on the polyploid genome of a medicinal plant showed that high error rate of ont reads are not suitable for self correction
topic ONT-based assembly
Allotetraploid
Veratrum dahuricum
Low-quality sequences
Homozygous variants
url https://doi.org/10.1186/s13020-022-00644-1
work_keys_str_mv AT pengzeng comparisonofontandccssequencingtechnologiesonthepolyploidgenomeofamedicinalplantshowedthathigherrorrateofontreadsarenotsuitableforselfcorrection
AT zunzhetian comparisonofontandccssequencingtechnologiesonthepolyploidgenomeofamedicinalplantshowedthathigherrorrateofontreadsarenotsuitableforselfcorrection
AT yuweihan comparisonofontandccssequencingtechnologiesonthepolyploidgenomeofamedicinalplantshowedthathigherrorrateofontreadsarenotsuitableforselfcorrection
AT weixiongzhang comparisonofontandccssequencingtechnologiesonthepolyploidgenomeofamedicinalplantshowedthathigherrorrateofontreadsarenotsuitableforselfcorrection
AT tingganzhou comparisonofontandccssequencingtechnologiesonthepolyploidgenomeofamedicinalplantshowedthathigherrorrateofontreadsarenotsuitableforselfcorrection
AT yingmeipeng comparisonofontandccssequencingtechnologiesonthepolyploidgenomeofamedicinalplantshowedthathigherrorrateofontreadsarenotsuitableforselfcorrection
AT haohu comparisonofontandccssequencingtechnologiesonthepolyploidgenomeofamedicinalplantshowedthathigherrorrateofontreadsarenotsuitableforselfcorrection
AT jingcai comparisonofontandccssequencingtechnologiesonthepolyploidgenomeofamedicinalplantshowedthathigherrorrateofontreadsarenotsuitableforselfcorrection