Benchmarking datasets for assembly-based variant calling using high-fidelity long reads

Abstract Background Recent advances in long-read sequencing technologies have enabled accurate identification of all genetic variants in individuals or cells; this procedure is known as variant calling. However, benchmarking studies on variant calling using different long-read sequencing technologie...

Full description

Bibliographic Details
Main Authors:	Hyunji Lee, Jun Kim, Junho Lee
Format:	Article
Language:	English
Published:	BMC 2023-03-01
Series:	BMC Genomics
Subjects:	Genetic variant Variant calling High-fidelity long reads Long-read sequencing Benchmark
Online Access:	https://doi.org/10.1186/s12864-023-09255-y

_version_	1827974746693697536
author	Hyunji Lee Jun Kim Junho Lee
author_facet	Hyunji Lee Jun Kim Junho Lee
author_sort	Hyunji Lee
collection	DOAJ
description	Abstract Background Recent advances in long-read sequencing technologies have enabled accurate identification of all genetic variants in individuals or cells; this procedure is known as variant calling. However, benchmarking studies on variant calling using different long-read sequencing technologies are still lacking. Results We used two Caenorhabditis elegans strains to measure several variant calling metrics. These two strains shared true-positive genetic variants that were introduced during strain generation. In addition, both strains contained common and distinguishable variants induced by DNA damage, possibly leading to false-positive estimation. We obtained accurate and noisy long reads from both strains using high-fidelity (HiFi) and continuous long-read (CLR) sequencing platforms, and compared the variant calling performance of the two platforms. HiFi identified a 1.65-fold higher number of true-positive variants on average, with 60% fewer false-positive variants, than CLR did. We also compared read-based and assembly-based variant calling methods in combination with subsampling of various sequencing depths and demonstrated that variant calling after genome assembly was particularly effective for detection of large insertions, even with 10 × sequencing depth of accurate long-read sequencing data. Conclusions By directly comparing the two long-read sequencing technologies, we demonstrated that variant calling after genome assembly with 10 × or more depth of accurate long-read sequencing data allowed reliable detection of true-positive variants. Considering the high cost of HiFi sequencing, we herein propose appropriate methodologies for performing cost-effective and high-quality variant calling: 10 × assembly-based variant calling. The results of the present study may facilitate the development of methods for identifying all genetic variants at the population level.
first_indexed	2024-04-09T19:59:18Z
format	Article
id	doaj.art-1665f5412a4a45879bae88fda3283c40
institution	Directory Open Access Journal
issn	1471-2164
language	English
last_indexed	2024-04-09T19:59:18Z
publishDate	2023-03-01
publisher	BMC
record_format	Article
series	BMC Genomics
spelling	doaj.art-1665f5412a4a45879bae88fda3283c402023-04-03T05:17:42ZengBMCBMC Genomics1471-21642023-03-0124111410.1186/s12864-023-09255-yBenchmarking datasets for assembly-based variant calling using high-fidelity long readsHyunji Lee0Jun Kim1Junho Lee2Institute of Molecular Biology and Genetics, Seoul National UniversityDepartment of Biological Sciences, Seoul National UniversityInstitute of Molecular Biology and Genetics, Seoul National UniversityAbstract Background Recent advances in long-read sequencing technologies have enabled accurate identification of all genetic variants in individuals or cells; this procedure is known as variant calling. However, benchmarking studies on variant calling using different long-read sequencing technologies are still lacking. Results We used two Caenorhabditis elegans strains to measure several variant calling metrics. These two strains shared true-positive genetic variants that were introduced during strain generation. In addition, both strains contained common and distinguishable variants induced by DNA damage, possibly leading to false-positive estimation. We obtained accurate and noisy long reads from both strains using high-fidelity (HiFi) and continuous long-read (CLR) sequencing platforms, and compared the variant calling performance of the two platforms. HiFi identified a 1.65-fold higher number of true-positive variants on average, with 60% fewer false-positive variants, than CLR did. We also compared read-based and assembly-based variant calling methods in combination with subsampling of various sequencing depths and demonstrated that variant calling after genome assembly was particularly effective for detection of large insertions, even with 10 × sequencing depth of accurate long-read sequencing data. Conclusions By directly comparing the two long-read sequencing technologies, we demonstrated that variant calling after genome assembly with 10 × or more depth of accurate long-read sequencing data allowed reliable detection of true-positive variants. Considering the high cost of HiFi sequencing, we herein propose appropriate methodologies for performing cost-effective and high-quality variant calling: 10 × assembly-based variant calling. The results of the present study may facilitate the development of methods for identifying all genetic variants at the population level.https://doi.org/10.1186/s12864-023-09255-yGenetic variantVariant callingHigh-fidelity long readsLong-read sequencingBenchmark
spellingShingle	Hyunji Lee Jun Kim Junho Lee Benchmarking datasets for assembly-based variant calling using high-fidelity long reads BMC Genomics Genetic variant Variant calling High-fidelity long reads Long-read sequencing Benchmark
title	Benchmarking datasets for assembly-based variant calling using high-fidelity long reads
title_full	Benchmarking datasets for assembly-based variant calling using high-fidelity long reads
title_fullStr	Benchmarking datasets for assembly-based variant calling using high-fidelity long reads
title_full_unstemmed	Benchmarking datasets for assembly-based variant calling using high-fidelity long reads
title_short	Benchmarking datasets for assembly-based variant calling using high-fidelity long reads
title_sort	benchmarking datasets for assembly based variant calling using high fidelity long reads
topic	Genetic variant Variant calling High-fidelity long reads Long-read sequencing Benchmark
url	https://doi.org/10.1186/s12864-023-09255-y
work_keys_str_mv	AT hyunjilee benchmarkingdatasetsforassemblybasedvariantcallingusinghighfidelitylongreads AT junkim benchmarkingdatasetsforassemblybasedvariantcallingusinghighfidelitylongreads AT junholee benchmarkingdatasetsforassemblybasedvariantcallingusinghighfidelitylongreads

Benchmarking datasets for assembly-based variant calling using high-fidelity long reads

Similar Items