Pairwise comparative analysis of six haplotype assembly methods based on users’ experience

Abstract Background A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplo...

Full description

Bibliographic Details
Main Authors:	Sun, Shuying, Cheng, Flora, Han, Daphne, Wei, Sarah, Zhong, Alice, Massoudian, Sherwin, Johnson, Alison B.
Format:	Article
Language:	English
Published:	BioMed Central 2023
Online Access:	https://hdl.handle.net/1721.1/151073

_version_	1826192584475672576
author	Sun, Shuying Cheng, Flora Han, Daphne Wei, Sarah Zhong, Alice Massoudian, Sherwin Johnson, Alison B.
author_facet	Sun, Shuying Cheng, Flora Han, Daphne Wei, Sarah Zhong, Alice Massoudian, Sherwin Johnson, Alison B.
author_sort	Sun, Shuying
collection	MIT
description	Abstract Background A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data. Currently, there are many HA methods with their own strengths and weaknesses. This study focused on comparing six HA methods or algorithms: HapCUT2, MixSIH, PEATH, WhatsHap, SDhaP, and MAtCHap using two NA12878 datasets named hg19 and hg38. The 6 HA algorithms were run on chromosome 10 of these two datasets, each with 3 filtering levels based on sequencing depth (DP1, DP15, and DP30). Their outputs were then compared. Result Run time (CPU time) was compared to assess the efficiency of 6 HA methods. HapCUT2 was the fastest HA for 6 datasets, with run time consistently under 2 min. In addition, WhatsHap was relatively fast, and its run time was 21 min or less for all 6 datasets. The other 4 HA algorithms’ run time varied across different datasets and coverage levels. To assess their accuracy, pairwise comparisons were conducted for each pair of the six packages by generating their disagreement rates for both haplotype blocks and Single Nucleotide Variants (SNVs). The authors also compared them using switch distance (error), i.e., the number of positions where two chromosomes of a certain phase must be switched to match with the known haplotype. HapCUT2, PEATH, MixSIH, and MAtCHap generated output files with similar numbers of blocks and SNVs, and they had relatively similar performance. WhatsHap generated a much larger number of SNVs in the hg19 DP1 output, which caused it to have high disagreement percentages with other methods. However, for the hg38 data, WhatsHap had similar performance as the other 4 algorithms, except SDhaP. The comparison analysis showed that SDhaP had a much larger disagreement rate when it was compared with the other algorithms in all 6 datasets. Conclusion The comparative analysis is important because each algorithm is different. The findings of this study provide a deeper understanding of the performance of currently available HA algorithms and useful input for other users.
first_indexed	2024-09-23T09:23:02Z
format	Article
id	mit-1721.1/151073
institution	Massachusetts Institute of Technology
language	English
last_indexed	2024-09-23T09:23:02Z
publishDate	2023
publisher	BioMed Central
record_format	dspace
spelling	mit-1721.1/1510732023-07-11T03:45:00Z Pairwise comparative analysis of six haplotype assembly methods based on users’ experience Sun, Shuying Cheng, Flora Han, Daphne Wei, Sarah Zhong, Alice Massoudian, Sherwin Johnson, Alison B. Abstract Background A haplotype is a set of DNA variants inherited together from one parent or chromosome. Haplotype information is useful for studying genetic variation and disease association. Haplotype assembly (HA) is a process of obtaining haplotypes using DNA sequencing data. Currently, there are many HA methods with their own strengths and weaknesses. This study focused on comparing six HA methods or algorithms: HapCUT2, MixSIH, PEATH, WhatsHap, SDhaP, and MAtCHap using two NA12878 datasets named hg19 and hg38. The 6 HA algorithms were run on chromosome 10 of these two datasets, each with 3 filtering levels based on sequencing depth (DP1, DP15, and DP30). Their outputs were then compared. Result Run time (CPU time) was compared to assess the efficiency of 6 HA methods. HapCUT2 was the fastest HA for 6 datasets, with run time consistently under 2 min. In addition, WhatsHap was relatively fast, and its run time was 21 min or less for all 6 datasets. The other 4 HA algorithms’ run time varied across different datasets and coverage levels. To assess their accuracy, pairwise comparisons were conducted for each pair of the six packages by generating their disagreement rates for both haplotype blocks and Single Nucleotide Variants (SNVs). The authors also compared them using switch distance (error), i.e., the number of positions where two chromosomes of a certain phase must be switched to match with the known haplotype. HapCUT2, PEATH, MixSIH, and MAtCHap generated output files with similar numbers of blocks and SNVs, and they had relatively similar performance. WhatsHap generated a much larger number of SNVs in the hg19 DP1 output, which caused it to have high disagreement percentages with other methods. However, for the hg38 data, WhatsHap had similar performance as the other 4 algorithms, except SDhaP. The comparison analysis showed that SDhaP had a much larger disagreement rate when it was compared with the other algorithms in all 6 datasets. Conclusion The comparative analysis is important because each algorithm is different. The findings of this study provide a deeper understanding of the performance of currently available HA algorithms and useful input for other users. 2023-07-10T19:01:18Z 2023-07-10T19:01:18Z 2023-06-29 2023-07-02T03:11:29Z Article http://purl.org/eprint/type/JournalArticle https://hdl.handle.net/1721.1/151073 BMC Genomic Data. 2023 Jun 29;24(1):35 PUBLISHER_CC en https://doi.org/10.1186/s12863-023-01134-5 Creative Commons Attribution http://creativecommons.org/licenses/by/4.0/ The Author(s) application/pdf BioMed Central BioMed Central
spellingShingle	Sun, Shuying Cheng, Flora Han, Daphne Wei, Sarah Zhong, Alice Massoudian, Sherwin Johnson, Alison B. Pairwise comparative analysis of six haplotype assembly methods based on users’ experience
title	Pairwise comparative analysis of six haplotype assembly methods based on users’ experience
title_full	Pairwise comparative analysis of six haplotype assembly methods based on users’ experience
title_fullStr	Pairwise comparative analysis of six haplotype assembly methods based on users’ experience
title_full_unstemmed	Pairwise comparative analysis of six haplotype assembly methods based on users’ experience
title_short	Pairwise comparative analysis of six haplotype assembly methods based on users’ experience
title_sort	pairwise comparative analysis of six haplotype assembly methods based on users experience
url	https://hdl.handle.net/1721.1/151073
work_keys_str_mv	AT sunshuying pairwisecomparativeanalysisofsixhaplotypeassemblymethodsbasedonusersexperience AT chengflora pairwisecomparativeanalysisofsixhaplotypeassemblymethodsbasedonusersexperience AT handaphne pairwisecomparativeanalysisofsixhaplotypeassemblymethodsbasedonusersexperience AT weisarah pairwisecomparativeanalysisofsixhaplotypeassemblymethodsbasedonusersexperience AT zhongalice pairwisecomparativeanalysisofsixhaplotypeassemblymethodsbasedonusersexperience AT massoudiansherwin pairwisecomparativeanalysisofsixhaplotypeassemblymethodsbasedonusersexperience AT johnsonalisonb pairwisecomparativeanalysisofsixhaplotypeassemblymethodsbasedonusersexperience

Pairwise comparative analysis of six haplotype assembly methods based on users’ experience

Similar Items