GenHap: a novel computational method based on genetic algorithms for haplotype assembly

Abstract Background In order to fully characterize the genome of an individual, the reconstruction of the two distinct copies of each chromosome, called haplotypes, is essential. The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplot...

Full description

Bibliographic Details
Main Authors: Andrea Tangherloni, Simone Spolaor, Leonardo Rundo, Marco S. Nobile, Paolo Cazzaniga, Giancarlo Mauri, Pietro Liò, Ivan Merelli, Daniela Besozzi
Format: Article
Language:English
Published: BMC 2019-04-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-019-2691-y
_version_ 1829102471432110080
author Andrea Tangherloni
Simone Spolaor
Leonardo Rundo
Marco S. Nobile
Paolo Cazzaniga
Giancarlo Mauri
Pietro Liò
Ivan Merelli
Daniela Besozzi
author_facet Andrea Tangherloni
Simone Spolaor
Leonardo Rundo
Marco S. Nobile
Paolo Cazzaniga
Giancarlo Mauri
Pietro Liò
Ivan Merelli
Daniela Besozzi
author_sort Andrea Tangherloni
collection DOAJ
description Abstract Background In order to fully characterize the genome of an individual, the reconstruction of the two distinct copies of each chromosome, called haplotypes, is essential. The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. Indeed, the knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medical applications. Results To reconstruct the two haplotypes, we addressed the weighted Minimum Error Correction (wMEC) problem, which is a successful approach for haplotype assembly. This NP-hard problem consists in computing the two haplotypes that partition the sequencing reads into two disjoint sub-sets, with the least number of corrections to the SNP values. To this aim, we propose here GenHap, a novel computational method for haplotype assembly based on Genetic Algorithms, yielding optimal solutions by means of a global search process. In order to evaluate the effectiveness of our approach, we run GenHap on two synthetic (yet realistic) datasets, based on the Roche/454 and PacBio RS II sequencing technologies. We compared the performance of GenHap against HapCol, an efficient state-of-the-art algorithm for haplotype phasing. Our results show that GenHap always obtains high accuracy solutions (in terms of haplotype error rate), and is up to 4× faster than HapCol in the case of Roche/454 instances and up to 20× faster when compared on the PacBio RS II dataset. Finally, we assessed the performance of GenHap on two different real datasets. Conclusions Future-generation sequencing technologies, producing longer reads with higher coverage, can highly benefit from GenHap, thanks to its capability of efficiently solving large instances of the haplotype assembly problem. Moreover, the optimization approach proposed in GenHap can be extended to the study of allele-specific genomic features, such as expression, methylation and chromatin conformation, by exploiting multi-objective optimization techniques. The source code and the full documentation are available at the following GitHub repository: https://github.com/andrea-tango/GenHap.
first_indexed 2024-12-10T23:00:32Z
format Article
id doaj.art-3a3fab37dad84b53af785205d792e540
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-10T23:00:32Z
publishDate 2019-04-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-3a3fab37dad84b53af785205d792e5402022-12-22T01:30:09ZengBMCBMC Bioinformatics1471-21052019-04-0120S411410.1186/s12859-019-2691-yGenHap: a novel computational method based on genetic algorithms for haplotype assemblyAndrea Tangherloni0Simone Spolaor1Leonardo Rundo2Marco S. Nobile3Paolo Cazzaniga4Giancarlo Mauri5Pietro Liò6Ivan Merelli7Daniela Besozzi8Department of Informatics, Systems and Communication (DISCo), University of Milano-BicoccaDepartment of Informatics, Systems and Communication (DISCo), University of Milano-BicoccaDepartment of Informatics, Systems and Communication (DISCo), University of Milano-BicoccaDepartment of Informatics, Systems and Communication (DISCo), University of Milano-BicoccaDepartment of Human and Social Sciences, University of BergamoDepartment of Informatics, Systems and Communication (DISCo), University of Milano-BicoccaComputer Laboratory, University of CambridgeInstitute of Biomedical Technologies, Italian National Research CouncilDepartment of Informatics, Systems and Communication (DISCo), University of Milano-BicoccaAbstract Background In order to fully characterize the genome of an individual, the reconstruction of the two distinct copies of each chromosome, called haplotypes, is essential. The computational problem of inferring the full haplotype of a cell starting from read sequencing data is known as haplotype assembly, and consists in assigning all heterozygous Single Nucleotide Polymorphisms (SNPs) to exactly one of the two chromosomes. Indeed, the knowledge of complete haplotypes is generally more informative than analyzing single SNPs and plays a fundamental role in many medical applications. Results To reconstruct the two haplotypes, we addressed the weighted Minimum Error Correction (wMEC) problem, which is a successful approach for haplotype assembly. This NP-hard problem consists in computing the two haplotypes that partition the sequencing reads into two disjoint sub-sets, with the least number of corrections to the SNP values. To this aim, we propose here GenHap, a novel computational method for haplotype assembly based on Genetic Algorithms, yielding optimal solutions by means of a global search process. In order to evaluate the effectiveness of our approach, we run GenHap on two synthetic (yet realistic) datasets, based on the Roche/454 and PacBio RS II sequencing technologies. We compared the performance of GenHap against HapCol, an efficient state-of-the-art algorithm for haplotype phasing. Our results show that GenHap always obtains high accuracy solutions (in terms of haplotype error rate), and is up to 4× faster than HapCol in the case of Roche/454 instances and up to 20× faster when compared on the PacBio RS II dataset. Finally, we assessed the performance of GenHap on two different real datasets. Conclusions Future-generation sequencing technologies, producing longer reads with higher coverage, can highly benefit from GenHap, thanks to its capability of efficiently solving large instances of the haplotype assembly problem. Moreover, the optimization approach proposed in GenHap can be extended to the study of allele-specific genomic features, such as expression, methylation and chromatin conformation, by exploiting multi-objective optimization techniques. The source code and the full documentation are available at the following GitHub repository: https://github.com/andrea-tango/GenHap.http://link.springer.com/article/10.1186/s12859-019-2691-yHaplotype assemblyFuture-generation sequencingGenetic algorithmsCombinatorial optimizationWeighted minimum error correction problem
spellingShingle Andrea Tangherloni
Simone Spolaor
Leonardo Rundo
Marco S. Nobile
Paolo Cazzaniga
Giancarlo Mauri
Pietro Liò
Ivan Merelli
Daniela Besozzi
GenHap: a novel computational method based on genetic algorithms for haplotype assembly
BMC Bioinformatics
Haplotype assembly
Future-generation sequencing
Genetic algorithms
Combinatorial optimization
Weighted minimum error correction problem
title GenHap: a novel computational method based on genetic algorithms for haplotype assembly
title_full GenHap: a novel computational method based on genetic algorithms for haplotype assembly
title_fullStr GenHap: a novel computational method based on genetic algorithms for haplotype assembly
title_full_unstemmed GenHap: a novel computational method based on genetic algorithms for haplotype assembly
title_short GenHap: a novel computational method based on genetic algorithms for haplotype assembly
title_sort genhap a novel computational method based on genetic algorithms for haplotype assembly
topic Haplotype assembly
Future-generation sequencing
Genetic algorithms
Combinatorial optimization
Weighted minimum error correction problem
url http://link.springer.com/article/10.1186/s12859-019-2691-y
work_keys_str_mv AT andreatangherloni genhapanovelcomputationalmethodbasedongeneticalgorithmsforhaplotypeassembly
AT simonespolaor genhapanovelcomputationalmethodbasedongeneticalgorithmsforhaplotypeassembly
AT leonardorundo genhapanovelcomputationalmethodbasedongeneticalgorithmsforhaplotypeassembly
AT marcosnobile genhapanovelcomputationalmethodbasedongeneticalgorithmsforhaplotypeassembly
AT paolocazzaniga genhapanovelcomputationalmethodbasedongeneticalgorithmsforhaplotypeassembly
AT giancarlomauri genhapanovelcomputationalmethodbasedongeneticalgorithmsforhaplotypeassembly
AT pietrolio genhapanovelcomputationalmethodbasedongeneticalgorithmsforhaplotypeassembly
AT ivanmerelli genhapanovelcomputationalmethodbasedongeneticalgorithmsforhaplotypeassembly
AT danielabesozzi genhapanovelcomputationalmethodbasedongeneticalgorithmsforhaplotypeassembly