PanDelos: a dictionary-based method for pan-genome content discovery

Abstract Background Pan-genome approaches afford the discovery of homology relations in a set of genomes, by determining how some gene families are distributed among a given set of genomes. The retrieval of a complete gene distribution among a class of genomes is an NP-hard problem because computati...

Full description

Bibliographic Details
Main Authors: Vincenzo Bonnici, Rosalba Giugno, Vincenzo Manca
Format: Article
Language:English
Published: BMC 2018-11-01
Series:BMC Bioinformatics
Subjects:
Online Access:http://link.springer.com/article/10.1186/s12859-018-2417-6
_version_ 1819114411952963584
author Vincenzo Bonnici
Rosalba Giugno
Vincenzo Manca
author_facet Vincenzo Bonnici
Rosalba Giugno
Vincenzo Manca
author_sort Vincenzo Bonnici
collection DOAJ
description Abstract Background Pan-genome approaches afford the discovery of homology relations in a set of genomes, by determining how some gene families are distributed among a given set of genomes. The retrieval of a complete gene distribution among a class of genomes is an NP-hard problem because computational costs increase with the number of analyzed genomes, in fact, all-against-all gene comparisons are required to completely solve the problem. In presence of phylogenetically distant genomes, due to the variability introduced in gene duplication and transmission, the task of recognizing homologous genes becomes even more difficult. A challenge on this field is that of designing fast and adaptive similarity measures in order to find a suitable pan-genome structure of homology relations. Results We present PanDelos, a stand alone tool for the discovery of pan-genome contents among phylogenetic distant genomes. The methodology is based on information theory and network analysis. It is parameter-free because thresholds are automatically deduced from the context. PanDelos avoids sequence alignment by introducing a measure based on k-mer multiplicity. The k-mer length is defined according to general arguments rather than empirical considerations. Homology candidate relations are integrated into a global network and groups of homologous genes are extracted by applying a community detection algorithm. Conclusions PanDelos outperforms existing approaches, Roary and EDGAR, in terms of running times and quality content discovery. Tests were run on collections of real genomes, previously used in analogous studies, and in synthetic benchmarks that represent fully trusted golden truth. The software is available at https://github.com/GiugnoLab/PanDelos.
first_indexed 2024-12-22T04:44:53Z
format Article
id doaj.art-8fea1f2d13f94465b3ee4f6fca7eb811
institution Directory Open Access Journal
issn 1471-2105
language English
last_indexed 2024-12-22T04:44:53Z
publishDate 2018-11-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj.art-8fea1f2d13f94465b3ee4f6fca7eb8112022-12-21T18:38:38ZengBMCBMC Bioinformatics1471-21052018-11-0119S15475910.1186/s12859-018-2417-6PanDelos: a dictionary-based method for pan-genome content discoveryVincenzo Bonnici0Rosalba Giugno1Vincenzo Manca2Department of Computer Science, University of VeronaDepartment of Computer Science, University of VeronaDepartment of Computer Science, University of VeronaAbstract Background Pan-genome approaches afford the discovery of homology relations in a set of genomes, by determining how some gene families are distributed among a given set of genomes. The retrieval of a complete gene distribution among a class of genomes is an NP-hard problem because computational costs increase with the number of analyzed genomes, in fact, all-against-all gene comparisons are required to completely solve the problem. In presence of phylogenetically distant genomes, due to the variability introduced in gene duplication and transmission, the task of recognizing homologous genes becomes even more difficult. A challenge on this field is that of designing fast and adaptive similarity measures in order to find a suitable pan-genome structure of homology relations. Results We present PanDelos, a stand alone tool for the discovery of pan-genome contents among phylogenetic distant genomes. The methodology is based on information theory and network analysis. It is parameter-free because thresholds are automatically deduced from the context. PanDelos avoids sequence alignment by introducing a measure based on k-mer multiplicity. The k-mer length is defined according to general arguments rather than empirical considerations. Homology candidate relations are integrated into a global network and groups of homologous genes are extracted by applying a community detection algorithm. Conclusions PanDelos outperforms existing approaches, Roary and EDGAR, in terms of running times and quality content discovery. Tests were run on collections of real genomes, previously used in analogous studies, and in synthetic benchmarks that represent fully trusted golden truth. The software is available at https://github.com/GiugnoLab/PanDelos.http://link.springer.com/article/10.1186/s12859-018-2417-6Pan-genomeDistant genomesk-mer dictionary
spellingShingle Vincenzo Bonnici
Rosalba Giugno
Vincenzo Manca
PanDelos: a dictionary-based method for pan-genome content discovery
BMC Bioinformatics
Pan-genome
Distant genomes
k-mer dictionary
title PanDelos: a dictionary-based method for pan-genome content discovery
title_full PanDelos: a dictionary-based method for pan-genome content discovery
title_fullStr PanDelos: a dictionary-based method for pan-genome content discovery
title_full_unstemmed PanDelos: a dictionary-based method for pan-genome content discovery
title_short PanDelos: a dictionary-based method for pan-genome content discovery
title_sort pandelos a dictionary based method for pan genome content discovery
topic Pan-genome
Distant genomes
k-mer dictionary
url http://link.springer.com/article/10.1186/s12859-018-2417-6
work_keys_str_mv AT vincenzobonnici pandelosadictionarybasedmethodforpangenomecontentdiscovery
AT rosalbagiugno pandelosadictionarybasedmethodforpangenomecontentdiscovery
AT vincenzomanca pandelosadictionarybasedmethodforpangenomecontentdiscovery