Comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analyses

Abstract Background A key step for comparative genomics is to group open reading frames into functionally and evolutionarily meaningful gene clusters. Gene clustering is complicated by intraspecific duplications and horizontal gene transfers that are frequent in prokaryotes. In consequence, gene clu...

Full description

Bibliographic Details
Main Authors: Saioa Manzano-Morales, Yang Liu, Sara González-Bodí, Jaime Huerta-Cepas, Jaime Iranzo
Format: Article
Language:English
Published: BMC 2023-10-01
Series:Genome Biology
Subjects:
Online Access:https://doi.org/10.1186/s13059-023-03089-3
_version_ 1797636843631542272
author Saioa Manzano-Morales
Yang Liu
Sara González-Bodí
Jaime Huerta-Cepas
Jaime Iranzo
author_facet Saioa Manzano-Morales
Yang Liu
Sara González-Bodí
Jaime Huerta-Cepas
Jaime Iranzo
author_sort Saioa Manzano-Morales
collection DOAJ
description Abstract Background A key step for comparative genomics is to group open reading frames into functionally and evolutionarily meaningful gene clusters. Gene clustering is complicated by intraspecific duplications and horizontal gene transfers that are frequent in prokaryotes. In consequence, gene clustering methods must deal with a trade-off between identifying vertically transmitted representatives of multicopy gene families, which are recognizable by synteny conservation, and retrieving complete sets of species-level orthologs. We studied the implications of adopting homology, orthology, or synteny conservation as formal criteria for gene clustering by performing comparative analyses of 125 prokaryotic pangenomes. Results Clustering criteria affect pangenome functional characterization, core genome inference, and reconstruction of ancestral gene content to different extents. Species-wise estimates of pangenome and core genome sizes change by the same factor when using different clustering criteria, allowing robust cross-species comparisons regardless of the clustering criterion. However, cross-species comparisons of genome plasticity and functional profiles are substantially affected by inconsistencies among clustering criteria. Such inconsistencies are driven not only by mobile genetic elements, but also by genes involved in defense, secondary metabolism, and other accessory functions. In some pangenome features, the variability attributed to methodological inconsistencies can even exceed the effect sizes of ecological and phylogenetic variables. Conclusions Choosing an appropriate criterion for gene clustering is critical to conduct unbiased pangenome analyses. We provide practical guidelines to choose the right method depending on the research goals and the quality of genome assemblies, and a benchmarking dataset to assess the robustness and reproducibility of future comparative studies.
first_indexed 2024-03-11T12:41:05Z
format Article
id doaj.art-cae4b6a6b6944cbbacbe4470c052a65c
institution Directory Open Access Journal
issn 1474-760X
language English
last_indexed 2024-03-11T12:41:05Z
publishDate 2023-10-01
publisher BMC
record_format Article
series Genome Biology
spelling doaj.art-cae4b6a6b6944cbbacbe4470c052a65c2023-11-05T12:19:42ZengBMCGenome Biology1474-760X2023-10-0124112710.1186/s13059-023-03089-3Comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analysesSaioa Manzano-Morales0Yang Liu1Sara González-Bodí2Jaime Huerta-Cepas3Jaime Iranzo4Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC)Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC)Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC)Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC)Centro de Biotecnología y Genómica de Plantas, Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA-CSIC)Abstract Background A key step for comparative genomics is to group open reading frames into functionally and evolutionarily meaningful gene clusters. Gene clustering is complicated by intraspecific duplications and horizontal gene transfers that are frequent in prokaryotes. In consequence, gene clustering methods must deal with a trade-off between identifying vertically transmitted representatives of multicopy gene families, which are recognizable by synteny conservation, and retrieving complete sets of species-level orthologs. We studied the implications of adopting homology, orthology, or synteny conservation as formal criteria for gene clustering by performing comparative analyses of 125 prokaryotic pangenomes. Results Clustering criteria affect pangenome functional characterization, core genome inference, and reconstruction of ancestral gene content to different extents. Species-wise estimates of pangenome and core genome sizes change by the same factor when using different clustering criteria, allowing robust cross-species comparisons regardless of the clustering criterion. However, cross-species comparisons of genome plasticity and functional profiles are substantially affected by inconsistencies among clustering criteria. Such inconsistencies are driven not only by mobile genetic elements, but also by genes involved in defense, secondary metabolism, and other accessory functions. In some pangenome features, the variability attributed to methodological inconsistencies can even exceed the effect sizes of ecological and phylogenetic variables. Conclusions Choosing an appropriate criterion for gene clustering is critical to conduct unbiased pangenome analyses. We provide practical guidelines to choose the right method depending on the research goals and the quality of genome assemblies, and a benchmarking dataset to assess the robustness and reproducibility of future comparative studies.https://doi.org/10.1186/s13059-023-03089-3PangenomeOrthologyComparative genomicsHomologyCore geneAccessory genome
spellingShingle Saioa Manzano-Morales
Yang Liu
Sara González-Bodí
Jaime Huerta-Cepas
Jaime Iranzo
Comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analyses
Genome Biology
Pangenome
Orthology
Comparative genomics
Homology
Core gene
Accessory genome
title Comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analyses
title_full Comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analyses
title_fullStr Comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analyses
title_full_unstemmed Comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analyses
title_short Comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analyses
title_sort comparison of gene clustering criteria reveals intrinsic uncertainty in pangenome analyses
topic Pangenome
Orthology
Comparative genomics
Homology
Core gene
Accessory genome
url https://doi.org/10.1186/s13059-023-03089-3
work_keys_str_mv AT saioamanzanomorales comparisonofgeneclusteringcriteriarevealsintrinsicuncertaintyinpangenomeanalyses
AT yangliu comparisonofgeneclusteringcriteriarevealsintrinsicuncertaintyinpangenomeanalyses
AT saragonzalezbodi comparisonofgeneclusteringcriteriarevealsintrinsicuncertaintyinpangenomeanalyses
AT jaimehuertacepas comparisonofgeneclusteringcriteriarevealsintrinsicuncertaintyinpangenomeanalyses
AT jaimeiranzo comparisonofgeneclusteringcriteriarevealsintrinsicuncertaintyinpangenomeanalyses