Using networks to analyze and visualize the distribution of overlapping genes in virus genomes.

Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may increase the information content of compact genomes or influence the creation of new genes. Here we report a global c...

Full description

Bibliographic Details
Main Authors: Laura Muñoz-Baena, Art F Y Poon
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2022-02-01
Series:PLoS Pathogens
Online Access:https://doi.org/10.1371/journal.ppat.1010331
_version_ 1818258844853207040
author Laura Muñoz-Baena
Art F Y Poon
author_facet Laura Muñoz-Baena
Art F Y Poon
author_sort Laura Muñoz-Baena
collection DOAJ
description Gene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may increase the information content of compact genomes or influence the creation of new genes. Here we report a global comparative study of overlapping open reading frames (OvRFs) of 12,609 virus reference genomes in the NCBI database. We retrieved metadata associated with all annotated open reading frames (ORFs) in each genome record to calculate the number, length, and frameshift of OvRFs. Our results show that while the number of OvRFs increases with genome length, they tend to be shorter in longer genomes. The majority of overlaps involve +2 frameshifts, predominantly found in dsDNA viruses. Antisense overlaps in which one of the ORFs was encoded in the same frame on the opposite strand (-0) tend to be longer. Next, we develop a new graph-based representation of the distribution of overlaps among the ORFs of genomes in a given virus family. In the absence of an unambiguous partition of ORFs by homology at this taxonomic level, we used an alignment-free k-mer based approach to cluster protein coding sequences by similarity. We connect these clusters with two types of directed edges to indicate (1) that constituent ORFs are adjacent in one or more genomes, and (2) that these ORFs overlap. These adjacency graphs not only provide a natural visualization scheme, but also a novel statistical framework for analyzing the effects of gene- and genome-level attributes on the frequencies of overlaps.
first_indexed 2024-12-12T18:06:00Z
format Article
id doaj.art-66d9c00919564226bc190594146b85c4
institution Directory Open Access Journal
issn 1553-7366
1553-7374
language English
last_indexed 2024-12-12T18:06:00Z
publishDate 2022-02-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Pathogens
spelling doaj.art-66d9c00919564226bc190594146b85c42022-12-22T00:16:29ZengPublic Library of Science (PLoS)PLoS Pathogens1553-73661553-73742022-02-01182e101033110.1371/journal.ppat.1010331Using networks to analyze and visualize the distribution of overlapping genes in virus genomes.Laura Muñoz-BaenaArt F Y PoonGene overlap occurs when two or more genes are encoded by the same nucleotides. This phenomenon is found in all taxonomic domains, but is particularly common in viruses, where it may increase the information content of compact genomes or influence the creation of new genes. Here we report a global comparative study of overlapping open reading frames (OvRFs) of 12,609 virus reference genomes in the NCBI database. We retrieved metadata associated with all annotated open reading frames (ORFs) in each genome record to calculate the number, length, and frameshift of OvRFs. Our results show that while the number of OvRFs increases with genome length, they tend to be shorter in longer genomes. The majority of overlaps involve +2 frameshifts, predominantly found in dsDNA viruses. Antisense overlaps in which one of the ORFs was encoded in the same frame on the opposite strand (-0) tend to be longer. Next, we develop a new graph-based representation of the distribution of overlaps among the ORFs of genomes in a given virus family. In the absence of an unambiguous partition of ORFs by homology at this taxonomic level, we used an alignment-free k-mer based approach to cluster protein coding sequences by similarity. We connect these clusters with two types of directed edges to indicate (1) that constituent ORFs are adjacent in one or more genomes, and (2) that these ORFs overlap. These adjacency graphs not only provide a natural visualization scheme, but also a novel statistical framework for analyzing the effects of gene- and genome-level attributes on the frequencies of overlaps.https://doi.org/10.1371/journal.ppat.1010331
spellingShingle Laura Muñoz-Baena
Art F Y Poon
Using networks to analyze and visualize the distribution of overlapping genes in virus genomes.
PLoS Pathogens
title Using networks to analyze and visualize the distribution of overlapping genes in virus genomes.
title_full Using networks to analyze and visualize the distribution of overlapping genes in virus genomes.
title_fullStr Using networks to analyze and visualize the distribution of overlapping genes in virus genomes.
title_full_unstemmed Using networks to analyze and visualize the distribution of overlapping genes in virus genomes.
title_short Using networks to analyze and visualize the distribution of overlapping genes in virus genomes.
title_sort using networks to analyze and visualize the distribution of overlapping genes in virus genomes
url https://doi.org/10.1371/journal.ppat.1010331
work_keys_str_mv AT lauramunozbaena usingnetworkstoanalyzeandvisualizethedistributionofoverlappinggenesinvirusgenomes
AT artfypoon usingnetworkstoanalyzeandvisualizethedistributionofoverlappinggenesinvirusgenomes