Analysis of 3.5 million SARS-CoV-2 sequences reveals unique mutational trends with consistent nucleotide and codon frequencies

Abstract Background Since the onset of the SARS-CoV-2 pandemic, bioinformatic analyses have been performed to understand the nucleotide and synonymous codon usage features and mutational patterns of the virus. However, comparatively few have attempted to perform such analyses on a considerably large...

Full description

Bibliographic Details
Main Authors:	Sarah E. Fumagalli, Nigam H. Padhiar, Douglas Meyer, Upendra Katneni, Haim Bar, Michael DiCuccio, Anton A. Komar, Chava Kimchi-Sarfaty
Format:	Article
Language:	English
Published:	BMC 2023-02-01
Series:	Virology Journal
Subjects:	SARS-CoV-2 Nucleotide usage Codon usage bias Relative synonymous codon usage Codon adaptation index and dN/dS
Online Access:	https://doi.org/10.1186/s12985-023-01982-8

_version_	1827985169337810944
author	Sarah E. Fumagalli Nigam H. Padhiar Douglas Meyer Upendra Katneni Haim Bar Michael DiCuccio Anton A. Komar Chava Kimchi-Sarfaty
author_facet	Sarah E. Fumagalli Nigam H. Padhiar Douglas Meyer Upendra Katneni Haim Bar Michael DiCuccio Anton A. Komar Chava Kimchi-Sarfaty
author_sort	Sarah E. Fumagalli
collection	DOAJ
description	Abstract Background Since the onset of the SARS-CoV-2 pandemic, bioinformatic analyses have been performed to understand the nucleotide and synonymous codon usage features and mutational patterns of the virus. However, comparatively few have attempted to perform such analyses on a considerably large cohort of viral genomes while organizing the plethora of available sequence data for a month-by-month analysis to observe changes over time. Here, we aimed to perform sequence composition and mutation analysis of SARS-CoV-2, separating sequences by gene, clade, and timepoints, and contrast the mutational profile of SARS-CoV-2 to other comparable RNA viruses. Methods Using a cleaned, filtered, and pre-aligned dataset of over 3.5 million sequences downloaded from the GISAID database, we computed nucleotide and codon usage statistics, including calculation of relative synonymous codon usage values. We then calculated codon adaptation index (CAI) changes and a nonsynonymous/synonymous mutation ratio (dN/dS) over time for our dataset. Finally, we compiled information on the types of mutations occurring for SARS-CoV-2 and other comparable RNA viruses, and generated heatmaps showing codon and nucleotide composition at high entropy positions along the Spike sequence. Results We show that nucleotide and codon usage metrics remain relatively consistent over the 32-month span, though there are significant differences between clades within each gene at various timepoints. CAI and dN/dS values vary substantially between different timepoints and different genes, with Spike gene on average showing both the highest CAI and dN/dS values. Mutational analysis showed that SARS-CoV-2 Spike has a higher proportion of nonsynonymous mutations than analogous genes in other RNA viruses, with nonsynonymous mutations outnumbering synonymous ones by up to 20:1. However, at several specific positions, synonymous mutations were overwhelmingly predominant. Conclusions Our multifaceted analysis covering both the composition and mutation signature of SARS-CoV-2 gives valuable insight into the nucleotide frequency and codon usage heterogeneity of SARS-CoV-2 over time, and its unique mutational profile compared to other RNA viruses.
first_indexed	2024-04-09T23:10:36Z
format	Article
id	doaj.art-3ad6147fc4414bfb99dc92dfecc690e9
institution	Directory Open Access Journal
issn	1743-422X
language	English
last_indexed	2024-04-09T23:10:36Z
publishDate	2023-02-01
publisher	BMC
record_format	Article
series	Virology Journal
spelling	doaj.art-3ad6147fc4414bfb99dc92dfecc690e92023-03-22T10:24:57ZengBMCVirology Journal1743-422X2023-02-0120112210.1186/s12985-023-01982-8Analysis of 3.5 million SARS-CoV-2 sequences reveals unique mutational trends with consistent nucleotide and codon frequenciesSarah E. Fumagalli0Nigam H. Padhiar1Douglas Meyer2Upendra Katneni3Haim Bar4Michael DiCuccioAnton A. Komar5Chava Kimchi-Sarfaty6Hemostasis Branch, Division of Plasma Protein Therapeutics, Office of Tissues and Advanced Therapies, Center for Biologics Evaluation and Research, US Food and Drug AdministrationHemostasis Branch, Division of Plasma Protein Therapeutics, Office of Tissues and Advanced Therapies, Center for Biologics Evaluation and Research, US Food and Drug AdministrationHemostasis Branch, Division of Plasma Protein Therapeutics, Office of Tissues and Advanced Therapies, Center for Biologics Evaluation and Research, US Food and Drug AdministrationHemostasis Branch, Division of Plasma Protein Therapeutics, Office of Tissues and Advanced Therapies, Center for Biologics Evaluation and Research, US Food and Drug AdministrationDepartment of Statistics, University of ConnecticutDepartment of Biological, Geological and Environmental Sciences, Center for Gene Regulation in Health and Disease, Cleveland State UniversityHemostasis Branch, Division of Plasma Protein Therapeutics, Office of Tissues and Advanced Therapies, Center for Biologics Evaluation and Research, US Food and Drug AdministrationAbstract Background Since the onset of the SARS-CoV-2 pandemic, bioinformatic analyses have been performed to understand the nucleotide and synonymous codon usage features and mutational patterns of the virus. However, comparatively few have attempted to perform such analyses on a considerably large cohort of viral genomes while organizing the plethora of available sequence data for a month-by-month analysis to observe changes over time. Here, we aimed to perform sequence composition and mutation analysis of SARS-CoV-2, separating sequences by gene, clade, and timepoints, and contrast the mutational profile of SARS-CoV-2 to other comparable RNA viruses. Methods Using a cleaned, filtered, and pre-aligned dataset of over 3.5 million sequences downloaded from the GISAID database, we computed nucleotide and codon usage statistics, including calculation of relative synonymous codon usage values. We then calculated codon adaptation index (CAI) changes and a nonsynonymous/synonymous mutation ratio (dN/dS) over time for our dataset. Finally, we compiled information on the types of mutations occurring for SARS-CoV-2 and other comparable RNA viruses, and generated heatmaps showing codon and nucleotide composition at high entropy positions along the Spike sequence. Results We show that nucleotide and codon usage metrics remain relatively consistent over the 32-month span, though there are significant differences between clades within each gene at various timepoints. CAI and dN/dS values vary substantially between different timepoints and different genes, with Spike gene on average showing both the highest CAI and dN/dS values. Mutational analysis showed that SARS-CoV-2 Spike has a higher proportion of nonsynonymous mutations than analogous genes in other RNA viruses, with nonsynonymous mutations outnumbering synonymous ones by up to 20:1. However, at several specific positions, synonymous mutations were overwhelmingly predominant. Conclusions Our multifaceted analysis covering both the composition and mutation signature of SARS-CoV-2 gives valuable insight into the nucleotide frequency and codon usage heterogeneity of SARS-CoV-2 over time, and its unique mutational profile compared to other RNA viruses.https://doi.org/10.1186/s12985-023-01982-8SARS-CoV-2Nucleotide usageCodon usage biasRelative synonymous codon usageCodon adaptation index and dN/dS
spellingShingle	Sarah E. Fumagalli Nigam H. Padhiar Douglas Meyer Upendra Katneni Haim Bar Michael DiCuccio Anton A. Komar Chava Kimchi-Sarfaty Analysis of 3.5 million SARS-CoV-2 sequences reveals unique mutational trends with consistent nucleotide and codon frequencies Virology Journal SARS-CoV-2 Nucleotide usage Codon usage bias Relative synonymous codon usage Codon adaptation index and dN/dS
title	Analysis of 3.5 million SARS-CoV-2 sequences reveals unique mutational trends with consistent nucleotide and codon frequencies
title_full	Analysis of 3.5 million SARS-CoV-2 sequences reveals unique mutational trends with consistent nucleotide and codon frequencies
title_fullStr	Analysis of 3.5 million SARS-CoV-2 sequences reveals unique mutational trends with consistent nucleotide and codon frequencies
title_full_unstemmed	Analysis of 3.5 million SARS-CoV-2 sequences reveals unique mutational trends with consistent nucleotide and codon frequencies
title_short	Analysis of 3.5 million SARS-CoV-2 sequences reveals unique mutational trends with consistent nucleotide and codon frequencies
title_sort	analysis of 3 5 million sars cov 2 sequences reveals unique mutational trends with consistent nucleotide and codon frequencies
topic	SARS-CoV-2 Nucleotide usage Codon usage bias Relative synonymous codon usage Codon adaptation index and dN/dS
url	https://doi.org/10.1186/s12985-023-01982-8
work_keys_str_mv	AT sarahefumagalli analysisof35millionsarscov2sequencesrevealsuniquemutationaltrendswithconsistentnucleotideandcodonfrequencies AT nigamhpadhiar analysisof35millionsarscov2sequencesrevealsuniquemutationaltrendswithconsistentnucleotideandcodonfrequencies AT douglasmeyer analysisof35millionsarscov2sequencesrevealsuniquemutationaltrendswithconsistentnucleotideandcodonfrequencies AT upendrakatneni analysisof35millionsarscov2sequencesrevealsuniquemutationaltrendswithconsistentnucleotideandcodonfrequencies AT haimbar analysisof35millionsarscov2sequencesrevealsuniquemutationaltrendswithconsistentnucleotideandcodonfrequencies AT michaeldicuccio analysisof35millionsarscov2sequencesrevealsuniquemutationaltrendswithconsistentnucleotideandcodonfrequencies AT antonakomar analysisof35millionsarscov2sequencesrevealsuniquemutationaltrendswithconsistentnucleotideandcodonfrequencies AT chavakimchisarfaty analysisof35millionsarscov2sequencesrevealsuniquemutationaltrendswithconsistentnucleotideandcodonfrequencies

Analysis of 3.5 million SARS-CoV-2 sequences reveals unique mutational trends with consistent nucleotide and codon frequencies

Similar Items