Fast Phylogeny of SARS-CoV-2 by Compression

The compression method to assess similarity, in the sense of having a small normalized compression distance (NCD), was developed based on algorithmic information theory to quantify the similarity in files ranging from words and languages to genomes and music pieces. It has been validated on objects...

Full description

Bibliographic Details
Main Authors: Rudi L. Cilibrasi, Paul M. B. Vitányi
Format: Article
Language:English
Published: MDPI AG 2022-03-01
Series:Entropy
Subjects:
Online Access:https://www.mdpi.com/1099-4300/24/4/439
_version_ 1797434787176120320
author Rudi L. Cilibrasi
Paul M. B. Vitányi
author_facet Rudi L. Cilibrasi
Paul M. B. Vitányi
author_sort Rudi L. Cilibrasi
collection DOAJ
description The compression method to assess similarity, in the sense of having a small normalized compression distance (NCD), was developed based on algorithmic information theory to quantify the similarity in files ranging from words and languages to genomes and music pieces. It has been validated on objects from different domains always using essentially the same software. We analyze the whole-genome phylogeny and taxonomy of the SARS-CoV-2 virus, which is responsible for causing the COVID-19 disease, using the alignment-free compression method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6500 viruses. The results suggest that the SARS-CoV-2 virus is closest in that database to the RaTG13 virus and rather close to the bat SARS-like coronaviruses bat-SL-CoVZXC21 and bat-SL-CoVZC45. Over 6500 viruses are identified (given by their registration code) with larger NCDs. The NCDs are compared with the NCDs between the mtDNA of familiar species. We address the question of whether pangolins are involved in the SARS-CoV-2 virus. The compression method is simpler and possibly faster than any other whole-genome method, which makes it the ideal tool to explore phylogeny. Here, we use it for the complex case of determining this similarity between the COVID-19 virus, SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely resemble earlier results from by alignment-based methods and a machine-learning method, providing the most compelling evidence to date for the compression method, showing that one can achieve equivalent results both simply and quickly.
first_indexed 2024-03-09T10:37:38Z
format Article
id doaj.art-4a287526467241708e4a3061a2dead8b
institution Directory Open Access Journal
issn 1099-4300
language English
last_indexed 2024-03-09T10:37:38Z
publishDate 2022-03-01
publisher MDPI AG
record_format Article
series Entropy
spelling doaj.art-4a287526467241708e4a3061a2dead8b2023-12-01T20:49:45ZengMDPI AGEntropy1099-43002022-03-0124443910.3390/e24040439Fast Phylogeny of SARS-CoV-2 by CompressionRudi L. Cilibrasi0Paul M. B. Vitányi1Centre for Nathematics & Computer Science CWI, Science Park 123, 1098 XG Amsterdam, The NetherlandsCWI (Centrum Wiskunde & Informatica), Department of Computer Science, Faculteit Natuurwetenschappen, Wiskunde en Informatica, University of Amsterdam, Science Park 904, 1098 XH Amsterdam, The NetherlandsThe compression method to assess similarity, in the sense of having a small normalized compression distance (NCD), was developed based on algorithmic information theory to quantify the similarity in files ranging from words and languages to genomes and music pieces. It has been validated on objects from different domains always using essentially the same software. We analyze the whole-genome phylogeny and taxonomy of the SARS-CoV-2 virus, which is responsible for causing the COVID-19 disease, using the alignment-free compression method to assess similarity. We compare the SARS-CoV-2 virus with a database of over 6500 viruses. The results suggest that the SARS-CoV-2 virus is closest in that database to the RaTG13 virus and rather close to the bat SARS-like coronaviruses bat-SL-CoVZXC21 and bat-SL-CoVZC45. Over 6500 viruses are identified (given by their registration code) with larger NCDs. The NCDs are compared with the NCDs between the mtDNA of familiar species. We address the question of whether pangolins are involved in the SARS-CoV-2 virus. The compression method is simpler and possibly faster than any other whole-genome method, which makes it the ideal tool to explore phylogeny. Here, we use it for the complex case of determining this similarity between the COVID-19 virus, SARS-CoV-2 and many other viruses. The resulting phylogeny and taxonomy closely resemble earlier results from by alignment-based methods and a machine-learning method, providing the most compelling evidence to date for the compression method, showing that one can achieve equivalent results both simply and quickly.https://www.mdpi.com/1099-4300/24/4/439compressionphylogenyCOVID-19 virus
spellingShingle Rudi L. Cilibrasi
Paul M. B. Vitányi
Fast Phylogeny of SARS-CoV-2 by Compression
Entropy
compression
phylogeny
COVID-19 virus
title Fast Phylogeny of SARS-CoV-2 by Compression
title_full Fast Phylogeny of SARS-CoV-2 by Compression
title_fullStr Fast Phylogeny of SARS-CoV-2 by Compression
title_full_unstemmed Fast Phylogeny of SARS-CoV-2 by Compression
title_short Fast Phylogeny of SARS-CoV-2 by Compression
title_sort fast phylogeny of sars cov 2 by compression
topic compression
phylogeny
COVID-19 virus
url https://www.mdpi.com/1099-4300/24/4/439
work_keys_str_mv AT rudilcilibrasi fastphylogenyofsarscov2bycompression
AT paulmbvitanyi fastphylogenyofsarscov2bycompression