LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes

Background Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expen...

Full description

Bibliographic Details
Main Authors: Long Tian, Reza Mazloom, Lenwood S. Heath, Boris A. Vinatzer
Format: Article
Language:English
Published: PeerJ Inc. 2021-03-01
Series:PeerJ
Subjects:
Online Access:https://peerj.com/articles/10906.pdf
_version_ 1797424819953729536
author Long Tian
Reza Mazloom
Lenwood S. Heath
Boris A. Vinatzer
author_facet Long Tian
Reza Mazloom
Lenwood S. Heath
Boris A. Vinatzer
author_sort Long Tian
collection DOAJ
description Background Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. Methods Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. Results LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.
first_indexed 2024-03-09T08:07:21Z
format Article
id doaj.art-392887551bd940b3be9d06182d1a6d60
institution Directory Open Access Journal
issn 2167-8359
language English
last_indexed 2024-03-09T08:07:21Z
publishDate 2021-03-01
publisher PeerJ Inc.
record_format Article
series PeerJ
spelling doaj.art-392887551bd940b3be9d06182d1a6d602023-12-02T23:46:40ZengPeerJ Inc.PeerJ2167-83592021-03-019e1090610.7717/peerj.10906LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomesLong Tian0Reza Mazloom1Lenwood S. Heath2Boris A. Vinatzer3School of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA, USADepartment of Computer Science, Virginia Tech, Blacksburg, VA, USADepartment of Computer Science, Virginia Tech, Blacksburg, VA, USASchool of Plant and Environmental Sciences, Virginia Tech, Blacksburg, VA, USABackground Computing genomic similarity between strains is a prerequisite for genome-based prokaryotic classification and identification. Genomic similarity was first computed as Average Nucleotide Identity (ANI) values based on the alignment of genomic fragments. Since this is computationally expensive, faster and computationally cheaper alignment-free methods have been developed to estimate ANI. However, these methods do not reach the level of accuracy of alignment-based methods. Methods Here we introduce LINflow, a computational pipeline that infers pairwise genomic similarity in a set of genomes. LINflow takes advantage of the speed of the alignment-free sourmash tool to identify the genome in a dataset that is most similar to a query genome and the precision of the alignment-based pyani software to precisely compute ANI between the query genome and the most similar genome identified by sourmash. This is repeated for each new genome that is added to a dataset. The sequentially computed ANI values are stored as Life Identification Numbers (LINs), which are then used to infer all other pairwise ANI values in the set. We tested LINflow on four sets, 484 genomes in total, and compared the needed time and the generated similarity matrices with other tools. Results LINflow is up to 150 times faster than pyani and pairwise ANI values generated by LINflow are highly correlated with those computed by pyani. However, because LINflow infers most pairwise ANI values instead of computing them directly, ANI values occasionally depart from the ANI values computed by pyani. In conclusion, LINflow is a fast and memory-efficient pipeline to infer similarity among a large set of prokaryotic genomes. Its ability to quickly add new genome sequences to an already computed similarity matrix makes LINflow particularly useful for projects when new genome sequences need to be regularly added to an existing dataset.https://peerj.com/articles/10906.pdfProkaryotesGenome-based taxonomyAverage nucleotide identityGenomic similarityComparative genomicsPhylogenomics
spellingShingle Long Tian
Reza Mazloom
Lenwood S. Heath
Boris A. Vinatzer
LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
PeerJ
Prokaryotes
Genome-based taxonomy
Average nucleotide identity
Genomic similarity
Comparative genomics
Phylogenomics
title LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title_full LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title_fullStr LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title_full_unstemmed LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title_short LINflow: a computational pipeline that combines an alignment-free with an alignment-based method to accelerate generation of similarity matrices for prokaryotic genomes
title_sort linflow a computational pipeline that combines an alignment free with an alignment based method to accelerate generation of similarity matrices for prokaryotic genomes
topic Prokaryotes
Genome-based taxonomy
Average nucleotide identity
Genomic similarity
Comparative genomics
Phylogenomics
url https://peerj.com/articles/10906.pdf
work_keys_str_mv AT longtian linflowacomputationalpipelinethatcombinesanalignmentfreewithanalignmentbasedmethodtoaccelerategenerationofsimilaritymatricesforprokaryoticgenomes
AT rezamazloom linflowacomputationalpipelinethatcombinesanalignmentfreewithanalignmentbasedmethodtoaccelerategenerationofsimilaritymatricesforprokaryoticgenomes
AT lenwoodsheath linflowacomputationalpipelinethatcombinesanalignmentfreewithanalignmentbasedmethodtoaccelerategenerationofsimilaritymatricesforprokaryoticgenomes
AT borisavinatzer linflowacomputationalpipelinethatcombinesanalignmentfreewithanalignmentbasedmethodtoaccelerategenerationofsimilaritymatricesforprokaryoticgenomes