SNPs detection by eBWT positional clustering

Abstract Background Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants c...

Full description

Bibliographic Details
Main Authors: Nicola Prezza, Nadia Pisanti, Marinella Sciortino, Giovanna Rosone
Format: Article
Language:English
Published: BMC 2019-02-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13015-019-0137-8
_version_ 1811274668824330240
author Nicola Prezza
Nadia Pisanti
Marinella Sciortino
Giovanna Rosone
author_facet Nicola Prezza
Nadia Pisanti
Marinella Sciortino
Giovanna Rosone
author_sort Nicola Prezza
collection DOAJ
description Abstract Background Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data. Results We develop the positional clustering theory that (i) describes how the extended Burrows–Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP. Conclusions Based on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data. Availability The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp.
first_indexed 2024-04-12T23:23:10Z
format Article
id doaj.art-ea5887a560054b21960a87395423e13e
institution Directory Open Access Journal
issn 1748-7188
language English
last_indexed 2024-04-12T23:23:10Z
publishDate 2019-02-01
publisher BMC
record_format Article
series Algorithms for Molecular Biology
spelling doaj.art-ea5887a560054b21960a87395423e13e2022-12-22T03:12:28ZengBMCAlgorithms for Molecular Biology1748-71882019-02-0114111310.1186/s13015-019-0137-8SNPs detection by eBWT positional clusteringNicola Prezza0Nadia Pisanti1Marinella Sciortino2Giovanna Rosone3Dipartimento di Informatica, University of PisaDipartimento di Informatica, University of PisaDipartimento di Matematica e Informatica, University of PalermoDipartimento di Informatica, University of PisaAbstract Background Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data. Results We develop the positional clustering theory that (i) describes how the extended Burrows–Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP. Conclusions Based on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data. Availability The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp.http://link.springer.com/article/10.1186/s13015-019-0137-8BWTLCP arraySNPsReference-freeAssembly-free
spellingShingle Nicola Prezza
Nadia Pisanti
Marinella Sciortino
Giovanna Rosone
SNPs detection by eBWT positional clustering
Algorithms for Molecular Biology
BWT
LCP array
SNPs
Reference-free
Assembly-free
title SNPs detection by eBWT positional clustering
title_full SNPs detection by eBWT positional clustering
title_fullStr SNPs detection by eBWT positional clustering
title_full_unstemmed SNPs detection by eBWT positional clustering
title_short SNPs detection by eBWT positional clustering
title_sort snps detection by ebwt positional clustering
topic BWT
LCP array
SNPs
Reference-free
Assembly-free
url http://link.springer.com/article/10.1186/s13015-019-0137-8
work_keys_str_mv AT nicolaprezza snpsdetectionbyebwtpositionalclustering
AT nadiapisanti snpsdetectionbyebwtpositionalclustering
AT marinellasciortino snpsdetectionbyebwtpositionalclustering
AT giovannarosone snpsdetectionbyebwtpositionalclustering