SNPs detection by eBWT positional clustering
Abstract Background Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants c...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
BMC
2019-02-01
|
Series: | Algorithms for Molecular Biology |
Subjects: | |
Online Access: | http://link.springer.com/article/10.1186/s13015-019-0137-8 |
_version_ | 1811274668824330240 |
---|---|
author | Nicola Prezza Nadia Pisanti Marinella Sciortino Giovanna Rosone |
author_facet | Nicola Prezza Nadia Pisanti Marinella Sciortino Giovanna Rosone |
author_sort | Nicola Prezza |
collection | DOAJ |
description | Abstract Background Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data. Results We develop the positional clustering theory that (i) describes how the extended Burrows–Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP. Conclusions Based on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data. Availability The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp. |
first_indexed | 2024-04-12T23:23:10Z |
format | Article |
id | doaj.art-ea5887a560054b21960a87395423e13e |
institution | Directory Open Access Journal |
issn | 1748-7188 |
language | English |
last_indexed | 2024-04-12T23:23:10Z |
publishDate | 2019-02-01 |
publisher | BMC |
record_format | Article |
series | Algorithms for Molecular Biology |
spelling | doaj.art-ea5887a560054b21960a87395423e13e2022-12-22T03:12:28ZengBMCAlgorithms for Molecular Biology1748-71882019-02-0114111310.1186/s13015-019-0137-8SNPs detection by eBWT positional clusteringNicola Prezza0Nadia Pisanti1Marinella Sciortino2Giovanna Rosone3Dipartimento di Informatica, University of PisaDipartimento di Informatica, University of PisaDipartimento di Matematica e Informatica, University of PalermoDipartimento di Informatica, University of PisaAbstract Background Sequencing technologies keep on turning cheaper and faster, thus putting a growing pressure for data structures designed to efficiently store raw data, and possibly perform analysis therein. In this view, there is a growing interest in alignment-free and reference-free variants calling methods that only make use of (suitably indexed) raw reads data. Results We develop the positional clustering theory that (i) describes how the extended Burrows–Wheeler Transform (eBWT) of a collection of reads tends to cluster together bases that cover the same genome position (ii) predicts the size of such clusters, and (iii) exhibits an elegant and precise LCP array based procedure to locate such clusters in the eBWT. Based on this theory, we designed and implemented an alignment-free and reference-free SNPs calling method, and we devised a consequent SNPs calling pipeline. Experiments on both synthetic and real data show that SNPs can be detected with a simple scan of the eBWT and LCP arrays as, in accordance with our theoretical framework, they are within clusters in the eBWT of the reads. Finally, our tool intrinsically performs a reference-free evaluation of its accuracy by returning the coverage of each SNP. Conclusions Based on the results of the experiments on synthetic and real data, we conclude that the positional clustering framework can be effectively used for the problem of identifying SNPs, and it appears to be a promising approach for calling other type of variants directly on raw sequencing data. Availability The software ebwt2snp is freely available for academic use at: https://github.com/nicolaprezza/ebwt2snp.http://link.springer.com/article/10.1186/s13015-019-0137-8BWTLCP arraySNPsReference-freeAssembly-free |
spellingShingle | Nicola Prezza Nadia Pisanti Marinella Sciortino Giovanna Rosone SNPs detection by eBWT positional clustering Algorithms for Molecular Biology BWT LCP array SNPs Reference-free Assembly-free |
title | SNPs detection by eBWT positional clustering |
title_full | SNPs detection by eBWT positional clustering |
title_fullStr | SNPs detection by eBWT positional clustering |
title_full_unstemmed | SNPs detection by eBWT positional clustering |
title_short | SNPs detection by eBWT positional clustering |
title_sort | snps detection by ebwt positional clustering |
topic | BWT LCP array SNPs Reference-free Assembly-free |
url | http://link.springer.com/article/10.1186/s13015-019-0137-8 |
work_keys_str_mv | AT nicolaprezza snpsdetectionbyebwtpositionalclustering AT nadiapisanti snpsdetectionbyebwtpositionalclustering AT marinellasciortino snpsdetectionbyebwtpositionalclustering AT giovannarosone snpsdetectionbyebwtpositionalclustering |