Generalized enhanced suffix array construction in external memory

Abstract Background Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data stru...

Full description

Bibliographic Details
Main Authors: Felipe A. Louza, Guilherme P. Telles, Steve Hoffmann, Cristina D. A. Ciferri
Format: Article
Language:English
Published: BMC 2017-12-01
Series:Algorithms for Molecular Biology
Subjects:
Online Access:http://link.springer.com/article/10.1186/s13015-017-0117-9
_version_ 1819085003984732160
author Felipe A. Louza
Guilherme P. Telles
Steve Hoffmann
Cristina D. A. Ciferri
author_facet Felipe A. Louza
Guilherme P. Telles
Steve Hoffmann
Cristina D. A. Ciferri
author_sort Felipe A. Louza
collection DOAJ
description Abstract Background Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory. Results In this article we present and analyze $$\mathsf {eGSA}$$ eGSA [introduced in CPM (External memory generalized suffix and $$\mathsf {LCP}$$ LCP arrays construction. In: Proceedings of CPM. pp 201–10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, $$\mathsf {eGSA}$$ eGSA showed a competitive performance when compared to $$\mathsf {eSAIS}$$ eSAIS and $$\mathsf {SAscan}$$ SAscan , which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm. Conclusions The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time.
first_indexed 2024-12-21T20:57:27Z
format Article
id doaj.art-bc8337213dc2403ea213db794ca99cb2
institution Directory Open Access Journal
issn 1748-7188
language English
last_indexed 2024-12-21T20:57:27Z
publishDate 2017-12-01
publisher BMC
record_format Article
series Algorithms for Molecular Biology
spelling doaj.art-bc8337213dc2403ea213db794ca99cb22022-12-21T18:50:33ZengBMCAlgorithms for Molecular Biology1748-71882017-12-0112111610.1186/s13015-017-0117-9Generalized enhanced suffix array construction in external memoryFelipe A. Louza0Guilherme P. Telles1Steve Hoffmann2Cristina D. A. Ciferri3Department of Computing and Mathematics, University of São PauloInstitute of Computing, University of CampinasComputational Biology, Leibniz Institute on Aging - Fritz Lipman Institute and Friedrich Schiller University JenaInstitute of Mathematics and Computer Science, University of São PauloAbstract Background Suffix arrays, augmented by additional data structures, allow solving efficiently many string processing problems. The external memory construction of the generalized suffix array for a string collection is a fundamental task when the size of the input collection or the data structure exceeds the available internal memory. Results In this article we present and analyze $$\mathsf {eGSA}$$ eGSA [introduced in CPM (External memory generalized suffix and $$\mathsf {LCP}$$ LCP arrays construction. In: Proceedings of CPM. pp 201–10, 2013)], the first external memory algorithm to construct generalized suffix arrays augmented with the longest common prefix array for a string collection. Our algorithm relies on a combination of buffers, induced sorting and a heap to avoid direct string comparisons. We performed experiments that covered different aspects of our algorithm, including running time, efficiency, external memory access, internal phases and the influence of different optimization strategies. On real datasets of size up to 24 GB and using 2 GB of internal memory, $$\mathsf {eGSA}$$ eGSA showed a competitive performance when compared to $$\mathsf {eSAIS}$$ eSAIS and $$\mathsf {SAscan}$$ SAscan , which are efficient algorithms for a single string according to the related literature. We also show the effect of disk caching managed by the operating system on our algorithm. Conclusions The proposed algorithm was validated through performance tests using real datasets from different domains, in various combinations, and showed a competitive performance. Our algorithm can also construct the generalized Burrows-Wheeler transform of a string collection with no additional cost except by the output time.http://link.springer.com/article/10.1186/s13015-017-0117-9Suffix arrayLCP arrayBurrows–Wheeler transformExternal memory algorithmsString collections
spellingShingle Felipe A. Louza
Guilherme P. Telles
Steve Hoffmann
Cristina D. A. Ciferri
Generalized enhanced suffix array construction in external memory
Algorithms for Molecular Biology
Suffix array
LCP array
Burrows–Wheeler transform
External memory algorithms
String collections
title Generalized enhanced suffix array construction in external memory
title_full Generalized enhanced suffix array construction in external memory
title_fullStr Generalized enhanced suffix array construction in external memory
title_full_unstemmed Generalized enhanced suffix array construction in external memory
title_short Generalized enhanced suffix array construction in external memory
title_sort generalized enhanced suffix array construction in external memory
topic Suffix array
LCP array
Burrows–Wheeler transform
External memory algorithms
String collections
url http://link.springer.com/article/10.1186/s13015-017-0117-9
work_keys_str_mv AT felipealouza generalizedenhancedsuffixarrayconstructioninexternalmemory
AT guilhermeptelles generalizedenhancedsuffixarrayconstructioninexternalmemory
AT stevehoffmann generalizedenhancedsuffixarrayconstructioninexternalmemory
AT cristinadaciferri generalizedenhancedsuffixarrayconstructioninexternalmemory