A fast algorithm for constructing suffix arrays for DNA alphabets

The continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length...

Full description

Bibliographic Details
Main Authors: Zeinab Rabea, Sara El-Metwally, Samir Elmougy, Magdi Zakaria
Format: Article
Language:English
Published: Elsevier 2022-07-01
Series:Journal of King Saud University: Computer and Information Sciences
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S1319157822001434
_version_ 1811329450567008256
author Zeinab Rabea
Sara El-Metwally
Samir Elmougy
Magdi Zakaria
author_facet Zeinab Rabea
Sara El-Metwally
Samir Elmougy
Magdi Zakaria
author_sort Zeinab Rabea
collection DOAJ
description The continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length genomes. Building a suffix array itself is an expensive-resource process since the computations are dominant by sorting suffixes in a lexical order. Most of the suffix array construction algorithms consider the general and integer alphabets without utilizing special cases for fixed-size ones such as DNA alphabets. In this paper, we exploit the nature of four-sized DNA alphabets and utilize their predefined lexicographical ordering in order to construct suffix arrays for genomic data correctly and efficiently. The suffix array construction algorithm for DNA alphabets is evaluated using three real data sets with different lengths ranging from small E-coli genome to long length Homo sapiens GRCh38.p13 chromosomes. For long length genomes, their corresponding sequence is divided into parts (i.e. reads) with a minimum overlap length, the suffix array is computed for each part separately, and finally all partially computed arrays are merged together into a single one. We studied the effects of varying the reads/overlap lengths on the running time of the proposed suffix array construction algorithm and conclude that the minimum overlap length should be equal to the average length of the longest common prefix between the adjacent parts.
first_indexed 2024-04-13T15:44:04Z
format Article
id doaj.art-5577105319204f16ba33e039e5b67e2b
institution Directory Open Access Journal
issn 1319-1578
language English
last_indexed 2024-04-13T15:44:04Z
publishDate 2022-07-01
publisher Elsevier
record_format Article
series Journal of King Saud University: Computer and Information Sciences
spelling doaj.art-5577105319204f16ba33e039e5b67e2b2022-12-22T02:41:03ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782022-07-0134746594668A fast algorithm for constructing suffix arrays for DNA alphabetsZeinab Rabea0Sara El-Metwally1Samir Elmougy2Magdi Zakaria3Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptCorresponding author at: Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.; Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptComputer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptComputer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptThe continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length genomes. Building a suffix array itself is an expensive-resource process since the computations are dominant by sorting suffixes in a lexical order. Most of the suffix array construction algorithms consider the general and integer alphabets without utilizing special cases for fixed-size ones such as DNA alphabets. In this paper, we exploit the nature of four-sized DNA alphabets and utilize their predefined lexicographical ordering in order to construct suffix arrays for genomic data correctly and efficiently. The suffix array construction algorithm for DNA alphabets is evaluated using three real data sets with different lengths ranging from small E-coli genome to long length Homo sapiens GRCh38.p13 chromosomes. For long length genomes, their corresponding sequence is divided into parts (i.e. reads) with a minimum overlap length, the suffix array is computed for each part separately, and finally all partially computed arrays are merged together into a single one. We studied the effects of varying the reads/overlap lengths on the running time of the proposed suffix array construction algorithm and conclude that the minimum overlap length should be equal to the average length of the longest common prefix between the adjacent parts.http://www.sciencedirect.com/science/article/pii/S1319157822001434SuffixesSuffix arraysDNA alphabetsLongest Common PrefixBurrows-Wheeler transform
spellingShingle Zeinab Rabea
Sara El-Metwally
Samir Elmougy
Magdi Zakaria
A fast algorithm for constructing suffix arrays for DNA alphabets
Journal of King Saud University: Computer and Information Sciences
Suffixes
Suffix arrays
DNA alphabets
Longest Common Prefix
Burrows-Wheeler transform
title A fast algorithm for constructing suffix arrays for DNA alphabets
title_full A fast algorithm for constructing suffix arrays for DNA alphabets
title_fullStr A fast algorithm for constructing suffix arrays for DNA alphabets
title_full_unstemmed A fast algorithm for constructing suffix arrays for DNA alphabets
title_short A fast algorithm for constructing suffix arrays for DNA alphabets
title_sort fast algorithm for constructing suffix arrays for dna alphabets
topic Suffixes
Suffix arrays
DNA alphabets
Longest Common Prefix
Burrows-Wheeler transform
url http://www.sciencedirect.com/science/article/pii/S1319157822001434
work_keys_str_mv AT zeinabrabea afastalgorithmforconstructingsuffixarraysfordnaalphabets
AT saraelmetwally afastalgorithmforconstructingsuffixarraysfordnaalphabets
AT samirelmougy afastalgorithmforconstructingsuffixarraysfordnaalphabets
AT magdizakaria afastalgorithmforconstructingsuffixarraysfordnaalphabets
AT zeinabrabea fastalgorithmforconstructingsuffixarraysfordnaalphabets
AT saraelmetwally fastalgorithmforconstructingsuffixarraysfordnaalphabets
AT samirelmougy fastalgorithmforconstructingsuffixarraysfordnaalphabets
AT magdizakaria fastalgorithmforconstructingsuffixarraysfordnaalphabets