A fast algorithm for constructing suffix arrays for DNA alphabets

The continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length...

Full description

Bibliographic Details
Main Authors:	Zeinab Rabea, Sara El-Metwally, Samir Elmougy, Magdi Zakaria
Format:	Article
Language:	English
Published:	Elsevier 2022-07-01
Series:	Journal of King Saud University: Computer and Information Sciences
Subjects:	Suffixes Suffix arrays DNA alphabets Longest Common Prefix Burrows-Wheeler transform
Online Access:	http://www.sciencedirect.com/science/article/pii/S1319157822001434

_version_	1811329450567008256
author	Zeinab Rabea Sara El-Metwally Samir Elmougy Magdi Zakaria
author_facet	Zeinab Rabea Sara El-Metwally Samir Elmougy Magdi Zakaria
author_sort	Zeinab Rabea
collection	DOAJ
description	The continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length genomes. Building a suffix array itself is an expensive-resource process since the computations are dominant by sorting suffixes in a lexical order. Most of the suffix array construction algorithms consider the general and integer alphabets without utilizing special cases for fixed-size ones such as DNA alphabets. In this paper, we exploit the nature of four-sized DNA alphabets and utilize their predefined lexicographical ordering in order to construct suffix arrays for genomic data correctly and efficiently. The suffix array construction algorithm for DNA alphabets is evaluated using three real data sets with different lengths ranging from small E-coli genome to long length Homo sapiens GRCh38.p13 chromosomes. For long length genomes, their corresponding sequence is divided into parts (i.e. reads) with a minimum overlap length, the suffix array is computed for each part separately, and finally all partially computed arrays are merged together into a single one. We studied the effects of varying the reads/overlap lengths on the running time of the proposed suffix array construction algorithm and conclude that the minimum overlap length should be equal to the average length of the longest common prefix between the adjacent parts.
first_indexed	2024-04-13T15:44:04Z
format	Article
id	doaj.art-5577105319204f16ba33e039e5b67e2b
institution	Directory Open Access Journal
issn	1319-1578
language	English
last_indexed	2024-04-13T15:44:04Z
publishDate	2022-07-01
publisher	Elsevier
record_format	Article
series	Journal of King Saud University: Computer and Information Sciences
spelling	doaj.art-5577105319204f16ba33e039e5b67e2b2022-12-22T02:41:03ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782022-07-0134746594668A fast algorithm for constructing suffix arrays for DNA alphabetsZeinab Rabea0Sara El-Metwally1Samir Elmougy2Magdi Zakaria3Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptCorresponding author at: Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.; Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptComputer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptComputer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptThe continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length genomes. Building a suffix array itself is an expensive-resource process since the computations are dominant by sorting suffixes in a lexical order. Most of the suffix array construction algorithms consider the general and integer alphabets without utilizing special cases for fixed-size ones such as DNA alphabets. In this paper, we exploit the nature of four-sized DNA alphabets and utilize their predefined lexicographical ordering in order to construct suffix arrays for genomic data correctly and efficiently. The suffix array construction algorithm for DNA alphabets is evaluated using three real data sets with different lengths ranging from small E-coli genome to long length Homo sapiens GRCh38.p13 chromosomes. For long length genomes, their corresponding sequence is divided into parts (i.e. reads) with a minimum overlap length, the suffix array is computed for each part separately, and finally all partially computed arrays are merged together into a single one. We studied the effects of varying the reads/overlap lengths on the running time of the proposed suffix array construction algorithm and conclude that the minimum overlap length should be equal to the average length of the longest common prefix between the adjacent parts.http://www.sciencedirect.com/science/article/pii/S1319157822001434SuffixesSuffix arraysDNA alphabetsLongest Common PrefixBurrows-Wheeler transform
spellingShingle	Zeinab Rabea Sara El-Metwally Samir Elmougy Magdi Zakaria A fast algorithm for constructing suffix arrays for DNA alphabets Journal of King Saud University: Computer and Information Sciences Suffixes Suffix arrays DNA alphabets Longest Common Prefix Burrows-Wheeler transform
title	A fast algorithm for constructing suffix arrays for DNA alphabets
title_full	A fast algorithm for constructing suffix arrays for DNA alphabets
title_fullStr	A fast algorithm for constructing suffix arrays for DNA alphabets
title_full_unstemmed	A fast algorithm for constructing suffix arrays for DNA alphabets
title_short	A fast algorithm for constructing suffix arrays for DNA alphabets
title_sort	fast algorithm for constructing suffix arrays for dna alphabets
topic	Suffixes Suffix arrays DNA alphabets Longest Common Prefix Burrows-Wheeler transform
url	http://www.sciencedirect.com/science/article/pii/S1319157822001434
work_keys_str_mv	AT zeinabrabea afastalgorithmforconstructingsuffixarraysfordnaalphabets AT saraelmetwally afastalgorithmforconstructingsuffixarraysfordnaalphabets AT samirelmougy afastalgorithmforconstructingsuffixarraysfordnaalphabets AT magdizakaria afastalgorithmforconstructingsuffixarraysfordnaalphabets AT zeinabrabea fastalgorithmforconstructingsuffixarraysfordnaalphabets AT saraelmetwally fastalgorithmforconstructingsuffixarraysfordnaalphabets AT samirelmougy fastalgorithmforconstructingsuffixarraysfordnaalphabets AT magdizakaria fastalgorithmforconstructingsuffixarraysfordnaalphabets

A fast algorithm for constructing suffix arrays for DNA alphabets

Similar Items