A fast algorithm for constructing suffix arrays for DNA alphabets
The continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length...
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2022-07-01
|
Series: | Journal of King Saud University: Computer and Information Sciences |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S1319157822001434 |
_version_ | 1811329450567008256 |
---|---|
author | Zeinab Rabea Sara El-Metwally Samir Elmougy Magdi Zakaria |
author_facet | Zeinab Rabea Sara El-Metwally Samir Elmougy Magdi Zakaria |
author_sort | Zeinab Rabea |
collection | DOAJ |
description | The continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length genomes. Building a suffix array itself is an expensive-resource process since the computations are dominant by sorting suffixes in a lexical order. Most of the suffix array construction algorithms consider the general and integer alphabets without utilizing special cases for fixed-size ones such as DNA alphabets. In this paper, we exploit the nature of four-sized DNA alphabets and utilize their predefined lexicographical ordering in order to construct suffix arrays for genomic data correctly and efficiently. The suffix array construction algorithm for DNA alphabets is evaluated using three real data sets with different lengths ranging from small E-coli genome to long length Homo sapiens GRCh38.p13 chromosomes. For long length genomes, their corresponding sequence is divided into parts (i.e. reads) with a minimum overlap length, the suffix array is computed for each part separately, and finally all partially computed arrays are merged together into a single one. We studied the effects of varying the reads/overlap lengths on the running time of the proposed suffix array construction algorithm and conclude that the minimum overlap length should be equal to the average length of the longest common prefix between the adjacent parts. |
first_indexed | 2024-04-13T15:44:04Z |
format | Article |
id | doaj.art-5577105319204f16ba33e039e5b67e2b |
institution | Directory Open Access Journal |
issn | 1319-1578 |
language | English |
last_indexed | 2024-04-13T15:44:04Z |
publishDate | 2022-07-01 |
publisher | Elsevier |
record_format | Article |
series | Journal of King Saud University: Computer and Information Sciences |
spelling | doaj.art-5577105319204f16ba33e039e5b67e2b2022-12-22T02:41:03ZengElsevierJournal of King Saud University: Computer and Information Sciences1319-15782022-07-0134746594668A fast algorithm for constructing suffix arrays for DNA alphabetsZeinab Rabea0Sara El-Metwally1Samir Elmougy2Magdi Zakaria3Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptCorresponding author at: Department of Computer Science, Faculty of Computers and Information, Mansoura University, P.O. Box: 35516, Mansoura, Egypt.; Computer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptComputer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptComputer Science Department, Faculty of Computers and Information, Mansoura University, Mansoura 35516, EgyptThe continuous improvement of sequencing technologies has been paralleled by the development of efficient algorithms and data structures for sequencing data analysis and processing. Suffix array is one of data structures that are used to construct the Burrows-Wheeler transform (BWT) for long length genomes. Building a suffix array itself is an expensive-resource process since the computations are dominant by sorting suffixes in a lexical order. Most of the suffix array construction algorithms consider the general and integer alphabets without utilizing special cases for fixed-size ones such as DNA alphabets. In this paper, we exploit the nature of four-sized DNA alphabets and utilize their predefined lexicographical ordering in order to construct suffix arrays for genomic data correctly and efficiently. The suffix array construction algorithm for DNA alphabets is evaluated using three real data sets with different lengths ranging from small E-coli genome to long length Homo sapiens GRCh38.p13 chromosomes. For long length genomes, their corresponding sequence is divided into parts (i.e. reads) with a minimum overlap length, the suffix array is computed for each part separately, and finally all partially computed arrays are merged together into a single one. We studied the effects of varying the reads/overlap lengths on the running time of the proposed suffix array construction algorithm and conclude that the minimum overlap length should be equal to the average length of the longest common prefix between the adjacent parts.http://www.sciencedirect.com/science/article/pii/S1319157822001434SuffixesSuffix arraysDNA alphabetsLongest Common PrefixBurrows-Wheeler transform |
spellingShingle | Zeinab Rabea Sara El-Metwally Samir Elmougy Magdi Zakaria A fast algorithm for constructing suffix arrays for DNA alphabets Journal of King Saud University: Computer and Information Sciences Suffixes Suffix arrays DNA alphabets Longest Common Prefix Burrows-Wheeler transform |
title | A fast algorithm for constructing suffix arrays for DNA alphabets |
title_full | A fast algorithm for constructing suffix arrays for DNA alphabets |
title_fullStr | A fast algorithm for constructing suffix arrays for DNA alphabets |
title_full_unstemmed | A fast algorithm for constructing suffix arrays for DNA alphabets |
title_short | A fast algorithm for constructing suffix arrays for DNA alphabets |
title_sort | fast algorithm for constructing suffix arrays for dna alphabets |
topic | Suffixes Suffix arrays DNA alphabets Longest Common Prefix Burrows-Wheeler transform |
url | http://www.sciencedirect.com/science/article/pii/S1319157822001434 |
work_keys_str_mv | AT zeinabrabea afastalgorithmforconstructingsuffixarraysfordnaalphabets AT saraelmetwally afastalgorithmforconstructingsuffixarraysfordnaalphabets AT samirelmougy afastalgorithmforconstructingsuffixarraysfordnaalphabets AT magdizakaria afastalgorithmforconstructingsuffixarraysfordnaalphabets AT zeinabrabea fastalgorithmforconstructingsuffixarraysfordnaalphabets AT saraelmetwally fastalgorithmforconstructingsuffixarraysfordnaalphabets AT samirelmougy fastalgorithmforconstructingsuffixarraysfordnaalphabets AT magdizakaria fastalgorithmforconstructingsuffixarraysfordnaalphabets |