Accelerated preprocessing in task of searching substrings in a string

Introduction. A rapid development of the systems such as Yandex, Google, etc., has predetermined the relevance of the task of searching substrings in a string, and approaches to its solution are actively investigated. This task is used to create database management systems that support associative s...

Full description

Bibliographic Details
Main Authors: A. V. Mazurenko, N. V. Boldyrikhin
Format: Article
Language:Russian
Published: Don State Technical University 2019-10-01
Series:Advanced Engineering Research
Subjects:
Online Access:https://www.vestnik-donstu.ru/jour/article/view/1537
_version_ 1797881294101676032
author A. V. Mazurenko
N. V. Boldyrikhin
author_facet A. V. Mazurenko
N. V. Boldyrikhin
author_sort A. V. Mazurenko
collection DOAJ
description Introduction. A rapid development of the systems such as Yandex, Google, etc., has predetermined the relevance of the task of searching substrings in a string, and approaches to its solution are actively investigated. This task is used to create database management systems that support associative search. Besides, it is applicable in solving information security issues and creating antivirus programs. Algorithms of searching substring in a string are used in signature-based discovery tasks.Materials and Methods. The solution to the problem is based on the Aho-Corasick algorithm which is a typical technique of searching substrings in a string. At the same time, a new approach regarding preprocessing is employed.Research Results. The possibility of constructing the transition function and suffix references through suffix arrays and special mappings, is shown. The relationship between the prefix tree and suffix arrays was investigated, which provided the development of a fundamentally new method of constructing the transition and error functions. The results obtained enable to substantially shorten the time intervals spent on the preelection processing of a set of pattern strings when using an integer alphabet. The paper lists eight algorithms. The developed algorithms are evaluated. The results obtained are compared to the formerly known. Two theorems and eight lemmas are proved. Two examples illustrating features of the practical application of the developed preprocessing procedure are given.Discussion and Conclusions. The preprocessing procedure proposed in this paper is based on the communication between the suffix array built on the ground of a set of pattern strings and the construction of transition and error functions at the initial stages of the Aho-Corasick algorithm. This approach differs from the traditional one and requires the use of algorithms providing a suffix array in linear time. Thus, the algorithms that enable to significantly reduce the time for preprocessing of a set of pattern strings under the condition of using a certain type of alphabet in comparison to the known approach proposed in the Aho- Corasick algorithm are described. The research results presented in the paper can be used in antivirus programs that apply searching for signatures of malicious data objects in the memory of a computer system. In addition, this approach to solving the problem on searching substrings in a string will significantly speed up the operation of database management systems using associative search.
first_indexed 2024-04-10T03:17:54Z
format Article
id doaj.art-87233a64533a4167bfbb0e109567d924
institution Directory Open Access Journal
issn 2687-1653
language Russian
last_indexed 2024-04-10T03:17:54Z
publishDate 2019-10-01
publisher Don State Technical University
record_format Article
series Advanced Engineering Research
spelling doaj.art-87233a64533a4167bfbb0e109567d9242023-03-13T07:31:28ZrusDon State Technical UniversityAdvanced Engineering Research2687-16532019-10-0119329030010.23947/1992-5980-2019-19-3-290-3001431Accelerated preprocessing in task of searching substrings in a stringA. V. Mazurenko0N. V. Boldyrikhin1ООО «ДДОС-Гвард»Донской государственный технический университетIntroduction. A rapid development of the systems such as Yandex, Google, etc., has predetermined the relevance of the task of searching substrings in a string, and approaches to its solution are actively investigated. This task is used to create database management systems that support associative search. Besides, it is applicable in solving information security issues and creating antivirus programs. Algorithms of searching substring in a string are used in signature-based discovery tasks.Materials and Methods. The solution to the problem is based on the Aho-Corasick algorithm which is a typical technique of searching substrings in a string. At the same time, a new approach regarding preprocessing is employed.Research Results. The possibility of constructing the transition function and suffix references through suffix arrays and special mappings, is shown. The relationship between the prefix tree and suffix arrays was investigated, which provided the development of a fundamentally new method of constructing the transition and error functions. The results obtained enable to substantially shorten the time intervals spent on the preelection processing of a set of pattern strings when using an integer alphabet. The paper lists eight algorithms. The developed algorithms are evaluated. The results obtained are compared to the formerly known. Two theorems and eight lemmas are proved. Two examples illustrating features of the practical application of the developed preprocessing procedure are given.Discussion and Conclusions. The preprocessing procedure proposed in this paper is based on the communication between the suffix array built on the ground of a set of pattern strings and the construction of transition and error functions at the initial stages of the Aho-Corasick algorithm. This approach differs from the traditional one and requires the use of algorithms providing a suffix array in linear time. Thus, the algorithms that enable to significantly reduce the time for preprocessing of a set of pattern strings under the condition of using a certain type of alphabet in comparison to the known approach proposed in the Aho- Corasick algorithm are described. The research results presented in the paper can be used in antivirus programs that apply searching for signatures of malicious data objects in the memory of a computer system. In addition, this approach to solving the problem on searching substrings in a string will significantly speed up the operation of database management systems using associative search.https://www.vestnik-donstu.ru/jour/article/view/1537поиск подстрокиалгоритм ахо — корасикпрефиксное деревосуффиксный массивпоиск информациифункция ошибокфункция перехода
spellingShingle A. V. Mazurenko
N. V. Boldyrikhin
Accelerated preprocessing in task of searching substrings in a string
Advanced Engineering Research
поиск подстроки
алгоритм ахо — корасик
префиксное дерево
суффиксный массив
поиск информации
функция ошибок
функция перехода
title Accelerated preprocessing in task of searching substrings in a string
title_full Accelerated preprocessing in task of searching substrings in a string
title_fullStr Accelerated preprocessing in task of searching substrings in a string
title_full_unstemmed Accelerated preprocessing in task of searching substrings in a string
title_short Accelerated preprocessing in task of searching substrings in a string
title_sort accelerated preprocessing in task of searching substrings in a string
topic поиск подстроки
алгоритм ахо — корасик
префиксное дерево
суффиксный массив
поиск информации
функция ошибок
функция перехода
url https://www.vestnik-donstu.ru/jour/article/view/1537
work_keys_str_mv AT avmazurenko acceleratedpreprocessingintaskofsearchingsubstringsinastring
AT nvboldyrikhin acceleratedpreprocessingintaskofsearchingsubstringsinastring