Development of algorithms and software for classification of nucleotide sequences

Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correl...

Full description

Bibliographic Details
Main Authors: V. R. Zakirava, D. A. Syrakvash, S. V. Hileuski, P. V. Nazarov, M. M. Yatskou
Format: Article
Language:Russian
Published: The United Institute of Informatics Problems of the National Academy of Sciences of Belarus 2019-06-01
Series:Informatika
Subjects:
Online Access:https://inf.grid.by/jour/article/view/471
Description
Summary:Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %.
ISSN:1816-0301