Development of algorithms and software for classification of nucleotide sequences

Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correl...

Full description

Bibliographic Details
Main Authors: V. R. Zakirava, D. A. Syrakvash, S. V. Hileuski, P. V. Nazarov, M. M. Yatskou
Format: Article
Language:Russian
Published: The United Institute of Informatics Problems of the National Academy of Sciences of Belarus 2019-06-01
Series:Informatika
Subjects:
Online Access:https://inf.grid.by/jour/article/view/471
_version_ 1797877251890479104
author V. R. Zakirava
D. A. Syrakvash
S. V. Hileuski
P. V. Nazarov
M. M. Yatskou
author_facet V. R. Zakirava
D. A. Syrakvash
S. V. Hileuski
P. V. Nazarov
M. M. Yatskou
author_sort V. R. Zakirava
collection DOAJ
description Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %.
first_indexed 2024-04-10T02:14:03Z
format Article
id doaj.art-812f44bf66d14f03ac45473e97442e31
institution Directory Open Access Journal
issn 1816-0301
language Russian
last_indexed 2024-04-10T02:14:03Z
publishDate 2019-06-01
publisher The United Institute of Informatics Problems of the National Academy of Sciences of Belarus
record_format Article
series Informatika
spelling doaj.art-812f44bf66d14f03ac45473e97442e312023-03-13T08:32:23ZrusThe United Institute of Informatics Problems of the National Academy of Sciences of BelarusInformatika1816-03012019-06-01162109118784Development of algorithms and software for classification of nucleotide sequencesV. R. Zakirava0D. A. Syrakvash1S. V. Hileuski2P. V. Nazarov3M. M. Yatskou4Belarusian State UniversityBelarusian State UniversityBelarusian State UniversityLuxembourg Institute of HealthBelarusian State UniversityCoding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %.https://inf.grid.by/jour/article/view/471dnaexonintronclassificationrandom forestssupport vector machinefeature selectionr programming
spellingShingle V. R. Zakirava
D. A. Syrakvash
S. V. Hileuski
P. V. Nazarov
M. M. Yatskou
Development of algorithms and software for classification of nucleotide sequences
Informatika
dna
exon
intron
classification
random forests
support vector machine
feature selection
r programming
title Development of algorithms and software for classification of nucleotide sequences
title_full Development of algorithms and software for classification of nucleotide sequences
title_fullStr Development of algorithms and software for classification of nucleotide sequences
title_full_unstemmed Development of algorithms and software for classification of nucleotide sequences
title_short Development of algorithms and software for classification of nucleotide sequences
title_sort development of algorithms and software for classification of nucleotide sequences
topic dna
exon
intron
classification
random forests
support vector machine
feature selection
r programming
url https://inf.grid.by/jour/article/view/471
work_keys_str_mv AT vrzakirava developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences
AT dasyrakvash developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences
AT svhileuski developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences
AT pvnazarov developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences
AT mmyatskou developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences