Development of algorithms and software for classification of nucleotide sequences

Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correl...

Full description

Bibliographic Details
Main Authors:	V. R. Zakirava, D. A. Syrakvash, S. V. Hileuski, P. V. Nazarov, M. M. Yatskou
Format:	Article
Language:	Russian
Published:	The United Institute of Informatics Problems of the National Academy of Sciences of Belarus 2019-06-01
Series:	Informatika
Subjects:	dna exon intron classification random forests support vector machine feature selection r programming
Online Access:	https://inf.grid.by/jour/article/view/471

_version_	1797877251890479104
author	V. R. Zakirava D. A. Syrakvash S. V. Hileuski P. V. Nazarov M. M. Yatskou
author_facet	V. R. Zakirava D. A. Syrakvash S. V. Hileuski P. V. Nazarov M. M. Yatskou
author_sort	V. R. Zakirava
collection	DOAJ
description	Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %.
first_indexed	2024-04-10T02:14:03Z
format	Article
id	doaj.art-812f44bf66d14f03ac45473e97442e31
institution	Directory Open Access Journal
issn	1816-0301
language	Russian
last_indexed	2024-04-10T02:14:03Z
publishDate	2019-06-01
publisher	The United Institute of Informatics Problems of the National Academy of Sciences of Belarus
record_format	Article
series	Informatika
spelling	doaj.art-812f44bf66d14f03ac45473e97442e312023-03-13T08:32:23ZrusThe United Institute of Informatics Problems of the National Academy of Sciences of BelarusInformatika1816-03012019-06-01162109118784Development of algorithms and software for classification of nucleotide sequencesV. R. Zakirava0D. A. Syrakvash1S. V. Hileuski2P. V. Nazarov3M. M. Yatskou4Belarusian State UniversityBelarusian State UniversityBelarusian State UniversityLuxembourg Institute of HealthBelarusian State UniversityCoding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %.https://inf.grid.by/jour/article/view/471dnaexonintronclassificationrandom forestssupport vector machinefeature selectionr programming
spellingShingle	V. R. Zakirava D. A. Syrakvash S. V. Hileuski P. V. Nazarov M. M. Yatskou Development of algorithms and software for classification of nucleotide sequences Informatika dna exon intron classification random forests support vector machine feature selection r programming
title	Development of algorithms and software for classification of nucleotide sequences
title_full	Development of algorithms and software for classification of nucleotide sequences
title_fullStr	Development of algorithms and software for classification of nucleotide sequences
title_full_unstemmed	Development of algorithms and software for classification of nucleotide sequences
title_short	Development of algorithms and software for classification of nucleotide sequences
title_sort	development of algorithms and software for classification of nucleotide sequences
topic	dna exon intron classification random forests support vector machine feature selection r programming
url	https://inf.grid.by/jour/article/view/471
work_keys_str_mv	AT vrzakirava developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences AT dasyrakvash developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences AT svhileuski developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences AT pvnazarov developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences AT mmyatskou developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences

Development of algorithms and software for classification of nucleotide sequences

Similar Items