Development of algorithms and software for classification of nucleotide sequences
Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correl...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | Russian |
Published: |
The United Institute of Informatics Problems of the National Academy of Sciences of Belarus
2019-06-01
|
Series: | Informatika |
Subjects: | |
Online Access: | https://inf.grid.by/jour/article/view/471 |
_version_ | 1797877251890479104 |
---|---|
author | V. R. Zakirava D. A. Syrakvash S. V. Hileuski P. V. Nazarov M. M. Yatskou |
author_facet | V. R. Zakirava D. A. Syrakvash S. V. Hileuski P. V. Nazarov M. M. Yatskou |
author_sort | V. R. Zakirava |
collection | DOAJ |
description | Coding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %. |
first_indexed | 2024-04-10T02:14:03Z |
format | Article |
id | doaj.art-812f44bf66d14f03ac45473e97442e31 |
institution | Directory Open Access Journal |
issn | 1816-0301 |
language | Russian |
last_indexed | 2024-04-10T02:14:03Z |
publishDate | 2019-06-01 |
publisher | The United Institute of Informatics Problems of the National Academy of Sciences of Belarus |
record_format | Article |
series | Informatika |
spelling | doaj.art-812f44bf66d14f03ac45473e97442e312023-03-13T08:32:23ZrusThe United Institute of Informatics Problems of the National Academy of Sciences of BelarusInformatika1816-03012019-06-01162109118784Development of algorithms and software for classification of nucleotide sequencesV. R. Zakirava0D. A. Syrakvash1S. V. Hileuski2P. V. Nazarov3M. M. Yatskou4Belarusian State UniversityBelarusian State UniversityBelarusian State UniversityLuxembourg Institute of HealthBelarusian State UniversityCoding and non-coding nucleotide sequences of the human reference genome have been investigated. Seven models of vectorization of nucleotide sequences based on mono-, bi-, trigram nucleotide frequencies, parameters of the category-position-frequency model, the lengths of sequences, nucleotide correlation factors, statistical features of coding and non-coding regions of DNA molecules were developed. The most informative features of vectorization models were determined using feature selection and classification algorithms based on the random forests and support vector machine methods. The difference between coding and non-coding fragments of nucleotide sequences was established. An error of the coding and non-coding sequences classification using the random forests method on a set of the 23 most informative features is 2,93 %.https://inf.grid.by/jour/article/view/471dnaexonintronclassificationrandom forestssupport vector machinefeature selectionr programming |
spellingShingle | V. R. Zakirava D. A. Syrakvash S. V. Hileuski P. V. Nazarov M. M. Yatskou Development of algorithms and software for classification of nucleotide sequences Informatika dna exon intron classification random forests support vector machine feature selection r programming |
title | Development of algorithms and software for classification of nucleotide sequences |
title_full | Development of algorithms and software for classification of nucleotide sequences |
title_fullStr | Development of algorithms and software for classification of nucleotide sequences |
title_full_unstemmed | Development of algorithms and software for classification of nucleotide sequences |
title_short | Development of algorithms and software for classification of nucleotide sequences |
title_sort | development of algorithms and software for classification of nucleotide sequences |
topic | dna exon intron classification random forests support vector machine feature selection r programming |
url | https://inf.grid.by/jour/article/view/471 |
work_keys_str_mv | AT vrzakirava developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences AT dasyrakvash developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences AT svhileuski developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences AT pvnazarov developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences AT mmyatskou developmentofalgorithmsandsoftwareforclassificationofnucleotidesequences |