Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm

The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were ca...

Full description

Bibliographic Details
Main Author: Ksenia Vladimirovna Lagutina
Format: Article
Language:English
Published: Yaroslavl State University 2022-12-01
Series:Моделирование и анализ информационных систем
Subjects:
Online Access:https://www.mais-journal.ru/jour/article/view/1750
_version_ 1797877836819726336
author Ksenia Vladimirovna Lagutina
author_facet Ksenia Vladimirovna Lagutina
author_sort Ksenia Vladimirovna Lagutina
collection DOAJ
description The article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genres: novels, scientific articles, reviews, posts from the social network Vkontakte, news from OpenCorpora. Visualization and analysis of statistics for rhythm features made it possible to identify both the most diverse genres in terms of rhythm: novels and reviews, and the least ones: scientific articles. Subsequently, these genres were classified best with the help of rhythm features and the neural network-classifier LSTM. Clustering and classifying texts by genre using ELMo and BERT embeddings made it possible to separate one genre from another with a small number of errors. The multiclassification F-score reached 99%. The study confirms the efficiency of modern embeddings in the tasks of computational linguistics, and also allows to highlight the advantages and limitations of the complex of rhythm features on the material of genre classification.
first_indexed 2024-04-10T02:24:23Z
format Article
id doaj.art-e16e980aa07b4ee3b6e024605849d21e
institution Directory Open Access Journal
issn 1818-1015
2313-5417
language English
last_indexed 2024-04-10T02:24:23Z
publishDate 2022-12-01
publisher Yaroslavl State University
record_format Article
series Моделирование и анализ информационных систем
spelling doaj.art-e16e980aa07b4ee3b6e024605849d21e2023-03-13T08:07:35ZengYaroslavl State UniversityМоделирование и анализ информационных систем1818-10152313-54172022-12-0129433434710.18255/1818-1015-2022-4-334-3471355Classification of Russian Texts by Genres Based on Modern Embeddings and RhythmKsenia Vladimirovna Lagutina0Ярославский государственный университет им. П. Г. ДемидоваThe article investigates modern vector text models for solving the problem of genre classification of Russian-language texts. Models include ELMo embeddings, BERT language model with pre-training and a complex of numerical rhythm features based on lexico-grammatical features. The experiments were carried out on a corpus of 10,000 texts in five genres: novels, scientific articles, reviews, posts from the social network Vkontakte, news from OpenCorpora. Visualization and analysis of statistics for rhythm features made it possible to identify both the most diverse genres in terms of rhythm: novels and reviews, and the least ones: scientific articles. Subsequently, these genres were classified best with the help of rhythm features and the neural network-classifier LSTM. Clustering and classifying texts by genre using ELMo and BERT embeddings made it possible to separate one genre from another with a small number of errors. The multiclassification F-score reached 99%. The study confirms the efficiency of modern embeddings in the tasks of computational linguistics, and also allows to highlight the advantages and limitations of the complex of rhythm features on the material of genre classification.https://www.mais-journal.ru/jour/article/view/1750стилометрияобработка естественного языкаритмические характеристикижанрыклассификация текстовbertelmo
spellingShingle Ksenia Vladimirovna Lagutina
Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm
Моделирование и анализ информационных систем
стилометрия
обработка естественного языка
ритмические характеристики
жанры
классификация текстов
bert
elmo
title Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm
title_full Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm
title_fullStr Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm
title_full_unstemmed Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm
title_short Classification of Russian Texts by Genres Based on Modern Embeddings and Rhythm
title_sort classification of russian texts by genres based on modern embeddings and rhythm
topic стилометрия
обработка естественного языка
ритмические характеристики
жанры
классификация текстов
bert
elmo
url https://www.mais-journal.ru/jour/article/view/1750
work_keys_str_mv AT kseniavladimirovnalagutina classificationofrussiantextsbygenresbasedonmodernembeddingsandrhythm