Classification of Articles from Mass Media by Categories and Relevance of the Subject Area

The research is devoted to classification of news articles about P. G. Demidov Yaroslavl State University (YarSU) into 4 categories: “society”, “education”, “science and technologies”, “not relevant”.The proposed approaches are based on using the BERT neural network and methods of machine learning:...

Full description

Bibliographic Details
Main Authors: Vladislav Dmitrievich Larionov, Ilya Vyacheslavovich Paramonov
Format: Article
Language:English
Published: Yaroslavl State University 2022-09-01
Series:Моделирование и анализ информационных систем
Subjects:
Online Access:https://www.mais-journal.ru/jour/article/view/1716
_version_ 1826558922841915392
author Vladislav Dmitrievich Larionov
Ilya Vyacheslavovich Paramonov
author_facet Vladislav Dmitrievich Larionov
Ilya Vyacheslavovich Paramonov
author_sort Vladislav Dmitrievich Larionov
collection DOAJ
description The research is devoted to classification of news articles about P. G. Demidov Yaroslavl State University (YarSU) into 4 categories: “society”, “education”, “science and technologies”, “not relevant”.The proposed approaches are based on using the BERT neural network and methods of machine learning: SVM, Logistic Regression, K-Neighbors, Random Forest, in combination of different embedding types: Word2Vec, FastText, TF-IDF, GPT-3. Also approaches of text preprocessing are considered to achieve higher quality of the classification. The experiments showed that the SVM classifier with TF-IDF embedding and trained on full article texts with titles achieved the best result. Its micro-F-measure and macro-F-measure are 0.8214 and 0.8308 respectively. The BERT neural network trained on fragments of paragraphs with YarSU mentions, from which the first 128 words and the last 384 words were taken, showed comparable results. The resulting micro-F-measure and macro-F-measure are 0.8304 and 0.8181 respectively. Thus, using paragraphs with the target organisation mentions is enough to classify text by categories efficiently.
first_indexed 2024-04-10T02:24:13Z
format Article
id doaj.art-a564ba9e1d6d474b9f0063632cadad0e
institution Directory Open Access Journal
issn 1818-1015
2313-5417
language English
last_indexed 2025-03-14T08:52:12Z
publishDate 2022-09-01
publisher Yaroslavl State University
record_format Article
series Моделирование и анализ информационных систем
spelling doaj.art-a564ba9e1d6d474b9f0063632cadad0e2025-03-02T12:46:59ZengYaroslavl State UniversityМоделирование и анализ информационных систем1818-10152313-54172022-09-0129326627910.18255/1818-1015-2022-3-266-2791326Classification of Articles from Mass Media by Categories and Relevance of the Subject AreaVladislav Dmitrievich Larionov0Ilya Vyacheslavovich Paramonov1P. G. Demidov Yaroslavl State UniversityP. G. Demidov Yaroslavl State UniversityThe research is devoted to classification of news articles about P. G. Demidov Yaroslavl State University (YarSU) into 4 categories: “society”, “education”, “science and technologies”, “not relevant”.The proposed approaches are based on using the BERT neural network and methods of machine learning: SVM, Logistic Regression, K-Neighbors, Random Forest, in combination of different embedding types: Word2Vec, FastText, TF-IDF, GPT-3. Also approaches of text preprocessing are considered to achieve higher quality of the classification. The experiments showed that the SVM classifier with TF-IDF embedding and trained on full article texts with titles achieved the best result. Its micro-F-measure and macro-F-measure are 0.8214 and 0.8308 respectively. The BERT neural network trained on fragments of paragraphs with YarSU mentions, from which the first 128 words and the last 384 words were taken, showed comparable results. The resulting micro-F-measure and macro-F-measure are 0.8304 and 0.8181 respectively. Thus, using paragraphs with the target organisation mentions is enough to classify text by categories efficiently.https://www.mais-journal.ru/jour/article/view/1716classification by categoriesautomatic text processingsubject arearussian languagenews articles
spellingShingle Vladislav Dmitrievich Larionov
Ilya Vyacheslavovich Paramonov
Classification of Articles from Mass Media by Categories and Relevance of the Subject Area
Моделирование и анализ информационных систем
classification by categories
automatic text processing
subject area
russian language
news articles
title Classification of Articles from Mass Media by Categories and Relevance of the Subject Area
title_full Classification of Articles from Mass Media by Categories and Relevance of the Subject Area
title_fullStr Classification of Articles from Mass Media by Categories and Relevance of the Subject Area
title_full_unstemmed Classification of Articles from Mass Media by Categories and Relevance of the Subject Area
title_short Classification of Articles from Mass Media by Categories and Relevance of the Subject Area
title_sort classification of articles from mass media by categories and relevance of the subject area
topic classification by categories
automatic text processing
subject area
russian language
news articles
url https://www.mais-journal.ru/jour/article/view/1716
work_keys_str_mv AT vladislavdmitrievichlarionov classificationofarticlesfrommassmediabycategoriesandrelevanceofthesubjectarea
AT ilyavyacheslavovichparamonov classificationofarticlesfrommassmediabycategoriesandrelevanceofthesubjectarea