Classification of Articles from Mass Media by Categories and Relevance of the Subject Area
The research is devoted to classification of news articles about P. G. Demidov Yaroslavl State University (YarSU) into 4 categories: “society”, “education”, “science and technologies”, “not relevant”.The proposed approaches are based on using the BERT neural network and methods of machine learning:...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Yaroslavl State University
2022-09-01
|
Series: | Моделирование и анализ информационных систем |
Subjects: | |
Online Access: | https://www.mais-journal.ru/jour/article/view/1716 |
_version_ | 1826558922841915392 |
---|---|
author | Vladislav Dmitrievich Larionov Ilya Vyacheslavovich Paramonov |
author_facet | Vladislav Dmitrievich Larionov Ilya Vyacheslavovich Paramonov |
author_sort | Vladislav Dmitrievich Larionov |
collection | DOAJ |
description | The research is devoted to classification of news articles about P. G. Demidov Yaroslavl State University (YarSU) into 4 categories: “society”, “education”, “science and technologies”, “not relevant”.The proposed approaches are based on using the BERT neural network and methods of machine learning: SVM, Logistic Regression, K-Neighbors, Random Forest, in combination of different embedding types: Word2Vec, FastText, TF-IDF, GPT-3. Also approaches of text preprocessing are considered to achieve higher quality of the classification. The experiments showed that the SVM classifier with TF-IDF embedding and trained on full article texts with titles achieved the best result. Its micro-F-measure and macro-F-measure are 0.8214 and 0.8308 respectively. The BERT neural network trained on fragments of paragraphs with YarSU mentions, from which the first 128 words and the last 384 words were taken, showed comparable results. The resulting micro-F-measure and macro-F-measure are 0.8304 and 0.8181 respectively. Thus, using paragraphs with the target organisation mentions is enough to classify text by categories efficiently. |
first_indexed | 2024-04-10T02:24:13Z |
format | Article |
id | doaj.art-a564ba9e1d6d474b9f0063632cadad0e |
institution | Directory Open Access Journal |
issn | 1818-1015 2313-5417 |
language | English |
last_indexed | 2025-03-14T08:52:12Z |
publishDate | 2022-09-01 |
publisher | Yaroslavl State University |
record_format | Article |
series | Моделирование и анализ информационных систем |
spelling | doaj.art-a564ba9e1d6d474b9f0063632cadad0e2025-03-02T12:46:59ZengYaroslavl State UniversityМоделирование и анализ информационных систем1818-10152313-54172022-09-0129326627910.18255/1818-1015-2022-3-266-2791326Classification of Articles from Mass Media by Categories and Relevance of the Subject AreaVladislav Dmitrievich Larionov0Ilya Vyacheslavovich Paramonov1P. G. Demidov Yaroslavl State UniversityP. G. Demidov Yaroslavl State UniversityThe research is devoted to classification of news articles about P. G. Demidov Yaroslavl State University (YarSU) into 4 categories: “society”, “education”, “science and technologies”, “not relevant”.The proposed approaches are based on using the BERT neural network and methods of machine learning: SVM, Logistic Regression, K-Neighbors, Random Forest, in combination of different embedding types: Word2Vec, FastText, TF-IDF, GPT-3. Also approaches of text preprocessing are considered to achieve higher quality of the classification. The experiments showed that the SVM classifier with TF-IDF embedding and trained on full article texts with titles achieved the best result. Its micro-F-measure and macro-F-measure are 0.8214 and 0.8308 respectively. The BERT neural network trained on fragments of paragraphs with YarSU mentions, from which the first 128 words and the last 384 words were taken, showed comparable results. The resulting micro-F-measure and macro-F-measure are 0.8304 and 0.8181 respectively. Thus, using paragraphs with the target organisation mentions is enough to classify text by categories efficiently.https://www.mais-journal.ru/jour/article/view/1716classification by categoriesautomatic text processingsubject arearussian languagenews articles |
spellingShingle | Vladislav Dmitrievich Larionov Ilya Vyacheslavovich Paramonov Classification of Articles from Mass Media by Categories and Relevance of the Subject Area Моделирование и анализ информационных систем classification by categories automatic text processing subject area russian language news articles |
title | Classification of Articles from Mass Media by Categories and Relevance of the Subject Area |
title_full | Classification of Articles from Mass Media by Categories and Relevance of the Subject Area |
title_fullStr | Classification of Articles from Mass Media by Categories and Relevance of the Subject Area |
title_full_unstemmed | Classification of Articles from Mass Media by Categories and Relevance of the Subject Area |
title_short | Classification of Articles from Mass Media by Categories and Relevance of the Subject Area |
title_sort | classification of articles from mass media by categories and relevance of the subject area |
topic | classification by categories automatic text processing subject area russian language news articles |
url | https://www.mais-journal.ru/jour/article/view/1716 |
work_keys_str_mv | AT vladislavdmitrievichlarionov classificationofarticlesfrommassmediabycategoriesandrelevanceofthesubjectarea AT ilyavyacheslavovichparamonov classificationofarticlesfrommassmediabycategoriesandrelevanceofthesubjectarea |