Using machine learning for recognition of text patterns of literary sources

Background. Today, in the field of artificial intelligence, there are natural language processing technologies, the purpose of which is to solve problems in such areas as machine translation, text sentiment analysis and text classification. In the article, within the framework of the problem of r...

Full description

Bibliographic Details
Main Authors: V.S. Tomashevskaya, Yu.V. Starichkova, D.A. Yakovlev
Format: Article
Language:English
Published: Penza State University Publishing House 2022-12-01
Series:Известия высших учебных заведений. Поволжский регион:Технические науки
Subjects:
Description
Summary:Background. Today, in the field of artificial intelligence, there are natural language processing technologies, the purpose of which is to solve problems in such areas as machine translation, text sentiment analysis and text classification. In the article, within the framework of the problem of recognition of text patterns, the application of machine learning and data mining methods is considered. The object of the study is the types of literary sources. The subject of the research is the classification of literary sources using machine learning methods. The purpose of the work is to compare the effectiveness of machine learning methods in solving the problem of binary classification of literary sources and to identify the distinctive features inherent in each of them. Materials and methods. Classification of literary sources using the Naive Bayes classifier and Logistic regression, and the Bag of Words and TF-IDF methods. Results. A comparative analysis of the obtained models was carried out. The model with which the Logistic regression and the Bag of Words method were used together demonstrates the greatest efficiency. Conclusions. Logistic regression and the Bag of Words method demonstrated the greatest efficiency when working with text templates, while the use of stemmization and lemmatization did not affect the final model efficiency indicator. The second type of literary sources contains text constructions unique to it, such as “[Electronic resource]” or “date of access”, which increase the chance of correct classification.
ISSN:2072-3059