NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTION

The article proposes two algorithms for substandard texts filtering. The first of these is based on the fact that the frequency of n-grams occurrence in a quality text obeys the Zipf law, and when the words of the text are rearranged, the law ceases to act. Comparison of the frequency characterist...

Full description

Bibliographic Details
Main Authors:	Vladimir V. Popov, Tatyana V. Shtelmakh
Format:	Article
Language:	English
Published:	Volgograd State University 2019-08-01
Series:	Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie
Subjects:	natural text pseudo-text text filtering zipf’s law n-grams the rate of appearance of new words “bag of words” model of the text graph model of the text
Online Access:	https://l.jvolsu.com/index.php/en/archive-en/562-science-journal-of-volsu-linguistics-2019-vol-18-no-2/materials-and-reports/1869-popov-v-v-shtelmakh-t-v-natural-text-mathematical-methods-of-attribution

_version_	1797794224814424064
author	Vladimir V. Popov Tatyana V. Shtelmakh
author_facet	Vladimir V. Popov Tatyana V. Shtelmakh
author_sort	Vladimir V. Popov
collection	DOAJ
description	The article proposes two algorithms for substandard texts filtering. The first of these is based on the fact that the frequency of n-grams occurrence in a quality text obeys the Zipf law, and when the words of the text are rearranged, the law ceases to act. Comparison of the frequency characteristics of the source text with the characteristics of the text resulting from the permutation of words enables researchers to draw conclusions regarding the quality of the source text. The second algorithm is based on calculating and comparing the rate new words appear in good quality and randomly generated texts. In a good text, this rate is, as a rule, uneven whereas in randomly generated texts, this unevenness is smoothed out, which makes it possible to detect low-quality texts. The methods for solving the problem of substandard texts filtering are statistical and are based on the calculation of various frequency characteristics of the text. As compared to the “bag of words” model, a graph model of the text, in which the vertices are words or word forms, and the edges are pairs of words, as well as models with higher order structures, in which the frequency characteristics of n-grams are used with n > 2, takes into account the mutual disposition of word pairs, as well as triples of words in a common part of the text, for example, in one sentence or one n-gram.
first_indexed	2024-03-13T02:59:41Z
format	Article
id	doaj.art-6aeae743869243298ec0c5e6b13bd76f
institution	Directory Open Access Journal
issn	1998-9911 2409-1979
language	English
last_indexed	2024-03-13T02:59:41Z
publishDate	2019-08-01
publisher	Volgograd State University
record_format	Article
series	Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie
spelling	doaj.art-6aeae743869243298ec0c5e6b13bd76f2023-06-27T15:54:14ZengVolgograd State UniversityVestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie1998-99112409-19792019-08-0118214715810.15688/jvolsu2.2019.2.13NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTIONVladimir V. Popov0https://orcid.org/0000-0003-0419-2874Tatyana V. Shtelmakh1https://orcid.org/0000-0002-5320-7406Volgograd State University, Volgograd, RussiaVolgograd State University, Volgograd, RussiaThe article proposes two algorithms for substandard texts filtering. The first of these is based on the fact that the frequency of n-grams occurrence in a quality text obeys the Zipf law, and when the words of the text are rearranged, the law ceases to act. Comparison of the frequency characteristics of the source text with the characteristics of the text resulting from the permutation of words enables researchers to draw conclusions regarding the quality of the source text. The second algorithm is based on calculating and comparing the rate new words appear in good quality and randomly generated texts. In a good text, this rate is, as a rule, uneven whereas in randomly generated texts, this unevenness is smoothed out, which makes it possible to detect low-quality texts. The methods for solving the problem of substandard texts filtering are statistical and are based on the calculation of various frequency characteristics of the text. As compared to the “bag of words” model, a graph model of the text, in which the vertices are words or word forms, and the edges are pairs of words, as well as models with higher order structures, in which the frequency characteristics of n-grams are used with n > 2, takes into account the mutual disposition of word pairs, as well as triples of words in a common part of the text, for example, in one sentence or one n-gram.https://l.jvolsu.com/index.php/en/archive-en/562-science-journal-of-volsu-linguistics-2019-vol-18-no-2/materials-and-reports/1869-popov-v-v-shtelmakh-t-v-natural-text-mathematical-methods-of-attributionnatural textpseudo-texttext filteringzipf’s lawn-gramsthe rate of appearance of new words“bag of words” model of the textgraph model of the text
spellingShingle	Vladimir V. Popov Tatyana V. Shtelmakh NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTION Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie natural text pseudo-text text filtering zipf’s law n-grams the rate of appearance of new words “bag of words” model of the text graph model of the text
title	NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTION
title_full	NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTION
title_fullStr	NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTION
title_full_unstemmed	NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTION
title_short	NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTION
title_sort	natural text mathematical methods of attribution
topic	natural text pseudo-text text filtering zipf’s law n-grams the rate of appearance of new words “bag of words” model of the text graph model of the text
url	https://l.jvolsu.com/index.php/en/archive-en/562-science-journal-of-volsu-linguistics-2019-vol-18-no-2/materials-and-reports/1869-popov-v-v-shtelmakh-t-v-natural-text-mathematical-methods-of-attribution
work_keys_str_mv	AT vladimirvpopov naturaltextmathematicalmethodsofattribution AT tatyanavshtelmakh naturaltextmathematicalmethodsofattribution

NATURAL TEXT: MATHEMATICAL METHODS OF ATTRIBUTION

Similar Items