Algorithm of Key Words Search Based on Graph Model of Linguistic Corpus

One of the problems of computer corpus linguistics is an automatic determination of keywords inthe text. The solution is a statistical method based on calculation of various frequency characteristics of the text. In this case, the most commonly used model is a “bag of words”, which does not take int...

Full description

Bibliographic Details
Main Authors:	Elena G. Grigoryeva, Vladimir A. Klyachin, Yuriy V. Pomelnikov, Vladimir V. Popov
Format:	Article
Language:	English
Published:	Volgograd State University 2017-07-01
Series:	Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie
Subjects:	graph text word text split statistic measure tf-idf key word base form of word
Online Access:	https://l.jvolsu.com/index.php/en/component/attachments/download/1544

_version_	1818067572486045696
author	Elena G. Grigoryeva Vladimir A. Klyachin Yuriy V. Pomelnikov Vladimir V. Popov
author_facet	Elena G. Grigoryeva Vladimir A. Klyachin Yuriy V. Pomelnikov Vladimir V. Popov
author_sort	Elena G. Grigoryeva
collection	DOAJ
description	One of the problems of computer corpus linguistics is an automatic determination of keywords inthe text. The solution is a statistical method based on calculation of various frequency characteristics of the text. In this case, the most commonly used model is a “bag of words”, which does not take into account the order of words in the text. In this paper, we propose a graph model of the text that allows us to calculate the frequency characteristics of words in the text not only within the framework of the “word bag” model, but with respect to location of pairs of owls in some common part of the text, for example, in one sentence. To work with such a model, a software model is constructed in the form of a database schema intended for storing various statistical text information. Taking into account such a data model, the article proposes an algorithm for determining the keywords of the text, the implementation of which is performed in the Python programming language. When analyzing a document d of linguistics corpus D, our algorithm creates a list of about 40 words with the largest measure tf-idf, and choise from them 20 words, which are more often used in the document d. We regard these words as vertices of some graph G, and the multiplicity of the edge, connecting the vertices t and t’ is equal to the number of sentences in document d, containing both these words. Approximately 10 vertices of the graph with the greatest degree are selected. The words corresponding to these vertices are taken for key words of document d.
first_indexed	2024-12-10T15:25:49Z
format	Article
id	doaj.art-38b60a87995a4264a472f11f606e4dc3
institution	Directory Open Access Journal
issn	1998-9911 2409-1979
language	English
last_indexed	2024-12-10T15:25:49Z
publishDate	2017-07-01
publisher	Volgograd State University
record_format	Article
series	Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie
spelling	doaj.art-38b60a87995a4264a472f11f606e4dc32022-12-22T01:43:33ZengVolgograd State UniversityVestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie1998-99112409-19792017-07-01162586710.15688/jvolsu2.2017.2.6Algorithm of Key Words Search Based on Graph Model of Linguistic CorpusElena G. Grigoryeva0Vladimir A. Klyachin1Yuriy V. Pomelnikov2Vladimir V. Popov3Volgograd State UniversityVolgograd State UniversityVolgograd State UniversityVolgograd State UniversityOne of the problems of computer corpus linguistics is an automatic determination of keywords inthe text. The solution is a statistical method based on calculation of various frequency characteristics of the text. In this case, the most commonly used model is a “bag of words”, which does not take into account the order of words in the text. In this paper, we propose a graph model of the text that allows us to calculate the frequency characteristics of words in the text not only within the framework of the “word bag” model, but with respect to location of pairs of owls in some common part of the text, for example, in one sentence. To work with such a model, a software model is constructed in the form of a database schema intended for storing various statistical text information. Taking into account such a data model, the article proposes an algorithm for determining the keywords of the text, the implementation of which is performed in the Python programming language. When analyzing a document d of linguistics corpus D, our algorithm creates a list of about 40 words with the largest measure tf-idf, and choise from them 20 words, which are more often used in the document d. We regard these words as vertices of some graph G, and the multiplicity of the edge, connecting the vertices t and t’ is equal to the number of sentences in document d, containing both these words. Approximately 10 vertices of the graph with the greatest degree are selected. The words corresponding to these vertices are taken for key words of document d.https://l.jvolsu.com/index.php/en/component/attachments/download/1544graphtextwordtext splitstatistic measure tf-idfkey wordbase form of word
spellingShingle	Elena G. Grigoryeva Vladimir A. Klyachin Yuriy V. Pomelnikov Vladimir V. Popov Algorithm of Key Words Search Based on Graph Model of Linguistic Corpus Vestnik Volgogradskogo Gosudarstvennogo Universiteta. Seriâ 2. Âzykoznanie graph text word text split statistic measure tf-idf key word base form of word
title	Algorithm of Key Words Search Based on Graph Model of Linguistic Corpus
title_full	Algorithm of Key Words Search Based on Graph Model of Linguistic Corpus
title_fullStr	Algorithm of Key Words Search Based on Graph Model of Linguistic Corpus
title_full_unstemmed	Algorithm of Key Words Search Based on Graph Model of Linguistic Corpus
title_short	Algorithm of Key Words Search Based on Graph Model of Linguistic Corpus
title_sort	algorithm of key words search based on graph model of linguistic corpus
topic	graph text word text split statistic measure tf-idf key word base form of word
url	https://l.jvolsu.com/index.php/en/component/attachments/download/1544
work_keys_str_mv	AT elenaggrigoryeva algorithmofkeywordssearchbasedongraphmodeloflinguisticcorpus AT vladimiraklyachin algorithmofkeywordssearchbasedongraphmodeloflinguisticcorpus AT yuriyvpomelnikov algorithmofkeywordssearchbasedongraphmodeloflinguisticcorpus AT vladimirvpopov algorithmofkeywordssearchbasedongraphmodeloflinguisticcorpus

Algorithm of Key Words Search Based on Graph Model of Linguistic Corpus

Similar Items