Updating the dictionary: Semantic change identification based on change in bigrams over time

We investigate a method of updating a Danish monolingual dictionary with new semantic information on already included lemmas in a systematic way, based on the hypothesis that the variation in bigrams over time in a corpus might indicate changes in the meaning of one of the words. The method combine...

Full description

Bibliographic Details
Main Authors:	Sanni Nimb, Nicolai Hartvig Sørensen, Henrik Lorentzen
Format:	Article
Language:	English
Published:	University of Ljubljana Press (Založba Univerze v Ljubljani) 2020-08-01
Series:	Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
Subjects:	corpus statistics bigrams dictionary update semantic change Danish
Online Access:	https://journals.uni-lj.si/slovenscina2/article/view/9142

_version_	1828060175838216192
author	Sanni Nimb Nicolai Hartvig Sørensen Henrik Lorentzen
author_facet	Sanni Nimb Nicolai Hartvig Sørensen Henrik Lorentzen
author_sort	Sanni Nimb
collection	DOAJ
description	We investigate a method of updating a Danish monolingual dictionary with new semantic information on already included lemmas in a systematic way, based on the hypothesis that the variation in bigrams over time in a corpus might indicate changes in the meaning of one of the words. The method combines corpus statistics with manual annotations. The first step consists in measuring the collocational change in a homogeneous newswire corpus with texts from a 14 year time span, 2005 through 2018, by calculating all the statistically significant bigrams. These are then applied to a new version of the corpus that is split into one sub-corpus per year. We then collect all the bigrams that do not appear at all in the first three years, but appear at least 20 times in the following 11 years. The output, a dataset of 745 bigrams considered to be potentially new in Danish, are double annotated, and depending on the annotations and the inter-annotator agreement, either discarded or divided into groups of relevant data for further investigation. We then carry out a more thorough lexicographical study of the bigrams in order to determine the degree to which they support the identification of new senses and lead to revised sense inventories for at least one of the words Furthermore we study the relation between the revisions carried out, the annotation values and the degree of inter-annotator agreement. Finally, we compare the resulting updates of the dictionary with Cook et al. (2013), and discuss whether the method might lead to a more consistent way of revising and updating the dictionary in the future.
first_indexed	2024-04-10T21:51:50Z
format	Article
id	doaj.art-0b3fb88b67d442089e2bdcc6db989676
institution	Directory Open Access Journal
issn	2335-2736
language	English
last_indexed	2024-04-10T21:51:50Z
publishDate	2020-08-01
publisher	University of Ljubljana Press (Založba Univerze v Ljubljani)
record_format	Article
series	Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave
spelling	doaj.art-0b3fb88b67d442089e2bdcc6db9896762023-01-18T12:32:38ZengUniversity of Ljubljana Press (Založba Univerze v Ljubljani)Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave2335-27362020-08-018210.4312/slo2.0.2020.2.112-138Updating the dictionary: Semantic change identification based on change in bigrams over timeSanni Nimb0Nicolai Hartvig Sørensen1Henrik Lorentzen2Society for Danish Language and Literature, Copenhagen, DenmarkSociety for Danish Language and Literature, Copenhagen, DenmarkSociety for Danish Language and Literature, Copenhagen, Denmark We investigate a method of updating a Danish monolingual dictionary with new semantic information on already included lemmas in a systematic way, based on the hypothesis that the variation in bigrams over time in a corpus might indicate changes in the meaning of one of the words. The method combines corpus statistics with manual annotations. The first step consists in measuring the collocational change in a homogeneous newswire corpus with texts from a 14 year time span, 2005 through 2018, by calculating all the statistically significant bigrams. These are then applied to a new version of the corpus that is split into one sub-corpus per year. We then collect all the bigrams that do not appear at all in the first three years, but appear at least 20 times in the following 11 years. The output, a dataset of 745 bigrams considered to be potentially new in Danish, are double annotated, and depending on the annotations and the inter-annotator agreement, either discarded or divided into groups of relevant data for further investigation. We then carry out a more thorough lexicographical study of the bigrams in order to determine the degree to which they support the identification of new senses and lead to revised sense inventories for at least one of the words Furthermore we study the relation between the revisions carried out, the annotation values and the degree of inter-annotator agreement. Finally, we compare the resulting updates of the dictionary with Cook et al. (2013), and discuss whether the method might lead to a more consistent way of revising and updating the dictionary in the future. https://journals.uni-lj.si/slovenscina2/article/view/9142corpus statisticsbigramsdictionary updatesemantic changeDanish
spellingShingle	Sanni Nimb Nicolai Hartvig Sørensen Henrik Lorentzen Updating the dictionary: Semantic change identification based on change in bigrams over time Slovenščina 2.0: Empirične, aplikativne in interdisciplinarne raziskave corpus statistics bigrams dictionary update semantic change Danish
title	Updating the dictionary: Semantic change identification based on change in bigrams over time
title_full	Updating the dictionary: Semantic change identification based on change in bigrams over time
title_fullStr	Updating the dictionary: Semantic change identification based on change in bigrams over time
title_full_unstemmed	Updating the dictionary: Semantic change identification based on change in bigrams over time
title_short	Updating the dictionary: Semantic change identification based on change in bigrams over time
title_sort	updating the dictionary semantic change identification based on change in bigrams over time
topic	corpus statistics bigrams dictionary update semantic change Danish
url	https://journals.uni-lj.si/slovenscina2/article/view/9142
work_keys_str_mv	AT sanninimb updatingthedictionarysemanticchangeidentificationbasedonchangeinbigramsovertime AT nicolaihartvigsørensen updatingthedictionarysemanticchangeidentificationbasedonchangeinbigramsovertime AT henriklorentzen updatingthedictionarysemanticchangeidentificationbasedonchangeinbigramsovertime

Updating the dictionary: Semantic change identification based on change in bigrams over time

Similar Items