ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation

ShotgunWSD is a recent unsupervised and knowledge-based algorithm for global word sense disambiguation (WSD). The algorithm is inspired by the Shotgun sequencing technique, which is a broadly-used whole genome sequencing approach. ShotgunWSD performs WSD at the document level based on three phases....

Full description

Bibliographic Details
Main Authors: Andrei M. Butnaru, Radu Tudor Ionescu
Format: Article
Language:English
Published: IEEE 2019-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/8817973/
_version_ 1823925570180743168
author Andrei M. Butnaru
Radu Tudor Ionescu
author_facet Andrei M. Butnaru
Radu Tudor Ionescu
author_sort Andrei M. Butnaru
collection DOAJ
description ShotgunWSD is a recent unsupervised and knowledge-based algorithm for global word sense disambiguation (WSD). The algorithm is inspired by the Shotgun sequencing technique, which is a broadly-used whole genome sequencing approach. ShotgunWSD performs WSD at the document level based on three phases. The first phase consists of applying a brute-force WSD algorithm on short context windows selected from the document, in order to generate a short list of likely sense configurations for each window. The second phase consists of assembling the local sense configurations into longer composite configurations by prefix and suffix matching. In the third phase, the resulting configurations are ranked by their length, and the sense of each word is chosen based on a majority voting scheme that considers only the top configurations in which the respective word appears. In this paper, we present an improved version (2.0) of ShotgunWSD which is based on a different approach for computing the relatedness score between two word senses, a step that stays at the core of building better local sense configurations. For each sense, we collect all the words from the corresponding WordNet synset, gloss and related synsets, into a sense bag. We embed the collected words from all the sense bags in the entire document into a vector space using a common word embedding framework. The word vectors are then clustered using k-means to form clusters of semantically related words. At this stage, we consider that clusters with fewer samples (with respect to a given threshold) represent outliers and we eliminate these clusters altogether. Words from the eliminated clusters are also removed from each and every sense bag. Finally, we compute the median of all the remaining word embeddings in a given sense bag to obtain a sense embedding for the corresponding word sense. We compare the improved ShotgunWSD algorithm (version 2.0) with its previous version (1.0) as well as several state-of-the-art unsupervised WSD algorithms on six benchmarks: SemEval 2007, Senseval-2, Senseval-3, SemEval 2013, SemEval 2015, and overall (unified). We demonstrate that ShotgunWSD 2.0 yields better performance than ShotgunWSD 1.0 and some other recent unsupervised or knowledge-based approaches. We also performed paired McNemar's significance tests, showing that the improvements of ShotgunWSD 2.0 over ShotgunWSD 1.0 are in most cases statistically significant, with a confidence interval of 0.01.
first_indexed 2024-12-16T20:10:41Z
format Article
id doaj.art-2d2394873b214c4683f6205f5f9427c0
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-12-16T20:10:41Z
publishDate 2019-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-2d2394873b214c4683f6205f5f9427c02022-12-21T22:18:08ZengIEEEIEEE Access2169-35362019-01-01712096112097510.1109/ACCESS.2019.29380588817973ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense DisambiguationAndrei M. Butnaru0Radu Tudor Ionescu1https://orcid.org/0000-0002-9301-1950Faculty of Mathematics and Computer Science, University of Bucharest, Bucharest, RomaniaFaculty of Mathematics and Computer Science, University of Bucharest, Bucharest, RomaniaShotgunWSD is a recent unsupervised and knowledge-based algorithm for global word sense disambiguation (WSD). The algorithm is inspired by the Shotgun sequencing technique, which is a broadly-used whole genome sequencing approach. ShotgunWSD performs WSD at the document level based on three phases. The first phase consists of applying a brute-force WSD algorithm on short context windows selected from the document, in order to generate a short list of likely sense configurations for each window. The second phase consists of assembling the local sense configurations into longer composite configurations by prefix and suffix matching. In the third phase, the resulting configurations are ranked by their length, and the sense of each word is chosen based on a majority voting scheme that considers only the top configurations in which the respective word appears. In this paper, we present an improved version (2.0) of ShotgunWSD which is based on a different approach for computing the relatedness score between two word senses, a step that stays at the core of building better local sense configurations. For each sense, we collect all the words from the corresponding WordNet synset, gloss and related synsets, into a sense bag. We embed the collected words from all the sense bags in the entire document into a vector space using a common word embedding framework. The word vectors are then clustered using k-means to form clusters of semantically related words. At this stage, we consider that clusters with fewer samples (with respect to a given threshold) represent outliers and we eliminate these clusters altogether. Words from the eliminated clusters are also removed from each and every sense bag. Finally, we compute the median of all the remaining word embeddings in a given sense bag to obtain a sense embedding for the corresponding word sense. We compare the improved ShotgunWSD algorithm (version 2.0) with its previous version (1.0) as well as several state-of-the-art unsupervised WSD algorithms on six benchmarks: SemEval 2007, Senseval-2, Senseval-3, SemEval 2013, SemEval 2015, and overall (unified). We demonstrate that ShotgunWSD 2.0 yields better performance than ShotgunWSD 1.0 and some other recent unsupervised or knowledge-based approaches. We also performed paired McNemar's significance tests, showing that the improvements of ShotgunWSD 2.0 over ShotgunWSD 1.0 are in most cases statistically significant, with a confidence interval of 0.01.https://ieeexplore.ieee.org/document/8817973/Word sense disambiguationshotgun sequencingword embeddingsoutlier removal
spellingShingle Andrei M. Butnaru
Radu Tudor Ionescu
ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation
IEEE Access
Word sense disambiguation
shotgun sequencing
word embeddings
outlier removal
title ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation
title_full ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation
title_fullStr ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation
title_full_unstemmed ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation
title_short ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation
title_sort shotgunwsd 2 0 an improved algorithm for global word sense disambiguation
topic Word sense disambiguation
shotgun sequencing
word embeddings
outlier removal
url https://ieeexplore.ieee.org/document/8817973/
work_keys_str_mv AT andreimbutnaru shotgunwsd20animprovedalgorithmforglobalwordsensedisambiguation
AT radutudorionescu shotgunwsd20animprovedalgorithmforglobalwordsensedisambiguation