ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation
ShotgunWSD is a recent unsupervised and knowledge-based algorithm for global word sense disambiguation (WSD). The algorithm is inspired by the Shotgun sequencing technique, which is a broadly-used whole genome sequencing approach. ShotgunWSD performs WSD at the document level based on three phases....
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2019-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/8817973/ |
_version_ | 1823925570180743168 |
---|---|
author | Andrei M. Butnaru Radu Tudor Ionescu |
author_facet | Andrei M. Butnaru Radu Tudor Ionescu |
author_sort | Andrei M. Butnaru |
collection | DOAJ |
description | ShotgunWSD is a recent unsupervised and knowledge-based algorithm for global word sense disambiguation (WSD). The algorithm is inspired by the Shotgun sequencing technique, which is a broadly-used whole genome sequencing approach. ShotgunWSD performs WSD at the document level based on three phases. The first phase consists of applying a brute-force WSD algorithm on short context windows selected from the document, in order to generate a short list of likely sense configurations for each window. The second phase consists of assembling the local sense configurations into longer composite configurations by prefix and suffix matching. In the third phase, the resulting configurations are ranked by their length, and the sense of each word is chosen based on a majority voting scheme that considers only the top configurations in which the respective word appears. In this paper, we present an improved version (2.0) of ShotgunWSD which is based on a different approach for computing the relatedness score between two word senses, a step that stays at the core of building better local sense configurations. For each sense, we collect all the words from the corresponding WordNet synset, gloss and related synsets, into a sense bag. We embed the collected words from all the sense bags in the entire document into a vector space using a common word embedding framework. The word vectors are then clustered using k-means to form clusters of semantically related words. At this stage, we consider that clusters with fewer samples (with respect to a given threshold) represent outliers and we eliminate these clusters altogether. Words from the eliminated clusters are also removed from each and every sense bag. Finally, we compute the median of all the remaining word embeddings in a given sense bag to obtain a sense embedding for the corresponding word sense. We compare the improved ShotgunWSD algorithm (version 2.0) with its previous version (1.0) as well as several state-of-the-art unsupervised WSD algorithms on six benchmarks: SemEval 2007, Senseval-2, Senseval-3, SemEval 2013, SemEval 2015, and overall (unified). We demonstrate that ShotgunWSD 2.0 yields better performance than ShotgunWSD 1.0 and some other recent unsupervised or knowledge-based approaches. We also performed paired McNemar's significance tests, showing that the improvements of ShotgunWSD 2.0 over ShotgunWSD 1.0 are in most cases statistically significant, with a confidence interval of 0.01. |
first_indexed | 2024-12-16T20:10:41Z |
format | Article |
id | doaj.art-2d2394873b214c4683f6205f5f9427c0 |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-12-16T20:10:41Z |
publishDate | 2019-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-2d2394873b214c4683f6205f5f9427c02022-12-21T22:18:08ZengIEEEIEEE Access2169-35362019-01-01712096112097510.1109/ACCESS.2019.29380588817973ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense DisambiguationAndrei M. Butnaru0Radu Tudor Ionescu1https://orcid.org/0000-0002-9301-1950Faculty of Mathematics and Computer Science, University of Bucharest, Bucharest, RomaniaFaculty of Mathematics and Computer Science, University of Bucharest, Bucharest, RomaniaShotgunWSD is a recent unsupervised and knowledge-based algorithm for global word sense disambiguation (WSD). The algorithm is inspired by the Shotgun sequencing technique, which is a broadly-used whole genome sequencing approach. ShotgunWSD performs WSD at the document level based on three phases. The first phase consists of applying a brute-force WSD algorithm on short context windows selected from the document, in order to generate a short list of likely sense configurations for each window. The second phase consists of assembling the local sense configurations into longer composite configurations by prefix and suffix matching. In the third phase, the resulting configurations are ranked by their length, and the sense of each word is chosen based on a majority voting scheme that considers only the top configurations in which the respective word appears. In this paper, we present an improved version (2.0) of ShotgunWSD which is based on a different approach for computing the relatedness score between two word senses, a step that stays at the core of building better local sense configurations. For each sense, we collect all the words from the corresponding WordNet synset, gloss and related synsets, into a sense bag. We embed the collected words from all the sense bags in the entire document into a vector space using a common word embedding framework. The word vectors are then clustered using k-means to form clusters of semantically related words. At this stage, we consider that clusters with fewer samples (with respect to a given threshold) represent outliers and we eliminate these clusters altogether. Words from the eliminated clusters are also removed from each and every sense bag. Finally, we compute the median of all the remaining word embeddings in a given sense bag to obtain a sense embedding for the corresponding word sense. We compare the improved ShotgunWSD algorithm (version 2.0) with its previous version (1.0) as well as several state-of-the-art unsupervised WSD algorithms on six benchmarks: SemEval 2007, Senseval-2, Senseval-3, SemEval 2013, SemEval 2015, and overall (unified). We demonstrate that ShotgunWSD 2.0 yields better performance than ShotgunWSD 1.0 and some other recent unsupervised or knowledge-based approaches. We also performed paired McNemar's significance tests, showing that the improvements of ShotgunWSD 2.0 over ShotgunWSD 1.0 are in most cases statistically significant, with a confidence interval of 0.01.https://ieeexplore.ieee.org/document/8817973/Word sense disambiguationshotgun sequencingword embeddingsoutlier removal |
spellingShingle | Andrei M. Butnaru Radu Tudor Ionescu ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation IEEE Access Word sense disambiguation shotgun sequencing word embeddings outlier removal |
title | ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation |
title_full | ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation |
title_fullStr | ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation |
title_full_unstemmed | ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation |
title_short | ShotgunWSD 2.0: An Improved Algorithm for Global Word Sense Disambiguation |
title_sort | shotgunwsd 2 0 an improved algorithm for global word sense disambiguation |
topic | Word sense disambiguation shotgun sequencing word embeddings outlier removal |
url | https://ieeexplore.ieee.org/document/8817973/ |
work_keys_str_mv | AT andreimbutnaru shotgunwsd20animprovedalgorithmforglobalwordsensedisambiguation AT radutudorionescu shotgunwsd20animprovedalgorithmforglobalwordsensedisambiguation |