Testing word embeddings for Polish

Testing word embeddings for Polish Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper...

Full description

Bibliographic Details
Main Authors:	Agnieszka Mykowiecka, Małgorzata Marciniak, Piotr Rychlik
Format:	Article
Language:	English
Published:	Institute of Slavic Studies, Polish Academy of Sciences 2017-12-01
Series:	Cognitive Studies \| Études cognitives
Subjects:	distributional semantics word embeddings model evaluation synonymy analogy
Online Access:	https://journals.ispan.edu.pl/index.php/cs-ec/article/view/1468

_version_	1827829045786574848
author	Agnieszka Mykowiecka Małgorzata Marciniak Piotr Rychlik
author_facet	Agnieszka Mykowiecka Małgorzata Marciniak Piotr Rychlik
author_sort	Agnieszka Mykowiecka
collection	DOAJ
description	Testing word embeddings for Polish Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skip-gram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results. Testowanie wektorowych reprezentacji dystrybucyjnych słów języka polskiego Semantyka dystrybucyjna opiera się na założeniu, że znaczenie słów wyrażone jest za pomocą wektorów reprezentujących, w sposób bezpośredni bądź pośredni, konteksty, w jakich słowo to jest używane w dużym zbiorze tekstów. Niniejszy artykuł dotyczy ewaluacji wielu takich modeli skonstruowanych dla języka polskiego. W pracy porównano skuteczność modeli opartych na lematach i formach słów, utworzonych przy wykorzystaniu sieci neuronowych na danych z dwóch różnych korpusów języka polskiego. Ewaluacji dokonano na podstawie wyników dwóch typowych zadań rozwiązywanych za pomocą metod semantyki dystrybucyjnej, tzn. rozpoznania występowania synonimii i analogii między konkretnymi parami słów. Uzyskane wyniki dowodzą, że nie można wskazać jednego uniwersalnego podejścia do tworzenia modeli dystrybucyjnych, gdyż ich skuteczność jest różna w zależności od zastosowania. Najważniejszą cechą wpływającą na jakość modelu jest jakość oraz rozmiar danych, ale wybory różnych strategii uczenia sieci mogą również prowadzić do istotnie odmiennych wyników.
first_indexed	2024-03-12T03:58:25Z
format	Article
id	doaj.art-1d3dc9f84e04452db0b3c9b96b04a3e7
institution	Directory Open Access Journal
issn	2392-2397
language	English
last_indexed	2024-03-12T03:58:25Z
publishDate	2017-12-01
publisher	Institute of Slavic Studies, Polish Academy of Sciences
record_format	Article
series	Cognitive Studies \| Études cognitives
spelling	doaj.art-1d3dc9f84e04452db0b3c9b96b04a3e72023-09-03T11:46:19ZengInstitute of Slavic Studies, Polish Academy of SciencesCognitive Studies \| Études cognitives2392-23972017-12-011710.11649/cs.1468Testing word embeddings for PolishAgnieszka Mykowiecka0Małgorzata Marciniak1Piotr Rychlik2Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warszawa [Warsaw]Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warszawa [Warsaw]Instytut Podstaw Informatyki Polskiej Akademii Nauk [Institute of Computer Science, Polish Academy of Sciences], Warszawa [Warsaw] Testing word embeddings for Polish Distributional Semantics postulates the representation of word meaning in the form of numeric vectors which represent words which occur in context in large text data. This paper addresses the problem of constructing such models for the Polish language. The paper compares the effectiveness of models based on lemmas and forms created with Continuous Bag of Words (CBOW) and skip-gram approaches based on different Polish corpora. For the purposes of this comparison, the results of two typical tasks solved with the help of distributional semantics, i.e. synonymy and analogy recognition, are compared. The results show that it is not possible to identify one universal approach to vector creation applicable to various tasks. The most important feature is the quality and size of the data, but different strategy choices can also lead to significantly different results. Testowanie wektorowych reprezentacji dystrybucyjnych słów języka polskiego Semantyka dystrybucyjna opiera się na założeniu, że znaczenie słów wyrażone jest za pomocą wektorów reprezentujących, w sposób bezpośredni bądź pośredni, konteksty, w jakich słowo to jest używane w dużym zbiorze tekstów. Niniejszy artykuł dotyczy ewaluacji wielu takich modeli skonstruowanych dla języka polskiego. W pracy porównano skuteczność modeli opartych na lematach i formach słów, utworzonych przy wykorzystaniu sieci neuronowych na danych z dwóch różnych korpusów języka polskiego. Ewaluacji dokonano na podstawie wyników dwóch typowych zadań rozwiązywanych za pomocą metod semantyki dystrybucyjnej, tzn. rozpoznania występowania synonimii i analogii między konkretnymi parami słów. Uzyskane wyniki dowodzą, że nie można wskazać jednego uniwersalnego podejścia do tworzenia modeli dystrybucyjnych, gdyż ich skuteczność jest różna w zależności od zastosowania. Najważniejszą cechą wpływającą na jakość modelu jest jakość oraz rozmiar danych, ale wybory różnych strategii uczenia sieci mogą również prowadzić do istotnie odmiennych wyników. https://journals.ispan.edu.pl/index.php/cs-ec/article/view/1468distributional semanticsword embeddingsmodel evaluationsynonymyanalogy
spellingShingle	Agnieszka Mykowiecka Małgorzata Marciniak Piotr Rychlik Testing word embeddings for Polish Cognitive Studies \| Études cognitives distributional semantics word embeddings model evaluation synonymy analogy
title	Testing word embeddings for Polish
title_full	Testing word embeddings for Polish
title_fullStr	Testing word embeddings for Polish
title_full_unstemmed	Testing word embeddings for Polish
title_short	Testing word embeddings for Polish
title_sort	testing word embeddings for polish
topic	distributional semantics word embeddings model evaluation synonymy analogy
url	https://journals.ispan.edu.pl/index.php/cs-ec/article/view/1468
work_keys_str_mv	AT agnieszkamykowiecka testingwordembeddingsforpolish AT małgorzatamarciniak testingwordembeddingsforpolish AT piotrrychlik testingwordembeddingsforpolish

Testing word embeddings for Polish

Similar Items