Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification.

Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a...

Full description

Bibliographic Details
Main Authors: Uxoa Inurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2020-01-01
Series:PLoS ONE
Online Access:https://doi.org/10.1371/journal.pone.0237767
_version_ 1818588824587993088
author Uxoa Inurrieta
Itziar Aduriz
Arantza Díaz de Ilarraza
Gorka Labaka
Kepa Sarasola
author_facet Uxoa Inurrieta
Itziar Aduriz
Arantza Díaz de Ilarraza
Gorka Labaka
Kepa Sarasola
author_sort Uxoa Inurrieta
collection DOAJ
description Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a linguistically motivated method to gather detailed information about verb+noun MWEs (VNMWEs) from corpora. Although the main focus of this study is Spanish, the method is easily adaptable to other languages. Monolingual and parallel corpora are used as input, and data about the morphosyntactic variability of VNMWEs is extracted. This information is then tested in an identification task, obtaining an F score of 0.52, which is considerably higher than related work.
first_indexed 2024-12-16T09:30:54Z
format Article
id doaj.art-2bedfbf58c2c4e39a4843234609637af
institution Directory Open Access Journal
issn 1932-6203
language English
last_indexed 2024-12-16T09:30:54Z
publishDate 2020-01-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS ONE
spelling doaj.art-2bedfbf58c2c4e39a4843234609637af2022-12-21T22:36:31ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01158e023776710.1371/journal.pone.0237767Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification.Uxoa InurrietaItziar AdurizArantza Díaz de IlarrazaGorka LabakaKepa SarasolaMultiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a linguistically motivated method to gather detailed information about verb+noun MWEs (VNMWEs) from corpora. Although the main focus of this study is Spanish, the method is easily adaptable to other languages. Monolingual and parallel corpora are used as input, and data about the morphosyntactic variability of VNMWEs is extracted. This information is then tested in an identification task, obtaining an F score of 0.52, which is considerably higher than related work.https://doi.org/10.1371/journal.pone.0237767
spellingShingle Uxoa Inurrieta
Itziar Aduriz
Arantza Díaz de Ilarraza
Gorka Labaka
Kepa Sarasola
Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification.
PLoS ONE
title Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification.
title_full Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification.
title_fullStr Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification.
title_full_unstemmed Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification.
title_short Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification.
title_sort learning about phraseology from corpora a linguistically motivated approach for multiword expression identification
url https://doi.org/10.1371/journal.pone.0237767
work_keys_str_mv AT uxoainurrieta learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification
AT itziaraduriz learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification
AT arantzadiazdeilarraza learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification
AT gorkalabaka learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification
AT kepasarasola learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification