Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification.
Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a...
Main Authors: | , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2020-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0237767 |
_version_ | 1818588824587993088 |
---|---|
author | Uxoa Inurrieta Itziar Aduriz Arantza Díaz de Ilarraza Gorka Labaka Kepa Sarasola |
author_facet | Uxoa Inurrieta Itziar Aduriz Arantza Díaz de Ilarraza Gorka Labaka Kepa Sarasola |
author_sort | Uxoa Inurrieta |
collection | DOAJ |
description | Multiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a linguistically motivated method to gather detailed information about verb+noun MWEs (VNMWEs) from corpora. Although the main focus of this study is Spanish, the method is easily adaptable to other languages. Monolingual and parallel corpora are used as input, and data about the morphosyntactic variability of VNMWEs is extracted. This information is then tested in an identification task, obtaining an F score of 0.52, which is considerably higher than related work. |
first_indexed | 2024-12-16T09:30:54Z |
format | Article |
id | doaj.art-2bedfbf58c2c4e39a4843234609637af |
institution | Directory Open Access Journal |
issn | 1932-6203 |
language | English |
last_indexed | 2024-12-16T09:30:54Z |
publishDate | 2020-01-01 |
publisher | Public Library of Science (PLoS) |
record_format | Article |
series | PLoS ONE |
spelling | doaj.art-2bedfbf58c2c4e39a4843234609637af2022-12-21T22:36:31ZengPublic Library of Science (PLoS)PLoS ONE1932-62032020-01-01158e023776710.1371/journal.pone.0237767Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification.Uxoa InurrietaItziar AdurizArantza Díaz de IlarrazaGorka LabakaKepa SarasolaMultiword Expressions (MWEs) are idiosyncratic combinations of words which pose important challenges to Natural Language Processing. Some kinds of MWEs, such as verbal ones, are particularly hard to identify in corpora, due to their high degree of morphosyntactic flexibility. This paper describes a linguistically motivated method to gather detailed information about verb+noun MWEs (VNMWEs) from corpora. Although the main focus of this study is Spanish, the method is easily adaptable to other languages. Monolingual and parallel corpora are used as input, and data about the morphosyntactic variability of VNMWEs is extracted. This information is then tested in an identification task, obtaining an F score of 0.52, which is considerably higher than related work.https://doi.org/10.1371/journal.pone.0237767 |
spellingShingle | Uxoa Inurrieta Itziar Aduriz Arantza Díaz de Ilarraza Gorka Labaka Kepa Sarasola Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. PLoS ONE |
title | Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. |
title_full | Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. |
title_fullStr | Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. |
title_full_unstemmed | Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. |
title_short | Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. |
title_sort | learning about phraseology from corpora a linguistically motivated approach for multiword expression identification |
url | https://doi.org/10.1371/journal.pone.0237767 |
work_keys_str_mv | AT uxoainurrieta learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification AT itziaraduriz learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification AT arantzadiazdeilarraza learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification AT gorkalabaka learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification AT kepasarasola learningaboutphraseologyfromcorporaalinguisticallymotivatedapproachformultiwordexpressionidentification |