A French corpus annotated for multiword expressions and named entities
We present the enrichment of a French treebank of various genres with a new annotation layer for multiword expressions (MWEs) and named entities (NEs).1 Our contribution with respect to previous work on NE and MWE annotation is the particular care taken to use formal criteria, organized into decisio...
Main Authors: | , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Institute of Computer Science, Polish Academy of Sciences
2021-02-01
|
Series: | Journal of Language Modelling |
Subjects: | |
Online Access: | https://jlm.ipipan.waw.pl/index.php/JLM/article/view/265 |
_version_ | 1828427198915149824 |
---|---|
author | Marie Candito Mathieu Constant Carlos Ramisch Agata Savary Bruno Guillaume Yannick Parmentier Silvio Cordeiro |
author_facet | Marie Candito Mathieu Constant Carlos Ramisch Agata Savary Bruno Guillaume Yannick Parmentier Silvio Cordeiro |
author_sort | Marie Candito |
collection | DOAJ |
description | We present the enrichment of a French treebank of various genres with a new annotation layer for multiword expressions (MWEs) and named entities (NEs).1 Our contribution with respect to previous work on NE and MWE annotation is the particular care taken to use formal criteria, organized into decision flowcharts, shedding some light on the interactions between NEs and MWEs. Moreover, in order to cope with the well-known difficulty to draw a clear-cut frontier between compositional expressions and MWEs, we chose to use sufficient criteria only. As a result, annotated MWEs satisfy a varying number of sufficient criteria, accounting for the scalar nature of the MWE status.
In addition to the span of the elements, annotation includes the subcategory of NEs (e.g., person, location) and one matching sufficient criterion for non-verbal MWEs (e.g., lexical substitution). The 3,099 sentences of the treebank were double-annotated and adjudicated, and we paid attention to cross-type consistency and compatibility with thesyntactic layer. Overall inter-annotator agreement on non-verbal MWEs and NEs reached 71.1%. The released corpus contains 3,112 annotated NEs and 3,440 MWEs, and is distributed under an open license. |
first_indexed | 2024-12-10T16:57:52Z |
format | Article |
id | doaj.art-454468d4025e4a1d8817fd401120f23a |
institution | Directory Open Access Journal |
issn | 2299-856X 2299-8470 |
language | English |
last_indexed | 2024-12-10T16:57:52Z |
publishDate | 2021-02-01 |
publisher | Institute of Computer Science, Polish Academy of Sciences |
record_format | Article |
series | Journal of Language Modelling |
spelling | doaj.art-454468d4025e4a1d8817fd401120f23a2022-12-22T01:40:39ZengInstitute of Computer Science, Polish Academy of SciencesJournal of Language Modelling2299-856X2299-84702021-02-0182415–479415–47910.15398/jlm.v8i2.265205A French corpus annotated for multiword expressions and named entitiesMarie Candito0https://orcid.org/0000-0001-8306-4859Mathieu Constant1https://orcid.org/0000-0002-9910-594XCarlos Ramisch2https://orcid.org/0000-0001-7466-9039Agata Savary3https://orcid.org/0000-0002-6473-6477Bruno Guillaume4https://orcid.org/0000-0001-8314-8075Yannick Parmentier5https://orcid.org/0000-0003-1461-5535Silvio Cordeirohttps://orcid.org/0000-0002-1262-369XLLF (CNRS and Paris University)ATILF (CNRS and Université de Lorraine)LIS (CNRS and Aix Marseille University)University of ToursLORIA (CNRS, Université de Lorraine and Inria)LORIA (CNRS, Université de Lorraine and Inria)We present the enrichment of a French treebank of various genres with a new annotation layer for multiword expressions (MWEs) and named entities (NEs).1 Our contribution with respect to previous work on NE and MWE annotation is the particular care taken to use formal criteria, organized into decision flowcharts, shedding some light on the interactions between NEs and MWEs. Moreover, in order to cope with the well-known difficulty to draw a clear-cut frontier between compositional expressions and MWEs, we chose to use sufficient criteria only. As a result, annotated MWEs satisfy a varying number of sufficient criteria, accounting for the scalar nature of the MWE status. In addition to the span of the elements, annotation includes the subcategory of NEs (e.g., person, location) and one matching sufficient criterion for non-verbal MWEs (e.g., lexical substitution). The 3,099 sentences of the treebank were double-annotated and adjudicated, and we paid attention to cross-type consistency and compatibility with thesyntactic layer. Overall inter-annotator agreement on non-verbal MWEs and NEs reached 71.1%. The released corpus contains 3,112 annotated NEs and 3,440 MWEs, and is distributed under an open license.https://jlm.ipipan.waw.pl/index.php/JLM/article/view/265multiword expressionsannotationcorpusfrench |
spellingShingle | Marie Candito Mathieu Constant Carlos Ramisch Agata Savary Bruno Guillaume Yannick Parmentier Silvio Cordeiro A French corpus annotated for multiword expressions and named entities Journal of Language Modelling multiword expressions annotation corpus french |
title | A French corpus annotated for multiword expressions and named entities |
title_full | A French corpus annotated for multiword expressions and named entities |
title_fullStr | A French corpus annotated for multiword expressions and named entities |
title_full_unstemmed | A French corpus annotated for multiword expressions and named entities |
title_short | A French corpus annotated for multiword expressions and named entities |
title_sort | french corpus annotated for multiword expressions and named entities |
topic | multiword expressions annotation corpus french |
url | https://jlm.ipipan.waw.pl/index.php/JLM/article/view/265 |
work_keys_str_mv | AT mariecandito afrenchcorpusannotatedformultiwordexpressionsandnamedentities AT mathieuconstant afrenchcorpusannotatedformultiwordexpressionsandnamedentities AT carlosramisch afrenchcorpusannotatedformultiwordexpressionsandnamedentities AT agatasavary afrenchcorpusannotatedformultiwordexpressionsandnamedentities AT brunoguillaume afrenchcorpusannotatedformultiwordexpressionsandnamedentities AT yannickparmentier afrenchcorpusannotatedformultiwordexpressionsandnamedentities AT silviocordeiro afrenchcorpusannotatedformultiwordexpressionsandnamedentities AT mariecandito frenchcorpusannotatedformultiwordexpressionsandnamedentities AT mathieuconstant frenchcorpusannotatedformultiwordexpressionsandnamedentities AT carlosramisch frenchcorpusannotatedformultiwordexpressionsandnamedentities AT agatasavary frenchcorpusannotatedformultiwordexpressionsandnamedentities AT brunoguillaume frenchcorpusannotatedformultiwordexpressionsandnamedentities AT yannickparmentier frenchcorpusannotatedformultiwordexpressionsandnamedentities AT silviocordeiro frenchcorpusannotatedformultiwordexpressionsandnamedentities |