A French corpus annotated for multiword expressions and named entities

We present the enrichment of a French treebank of various genres with a new annotation layer for multiword expressions (MWEs) and named entities (NEs).1 Our contribution with respect to previous work on NE and MWE annotation is the particular care taken to use formal criteria, organized into decisio...

Full description

Bibliographic Details
Main Authors: Marie Candito, Mathieu Constant, Carlos Ramisch, Agata Savary, Bruno Guillaume, Yannick Parmentier, Silvio Cordeiro
Format: Article
Language:English
Published: Institute of Computer Science, Polish Academy of Sciences 2021-02-01
Series:Journal of Language Modelling
Subjects:
Online Access:https://jlm.ipipan.waw.pl/index.php/JLM/article/view/265
_version_ 1828427198915149824
author Marie Candito
Mathieu Constant
Carlos Ramisch
Agata Savary
Bruno Guillaume
Yannick Parmentier
Silvio Cordeiro
author_facet Marie Candito
Mathieu Constant
Carlos Ramisch
Agata Savary
Bruno Guillaume
Yannick Parmentier
Silvio Cordeiro
author_sort Marie Candito
collection DOAJ
description We present the enrichment of a French treebank of various genres with a new annotation layer for multiword expressions (MWEs) and named entities (NEs).1 Our contribution with respect to previous work on NE and MWE annotation is the particular care taken to use formal criteria, organized into decision flowcharts, shedding some light on the interactions between NEs and MWEs. Moreover, in order to cope with the well-known difficulty to draw a clear-cut frontier between compositional expressions and MWEs, we chose to use sufficient criteria only. As a result, annotated MWEs satisfy a varying number of sufficient criteria, accounting for the scalar nature of the MWE status. In addition to the span of the elements, annotation includes the subcategory of NEs (e.g., person, location) and one matching sufficient criterion for non-verbal MWEs (e.g., lexical substitution). The 3,099 sentences of the treebank were double-annotated and adjudicated, and we paid attention to cross-type consistency and compatibility with thesyntactic layer. Overall inter-annotator agreement on non-verbal MWEs and NEs reached 71.1%. The released corpus contains 3,112 annotated NEs and 3,440 MWEs, and is distributed under an open license.
first_indexed 2024-12-10T16:57:52Z
format Article
id doaj.art-454468d4025e4a1d8817fd401120f23a
institution Directory Open Access Journal
issn 2299-856X
2299-8470
language English
last_indexed 2024-12-10T16:57:52Z
publishDate 2021-02-01
publisher Institute of Computer Science, Polish Academy of Sciences
record_format Article
series Journal of Language Modelling
spelling doaj.art-454468d4025e4a1d8817fd401120f23a2022-12-22T01:40:39ZengInstitute of Computer Science, Polish Academy of SciencesJournal of Language Modelling2299-856X2299-84702021-02-0182415–479415–47910.15398/jlm.v8i2.265205A French corpus annotated for multiword expressions and named entitiesMarie Candito0https://orcid.org/0000-0001-8306-4859Mathieu Constant1https://orcid.org/0000-0002-9910-594XCarlos Ramisch2https://orcid.org/0000-0001-7466-9039Agata Savary3https://orcid.org/0000-0002-6473-6477Bruno Guillaume4https://orcid.org/0000-0001-8314-8075Yannick Parmentier5https://orcid.org/0000-0003-1461-5535Silvio Cordeirohttps://orcid.org/0000-0002-1262-369XLLF (CNRS and Paris University)ATILF (CNRS and Université de Lorraine)LIS (CNRS and Aix Marseille University)University of ToursLORIA (CNRS, Université de Lorraine and Inria)LORIA (CNRS, Université de Lorraine and Inria)We present the enrichment of a French treebank of various genres with a new annotation layer for multiword expressions (MWEs) and named entities (NEs).1 Our contribution with respect to previous work on NE and MWE annotation is the particular care taken to use formal criteria, organized into decision flowcharts, shedding some light on the interactions between NEs and MWEs. Moreover, in order to cope with the well-known difficulty to draw a clear-cut frontier between compositional expressions and MWEs, we chose to use sufficient criteria only. As a result, annotated MWEs satisfy a varying number of sufficient criteria, accounting for the scalar nature of the MWE status. In addition to the span of the elements, annotation includes the subcategory of NEs (e.g., person, location) and one matching sufficient criterion for non-verbal MWEs (e.g., lexical substitution). The 3,099 sentences of the treebank were double-annotated and adjudicated, and we paid attention to cross-type consistency and compatibility with thesyntactic layer. Overall inter-annotator agreement on non-verbal MWEs and NEs reached 71.1%. The released corpus contains 3,112 annotated NEs and 3,440 MWEs, and is distributed under an open license.https://jlm.ipipan.waw.pl/index.php/JLM/article/view/265multiword expressionsannotationcorpusfrench
spellingShingle Marie Candito
Mathieu Constant
Carlos Ramisch
Agata Savary
Bruno Guillaume
Yannick Parmentier
Silvio Cordeiro
A French corpus annotated for multiword expressions and named entities
Journal of Language Modelling
multiword expressions
annotation
corpus
french
title A French corpus annotated for multiword expressions and named entities
title_full A French corpus annotated for multiword expressions and named entities
title_fullStr A French corpus annotated for multiword expressions and named entities
title_full_unstemmed A French corpus annotated for multiword expressions and named entities
title_short A French corpus annotated for multiword expressions and named entities
title_sort french corpus annotated for multiword expressions and named entities
topic multiword expressions
annotation
corpus
french
url https://jlm.ipipan.waw.pl/index.php/JLM/article/view/265
work_keys_str_mv AT mariecandito afrenchcorpusannotatedformultiwordexpressionsandnamedentities
AT mathieuconstant afrenchcorpusannotatedformultiwordexpressionsandnamedentities
AT carlosramisch afrenchcorpusannotatedformultiwordexpressionsandnamedentities
AT agatasavary afrenchcorpusannotatedformultiwordexpressionsandnamedentities
AT brunoguillaume afrenchcorpusannotatedformultiwordexpressionsandnamedentities
AT yannickparmentier afrenchcorpusannotatedformultiwordexpressionsandnamedentities
AT silviocordeiro afrenchcorpusannotatedformultiwordexpressionsandnamedentities
AT mariecandito frenchcorpusannotatedformultiwordexpressionsandnamedentities
AT mathieuconstant frenchcorpusannotatedformultiwordexpressionsandnamedentities
AT carlosramisch frenchcorpusannotatedformultiwordexpressionsandnamedentities
AT agatasavary frenchcorpusannotatedformultiwordexpressionsandnamedentities
AT brunoguillaume frenchcorpusannotatedformultiwordexpressionsandnamedentities
AT yannickparmentier frenchcorpusannotatedformultiwordexpressionsandnamedentities
AT silviocordeiro frenchcorpusannotatedformultiwordexpressionsandnamedentities