Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo
Annotating spoken corpora poses unique challenges stemming from the particular characteristics of spontaneous speech and its transcription. Automatic annotation tools need to adapt to these challenges. At the same time, it is desirable to define a “least common denominator” of written and spoken lan...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Cercle linguistique du Centre et de l'Ouest - CerLICO
|
Series: | Corela |
Subjects: | |
Online Access: | https://journals.openedition.org/corela/4867 |
_version_ | 1797313903699427328 |
---|---|
author | George Christodoulides Giulia Barreca |
author_facet | George Christodoulides Giulia Barreca |
author_sort | George Christodoulides |
collection | DOAJ |
description | Annotating spoken corpora poses unique challenges stemming from the particular characteristics of spontaneous speech and its transcription. Automatic annotation tools need to adapt to these challenges. At the same time, it is desirable to define a “least common denominator” of written and spoken language corpora, to allow for comparisons between these two modalities, and apply an enriched annotation scheme for phenomena specific to spoken language. In this article, we present the approach implemented in the DisMo automatic annotator, which is specifically designed for spoken corpora, and which generates a multi-level annotation, including : part-of-speech tagging, lemmatisation, multi-word unit detection, detection and annotation of disfluencies and discourse markers, and chunking. We present our work on the French corpus of the Phonologie du Français Contemporain (PFC) project ; this work allowed us to improve the tool. We discuss the theoretical and practical considerations that informed the choice of levels of annotation, types of phenomena detected, and tag sets, and we present a performance evaluation of the automatic annotation. |
first_indexed | 2024-03-08T02:38:20Z |
format | Article |
id | doaj.art-286be59b62134786a1c0423cf36f33c3 |
institution | Directory Open Access Journal |
issn | 1638-573X |
language | English |
last_indexed | 2024-03-08T02:38:20Z |
publisher | Cercle linguistique du Centre et de l'Ouest - CerLICO |
record_format | Article |
series | Corela |
spelling | doaj.art-286be59b62134786a1c0423cf36f33c32024-02-13T13:51:47ZengCercle linguistique du Centre et de l'Ouest - CerLICOCorela1638-573X2110.4000/corela.4867Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMoGeorge ChristodoulidesGiulia BarrecaAnnotating spoken corpora poses unique challenges stemming from the particular characteristics of spontaneous speech and its transcription. Automatic annotation tools need to adapt to these challenges. At the same time, it is desirable to define a “least common denominator” of written and spoken language corpora, to allow for comparisons between these two modalities, and apply an enriched annotation scheme for phenomena specific to spoken language. In this article, we present the approach implemented in the DisMo automatic annotator, which is specifically designed for spoken corpora, and which generates a multi-level annotation, including : part-of-speech tagging, lemmatisation, multi-word unit detection, detection and annotation of disfluencies and discourse markers, and chunking. We present our work on the French corpus of the Phonologie du Français Contemporain (PFC) project ; this work allowed us to improve the tool. We discuss the theoretical and practical considerations that informed the choice of levels of annotation, types of phenomena detected, and tag sets, and we present a performance evaluation of the automatic annotation.https://journals.openedition.org/corela/4867exploitation of oral corporamultilevel annotationautomatic annotation |
spellingShingle | George Christodoulides Giulia Barreca Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo Corela exploitation of oral corpora multilevel annotation automatic annotation |
title | Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo |
title_full | Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo |
title_fullStr | Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo |
title_full_unstemmed | Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo |
title_short | Expériences sur l’analyse morphosyntaxique des corpus oraux avec l’annotateur multi-niveaux DisMo |
title_sort | experiences sur l analyse morphosyntaxique des corpus oraux avec l annotateur multi niveaux dismo |
topic | exploitation of oral corpora multilevel annotation automatic annotation |
url | https://journals.openedition.org/corela/4867 |
work_keys_str_mv | AT georgechristodoulides experiencessurlanalysemorphosyntaxiquedescorpusorauxaveclannotateurmultiniveauxdismo AT giuliabarreca experiencessurlanalysemorphosyntaxiquedescorpusorauxaveclannotateurmultiniveauxdismo |