Design and analysis of a lean interface for Sanskrit corpus annotation

We describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the uni...

Full description

Bibliographic Details
Main Authors: Pawan Goyal, Gerard Huet
Format: Article
Language:English
Published: Institute of Computer Science, Polish Academy of Sciences 2016-10-01
Series:Journal of Language Modelling
Subjects:
Online Access:https://jlm.ipipan.waw.pl/index.php/JLM/article/view/108
_version_ 1831686874984349696
author Pawan Goyal
Gerard Huet
author_facet Pawan Goyal
Gerard Huet
author_sort Pawan Goyal
collection DOAJ
description We describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting from the sandhi rules used, and aligning with the input sentence. We show that this representation provides an exponential saving, in both space and time. The segmentation methodology is lexicon-directed. When the lexicon does not have full coverage of the corpus vocabulary, some chunks of the input may fail to be recognized. We designed a lexicon-acquisition facility, which remedies this incompleteness and makes the interface more robust. This interface has been implemented, and is currently being applied to the annotation of the Sanskrit Library corpus. Evaluation over 1,500 sentences from the Pañcatantra text shows the effectiveness of the proposed interface on real corpus data.
first_indexed 2024-12-20T08:47:16Z
format Article
id doaj.art-d5fa2e711c9c493d85049f52975fda95
institution Directory Open Access Journal
issn 2299-856X
2299-8470
language English
last_indexed 2024-12-20T08:47:16Z
publishDate 2016-10-01
publisher Institute of Computer Science, Polish Academy of Sciences
record_format Article
series Journal of Language Modelling
spelling doaj.art-d5fa2e711c9c493d85049f52975fda952022-12-21T19:46:13ZengInstitute of Computer Science, Polish Academy of SciencesJournal of Language Modelling2299-856X2299-84702016-10-014210.15398/jlm.v4i2.10853Design and analysis of a lean interface for Sanskrit corpus annotationPawan Goyal0Gerard Huet1IIT KharagpurInriaWe describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting from the sandhi rules used, and aligning with the input sentence. We show that this representation provides an exponential saving, in both space and time. The segmentation methodology is lexicon-directed. When the lexicon does not have full coverage of the corpus vocabulary, some chunks of the input may fail to be recognized. We designed a lexicon-acquisition facility, which remedies this incompleteness and makes the interface more robust. This interface has been implemented, and is currently being applied to the annotation of the Sanskrit Library corpus. Evaluation over 1,500 sentences from the Pañcatantra text shows the effectiveness of the proposed interface on real corpus data.https://jlm.ipipan.waw.pl/index.php/JLM/article/view/108Sanskrittext segmentationannotationlinguistic interfaces
spellingShingle Pawan Goyal
Gerard Huet
Design and analysis of a lean interface for Sanskrit corpus annotation
Journal of Language Modelling
Sanskrit
text segmentation
annotation
linguistic interfaces
title Design and analysis of a lean interface for Sanskrit corpus annotation
title_full Design and analysis of a lean interface for Sanskrit corpus annotation
title_fullStr Design and analysis of a lean interface for Sanskrit corpus annotation
title_full_unstemmed Design and analysis of a lean interface for Sanskrit corpus annotation
title_short Design and analysis of a lean interface for Sanskrit corpus annotation
title_sort design and analysis of a lean interface for sanskrit corpus annotation
topic Sanskrit
text segmentation
annotation
linguistic interfaces
url https://jlm.ipipan.waw.pl/index.php/JLM/article/view/108
work_keys_str_mv AT pawangoyal designandanalysisofaleaninterfaceforsanskritcorpusannotation
AT gerardhuet designandanalysisofaleaninterfaceforsanskritcorpusannotation