Design and analysis of a lean interface for Sanskrit corpus annotation
We describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the uni...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Institute of Computer Science, Polish Academy of Sciences
2016-10-01
|
Series: | Journal of Language Modelling |
Subjects: | |
Online Access: | https://jlm.ipipan.waw.pl/index.php/JLM/article/view/108 |
_version_ | 1831686874984349696 |
---|---|
author | Pawan Goyal Gerard Huet |
author_facet | Pawan Goyal Gerard Huet |
author_sort | Pawan Goyal |
collection | DOAJ |
description | We describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting from the sandhi rules used, and aligning with the input sentence. We show that this representation provides an exponential saving, in both space and time.
The segmentation methodology is lexicon-directed. When the lexicon does not have full coverage of the corpus vocabulary, some chunks of the input may fail to be recognized. We designed a lexicon-acquisition facility, which remedies this incompleteness and makes the interface more robust.
This interface has been implemented, and is currently being applied to the annotation of the Sanskrit Library corpus. Evaluation over 1,500 sentences from the Pañcatantra text shows the effectiveness of the proposed interface on real corpus data. |
first_indexed | 2024-12-20T08:47:16Z |
format | Article |
id | doaj.art-d5fa2e711c9c493d85049f52975fda95 |
institution | Directory Open Access Journal |
issn | 2299-856X 2299-8470 |
language | English |
last_indexed | 2024-12-20T08:47:16Z |
publishDate | 2016-10-01 |
publisher | Institute of Computer Science, Polish Academy of Sciences |
record_format | Article |
series | Journal of Language Modelling |
spelling | doaj.art-d5fa2e711c9c493d85049f52975fda952022-12-21T19:46:13ZengInstitute of Computer Science, Polish Academy of SciencesJournal of Language Modelling2299-856X2299-84702016-10-014210.15398/jlm.v4i2.10853Design and analysis of a lean interface for Sanskrit corpus annotationPawan Goyal0Gerard Huet1IIT KharagpurInriaWe describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting from the sandhi rules used, and aligning with the input sentence. We show that this representation provides an exponential saving, in both space and time. The segmentation methodology is lexicon-directed. When the lexicon does not have full coverage of the corpus vocabulary, some chunks of the input may fail to be recognized. We designed a lexicon-acquisition facility, which remedies this incompleteness and makes the interface more robust. This interface has been implemented, and is currently being applied to the annotation of the Sanskrit Library corpus. Evaluation over 1,500 sentences from the Pañcatantra text shows the effectiveness of the proposed interface on real corpus data.https://jlm.ipipan.waw.pl/index.php/JLM/article/view/108Sanskrittext segmentationannotationlinguistic interfaces |
spellingShingle | Pawan Goyal Gerard Huet Design and analysis of a lean interface for Sanskrit corpus annotation Journal of Language Modelling Sanskrit text segmentation annotation linguistic interfaces |
title | Design and analysis of a lean interface for Sanskrit corpus annotation |
title_full | Design and analysis of a lean interface for Sanskrit corpus annotation |
title_fullStr | Design and analysis of a lean interface for Sanskrit corpus annotation |
title_full_unstemmed | Design and analysis of a lean interface for Sanskrit corpus annotation |
title_short | Design and analysis of a lean interface for Sanskrit corpus annotation |
title_sort | design and analysis of a lean interface for sanskrit corpus annotation |
topic | Sanskrit text segmentation annotation linguistic interfaces |
url | https://jlm.ipipan.waw.pl/index.php/JLM/article/view/108 |
work_keys_str_mv | AT pawangoyal designandanalysisofaleaninterfaceforsanskritcorpusannotation AT gerardhuet designandanalysisofaleaninterfaceforsanskritcorpusannotation |