Design and analysis of a lean interface for Sanskrit corpus annotation

We describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the uni...

Full description

Bibliographic Details
Main Authors:	Pawan Goyal, Gerard Huet
Format:	Article
Language:	English
Published:	Institute of Computer Science, Polish Academy of Sciences 2016-10-01
Series:	Journal of Language Modelling
Subjects:	Sanskrit text segmentation annotation linguistic interfaces
Online Access:	https://jlm.ipipan.waw.pl/index.php/JLM/article/view/108

_version_	1831686874984349696
author	Pawan Goyal Gerard Huet
author_facet	Pawan Goyal Gerard Huet
author_sort	Pawan Goyal
collection	DOAJ
description	We describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting from the sandhi rules used, and aligning with the input sentence. We show that this representation provides an exponential saving, in both space and time. The segmentation methodology is lexicon-directed. When the lexicon does not have full coverage of the corpus vocabulary, some chunks of the input may fail to be recognized. We designed a lexicon-acquisition facility, which remedies this incompleteness and makes the interface more robust. This interface has been implemented, and is currently being applied to the annotation of the Sanskrit Library corpus. Evaluation over 1,500 sentences from the Pañcatantra text shows the effectiveness of the proposed interface on real corpus data.
first_indexed	2024-12-20T08:47:16Z
format	Article
id	doaj.art-d5fa2e711c9c493d85049f52975fda95
institution	Directory Open Access Journal
issn	2299-856X 2299-8470
language	English
last_indexed	2024-12-20T08:47:16Z
publishDate	2016-10-01
publisher	Institute of Computer Science, Polish Academy of Sciences
record_format	Article
series	Journal of Language Modelling
spelling	doaj.art-d5fa2e711c9c493d85049f52975fda952022-12-21T19:46:13ZengInstitute of Computer Science, Polish Academy of SciencesJournal of Language Modelling2299-856X2299-84702016-10-014210.15398/jlm.v4i2.10853Design and analysis of a lean interface for Sanskrit corpus annotationPawan Goyal0Gerard Huet1IIT KharagpurInriaWe describe an innovative computer interface designed to assist annotators in the efficient selection of segmentation solutions for proper tagging of Sanskrit corpora. The proposed solution uses a compact representation of the shared forest of all segmentations. The main idea is to represent the union of all segmentations, abstracting from the sandhi rules used, and aligning with the input sentence. We show that this representation provides an exponential saving, in both space and time. The segmentation methodology is lexicon-directed. When the lexicon does not have full coverage of the corpus vocabulary, some chunks of the input may fail to be recognized. We designed a lexicon-acquisition facility, which remedies this incompleteness and makes the interface more robust. This interface has been implemented, and is currently being applied to the annotation of the Sanskrit Library corpus. Evaluation over 1,500 sentences from the Pañcatantra text shows the effectiveness of the proposed interface on real corpus data.https://jlm.ipipan.waw.pl/index.php/JLM/article/view/108Sanskrittext segmentationannotationlinguistic interfaces
spellingShingle	Pawan Goyal Gerard Huet Design and analysis of a lean interface for Sanskrit corpus annotation Journal of Language Modelling Sanskrit text segmentation annotation linguistic interfaces
title	Design and analysis of a lean interface for Sanskrit corpus annotation
title_full	Design and analysis of a lean interface for Sanskrit corpus annotation
title_fullStr	Design and analysis of a lean interface for Sanskrit corpus annotation
title_full_unstemmed	Design and analysis of a lean interface for Sanskrit corpus annotation
title_short	Design and analysis of a lean interface for Sanskrit corpus annotation
title_sort	design and analysis of a lean interface for sanskrit corpus annotation
topic	Sanskrit text segmentation annotation linguistic interfaces
url	https://jlm.ipipan.waw.pl/index.php/JLM/article/view/108
work_keys_str_mv	AT pawangoyal designandanalysisofaleaninterfaceforsanskritcorpusannotation AT gerardhuet designandanalysisofaleaninterfaceforsanskritcorpusannotation

Design and analysis of a lean interface for Sanskrit corpus annotation

Similar Items