MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm...

Full description

Bibliographic Details
Main Authors:	Karol Nowakowski, Michal Ptaszynski, Fumito Masui
Format:	Article
Language:	English
Published:	MDPI AG 2019-10-01
Series:	Information
Subjects:	word segmentation tokenization language modelling n-gram models ainu language endangered languages under-resourced languages
Online Access:	https://www.mdpi.com/2078-2489/10/10/317

_version_	1811320735468093440
author	Karol Nowakowski Michal Ptaszynski Fumito Masui
author_facet	Karol Nowakowski Michal Ptaszynski Fumito Masui
author_sort	Karol Nowakowski
collection	DOAJ
description	Word segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.
first_indexed	2024-04-13T13:04:39Z
format	Article
id	doaj.art-e42952efb4f0445ea940931bcaa49337
institution	Directory Open Access Journal
issn	2078-2489
language	English
last_indexed	2024-04-13T13:04:39Z
publishDate	2019-10-01
publisher	MDPI AG
record_format	Article
series	Information
spelling	doaj.art-e42952efb4f0445ea940931bcaa493372022-12-22T02:45:49ZengMDPI AGInformation2078-24892019-10-01101031710.3390/info10100317info10100317MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu LanguageKarol Nowakowski0Michal Ptaszynski1Fumito Masui2Department of Computer Science, Kitami Institute of Technology, 165 Koen-cho, Kitami, Hokkaido 090-8507, JapanDepartment of Computer Science, Kitami Institute of Technology, 165 Koen-cho, Kitami, Hokkaido 090-8507, JapanDepartment of Computer Science, Kitami Institute of Technology, 165 Koen-cho, Kitami, Hokkaido 090-8507, JapanWord segmentation is an essential task in automatic language processing for languages where there are no explicit word boundary markers, or where space-delimited orthographic words are too coarse-grained. In this paper we introduce the MiNgMatch Segmenter—a fast word segmentation algorithm, which reduces the problem of identifying word boundaries to finding the shortest sequence of lexical n-grams matching the input text. In order to validate our method in a low-resource scenario involving extremely sparse data, we tested it with a small corpus of text in the critically endangered language of the Ainu people living in northern parts of Japan. Furthermore, we performed a series of experiments comparing our algorithm with systems utilizing state-of-the-art lexical n-gram-based language modelling techniques (namely, Stupid Backoff model and a model with modified Kneser-Ney smoothing), as well as a neural model performing word segmentation as character sequence labelling. The experimental results we obtained demonstrate the high performance of our algorithm, comparable with the other best-performing models. Given its low computational cost and competitive results, we believe that the proposed approach could be extended to other languages, and possibly also to other Natural Language Processing tasks, such as speech recognition.https://www.mdpi.com/2078-2489/10/10/317word segmentationtokenizationlanguage modellingn-gram modelsainu languageendangered languagesunder-resourced languages
spellingShingle	Karol Nowakowski Michal Ptaszynski Fumito Masui MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language Information word segmentation tokenization language modelling n-gram models ainu language endangered languages under-resourced languages
title	MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language
title_full	MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language
title_fullStr	MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language
title_full_unstemmed	MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language
title_short	MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language
title_sort	mingmatch a fast n gram model for word segmentation of the ainu language
topic	word segmentation tokenization language modelling n-gram models ainu language endangered languages under-resourced languages
url	https://www.mdpi.com/2078-2489/10/10/317
work_keys_str_mv	AT karolnowakowski mingmatchafastngrammodelforwordsegmentationoftheainulanguage AT michalptaszynski mingmatchafastngrammodelforwordsegmentationoftheainulanguage AT fumitomasui mingmatchafastngrammodelforwordsegmentationoftheainulanguage

MiNgMatch—A Fast N-gram Model for Word Segmentation of the Ainu Language

Similar Items