Summary: | Researches in natural languange processing indicated that more data led to
better accuracy. Processing this large scale of data using single machine has its
own limitation that can be handled by processing data in parallel.
This research used MapReduce on part-of-speech (POS) tagging.
MapReduce is programming model developed for processing large data, while
POS tagging is one the earliest steps in natural language processing. POS tagging
approach used in this research is Maximum Entropy model in Bahasa Indonesia.
MapReduce model is implemented in some parts of training and tagging process.
MapReduce is implemented in dictionary, tagtoken, and feature creation,
and also in calculation using improved iterative scaling (IIS). It is found out that
calculation using IIS could not implemented using MapReduce model, because
there is updating probability parameters that closely related so that it could not
implemented in parallel. The experiments conducted using 100,000 and 1,000,000
words training corpus from Pan Localization and 12,000 words training corpus
used in Wicaksono and Purwarianti's research. The experiments showed that total
training time using MapReduce is faster than without using it. However,
MapReduce's result reading time inside training process slow down the training
total time.
Tagging experiments conducted using different numbers of map and reduce
process on different sizes corpora gathered from various news sites. The
experiments showed MapReduce implementation could speedup the tagging
process. The fastest result is shown by tagging process using 1,000,000 words
corpus and 30 map process.
|