Chinese words segmentation in user generated content

Chinese word segmentation is the first step for Chinese text processing. The accuracy of Chinese word segmentation directly affects the performance of Chinese text processing. Therefore, Chinese word segmentation plays an important role in Chinese text processing. In addition, with the increasing po...

Full description

Bibliographic Details
Main Author: Cai, Xiaoxuan
Other Authors: Sun Aixin
Format: Final Year Project (FYP)
Language:English
Published: 2015
Subjects:
Online Access:http://hdl.handle.net/10356/63803
_version_ 1811685058101116928
author Cai, Xiaoxuan
author2 Sun Aixin
author_facet Sun Aixin
Cai, Xiaoxuan
author_sort Cai, Xiaoxuan
collection NTU
description Chinese word segmentation is the first step for Chinese text processing. The accuracy of Chinese word segmentation directly affects the performance of Chinese text processing. Therefore, Chinese word segmentation plays an important role in Chinese text processing. In addition, with the increasing popularity of social media in China, Chinese sentences that are written in an informal manner in user generated content are very common on the Internet. This project is to study Chinese word segmentation in user generated content. In this project, two existing Chinese word segmentation tools Jieba [1] and Stanford Word Segmenter [2] are studied; a new Chinese word segmentation tool named Weibo Segmenter implemented according to [3] is presented; then these three tools are tested using the same dataset to compare the performance. As a result, Weibo Segmenter achieves an accuracy rate of 83.3% in the test. The performance of Weibo Segmenter could be further enhanced by using a more suitable dictionary and some programming techniques.
first_indexed 2024-10-01T04:38:29Z
format Final Year Project (FYP)
id ntu-10356/63803
institution Nanyang Technological University
language English
last_indexed 2024-10-01T04:38:29Z
publishDate 2015
record_format dspace
spelling ntu-10356/638032023-03-03T20:24:01Z Chinese words segmentation in user generated content Cai, Xiaoxuan Sun Aixin School of Computer Engineering DRNTU::Engineering::Computer science and engineering Chinese word segmentation is the first step for Chinese text processing. The accuracy of Chinese word segmentation directly affects the performance of Chinese text processing. Therefore, Chinese word segmentation plays an important role in Chinese text processing. In addition, with the increasing popularity of social media in China, Chinese sentences that are written in an informal manner in user generated content are very common on the Internet. This project is to study Chinese word segmentation in user generated content. In this project, two existing Chinese word segmentation tools Jieba [1] and Stanford Word Segmenter [2] are studied; a new Chinese word segmentation tool named Weibo Segmenter implemented according to [3] is presented; then these three tools are tested using the same dataset to compare the performance. As a result, Weibo Segmenter achieves an accuracy rate of 83.3% in the test. The performance of Weibo Segmenter could be further enhanced by using a more suitable dictionary and some programming techniques. Bachelor of Engineering (Computer Science) 2015-05-19T03:54:24Z 2015-05-19T03:54:24Z 2015 2015 Final Year Project (FYP) http://hdl.handle.net/10356/63803 en Nanyang Technological University 42 p. application/pdf
spellingShingle DRNTU::Engineering::Computer science and engineering
Cai, Xiaoxuan
Chinese words segmentation in user generated content
title Chinese words segmentation in user generated content
title_full Chinese words segmentation in user generated content
title_fullStr Chinese words segmentation in user generated content
title_full_unstemmed Chinese words segmentation in user generated content
title_short Chinese words segmentation in user generated content
title_sort chinese words segmentation in user generated content
topic DRNTU::Engineering::Computer science and engineering
url http://hdl.handle.net/10356/63803
work_keys_str_mv AT caixiaoxuan chinesewordssegmentationinusergeneratedcontent