Towards preprocessing on criminal chatting corpus.

The importance of data cleansing is apparent with the advent meaning of the words collected during a conversation. In addition, noise, concise expressions and dynamic situation makes chat data ill-suited for analysis. Due to its often informal nature especially in short form language, this paper wil...

Full description

Bibliographic Details
Main Authors: Marjuni, Siti Hanom, Mahmod, Ramlan, Abd Ghani, Abdul Azim, Md Zain, Abdullah, Sidi, Fatimah
Format: Article
Language:English
English
Published: MASAUM Network 2009
Online Access:http://psasir.upm.edu.my/id/eprint/17446/1/Towards%20preprocessing%20on%20criminal%20chatting%20corpus.pdf
_version_ 1796969256852652032
author Marjuni, Siti Hanom
Mahmod, Ramlan
Abd Ghani, Abdul Azim
Md Zain, Abdullah
Sidi, Fatimah
author_facet Marjuni, Siti Hanom
Mahmod, Ramlan
Abd Ghani, Abdul Azim
Md Zain, Abdullah
Sidi, Fatimah
author_sort Marjuni, Siti Hanom
collection UPM
description The importance of data cleansing is apparent with the advent meaning of the words collected during a conversation. In addition, noise, concise expressions and dynamic situation makes chat data ill-suited for analysis. Due to its often informal nature especially in short form language, this paper will present the importance of preprocessing steps of data collection before we proceed to the next stage of the research. Two processes of cleaning data are required in this research. First, the conversion of short form words to full English words and second, discarding all toggles found in every utterance of the conversation. The processing is to make the sentence more meaningful due to the suspect's target and expectation of intention. Results done by precisions, recalls and f_measure showed that the corpus need the conversion to be more meaningful. Furthermore, each word of the suspect's and victim's utterance is analyzed and treated as support evidence in criminal court cases. This research will consider criminal data chatting through Yahoo Messenger (YM) which involved the suspect's and victim's conversation collected in real time without any editorial changes in electronic discourse. However, chat messengers are in an unstructured format which always use short form languages. Chatters may use the typical language or use their own understood language during the conversation. Therefore, we propose the preprocessing phase for specifically chat data mining which involve text messages. The idea of the preprocessing is to prepare cleaned data called corpus criminal data and the cleaned data will be used in the next phase for identifying words classification, tokenizing, tagging, ranking and constructing the meanings.
first_indexed 2024-03-06T07:40:27Z
format Article
id upm.eprints-17446
institution Universiti Putra Malaysia
language English
English
last_indexed 2024-03-06T07:40:27Z
publishDate 2009
publisher MASAUM Network
record_format dspace
spelling upm.eprints-174462015-10-22T01:26:13Z http://psasir.upm.edu.my/id/eprint/17446/ Towards preprocessing on criminal chatting corpus. Marjuni, Siti Hanom Mahmod, Ramlan Abd Ghani, Abdul Azim Md Zain, Abdullah Sidi, Fatimah The importance of data cleansing is apparent with the advent meaning of the words collected during a conversation. In addition, noise, concise expressions and dynamic situation makes chat data ill-suited for analysis. Due to its often informal nature especially in short form language, this paper will present the importance of preprocessing steps of data collection before we proceed to the next stage of the research. Two processes of cleaning data are required in this research. First, the conversion of short form words to full English words and second, discarding all toggles found in every utterance of the conversation. The processing is to make the sentence more meaningful due to the suspect's target and expectation of intention. Results done by precisions, recalls and f_measure showed that the corpus need the conversion to be more meaningful. Furthermore, each word of the suspect's and victim's utterance is analyzed and treated as support evidence in criminal court cases. This research will consider criminal data chatting through Yahoo Messenger (YM) which involved the suspect's and victim's conversation collected in real time without any editorial changes in electronic discourse. However, chat messengers are in an unstructured format which always use short form languages. Chatters may use the typical language or use their own understood language during the conversation. Therefore, we propose the preprocessing phase for specifically chat data mining which involve text messages. The idea of the preprocessing is to prepare cleaned data called corpus criminal data and the cleaned data will be used in the next phase for identifying words classification, tokenizing, tagging, ranking and constructing the meanings. MASAUM Network 2009-10 Article PeerReviewed application/pdf en http://psasir.upm.edu.my/id/eprint/17446/1/Towards%20preprocessing%20on%20criminal%20chatting%20corpus.pdf Marjuni, Siti Hanom and Mahmod, Ramlan and Abd Ghani, Abdul Azim and Md Zain, Abdullah and Sidi, Fatimah (2009) Towards preprocessing on criminal chatting corpus. MASAUM Journal of Basic and Applied Sciences, 1 (3). pp. 401-405. ISSN 2076-0841 English
spellingShingle Marjuni, Siti Hanom
Mahmod, Ramlan
Abd Ghani, Abdul Azim
Md Zain, Abdullah
Sidi, Fatimah
Towards preprocessing on criminal chatting corpus.
title Towards preprocessing on criminal chatting corpus.
title_full Towards preprocessing on criminal chatting corpus.
title_fullStr Towards preprocessing on criminal chatting corpus.
title_full_unstemmed Towards preprocessing on criminal chatting corpus.
title_short Towards preprocessing on criminal chatting corpus.
title_sort towards preprocessing on criminal chatting corpus
url http://psasir.upm.edu.my/id/eprint/17446/1/Towards%20preprocessing%20on%20criminal%20chatting%20corpus.pdf
work_keys_str_mv AT marjunisitihanom towardspreprocessingoncriminalchattingcorpus
AT mahmodramlan towardspreprocessingoncriminalchattingcorpus
AT abdghaniabdulazim towardspreprocessingoncriminalchattingcorpus
AT mdzainabdullah towardspreprocessingoncriminalchattingcorpus
AT sidifatimah towardspreprocessingoncriminalchattingcorpus