Projecting named entity tags from a resource rich language to a resource poor language

Named Entities (NE) are the prominent entities appearing in textual documents.Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a...

Full description

Bibliographic Details
Main Authors: Zamin, Norshuhani, Oxley, Alan, Abu Bakar, Zainab
Format: Article
Language:English
Published: Universiti Utara Malaysia Press 2012
Subjects:
Online Access:https://repo.uum.edu.my/id/eprint/24088/1/J%20ICT%2012%202013%20121%E2%80%93146.pdf
_version_ 1803628659073875968
author Zamin, Norshuhani
Oxley, Alan
Abu Bakar, Zainab
author_facet Zamin, Norshuhani
Oxley, Alan
Abu Bakar, Zainab
author_sort Zamin, Norshuhani
collection UUM
description Named Entities (NE) are the prominent entities appearing in textual documents.Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a pre-defined taxonomy such as person, organization, location, date, time, etc.This article focuses on the person (PER), organization (ORG) and location (LOC) entities for a Malay journalistic corpus of terrorism.A projection algorithm, using the Dice Coefficient function and bigram scoring method with domain-specific rules, is suggested to map the NE information from the English corpus to the Malay corpus of terrorism.The English corpus is the translated version of the Malay corpus.Hence, these two corpora are treated as parallel corpora. The method computes the string similarity between the English words and the list of available lexemes in a pre-built lexicon that approximates the best NE mapping.The algorithm has been effectively evaluated using our own terrorism tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.An evaluation of the selected open source NER tool for English is also presented.
first_indexed 2024-07-04T06:25:28Z
format Article
id uum-24088
institution Universiti Utara Malaysia
language English
last_indexed 2024-07-04T06:25:28Z
publishDate 2012
publisher Universiti Utara Malaysia Press
record_format dspace
spelling uum-240882018-05-06T23:42:45Z https://repo.uum.edu.my/id/eprint/24088/ Projecting named entity tags from a resource rich language to a resource poor language Zamin, Norshuhani Oxley, Alan Abu Bakar, Zainab QA75 Electronic computers. Computer science Named Entities (NE) are the prominent entities appearing in textual documents.Automatic classification of NE in a textual corpus is a vital process in Information Extraction and Information Retrieval research. Named Entity Recognition (NER) is the identification of words in text that correspond to a pre-defined taxonomy such as person, organization, location, date, time, etc.This article focuses on the person (PER), organization (ORG) and location (LOC) entities for a Malay journalistic corpus of terrorism.A projection algorithm, using the Dice Coefficient function and bigram scoring method with domain-specific rules, is suggested to map the NE information from the English corpus to the Malay corpus of terrorism.The English corpus is the translated version of the Malay corpus.Hence, these two corpora are treated as parallel corpora. The method computes the string similarity between the English words and the list of available lexemes in a pre-built lexicon that approximates the best NE mapping.The algorithm has been effectively evaluated using our own terrorism tagged corpus; it achieved satisfactory results in terms of precision, recall, and F-measure.An evaluation of the selected open source NER tool for English is also presented. Universiti Utara Malaysia Press 2012 Article PeerReviewed application/pdf en https://repo.uum.edu.my/id/eprint/24088/1/J%20ICT%2012%202013%20121%E2%80%93146.pdf Zamin, Norshuhani and Oxley, Alan and Abu Bakar, Zainab (2012) Projecting named entity tags from a resource rich language to a resource poor language. Journal of Information and Communication Technology, 11. pp. 121-146. ISSN 2180-3862 http://jict.uum.edu.my/index.php/previous-issues/141-journal-of-information-and-communication-technology-jict-vol-12-2013
spellingShingle QA75 Electronic computers. Computer science
Zamin, Norshuhani
Oxley, Alan
Abu Bakar, Zainab
Projecting named entity tags from a resource rich language to a resource poor language
title Projecting named entity tags from a resource rich language to a resource poor language
title_full Projecting named entity tags from a resource rich language to a resource poor language
title_fullStr Projecting named entity tags from a resource rich language to a resource poor language
title_full_unstemmed Projecting named entity tags from a resource rich language to a resource poor language
title_short Projecting named entity tags from a resource rich language to a resource poor language
title_sort projecting named entity tags from a resource rich language to a resource poor language
topic QA75 Electronic computers. Computer science
url https://repo.uum.edu.my/id/eprint/24088/1/J%20ICT%2012%202013%20121%E2%80%93146.pdf
work_keys_str_mv AT zaminnorshuhani projectingnamedentitytagsfromaresourcerichlanguagetoaresourcepoorlanguage
AT oxleyalan projectingnamedentitytagsfromaresourcerichlanguagetoaresourcepoorlanguage
AT abubakarzainab projectingnamedentitytagsfromaresourcerichlanguagetoaresourcepoorlanguage