Named Entity Recognition Using Conditional Random Fields
Named entity recognition (NER) is an important task in natural language processing, as it is widely featured as a key information extraction sub-task with numerous application areas. A plethora of attempts was made for NER detection in Western and Asian languages. However, little effort has been mad...
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
MDPI AG
2022-06-01
|
Series: | Applied Sciences |
Subjects: | |
Online Access: | https://www.mdpi.com/2076-3417/12/13/6391 |
_version_ | 1797481048502697984 |
---|---|
author | Wahab Khan Ali Daud Khurram Shahzad Tehmina Amjad Ameen Banjar Heba Fasihuddin |
author_facet | Wahab Khan Ali Daud Khurram Shahzad Tehmina Amjad Ameen Banjar Heba Fasihuddin |
author_sort | Wahab Khan |
collection | DOAJ |
description | Named entity recognition (NER) is an important task in natural language processing, as it is widely featured as a key information extraction sub-task with numerous application areas. A plethora of attempts was made for NER detection in Western and Asian languages. However, little effort has been made to develop techniques for the Urdu language, which is a prominent South Asian language with hundreds of millions of speakers across the globe. NER in Urdu is considered a hard problem owing to several reasons, including the paucity of large, annotated datasets; an inaccurate tokenizer; and the absence of capitalization in the Urdu language. To this end, this study proposed a conditional-random-field-based technique with both language-dependent and language-independent features, such as part-of-speech tags and context windows of words, respectively. As a second contribution, we developed an Urdu NER dataset (UNER-I) in which a large number of NE types were manually annotated. To evaluate the effectiveness of the proposed approach, as well as the usefulness of the dataset, experiments were performed using the dataset we developed and an existing dataset. The results of the experiments showed that our proposed technique outperformed the baseline technique for both datasets by improving the F1 scores by 1.5% to 3%. Furthermore, the results demonstrated that the enhanced dataset was useful for learning and prediction in a supervised learning approach. |
first_indexed | 2024-03-09T22:08:54Z |
format | Article |
id | doaj.art-e3c871d0711f4e178f9ae3391930e3fd |
institution | Directory Open Access Journal |
issn | 2076-3417 |
language | English |
last_indexed | 2024-03-09T22:08:54Z |
publishDate | 2022-06-01 |
publisher | MDPI AG |
record_format | Article |
series | Applied Sciences |
spelling | doaj.art-e3c871d0711f4e178f9ae3391930e3fd2023-11-23T19:35:49ZengMDPI AGApplied Sciences2076-34172022-06-011213639110.3390/app12136391Named Entity Recognition Using Conditional Random FieldsWahab Khan0Ali Daud1Khurram Shahzad2Tehmina Amjad3Ameen Banjar4Heba Fasihuddin5Department of Computer Science, University of Science and Technology, Bannu 28100, PakistanDepartment of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi ArabiaDepartment of Data Science, University of the Punjab, Lahore 54000, PakistanDepartment of Computer Science and Software Engineering, International Islamic University Islamabad, Islamabad 44000, PakistanDepartment of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi ArabiaDepartment of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi ArabiaNamed entity recognition (NER) is an important task in natural language processing, as it is widely featured as a key information extraction sub-task with numerous application areas. A plethora of attempts was made for NER detection in Western and Asian languages. However, little effort has been made to develop techniques for the Urdu language, which is a prominent South Asian language with hundreds of millions of speakers across the globe. NER in Urdu is considered a hard problem owing to several reasons, including the paucity of large, annotated datasets; an inaccurate tokenizer; and the absence of capitalization in the Urdu language. To this end, this study proposed a conditional-random-field-based technique with both language-dependent and language-independent features, such as part-of-speech tags and context windows of words, respectively. As a second contribution, we developed an Urdu NER dataset (UNER-I) in which a large number of NE types were manually annotated. To evaluate the effectiveness of the proposed approach, as well as the usefulness of the dataset, experiments were performed using the dataset we developed and an existing dataset. The results of the experiments showed that our proposed technique outperformed the baseline technique for both datasets by improving the F1 scores by 1.5% to 3%. Furthermore, the results demonstrated that the enhanced dataset was useful for learning and prediction in a supervised learning approach.https://www.mdpi.com/2076-3417/12/13/6391natural language processinginformation filteringinformation extractionmachine learningclassification algorithmsnamed entity recognition |
spellingShingle | Wahab Khan Ali Daud Khurram Shahzad Tehmina Amjad Ameen Banjar Heba Fasihuddin Named Entity Recognition Using Conditional Random Fields Applied Sciences natural language processing information filtering information extraction machine learning classification algorithms named entity recognition |
title | Named Entity Recognition Using Conditional Random Fields |
title_full | Named Entity Recognition Using Conditional Random Fields |
title_fullStr | Named Entity Recognition Using Conditional Random Fields |
title_full_unstemmed | Named Entity Recognition Using Conditional Random Fields |
title_short | Named Entity Recognition Using Conditional Random Fields |
title_sort | named entity recognition using conditional random fields |
topic | natural language processing information filtering information extraction machine learning classification algorithms named entity recognition |
url | https://www.mdpi.com/2076-3417/12/13/6391 |
work_keys_str_mv | AT wahabkhan namedentityrecognitionusingconditionalrandomfields AT alidaud namedentityrecognitionusingconditionalrandomfields AT khurramshahzad namedentityrecognitionusingconditionalrandomfields AT tehminaamjad namedentityrecognitionusingconditionalrandomfields AT ameenbanjar namedentityrecognitionusingconditionalrandomfields AT hebafasihuddin namedentityrecognitionusingconditionalrandomfields |