Named Entity Recognition Using Conditional Random Fields

Named entity recognition (NER) is an important task in natural language processing, as it is widely featured as a key information extraction sub-task with numerous application areas. A plethora of attempts was made for NER detection in Western and Asian languages. However, little effort has been mad...

Full description

Bibliographic Details
Main Authors: Wahab Khan, Ali Daud, Khurram Shahzad, Tehmina Amjad, Ameen Banjar, Heba Fasihuddin
Format: Article
Language:English
Published: MDPI AG 2022-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/12/13/6391
_version_ 1797481048502697984
author Wahab Khan
Ali Daud
Khurram Shahzad
Tehmina Amjad
Ameen Banjar
Heba Fasihuddin
author_facet Wahab Khan
Ali Daud
Khurram Shahzad
Tehmina Amjad
Ameen Banjar
Heba Fasihuddin
author_sort Wahab Khan
collection DOAJ
description Named entity recognition (NER) is an important task in natural language processing, as it is widely featured as a key information extraction sub-task with numerous application areas. A plethora of attempts was made for NER detection in Western and Asian languages. However, little effort has been made to develop techniques for the Urdu language, which is a prominent South Asian language with hundreds of millions of speakers across the globe. NER in Urdu is considered a hard problem owing to several reasons, including the paucity of large, annotated datasets; an inaccurate tokenizer; and the absence of capitalization in the Urdu language. To this end, this study proposed a conditional-random-field-based technique with both language-dependent and language-independent features, such as part-of-speech tags and context windows of words, respectively. As a second contribution, we developed an Urdu NER dataset (UNER-I) in which a large number of NE types were manually annotated. To evaluate the effectiveness of the proposed approach, as well as the usefulness of the dataset, experiments were performed using the dataset we developed and an existing dataset. The results of the experiments showed that our proposed technique outperformed the baseline technique for both datasets by improving the F1 scores by 1.5% to 3%. Furthermore, the results demonstrated that the enhanced dataset was useful for learning and prediction in a supervised learning approach.
first_indexed 2024-03-09T22:08:54Z
format Article
id doaj.art-e3c871d0711f4e178f9ae3391930e3fd
institution Directory Open Access Journal
issn 2076-3417
language English
last_indexed 2024-03-09T22:08:54Z
publishDate 2022-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj.art-e3c871d0711f4e178f9ae3391930e3fd2023-11-23T19:35:49ZengMDPI AGApplied Sciences2076-34172022-06-011213639110.3390/app12136391Named Entity Recognition Using Conditional Random FieldsWahab Khan0Ali Daud1Khurram Shahzad2Tehmina Amjad3Ameen Banjar4Heba Fasihuddin5Department of Computer Science, University of Science and Technology, Bannu 28100, PakistanDepartment of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi ArabiaDepartment of Data Science, University of the Punjab, Lahore 54000, PakistanDepartment of Computer Science and Software Engineering, International Islamic University Islamabad, Islamabad 44000, PakistanDepartment of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi ArabiaDepartment of Information Systems and Technology, College of Computer Science and Engineering, University of Jeddah, Jeddah 21959, Saudi ArabiaNamed entity recognition (NER) is an important task in natural language processing, as it is widely featured as a key information extraction sub-task with numerous application areas. A plethora of attempts was made for NER detection in Western and Asian languages. However, little effort has been made to develop techniques for the Urdu language, which is a prominent South Asian language with hundreds of millions of speakers across the globe. NER in Urdu is considered a hard problem owing to several reasons, including the paucity of large, annotated datasets; an inaccurate tokenizer; and the absence of capitalization in the Urdu language. To this end, this study proposed a conditional-random-field-based technique with both language-dependent and language-independent features, such as part-of-speech tags and context windows of words, respectively. As a second contribution, we developed an Urdu NER dataset (UNER-I) in which a large number of NE types were manually annotated. To evaluate the effectiveness of the proposed approach, as well as the usefulness of the dataset, experiments were performed using the dataset we developed and an existing dataset. The results of the experiments showed that our proposed technique outperformed the baseline technique for both datasets by improving the F1 scores by 1.5% to 3%. Furthermore, the results demonstrated that the enhanced dataset was useful for learning and prediction in a supervised learning approach.https://www.mdpi.com/2076-3417/12/13/6391natural language processinginformation filteringinformation extractionmachine learningclassification algorithmsnamed entity recognition
spellingShingle Wahab Khan
Ali Daud
Khurram Shahzad
Tehmina Amjad
Ameen Banjar
Heba Fasihuddin
Named Entity Recognition Using Conditional Random Fields
Applied Sciences
natural language processing
information filtering
information extraction
machine learning
classification algorithms
named entity recognition
title Named Entity Recognition Using Conditional Random Fields
title_full Named Entity Recognition Using Conditional Random Fields
title_fullStr Named Entity Recognition Using Conditional Random Fields
title_full_unstemmed Named Entity Recognition Using Conditional Random Fields
title_short Named Entity Recognition Using Conditional Random Fields
title_sort named entity recognition using conditional random fields
topic natural language processing
information filtering
information extraction
machine learning
classification algorithms
named entity recognition
url https://www.mdpi.com/2076-3417/12/13/6391
work_keys_str_mv AT wahabkhan namedentityrecognitionusingconditionalrandomfields
AT alidaud namedentityrecognitionusingconditionalrandomfields
AT khurramshahzad namedentityrecognitionusingconditionalrandomfields
AT tehminaamjad namedentityrecognitionusingconditionalrandomfields
AT ameenbanjar namedentityrecognitionusingconditionalrandomfields
AT hebafasihuddin namedentityrecognitionusingconditionalrandomfields