DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect

DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organi...

Full description

Bibliographic Details
Main Authors: Hanane Nour Moussa, Asmaa Mourhir
Format: Article
Language:English
Published: Elsevier 2023-06-01
Series:Data in Brief
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2352340923003530
_version_ 1797798008511791104
author Hanane Nour Moussa
Asmaa Mourhir
author_facet Hanane Nour Moussa
Asmaa Mourhir
author_sort Hanane Nour Moussa
collection DOAJ
description DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic.
first_indexed 2024-03-13T03:58:00Z
format Article
id doaj.art-6e3b0ccd275045b3bad555dad1ca2625
institution Directory Open Access Journal
issn 2352-3409
language English
last_indexed 2024-03-13T03:58:00Z
publishDate 2023-06-01
publisher Elsevier
record_format Article
series Data in Brief
spelling doaj.art-6e3b0ccd275045b3bad555dad1ca26252023-06-22T05:04:05ZengElsevierData in Brief2352-34092023-06-0148109234DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialectHanane Nour Moussa0Asmaa Mourhir1Corresponding author.; School of Science and Engineering, Al Akhawayn University in Ifrane, P.O. Box 104, Hassan II Avenue, Ifrane 53000, MoroccoSchool of Science and Engineering, Al Akhawayn University in Ifrane, P.O. Box 104, Hassan II Avenue, Ifrane 53000, MoroccoDarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic.http://www.sciencedirect.com/science/article/pii/S2352340923003530Natural language processingText miningNamed entity recognitionDialectal ArabicCorpusBIO
spellingShingle Hanane Nour Moussa
Asmaa Mourhir
DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
Data in Brief
Natural language processing
Text mining
Named entity recognition
Dialectal Arabic
Corpus
BIO
title DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_full DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_fullStr DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_full_unstemmed DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_short DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_sort darnercorp an annotated named entity recognition dataset in the moroccan dialect
topic Natural language processing
Text mining
Named entity recognition
Dialectal Arabic
Corpus
BIO
url http://www.sciencedirect.com/science/article/pii/S2352340923003530
work_keys_str_mv AT hananenourmoussa darnercorpanannotatednamedentityrecognitiondatasetinthemoroccandialect
AT asmaamourhir darnercorpanannotatednamedentityrecognitiondatasetinthemoroccandialect