DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect

DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organi...

Full description

Bibliographic Details
Main Authors:	Hanane Nour Moussa, Asmaa Mourhir
Format:	Article
Language:	English
Published:	Elsevier 2023-06-01
Series:	Data in Brief
Subjects:	Natural language processing Text mining Named entity recognition Dialectal Arabic Corpus BIO
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340923003530

_version_	1797798008511791104
author	Hanane Nour Moussa Asmaa Mourhir
author_facet	Hanane Nour Moussa Asmaa Mourhir
author_sort	Hanane Nour Moussa
collection	DOAJ
description	DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic.
first_indexed	2024-03-13T03:58:00Z
format	Article
id	doaj.art-6e3b0ccd275045b3bad555dad1ca2625
institution	Directory Open Access Journal
issn	2352-3409
language	English
last_indexed	2024-03-13T03:58:00Z
publishDate	2023-06-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj.art-6e3b0ccd275045b3bad555dad1ca26252023-06-22T05:04:05ZengElsevierData in Brief2352-34092023-06-0148109234DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialectHanane Nour Moussa0Asmaa Mourhir1Corresponding author.; School of Science and Engineering, Al Akhawayn University in Ifrane, P.O. Box 104, Hassan II Avenue, Ifrane 53000, MoroccoSchool of Science and Engineering, Al Akhawayn University in Ifrane, P.O. Box 104, Hassan II Avenue, Ifrane 53000, MoroccoDarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic.http://www.sciencedirect.com/science/article/pii/S2352340923003530Natural language processingText miningNamed entity recognitionDialectal ArabicCorpusBIO
spellingShingle	Hanane Nour Moussa Asmaa Mourhir DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect Data in Brief Natural language processing Text mining Named entity recognition Dialectal Arabic Corpus BIO
title	DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_full	DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_fullStr	DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_full_unstemmed	DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_short	DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
title_sort	darnercorp an annotated named entity recognition dataset in the moroccan dialect
topic	Natural language processing Text mining Named entity recognition Dialectal Arabic Corpus BIO
url	http://www.sciencedirect.com/science/article/pii/S2352340923003530
work_keys_str_mv	AT hananenourmoussa darnercorpanannotatednamedentityrecognitiondatasetinthemoroccandialect AT asmaamourhir darnercorpanannotatednamedentityrecognitiondatasetinthemoroccandialect

DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect

Similar Items