DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect
DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organi...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
Elsevier
2023-06-01
|
Series: | Data in Brief |
Subjects: | |
Online Access: | http://www.sciencedirect.com/science/article/pii/S2352340923003530 |
_version_ | 1797798008511791104 |
---|---|
author | Hanane Nour Moussa Asmaa Mourhir |
author_facet | Hanane Nour Moussa Asmaa Mourhir |
author_sort | Hanane Nour Moussa |
collection | DOAJ |
description | DarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic. |
first_indexed | 2024-03-13T03:58:00Z |
format | Article |
id | doaj.art-6e3b0ccd275045b3bad555dad1ca2625 |
institution | Directory Open Access Journal |
issn | 2352-3409 |
language | English |
last_indexed | 2024-03-13T03:58:00Z |
publishDate | 2023-06-01 |
publisher | Elsevier |
record_format | Article |
series | Data in Brief |
spelling | doaj.art-6e3b0ccd275045b3bad555dad1ca26252023-06-22T05:04:05ZengElsevierData in Brief2352-34092023-06-0148109234DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialectHanane Nour Moussa0Asmaa Mourhir1Corresponding author.; School of Science and Engineering, Al Akhawayn University in Ifrane, P.O. Box 104, Hassan II Avenue, Ifrane 53000, MoroccoSchool of Science and Engineering, Al Akhawayn University in Ifrane, P.O. Box 104, Hassan II Avenue, Ifrane 53000, MoroccoDarNERcorp is a manually annotated named entity recognition (NER) dataset in the Moroccan dialect, also called Darija. The dataset consists of 65,905 tokens and their corresponding tags according to BIO scheme. 13.8% of the tokens are named entities spanning four categories: person, location, organization, and miscellaneous. The data were scraped from the Moroccan Dialect section of Wikipedia and processed and annotated using open-source libraries and tools. The data are useful for the Arabic natural language processing (NLP) community as they address the lack in dialectal Arabic annotated corpora. This dataset can be used to train and evaluate named entity recognition systems in dialectal and mixed Arabic.http://www.sciencedirect.com/science/article/pii/S2352340923003530Natural language processingText miningNamed entity recognitionDialectal ArabicCorpusBIO |
spellingShingle | Hanane Nour Moussa Asmaa Mourhir DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect Data in Brief Natural language processing Text mining Named entity recognition Dialectal Arabic Corpus BIO |
title | DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title_full | DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title_fullStr | DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title_full_unstemmed | DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title_short | DarNERcorp: An annotated named entity recognition dataset in the Moroccan dialect |
title_sort | darnercorp an annotated named entity recognition dataset in the moroccan dialect |
topic | Natural language processing Text mining Named entity recognition Dialectal Arabic Corpus BIO |
url | http://www.sciencedirect.com/science/article/pii/S2352340923003530 |
work_keys_str_mv | AT hananenourmoussa darnercorpanannotatednamedentityrecognitiondatasetinthemoroccandialect AT asmaamourhir darnercorpanannotatednamedentityrecognitiondatasetinthemoroccandialect |