Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
Dataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this...
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10231147/ |
_version_ | 1827824212897693696 |
---|---|
author | Yousef Younes Ansgar Scherp |
author_facet | Yousef Younes Ansgar Scherp |
author_sort | Yousef Younes |
collection | DOAJ |
description | Dataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this challenge in two ways. First, we consider a two-step approach where a binary classifier filters out positive contexts, i.e., detects sentences with a dataset mention. We consider multiple transformer-based models and strong baselines for this task. Subsequently, the dataset is extracted from the positive context. Second, we consider a one-step approach and directly aim to detect and extract a possible dataset mention. For the extraction of datasets, we consider transformer models in named entity recognition (NER) mode. We contrast NER with the transformers’ capabilities for question answering (QA). We use the Coleridge Initiative “Show US the Data” dataset consisting of <inline-formula> <tex-math notation="LaTeX">$14.3k$ </tex-math></inline-formula> scientific papers with about <inline-formula> <tex-math notation="LaTeX">$35k$ </tex-math></inline-formula> mentions of datasets. We found that using transformers in QA mode is a better choice than NER for extracting unknown datasets. The rationale is that detecting new datasets is an out-of-vocabulary task, i.e., the dataset name has not been seen once during training. Comparing the two-step versus the one-step approach, we found contrasting strengths. A two-step dataset extraction using an MLP for filtering and RoBERTa in QA mode extracts more dataset mentions than a one-step system, but at the cost of a lower F1-score of 62.7%. A one-step extraction with DeBERTa in QA achieves the highest F1-score of 92.88% at the cost of missing dataset mentions. We recommend the one-step approach for the case when accuracy is more important, and the two-step approach when there is a postprocessing mechanism for the extracted dataset mentions, e.g., a manual check. The source code is available at <uri>https://github.com/yousef-younes/dataset_mention_extraction</uri>. |
first_indexed | 2024-03-12T02:23:42Z |
format | Article |
id | doaj.art-0621c3cc6bb44898a20e8681e62c103e |
institution | Directory Open Access Journal |
issn | 2169-3536 |
language | English |
last_indexed | 2024-03-12T02:23:42Z |
publishDate | 2023-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj.art-0621c3cc6bb44898a20e8681e62c103e2023-09-05T23:01:27ZengIEEEIEEE Access2169-35362023-01-0111927759278710.1109/ACCESS.2023.330914810231147Question Answering Versus Named Entity Recognition for Extracting Unknown DatasetsYousef Younes0https://orcid.org/0000-0003-1271-3633Ansgar Scherp1https://orcid.org/0000-0002-2653-9245GESIS—Leibniz-Institute for the Social Sciences, Cologne, GermanyData Science and Big Data Analytics, University of Ulm, Ulm, GermanyDataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this challenge in two ways. First, we consider a two-step approach where a binary classifier filters out positive contexts, i.e., detects sentences with a dataset mention. We consider multiple transformer-based models and strong baselines for this task. Subsequently, the dataset is extracted from the positive context. Second, we consider a one-step approach and directly aim to detect and extract a possible dataset mention. For the extraction of datasets, we consider transformer models in named entity recognition (NER) mode. We contrast NER with the transformers’ capabilities for question answering (QA). We use the Coleridge Initiative “Show US the Data” dataset consisting of <inline-formula> <tex-math notation="LaTeX">$14.3k$ </tex-math></inline-formula> scientific papers with about <inline-formula> <tex-math notation="LaTeX">$35k$ </tex-math></inline-formula> mentions of datasets. We found that using transformers in QA mode is a better choice than NER for extracting unknown datasets. The rationale is that detecting new datasets is an out-of-vocabulary task, i.e., the dataset name has not been seen once during training. Comparing the two-step versus the one-step approach, we found contrasting strengths. A two-step dataset extraction using an MLP for filtering and RoBERTa in QA mode extracts more dataset mentions than a one-step system, but at the cost of a lower F1-score of 62.7%. A one-step extraction with DeBERTa in QA achieves the highest F1-score of 92.88% at the cost of missing dataset mentions. We recommend the one-step approach for the case when accuracy is more important, and the two-step approach when there is a postprocessing mechanism for the extracted dataset mentions, e.g., a manual check. The source code is available at <uri>https://github.com/yousef-younes/dataset_mention_extraction</uri>.https://ieeexplore.ieee.org/document/10231147/Binary text classificationdataset mentionsnamed entity recognitionquestion answering |
spellingShingle | Yousef Younes Ansgar Scherp Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets IEEE Access Binary text classification dataset mentions named entity recognition question answering |
title | Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets |
title_full | Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets |
title_fullStr | Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets |
title_full_unstemmed | Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets |
title_short | Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets |
title_sort | question answering versus named entity recognition for extracting unknown datasets |
topic | Binary text classification dataset mentions named entity recognition question answering |
url | https://ieeexplore.ieee.org/document/10231147/ |
work_keys_str_mv | AT yousefyounes questionansweringversusnamedentityrecognitionforextractingunknowndatasets AT ansgarscherp questionansweringversusnamedentityrecognitionforextractingunknowndatasets |