Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets

Dataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this...

Full description

Bibliographic Details
Main Authors: Yousef Younes, Ansgar Scherp
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10231147/
_version_ 1827824212897693696
author Yousef Younes
Ansgar Scherp
author_facet Yousef Younes
Ansgar Scherp
author_sort Yousef Younes
collection DOAJ
description Dataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this challenge in two ways. First, we consider a two-step approach where a binary classifier filters out positive contexts, i.e., detects sentences with a dataset mention. We consider multiple transformer-based models and strong baselines for this task. Subsequently, the dataset is extracted from the positive context. Second, we consider a one-step approach and directly aim to detect and extract a possible dataset mention. For the extraction of datasets, we consider transformer models in named entity recognition (NER) mode. We contrast NER with the transformers&#x2019; capabilities for question answering (QA). We use the Coleridge Initiative &#x201C;Show US the Data&#x201D; dataset consisting of <inline-formula> <tex-math notation="LaTeX">$14.3k$ </tex-math></inline-formula> scientific papers with about <inline-formula> <tex-math notation="LaTeX">$35k$ </tex-math></inline-formula> mentions of datasets. We found that using transformers in QA mode is a better choice than NER for extracting unknown datasets. The rationale is that detecting new datasets is an out-of-vocabulary task, i.e., the dataset name has not been seen once during training. Comparing the two-step versus the one-step approach, we found contrasting strengths. A two-step dataset extraction using an MLP for filtering and RoBERTa in QA mode extracts more dataset mentions than a one-step system, but at the cost of a lower F1-score of 62.7&#x0025;. A one-step extraction with DeBERTa in QA achieves the highest F1-score of 92.88&#x0025; at the cost of missing dataset mentions. We recommend the one-step approach for the case when accuracy is more important, and the two-step approach when there is a postprocessing mechanism for the extracted dataset mentions, e.g., a manual check. The source code is available at <uri>https://github.com/yousef-younes/dataset_mention_extraction</uri>.
first_indexed 2024-03-12T02:23:42Z
format Article
id doaj.art-0621c3cc6bb44898a20e8681e62c103e
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-03-12T02:23:42Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-0621c3cc6bb44898a20e8681e62c103e2023-09-05T23:01:27ZengIEEEIEEE Access2169-35362023-01-0111927759278710.1109/ACCESS.2023.330914810231147Question Answering Versus Named Entity Recognition for Extracting Unknown DatasetsYousef Younes0https://orcid.org/0000-0003-1271-3633Ansgar Scherp1https://orcid.org/0000-0002-2653-9245GESIS&#x2014;Leibniz-Institute for the Social Sciences, Cologne, GermanyData Science and Big Data Analytics, University of Ulm, Ulm, GermanyDataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this challenge in two ways. First, we consider a two-step approach where a binary classifier filters out positive contexts, i.e., detects sentences with a dataset mention. We consider multiple transformer-based models and strong baselines for this task. Subsequently, the dataset is extracted from the positive context. Second, we consider a one-step approach and directly aim to detect and extract a possible dataset mention. For the extraction of datasets, we consider transformer models in named entity recognition (NER) mode. We contrast NER with the transformers&#x2019; capabilities for question answering (QA). We use the Coleridge Initiative &#x201C;Show US the Data&#x201D; dataset consisting of <inline-formula> <tex-math notation="LaTeX">$14.3k$ </tex-math></inline-formula> scientific papers with about <inline-formula> <tex-math notation="LaTeX">$35k$ </tex-math></inline-formula> mentions of datasets. We found that using transformers in QA mode is a better choice than NER for extracting unknown datasets. The rationale is that detecting new datasets is an out-of-vocabulary task, i.e., the dataset name has not been seen once during training. Comparing the two-step versus the one-step approach, we found contrasting strengths. A two-step dataset extraction using an MLP for filtering and RoBERTa in QA mode extracts more dataset mentions than a one-step system, but at the cost of a lower F1-score of 62.7&#x0025;. A one-step extraction with DeBERTa in QA achieves the highest F1-score of 92.88&#x0025; at the cost of missing dataset mentions. We recommend the one-step approach for the case when accuracy is more important, and the two-step approach when there is a postprocessing mechanism for the extracted dataset mentions, e.g., a manual check. The source code is available at <uri>https://github.com/yousef-younes/dataset_mention_extraction</uri>.https://ieeexplore.ieee.org/document/10231147/Binary text classificationdataset mentionsnamed entity recognitionquestion answering
spellingShingle Yousef Younes
Ansgar Scherp
Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
IEEE Access
Binary text classification
dataset mentions
named entity recognition
question answering
title Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
title_full Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
title_fullStr Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
title_full_unstemmed Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
title_short Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
title_sort question answering versus named entity recognition for extracting unknown datasets
topic Binary text classification
dataset mentions
named entity recognition
question answering
url https://ieeexplore.ieee.org/document/10231147/
work_keys_str_mv AT yousefyounes questionansweringversusnamedentityrecognitionforextractingunknowndatasets
AT ansgarscherp questionansweringversusnamedentityrecognitionforextractingunknowndatasets