Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets

Dataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this...

Full description

Bibliographic Details
Main Authors:	Yousef Younes, Ansgar Scherp
Format:	Article
Language:	English
Published:	IEEE 2023-01-01
Series:	IEEE Access
Subjects:	Binary text classification dataset mentions named entity recognition question answering
Online Access:	https://ieeexplore.ieee.org/document/10231147/

_version_	1827824212897693696
author	Yousef Younes Ansgar Scherp
author_facet	Yousef Younes Ansgar Scherp
author_sort	Yousef Younes
collection	DOAJ
description	Dataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this challenge in two ways. First, we consider a two-step approach where a binary classifier filters out positive contexts, i.e., detects sentences with a dataset mention. We consider multiple transformer-based models and strong baselines for this task. Subsequently, the dataset is extracted from the positive context. Second, we consider a one-step approach and directly aim to detect and extract a possible dataset mention. For the extraction of datasets, we consider transformer models in named entity recognition (NER) mode. We contrast NER with the transformers’ capabilities for question answering (QA). We use the Coleridge Initiative “Show US the Data” dataset consisting of <inline-formula> <tex-math notation="LaTeX">$14.3k$ </tex-math></inline-formula> scientific papers with about <inline-formula> <tex-math notation="LaTeX">$35k$ </tex-math></inline-formula> mentions of datasets. We found that using transformers in QA mode is a better choice than NER for extracting unknown datasets. The rationale is that detecting new datasets is an out-of-vocabulary task, i.e., the dataset name has not been seen once during training. Comparing the two-step versus the one-step approach, we found contrasting strengths. A two-step dataset extraction using an MLP for filtering and RoBERTa in QA mode extracts more dataset mentions than a one-step system, but at the cost of a lower F1-score of 62.7%. A one-step extraction with DeBERTa in QA achieves the highest F1-score of 92.88% at the cost of missing dataset mentions. We recommend the one-step approach for the case when accuracy is more important, and the two-step approach when there is a postprocessing mechanism for the extracted dataset mentions, e.g., a manual check. The source code is available at <uri>https://github.com/yousef-younes/dataset_mention_extraction</uri>.
first_indexed	2024-03-12T02:23:42Z
format	Article
id	doaj.art-0621c3cc6bb44898a20e8681e62c103e
institution	Directory Open Access Journal
issn	2169-3536
language	English
last_indexed	2024-03-12T02:23:42Z
publishDate	2023-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj.art-0621c3cc6bb44898a20e8681e62c103e2023-09-05T23:01:27ZengIEEEIEEE Access2169-35362023-01-0111927759278710.1109/ACCESS.2023.330914810231147Question Answering Versus Named Entity Recognition for Extracting Unknown DatasetsYousef Younes0https://orcid.org/0000-0003-1271-3633Ansgar Scherp1https://orcid.org/0000-0002-2653-9245GESIS—Leibniz-Institute for the Social Sciences, Cologne, GermanyData Science and Big Data Analytics, University of Ulm, Ulm, GermanyDataset mention extraction is a difficult problem due to the unstructured nature of text, the sparsity of dataset mentions, and the various ways the same dataset can be mentioned. Extracting unknown dataset mentions which are not part of the training data of the model is even harder. We address this challenge in two ways. First, we consider a two-step approach where a binary classifier filters out positive contexts, i.e., detects sentences with a dataset mention. We consider multiple transformer-based models and strong baselines for this task. Subsequently, the dataset is extracted from the positive context. Second, we consider a one-step approach and directly aim to detect and extract a possible dataset mention. For the extraction of datasets, we consider transformer models in named entity recognition (NER) mode. We contrast NER with the transformers’ capabilities for question answering (QA). We use the Coleridge Initiative “Show US the Data” dataset consisting of <inline-formula> <tex-math notation="LaTeX">$14.3k$ </tex-math></inline-formula> scientific papers with about <inline-formula> <tex-math notation="LaTeX">$35k$ </tex-math></inline-formula> mentions of datasets. We found that using transformers in QA mode is a better choice than NER for extracting unknown datasets. The rationale is that detecting new datasets is an out-of-vocabulary task, i.e., the dataset name has not been seen once during training. Comparing the two-step versus the one-step approach, we found contrasting strengths. A two-step dataset extraction using an MLP for filtering and RoBERTa in QA mode extracts more dataset mentions than a one-step system, but at the cost of a lower F1-score of 62.7%. A one-step extraction with DeBERTa in QA achieves the highest F1-score of 92.88% at the cost of missing dataset mentions. We recommend the one-step approach for the case when accuracy is more important, and the two-step approach when there is a postprocessing mechanism for the extracted dataset mentions, e.g., a manual check. The source code is available at <uri>https://github.com/yousef-younes/dataset_mention_extraction</uri>.https://ieeexplore.ieee.org/document/10231147/Binary text classificationdataset mentionsnamed entity recognitionquestion answering
spellingShingle	Yousef Younes Ansgar Scherp Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets IEEE Access Binary text classification dataset mentions named entity recognition question answering
title	Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
title_full	Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
title_fullStr	Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
title_full_unstemmed	Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
title_short	Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets
title_sort	question answering versus named entity recognition for extracting unknown datasets
topic	Binary text classification dataset mentions named entity recognition question answering
url	https://ieeexplore.ieee.org/document/10231147/
work_keys_str_mv	AT yousefyounes questionansweringversusnamedentityrecognitionforextractingunknowndatasets AT ansgarscherp questionansweringversusnamedentityrecognitionforextractingunknowndatasets

Question Answering Versus Named Entity Recognition for Extracting Unknown Datasets

Similar Items