SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks

Stakeholders of software development projects have various information needs for making rational decisions during their daily work. Satisfying these needs requires substantial knowledge of where and how the relevant information is stored and consumes valuable time that is often not available. Easing...

Full description

Bibliographic Details
Main Authors:	Mihaela Todorova Tomova, Martin Hofmann, Patrick Mäder
Format:	Article
Language:	English
Published:	Elsevier 2022-06-01
Series:	Data in Brief
Subjects:	Software and systems requirement engineering Text-to-SQL Dataset Question answering Natural language processing
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340922004152

_version_	1829462099650150400
author	Mihaela Todorova Tomova Martin Hofmann Patrick Mäder
author_facet	Mihaela Todorova Tomova Martin Hofmann Patrick Mäder
author_sort	Mihaela Todorova Tomova
collection	DOAJ
description	Stakeholders of software development projects have various information needs for making rational decisions during their daily work. Satisfying these needs requires substantial knowledge of where and how the relevant information is stored and consumes valuable time that is often not available. Easing the need for this knowledge is an ideal text-to-SQL benchmark problem, a field where public datasets are scarce and needed. We propose the SEOSS-Queries dataset consisting of natural language utterances and accompanying SQL queries extracted from previous studies, software projects, issue tracking tools, and through expert surveys to cover a large variety of information need perspectives. Our dataset consists of 1,162 English utterances translating into 166 SQL queries; each query has four precise utterances and three more general ones. Furthermore, the dataset contains 393,086 labeled utterances extracted from issue tracker comments. We provide pre-trained SQLNet and RatSQL baseline models for benchmark comparisons, a replication package facilitating a seamless application, and discuss various other tasks that may be solved and evaluated using the dataset. The whole dataset with paraphrased natural language utterances and SQL queries is hosted at figshare.com/s/75ed49ef01ac2f83b3e2.
first_indexed	2024-12-12T12:02:38Z
format	Article
id	doaj.art-45d12762121f4d7e8af0ba71c009a3cc
institution	Directory Open Access Journal
issn	2352-3409
language	English
last_indexed	2024-12-12T12:02:38Z
publishDate	2022-06-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj.art-45d12762121f4d7e8af0ba71c009a3cc2022-12-22T00:25:04ZengElsevierData in Brief2352-34092022-06-0142108211SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasksMihaela Todorova Tomova0Martin Hofmann1Patrick Mäder2Corresponding author.; Technische Universität Ilmenau, Ilmenau 98693, GermanyTechnische Universität Ilmenau, Ilmenau 98693, GermanyTechnische Universität Ilmenau, Ilmenau 98693, Germany; Faculty of Biological Sciences, Friedrich Schiller University, Jena 07745, GermanyStakeholders of software development projects have various information needs for making rational decisions during their daily work. Satisfying these needs requires substantial knowledge of where and how the relevant information is stored and consumes valuable time that is often not available. Easing the need for this knowledge is an ideal text-to-SQL benchmark problem, a field where public datasets are scarce and needed. We propose the SEOSS-Queries dataset consisting of natural language utterances and accompanying SQL queries extracted from previous studies, software projects, issue tracking tools, and through expert surveys to cover a large variety of information need perspectives. Our dataset consists of 1,162 English utterances translating into 166 SQL queries; each query has four precise utterances and three more general ones. Furthermore, the dataset contains 393,086 labeled utterances extracted from issue tracker comments. We provide pre-trained SQLNet and RatSQL baseline models for benchmark comparisons, a replication package facilitating a seamless application, and discuss various other tasks that may be solved and evaluated using the dataset. The whole dataset with paraphrased natural language utterances and SQL queries is hosted at figshare.com/s/75ed49ef01ac2f83b3e2.http://www.sciencedirect.com/science/article/pii/S2352340922004152Software and systems requirement engineeringText-to-SQLDatasetQuestion answeringNatural language processing
spellingShingle	Mihaela Todorova Tomova Martin Hofmann Patrick Mäder SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks Data in Brief Software and systems requirement engineering Text-to-SQL Dataset Question answering Natural language processing
title	SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks
title_full	SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks
title_fullStr	SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks
title_full_unstemmed	SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks
title_short	SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks
title_sort	seoss queries a software engineering dataset for text to sql and question answering tasks
topic	Software and systems requirement engineering Text-to-SQL Dataset Question answering Natural language processing
url	http://www.sciencedirect.com/science/article/pii/S2352340922004152
work_keys_str_mv	AT mihaelatodorovatomova seossqueriesasoftwareengineeringdatasetfortexttosqlandquestionansweringtasks AT martinhofmann seossqueriesasoftwareengineeringdatasetfortexttosqlandquestionansweringtasks AT patrickmader seossqueriesasoftwareengineeringdatasetfortexttosqlandquestionansweringtasks

SEOSS-Queries - a software engineering dataset for text-to-SQL and question answering tasks

Similar Items