Source code analysis dataset

The data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars. The first set of pairs connects snippets of source code in C, C++, Java, and Python with their corresponding comments, which are extr...

Full description

Bibliographic Details
Main Authors:	Ben Gelman, Banjo Obayomi, Jessica Moore, David Slater
Format:	Article
Language:	English
Published:	Elsevier 2019-12-01
Series:	Data in Brief
Online Access:	http://www.sciencedirect.com/science/article/pii/S2352340919310674

_version_	1818583917049937920
author	Ben Gelman Banjo Obayomi Jessica Moore David Slater
author_facet	Ben Gelman Banjo Obayomi Jessica Moore David Slater
author_sort	Ben Gelman
collection	DOAJ
description	The data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars. The first set of pairs connects snippets of source code in C, C++, Java, and Python with their corresponding comments, which are extracted using Doxygen. The second set of pairs connects raw C and C++ source code repositories with the build artifacts of that code, which are obtained by running the make command. The last set of pairs connects raw C and C++ source code repositories with potential code vulnerabilities, which are determined by running the Infer static analyzer. The code and comment pairs can be used for tasks such as predicting comments or creating natural language descriptions of code. The code and build artifact pairs can be used for tasks such as reverse engineering or improving intermediate representations of code from decompiled binaries. The code and static analyzer pairs can be used for tasks such as machine learning approaches to vulnerability discovery. Keywords: Source code, Code comments, Bug detection, Static analysis
first_indexed	2024-12-16T08:12:53Z
format	Article
id	doaj.art-44847eea75f948ffa9b130c94ee0490e
institution	Directory Open Access Journal
issn	2352-3409
language	English
last_indexed	2024-12-16T08:12:53Z
publishDate	2019-12-01
publisher	Elsevier
record_format	Article
series	Data in Brief
spelling	doaj.art-44847eea75f948ffa9b130c94ee0490e2022-12-21T22:38:19ZengElsevierData in Brief2352-34092019-12-0127Source code analysis datasetBen Gelman0Banjo Obayomi1Jessica Moore2David Slater3Machine Learning Group, Two Six Labs, 901 N. Stuart St, Suite 1000, Arlington, VA, 22203, USAMachine Learning Group, Two Six Labs, 901 N. Stuart St, Suite 1000, Arlington, VA, 22203, USAMachine Learning Group, Two Six Labs, 901 N. Stuart St, Suite 1000, Arlington, VA, 22203, USACorresponding author.; Machine Learning Group, Two Six Labs, 901 N. Stuart St, Suite 1000, Arlington, VA, 22203, USAThe data in this article pair source code with three artifacts from 108,568 projects downloaded from Github that have a redistributable license and at least 10 stars. The first set of pairs connects snippets of source code in C, C++, Java, and Python with their corresponding comments, which are extracted using Doxygen. The second set of pairs connects raw C and C++ source code repositories with the build artifacts of that code, which are obtained by running the make command. The last set of pairs connects raw C and C++ source code repositories with potential code vulnerabilities, which are determined by running the Infer static analyzer. The code and comment pairs can be used for tasks such as predicting comments or creating natural language descriptions of code. The code and build artifact pairs can be used for tasks such as reverse engineering or improving intermediate representations of code from decompiled binaries. The code and static analyzer pairs can be used for tasks such as machine learning approaches to vulnerability discovery. Keywords: Source code, Code comments, Bug detection, Static analysishttp://www.sciencedirect.com/science/article/pii/S2352340919310674
spellingShingle	Ben Gelman Banjo Obayomi Jessica Moore David Slater Source code analysis dataset Data in Brief
title	Source code analysis dataset
title_full	Source code analysis dataset
title_fullStr	Source code analysis dataset
title_full_unstemmed	Source code analysis dataset
title_short	Source code analysis dataset
title_sort	source code analysis dataset
url	http://www.sciencedirect.com/science/article/pii/S2352340919310674
work_keys_str_mv	AT bengelman sourcecodeanalysisdataset AT banjoobayomi sourcecodeanalysisdataset AT jessicamoore sourcecodeanalysisdataset AT davidslater sourcecodeanalysisdataset

Source code analysis dataset

Similar Items