PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub

GitHub hosts Git repositories and provides issues-tracking services to provide a better collaboration environment for software developers. Issues and Pull-Requests are frequently used in GitHub to discuss and review the software requirements (new features, bugs, etc.) and software solutions (source...

Full description

Bibliographic Details
Main Authors: Zakarea Alshara, Anas Shatnawi, Hamzeh Eyal-Salman, Abdelhak-Djamel Seriai, Maad Shatnawi
Format: Article
Language:English
Published: IEEE 2023-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10002372/
_version_ 1797960267152228352
author Zakarea Alshara
Anas Shatnawi
Hamzeh Eyal-Salman
Abdelhak-Djamel Seriai
Maad Shatnawi
author_facet Zakarea Alshara
Anas Shatnawi
Hamzeh Eyal-Salman
Abdelhak-Djamel Seriai
Maad Shatnawi
author_sort Zakarea Alshara
collection DOAJ
description GitHub hosts Git repositories and provides issues-tracking services to provide a better collaboration environment for software developers. Issues and Pull-Requests are frequently used in GitHub to discuss and review the software requirements (new features, bugs, etc.) and software solutions (source code, test cases, etc.) respectively. The links between Issues and their corresponding Pull-Requests comprise valuable information to keep tracking current development as well as documenting knowledge for future development. Considering a large number of links, such information can be used to train machine learning models for several purposes such as feature location, bug prediction and localization, recommendation systems and documentation generation. To the best of our knowledge, no dataset has been proposed as a ground-truth of links between Issues and Pull-Requests. In this paper, we propose, PI-Link, a new significant and reliable ground-truth dataset composed of 50369 links that explicitly connect 34732 Issues with 50369 Pull-Requests. These links are automatically extracted from all (907,139) Android projects in GitHub created between January 1, 2011 and January 1, 2021. To better organize and store the collected data, we propose a metamodel based on the concepts of Issues and Pull Requests. Moreover, we analyze the relationships between Issues and their linked Pull Requests based on four features related to their titles, bodies, labels and comments. The selected features are analyzed in terms of their lengths and similarities based on three lexical and one semantic similarity metrics. The results showed promising similarities between Issues and their linked PRs at the lexical and semantic levels. In addition, some feature similarities are sensitive to the text length, whereas other feature similarities are sensitive to the term frequency.
first_indexed 2024-04-11T00:43:25Z
format Article
id doaj.art-45a17e7fb29f4632b136a770baecd18b
institution Directory Open Access Journal
issn 2169-3536
language English
last_indexed 2024-04-11T00:43:25Z
publishDate 2023-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj.art-45a17e7fb29f4632b136a770baecd18b2023-01-06T00:00:30ZengIEEEIEEE Access2169-35362023-01-011169771010.1109/ACCESS.2022.323298210002372PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHubZakarea Alshara0https://orcid.org/0000-0002-2727-6985Anas Shatnawi1Hamzeh Eyal-Salman2Abdelhak-Djamel Seriai3Maad Shatnawi4Department of Software Engineering, Jordan University of Science and Technology, Irbid, JordanDRIT, Berger-Levrault, Montpellier, FranceDepartment of Software Engineering, Mutah University, Al-Karak, JordanLIRMM Laboratory, University of Montpellier, Montpellier, FranceDepartment of Electrical Engineering Technology, Higher Colleges of Technology, Abu Dhabi, United Arab EmiratesGitHub hosts Git repositories and provides issues-tracking services to provide a better collaboration environment for software developers. Issues and Pull-Requests are frequently used in GitHub to discuss and review the software requirements (new features, bugs, etc.) and software solutions (source code, test cases, etc.) respectively. The links between Issues and their corresponding Pull-Requests comprise valuable information to keep tracking current development as well as documenting knowledge for future development. Considering a large number of links, such information can be used to train machine learning models for several purposes such as feature location, bug prediction and localization, recommendation systems and documentation generation. To the best of our knowledge, no dataset has been proposed as a ground-truth of links between Issues and Pull-Requests. In this paper, we propose, PI-Link, a new significant and reliable ground-truth dataset composed of 50369 links that explicitly connect 34732 Issues with 50369 Pull-Requests. These links are automatically extracted from all (907,139) Android projects in GitHub created between January 1, 2011 and January 1, 2021. To better organize and store the collected data, we propose a metamodel based on the concepts of Issues and Pull Requests. Moreover, we analyze the relationships between Issues and their linked Pull Requests based on four features related to their titles, bodies, labels and comments. The selected features are analyzed in terms of their lengths and similarities based on three lexical and one semantic similarity metrics. The results showed promising similarities between Issues and their linked PRs at the lexical and semantic levels. In addition, some feature similarities are sensitive to the text length, whereas other feature similarities are sensitive to the term frequency.https://ieeexplore.ieee.org/document/10002372/AndroidGitHubground-truth datasetissuelinkpull-request
spellingShingle Zakarea Alshara
Anas Shatnawi
Hamzeh Eyal-Salman
Abdelhak-Djamel Seriai
Maad Shatnawi
PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub
IEEE Access
Android
GitHub
ground-truth dataset
issue
link
pull-request
title PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub
title_full PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub
title_fullStr PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub
title_full_unstemmed PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub
title_short PI-Link: A Ground-Truth Dataset of Links Between Pull-Requests and Issues in GitHub
title_sort pi link a ground truth dataset of links between pull requests and issues in github
topic Android
GitHub
ground-truth dataset
issue
link
pull-request
url https://ieeexplore.ieee.org/document/10002372/
work_keys_str_mv AT zakareaalshara pilinkagroundtruthdatasetoflinksbetweenpullrequestsandissuesingithub
AT anasshatnawi pilinkagroundtruthdatasetoflinksbetweenpullrequestsandissuesingithub
AT hamzeheyalsalman pilinkagroundtruthdatasetoflinksbetweenpullrequestsandissuesingithub
AT abdelhakdjamelseriai pilinkagroundtruthdatasetoflinksbetweenpullrequestsandissuesingithub
AT maadshatnawi pilinkagroundtruthdatasetoflinksbetweenpullrequestsandissuesingithub