Duplicate Pull Requests: Automated Detection Using S-BERT and Machine Learning

Authors

  • Umar Hayat Khan Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
  • Ashraf Zia Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
  • Hashim Ali Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
  • Asif Rahim School of Computer and Information Security, Guilin University of Electronic Technology Guilin 541004, China
  • Umer Tanveer Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan
  • Kiran Falak Sher Department of Computer Science, Abdul Wali Khan University Mardan, Pakistan

Abstract

In the context of modern pull-based systems like GitHub, identifying and processing duplicate pull requests (PRs) has become a major challenge for integrators of large-scale open-source software systems. With hundreds of PRs being generated on a daily basis, identifying and processing these PRs manually is a time-consuming and costly affair, and the chances of errors are high. This work proposes an automated approach to detect duplicate PRs using the concept of semantic similarity with the help of the popular transfer learning model S-BERT, which measures the semantic similarity between two given pieces of text. We have successfully achieved an accuracy of 78% and an F1 score of 84% using the cosine similarity measure on the S-BERT model with an optimized similarity measure of 0.40. We have also expanded the baseline dataset with 2,000 additional PRs and proposed the use of the XGBoost model to achieve an accuracy of 80.64%. Further, the study proposes the Duplicate Pull Request Detector (DPD) tool and the significance of the tool through a survey among developers.

Downloads

Published

2025-05-28

How to Cite

Umar Hayat Khan, Ashraf Zia, Hashim Ali, Asif Rahim, Umer Tanveer, & Kiran Falak Sher. (2025). Duplicate Pull Requests: Automated Detection Using S-BERT and Machine Learning. Spectrum of Engineering Sciences, 3(5), 991–1010. Retrieved from https://thesesjournal.com/index.php/1/article/view/2240