Duplicate Pull Requests: Automated Detection Using S-BERT and Machine Learning
Abstract
In the context of modern pull-based systems like GitHub, identifying and processing duplicate pull requests (PRs) has become a major challenge for integrators of large-scale open-source software systems. With hundreds of PRs being generated on a daily basis, identifying and processing these PRs manually is a time-consuming and costly affair, and the chances of errors are high. This work proposes an automated approach to detect duplicate PRs using the concept of semantic similarity with the help of the popular transfer learning model S-BERT, which measures the semantic similarity between two given pieces of text. We have successfully achieved an accuracy of 78% and an F1 score of 84% using the cosine similarity measure on the S-BERT model with an optimized similarity measure of 0.40. We have also expanded the baseline dataset with 2,000 additional PRs and proposed the use of the XGBoost model to achieve an accuracy of 80.64%. Further, the study proposes the Duplicate Pull Request Detector (DPD) tool and the significance of the tool through a survey among developers.













