Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning
Abstract
:1. Introduction
- An automatic approach to cluster similar PRs together using two supervised and unsupervised ML algorithms considering the number of reviewers or repository’s owner preferences.
- An empirical evaluation for our proposal using twenty popular repositories from different domains and sizes.
2. Background
2.1. Pull-Request Mechanism
2.2. Similar Pull-Requests
3. Related Work
3.1. Detecting Duplicate Pull-Requests
- Pull-request retrieval: retrieving a ranked list of PRs for a given new PR.
- Pull-request classification: assigning a label (duplicated or not duplicated) for a given new PR using a ML algorithm.
3.2. Detecting Duplicate Bug Reports
3.3. Recommending Code Reviewers for PRs
3.3.1. Heuristic-Based Techniques
3.3.2. Social Network-Based Techniques
3.3.3. Machine Learning-Based Techniques
3.3.4. Hybrid Techniques
4. The Proposed Approach
4.1. Holistic View of the Proposed Approach
4.2. Parsing and Preprocessing Pull-Requests
4.3. Building Term-Document Matrix
4.4. Calculating Similarities among Pull-Requests
4.5. Clustering-Based Agglomeration Hierarchical Algorithm
4.5.1. Building Dendrogram Tree
Algorithm 1 BuildingDendrogramTree |
4.5.2. Identifying Candidate PR Clusters
4.6. Clustering-Based K-Means Algorithm
Algorithm 2 Identifying PR Clusters |
Algorithm 3 Identifying K PR-Clusters |
5. Experimental Results and Evaluation
5.1. Investigation Research Questions and Evaluation Procedure
- -
- RQ1: To what extent the proposed approach does identify relevant PRs clusters?
- -
- RQ2: How much efforts could the proposed approach save for reviewers?
- -
- RQ3: To what extent the proposed is effective when it is compared to the most recent and relevant works in the subject?
- -
- We find a match between an identified cluster (i.e., their PRs) with all already existed actual PRs clusters of a given repository of interest. Suppose that Z is an identified cluster. The cluster that maximizes the matching with Z cluster in terms of PRs is called actual cluster. Such actual clusters (ACs) represent ground truth clusters for the evaluation purpose.
- -
- We use the following equations to compute the precision and recall values for each PR cluster against their actual cluster:
5.2. Dataset
5.3. Results Analysis
5.3.1. Identifying Relevant PRs Clusters (RQ1)
5.3.2. Saving Reviewing Efforts (RQ2)
5.3.3. The Effectiveness of the Proposed Approach against the Existing Work (RQ3)
5.4. Threats to Validity
5.4.1. Threats to Internal Validity
- -
- Our research contribution in this article is to identify similar PRs clusters from a given GitHub repository. This contribution has been evaluated only using duplicate PRs clusters from different repositories. Indeed, in this subject, there are no benchmark or public case studies that provide clusters of similar PRs. However, we consider duplication as a special case of similarity (100% similarity).
- -
- Our identification process uses descriptive textual information to find similar PRs, so the proposed approach is sensitive to the vocabulary used to describe these PRs. Consequently, our proposal may succeed or fail depending on the vocabulary used. However, this threat is common among all works that use textual matching to find similarities between the artifacts of interest.
- -
- The proposed approach can be used only to identify similar PRs from open PRs.
5.4.2. Threats to External Validity
6. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Conflicts of Interest
References
- Li, Z.; Yu, Y.; Zhou, M.; Wang, T.; Yin, G.; Lan, L.; Wang, H. Redundancy, Context, and Preference: An Empirical Study of Duplicate Pull Requests in OSS Projects. IEEE Trans. Softw. Eng. 2020, 1–28. [Google Scholar] [CrossRef]
- Rahman, M.M.; Roy, C.K. An Insight into the Pull Requests of GitHub. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR 2014), Hyderabad, India, 31 May–1 June 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 364–367. [Google Scholar] [CrossRef] [Green Version]
- Salman, H.E.; Seriai, A.D.; Dony, C. Feature-Level Change Impact Analysis Using Formal Concept Analysis. Int. J. Softw. Eng. Knowl. Eng. 2015, 25, 69–92. [Google Scholar] [CrossRef] [Green Version]
- Eyal Salman, H.; Seriai, A.D.; Dony, C. Feature-to-Code Traceability in Legacy Software Variants. In Proceedings of the 2013 39th Euromicro Conference on Software Engineering and Advanced Applications, Santander, Spain, 4–6 September 2013; pp. 57–61. [Google Scholar]
- Wang, Q.; Xu, B.; Xia, X.; Wang, T.; Li, S. Duplicate Pull Request Detection: When Time Matters. In Proceedings of the 11th Asia-Pacific Symposium on Internetware (Internetware ’19), Fukuoka, Japan, 28–29 October 2019; Association for Computing Machinery: New York, NY, USA, 2019. [Google Scholar] [CrossRef]
- Zhou, S.; Stănciulescu, c.; Leßenich, O.; Xiong, Y.; Wąsowski, A.; Kästner, C. Identifying Features in Forks. In Proceedings of the 40th International Conference on Software Engineering (ICSE ’18), Gothenburg Sweden, 27 May–3 June 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 105–116. [Google Scholar] [CrossRef]
- Liao, Z.; Wu, Z.; Li, Y.; Zhang, Y.; Fan, X.; Wu, J. Core-reviewer recommendation based on Pull Request topic model and collaborator social network. Soft Comput. 2020, 24, 5683–5693. [Google Scholar] [CrossRef] [Green Version]
- Ren, L.; Zhou, S.; Kästner, C.; Wasowski, A. Identifying Redundancies in Fork-based Development. In Proceedings of the 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019, Hangzhou, China, 24–27 February 2019; Wang, X., Lo, D., Shihab, E., Eds.; IEEE: Piscataway, NJ, USA, 2019; pp. 230–241. [Google Scholar] [CrossRef]
- Li, Z.; Yin, G.; Yu, Y.; Wang, T.; Wang, H. Detecting Duplicate Pull-Requests in GitHub. In Proceedings of the 9th Asia-Pacific Symposium on Internetware (Internetware’17), Shanghai, China, 23 September 2017; Association for Computing Machinery: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
- Yu, Y.; Wang, H.; Yin, G.; Wang, T. Reviewer recommendation for pull-requests in GitHub: What can we learn from code review and bug assignment? Inf. Softw. Technol. 2016, 74, 204–218. [Google Scholar] [CrossRef]
- Thongtanunam, P.; Kula, R.G.; Cruz, A.E.C.; Yoshida, N.; Iida, H. Improving Code Review Effectiveness through Reviewer Recommendations. In Proceedings of the 7th International Workshop on Cooperative and Human Aspects of Software Engineering (CHASE 2014), Hyderabad, India, 2–3 June 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 119–122. [Google Scholar] [CrossRef] [Green Version]
- Xia, Z.; Sun, H.; Jiang, J.; Wang, X.; Liu, X. A hybrid approach to code reviewer recommendation with collaborative filtering. In Proceedings of the 2017 6th International Workshop on Software Mining (SoftwareMining), Urbana, IL, USA, 3 November 2017; pp. 24–31. [Google Scholar] [CrossRef]
- Chueshev, A.; Lawall, J.; Bendraou, R.; Ziadi, T. Expanding the Number of Reviewers in Open-Source Projects by Recommending Appropriate Developers. In Proceedings of the ICSME 2020—International Conference on Software Maintenance and Evolution, Adelaide, Australia, 28 September–2 October 2020. [Google Scholar]
- Jain, A.K.; Dubes, R.C. Algorithms for Clustering Data; Prentice-Hall, Inc.: Hoboken, NJ, USA, 1988. [Google Scholar]
- Zhao, H.; Qi, Z. Hierarchical Agglomerative Clustering with Ordering Constraints. In Proceedings of the 2010 Third International Conference on Knowledge Discovery and Data Mining, Phuket, Thailand, 9–10 January 2010; pp. 195–199. [Google Scholar] [CrossRef]
- Nerur, S.; Mahapatra, R.; Mangalaraj, G. Challenges of Migrating to Agile Methodologies. Commun. ACM 2005, 48, 72–78. [Google Scholar] [CrossRef]
- Dabbish, L.; Stuart, C.; Tsay, J.; Herbsleb, J. Social Coding in GitHub: Transparency and Collaboration in an Open Software Repository. In Proceedings of the ACM 2012 Conference on Computer Supported Cooperative Work (CSCW ’12), Seattle, WA, USA, 11–15 February 2012; Association for Computing Machinery: New York, NY, USA, 2012; pp. 1277–1286. [Google Scholar] [CrossRef]
- Yu, S.; Xu, L.; Zhang, Y.; Wu, J.; Liao, Z.; Li, Y. NBSL: A Supervised Classification Model of Pull Request in Github. In Proceedings of the 2018 IEEE International Conference on Communications (ICC), Kansas City, MO, USA, 20–24 May 2018; pp. 1–6. [Google Scholar] [CrossRef]
- Jiang, J.; Yang, Y.; He, J.; Blanc, X.; Zhang, L. Who should comment on this pull request? Analyzing attributes for more accurate commenter recommendation in pull-based development. Inf. Softw. Technol. 2017, 84, 48–62. [Google Scholar] [CrossRef]
- Yu, Y.; Wang, H.; Filkov, V.; Devanbu, P.; Vasilescu, B. Wait for It: Determinants of Pull Request Evaluation Latency on GitHub. In Proceedings of the 2015 IEEE/ACM 12th Working Conference on Mining Software Repositories, Florence, Italy, 16–17 May 2015; pp. 367–371. [Google Scholar] [CrossRef]
- Li, Z.; Yu, Y.; Wang, T.; Yin, G.; Mao, X.; Wang, H. Detecting Duplicate Contributions in Pull-Based Model Combining Textual and Change Similarities. J. Comput. Sci. Technol. 2021, 36, 191–206. [Google Scholar] [CrossRef]
- Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
- Runeson, P.; Alexandersson, M.; Nyholm, O. Detection of Duplicate Defect Reports Using Natural Language Processing. In Proceedings of the 29th International Conference on Software Engineering (ICSE’07), Minneapolis, MN, USA, 20–26 May 2007; pp. 499–510. [Google Scholar] [CrossRef]
- Wang, X.; Zhang, L.; Xie, T.; Anvik, J.; Sun, J. An approach to detecting duplicate bug reports using natural language and execution information. In Proceedings of the 2008 ACM/IEEE 30th International Conference on Software Engineering, Leipzig, Germany, 10–18 May 2008; pp. 461–470. [Google Scholar] [CrossRef]
- Sun, C.; Lo, D.; Khoo, S.C.; Jiang, J. Towards more accurate retrieval of duplicate bug reports. In Proceedings of the 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011), Lawrence, KS, USA, 6–10 November 2011; pp. 253–262. [Google Scholar] [CrossRef]
- He, J.; Xu, L.; Yan, M.; Xia, X.; Lei, Y. Duplicate Bug Report Detection Using Dual-Channel Convolutional Neural Networks. In Proceedings of the 28th International Conference on Program Comprehension (ICPC ’20), Seoul, Korea, 13–15 July 2020; Association for Computing Machinery: New York, NY, USA, 2020; pp. 117–127. [Google Scholar] [CrossRef]
- Lipcak, J.; Rossi, B. A Large-Scale Study on Source Code Reviewer Recommendation. In Proceedings of the 2018 44th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), Prague, Czech Republic, 29–31 August 2018; pp. 378–387. [Google Scholar] [CrossRef] [Green Version]
- Balachandran, V. Reducing human effort and improving quality in peer code reviews using automatic static analysis and reviewer recommendation. In Proceedings of the 2013 35th International Conference on Software Engineering (ICSE), San Francisco, CA, USA, 18–26 May 2013; pp. 931–940. [Google Scholar] [CrossRef]
- Thongtanunam, P.; Tantithamthavorn, C.; Kula, R.G.; Yoshida, N.; Iida, H.; Matsumoto, K. Who should review my code? A file location-based code-reviewer recommendation approach for Modern Code Review. In Proceedings of the 2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Montreal, QC, Canada, 2–6 March 2015; pp. 141–150. [Google Scholar] [CrossRef]
- Xia, X.; Lo, D.; Wang, X.; Yang, X. Who should review this change?: Putting text and file location analyses together for more accurate recommendations. In Proceedings of the 2015 IEEE International Conference on Software Maintenance and Evolution (ICSME), Bremen, Germany, 29 September–1 October 2015; pp. 261–270. [Google Scholar] [CrossRef]
- Zanjani, M.B.; Kagdi, H.; Bird, C. Automatically Recommending Peer Reviewers in Modern Code Review. IEEE Trans. Softw. Eng. 2016, 42, 530–543. [Google Scholar] [CrossRef]
- Hannebauer, C.; Patalas, M.; Stünkelt, S.; Gruhn, V. Automatically recommending code reviewers based on their expertise: An empirical comparison. In Proceedings of the 2016 31st IEEE/ACM International Conference on Automated Software Engineering (ASE), Singapore, 3–7 September 2016; pp. 99–110. [Google Scholar]
- Rahman, M.M.; Roy, C.K.; Collins, J.A. CORRECT: Code Reviewer Recommendation in GitHub Based on Cross-Project and Technology Experience. In Proceedings of the 2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C), Austin, TX, USA, 14–22 May 2016; pp. 222–231. [Google Scholar]
- Mirsaeedi, E.; Rigby, P.C. Mitigating Turnover with Code Review Recommendation: Balancing Expertise, Workload, and Knowledge Distribution. In Proceedings of the 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE), Seoul, Korea, 6–11 July 2020; pp. 1183–1195. [Google Scholar] [CrossRef]
- Yu, Y.; Wang, H.; Yin, G.; Ling, C.X. Who Should Review this Pull-Request: Reviewer Recommendation to Expedite Crowd Collaboration. In Proceedings of the 2014 21st Asia-Pacific Software Engineering Conference, Jeju, Korea, 1–4 December 2014; Volume 1, pp. 335–342. [Google Scholar] [CrossRef]
- Salman, H.E. Identification multi-level frequent usage patterns from apis. J. Syst. Softw. 2017, 130, 42–56. [Google Scholar] [CrossRef]
- Tarawneh, A.S.; Hassanat, A.B.; Chetverikov, D.; Lendak, I.; Verma, C. Invoice classification using deep features and machine learning techniques. In Proceedings of the 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology (JEEIT), Amman, Jordan, 9–11 April 2019; pp. 855–859. [Google Scholar]
- Hassanat, A.B. Two-point-based binary search trees for accelerating big data classification using KNN. PLoS ONE 2018, 13, e0207772. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Tarawneh, A.S.; Chetverikov, D.; Verma, C.; Hassanat, A.B. Stability and reduction of statistical features for image classification and retrieval: Preliminary results. In Proceedings of the 2018 9th International Conference on Information and Communication Systems (ICICS), Jeju Island, Korea, 17–19 October 2018; pp. 117–121. [Google Scholar]
- Hassanat, A.B.; Prasath, V.S.; Al-Mahadeen, B.M.; Alhasanat, S.M.M. Classification and gender recognition from veiled-faces. Int. J. Biom. 2017, 9, 347–364. [Google Scholar] [CrossRef]
- Tarawneh, A.S.; Hassanat, A.B.; Almohammadi, K.; Chetverikov, D.; Bellinger, C. Smotefuna: Synthetic minority over-sampling technique based on furthest neighbour algorithm. IEEE Access 2020, 8, 59069–59082. [Google Scholar] [CrossRef]
- Jeong, G.; Kim, S.; Zimmermann, T.; Yi, K. Improving Code Review by Predicting Reviewers and Acceptance of Patches. In Research on Software Analysis for Error-free Computing Center Tech-Memo (ROSAEC MEMO 2009-006); RSAEC Center: Seoul, Korea, 2009; pp. 1–18. [Google Scholar]
- Jiang, J.; He, J.H.; Chen, X.Y. CoreDevRec: Automatic Core Member Recommendation for Contribution Evaluation. J. Comput. Sci. Technol. 2015, 30, 998–1016. [Google Scholar] [CrossRef]
- Yang, C.; Zhang, X.h.; Zeng, L.b.; Fan, Q.; Wang, T.; Yu, Y.; Yin, G.; Wang, H.m. RevRec: A two-layer reviewer recommendation algorithm in pull-based development model. J. Cent. South Univ. 2018, 25, 1129–1143. [Google Scholar] [CrossRef]
- Manning, C.D.; Schütze, H. Foundations of Statistical Natural Language Processing; MIT Press: Cambridge, MA, USA, 1999. [Google Scholar]
- Porter, M.F. An Algorithm for Suffix Stripping. In Readings in Information Retrieval; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA, 1997; pp. 313–316. [Google Scholar]
- Salton, G.; Buckley, C. Term-Weighting Approaches in Automatic Text Retrieval. Inf. Process. Manag. 1988, 24, 513–523. [Google Scholar] [CrossRef] [Green Version]
- Rahman, M.M.; Chakraborty, S.; Kaiser, G.E.; Ray, B. A Case Study on the Impact of Similarity Measure on Information Retrieval based Software Engineering Tasks. arXiv 2018, arXiv:1808.02911. [Google Scholar]
- Eyal Salman, H.; Hammad, M.; Seriai, A.D.; Al-Sbou, A. Semantic Clustering of Functional Requirements Using Agglomerative Hierarchical Clustering. Information 2018, 9, 222. [Google Scholar] [CrossRef] [Green Version]
- Pandey, P.; Singh, I. Comparison between Standard K-Mean Clustering and Improved K-Mean Clustering. Int. J. Comput. Appl. 2016, 146, 39–42. [Google Scholar] [CrossRef]
- Alfeilat, H.A.A.; Hassanat, A.B.A.; Lasassmeh, O.; Tarawneh, A.S.; Alhasanat, M.B.; Salman, H.E.; Prasath, V.B.S. Effects of Distance Measure Choice on K-Nearest Neighbor Classifier Performance: A Review. Big Data 2019, 7, 221–248. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Manning, C.D.; Raghavan, P.; Schütze, H. Introduction to Information Retrieval; Cambridge University Press: New York, NY, USA, 2008. [Google Scholar]
- Yu, Y.; Li, Z.; Yin, G.; Wang, T.; Wang, H. A Dataset of Duplicate Pull-Requests in Github; Association for Computing Machinery: New York, NY, USA, 2018. [Google Scholar] [CrossRef]
Repository Name | #PRs | #Clusters (ACs) | NRs | Cluster Size | ||
---|---|---|---|---|---|---|
Min | Max | Avg | ||||
angular/angular.js | 31 | 8 | 15 | 3 | 5 | 3.75 |
facebook/react | 15 | 4 | 2374 | 3 | 6 | 3.75 |
twbs/bootstrap | 47 | 14 | 16 | 3 | 5 | 3.35 |
symfony/symfony | 33 | 9 | 23 | 3 | 7 | 3.66 |
rails/rails | 25 | 8 | 50 | 3 | 4 | 3.13 |
joomla/joomla-cms | 19 | 6 | 24 | 3 | 4 | 3.17 |
ansible/ansible | 18 | 6 | 63 | 3 | 3 | 3 |
nodejs/node | 15 | 5 | 113 | 3 | 3 | 3 |
cocos2d/cocos2d-x | 3 | 1 | 10 | 3 | 3 | 3 |
rust-lang/rust | 9 | 3 | 179 | 3 | 3 | 3 |
ceph/ceph | 9 | 3 | 213 | 3 | 3 | 3 |
zendframework/zf2 | 9 | 3 | 15 | 3 | 3 | 3 |
django/django | 3 | 1 | 50 | 3 | 3 | 3 |
pydata/pandas | 3 | 1 | 49 | 3 | 3 | 3 |
elastic/elasticsearch | 6 | 2 | 1800 | 3 | 3 | 3 |
JuliaLang/julia | 3 | 1 | 98 | 3 | 3 | 3 |
scikit-learn/scikit-learn | 3 | 1 | 34 | 3 | 3 | 3 |
kubernetes/kubernetes | 13 | 4 | 1208 | 3 | 4 | 3.25 |
docker/docker | 7 | 2 | 56 | 3 | 4 | 3.5 |
symfony/symfony-docs | 19 | 5 | 9 | 3 | 5 | 3.8 |
Repository Name | #PRs | #AC | NRs | Precision | Recall | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Min | Max | Avg | StDev | Min | Max | Avg | StDev | ||||
angular/angular.js | 31 | 8 | 15 | 0.50 | 1.0 | 0.91 | 0.18 | ||||
twbs/bootstrap | 47 | 14 | 16 | 0.22 | 1.0 | 0.73 | 0.29 | ||||
symfony/symfony | 33 | 9 | 23 | 0.50 | 1.0 | 0.93 | 0.16 | ||||
symfony/symfony-docs | 19 | 5 | 9 | 0.67 | 1.0 | 0.93 | 0.13 |
Repository Name | #ACs | K | Precision | Recall | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Min | Max | Avg | StDev | Min | Max | Avg | StDev | |||
angular/angular.js | 8 | 8 | 0.60 | 1.0 | 0.92 | 0.14 | 0.40 | 1.0 | 0.88 | 0.22 |
facebook/react | 4 | 4 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
twbs/bootstrap | 14 | 14 | 0.22 | 1.0 | 0.70 | 0.29 | 0.20 | 1.0 | 0.52 | 0.24 |
symfony/symfony | 9 | 9 | 0.38 | 1.0 | 0.87 | 0.22 | 0.33 | 1.0 | 0.68 | 0.27 |
symfony/symfony-docs | 5 | 5 | 0.67 | 1.0 | 0.87 | 0.16 | 0.67 | 1.0 | 0.87 | 0.16 |
rails/rails | 8 | 8 | 0.50 | 1.0 | 0.80 | 0.21 | 0.25 | 1.0 | 0.75 | 0.23 |
joomla/joomla-cms | 6 | 6 | 0.50 | 1.0 | 0.78 | 0.22 | 0.33 | 1.0 | 0.78 | 0.24 |
ansible/ansible | 6 | 6 | 0.67 | 1.0 | 0.89 | 0.15 | 0.67 | 1.0 | 0.89 | 0.15 |
nodejs/node | 5 | 5 | 0.60 | 1.0 | 0.92 | 0.16 | 0.34 | 1.0 | 0.87 | 0.26 |
cocos2d/cocos2d-x | 1 | 1 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
rust-lang/rust | 3 | 3 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
ceph/ceph | 3 | 3 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
zendframework/zf2 | 3 | 3 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
django/django | 1 | 1 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
pydata/pandas | 1 | 1 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
elastic/elasticsearch | 2 | 2 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
JuliaLang/julia | 1 | 1 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
scikit-learn/scikit-learn | 1 | 1 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
kubernetes/kubernetes | 4 | 4 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
docker/docker | 2 | 2 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
Repository Name | #PRs | #ACs | #ICs | NRs | Precision | Recall | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Min | Max | Avg | StdDev | Min | Max | Avg | StdDev | |||||
facebook/react | 15 | 4 | 4 | 2374 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
rails/rails | 25 | 8 | 6 | 50 | 0.50 | 1.0 | 0.70 | 0.24 | 0.50 | 1.0 | 0.87 | 0.20 |
ansible/ansible | 18 | 6 | 4 | 63 | 0.50 | 1.0 | 0.72 | 0.18 | 1.0 | 1.0 | 1.0 | 0.0 |
nodejs/node | 15 | 5 | 4 | 113 | 0.50 | 1.0 | 0.88 | 0.21 | 1.0 | 1.0 | 1.0 | 0.0 |
joomla/joomla-cms | 19 | 6 | 4 | 24 | 0.40 | 1.0 | 0.60 | 0.23 | 0.70 | 1.0 | 0.85 | 0.16 |
cocos2d/cocos2d-x | 3 | 1 | 1 | 10 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
rust-lang/rust | 9 | 3 | 3 | 179 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
ceph/ceph | 9 | 3 | 3 | 213 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
zendframework/zf2 | 9 | 3 | 3 | 15 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
django/django | 3 | 1 | 1 | 50 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
pydata/pandas | 3 | 1 | 1 | 49 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
elastic/elasticsearch | 6 | 2 | 2 | 1800 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
JuliaLang/julia | 3 | 1 | 1 | 98 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
scikit-learn/scikit-learn | 3 | 1 | 1 | 34 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
kubernetes/kubernetes | 13 | 4 | 4 | 1208 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
docker/docker | 7 | 2 | 2 | 56 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 |
Repository Name | #PRs | #ACs | #ICs | Cluster Size | SRE | ||
---|---|---|---|---|---|---|---|
Min | Max | Avg | |||||
angular/angular.js | 31 | 8 | 15 | 1 | 4 | 2.0 | 75% |
twbs/bootstrap | 47 | 14 | 16 | 1 | 11 | 2.9 | 91% |
symfony/symfony | 33 | 9 | 23 | 1 | 4 | 1.3 | 75% |
symfony/symfony-docs | 19 | 5 | 9 | 1 | 3 | 2.1 | 67% |
Repository Name | #PRs | #ACs | #ICs | Cluster Size | SRE | ||
---|---|---|---|---|---|---|---|
Min | Max | Avg | |||||
facebook/react | 15 | 4 | 4 | 3 | 6 | 3.75 | 83% |
rails/rails | 25 | 8 | 6 | 3 | 4 | 3.6 | 75% |
ansible/ansible | 18 | 6 | 4 | 3 | 6 | 4.3 | 83% |
nodejs/node | 15 | 5 | 4 | 3 | 6 | 4.0 | 83% |
joomla/joomla-cms | 19 | 6 | 4 | 4 | 6 | 4.75 | 83% |
cocos2d/cocos2d-x | 3 | 1 | 1 | 3 | 3 | 3 | 67% |
rust-lang/rust | 9 | 3 | 3 | 3 | 3 | 3 | 67% |
ceph/ceph | 9 | 3 | 3 | 3 | 3 | 3 | 67% |
zendframework/zf2 | 9 | 3 | 3 | 3 | 3 | 3 | 67% |
django/django | 3 | 1 | 1 | 3 | 3 | 3 | 67% |
pydata/pandas | 3 | 1 | 1 | 3 | 3 | 3 | 67% |
elastic/elasticsearch | 6 | 2 | 2 | 3 | 3 | 3 | 67% |
JuliaLang/julia | 3 | 1 | 1 | 3 | 3 | 3 | 67% |
scikit-learn/scikit-learn | 3 | 1 | 1 | 3 | 3 | 3 | 67% |
kubernetes/kubernetes | 13 | 4 | 4 | 3 | 4 | 3.25 | 75% |
docker/docker | 7 | 2 | 2 | 3 | 4 | 3.5 | 75% |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Eyal Salman, H.; Alshara, Z.; Seriai, A.-D. Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning. Information 2022, 13, 73. https://doi.org/10.3390/info13020073
Eyal Salman H, Alshara Z, Seriai A-D. Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning. Information. 2022; 13(2):73. https://doi.org/10.3390/info13020073
Chicago/Turabian StyleEyal Salman, Hamzeh, Zakarea Alshara, and Abdelhak-Djamel Seriai. 2022. "Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning" Information 13, no. 2: 73. https://doi.org/10.3390/info13020073
APA StyleEyal Salman, H., Alshara, Z., & Seriai, A. -D. (2022). Automatic Identification of Similar Pull-Requests in GitHub’s Repositories Using Machine Learning. Information, 13(2), 73. https://doi.org/10.3390/info13020073