A Generalized Method for Filtering Noise in Open-Source Project Selection

Ding, Yi; Fang, Qing; Liu, Xiaoyan

doi:10.3390/info16090774

Open AccessArticle

A Generalized Method for Filtering Noise in Open-Source Project Selection

by

Yi Ding

¹,

Qing Fang

^2,* and

Xiaoyan Liu

¹

Information College, Wuhan Vocational College of Software and Engineering, Wuhan 430205, China

²

Normal College, Jingchu University of Technology, Jingmen 448000, China

^*

Author to whom correspondence should be addressed.

Information 2025, 16(9), 774; https://doi.org/10.3390/info16090774

Submission received: 11 July 2025 / Revised: 28 August 2025 / Accepted: 4 September 2025 / Published: 6 September 2025

(This article belongs to the Topic Software Engineering and Applications)

Download

Browse Figures

Versions Notes

Abstract

GitHub hosts over 10 million repositories, providing researchers with vast opportunities to study diverse software engineering problems. However, as anyone can create a repository for any purpose at no cost, open-source platforms contain many non-cooperative or non-developmental noise projects (e.g., repositories of dotfiles). When selecting open-source projects for analysis, mixing collaborative coding projects (e.g., machine learning frameworks) with noisy projects may bias research findings. To solve this problem, we optimize the Semi-Automatic Decision Tree Method (SADTM), an existing Collaborative Coding Project (CCP) classification method, to improve its generality and accuracy. We evaluate our method on the GHTorrent dataset (2012–2020) and find that it effectively enhances CCP classification in two key ways: (1) it demonstrates greater stability than existing methods, yielding consistent results across different datasets; (2) it achieves high precision, with an F-measure ranging from 0.780 to 0.893. Our method outperforms existing techniques in filtering noise and selecting CCPs, enabling researchers to extract high-quality open-source projects from candidate samples with reliable accuracy.

Keywords:

software ecosystem; open-source software project; GitHub; collaborative coding project

1. Introduction

As the most active open-source platform, GitHub reports having over 100 million developers and more than 420 million repositories (https://en.wikipedia.org/wiki/GitHub accessed on 10 July 2025). At present, most open-source research is focused on this platform. This academic attention stems from GitHub’s rich set of features, designed to enhance collaboration and social interactions around coding projects. By analyzing historical datasets such as GHTorrent [1], researchers can gain insights into how developers leverage GitHub for Collaborative Coding Projects (CCPs) [2].

The findings from Mining Software Repositories (MSRs) can significantly influence the decision-making processes of CCPs and enhance the quality of their development. For instance, by analyzing issue-tracking data from Mozilla and GNOME, Zhou and Mockus discovered that the likelihood of a new contributor becoming a Long-Term Contributor (LTC) is closely tied to their motivation and the project environment [3]. This insight enables project maintainers to identify potential LTCs early, thereby improving strategies for attracting and retaining valuable contributors.

Although GitHub serves as a valuable resource for software engineering studies [2], certain difficulties arise when researchers utilize large-scale datasets sourced from this platform.. For example, a common dilemma arises when researchers aim to use extensive datasets to validate the generality of their findings yet struggle to ensure that the selected CCPs align with their research objectives. If the sampled projects are not rigorously vetted, the resulting conclusions may lack validity. Therefore, to accurately analyze open-source software (OSS) development, it is essential to prioritize sample quality when mining GitHub repositories.

Several researchers have investigated quality-related challenges in the process of mining GitHub repositories [2,4], identifying potential pitfalls in such efforts. Building on these findings, they further analyzed these risks and proposed mitigation strategies to enhance the reliability of project samples. By applying these strategies, researchers can improve the quality of their datasets. For instance, to ensure data integrity, researchers may exclude projects that are not exclusively hosted on GitHub by filtering out repositories with a significant number of committers lacking GitHub accounts or those explicitly labeled as mirrors [2].

While existing approaches address many concerns, two issues remain that necessitate human intervention: (1) a considerable number of projects are unrelated to software development, and (2) many repositories are primarily personal in nature. This means that researchers currently lack convenient methods to help them to filter out projects for private use and non-development projects. Since some researchers aim to analyze project activities (e.g., [5]), they should note that the activity patterns of development projects differ significantly from those of non-development projects. For instance, Hannibal046/Awesome-LLM is a project that primarily collects information for LLM models. Despite its high popularity (e.g., star and fork counts), it may still be included in project samples, even though most of its commits involve minor updates to the README file. For researchers focusing on code contributions—a key topic in GitHub project studies [6]—such projects should be excluded. awesomedata/awesome-public-datasets is another example. This project is a non-coding collaborative project, where multiple contributors have collected a large number of public datasets. It does not contain any code development content. Therefore, the data in this project are not suitable for studying code development.

Additionally, as many studies investigate social factors in projects (e.g., attracting and retaining contributors) [6], private projects introduce noise, since they are typically personal (e.g., homework assignments) [4] and lack collaborative development. In summary, researchers need an automated method to identify and select CCPs.

In this study, we propose a novel method for the automatic identification of CCPs. We evaluate our approach on multiple datasets comprising more than twenty thousand GitHub repositories, meticulously labeled by PhD and Master’s students in software engineering. Our method demonstrates strong performance in classifying CCPs, achieving consistent results across diverse datasets. The key contributions of this work include the following.

Identifying Limitations in Existing Methods: We uncover critical shortcomings (e.g., missing essential keywords) in prior approaches, which hinder their effectiveness across different datasets.
Proposing an Automated Solution: We introduce a robust method that overcomes the limitations of existing techniques, achieving an F-measure ranging from 0.780 to 0.893 in CCP classification, as validated on multiple datasets.

Regarding the rest of this paper, related work is presented in Section 2. The definition of CCPs and the process of labeling projects are provided in Section 3. Existing methods for selecting CCPs are analyzed in Section 4, followed by our proposed method in Section 5. The study design and results are elaborated in Section 6, with further discussion in Section 7. Threats to the validity of the results are presented in Section 8, and the work is concluded in Section 9.

2. Related Work

To systematically identify potential challenges in studying open-source software (OSS) projects, we first conducted a comprehensive review of prior research. This examination focused on three key aspects: (1) Which datasets have been employed in GitHub-related research? (2) What are the existing problems in the selection of project samples? (3) Which potential approaches can be adopted to overcome these problems? Through this analysis, we aim to critically evaluate whether these CCP classification approaches introduce any methodological threats or limitations to empirical findings.

2.1. Datasets Used in Studying GitHub Repositories

In recent years, an increasing number of studies have focused on software ecosystems [7] (with most research objects being OSS ecosystems). This trend can likely be attributed to the publicly available historical datasets that OSS ecosystems provide, offering valuable resources for researchers. Notably, GitHub has emerged as a particularly prominent platform, attracting significant research attention. By 2017, more than one hundred studies had already utilized GitHub datasets [6], demonstrating its widespread adoption as a research platform. The extensive use of GitHub in academic research means that numerous hypotheses have been tested within this OSS ecosystem, and researchers can readily access development data from this platform. Consequently, we selected the GitHub ecosystem for our CCP to ensure that our findings contribute meaningfully to the broader open-source research community.

On the GitHub platform, numerous datasets are available that can be leveraged to study GitHub-related phenomena. Cosentino et al. conducted a systematic mapping study on GitHub [6] and identified six primary methods for obtaining GitHub data: (1) GHTorrent [1], (2) GitHub Archive, (3) GitHub APIs, (4) other sources (e.g., BOA [8]), (5) manual collection, and (6) a combination of these approaches. Kalliamvakou et al. noted that GitHub Archive, which began collecting data in 2011, provides an incomplete mirror of GitHub [9]. In contrast, the GHTorrent dataset offers a comprehensive historical record of GitHub activity. Additionally, GHTorrent can be supplemented using GitHub APIs, the most widely adopted method for data retrieval. Therefore, we selected the GHTorrent dataset (2020 version) as the primary data source for our research.

2.2. Risk Avoidance Strategies in Selecting Project Samples

Since numerous researchers have examined the evolution of open-source software (OSS) projects from diverse perspectives [6], many of them have relied on automated methods to select project samples. However, such approaches risk introducing noise into the research sample sets. To address this issue, Kalliamvakou et al. examined a range of challenges related to the use of GitHub repositories in research to mitigate these risks [2].

We summarize these strategies to determine whether researchers can easily use these strategies to avoid possible risks. The results are presented in Table 1.

After examining the risk avoidance strategies, we find that, although many automated methods can assist researchers in improving the quality of project samples, the two key risks—namely, that numerous projects fall outside the scope of software development, as well as being personal in nature—still require manual resolution by researchers. If researchers aim to exclude private projects or those not intended for development, they must expend significant human effort to manually curate samples. As the dataset scales, this manual effort becomes prohibitively costly. Therefore, an automated approach for identifying CCPs is essential.

2.3. Project Sample Selection Process in Studies on GHTorrent Dataset

As discussed earlier, several strategies can help to mitigate risks when selecting project samples on GitHub. Researchers may incorporate these strategies into their sample selection methods to reduce potential validity threats in their findings. However, the process of selecting CCPs still requires human intervention. To evaluate whether existing project sample selection methods can avoid the two main problems, we analyzed high-quality studies—published in top-tier journals or highly cited—that processed project samples using the GHTorrent dataset. We then identified literature employing automated project sample classification methods (as shown in Table 2) and assessed whether these approaches could successfully circumvent the two problems.

As shown in Table 2, there are three methods to automatically select project samples on the GHTorrent dataset. In the following, we will discuss the principles of these methods and whether they can avoid the two main problems.

Selecting projects that rank highly in certain dimensions or removing those with poor performance in specific aspects is straightforward and fully automated. Since many studies have adopted this approach for project sample selection, we refer to it as the baseline method in the following sections. This method aligns with the needs of researchers aiming to curate high-quality project samples. For example, Gousios et al. chose to study projects that contained more than 200 pull requests to avoid including toy projects in their research sample [10]. However, the baseline method primarily emphasizes the sample volume rather than content. A key limitation of this approach is its inability to filter out non-development projects. For example, Hannibal046/Awesome-LLM—a repository that collects LLM models—has garnered 24.8k stars, yet its commit behavior differs significantly from that of typical CCPs. When analyzing development activities on GitHub, such projects should be excluded.

The method of using a decision tree (in the following sections, we refer to this approach as the Semi-Automatic Decision Tree Method, or SADTM) has been proven to effectively select CCPs [34]. However, the stability of SADTM has not been thoroughly investigated—a known issue in software analytics studies [37]—as the method has only been tested on a limited dataset containing 6715 samples from 2012. The generality of this approach needs to be verified on more extensive datasets spanning multiple years.

The Reaper tool and PHANTOM approach are designed to identify well-engineered software projects. These approaches specifically target projects that demonstrate robust software engineering practices across key dimensions, including documentation, testing, and project management. While suitable for researchers studying professionally developed software projects, this method has limitations. It cannot effectively detect CCPs developed by small groups, nor can it distinguish between popular and unpopular CCPs [1]. Consequently, these methods fail to capture the full spectrum of development projects.

As evidenced in Table 2, existing selection methods cannot guarantee the comprehensive identification of all CCPs.

Summary: From the related works discussed above, we derive three key observations: (1) employing automatic project classification methods to study project samples carries inherent risks that may impact research outcomes; (2) the current risk avoidance strategies in project sample selection remain incomplete, particularly requiring human intervention for the identification of CCPs; (3) existing project selection methods cannot guarantee the comprehensive identification of all CCPs. In summary, we require an automated approach for CCP classification to complement the existing risk avoidance strategies proposed by Kalliamvakou et al. [2].

3. Collaborative Coding Projects

As discussed above, Kalliamvakou et al. proposed several strategies to avoid problems (e.g., most projects have low activity) in mining GitHub [2], but some of these strategies still require human intervention. This implies that we cannot automatically exclude non-collaborative projects and non-coding projects. To address this challenge, we propose a model to assist researchers in automatically selecting project samples based on their specific criteria, eliminating the need to manually examine each individual project.

To facilitate the discussion of our methodology for selecting CCPs, it is essential to first establish a clear understanding of what constitutes a CCP. Accordingly, we formally define a collaborative project in Definition 1 and a coding project in Definition 2.

Definition 1.

A collaborative project, by definition, is not an invisible/personal project.

Invisible projects: This is a category of private projects defined by GitHub. Users can create private repositories that are not visible to other GitHub users. In this study, we define an invisible project as one that utilizes the “private” mechanism provided by GitHub.

Personal projects: This category represents another class of private projects. Many users employ repositories to archive data (such as homework) or to host individual projects, with no intention of collaboration [2]. For example, the repository lnewcomer1-zz/Github-for-Web-Designers is used by its owner for a web design course. It is unlikely that others would contribute to such a project. In this study, we define a personal project as one that satisfies the following two conditions: (a) it is created using the “public” mechanism provided by GitHub, and (b) it either lacks a project description or is clearly intended for personal use, such as data archiving.

Definition 2.

A coding project is organized around software development and may encompass a variety of assets, including games, frameworks, development tools, and add-on repositories [2].

Not all projects on GitHub are built for code development. For instance, Hannibal046/ Awesome-LLM has 24.8k stars and is a project that collects LLM models and does not contain development information. Hence, if these noisy projects are not removed from the samples, they will influence the conclusions in MSR.

3.1. Labeling Process

To validate the effectiveness of our method, we construct a standardized dataset in which each project is manually labeled as either a CCP or non-CCP. This dataset enables the evaluation of CCP identification results.

Cheng et al. identified 6715 projects established between 1 January 2012 and 15 January 2012 [34]. These projects align with our research objectives as they shared the same goal of CCP detection. However, this sample size proves insufficient due to the evolving nature of project types over time. A method that was effective in identifying CCPs in 2012 may not necessarily apply to projects in subsequent years. Consequently, we have chosen to expand the dataset originally compiled by Cheng et al. [34].

To enhance the reliability of our dataset, it was crucial to ensure a sufficiently large sample size. To achieve this, we engaged four PhD students and one Master’s student in software engineering (hereafter referred to as participants) to manually identify CCPs. Given the constraints on our resources, it was infeasible to label all projects on GitHub, which hosts over 10 million repositories. Consequently, we devised a sampling strategy to maximize the diversity of the project types included in our dataset.

According to the available resources, we can only manually inspect 20,000 project samples established across different time periods. To select representative samples for manual verification, there are two potential strategies: (1) selecting all projects created within a specific timeframe or (2) randomly choosing a fixed number of projects. Given that certain periods may introduce instability (e.g., samples could be influenced by seasonal variations or holidays), we opted to randomly select 1/1000 CCPs from all samples spanning 2013 to 2020 [1].

We choose to compare our method to SADTM because they share the same research objectives [34]. Therefore, it is crucial to utilize the same dataset as Cheng et al. (i.e., labeled CCPs established between 1 January 2012 and 15 January 2012). To achieve this, we contacted the SADTM authors via email and obtained their labeled dataset. Subsequently, we selected and labeled additional CCP samples from different years to evaluate the generalizability of both SADTM and our approach.

In order to select appropriate samples, we carried out the following steps: (1) Every two years, we randomly selected 1/1000 of all projects as candidate samples. Due to resource constraints, we randomly labeled 1/1000 CCP samples created from 2013 to 2020. (2) We preprocessed the project samples by removing forked projects. To generate candidate project samples, metadata must undergo preprocessing, as they include numerous fork projects and non-English projects. As highlighted by Kalliamvakou et al., “To analyze a project hosted on GitHub, one must consider the activity in both the base repository and all associated forked repositories” [2]. Consequently, in this study, we consolidated data from fork projects into their corresponding base projects (i.e., our samples consisted solely of base repositories). (3) We excluded non-English projects. In this paper, we primarily focus on English projects.

After selecting samples, we asked the participants to indicate whether a sample project was a CCP. This process could introduce personal biases. Therefore, we adopted the following steps (see Figure 1) to label the CCPs.

(1): Learn annotation standards: We provided the labeled dataset to two participants, identified ambiguous items, and discussed the findings to finalize the label details.
(2): Randomly select samples: Based on the aforementioned discussion, we chose 1/1000 of the items every two years as the item samples.
(3): Label samples: According to Definition 1 and Definition 2, as well as the labeling details from the first step, the two participants labeled the samples separately.
(4): Calculate agreement level: We used Cohen’s Kappa coefficient to measure the agreement between the labels of the two participants.
(5): Divide samples: We compute the sets of samples that they agreed on and disagreed on.
(6): Resolve disagreements: We settled labeling disagreements through discussion with the two participants.
(7): Generate complete label for samples: This was achieved by merging the two label sets to form the final label dataset.

3.2. Labeling Results

Following the methodology outlined above, we identified base CCPs established on GitHub between 2013 and 2020. We subsequently excluded projects that were not documented in English or had been deleted from GitHub. During this process, we observed that some projects in the GHTorrent dataset had been removed from GitHub by their maintainers. Consequently, we eliminated these CCPs due to insufficient development information for proper labeling. The final sample classification (comprising both our labeled dataset and the 2012 dataset obtained from Cheng et al. [34] ) is presented in Table 3. In subsequent sections, we refer to these final labeled samples as the standard dataset.

4. Analysis of Existing Methods

Before examining our approach, it is essential to comprehensively analyze the strengths and limitations of existing methods. Our investigation primarily focuses on two critical dimensions, namely cost and accuracy, as these factors fundamentally determine both the ease of use and efficiency of sample classification methodologies for CCPs.

4.1. Costs of Existing Methods

Considering the needs of ordinary MSR researchers—who typically have limited human and computational resources yet wish to evaluate their hypotheses on large-scale GitHub datasets (e.g., datasets containing over 100,000 samples)—we first conducted a cost analysis of existing methods (including both automated approaches in Table 2 and manual methods) for selecting CCPs.

(1) Baseline Approach: This method requires basic project metadata such as star counts, the number of watchers, and committer counts. Such information is readily available in public GitHub datasets like GHTorrent [38], making it highly efficient and low-cost for large-scale analysis.

(2) SADTM: In addition to basic metadata, it utilizes textual project descriptions. Although the text data require extra preprocessing, they are still accessible via GHTorrent or the GitHub API without significant computational overhead.

(3) reaper: It relies on extended features such as issue activity, license information, and source code features. While metadata can be obtained from GHTorrent, source code must be retrieved via the GitHub API—which is subject to hourly rate limits—and stored locally. This process is resource-intensive and not suitable for large-scale applications.

(4) PHANTOM: It depends on time-series features extracted from version control logs. Collecting historical commit data via the API is notably slow and requires considerable time and computational resources.

(5) Manual label approach: Based on practical labeling experience, manually determining whether a project is a CCP takes approximately one minute per project. Therefore, annotating 20,000 projects would require around two weeks of continuous work per annotator, making it highly impractical for large datasets.

From the cost analysis, we can observe the following. (1) In terms of the label cost, Reaper requires additional code information, while PHANTOM needs to retrieve Git logs. For large-scale projects, obtaining data for a single project may require numerous Git requests. In contrast, both SADTM and the baseline method can acquire the relevant data with just one API request. For researchers aiming to annotate a large number of projects, the annotation efficiency of SADTM and Reaper is relatively low. (2) In terms of label accuracy, since we had already annotated numerous projects before developing the algorithm (as shown in Table 3), feedback from the annotators indicated that, in the vast majority of cases, they did not need to refer to code or Git logs to determine whether a project was a CCP. Therefore, we believe that using basic data (such as stars and committers) and descriptive data is sufficient to cover most of the information required for classification.

4.2. Accuracy of Existing Methods

As discussed above, the baseline method and SADTM can be readily employed by ordinary MSR researchers. Therefore, a natural question arises: can these two methods accurately select CCPs? Since this problem cannot be addressed through qualitative analysis alone, we employ a quantitative analysis to evaluate the accuracy of existing methods.

Although the baseline method and SADTM do not require many resources, in principle, neither of these methods performs well in selecting CCPs. For example, selecting the top projects (one strategy of the baseline method) may result in the inclusion of some tutorial projects, which are not development projects. SADTM has been tested only on a dataset with sample projects set up in 2012 [1], and the generality of this method requires further testing. Therefore, we decided to test the effects of these methods on the standard dataset.

We implemented SADTM [29] along with the baseline method and subsequently evaluated their effectiveness in accurately identifying CCPs.

SADTM: The implementation process of SADTM consists of three key steps (as illustrated in Figure 2).

(1) Collect labeled project samples from the dataset provided by Cheng et al. and then collect labeled samples from this study.

(2) For each sample, identify the description, URL, and basic features; the specific information utilized is presented in the following. Description: A set of terms commonly found in project metadata, including status indicators (e.g., deprecated, moved), content type descriptors (e.g., tutorial, example, plugin), and project-specific identifiers. (Note: Terms include app, backup, blog, clone, collection, collection of, config, copy, course, demo, deprecated, documentation, dot, dotfiles, example, extension, file, first, fork, framework, GitHub, guide, helper, http, https, intro, library, list of, localization, mirror, module, moved, my, null, personal, plugin, practice, resume, sample, school, server, setting, simple, source, storage, system, template, theme, tool, translation, tutorial, university, vim, website.) If a project’s description matches the pattern (e.g., like “%keyword%”) in MySQL, the feature “keyword” is set to 1; otherwise, it is set to 0. URL: Characteristic substrings within repository URLs (e.g., ‘doc’, ‘config’). If the URL of a project matches the pattern (e.g., like “%keyword%”) in MySQL, the feature “keyword” is set to 1; otherwise, it is set to 0. Basic Information: Quantitative project features, including counts of stars, watchers, distinct contributors, the number of community participants, and the primary programming language. These features can be collected using the GitHub API.

(3) Train the J48 model using the dataset obtained in the second step and predict all labeled samples using the fitted model. This procedure aligns with the model generation approach executed by Weka, which is a data mining tool (https://www.cs.waikato.ac.nz/ml/weka accessed on 10 July 2025), with the ConfidenceFactor (this is a parameter that can affect the pruning process of fitting a decision tree model; the smaller the parameter value is, the simpler the model becomes) set to 0.05, consistent with SADTM [29].

The Baseline Method: Following the approach proposed by Cheng et al. [29], we select or remove project samples by choosing the top 1%, 2%, 4%, 8%, and 15% of CCPs or removing the bottom 1%, 2%, 4%, 8%, and 15% of CCPs across four dimensions (i.e., stars, watchers, community member count, and committer count). We evaluate different parameters and select the optimal configuration separately for the precision, recall, and F-measure.

Results of testing the accuracy of existing methods: As discussed above, we evaluated SADTM and the baseline method on various datasets, recording the optimal parameters for the baseline method to determine its upper performance limit. From the results in Table 4, we draw two key observations: (1) while the baseline method achieves strong performance in certain dimensions (e.g., precision ranging from 0.855 to 0.963), its overall effectiveness (F-measure between 0.544 and 0.781) remains insufficient; (2) the performance of SADTM exhibits instability (F-measure varying from 0.583 to 0.885) across different datasets, indicating that it cannot be directly adopted by researchers without further refinement.

4.3. Weaknesses of Existing Methods

Since SADTM is not a generalized method and the baseline approach lacks accuracy when selecting CCPs, the current methodologies fail to satisfy the needs of typical MSR researchers in project sample selection. Consequently, it becomes imperative to examine the limitations of existing methods to inform the design of our proposed approach.

4.3.1. Weaknesses of the Baseline Method

We first conducted a thorough investigation into the limitations of the baseline method. To achieve this, we performed a manual analysis of 100 false positive samples and 100 false negative samples obtained from the baseline method’s results, as described in Section 4. Our analysis revealed that certain CCPs cannot be effectively identified using basic project information such as the star count.

False Positives: We observed that certain negative projects (i.e., those labeled as “FALSE” in the CCP dimension) cannot be accurately identified using basic metadata alone. For instance, consider the project Hannibal046/Awesome-LLM. This repository is not a CCP, as it collects LLM models. However, it exhibits substantial engagement metrics, including 24.8k stars, 2.1k forks, and 311 commits from diverse contributors [1]. Based solely on these basic indicators, the project would not be classified as negative.

False Negative: Some positive projects (i.e., projects that we labeled as “TRUE” in the dimension of CCP) cannot be accurately identified using basic information. For example, jshrake/hearthcards is a library for Hearthstone (a popular PC game) analysis. Although this project has no stars or watchers, it satisfies Definition 1 and Definition 2 in our study. Nevertheless, it remains undetectable through basic information alone.

In summary, the baseline method cannot correctly detect CCPs because the information used in this method is insufficient.

4.3.2. Weaknesses of SADTM

SADTM presents an opportunity to address the limitations of the baseline approach, as it incorporates project descriptions and URL information. However, as shown in Table 4, SADTM demonstrates limited generalizability across different datasets. To enhance SADTM’s performance, we conducted an in-depth analysis of its shortcomings. Specifically, we randomly selected 1000 samples that SADTM failed to predict and manually analyzed the problems of SADTM.

As a result, we identified three key limitations of SADTM.

(1) Imprecise keyword matching technique: SADTM identifies keywords in project descriptions through a basic pattern-matching rule (“%keyword%” in MySQL). Because this method is purely string-based, it cannot distinguish genuine relevance from accidental substring overlap. For example, dbcli/mycli (“A modern command line client for MySQL”) would wrongly be considered a match for the token “my”, even though the context is unrelated.

(2) Limited keyword lexicon: The keyword list used in SADTM does not fully cover terms that carry strong diagnostic signals. Words such as “LLM”, which often appear in domains like LLM app and LLM tool, are useful for identifying CCP-related projects but are not included in the current vocabulary.

(3) Skewed training dataset: The training corpus used by SADTM shows a clear class imbalance. Categories like “library” are heavily represented (211 projects), while others such as “theme” appear only rarely (six projects). Such disproportionate sampling limits the model’s ability to learn patterns from minority classes, making these categories more prone to misclassification.

5. Our Method

According to the discussion in Section 4.3, SADTM has great potential for improvement; thus, we decided to design our method based on SADTM.

5.1. Overcoming the Weaknesses of SADTM

As discussed in Section 4.3, SADTM can be improved in three aspects, for which we designed corresponding solutions.

Enhancing Keyword Matching. With the availability of advanced text-processing tools, this task can be systematically approached using a CCP framework consisting of (1) gathering descriptive text; (2) tokenizing the text and eliminating stop words (this step is implemented using the Apache-maintained Lucene library); and (3) executing keyword matching via regular expressions (REGEXP "keyword") in MySQL.

Automating Keyword Extraction. As highlighted in Section 4.3, the SADTM framework can benefit from the expansion of its keyword set (see Section 4.2). This expansion can be facilitated by implementing automated keyword extraction methods [1].

Following the work of Fu and Menzies [37,39], which emphasizes that researchers should weigh simpler and more efficient approaches against more complex and costly ones, we acknowledge that CCP design should carefully balance the information needs of a method with its operational simplicity.

To identify an appropriate approach for automating the keyword selection process, we carefully analyzed both the specific requirements of our task and existing methodologies addressing similar challenges.

(1): We examined the existing keywords employed by SADTM and observed that these terms often reflect the development types of projects (e.g., app, tool, and plugin). This insight motivated us to explore methods of automatically tagging CCPs in an OSS environment.
(2): We examined existing methods of tagging projects. However, traditional tagging approaches primarily generate generalized tags (e.g., Java) for projects [40,41]. These methods require an additional training dataset (e.g., projects with manually assigned tags) and, at the same time, fail to meet our requirement of extracting keywords directly from project descriptions. Therefore, we explored more general text mining techniques. Term Frequency–Inverse Document Frequency (TF-IDF) is widely used as a weighting factor in information retrieval and text mining [42]. Meanwhile, the Latent Dirichlet Allocation (LDA) model is commonly employed to analyze topics in software development [43]. Both methods only require a corpus (in this study, the corpus consists of CCP descriptions) and can generate the keywords that we need.

Based on the above findings, we propose enhancing SADTM by leveraging TF-IDF or LDA to automatically extract keywords from CCP descriptions. Subsequently, we conducted a manual evaluation of the generated keywords’ intelligibility to determine the superior method (TF-IDF or LDA). The key points are summarized below.

(1): The Corpus: The descriptions of environment projects from the sample projects in the standard dataset. (Note that the samples that we have labeled are only some of the projects that were established in that year. The description of labeled projects is not enough because text mining needs a large amount of training data to ensure the stability of the training results. Therefore, we collected all projects established in the current period (e.g., 2013–2014) as the training corpus for LDA or TF-IDF.)
(2): Keyword Selection: Cheng et al. proposed SADTM, which employs dozens of keywords and achieved accurate project classification [34] on the 2012 dataset. This demonstrates that only a limited set of keywords is sufficient for identifying CCPs. Accordingly, we selected 80 keywords—a number comparable to SADTM’s keyword count—to ensure a fair comparison between our method and the existing approaches (SADTM and the baseline method), while avoiding the curse of dimensionality. (This concept refers to the phenomenon whereby, if a dataset contains too many features, it will be inefficient when fitting a model to it. For the prediction of CCPs, the more keywords that our model includes, the more information it can leverage and the better the results may be. However, an excessive number of features will seriously affect the efficiency in generating the model.)
(3): TF-IDF: We evaluated TF-IDF and IDF on the corpus and observed that ranking the TF-IDF scores from high to low is essentially equivalent to ranking DF (the document frequency, which measures how often a word appears across all documents) from low to high. This is because, in CCP descriptions, a word typically appears only once. Consequently, we applied IDF to the corpus and manually assessed the interpretability of the resulting keywords. For instance, the keyword “framework” is meaningful, as a CCP containing this term is likely a reusable framework for developers. In contrast, the keyword “2015” is uninformative, as it frequently appears in CCPs created that year.
(4): LDA: Our analysis reveals that the keywords in SADTM effectively summarize project types. Consequently, when applying LDA, it is essential to select the most representative term for each topic as the key identifier. In this study, we performed LDA on the corpus and extracted the top words for each topic (i.e., those most semantically relevant) to serve as topic labels. Similarly to TF-IDF, we engaged participants in manually evaluating the interpretability of these keywords. Parameters: $N_{topics} = 80$ (this is the parameter that can control the number of generated topics), $α = 0.5$ , and $β = 0.1$ .

In summary, we tested IDF and LDA on the environment projects (2012–2020) and then recorded the number of understandable keywords produced by IDF and LDA for each year.

From the results in Table 5, IDF and LDA have similar accuracy in selecting useful keywords. However, LDA is more challenging to implement and relatively unstable (different runs will generate different keywords). Therefore, we decided to use IDF on descriptions of environment projects to select keywords.

Constructing a balanced training dataset. We observed that the training dataset employed by Cheng et al. failed to accurately detect certain project categories (e.g., “resume” projects) due to insufficient samples of these types. This deficiency primarily accounts for SADTM’s suboptimal performance on other data (2013–2020). To address the limitation in scalability, we aimed to create a more comprehensive dataset encompassing diverse project types. (A lack of keywords is a weakness of SADTM, and, in the future, it may also be a weakness of our method, because the keywords will change over time. Therefore, we need to update our model every year by adding new types of project samples into the training dataset. This process will consume considerable human resources. To reduce the future cost of labeling samples, the number of projects collected for each type should not be too large.) Balancing methodological scalability with detection accuracy, we strategically limited the number of CCP samples per category while ensuring adequate representation. Through empirical analysis, we determined that 100 samples per project type sufficiently enabled SADTM to achieve reliable predictions. Additionally, to enhance the model’s capability in handling keyword-absent cases, we incorporated both keyword-specific and keyword-free CCP samples in our dataset construction.

5.2. Method Design

In Section 5.1, we discussed how to overcome the weaknesses of SADTM. In this section, we describe the details of our method with four steps.

(1): Generate keywords for each year (2012–2020). We first collected environment projects (e.g., 2013–2014 projects in Github) and extracted descriptions from these projects. Then, we preprocessed these descriptions, i.e., separating words and deleting stop words. In the next stage, we implemented the IDF method on these descriptions and collected the top 80 words for each year. Finally, we merged these keywords and formed a keyword set to reflect the most popular keywords of these years (a total of 127 keywords).
(2): Construct data matrices for sample projects. First, we preprocessed the project descriptions in the standard dataset by performing word segmentation and removing stop words using the Lucene library. Next, we employed MySQL’s REGEXP operator to verify whether a project contained a specific keyword (e.g., “tool”). In addition to keyword-based features, we enriched the dataset by incorporating basic project metadata and URL information, following the same structure as in Section 4.2.
(3): Create a balanced training dataset. First, we randomly selected 100 projects per keyword (there may have been repetitions because some projects contained two or more keywords). Then, we randomly added 3000 projects without any keywords to balance the number of projects with keywords and the number of projects without keywords. Finally, we obtained a balanced training dataset with 9745 samples.
(4): Fit a model. We fitted a J48 model on the balanced training dataset using Weka, in which all parameters remained at the default values.

6. Study Design and Results

6.1. Research Question

After the design of our method, it is necessary to prove that our method outperforms existing methods. Therefore, we formulated a research question (RQ).

RQ: Does our method outperform the baseline method and SADTM?

Rationale: We have tested the baseline method and SADTM on the standard dataset. To confirm that our method is superior to these methods, it is also necessary to test our method on the same datasets.

6.2. Study Design

To answer the RQ, it was necessary to design an experiment to compare our method with the baseline method and SADTM. The details of the data collection and analysis are given as follows.

Data Collection: To support model fitting and testing, we collected two distinct sets of data. (1) Model Fitting Data: As detailed in Section 5.2, we gathered environmental project descriptions (2012–2020) for keyword generation. (Environmental project descriptions refer to the descriptions of projects that were not selected in the corresponding dataset. These descriptions assisted LDA and TF-IDF in selecting project keywords.) Additionally, we extracted the basic information, URL metadata, and keyword data from the standard dataset (introduced in Section 3.2) to train our model. (2) Testing Dataset: Since both the baseline method and SADTM were evaluated on the standard dataset, we adopted the same benchmark for the comparative assessment. Our testing dataset comprised project IDs, CCPs, and the requisite inputs for the fitted model—namely, basic information, URLs, and keyword metadata. Data Analysis: This process was divided into two steps. (1) Implementing our method. We implemented our method, proposed in Section 5.2, and obtained a fitted model that could be used to select CCPs. (2) Comparing the results of our method and other methods. In this step, we tested the fitted model obtained from Step (1) on the testing dataset obtained during data collection, and we then compared the precision, recall, and F-measure of our method with those of the baseline method and SADTM.

6.3. Study Results

The fitted model (the result of our method) was tested on the standard dataset. Then, we compared our method to existing methods, as shown in Table 6.

Comparing the results in Table 6, it is clear that our method is more accurate and stable. This means that our method is an improvement over SADTM.

Answer to the RQ: Our method performs better than the baseline method and SADTM.

7. Discussion

7.1. Overview of This Work

In this work, we (1) analyzed existing methods in selecting CCPs; (2) developed some strategies to overcome the weaknesses of SADTM; (3) compared our method with the baseline method and SADTM.

As shown in Table 6, our method demonstrates greater stability and accuracy compared to both SADTM and the baseline approach. From these results, we draw two key conclusions: (1) neither the baseline method nor SADTM can be directly applied to select CCPs; (2) our proposed method achieves high accuracy in CCP classification. In summary, we have successfully enhanced SADTM and developed a superior approach for CCP classification.

7.2. Analysis of Misclassified Projects

Although our approach enhances SADTM, certain projects remain undetectable by our method. Recognizing this limitation as a potential threat to the method’s applicability, we conducted a manual analysis of the misclassified projects to investigate the underlying causes. Based on this analysis, we categorized the reasons for undetectability into three types.

Limited keyword set. We collected 80 keywords to represent the most popular keywords for each year. However, by analyzing misclassified samples, we observed that these keywords are still insufficient. For instance, the CCP ravelsoft/node-krowlr is a node crawler—a typical project that gathers information from the Internet and includes the keyword “crawler”. If our model had incorporated this keyword, this project could have been correctly identified.

Model bias. Since our method does not incorporate semantic analysis, our model cannot accurately assess the true nature of projects. For instance, nikita9604/Blogging-Webs ite-using-Flask is a CCP designed as a blogging website using Python Flask. However, due to the presence of the keyword “blog”, our model misclassified it as a non-development project. In our dataset, the majority of projects containing the term “blog” are personal or static blogs without substantial development, which are not actual development projects. Consequently, our model generalized this pattern and incorrectly labeled all “blog” projects as negative samples.

Inadequate Information. Our method leverages description details, basic information, and URL data. However, manual analysis reveals that the information available to our model is inadequate. For example, Gawssin/ColorfulStarLocal is an official implementation of Parallel Colorful h-star Core Maintenance in Dynamic Graphs, with its details primarily in the README file. This project cannot be classified as a CCP as it lacks a description and exhibits low popularity. Based on the collected data (i.e., description, basic information, and URL), this project does not meet the criteria for CCP classification.

The distribution of the different errors is shown below.

Inadequate information and keywords can be mitigated by incorporating additional data, such as README files or supplementary keywords. However, model bias remains unresolved since the decision tree model fails to capture the semantic nuances of projects. Nevertheless, as shown in Table 7, this issue occurs relatively infrequently (approximately 20%). This suggests that, if researchers can gather sufficient keywords (e.g., selecting 120 keywords per year using IDF) and develop methods to analyze projects lacking descriptions (e.g., leveraging README file content), our model can be substantially enhanced to satisfy more research requirements.

7.3. How to Use Our Method

Our method has different usage scenarios for researchers with different purposes.

(1): Apply our model directly. Researchers studying CCPs who adopt our labeling standards can directly use our model to identify CCPs from candidate samples.
(2): Adjust the model based on researchers’ needs. For researchers studying CCPs who disagree with our labeling standards, they should identify the disputed labels, relabel the corresponding data in our training dataset, and retrain the model on the revised dataset. For instance, we classified most projects containing the keyword “small” as negative, but some researchers may contest this criterion. In such cases, they should locate these projects in the training dataset and reclassify them as positive. Given that our training dataset includes 100 projects with the keyword “small”, this adjustment process is not overly labor-intensive.
(3): Combining our model with existing methods. For researchers studying projects beyond CCPs, our model remains highly useful. For example, some researchers investigate stakeholder behavior in top projects (e.g., [27]); our model enables them to filter out private and non-development projects. We evaluated this on the 2015–2016 dataset: when selecting the top 15% of projects by star count, the initial sample contained 26.3% negative projects. After applying our model, the final result set reduced this proportion to just 8.9%.
(4): Enhancing our model through additional information. As discussed in Section 7.2, a significant number of classification errors can be mitigated by incorporating new data sources. Researchers may manually review CCPs lacking keywords or integrate additional information such as README files to substantially improve the model performance. Since a large proportion of misclassifications in our method stem from data limitations rather than model flaws, these issues can be addressed independently.

8. Implications

In OSS development, our results have two implications.

Implication 1: When conducting research on OSS, researchers should carefully consider the quality of project samples. In this study, we labeled projects created in different periods and evaluated the effectiveness of existing methods (SADTM and the baseline method) used for project sample classification on the standard dataset. The results demonstrate that these methods fail to accurately identify CCPs. For researchers aiming to exclude private or non-development projects, the current approaches prove ineffective.

Implication 2: Researchers should consider the generality of their methods when mining OSS repositories. In this study, we tested the generality of SADTM on datasets from 2012 to 2020 and found that SADTM cannot be used to detect CCPs on different datasets. Recently, many studies have focused on mining OSS repositories, and these authors have evaluated their methods on fixed datasets [2]. Based on our findings, some of these methods may not perform well on other datasets, which could limit the generality of their approaches.

9. Threats to Validity

In this section, several threats are identified regarding the validity of the study. Internal validity is not discussed, since we did not investigate causal relationships.

9.1. Construct Validity

Our study focuses exclusively on the binary classification of CCPs, making construct validity contingent upon the accuracy of projects labeled as “TRUE” being genuine CCPs.

The first threat to the construct validity concerns participants’ capacity to accurately distinguish CCPs from non-CCPs, which is fundamentally tied to their domain expertise and familiarity with the project ecosystem. This challenge becomes evident when examining cases like dler-io/Rules: despite appearing to have minimal documentation at first glance, the project’s substantial community engagement—reflected in its 1200 stars and 19 contributors—clearly establishes its legitimacy as an active CCP. Such scenarios create classification disparities, where participants with deeper domain knowledge are more likely to correctly identify these projects, while those with limited exposure may misclassify them.

While these ambiguous cases represent a relatively small portion of our dataset (147 out of 26,742 projects), they illustrate an important methodological consideration. Our classification approach relies exclusively on information accessible through GitHub’s interface, which introduces a systematic limitation: CCPs that primarily document their activities through external platforms or alternative channels may be systematically underrepresented in our sample. Therefore, our findings specifically characterize CCPs that maintain their primary documentation and community engagement within GitHub’s ecosystem, rather than representing the entire universe of community-contributed projects.

The second threat concerns the potential for researchers to apply varying definitional criteria for CCPs. Consider the case of olexpono/weatherwear, a straightforward single-page Flask application that enables users to establish temperature thresholds for wearing sweatshirts. While functionally simple, this project lacks explicit designation as private use, potentially leading to classificatory disagreements among researchers. Our analysis reveals that projects incorporating specific keywords (particularly “small”) present notable labeling challenges and demonstrate distinctive keyword patterns. To mitigate this issue, we implement a three-phase approach: initially detecting contentious projects (such as those featuring the “small” keyword), subsequently conducting relabeling procedures, and finally developing an updated classifier using the refined dataset that corresponds to researchers’ particular CCP definitions.

The third threat stems from the possibility of inconsistent labeling standards among participants. To address this concern, we established precise definitions for both collaborative projects and coding projects. When participants encountered uncertainty regarding project classification, we implemented a protocol requiring assignment to an "undecided" category for subsequent group deliberation. During these collective review sessions, all participants collaboratively examined undecided projects and reached a consensus on the final classifications. This systematic approach effectively minimizes the risk of inconsistent labeling standards.

9.2. External Validity

The external validity of this study depends on whether our method can effectively detect CCPs across all GitHub projects. As shown in Table 6, we have validated our approach on the standard dataset containing samples from different years, demonstrating its generalizability to other GitHub projects. However, based on our analysis in Section 7.2, we note that the model’s accuracy may decline over time due to the fixed nature of our keyword set. Therefore, we recommend that, when applying our method to projects significantly distant from 2020, researchers should update the keyword set using Step (1) in Section 5.2 and relabel projects with new keywords to obtain an updated, retrained model, ensuring continued effectiveness in detecting new types of CCPs.

9.3. Reliability

As highlighted by [44], replicating project classification is crucial as it enhances our understanding of study limitations. The GHTorrent dataset employed in this research, provided by Gousios, is a widely used CCP dataset for the analysis of OSS development behaviors [38]. Consequently, this study can be readily replicated by other researchers.

10. Conclusions and Future Work

In this work, we conducted experiments on the GHTorrent dataset to evaluate whether we can automatically and accurately identify open-source CCPs. Our method was compared with existing approaches, yielding three key findings: (1) current methods exhibit significant limitations—the SADTM method is not a generalized method, and it performs well when trained on one dataset but performs poorly on other datasets, while the baseline approach produces inaccurate classification results; (2) our proposed method achieves precise CCP classification across diverse datasets; (3) the approach can be readily extended to accommodate various researcher requirements—for instance, incorporating project README information could further enhance the accuracy of CCP detection.

Our future work will focus on the following two aspects: (1) we plan to test the effectiveness of our method on other platforms, such as SourceForge and Gitee; (2) we plan to introduce new methods of parsing README files to provide more reliable features and improve the performance of our classification.

Author Contributions

Y.D.: conceptualization, data curation, formal analysis, investigation, methodology, software, validation. Q.F.: writing—original draft, writing—review and editing, supervision, project administration, resources. X.L.: data annotation, validation, writing—review and editing. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Hubei Provincial Department of Education Philosophy and Social Sciences Research Project: Youth Project (24Q166) and the Scientific Research Startup Foundation Project of Wuhan Vocational College of Software and Engineering.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The annotated data from this paper have been uploaded to https://github.com/hellocisco/data accessed on 10 July 2025.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Gousios, G.; Spinellis, D. GHTorrent: Github’s data from a firehose. In Proceedings of the 9th Working Conference on Mining Software Repositories (MSR), Zurich, Switzerland, 2–3 June 2012; pp. 12–21. [Google Scholar]
Kalliamvakou, E.; Gousios, G.; Blincoe, K.; Singer, L.; German, D.M.; Damian, D. An in-depth study of the promises and perils of mining GitHub. Empir. Softw. Eng. 2016, 21, 2035–2071. [Google Scholar] [CrossRef]
Zhou, M.; Mockus, A. What make long term contributors: Willingness and opportunity in OSS community. In Proceedings of the 34th International Conference on Software Engineering (ICSE), Zurich, Switzerland, 2–9 June 2012; pp. 518–528. [Google Scholar]
Kalliamvakou, E.; Gousios, G.; Blincoe, K.; Singer, L.; German, D.M.; Damian, D. The promises and perils of mining github. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), Hyderabad, India, 31 May–1 June 2014; pp. 92–101. [Google Scholar]
Jing, J.; Li, Z.; Lei, L. Understanding project dissemination on a social coding site. In Proceedings of the 20th Working Conference on Reverse Engineering (WCRE), Koblenz, Germany, 14–17 October 2013; pp. 132–141. [Google Scholar]
Cosentino, V.; Izquierdo, J.L.C.; Cabot, J. A Systematic Mapping Study of Software Development with GitHub. IEEE Access 2017, 5, 7173–7192. [Google Scholar] [CrossRef]
Manikas, K.; Hansen, K.M. Software ecosystems—A systematic literature review. J. Syst. Softw. 2013, 86, 1294–1306. Available online: http://www.sciencedirect.com/science/article/pii/S016412121200338X (accessed on 10 July 2025). [CrossRef]
Yu, Y.; Wang, H.; Yin, G.; Wang, T. Reviewer recommendation for pull-requests in GitHub: What can we learn from code review and bug assignment? Inf. Softw. Technol. 2016, 74, 204–218. [Google Scholar] [CrossRef]
Yu, Y.; Wang, H.; Filkov, V.; Devanbu, P.; Vasilescu, B. Wait for It: Determinants of Pull Request Evaluation Latency on GitHub. In Proceedings of the 12th Working Conference on Mining Software Repositories (MSR), Florence, Italy, 16–17 May 2015; pp. 367–371. [Google Scholar]
Gousios, G.; Pinzger, M.; Deursen, A.V. An exploratory study of the pull-based software development model. In Proceedings of the 36th International Conference on Software Engineering (ICSE), Hyderabad, India, 31 May–7 June 2014; pp. 345–355. [Google Scholar]
Vasilescu, B.; Yu, Y.; Wang, H.; Devanbu, P.; Filkov, V. Quality and productivity outcomes relating to continuous integration in GitHub. In Proceedings of the 10th Joint Meeting on Foundations of Software Engineering (FSE), Bergamo, Italy, 30 August–4 September 2015; pp. 805–816. [Google Scholar]
Murphy, G.C.; Terra, R.; Figueiredo, J.; Serey, D. Do developers discuss design? In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), Hyderabad, India, 31 May–1 June 2014; pp. 340–343. [Google Scholar]
Kikas, R.; Dumas, M.; Pfahl, D. Using dynamic and contextual features to predict issue lifetime in GitHub projects. In Proceedings of the 13th Working Conference on Mining Software Repositories (MSR), Austin, TX, USA, 14–15 May 2016; pp. 291–302. [Google Scholar]
Constantinou, E.; Mens, T. Socio-technical evolution of the Ruby ecosystem in GitHub. In Proceedings of the 24th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), Klagenfurt, Austria, 20–24 February 2017; pp. 34–44. [Google Scholar]
Bertoncello, M.V.; Pinto, G.; Wiese, I.S.; Steinmacher, I. Pull Requests or Commits? Which Method Should We Use to Study Contributors’ Behavior? In Proceedings of the 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), London, ON, Canada, 18–21 February 2020; pp. 592–601. [Google Scholar]
Tantithamthavorn, C.; Hassan, A.E.; Matsumoto, K. The impact of class rebalancing techniques on the performance and interpretation of defect prediction models. IEEE Trans. Softw. Eng. 2018, 46, 1200–1219. [Google Scholar] [CrossRef]
Pascarella, L.; Palomba, F.; Penta, M.D.; Bacchelli, A. How Is Video Game Development Different from Software Development in Open Source. In Proceedings of the IEEE/ACM International Conference on Mining Software Repositories (MSR), Gothenburg, Sweden, 28–29 May 2018. [Google Scholar]
Goyal, R.; Ferreira, G.; Kästner, C.; Herbsleb, J. Identifying unusual commits on GitHub. J. Softw. Evol. Process 2018, 30, e1893. [Google Scholar] [CrossRef]
Fronchetti, F.; Wiese, I.; Pinto, G.; Steinmacher, I. What attracts newcomers to onboard on oss projects? In Proceedings of the 15th Open Source Systems—IFIP WG 2.13 International Conference (OSS), Montreal, QC, Canada, 26–27 May 2019; pp. 91–103. [Google Scholar]
Zhao, G.; Da Costa, D.A.; Zou, Y. Improving the pull requests review process using learning-to-rank algorithms. Empir. Softw. Eng. 2019, 24, 2140–2170. [Google Scholar] [CrossRef]
Zhou, P.; Liu, J.; Liu, X.; Yang, Z.; Grundy, J. Is Deep Learning Better than Traditional Approaches in Tag Recommendation for Software Information Sites? Inf. Softw. Technol. 2019, 109, 1–13. [Google Scholar] [CrossRef]
Jiarpakdee, J.; Tantithamthavorn, C.K.; Dam, H.K.; Grundy, J. An empirical study of model-agnostic techniques for defect prediction models. IEEE Trans. Softw. Eng. 2020, 48, 166–185. [Google Scholar] [CrossRef]
Vale, G.; Schmid, A.; Santos, A.R.; De Almeida, E.S.; Apel, S. On the relation between Github communication activity and merge conflicts. Empir. Softw. Eng. 2020, 25, 402–433. [Google Scholar] [CrossRef]
Malviya-Thakur, A.; Mockus, A. The Role of Data Filtering in Open Source Software Ranking and Selection. In Proceedings of the 1st IEEE/ACM International Workshop on Methodological Issues with Empirical Studies in Software Engineering, Lisbon, Portugal, 15 April 2024; pp. 7–12. [Google Scholar]
Padhye, R.; Mani, S.; Sinha, V.S. A study of external community contribution to open-source projects on GitHub. In Proceedings of the 11th Working Conference on Mining Software Repositories (MSR), Hyderabad, India, 31 May–1 June 2014; pp. 332–335. [Google Scholar]
Hilton, M.; Tunnell, T.; Huang, K.; Marinov, D.; Dig, D. Usage, costs, and benefits of continuous integration in open-source projects. In Proceedings of the 31th International Conference on Automated Software Engineering (ASE), Singapore, 3–7 September 2016; pp. 426–437. [Google Scholar]
Xavier, J.; Macedo, A.; Maia, M.D.A. Understanding the popularity of reporters and assignees in the Github. In Proceedings of the 26th International Conference on Software Engineering & Knowledge Engineering (SEKE), Vancouver, BC, Canada, 1–3 July 2014; pp. 484–489. [Google Scholar]
Elazhary, M.S.N.E.O.; Zaidman, A. Do as I Do, Not as I Say: Do Contribution Guidelines Match the GitHub Contribution Process. In Proceedings of the 35th IEEE International Conference on Software Maintenance and Evolution (ICSME), Cleveland, OH, USA, 29 September–4 October 2019; pp. 286–290. [Google Scholar]
Cheng, C.; Li, B.; Li, Z.-Y.; Liang, P.; Yang, X. An in-depth study of the effects of methods on the dataset selection of public development projects. IET Softw. 2021, 16, 146–166. [Google Scholar] [CrossRef]
Montandon, J.E.; Valente, M.T.; Silva, L.L. Mining the Technical Roles of GitHub Users. Inf. Softw. Technol. 2021, 131, 106485. [Google Scholar] [CrossRef]
Baltes, S.; Ralph, P. Sampling in software engineering research: A critical review and guidelines. Empir. Softw. Eng. 2022, 27, 94. [Google Scholar] [CrossRef]
Wang, Y.; Wang, J.; Zhang, H.; Ming, X.; Shi, L.; Wang, Q. Where is your app frustrating users? In Proceedings of the 44th International Conference on Software Engineering (ICSE), Pittsburgh, PA, USA, 21–29 May 2022; pp. 2427–2439. [Google Scholar]
Wang, Y.; Zhang, P.; Sun, M.; Lu, Z.; Yang, Y.; Tang, Y.; Qian, J.; Li, Z.; Zhou, Y. Uncovering bugs in code coverage profilers via control flow constraint solving. IEEE Trans. Softw. Eng. 2023, 49, 4964–4987. [Google Scholar] [CrossRef]
Cheng, C.; Li, B.; Li, Z.-Y.; Liang, P. Automatic Detection of Public Development Projects in Large Open Source Ecosystems: An Exploratory Study on GitHub. In Proceedings of the 30th International Conference on Software Engineering and Knowledge Engineering (SEKE), San Francisco, CA, USA, 1–3 July 2018; pp. 193–198. [Google Scholar]
Munaiah, N.; Kroh, S.; Cabrey, C.; Nagappan, M. Curating GitHub for engineered software projects. Empir. Softw. Eng. 2016, 22, 3219–3253. [Google Scholar] [CrossRef]
Pickerill, P.; Jungen, H.J.; Ochodek, M.; Maćkowiak, M.; Staron, M. PHANTOM: Curating GitHub for Engineered Software Projects Using Time-Series Clustering. Empir. Softw. Eng. 2020, 25, 2897–2929. [Google Scholar] [CrossRef]
Tim Menzies, M.S.; Smith, A. “Bad Smells” in Software Analytics Papers. Inf. Softw. Technol. 2019, 112, 35–47. [Google Scholar] [CrossRef]
Gousios, G. The GHTorent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), San Francisco, CA, USA, 18–19 May 2013; pp. 233–236. [Google Scholar]
Fu, W.; Menzies, T. Easy over hard: A case study on deep learning. In Proceedings of the 12th Joint Meeting on Foundations of Software Engineering (FSE), Paderborn, Germany, 4–8 September 2017; pp. 49–60. [Google Scholar]
Saha, A.K.; Saha, R.K.; Schneider, K.A. A discriminative model approach for suggesting tags automatically for Stack Overflow questions. In Proceedings of the 10th Working Conference on Mining Software Repositories (MSR), San Francisco, CA, USA, 18–19 May 2013; pp. 73–76. [Google Scholar]
Wang, T.; Wang, H.; Yin, G.; Ling, C.X.; Li, X.; Zou, P. Tag recommendation for open source software. Front. Comput. Sci. 2014, 8, 69–82. [Google Scholar] [CrossRef]
Beel, J.; Gipp, B.; Langer, S.; Breitinger, C. Research-paper recommender systems: A literature survey. Int. J. Digit. Libr. 2015, 17, 1–34. [Google Scholar] [CrossRef]
Barua, A.; Thomas, S.W.; Hassan, A.E. What are developers talking about? An analysis of topics and trends in Stack Overflow. Empir. Softw. Eng. 2014, 19, 619–654. [Google Scholar] [CrossRef]
Falessi, D.; Smith, W.; Serebrenik, A. STRESS: A Semi-Automated, Fully Replicable Approach for Project Selection. In Proceedings of the 11th International Symposium on Empirical Software Engineering & Measurement (ESEM), Toronto, ON, Canada, 9–10 November 2017; pp. 151–156. [Google Scholar]

Figure 1. Procedure of creating a standard dataset.

Figure 2. Three steps of testing SADTM.

Table 1. Risks of the sample project selection process and corresponding avoidance strategies.

Problem	Strategy	Automated Method
A repository is not necessarily a project	Consider the activity in both the base repository and all associated forked repositories.	Yes
Most projects have low activity	Consider the number of recent commits on a project to select projects with an appropriate activity level.	Yes
Most projects are inactive	Consider the number of recent commits and pull requests.	Yes
Many projects are not software development	Review descriptions and README files to ensure that the projects fit the research needs.	No, this strategy needs researchers to review descriptions and README files of project samples.
Most projects are personal	Consider the number of committers.	No, this strategy cannot effectively remove personal projects because many public projects have only one committer.
Many active projects do not use GitHub exclusively	Avoid projects that have a high number of committers who are not registered GitHub users and projects with descriptions that explicitly state that they are mirrors.	Yes
Few projects use pull requests	Consider the number of pull requests before selecting a project.	Yes

Table 2. Project sample selection methods in literature.

Method	Literature
Remove projects demonstrating poor performance in key metrics (e.g., pull request activity)	[8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24]
Select projects ranking highly in significant dimensions (e.g., star count)	[15,20,25,26,27,28,29,30,31,32,33]
Employ decision tree classification to determine CCP status (using project statistics and description keywords as features)	[29,34]
Apply score-based and random forest classifiers through the Reaper tool to identify engineered software projects	[35]
Propose the PHANTOM approach, which extracts multidimensional time-series features from Git logs and applies k-means clustering to automatically and effectively identify “engineered” projects from a large number of GitHub repositories	[36]

Table 3. Labeling results.

Period	No. of Samples	Cohen’s Kappa Coefficient
1 January 2012–15 January 2012	6715	/
1 January 2013–1 January 2014	1385	0.857
1 January 2015–1 January 2016	3991	0.835
1 January 2017–1 January 2018	6930	0.812
1 January 2019–1 January 2020	7721	0.885

Table 4. Results of the baseline method and SADTM on the datasets for the years 2013–2020.

Method	Best Parameter for Precision, Recall, or F-Measure	Precision	Recall	F-Measure
2012.1.1–2012.1.16
Baseline method	Select top 2% projects regarding watcher number	0.880	0.027	0.052
Baseline method	Remove bottom 1% projects regarding community member number	0.644	0.992	0.781
SADTM	J48, Confidencefactor = 0.05	0.847	0.927	0.885
2013–2014
Baseline method	Select top 2% projects regarding star number	0.963	0.033	0.064
Baseline method	Remove bottom 1% projects regarding watcher number	0.564	0.992	0.720
SADTM	J48, Confidencefactor = 0.05	0.703	0.789	0.743
2015–2016
Baseline method	Select top 2% projects regarding watcher number	0.862	0.038	0.074
Baseline method	Remove bottom 1% projects regarding watcher number	0.443	0.990	0.612
SADTM	J48, Confidencefactor = 0.05	0.463	0.790	0.583
2017–2018
Baseline method	Select top 1% projects regarding star number	0.855	0.019	0.037
Baseline method	Remove bottom 1% projects regarding star number	0.437	0.991	0.606
SADTM	J48, Confidencefactor = 0.05	0.575	0.801	0.669
2019–2020
Baseline method	Select top 1% projects regarding star number	0.857	0.022	0.044
Baseline method	Remove bottom 1% projects regarding star number	0.374	0.992	0.544
SADTM	J48, Confidencefactor = 0.05	0.469	0.844	0.602

Note. For the precision, recall, and F-measure of each dataset, we selected the best-performing method along with its corresponding values and highlighted them in bold in the table.

Table 5. Number of understandable keywords selected by IDF and LDA.

Method	2012	2013–2014	2015–2016	2017–2018	2019–2020
LDA	60/80	61/80	60/80	60/80	59/80
IDF	61/80	61/80	62/80	59/80	60/80

Table 6. Results of different methods on the standard dataset.

Year	Precision (Ours)	Recall (Ours)	F-Measure (Ours)	F-Measure (Baseline)	F-Measure (SADTM)
2012	0.845	0.946	0.893	0.781	0.885
2013–2014	0.849	0.885	0.866	0.720	0.743
2015–2016	0.752	0.848	0.797	0.612	0.583
2017–2018	0.813	0.800	0.806	0.606	0.669
2019–2020	0.740	0.825	0.780	0.544	0.602

Note.We compared the F-Measure of our method against the Baseline and SADTM, with the best-performing method and its corresponding values highlighted in bold in the table.

Table 7. The distribution of different errors.

Year	Model Bias	Inadequate Information	Limited Keyword Set
2012	25.1%	67.7%	7.2%
2013–2014	23.4%	61.9%	14.7%
2015–2016	23.3%	63.9%	12.7%
2017–2018	20.1%	69.0%	10.8%
2019–2020	27.1%	66.2%	6.7%

Note. This table shows the percentage distribution of different types of errors across different years.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ding, Y.; Fang, Q.; Liu, X. A Generalized Method for Filtering Noise in Open-Source Project Selection. Information 2025, 16, 774. https://doi.org/10.3390/info16090774

AMA Style

Ding Y, Fang Q, Liu X. A Generalized Method for Filtering Noise in Open-Source Project Selection. Information. 2025; 16(9):774. https://doi.org/10.3390/info16090774

Chicago/Turabian Style

Ding, Yi, Qing Fang, and Xiaoyan Liu. 2025. "A Generalized Method for Filtering Noise in Open-Source Project Selection" Information 16, no. 9: 774. https://doi.org/10.3390/info16090774

APA Style

Ding, Y., Fang, Q., & Liu, X. (2025). A Generalized Method for Filtering Noise in Open-Source Project Selection. Information, 16(9), 774. https://doi.org/10.3390/info16090774

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Generalized Method for Filtering Noise in Open-Source Project Selection

Abstract

1. Introduction

2. Related Work

2.1. Datasets Used in Studying GitHub Repositories

2.2. Risk Avoidance Strategies in Selecting Project Samples

2.3. Project Sample Selection Process in Studies on GHTorrent Dataset

3. Collaborative Coding Projects

3.1. Labeling Process

3.2. Labeling Results

4. Analysis of Existing Methods

4.1. Costs of Existing Methods

4.2. Accuracy of Existing Methods

4.3. Weaknesses of Existing Methods

4.3.1. Weaknesses of the Baseline Method

4.3.2. Weaknesses of SADTM

5. Our Method

5.1. Overcoming the Weaknesses of SADTM

5.2. Method Design

6. Study Design and Results

6.1. Research Question

6.2. Study Design

6.3. Study Results

7. Discussion

7.1. Overview of This Work

7.2. Analysis of Misclassified Projects

7.3. How to Use Our Method

8. Implications

9. Threats to Validity

9.1. Construct Validity

9.2. External Validity

9.3. Reliability

10. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI