What Is Needed for the Sustainable Success of OSS Projects: Efﬁciency Analysis of Commit Production Process via Git

: The purpose of this study is to investigate the relative efﬁciency of open source software projects, and to analyze what is needed for their sustainable success. The success of open source software is known to be attributable to a massive number of contributors engaging in the development process. However, an efﬁcient open source software project is not guaranteed simply by active participation by many; a coordination mechanism is needed to seamlessly manage the multi-party collaboration. On this basis, this study aimed to examine the internal regulatory processes based on Git and GitHub, which serve as such a mechanism, and redeﬁne the efﬁciency of open source software projects to fully reﬂect them. For this purpose, a two-stage data envelopment analysis was used to measure the project efﬁciency reﬂecting the internal processes. Moreover, this study considered the Kruskal–Wallis test and Tobit regression analysis to examine the effects of the participation by many on an open source software project based on the newly deﬁned efﬁciency. Results show that a simple increase in contributors can be poisonous in terms of the efﬁciency of open source software projects.


Introduction
Open source software (OSS) does not simply refer to software distributed for free. It is a concept that allows not only the actual usage of software, but also free redistribution, which means anyone can access and modify the original source code to produce derivative products for their sustainability [1]. Linux, MySQL, and Chrome are representative examples of sustainable and successful OSS projects. Moreover, OSS is being used by countless companies and government organizations; the number of OSS projects as well as the volume of the source code is on exponential growth on a global basis [2].
From a traditional point of view, the sustainable success of such OSS projects is rather unconventional. The online communication and cooperation of random people by voluntary motivation without relying on economic interests cannot be explained with the conventional economic viewpoint. Raymond [3] asserted "given enough eyeballs, all bugs are shallow" as 'Linus's Law', citing "a large enough beta-tester and co-developer base" at the core of the success of OSS. In other words, the massive number of contributors engaging in the testing, bug-fixing, and function-adding leads to a product that surpasses commercial software that is developed on an exclusive basis.
However, this free and open environment is not without limitations. This is because a large number of participants comparatively reduce group efficiency [4]. The larger the group becomes, the higher the potential for inefficient communication, which makes seamless project management difficult. For example, in terms of high-level decision making such as adding critical functions, the participation by many becomes toxic in nature. In reality, many OSS projects fail without reaching maturity owing to inefficiency. According to Schweik and English [5], only 17% of total OSS projects end in a success.
Then, what is needed for an OSS project to be efficient? An efficient OSS project is not guaranteed simply by the active participation of many. For efficient projects, a coordination mechanism is needed to seamlessly manage the multi-party cooperation. In other words, the multiple parties collaborating on an OSS project must be regulated to a certain extent by a smaller number of managers with appropriate authority [6]. Git, covered in this study, is one example of such a regulatory device for voluntary collaboration.
Therefore, a reflection of such internal regulatory processes is indispensable for analyzing the efficiency of an OSS project. Simply reviewing the initial input and final output cannot demonstrate the efficiency of the project. Hence, this study aims to move away from the existing perspective to a new one, reflecting a series of processes accompanying regulated voluntary cooperation, and to redefine the efficiency of OSS projects. For this purpose, this study has set a two-stage model of data envelopment analysis (DEA) to reflect internal processes. Moreover, based on the newly defined efficiency, this study examines the effects of the participation by many on an OSS project through the Kruskal-Wallis test and Tobit regression analysis.

Background and Literature Review
Over many years, several scholars have taken an interest in the efficiency of OSS projects and have presented their own interpretations. Among such scholars, Ghapanchi and Aurum [7], Wray and Mathieu [8], and Koch [9] utilized DEA to measure relative efficiency.
First, the research model of Ghapanchi and Aurum [7] is presented in Figure 1. This study utilized the partial least square (PLS) method to determine the positive influence of the four competencies of the theory of competency rallying (TCR) on the OSS project performance. Within the context of OSS, TCR covers a process of rallying the individual developers with capabilities to respond to new customer needs. The four capabilities of TCR-identification of market needs, marshalling of competencies, development of competencies, and managing cooperative work-are the independent variables of the research model; the OSS project performance, composed of developer interest and project efficiency, is the dependent variable. DEA was used in the analysis of project efficiency-the second component of project performance-with the number of developers and the project duration as inputs, and the number of released files as output. The subjects of the analysis were projects on the OSS development platform Sourceforge.net; in order to secure homogeneity, 607 projects in the software development category were selected. Wray and Mathieu [8] selected 34 security-based OSS projects in Sourceforge.net as decision making units (DMUs) and utilized an input-oriented Banker-Charnes-Cooper (BCC) model. The model considered the number of developers and number of bug submitters as input variables, and the project rank, number of downloads, and kilobytes per download as output variables. Particularly, this research provides several implications in that it included bug submitters as well as the number of developers in the inputs, and qualitative indicators in the outputs. The present study also aims to reference this aspect and approach inputs and outputs from a multidimensional perspective.
Koch [9] utilized DEA to understand the influence of cooperation tools on the efficiency of OSS projects. First, two types of DMU groups were selected; the first group was composed of the top 30 projects sorted by Sourceforge.net rankings, and the other group was randomly compiled. The inputs involved the number of developers and the number of years, and the outputs were downloads, web hits, size in bytes, and lines of code. Efficiency was measured using the output-oriented BCC model.
Such existing literature have contributed to newly defining the efficiency of OSS projects, which are clearly segregated from traditional closed-source software, but have limitations in two aspects.
The first limitation is limiting the research subject to Sourceforge.net projects, which presents a gap in the application of their results to the latest OSS projects. Sourceforge.net has now become a website for merely downloading software and no longer represents OSS platforms. This was largely due to GitHub, which started in 2007 and has now rapidly grown to be the world's largest OSS project platform. As GitHub came to dominate the OSS market, other OSS communities such as Google Code, fell behind and ended service in 2015. This study is based on GitHub in its measurements of OSS project efficiencies to derive results that are more appropriate in the present day.
The second limitation is that the internal process of the project was not considered in the analysis. GitHub offers a version control system called Git, which involves a more complex development process. OSS projects on GitHub are characterized by a participant writing code, creating a commit and undergoing a review process before being merged into the master branch through Git. The traditional DEA model is unable to accurately capture the efficiency of OSS projects with such complex processes. This is because the model regards the process between input and output as a black box [10]. Therefore, it is necessary to apply the expanded DEA model to take into account the internal processing mechanism of OSS projects. As such, this study utilizes a two-stage DEA method to identify internal inefficiencies, which are difficult to spot using the traditional DEA.

Git and GitHub
The subjects of this study were projects on GitHub, which is the worldwide platform for OSS development. GitHub provides a wide range of advanced functions that support cooperation for free, but the core rationale is Git-a version control system. A version control system is a system that manages all histories of source code changes so that multiple developers can simultaneously make changes to the same source code. In other words, it is a type of autonomous regulatory device to effectively manage the participation by many. Git saves each change of source code as a unit called a commit; the change is not applied immediately, but only after the code has passed a step-by-step examination through a series of mechanisms. OSS projects on GitHub are operated using Git by default.
GitHub explains the process of contributing to an OSS project using Git from the standpoint of a developer in six stages [11]: 1.
The first task is to create a branch in the project that the developer wishes to contribute to. The place where the code of the OSS is to be actually released is called the master branch; branches are used to prevent confusion from all changes being made to the master branch. For instance, if one wants to propose a function relating to comments, one can work on a branch called 'comment' if such a branch exists. Moreover, branches are like actual branches stemming from the master and thus begin from the code of the master; however, unless the administrator merges the branches, the changes to the code in a specific branch do not impact the master code. One can also create a fork of a project. To fork is to copy the said project. This is useful when testing the changes to the code, because the copied version of the project can be changed without influencing the original project.

2.
The second stage is to create a commit by writing codes. This may relate to removing errors in the existing code or adding a new function. 3.
The third stage involves writing a pull request, which shows the modified code segment with a simple explanation. In other words, a pull request is a message that requests the project administrator to accept-or pull-the changes to the original. 4.
Next, the pull request is discussed and reviewed by contributors. Anyone can leave a comment with evaluation of the pull request.

5.
The fifth stage involves the project administrators accepting the pull request or not. When the pull request is rejected, the code in question is not merged and thus is returned. If the pull request is accepted, the code is created as a commit of the branch. 6.
Finally, when the branch is ready for distribution, it is merged into the master branch. Even after the merge, related issues or bugs may be reported. Issues are a communication channel on GitHub, and are a type of bulletin board where anyone can present opinions about the project.
Through the above process, Git and GitHub help to make the cooperation of an unspecified many more effective. This process can be largely subdivided into two parts, as shown in Figure 2. The first regulatory device limits the commits through pull requests, and is shown in Figure 2a; Figure 2b is a device that limits the changes to be reflected directly on the master through the branch. However, not all commits go through all of these processes in an OSS project. Project administrators have the authority to directly push commits, and they often bypass the evaluation process of pull requests. As such, all commit information in an OSS project can be divided into those pull requests that have passed an open evaluation, and commits that have been added by those with authority to directly add commits [6].

Data Collection
In discussing the efficiency of the OSS project, this study aimed to reflect the process through which commits are generated and merged into the master branch. To achieve this, data were needed on a diverse range of features in the Git-based development process such as commit, branch, fork, pull request and issue; these can be collected using the GitHub application programming interface (API). GitHub provides a free API service, allowing external parties to access the accumulated data [12].
The collection of DMU information required for this research was done based on the classification of Showcase pages provided by GitHub; GitHub API crawling was used to collect the values corresponding to input, output and environmental variables.

DEA
Data envelope analysis (DEA) is a method that is used to measure the comparative efficiency of decision making units (DMUs) with multiple inputs and outputs [13]. The efficiency of software projects is usually measured with DEA. DEA is particularly useful for OSS whose production process is complex and difficult to define because DEA derives an efficient frontier using empirical data without explicitly assuming the production function and measures the efficiency of each object of evaluation based on its distance from the frontier [14]. Furthermore, DEA provides useful information on performance benchmarking, such as which effective DMU should be referenced to and how much improvement is needed in input or output elements.
To conduct efficiency analysis using DEA, an appropriate returns to scale assumption of the DEA model must be established. In the case of OSS projects, the assumption of variable returns to scale is applied because the scales are various and both increasing and decreasing returns to scale exist in the information technology industry [15]. As such, this study selected the Banker-Charnes-Cooper (BCC) model, which assumes variable returns to scale. The form of the equation used in this study is shown below:

Two-Stage DEA
OSS projects on GitHub are accompanied by a very complex development process. A traditional DEA assumes this commit production process as a single black box, and is unable to accurately reflect the efficiency of the GitHub OSS project. Therefore, this study utilizes the two-stage DEA model, which breaks down the internal structure of DMU in two stages rather than one, to analyze efficiency. The application of two-stage DEA can reveal internalized inefficiencies within a DMU which appears efficient in the traditional DEA model. Moreover, this model shows which stage specifically causes inefficiency since it calculates the efficiency of each stage independently; as such, guidelines to improve efficiencies can be specified by stage.
A commit carries a symbolic meaning in GitHub-based OSS development process. As mentioned earlier, a commit refers to an independent data unit that contains records of newly created code, which is created by Git [6]. It would be no exaggeration to refer to GitHub as a service that supports the convenient management and sharing of commits. This is because all code that makes up the OSS is contained within at least one commit. In discussing the efficiency of the OSS project, this study aimed to reflect the process through which the commits are created and merged into the master branch; as such, the mid-level output of the two-stage DEA model was set as the number of commits. On this basis, the merge efficiency of the first stage evaluated how efficiently the commits could be created by a large number of participants, and the project efficiency of the second stage evaluated the qualitative and quantitative outputs from the commits.

Stage 1-Merge Efficiency
All commits in an OSS project are divided into two types: those resulting from pull requests that have passed a phased regulatory phase, and those that have been directly added by contributors with push authority. As such, the merge efficiency of the first stage considers the number of pull requests and number of contributors as the input, and the number of commits as the output. As the OSS projects do not involve monetary costs for hiring human resources, but rather the voluntary participation of an unspecified many, it can be assumed that there is no control over inputs. Therefore, it is appropriate to use the output-oriented model, which seeks maximization of output for a given number of inputs [9].
Basically, to simplify the concept, merge efficiency gets lower by two reasons: lazy authority or an unacceptable pull request. The former is the case where collaborators with push authority do not produce enough commits, while the latter is the case where the majority of pull requests are unaccepted owing to problems such as low quality of the suggested codes.

Stage 2-Project Efficiency
In the second stage, project efficiency is analyzed with the number of commits as input and the number of stars and the project size as outputs. The project efficiency is characterized by relatively fixed input compared to the output and seeks for the maximization of outputs with the number of commits given; thus, the output-oriented model is selected.

DMU Selection
To properly apply the DEA model to a study of OSS efficiency, the compared DMUs must be homogeneous. Loss of homogeneity may cause the result to depend on external factors rather than on the objects of analysis [16]. For this reason, Ghapanchi and Aurum [7] narrowed the scope of samples to projects under the software development category among a variety of categories offered by Sourceforge.net. In addition, the majority of existing literature limited DMUs to a certain category such as ERP projects [17], Y2K projects [18], and security-based projects [8], and so on.
While there is no clear division of subject categories in GitHub unlike Sourceforge.net, it provides a collection of projects with popular subjects through its Showcase page [19]. Among 52 subjects provided by GitHub, this study obtains homogeneity by targeting the web application frameworks group as it has the largest number of projects that meet the following two conditions:

License
The ownership and distribution rights for an individual OSS are specified under a diverse range of licenses. In this study, only the projects that contained License.md or License.txt files were selected to limit the objects of analysis to projects that officially correspond to OSS.

Activeness
To secure the effectiveness of research, this study targeted only the projects that have been active until recently and fully utilize the features of Git. These were projects in which a commit was submitted within a month at the latest, at least 20 contributors were participating, and both pull requests and existing issues were selected.
GitHub presents the 29 projects under the subject of web application frameworks, which provides the framework for server-side web development. All 29 projects met the above two conditions and were selected as DMU. As this value is larger than three times the number of variables used, it is considered a valid number of DMUs [20].

Input and Output Variables
When measuring the efficiency of OSS projects, a common input is the number of developers, which is analogous with the labor cost in traditional projects [7][8][9]. Nevertheless, there are logical flaws in considering the number of contributors provided by GitHub as an input. The 'labor' of an OSS project includes those writing code, as well as those reporting bugs through an issue or presenting opinions. However, the concept of a contributor defined by GitHub is limited to those whose code suggestion was actually accepted. In other words, bug reporters and those whose pull request was declined are ignored. This is because the level of contribution is determined by the number of commits on GitHub. Referring to Figure 2, all commits are created through two paths: pull requests and users with push access. Therefore, as inputs of OSS project efficiency, not only the number of contributors, but also the number of pull requests were selected. As shown in Table 1, correlation level between two variables were low enough, while there still existed a potential for multicollinearity. As the number of commits-the mid-level variable-only contains codes that were evaluated to be significant and accepted, it differs from the input variables that also include the codes that have not been accepted. Moreover, commits sometimes contain deletions of wrong or inefficient codes, as well as additions of new codes. Therefore, it is a concept that clearly differs from project size, which measures the quantitative outputs of a project.
In the case of outputs, project size was selected as a quantitative indicator of the output of the OSS project [9]. GitHub API measures the size on the basis of the entire repository including all of its history in kilobytes [12]. However, quantitative indicators do not fully reflect the characteristics of OSS and only show one aspect of outputs. As such, Wray and Mathieu [8] proposed the number of downloads and the project rank as a qualitative output. Although GitHub does not provide a specific ranking of OSS projects, it is possible to sort the projects based on the number of stars-a type of bookmarking feature-and number of watches-a feature to follow a certain project. 'Starred' projects are listed in the star section of the menu, whereas relevant notifications about the projects user is 'watching' are shown in the user dashboard. Both features are considered to be appropriate qualitative outputs as they involve evaluation of the quality of OSS by the end-users. Table 2 shows the results of correlation between outputs; the number of stars and the number of watches have a correlation of 0.8579, which is rather high. Therefore, along with the project size as a quantitative output, this study has selected the number of stars-a feature that is more easily accessible by users-as an output, and excluded the number of watches from the final outputs. Figure 3 shows the resulting research model.   Table 3 shows the results of a DEA performed with the output-oriented BCC model for the merge efficiency of 29 OSS projects in total. The inputs are the number of contributors, and the number of pull requests; the output of this stage-at the same time the mid-level variable of the entire model-is the number of commits (Y1). As this study utilized the output-oriented model, the output shortage values-rather than input surplus-are given in the corresponding columns of Table 5. Based on variable returns to scale (VRS) efficiency, six projects out of 29 were efficient and the efficiency was 1. The projects with relatively lower efficiency had an efficiency value lower than 1. The data on reference sets refer to the proposed benchmark targets, and reference frequency refers to the number of times the project was referenced as a benchmark target by other DMUs. With respect to merge efficiency, 'mojo' and 'meteor' were most frequently used as references, 18 and 12 times respectively, followed by 'cakephp', 'rails', and 'catalyst-runtime'.

Stage 2-Project Efficiency
In the case of project efficiency, as shown in Table 4, seven projects were found to be efficient out of 29, based on VRS. A noticeable finding is that aside from 'meteor', 'rails', and 'whitestorm' which were efficient in both stages, all four of project with the value of 1 for project efficiency-'nodal', 'laravel', 'flask', and 'kemel'-performed extremely poorly in merge efficiency. This indicates that the merge efficiency of the four projects needs to be improved to maximize the overall efficiency. Similar results can be drawn for 'cakephp', 'mojo', and 'catalyst-runtime', which had high merge efficiency but low project efficiency. This would not have been discovered if the analysis were done using the traditional DEA model, rather than measuring efficiency with the two-stage model, taking into account the internal processes.

Overall Efficiency
There are two methods for deriving the overall efficiency to total the results of each stage in the two-stage DEA model: the additive method and the multiplicative method [21]. In the case of additive approach, the total value is calculated using a weighted average. However, there are numerous ways for calculating the weights, and it is difficult to appropriately apply such calculations based on the characteristics of OSS projects; thus, this study concluded that using weighted averages, which are only estimated values, could reduce the accuracy of the analysis. Therefore, overall efficiency was derived using the multiplicative method introduced in Kao and Hwang [22].
Furthermore, black box efficiency was calculated using the initial input and final output, considering the middle stage as a black box under the traditional DEA method. Comparing these figures with those of overall efficiency, the efficiency resulting from the two-stage DEA model was either equal to or lower than the black box efficiency. This is because the two-stage DEA can specify the development process of OSS projects to capture internal inefficiencies. The idea of efficiency decomposition is proven to have advantages over the traditional one-stage model in terms of systems that are accompanied by complex sub-processes [22,23]. One example that reconfirms the limitation of traditional DEA model is 'express'. Under the traditional model, the efficiency value of 'express' is 1; however, the overall efficiency or efficiencies of each stage of the same project reflect internal inefficiencies as shown in Table 5.

Determinants of Efficiency
"Given enough eyeballs, all bugs are shallow." This is a quote from Raymond [3] that is known as 'Linus' Law'. This interprets the core principle behind a successful OSS as "many eyeballs". OSS projects-the voluntary collaboration of a massive number of anonymous developers around the world-often seems idealistic. In fact, OSS development surfaced as a rebellion against the monopoly of commercial software, valuing freedom and equality, and taking place through contribution and sharing by the crowd.
However, the participation by many can often be toxic. This is because the larger the group becomes, the potential for inefficient communication increases, which makes seamless project management difficult. The increase of elite participants and the production monopoly by the few is required for efficient OSS projects, rather than a simple increase in average participants [6].
Thus, in order to examine the relationship of the "participation by many" with the value of merge efficiency derived in Stage 1 of the two-stage DEA, this study considered the Kruskal-Wallis test and Tobit regression analysis. First, the Kruskal-Wallis test was conducted on two simple hypotheses. The Kruskal-Wallis test is an expanded model of the Mann-Whitney U test, and allows analysis of three or more groups. The test results are presented below.
According to the first hypothesis-H1 in Table 6-it is more efficient if the ratio of the number of contributors to the total number of commits is low. Higher number of contributors means more pull requests were accepted as commits because additional commits made by a collaborator with direct push access do not increase the number of contributors unless it is his or her very first commit. This indicates that the commits directly made by ones with push authority, instead of going through pull requests, act positively in terms of efficiency. In other words, closed production by a small number of authorities can raise the efficiency of an OSS project, more than the open participation of many. Through this observation, it can be concluded that measures to lower the ratio of the number of contributors to the number of commits should be considered for DMUs such as 'nodal', 'laravel', 'flask', and 'kemal', which have high project efficiency but extremely low merge efficiency. This can be done by active commit production of the small number of authorities, rather than anticipating a mere increase in total contributors-the increase in pull requests-to enhance the efficiency. According to the second hypothesis-H2 in Table 6-a higher number of issues does not relate to higher efficiency in OSS projects. This indicates that the participation by an excessive number of people may bring difficulty in communication, which would negatively affect the project. In order to ensure there is no influence of the project size as a potential confounding variable, a correlation analysis was performed between the project size and the number of issues. The resulting correlation value was 0.3558, which is low enough.
Furthermore, the Tobit regression analysis was conducted to estimate potential effects of different factors on OSS efficiency. The examined factors include the number of forks, heavy contributors, watches, issues, and branches. A heavy contributor was defined as a contributor who had created more than 100 commits. The regression results are shown in Table 7.