1. Introduction
Many studies in the software engineering literature use data envelopment analysis (DEA) to rank software development [
1] and maintenance projects [
2]. When a software project is evaluated for efficiency under the DEA framework, it can perform a self-assessment by selecting an optimal set of weights to achieve the highest possible efficiency score. Such self-assessments are considered “egoistic” or “narcissistic.” The fundamental issue with these self-assessments is that many software projects—also known as decision-making units (DMUs) in DEA terminology—achieve an optimal efficiency score of 1 [
3]. The DEA literature refers to this as a problem of total weight flexibility [
4]. The problem with weight flexibility is that it prevents further prioritization among the fully efficient subset of projects. This undermines the purpose of efficiency evaluations because, ideally, only a small number of projects should be deemed efficient. Moreover, this small group of efficient projects should also reflect a high level of organizational consensus about their efficiencies.
DEA cross-efficiency (CE) [
4] and game-efficiency (GE) [
5] models were introduced to address issues with self-evaluations and weight flexibility. CE methods are valued for providing unbiased peer evaluations that lead to a unique ranking of DMUs [
4]. However, these techniques have two main limitations. First, CE evaluations can only be obtained under constant-returns-to-scale assumptions, and second, the optimal weights are not unique. While CE models are developed under the variable-returns-to-scale assumption, the resulting models impose additional constraints, and DMU efficiency measured under such constraints has limited interpretation [
6]. The non-uniqueness of the optimal CE weights can be further addressed by using secondary objective functions, where the DMU under evaluation uniquely selects its optimal weights using either an aggressive or a benevolent criterion [
4]. The GE approach assumes that projects compete for a shared pool of resources and that some direct or indirect competition exists between the projects under evaluation. Under these assumptions, the GE approach estimates DMU efficiencies using traditional CE scores as lower bounds for GE scores. Typical GE score values fall between CE scores and traditional DEA scores. The GE model provides a unique set of DMU rankings and does not require the use of any secondary objective functions. In the context of software engineering projects, the GE model is suitable only when all evaluated software projects come from the same organization and are from the same time frame.
Since software project managers now have several frontier efficiency models to choose from—each offering its own benefits and assumptions—it is crucial to evaluate how these models perform on real-world datasets. After obtaining efficiency scores from each model, they can be averaged into a hybrid ensemble efficiency score, or different performance metrics can be employed to identify DMUs where models diverge in their evaluations [
7]. This paper addresses two key issues. First, how to determine the most informative and least biased model when multiple software productivity ranking models are available. Second, since DEA models often identify several projects as fully efficient with the highest average project efficiency scores, while CE models tend to provide the lowest average scores, when both models are used, how can these scores be combined to identify false positives—specifically, high-ranking DEA projects that do not truly merit their rankings? To address the first issue, this paper compares various software productivity ranking models and evaluates them using an entropy-based criterion. The entropy criterion is recognized as maximally informative based on the data. This criterion is frequently employed in the software engineering literature, with recent applications including the development of ensemble classifiers [
8] and the measurement of software trustworthiness [
9]. To tackle the second issue, this paper uses a rule-based approach to identify discrepancies in project rankings across different models and to detect false positives. Projects that all models agree on can be deemed genuinely efficient. Projects with differing rankings are then scrutinized for false positives.
The rest of the paper is organized as follows. In
Section 2, different DEA frontier models and the entropy criterion for model evaluation are presented.
Section 3 describes datasets, experiments, and results.
Section 4 concludes the paper with a summary and a discussion of limitations.
2. Overview of DEA Efficiency, Cross Efficiency, and Game Efficiency Models
For single-input, single-output measures of software effort in person-months and software size in function points, software productivity is defined as software size divided by software effort (i.e., function points per person-month) [
10]. In this research, multiple inputs are used, with software size metric(s) as output(s) and software cost drivers as inputs. Since managers typically attempt to control for cost drivers [
11], input-oriented DEA models are more suitable for software project benchmarking. These input-oriented models are described in this section.
The traditional DEA model can be described as
n DMUs, each employing m inputs and s outputs. The input vector
x and the output vector
y take non-negative values. For a DMU
p ∈ {1, …,
n}, its traditional self-evaluation efficiency score,
Epp, may be computed by using optimal weights of the following mathematical programming model.
Subject to the following:
The variables
vip and
urp are input and output vectors for the respective inputs and outputs for DMU
p. Since the model (1) is non-linear and has multiple optimal solutions, it can be linearized as shown below. The linearization guarantees that the optimum solution values are unique.
Subject to the following:
The optimal value obtained by solving model (2), , is the traditional DEA efficiency score for DMUp.
If
and
are the optimal solutions chosen by DMUp by using the model (2), then the cross-efficiency scores for all DMUs are computed by evaluating DMU
p’s optimal weights as follows:
Overall cross-efficiency scores are computed by solving problem (2) for each DMU as an evaluator and by computing the corresponding cross-efficiency scores for all DMUs based on the optimal weights of the evaluating DMUs. Once this process is over, the cross-efficiency score for some DMU
j is computed using the following expression:
A well-known drawback of cross-efficiency scores computed by expression (4) is the presence of alternative optima in the DEA weights [
12]. To remove arbitrary optimal weight selections, researchers have suggested secondary objective functions to select a unique set of optimal weights under different secondary criteria. Among these secondary criteria, popular ones are aggressive and benevolent weight selection criteria. Under the aggressive criterion, the rating DMU aims to maximize its simple efficiency score as its primary objective while minimizing the average cross-efficiency score of its peers as its secondary objective. In the benevolent criterion, the primary objective remains the same as before, while the secondary objective is to maximize the average cross-efficiency scores of its peers. Assuming that the simple efficiency of DMU
p is computed using the model (2) and is given as
, the secondary aggressive criterion formulation can be written as follows:
Subject to the following:
The corresponding benevolent formulation can be constructed by changing the minimize objective function in the model (5) to the maximize objective function.
Liang et al. [
5] defined GE of DMU
j with respect to DMU
p using the following expression:
where variables
and
are the optimal solutions obtained by solving the following Formulation (7) for an
n-dimensional vector
of known values:
Subject to the following:
The vector component
αp is a positive number taking values in the interval [0, 1]. The model (7) maximizes the efficiency of DMU
j under the constraint that the cross-efficiency of DMU
p is no less than a value
αp. The model (7) is solved
n times each for a different value of
j ∈ {1, …,
n}. The average GE for DMU
j is defined as follows:
Given that the value of
is dependent on
, which is a function of (
α1, …,
αn), an iterative algorithm was developed by Liang et al. [
5] to obtain the best GE scores at the Nash equilibrium point. This iterative algorithm has the following three steps.
Step 1. Solve (2) and obtain cross-efficiency scores using (3). Set iteration t = 1 and , ∀p = 1, …, n.
Step 2. Solve model (8) and set
where
represents an optimal value obtained by solving (7) when
.
Step 3. If
for some
j ∈ {1, …,
n} then
and go to Step 2. Otherwise, stop and
are the best GE scores of DMU
j for
j = 1, …,
n.
Figure 1 illustrates the procedure’s flowchart.
The maximum entropy (ME) principle has been applied to test the DEA DMU ranking distribution bias [
13], weight aggregation [
14] and comparing different DEA models [
15]. Assuming a generic final optimal efficiency score (i.e., any one of traditional, cross efficiency, or game efficiency), notation for DMU
j is denoted as
, the entropy score of a given model can be computed by first normalizing the efficiency scores using the following expression:
and then computing the entropy score using the following expression:
Higher values of ξ mean lower discrimination power of the model because many DMUs are assigned similar scores. The highest value of ξ is achieved when all DMUs are assigned an efficiency score of 1 and the model has zero discrimination power. Models with lower entropy values are generally desirable because they tend to uniquely rank DMUs and provide for better discrimination among ranked DMUs.
3. Data, Experiments, and Results
Data used for testing the software productivity models were obtained from the International Software Benchmarking Standards Group (ISBSG). The ISBSG dataset is widely used in the software engineering literature and was also employed in the study of production functions [
16]. In particular, two software productivity models were tested. The first model was a single-output model in which the software production function produced software size (in function points) as an output, with software effort, project schedule length (in months), and total defects as inputs. The project schedule length was calculated as project elapsed time minus any inactive project time. Total defects were determined as the sum of major and minor defects. Only projects with complete data were included in the analysis; projects with any missing attributes were excluded. As per the ISBSG dataset convention, only projects with data quality ratings A and B were selected for analysis. According to ISBSG, a data quality rating of A indicates all data is seemingly sound, while a rating of B suggests most data attributes are sound except for a few that may not be entirely reliable. A total of 86 projects were used to test the first software productivity model.
Figure 2 shows the traditional DEA scores from (2), cross efficiency (CE) scores under an arbitrary criterion using (4), cross efficiency scores with the secondary aggressive (AGG) criterion from (5), and benevolent (Ben) criterion based on the maximum objective function in (5). In the figure, the X-axis lists the DMU numbers, and the Y-axis displays the efficiency scores.
Two expectations exist regarding the results. First, traditional DEA scores will consistently outperform all other scores (i.e., DEA max{CE, GE, AGG, Ben}), and GE scores will always be bounded above by traditional DEA scores and below by CE scores, based on an arbitrary objective function (i.e., DEA GE CE). The results confirmed both expectations.
Figure 3 displays the efficiency difference plot comparing DEA and GE scores, as well as GE and CE scores. Similarly to
Figure 2, the X-axis shows the DMU numbers, and the Y-axis represents the differences in efficiency scores. The plot indicates that the gap between traditional DEA and GE scores is generally larger than the gap between GE and CE scores. The DEA identified two DMUs as the most efficient, both with an efficiency score of 1. All methods ranked DMU #3 as the highest, with its respective efficiency scores listed in the Max row of
Table 1. DMU #3 was a small project with two inputs, taking a low value of the project schedule of one month and zero total defects, against the overall data average project schedule length of approximately 9 months and average total defects of approximately 10 defects. Similarly, all techniques agreed that DMU #43 was the least efficient, with its scores listed in the Min row of
Table 1. For this project, the primary reason for the lower efficiency score was a lower value of the output, where its FP value was 16 against a sample average of about 765 function points. The inputs for this DMU had high values of Effort = 465, Schedule = 7, and Defects = 7. Apart from the DEA, all other methods produced unique rankings.
When comparing Ben and AGG secondary objective cross-efficiency models, Ben models mostly had higher scores than the AGG model.
Figure 4 illustrates the Ben-AGG efficiency value difference plot. As mentioned earlier, the DEA model efficiency scores were the highest, so no additional comparison between the DEA and secondary objective cross-efficiency models was necessary.
Table 1 summarizes key statistics for all five techniques. The DEA and GE had the highest average efficiency scores, as well as the highest standard deviations. In terms of entropy scores, GE performed the worst with the highest entropy value, while AGG performed the best with the lowest entropy value. Additionally, the mean and standard deviation values for the AGG model were the lowest among all five models.
Table 2 presents the results of a paired-samples t-test comparing efficiency scores across techniques. The table displays |t|-values. All mean differences were significant at the 0.01 level of statistical significance, both for one tail (critical |t|-value = 2.37, df = 85) and two tails (critical |t|-value = 2.63, df = 85). These findings suggest that no two techniques are statistically similar in their rankings.
The second software productivity model examined in the research has two output measures—function points and source lines of code (SLOC). It uses two cost-influencing inputs: software effort and project schedule duration. Due to missing data for SLOC and total defects, including total defects as an input was not feasible, which restricted the dataset to only a few projects. For the model with two inputs and two outputs, 35 projects had complete data for all four variables and were included in the analysis.
Figure 5 shows the efficiency scores for these 35 DMUs across all five techniques. The DEA ranked four DMUs as fully efficient with a score of 1. All methods gave the highest score to DMU #10 and the lowest to DMU #31. The highest and lowest scores are listed in
Table 3. The high efficiency score for DMU #10 was due to its large size (SLOC = 13,900 and FP = 1106). For DMU #31, its low efficiency score was mainly because of a high effort score (7700) and a lower function points score (182).
Figure 6 shows the efficiency difference plot between DEA and GE, and GE and CE. Unlike the single-output software productivity model, in which the difference between DEA and GE mostly overshadowed that between GE and CE, the plot in
Figure 5 suggests that GE scores are neither close to DEA nor to CE scores. No clear pattern emerges from the plot.
Figure 7 shows the Ben-AGG efficiency difference plot. Similarly to the results from the previous dataset, the Ben model generally had higher efficiency scores than the AGG model, but the differences were smaller in the second dataset.
Table 3 shows key statistics for all five techniques. Similarly to the first experiment, the AGG results had the lowest entropy value and provided the best discrimination between projects and unique DMU rankings. Like the initial experiment, the DEA and GE ranking models showed the highest entropy, mean, and standard deviation values. The Ben ranking model does not seem promising in either single-output or two-output experiments, as its entropy values are not lower than the CE entropy values obtained under the arbitrary criterion. Overall, the AGG results appear consistently the most effective.
Table 4 displays the results of a paired t-test comparing the efficiency scores of different techniques. The table shows |t|-values, and most mean differences were statistically significant at the 0.01 level for both one-tail (critical |t|-value = 2.44,
df = 34) and two-tail tests (critical |t|-value = 2.73,
df = 34). The difference in means between CE and AGG was only significant at a weaker 0.1 level of significance.
If a manager chooses only one best technique, the experiment results show that AGG is the top choice. It produces a unique set of DMU rankings based on the aggressive secondary criterion, in which a DMU aims not to sacrifice its simple efficiency score so that its peers can improve theirs. Additionally, the AGG efficiency score will be acceptable to all other DMUs and will not be seen as unfair. When managers do not need to explain their results or the assumptions behind them, they can use an ensemble efficiency score by averaging all techniques’ scores. However, if a manager prefers to rely only on a traditional DEA method, other techniques can still help identify and highlight projects that deviate significantly from DEA scores. For projects with large score differences between methods, managers may need to reevaluate to prevent unfair ratings. Fortunately, the experiment results show that all methods agree on the highest- and lowest-ranking DMUs, so major disagreements are unlikely. Still, a formal framework is necessary to detect such differences.
The formal evaluation process must distinguish between efficient DMUs and false positives. False positives often occur with ranking methods that assign high-efficiency scores to too many DMUs. The traditional DEA model, with its issue of weight flexibility, frequently overestimates DMU efficiency scores, leading to false positives. While traditional DEA and CE methods are used below to identify false positives, any technique with a high entropy value can be compared to another method with a low entropy value to detect false positives in the high-entropy method. Baker and Talluri [
17] proposed the following false-positive index (FPI) metric to measure the false positivity of DMUs. For a DMUj, its FPI can be calculated using the following expression:
where
Ejj is the traditional DEA efficiency score computed using (2) and
is the cross-efficiency score calculated using Equation (4). Once FPIs for all DMUs are available, then false positives can be identified by selecting an appropriate threshold value for a parameter
η, and running the following rule:
IF FPIj > η THEN DMUj is False Positive
ELSE DMUj is Not False Positive.
Figure 8 and
Figure 9 illustrate the total false positives for the first and second software productivity models considered in this research. The threshold parameter (
η) is varied from 0 in increments of 0.05 until the total false positives drop to zero. Both figures indicate that total false positives fall precipitously until the parameter value of
η is about 0.5. After this point, the fall in total false positives is more gradual. Thus, setting the parameter
η = 0.5 may be a good idea for identifying false-positive DMUs. When
η = 0.5, exactly 18 projects in Model 1 will be identified as false positives, and only three projects from Model 2 will be identified as false positives.
Once the set of false-positive DMUs is identified, managers can review them further, as traditional DEA rankings tend to overestimate the efficiency scores of these DMUs. The lower CE scores for false-positive DMUs suggest that high-efficiency scores from traditional DEA rankings might not be accepted by their peers. The DMUs can also be ranked in descending order based on FPI, and managers may choose to focus only on the top 10% for further examination. In such cases, no threshold is needed to identify false positives.
4. Summary, Conclusions, and Limitations
This paper compares several non-parametric frontier efficiency models for ranking software project productivity. The models examined include traditional DEA, cross-efficiency DEA models with arbitrary and secondary objectives, and game efficiency models. Experiments used the ISBSG dataset and entropy criteria to select the best model. The study results show that the cross-efficiency model with the aggressive secondary objective is the most effective for software productivity rankings. These CE models provide unique rankings without favoring any particular software project, making their neutral approach potentially more acceptable to organizations. The GE approach appears to be the least effective because it is the most computationally intensive, with CPU runtimes exceeding an hour and resulting in high-entropy efficiency scores. GE models assume competition among projects for shared organizational resources, but this assumption does not seem to hold well in the software industry.
For managers who want to use the traditional DEA model for project ranking, this study offers additional hybrid tools to identify projects that might be unfairly assigned high-efficiency scores. These unjustified high scores occur because of the traditional DEA model’s total weight flexibility. By removing these unfair high-ranking DMUs, managers can more confidently identify the best-performing projects.
While nonparametric and free of assumptions about the form of the production function are benefits of the DEA method, the procedure also has some limitations. The principal limitation comes from the assumption of a common production process. An ideal situation for applying the DEA procedure to software engineering datasets would be when all projects come from the same organization and are carried out simultaneously. This way, all projects have access to similar people, tools, procedures, and resources. This will also limit outliers in the analyzed data. Among the different DEA techniques, Game DEA is the most restrictive, as it assumes that software projects compete for common resources in a zero-sum game. The narcissistic nature of the traditional DEA makes it the highest efficiency scoring technique. Peer ranking CE techniques may be suitable when data come from different organizations, as efficiency scores are averaged across peer rankings. In particular, the Benevolent secondary objective CE model, which maximizes the efficiency scores of individual DMUs and the average efficiency of all peer DMUs, is very attractive because it is consistent with the total factor productivity principle. The AGG secondary objective model may be suitable when there is general competition among software projects. Such competition may occur when several companies bid for software projects within a single industry.
The dataset used in this research comes from several companies across different industries. The projects were completed at different times as well. As a result, outliers are expected. Outliers in the dataset will lead to biases in efficiency scores. The entropy- and rule-based procedure proposed in this paper was designed to rank models that are least biased and identify DMUs for which efficiency scores can be considered reliable based on model consensus. The entropy method is just one technique that can be used to deal with outliers. The DEA literature has proposed other bootstrap sampling and robust Bayesian DEA techniques to compute efficiency in the presence of outliers [
18]. Future researchers may consider some of these techniques as a possible extension to this research.