A Comparison of Ensemble and Dimensionality Reduction DEA Models Based on Entropy Criterion

: Dimensionality reduction research in data envelopment analysis (DEA) has focused on subjective approaches to reduce dimensionality. Such approaches are less useful or attractive in practice because a subjective selection of variables introduces bias. A competing unbiased approach would be to use ensemble DEA scores. This paper illustrates that in addition to unbiased evaluations, the ensemble DEA scores result in unique rankings that have high entropy. Under restrictive assumptions, it is also shown that the ensemble DEA scores are normally distributed. Ensemble models do not require any new modiﬁcations to existing DEA objective functions or constraints, and when ensemble scores are normally distributed, returns-to-scale hypothesis testing can be carried out using traditional parametric statistical techniques.


Introduction
Data envelopment analysis (DEA) is a prominent technique for the non-parametric relative efficiency analysis of a set of decision-making units (DMUs) drawn from a similar production process [1]. DEA models are used in both operation research and data mining literature [2]. Some of the traditional properties of production functions, such as the monotonicity and convexity of the inputs and outputs, that are fundamental in DEA models are often found to be attractive in some data mining models where datasets are noisy and model resistance to learning noise is necessary [3]. An important aspect of DEA models is the reliability of DMU efficiency scores. It is generally accepted that the DEA efficiency estimates are reliable when the sample size is large [4]. Since the reliability of the DEA scores is dependent on the sample size, Cooper et al. [5] have suggested the following rule for the minimum number (n) of DMUs for reliable DEA analysis (each DMU has m inputs and s outputs): For small-size datasets, where violations of the minimum number of DMUs specified by Equation (1a) frequently occur, dimensionality reduction (also known as variable reduction or variable selection) approaches are frequently used to select a subset of variables to satisfy Equation (1a). A variety of variable selection approaches are available in the literature. Among these variable selection approaches are statistical [6], regression [7], efficiency contribution measure [8], bootstrapping [9], hypothesis testing [10], variable aggregation [11] and statistical experiment designs [12]. Variable selection approaches are criticized extensively for applying parametric procedures and linear relationship assumptions for selecting variables to determine an unknown non-linear and non-parametric efficiency frontier. Nataraja and Johnson [13] provide a good description of some of these procedures and their pros and cons.
Using Banker et al.'s [15] variable-returns-to-scale (VRS) DEA BCC model, and data and models obtained from a few studies in the literature, Pendharkar [14] showed that the ensemble DEA model provides a better ranking of DMUs than the models proposed in a few studies from the literature.
This research investigates the additional properties and statistical distribution of the ensemble DEA model scores. It is shown that there are added benefits of ensemble efficiency scores. In particular, the ensemble efficiency scores maximize entropy, meaning that the DMU ranking distribution generated by the ensemble efficiency scores has a lower bias when compared to some competing radial and non-radial variable selection models recently reported in the literature, and second, the ensemble efficiency scores may be normally distributed under certain restrictive assumptions. The normal distribution of the efficiency score feature is particularly attractive because returns-to-scale hypothesis testing may be conducted by using traditional difference-in-means parametric statistical procedures. Both of these features are tested using data and models reported in a published study [16]. The rest of the paper is organized as follows: In Section 2, the basic DEA radial and non-radial models, ensemble DEA model and Entropy criterion for comparing different DEA models are described. In Section 3, using Iranian gas company data, the results of ensemble DEA models are compared with the results of variable selection models used in Toloo and Babaee's [16] study. Additionally, in Section 3, the properties of the ensemble DEA scores are investigated in terms of the entropy criterion and their statistical distributions. In Section 4, the paper concludes with a summary and directions for future research.

DEA Preliminaries, Ensemble DEA Model, Entropy Criterion for DEA Model Comparisons and Statistical Distribution of Ensemble Scores
The basic DEA model assumes n DMUs, with each DMU consisting of m different inputs that produce s different outputs. The input and output vectors are semi-positive, and for DMU j (j = 1, . . . , n), the space for the input and output vectors (x j ,y j ) R m+s + . For a DMU o , its relative efficiency may be computed by using the linear programming model under the constant returns-to-scale assumption. This efficiency is computed by solving the following model: subject to: v i , u r ≥ ε f or all i = 1, . . . , m and r = 1, . . . , s where v i and u r are the weights associated with the ith input and jth output, respectively. The constant ε > 0 is infinitesimally non-Archimedean. The model (2a)-(2d) is often called the primary CCR model [1], and its dual is written as follows: λ j y r j − s + r = y ro , r = 1, . . . , s, and (2g) The VRS BCC model augments the system (2e)-(2h) by adding the following constraint: The aforementioned models are radial DEA models that are criticized for not providing input or output projections (for inefficient DMUs) that satisfy Pareto optimality conditions [17]. Fare and Lovell [18] independently proposed radial DEA models that allow for input or output reductions at variable rates. The radial version of the CCR model is mathematically represented in the following dual form: λ j y rj ≥ y ro , r = 1, . . . , s Pendharkar [14] proposed an ensemble DEA model based on the popularity of ensemble models in machine learning literature. The ensemble DEA model requires an exhaustive search procedure using a binary vector z whose components indicate whether an input or output is considered in performing DEA analysis. The dimension of this binary vector is (m + s). Figure 1 illustrates the z vector and exhaustive search tree for two-input-and-one-output datasets. The exhaustive tree is pruned (dotted edges) for models that have either no inputs or no outputs. DEA analysis is then conducted on the remaining models, and the efficiency results of each model for each DMU are averaged and used as ensemble DEA scores. To illustrate the ensemble DEA approach on a two-input-and-one-output dataset, a CCR DEA analysis using partial Cobb-Douglas production function data on US economic growth between 1899 and 1910 [19] is conducted. Table 1 illustrates the results of our DEA analysis and resulting ensemble scores. The two inputs were labor in person-hours worked per year and the amount of capital invested. The output was the total annual production. The results of the analysis show that the traditional DEA with z = [111] does not provide unique rankings (for the years 1901 data on US economic growth between 1899 and 1910 [19] is conducted. Table 1 illustrates the results of our DEA analysis and resulting ensemble scores. The two inputs were labor in person-hours worked per year and the amount of capital invested. The output was the total annual production. The results of the analysis show that the traditional DEA with z = [111] does not provide unique rankings (for the years 1901 and 1902 receive the same efficiency score), but the ensemble DEA model provides unique DMU rankings. Pendharkar's [14] study provides a theoretical basis for the reliability of ensemble DEA scores.  The maximum entropy (ME) principle has been applied to DEA DMU ranking distribution [20] and model comparisons [21]. The ME principle measures the DMU ranking bias by using a more general family of distributions [22]. Several statistical distributions can be characterized as ME densities [23]. The ME distributions are the least biased distributions obtained by imposing moment constraints that are inherent in the data [21]. To obtain the ME for a given set of DMUs and their efficiencies, normalized ranks are first obtained by computing * ∑ * =1 ⁄ , for each DMU, and then computing the ME for a certain model z as follows: [011] [10] [11] [01]  The maximum entropy (ME) principle has been applied to DEA DMU ranking distribution [20] and model comparisons [21]. The ME principle measures the DMU ranking bias by using a more general family of distributions [22]. Several statistical distributions can be characterized as ME densities [23]. The ME distributions are the least biased distributions obtained by imposing moment constraints that are inherent in the data [21]. To obtain the ME for a given set of DMUs and their efficiencies, normalized ranks are first obtained by computing , for each DMU, and then computing the ME for a certain model z as follows: The ME for the DEA models in Table 1 are ME 111 = 2.4768, ME 101 = 2.4775 and ME 011 = 2.4757. The model with labor as an input and production as an output (z = [101]) has the highest entropy and has the least bias, with a maximum difference between DMU efficiencies for closely ranked DMUs for the years 1901 and 1902. The ensemble entropy is 2.4769, and since it is an average of all z-vector combinations, the comparison benchmark for ensemble entropy is the model with z = [111].

of 11
The ensemble entropy is higher than the benchmark. The highest possible entropy value or upper bound (UB) for a model is given by the following expression: The ME UB for the data in Table 1 is 2.485, and the ensemble entropy is very close to the maximum value. It is important to note that obtaining the maximum value is not always desirable, but it provides a theoretical benchmark estimate for a completely unbiased normalized DMU score distribution.
To compute ensemble efficiency scores, an n × m matrix E of DEA efficiency scores is necessary. The rows of such a matrix are the numbers of DMUs, and the columns are the numbers of models given by the numbers of eligible models considered in computing ensemble efficiency scores. This number of eligible models will have an upper bound given by N, computed using Equation (1b). The elements of this matrix will be efficiency scores for each DMU computed for a given model identified by column number. Figure 2 illustrates a five-DMU-and-five-model matrix. The ensemble efficiency score (θ E i ) for each DMU is computed using the following formula: Algorithms 2020, 13, x FOR PEER REVIEW 5 of 11 The ME for the DEA models in Table 1 are ME 111 = 2.4768, ME 101 = 2.4775 and ME 011 = 2.4757. The model with labor as an input and production as an output (z = [101]) has the highest entropy and has the least bias, with a maximum difference between DMU efficiencies for closely ranked DMUs for the years 1901 and 1902. The ensemble entropy is 2.4769, and since it is an average of all z-vector combinations, the comparison benchmark for ensemble entropy is the model with z = [111]. The ensemble entropy is higher than the benchmark. The highest possible entropy value or upper bound (UB) for a model is given by the following expression: The ME UB for the data in Table 1 is 2.485, and the ensemble entropy is very close to the maximum value. It is important to note that obtaining the maximum value is not always desirable, but it provides a theoretical benchmark estimate for a completely unbiased normalized DMU score distribution.
To compute ensemble efficiency scores, an n × m matrix E of DEA efficiency scores is necessary. The rows of such a matrix are the numbers of DMUs, and the columns are the numbers of models given by the numbers of eligible models considered in computing ensemble efficiency scores. This number of eligible models will have an upper bound given by N, computed using Equation (1b). The elements of this matrix will be efficiency scores for each DMU computed for a given model identified by column number. Figure 2 illustrates a five-DMU-and-five-model matrix. The ensemble efficiency score ( ) for each DMU is computed using the following formula: A few observations can be made about any row i ϵ {1, …, n} of the ensemble efficiency score matrix. First, all the elements of a given row are an independent computation of efficiency scores by the same DMU under a different model number with its unique set of input(s) and output(s). Second, in all the elements of a given row, the DMU is maximizing its efficiency given its model constraints. Thus, each row represents independent evaluations by a DMU under the maximum decisional efficiency (MDE) principle [24]. The MDE principle was introduced by Troutt [25] to develop a function to the aggregate the performance of multiple decision-makers. The underlying assumption of the MDE principle is that all decision-makers seek to maximize their decisional efficiencies. Troutt [26] later used the MDE approach to rank DMUs and showed that DMUs deemed efficient under MDE are also efficient when ranked using the DEA. For a linear aggregator function, such as the one used in Equation (2j), Troutt [26] illustrated that the decisional efficiencies θ can be described by the following probability density function (pdf): The pdf in (2k) is monotone, increasing on its interval with a mode at θ = 1 (see Figure 5 for illustration). Using the laws of probability, the value of cα = α (e α − 1) −1 . Since each element in a given row of the ensemble efficiency score matrix is an independent evaluation by a decision-maker (i.e., a A few observations can be made about any row i {1, . . . , n} of the ensemble efficiency score matrix. First, all the elements of a given row are an independent computation of efficiency scores by the same DMU under a different model number with its unique set of input(s) and output(s). Second, in all the elements of a given row, the DMU is maximizing its efficiency given its model constraints. Thus, each row represents independent evaluations by a DMU under the maximum decisional efficiency (MDE) principle [24]. The MDE principle was introduced by Troutt [25] to develop a function to the aggregate the performance of multiple decision-makers. The underlying assumption of the MDE principle is that all decision-makers seek to maximize their decisional efficiencies. Troutt [26] later used the MDE approach to rank DMUs and showed that DMUs deemed efficient under MDE are also efficient when ranked using the DEA. For a linear aggregator function, such as the one used in Equation (2j), Troutt [26] illustrated that the decisional efficiencies θ can be described by the following probability density function (pdf): The pdf in (2k) is monotone, increasing on its interval with a mode at θ = 1 (see Figure 5 for illustration). Using the laws of probability, the value of c α = α (e α − 1) −1 . Since each element in a given row of the ensemble efficiency score matrix is an independent evaluation by a decision-maker (i.e., a DMU in an ensemble model) trying to maximize its decisional efficiency θ * ij for j = {1, . . . , m}, the probability density function for each row (DMU) can be written as: The central limit theorem mentions that the cumulative distribution functions (cdfs) of the sums of independently identically distributed random variables asymptotically converge to a Gaussian cdf. The ensemble efficiency scores are normalized sums of independent efficiency assessments that will be distributed with a pdf given by (2l). These sums can be considered independent and identically distributed if α 1 = α 2 = . . . = α n . Under the restrictive assumption that α 1 = α 2 = . . . = α n , the ensemble efficiency scores are guaranteed to asymptotically converge to a normal distribution by the central limit theorem. In practice, however, the ensemble efficiency scores are not entirely random or perfectly identically distributed (due to the slight likely variation of Equation (2l)'s α i parameters for each row), and each ensemble model does introduce a degree of mild randomization. For mild differences in the row pdf parameters α i, where α 1 ≈ α 2 ≈ . . . ≈ α n , the ensemble efficiency scores are likely to be normally distributed. A reader may note that under ideal conditions, where α 1 = α 2 = . . . = α n and individual DMU scores follow Equation (2l)'s distribution, the entropy of the ensemble scores will be highest and close to the highest upper bound mentioned in Equation (2i) because the distribution in Equation (2i) has a mode of 1 (see Figure 5). Thus, it may be argued that the likelihood of normality of the ensemble scores increases when the entropy of the ensemble scores is closer to its upper bound given by Equation (2i). It is important to note that an entropy equal to the exact value of the upper bound given by Equation (2i) is undesirable because at that value, the distribution is a uniform distribution where all the DMUs are fully efficient for all the models. The entropy of the pdf in Equation (2k) is maximized on the interval [0, 1] when the mean of the distribution is greater than 0.5 [27]. Additionally, another important aspect of the distribution of the ensemble efficiency scores is that both the rows and columns of ensemble efficiency scores ( Figure 2) play a role in the pdf of the ensemble efficiency scores because the rows represent sampling from the MDE distributions and the columns represent sampling from the distribution of the sums of independent variables. Larger sample sizes increase the statistical reliability and robustness of the results.

Comparing Variable Selection Models and Ensemble Model Using Gas Company Data and Entropy Criterion
For small datasets, many input or output variables are aggregated so that the selected variables satisfy the heuristic given in Equation (1a). There are two problems with all the variable selection approaches. First, they use an artificial criterion to select variables for a non-linear and non-parametric approach. Any artificial/subjective criterion will make some assumptions that are harder to justify. Second, these techniques have several selection parameters and thresholds that often lead to inconsistencies in applying these techniques. For example, Toloo and Babaee [16] illustrate three problems with a variable selection approach and suggested an improved approach. By contrast, the ensemble DEA approach does not make any assumptions, and for small datasets, trying out different input and output combinations and aggregating efficiency scores provide more reliable efficiency estimates than variable selection models. Part of the reason for the stability of ensemble DEA efficiency scores is that, even for small datasets, some DEA models in an ensemble will always satisfy the heuristic given in Equation (1a), which will increase the reliability of the ensemble efficiency scores due to model averaging. This stability of ensemble efficiency scores is illustrated by comparing ensemble scores with the results of models from Toloo and Babaee's [16] study and using the entropy criterion.
To compare the results, the dataset from Toloo and Babaee's [16] study is used. The dataset consists of three inputs and four outputs from an Iranian gas company. The inputs are budget (x 1 ), staff (x 2 ) and cost (x 3 ). The outputs are customers (y 1 ), the length of the gas network (y 2 ), the volume delivered (y 3 ) and gas sales (y 4 ). Table 2 lists these data. Table 3 lists the efficiency scores of the ensemble DEA Algorithms 2020, 13, 232 7 of 11 with the CCR and BCC models and models used by Toloo and Babaee [16]. Using formula (1b), a total of 105 unique DEA models were used to compute the DEA ensemble efficiency score.  More importantly, the Ensemble CCR model generates unique rankings for the DMUs, whereas the Non-Radial and Radial models generate a tie for three DMUs. The Ensemble BCC model also generates unique rankings, but the differences occur at the third decimal place. The Ensemble BCC efficiency scores for DMU 10, 12 and 13 were 0.960, 0.959 and 0.962, respectively. Figures 3 and 4 illustrate the numbers of models (out of 105 total models) where a DMU was fully efficient. These figures are useful for understanding to what extent the assumption α 1 ≈ α 2 ≈ . . . ≈ α n was satisfied for the theoretical normal distribution of the ensemble efficiency scores. For these parameters to be similar, the expectation is that a similar number of fully efficient DMUs should exist across all models. Clearly, some DMUs are never fully efficient under any of 105 models and the assumption of identical distributions is violated. While the assumption is violated, Figure 4 illustrates that some DMUs, e.g., 1, 8, 10, 12 and 13, have a somewhat similar number of fully efficient DMUs to others. These ensemble scores of these DMUs may be considered as normalized random sums generated from identical distributions (such as Distribution 1). All of these DMUs have ensemble efficiency scores greater than 0.95. Similarly, DMUs 5, 6 and 11, in Figure 4, have no fully efficient scores, and these may also be considered as random normalized sums generated from identically distributed pdfs (such as Distribution 2).
Algorithms 2020, 13, x FOR PEER REVIEW 8 of 11 assumption of identical distributions is violated. While the assumption is violated, Figure 4 illustrates that some DMUs, e.g., 1, 8, 10, 12 and 13, have a somewhat similar number of fully efficient DMUs to others. These ensemble scores of these DMUs may be considered as normalized random sums generated from identical distributions (such as Distribution 1). All of these DMUs have ensemble efficiency scores greater than 0.95. Similarly, DMUs 5, 6 and 11, in Figure 4, have no fully efficient scores, and these may also be considered as random normalized sums generated from identically distributed pdfs (such as Distribution 2).  The ensemble scores for this dataset appear to be random normalized sums from two or more pdfs of the forms given in Equation (2k). Given that these are independent random normalized sums, it can be easily shown that the product of two or more independent MDE pdfs is also an MDE pdf. Figure 5 illustrates two sample MDE pdfs for two different values of alpha. The entropy of an MDE pdf is maximized when the mean of a distribution is greater than 0.5 [27]. For the ensemble BCC model, from Table 3, this criterion is satisfied. The lowest value of the ensemble BCC score is 0.57, which is greater than the mean of 0.5 required to maximize entropy and higher than the lowest values for the efficiency scores for the radial, non-radial and ensemble CCR models. As a result, the ensemble BCC model appears to maximize its entropy slightly better than the ensemble CCR model.  assumption of identical distributions is violated. While the assumption is violated, Figure 4 illustrates that some DMUs, e.g., 1, 8, 10, 12 and 13, have a somewhat similar number of fully efficient DMUs to others. These ensemble scores of these DMUs may be considered as normalized random sums generated from identical distributions (such as Distribution 1). All of these DMUs have ensemble efficiency scores greater than 0.95. Similarly, DMUs 5, 6 and 11, in Figure 4, have no fully efficient scores, and these may also be considered as random normalized sums generated from identically distributed pdfs (such as Distribution 2).  The ensemble scores for this dataset appear to be random normalized sums from two or more pdfs of the forms given in Equation (2k). Given that these are independent random normalized sums, it can be easily shown that the product of two or more independent MDE pdfs is also an MDE pdf. Figure 5 illustrates two sample MDE pdfs for two different values of alpha. The entropy of an MDE pdf is maximized when the mean of a distribution is greater than 0.5 [27]. For the ensemble BCC model, from Table 3, this criterion is satisfied. The lowest value of the ensemble BCC score is 0.57, which is greater than the mean of 0.5 required to maximize entropy and higher than the lowest values for the efficiency scores for the radial, non-radial and ensemble CCR models. As a result, the ensemble BCC model appears to maximize its entropy slightly better than the ensemble CCR model.  The ensemble scores for this dataset appear to be random normalized sums from two or more pdfs of the forms given in Equation (2k). Given that these are independent random normalized sums, it can be easily shown that the product of two or more independent MDE pdfs is also an MDE pdf. Figure 5 illustrates two sample MDE pdfs for two different values of alpha. The entropy of an MDE pdf is maximized when the mean of a distribution is greater than 0.5 [27]. For the ensemble BCC model, from Table 3, this criterion is satisfied. The lowest value of the ensemble BCC score is 0.57, which is greater than the mean of 0.5 required to maximize entropy and higher than the lowest values for the efficiency scores for the radial, non-radial and ensemble CCR models. As a result, the ensemble BCC model appears to maximize its entropy slightly better than the ensemble CCR model. Algorithms 2020, 13, x FOR PEER REVIEW 9 of 11 While ensemble scores have a minor violation of an identical distribution for some DMUs, a formal test of the normality of the distribution of the ensemble efficiency scores was conducted. Table  4 illustrates the results of these tests. The Shapiro-Wilk statistic for the Ensemble CCR model is 0.944, and that for the Ensemble BCC model is 0.876, which, at 14 degrees of freedom, are non-significant, consistent the null hypothesis that the efficiency score distribution is normally distributed at the 95% level of statistical significance. A paired sample t-test for the difference in mean efficiency scores between the Ensemble CCR and the Ensemble BCC models gives a |t|-value of 3.524, which is significant at the 99% level of statistical significance (df = 13), indicating that a variable returns-to-scale relationship exists between inputs and outputs. The normality of the ensemble efficiency score distributions increases the power of parametric statistical tests.

Summary, Conclusions and Directions for Future Work
A significant amount of research in the DEA literature has focused on dimensionality reduction/variable selection techniques for small datasets. These techniques are often criticized and have their limitations, with no clear way of selecting which technique is the best. A better approach would be to use an ensemble DEA score that does not make any additional assumptions and provides models that have high entropy values and normally distributed scores under restrictive assumptions. Pendharkar [14], in his study, has already provided a theoretical foundation for the reliability of ensemble DEA scores. The added benefit of ensemble DEA scores is that they provide unique DMU rankings.
The normality of ensemble DEA scores is not guaranteed unless the ensemble DEA scores are normalized sums generated from independent identically distributed MDE pdfs. This assumption may not be strictly satisfied in most real-world datasets, but the current study shows that minor deviation from this assumption may be tolerated because the entropy of all MDE pdfs is maximized While ensemble scores have a minor violation of an identical distribution for some DMUs, a formal test of the normality of the distribution of the ensemble efficiency scores was conducted. Table 4 illustrates the results of these tests. The Shapiro-Wilk statistic for the Ensemble CCR model is 0.944, and that for the Ensemble BCC model is 0.876, which, at 14 degrees of freedom, are non-significant, consistent the null hypothesis that the efficiency score distribution is normally distributed at the 95% level of statistical significance. A paired sample t-test for the difference in mean efficiency scores between the Ensemble CCR and the Ensemble BCC models gives a |t|-value of 3.524, which is significant at the 99% level of statistical significance (df = 13), indicating that a variable returns-to-scale relationship exists between inputs and outputs. The normality of the ensemble efficiency score distributions increases the power of parametric statistical tests.

Summary, Conclusions and Directions for Future Work
A significant amount of research in the DEA literature has focused on dimensionality reduction/variable selection techniques for small datasets. These techniques are often criticized and have their limitations, with no clear way of selecting which technique is the best. A better approach would be to use an ensemble DEA score that does not make any additional assumptions and provides models that have high entropy values and normally distributed scores under restrictive assumptions. Pendharkar [14], in his study, has already provided a theoretical foundation for the reliability of ensemble DEA scores. The added benefit of ensemble DEA scores is that they provide unique DMU rankings.
The normality of ensemble DEA scores is not guaranteed unless the ensemble DEA scores are normalized sums generated from independent identically distributed MDE pdfs. This assumption may not be strictly satisfied in most real-world datasets, but the current study shows that minor deviation from this assumption may be tolerated because the entropy of all MDE pdfs is maximized when normalized sums have a value greater than 0.5. This means that, typically, the differences in means between the underlying pdfs (Equation (2l)) for ensemble entropy scores will be less than 0.5, and, while these pdfs may not be identically distributed, the means of these distributions will be close, resulting in the likely normal distribution of ensemble scores in most real-world cases. The normality of ensemble DEA scores allows for the application of traditional statistical tests for return-of-scales hypothesis tests. Traditional DEA hypothesis-testing methods are not perfect and are known to be slightly biased [28]. Future research may focus on comparing ensemble DEA-based hypothesis testing with traditional DEA hypothesis testing to identify which method provides reliable results.