Next Article in Journal
A Retinal Oct-Angiography and Cardiovascular STAtus (RASTA) Dataset of Swept-Source Microvascular Imaging for Cardiovascular Risk Assessment
Previous Article in Journal
Attention-Based Human Age Estimation from Face Images to Enhance Public Security
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Synthetic Data Generation for Data Envelopment Analysis

College of Information Technologies and Computer Sciences, National University of Science and Technology “MISIS”, 4 Leninsky Ave., Bldg. 1, 119049 Moscow, Russia
Data 2023, 8(10), 146; https://doi.org/10.3390/data8100146
Submission received: 10 July 2023 / Revised: 19 September 2023 / Accepted: 21 September 2023 / Published: 27 September 2023
(This article belongs to the Section Information Systems and Data Management)

Abstract

:
The paper is devoted to the problem of generating artificial datasets for data envelopment analysis (DEA), which can be used for testing DEA models and methods. In particular, the papers that applied DEA to big data often used synthetic data generation to obtain large-scale datasets because real datasets of large size, available in the public domain, are extremely rare. This paper proposes the algorithm which takes as input some real dataset and complements it by artificial efficient and inefficient units. The generation process extends the efficient part of the frontier by inserting artificial efficient units, keeping the original efficient frontier unchanged. For this purpose, the algorithm uses the assurance region method and consistently relaxes weight restrictions during the iterations. This approach produces synthetic datasets that are closer to real ones, compared to other algorithms that generate data from scratch. The proposed algorithm is applied to a pair of small real-life datasets. As a result, the datasets were expanded to 50K units. Computational experiments show that artificially generated DMUs preserve isotonicity and do not increase the collinearity of the original data as a whole.
MSC:
90B30; 90C05
JEL Classification:
C61; C67

1. Introduction

Data envelopment analysis (DEA) is a nonparametric method that is used to measure the relative efficiency of a homogeneous set of decision making units (DMUs) [1]. The DEA approach is widely used in various fields, including manufacturing systems, power industry, governance, finance, supply chain, transportation, etc. [2,3,4,5,6,7,8,9,10,11,12,13,14]. It is assumed that each DMU consumes multiple inputs to produce multiple outputs. The estimation of efficiency scores using the DEA model is based on comparing observations with points on the frontier. The term frontier is used because it is based on best practice observations.
Some DEA papers lack testing of the proposed models on large-scale real data. In those papers, the proposed new model is illustrated by a small dataset with several dozen units and three to five variables. Such an illustrative example serves to demonstrate that the proposed method can give correct results. However, as practice shows, this is not enough to draw a conclusion about the performance of the proposed model or algorithm. A small demo example cannot reveal the computational issues that may arise on medium- and large-scale datasets.
For example, in papers by Sueyoshi and Sekitani [15,16,17] DEA models were proposed for the measurement of returns to scale (RTS) under a simultaneous occurrence of multiple projections and multiple supporting hyperplanes. It was proposed to use the strong complementary slackness conditions (SCSC) of linear programming [18] as constraints in order to find all vertices of a face where RTS is evaluated. The paper [19] shows that the simple example proposed by Sueyoshi and Sekitani failed to detect numerical issues with their model. Computational experiments conducted on medium-sized datasets helped reveal the instability of the SCSC model.
Another example demonstrates how detailed testing on large-scale data allows one to benchmark DEA methods and draw conclusions about their performance. Paper [20] describes and compares some of the best-known methods for estimating returns to scale. The paper shows that well-established theoretical models may encounter numerical instability even in medium-sized datasets. In this particular example, the numerical instability is caused by the inversion of an ill-conditioned matrix during the simplex solution process. However, there are a large number of causes of algorithmic instabilities that are difficult to detect theoretically but easy to identify through numerical experiments.
One of the attempts to collect open DEA data that could be used for scientific and educational purposes is the Data Envelopment Analysis Dataset Repository [21]. Unfortunately, this DEA repository has not functioned for many years.
There exist several DEA datasets distributed with R packages (e.g., Benchmarking, rDEA, npsf) published on the Comprehensive R Archive Network (CRAN). However, these datasets are intended mainly for educational purposes and contain a small number of DMUs. Other existing open data repositories are focused on, e.g., machine learning data [22,23], or statistical data [24,25]. However, such repositories contain data that are not intended to be applied directly to DEA analysis: the variables of a DEA model are not specified; inputs and outputs may be mixed with environmental variables; datasets may contain binary, categorical, or unstructured data (audio, video, images, etc.). Therefore, the use of these repositories for DEA is very limited.
Well-known repositories for sharing scientific data [26,27] contain quite a few DEA datasets that are used in articles. However, the datasets in such repositories are difficult to find because they do not take into account DEA specifics. For example, there is no possibility to specify search details such as the number of DMUs or variables, time periods, presence of undesirable variables, etc. Moreover, many datasets do not contain descriptions, so it is impossible to recognize the variables and number of DMUs without downloading the dataset.
Largely because of those reasons, many DEA researchers, for testing purposes, use artificially generated data, usually referred to as synthetic data. Synthetic data are increasingly being used for machine learning. There are two reasons for this: a) the lack of high-quality data and b) the need for privacy protection when sensitive data are used. Recent reviews on synthetic data generation using machine learning (ML) algorithms are presented in [28,29]. Recent studies on the application of ML algorithms for the estimation of DEA technologies are given in [30,31].
On the one hand, large real datasets are rarely seen in the DEA literature, while datasets with more than 10K DMUs are extremely rare. One of the largest datasets in DEA is used in [32]; it represents real data from 30,099 power plants described by 6 variables (two inputs, one good output, and three bad outputs). As a consequence, the existing publicly available datasets for DEA are not enough to conduct comprehensive testing.
On the other hand, applications of DEA show that many inefficient units are projected on the inefficient parts of the frontier when efficiency scores are evaluated. However, this fact disagrees with the main concept of the DEA approach because the efficiency score of an inefficient unit has to be measured relative to efficient units. As a consequence, inaccurate efficiency scores may be obtained. This happens because a non-countable (continuous) production possibility set is determined on the basis of a finite number of production units.
One way to improve the frontier is to insert restrictions on the dual multipliers. A number of papers developed the DEA models, which were based on incorporating domination cones into the dual model [33,34,35,36,37,38,39]. Podinovski [40] proved the equivalence between weight restrictions and production trade-offs between inputs and outputs. Computational procedures with weight restrictions and production trade-offs and a discussion of their implementation can be found in [41,42].
Farrell was the first to introduce artificial units in the primal space of inputs and outputs in order to ensure the convexity of the piecewise linear isoquants. Allen and Thanassoulis [43] elaborated further on the idea of focusing on anchor units as points of departure for formulating coordinates of artificial units. The main purpose of this was to improve the envelopment by reducing the number of inefficient units not “properly” enveloped, resulting in projections to the frontier of these units being on a weakly efficient part of the frontier.
Thanassoulis et al. [44] developed further the super-efficiency method for discovering anchor units in the VRS model [45] and proposed a method for extending the frontier with the help of anchor units. However, their method cannot be used for generating large-scale datasets for two reasons. First, the positions of artificial units are specified by a decision-maker. Second, the procedure does not guarantee full envelopment.
To the best of our knowledge, there are no studies on synthetic data generation in DEA that takes some real datasets as input and complements them with artificial efficient units. The present study attempts to address this gap and contributes by proposing the algorithm for synthetic data generation. The proposed algorithm takes a real DEA dataset as an input and complements it with artificial DMUs.
Following the ideas of Thanassoulis et al. [44], we developed the algorithm for synthetic data generation based on the principles:
(P1)
Original efficient frontier should not change after adding artificial DMUs, and
(P2)
Artificial efficient DMUs should extend the efficient part of the frontier.
Data generation is organized in such a way that artificial efficient units are generated in the borderline region to upsample data neighborhoods. Inefficient units are generated to follow the underlying distribution of the original data.
Based on the proposed algorithm, datasets with a large number of DMUs have been prepared that can be used for testing the numerical stability and computational performance of existing DEA methods.
The structure of the paper is as follows. Section 2 provides a review of existing approaches for generating artificial datasets in DEA and shows some limitations of current approaches. Section 3 introduces the DEA models that are used for data generation. In Section 4, the algorithm of synthetic data generation for DEA is presented and applied to real-life datasets. Section 5 concludes.

2. Literature Review

Many DEA studies [46,47,48] have noted that generating synthetic datasets for general multi-input/multi-output technologies is a challenging task. Existing methods for generating artificial data are based on the assumption that there is a so-called data generating process (DGP), which relates inputs and outputs, from which the artificial DMUs were generated.
One of the easiest ways to generate data is, obviously, to use a uniform distribution as a DGP. However, this method produced datasets with a very low proportion of efficient DMUs. This is not an issue, but it does not accurately reflect real data. For example, Khezrimotlagh et al. [32] reported that a dataset with 10 inputs and 10 outputs uniformly distributed on the interval [10, 20] has more than 98.8% dominated DMUs on average in the sample with 20,000 DMUs. Moreover, using a uniform distribution as a DGP gives an unrealistic input-output mix because inputs and outputs are generated independently. Therefore, the benchmarks obtained with such datasets may be biased.
Dulá [47,49] used synthetic large-scale datasets to investigate the computational performance and scale limits of DEA models. This work provides a comprehensive computational study involving DEA problems with up to 100K DMUs. Since there were no typical DEA datasets for this purpose, the author simulates DGP using a sphere as an efficient frontier. This method is also implemented in FEAR [50]. FEAR is a software package for efficiency analysis with R, a software environment for statistical computing. FEAR’s function genxy.sphere generates n observations with m inputs and r outputs uniformly distributed on the part of a unit sphere located in the positive orthant of the output variables and in the negative orthant of the input variables. The center of the sphere is located at point ( 1 , , 1 , 0 , , 0 ) with m ones and r zeros; that leads to the spherical frontier lying in the positive orthant. For more information on the simulation method, see Ref. [51].
The described approach is efficient, but it does not allow for generating units of different scales. In real data, there may be DMUs that differ by 5–7 orders of magnitude in some variables. The presence of a significant variation in the values may trigger certain numerical difficulties that can reveal the instability of the algorithm at large-scale problems. Therefore, for a comprehensive computational study, the dataset must contain units at multiple scales; using a sphere as a DGP is not well suited for this task.
Barr and Durchholz [46] were among the first who tested DEA models using large-scale problems. In addition to the real dataset of 8748 US banks, they also used synthetic ones because few large-scale DEA problems were available. The first proposed approach for generating such data is to draw samples from a multivariate distribution. However, as the authors warn, the variables representing inputs and outputs need to be carefully chosen since all outputs should be positively correlated with all the inputs.
The second approach used in [46] employed the Cobb–Douglas production function, which is most widely used in production economics. This function may be written as
y = A i = 1 m x i α i , x i > 0 , α i > 0 , i = 1 , , m ,
where, y is the single output, x i , i = 1 , , m are input variables, α i  is an elasticity for input i, and A is usually referred to as total factor productivity. If i = 1 m α i < 1 , then the function displays decreasing returns to scale. For i = 1 m α i = 1 constant returns to scale exist. If i = 1 m α i > 1 , then the production function is said to be increasing returns.
For a model with a single output, inputs x 1 , , x m are generated as independent and identically distributed uniform random variables, and output is directly obtained according to (1) given all α i = 0.8 / m . For some DMUs, the A value can be multiplied by a random number between 0 and 1 to simulate inefficiency. By controlling A, we can also regulate the proportion of efficient units.
In the case of multiple outputs, Wilson [50] proposes the following approach implemented in FEAR’s genxy command. Inputs x 1 , , x m and output y are generated as for the case with a single output. Next, ( r 1 ) uniformly distributed numbers φ 1 , , φ q 1 in the interval ( 0 , π / 2 ) are generated. The outputs are determined as follows:
y r = y 2 / j = 1 r 1 tan 2 φ j + 1 , y j = y r tan φ j , j = 1 , , r 1 .
Put simply, the outputs are randomly distributed on the part of a sphere located within the positive orthant, centered at the origin, and with radius y.
To show the difference between artificially generated data and real data, we used genxy function mentioned above and generated 100 DMUs with 5 inputs and 3 outputs. The pairwise scatter plots of the variables and Pearson correlation coefficients are shown in Figure 1. The histograms showing the distribution of each variable are presented along the matrix diagonal. Pearson correlation coefficients are shown in the upper triangle. The red color corresponds to a positive correlation coefficient, and the blue color represents a negative correlation. The lower triangle provides the pairwise scatter plots of the variables, where the solid line is an OLS fit.
To be economically sound, each output in the dataset should be positively correlated with all the inputs. However, the figure shows that there are negative correlation coefficients between inputs and outputs, and for some pairs, the correlation coefficients turn out to be close to zero.
The same diagram is shown in Figure 2 for a real dataset taken from the R package Benchmarking, available publicly on CRAN, see [52]. The dataset is from a US federally sponsored program for providing remedial assistance to disadvantaged primary school students [53].
Figure 2 shows that all outputs possessed a significant positive correlation to all inputs, unlike artificial data. It should be emphasized that such a situation is not unique to this particular dataset but applies to most DEA datasets.
Paper [48] uses a modified Cobb–Douglas functional form for generating random data. This approach differs from FEAR’s genxy in that the outputs y ˜ k are generated first as independent and identically distributed uniform random variables between 0.1 and 1. All inputs except x 1 are generated using uniform distribution in the same way as the outputs. Next, the remaining input is determined by the following expression:
x 1 = k = 1 r β k ( y ˜ k ) 2 1 / 2 1 / γ i = 2 m x i α i 1 / α 1 .
Coefficients α i and β k can be chosen randomly according to the following procedure. The α 1 value is taken arbitrarily in ( 0 , 1 ) , e.g., α 1 = 0.25 . Other coefficients are found as follows:
α i = ( α ˜ i / ( l = 2 m α ˜ l ) ) ( 1 α 1 ) , i = 1 , , m ,
where parameters α ˜ 2 , , α ˜ m are generated randomly from the uniform distribution on the interval  ( 0 , 1 ) . The β k values are determined as β k = β ˜ k / ( l = 1 r β ˜ k ) , k = 1 , , r , where β ˜ k are uniformly distributed over the interval  ( 0 , 1 ) . Thus, it turns out that input and output coefficients are normalized m = 1 r β m = 1 , l = 1 m α l = 1 to ensure linear homogeneity of the distance functions. Nevertheless, parameter γ ( 0 , 1 ] still allows us to select various returns to scale degrees.
Inefficient DMUs are generated based on maximal output vector y ˜ by
y = y ˜ exp ( u ) ,
where u is a random number from a half-normal distribution, i.e., u | N ( 0 , σ ^ u 2 ) | , and exp ( u ) is the measure of inefficiency.
Although this method is much more sophisticated, it has the same disadvantages as the Cobb–Douglas approach (2) described above.
In addition to the commonly utilized uniform and Cobb–Douglas approaches, another interesting method of data generation was applied in [54]. The data generating rule was Y = 2 X , where X and Y are input and output vectors. This approach is as easy to implement as it is uniform; extra-large samples of size up to one million DMUs with high density can be easily produced. However, this approach also generates unrealistic datasets since they have a 100% correlation between outputs and inputs.
Kohl and Brunner [55] employ a Translog production function for data generation instead of the typically used Cobb–Douglas production function. However, they used just a single output in the DGP and considered only CRS settings. To emulate inefficiency, they utilized a truncated normal distribution with the specified lower bound and an upper bound of 1. The mode of the distribution was chosen below 1 to simulate the maximum point of probability density, not in 1.
Wimmer and Finger [56] used synthetic data generation to replicate the original data because it may be proprietary or confidential. They used the statistical technique proposed by Faisal et al. [57], shown below as Algorithm 1.
Algorithm 1 Generation of synthetic data according to Faisal et al. [57]
1:Take a simple random sample of  x 1 obs  and set it as  x 1 syn .
2:for  i = 2 , , m + r  do
3:      Fit model  f x i obs | x 1 obs , , x i 1 obs .
4:      Draw x i syn from f x 1 syn | x 1 syn , , x i 1 syn
4:end for
In Algorithm 1, all variables (inputs and outputs) are designated as x for the sake of simplicity, index obs stands for original data, and index syn is for synthetic one. Different methods can be used for fitting prediction models. Wimmer and Finger utilize a non-parametric method of classification and regression trees (CART) and a parametric method using normal linear regressions preserving the marginal distribution (NORMRANK).

3. Materials and Methods

3.1. DEA Background

Consider a set of n observed DMUs. Each DMU j is described by a pair ( X j , Y j ) , where X j = ( x 1 j , , x m j ) T 0 is the input vector, and Y j = ( y 1 j , , y r j ) T 0 is the output vector. At least one component of the input vector and one component of the output vector are assumed to be non-zero. The production possibility set T is the set { ( X , Y ) | the outputs  Y 0 can be produced from the inputs  X 0 }.
In DEA, a production possibility set (PPS) is constructed based on a set of axioms and using the observed DMUs. For DEA models with variable return to scale (VRS), the PPS is written in the following form:
T = ( X , Y ) R m + r | j = 1 n X j λ j X , j = 1 n Y j λ j Y , j = 1 n λ j = 1 , λ j 0 , j = 1 , , n .
It was proved in [58] that technology (5) generalizes a wide class of DEA models. Therefore, in this paper, we consider only this type of PPS.
Based on PPS (5), an input-oriented model can be written in the form [45]:
min θ ε k = 1 m s k + i = 1 r s i + subject   to j = 1 n X j λ j + S = θ X o , j = 1 n Y j λ j S + = Y o , j = 1 n λ j = 1 , S = ( s 1 , , s m ) T 0 , S + = ( s 1 + , , s r + ) T 0 , λ j 0 , j = 1 , , n ,
where S = ( s 1 , , s m ) and S + = ( s 1 + , , s r + ) are slack variables, and ε is non-Archimedian value. In model (6) the optimal value θ * describes the efficiency score of unit ( X o , Y o ) , where ( X o , Y o ) is a DMU from the set of observed production units ( X j , Y j ) , j = 1 , , n .
In this input-oriented model, the possibility of proportional contraction of inputs while keeping outputs constant is sought. Solving model (6) may lead to computational inaccuracies that result in misleading solutions due to the choice of ε  [59,60]. Hence, we do not use an infinitesimal constant explicitly in the DEA models since we suppose that each model is solved in two stages in order to separate efficient and weakly efficient units [1]. At the first stage, model (6) is solved by omitting the slacks by simply putting ε = 0 . In the second stage, θ is replaced by θ * , and the sum of the slacks is maximized. The efficiency score in model (6) is estimated as θ * .
Definition 1
([1]). DMU ( X o , Y o ) T is called efficient with respect to model (6) if and only if any optimal solution satisfies: (a)  θ * = 1 , (b) all slacks s k , k = 1 , , m , s i + , i = 1 , , r are zero.
If condition (a) in Definition 1 is satisfied, then DMU ( X o , Y o ) is called input weakly efficient with respect to model (6).
In the output-oriented VRS model, the level of output is maximized, keeping levels of the inputs constant:
max η subject   to j = 1 n X j λ j + S = X 0 j = 1 n Y j λ j S + = η Y 0 λ j 0 , j = 1 , ... , n S = s 1 , , s m T 0 S + = s 1 + , , s r + T 0
Definition 2
([1]). DMU ( X o , Y o ) T is called efficient with respect to model (7) if and only if any optimal solution satisfies: (a)  η * = 1 , (b) all slacks s k , k = 1 , , m , s i + , i = 1 , , r are zero.
If condition (a) in Definition 2 is hold, then DMU ( X o , Y o ) is called output weakly efficient.
Definition 3
([61]). Efficient DMU ( X o , Y o ) T is called extreme efficient in model (6) or (7) if and only if λ o * = 1 and λ j * = 0 , j o for all optimal solutions.
To test a DMU ( X o , Y o ) for efficiency, an additive model was proposed by Charnes et al. [62]. This model is written in the following form:
max k = 1 m s k + i = 1 r s i + subject   to j = 1 n X j λ j + S = X o , j = 1 n Y j λ j S + = Y o , j = 1 n λ j = 1 , S = ( s 1 , , s m ) T 0 , S + = ( s 1 + , , s r + ) T 0 , λ j 0 , j = 1 , , n .
This model provides sufficient conditions for the classification of DMUs without dealing with non-Archimedean constants.
Definition 4
([1]). DMU ( X o , Y o ) T is efficient in model (8) if and only if the optimal value of its objective function is zero.
Theorem 1
([1]). DMU ( X o , Y o ) T is efficient in model (8) if and only if it is efficient in VRS model.
The additive model has advantages over the radial VRS model (6) in finding efficient DMUs because model (6) is solved in two stages. Therefore, to separate efficient units from weakly efficient ones, two optimization problems should be solved. The additive model maximizes slack variables without projecting DMU onto the frontier first. Therefore, only one optimization problem is required to be solved.
Andersen and Petersen [63] developed a super-efficiency model for ranking efficient DMUs.
min θ subject   to j = 1 j o n X j λ j θ X o , j = 1 j o n Y j λ j Y o , j = 1 j o n λ j = 1 , λ j 0 , j = 1 , , n .
The efficiency score is obtained by eliminating the DMU under evaluation from the PPS. This results in θ values greater than one, which are then used to rank the efficient DMUs.
Model (9) can also be used to find the radial projection of an arbitrary point onto the frontier. If (9) has an optimal solution, then ( θ * X o , Y o ) puts point ( X o , Y o ) on the frontier. If problem (9) is infeasible, then no projection exists.
Output-oriented modes can be written as follows:
max η subject   to j = 1 j o n X j λ j X o , j = 1 j o n Y j λ j η Y o , j = 1 j o n λ j = 1 , λ j 0 , j = 1 , , n .
In the output-oriented model, radial projection is obtained with ( X o , Y o ) ( X o , η * Y o ) .

3.2. Assurance Region Method

Weight restrictions are used in DEA in order to incorporate implicit judgments in the dual model for modeling production trade-offs (see, e.g., [64] for a review of weight restriction approaches). For model (6), the dual input-oriented VRS model can be written in the form:
max u T Y o + u 0 subject   to v T X j + u T Y j + u 0 0 , j = 1 , , n , v T X o = 1 , v 0 , u 0 .
The dual model (11) provides another way of looking at the problem (6). Dual variables u and v are often called weights in DEA because the efficiency score of DMU ( X o , Y o ) is defined as the ratio of a virtual output (weighted sum of outputs i = 1 r u i y i o + u 0 ) to a virtual input (weighted sum of inputs k = 1 m v k x k o ). In the model (11), dual variables are supposed to be non-negative, and therefore they can be equal to zero in the optimal solution. This means that some variables are ignored in the efficiency evaluation. In order to overcome this situation, it is proposed to insert weight restrictions into the model.
The assurance region (AR) method proposed by Thompson et al. [35,36] extends a DEA model (11) by adding constraints for pairs of dual variables.
l k v k v 1 u k , k = 2 , , m , L i u i u 1 U i , i = 2 , , r .
The VRS-AR model is written as follows.
max u T Y o subject   to v T X o = 1 , v T X j + u T Y j + u 0 0 , j = 1 , , n , v T P 0 , u T Q 0 , u 0 , v 0 ,
where
P = l 2 u 2 l 3 u 3 1 1 0 0 0 0 1 1 …… …… …… …… …… a n d Q = L 2 U 2 L 3 U 3 1 1 0 0 0 0 1 1 …… …… …… …… …… ,
respectively.
The primal VRS-AR model can be written in the following form, which is usually easier to solve and interpret:
min θ subject   to j = 1 n X j λ j P π θ X o , j = 1 n Y j λ j + Q τ Y o , j = 1 n λ j = 1 , λ j 0 , j = 1 , , n , π 0 , τ 0 ,
where π = ( π 1 , , π 2 ( m 1 ) ) T and τ = ( τ 1 , , τ 2 ( r 1 ) ) T are extra variables that appear in the primal model as a result of the constraints (12) imposed on the dual variables.
Output-oriented VRS-AR model is written as follows:
max η subject   to j = 1 n X j λ j P π X o , j = 1 n Y j λ j + Q τ η Y o , j = 1 n λ j = 1 , λ j 0 , j = 1 , , n , π 0 , τ 0 .

4. Results

4.1. Idea of the Proposed Approach

The assurance region method puts constraints on the ratio of input and output weights in the form of lower and upper bounds. This approach is mainly used to eliminate the zero weights that often appear in solutions of DEA models. However, weight restrictions (12) have another feature. They reduce the feasible domain of multipliers while the feasible domains of inputs and outputs are expanding. Figure 3a illustrates the incorporation of weight restrictions into the VRS model. As a result, the original production possibility set T VRS expands, and weakly efficient parts of the VRS frontier become inefficient in the VRS-AR model.
The VRS-AR model is used in this paper to expand the existing data by creating artificial observations “at the edge” of the efficient frontier, i.e., where the efficient part of the frontier adjoins the inefficient part. The proposed approach is illustrated in Figure 3b.
Initially, points Z 1 and Z 2 represent units which are the end-points of two infinite edges of set T 1 . These edges are marked by dashed lines. After adding weight restrictions to the model, the frontier transforms and artificial points Z 1 and Z 2 are inserted. The process repeats, reducing the current feasible domain of dual multipliers. This leads to the artificial units Z 1 and Z 2 . Finally, a synthetic set T 2 is produced that includes DMUs from T 1 and all generated artificial DMUs. The frontier of T 2 is indicated in Figure 3b by a thick solid line. Inefficient units can be inserted into T 2 by generating a random point on the frontier and then reducing outputs or increasing inputs.
The idea of introducing artificial units into the production possibility set is not new. Already, Farrell has used artificial units to secure weights from being zero in his models [65,66]. Further, in the works of Thanassoulis and Allen [43,44,67], artificial units were used to improve the envelopment of PPS. They used the observed extreme efficient units (vertices) that lie on the boundary of the efficient part of the frontier as starting points for improving the frontier. In [67], such units are called anchor units. However, the approach of Thanassoulis and Allen has no formalized algorithm for inserting artificial DMUs.
In this paper, we use the concept of the terminal unit that is proposed in [68,69] as a point of departure for introducing artificial observations.
Definition 5.
Extreme efficient unit is terminal if an infinite edge starts at this unit.
This definition is better than the approach of Bougnol and Dulá [70] which determines an excessive number of anchor units and the approach of Thanassoulis et al. [44], which may not identify a sufficient number of units. The algorithm for determining terminal units in the VRS model is described in [68].

4.2. Algorithm for Synthetic Data Generation

According to the principles described in the Introduction, the algorithm for synthetic data generation should not change the original efficient frontier. Therefore, before starting the generation of artificial efficient DMUs, we need to make sure that the initial VRS-AR model does not violate (P1), i.e., that the efficient frontier is preserved after adding weight restrictions to the VRS model. The proposed algorithm for the determination of initial weight restrictions is based on the following propositions.
Proposition 1.
If all efficient units of the VRS model stay efficient in the VRS-AR model, then the efficient frontier of the VRS model belongs to the efficient frontier of the VRS-AR model.
Proof. 
Let E be the set of efficient units in the VRS model. Consider a set of extreme efficient units E * , which is a subset of E. According to Dulá and Thrall [71], the production possibility set is determined by a set of extreme efficient units. Since all units of E are efficient in the VRS-AR model, then all units from set  E * are also efficient in the VRS-AR model. It follows that any point of the original efficient frontier of the VRS model stays efficient in the VRS-AR model. Thus, the efficient frontier of the VRS model belongs to the efficient frontier of VRS-AR model.    □
The following proposition asserts the existence of weight restrictions in the VRS-AR model such that all efficient units of the VRS model stay efficient in the corresponding VRS-AR model.
Proposition 2.
For a set of DMUs, there exist nonzero weight restriction coefficients l k > 0 , u k > 0 , k = 2 , , m , and L i > 0 , U i > 0 , i = 2 , , r in the VRS-AR model (14) such that the set of efficient units of this model coincides with the set of efficient units of the VRS model.
Proof. 
According to Lemma 4.1 in [1], there exist dual optimal solution ( v * , u * , u 0 * ) for efficient unit, such that v * > 0 and u * > 0 . Let v j * > 0 and u j * > 0 be the optimal dual variables for jth DMU. Choose weight restriction coefficients as follows:
l k = min 1 j n v k j * v 1 j * > 0 , u k = max 1 j n v k j * v 1 j * > 0 , k = 2 , , m , L i = min 1 j n u i j * u 1 j * > 0 , U i = min 1 j n u i j * u 1 j * > 0 , i = 2 , , r .
Then, for each efficient DMU optimal solution ( v * , u * , u 0 * ) of the VRS model satisfies the AR constraints. This means that the optimal solution of the VRS model is also optimal in the VRS-AR model (14). Hence, the efficient units of the VRS model stay efficient in the VRS-AR model.    □
From Propositions 1 and 2 we obtain that there exist nonzero weight restriction coefficients in the VRS-AR model that do not change the efficient frontier of the VRS model. However, these statements do not give us an explicit expression for calculating such coefficients. The algorithm proposed below determines the maximal weight restrictions coefficients l k , L i , and minimal coefficients u k , U i , which does not change the original efficient frontier. This algorithm is close to the concept of Constrained Facet Analysis, which was originally proposed in [72,73] and clarified in [74]. Since, for our purpose, it is not necessary to determine these weight limits precisely, we use a simple approximating algorithm. Algorithm 2 for the determination of initial weight restrictions that does not change the efficient frontier can be written as follows:
Algorithm 2 Determine initial weight restrictions.
Input: Initial dataset D; small parameter ε, parameter ω.
Output: Coefficients l k , u k , k = 2 , , m , and L i , U i , i = 2 , , s .
1:procedure InitWeightRestrictions
    ▹ Initialize weight restrictions
2:    Set l k : = ε , u k : = 1 / ε , k = 2 , , m , L i : = ε , U i : = 1 / ε , i = 2 , , r .
    ▹ Correct initial weight restrictions
3:    Determine a set of efficient units E in model (8) for dataset D.
4:    Solve model (14) for dataset D. Find the set of efficient units E A R .
5:    while sets E and E A R are not equal do
6:        Set l k : = l k / w , u k : = u k · w , k = 2 , , m , L i : = L i / w , U i : = u i · w , i = 2 , , r .
7:        Solve model (14) again and determine the set of efficient units E A R .
8:    end while
    ▹ Adjust input weight restrictions
9:    for each input k = 1 , , m  do
10:        Increase l k : = l k · w until efficient units in model (14) coincide with the set
          of efficient units in model (8).
11:        Decrease u k : = u k / w until efficient units in model (14) coincide with the set
          of efficient units in model (8).
12:    end for
    ▹ Adjust output weight restrictions
13:    for each output i = 1 , , r  do
14:        Increase L i : = L i · w until efficient units in model (14) coincide with the set
          of efficient units in model (8).
15:        Decrease U i : = U i / w until efficient units in model (14) coincide with the set
          of efficient units in model (8).
16:    end for
17:end procedure
In the algorithm, a small parameter  ε is used to initialize the coefficients of the weight restrictions. Parameter w ( 0 < w < 1 ) characterizes the change of the weight coefficients during the iterations.
Lemma 1.
Algorithm 2 converges in a finite number of steps and does not violate (P1).
Proof. 
In step 2 of Algorithm 2, coefficients l k , u k , L i , and U i are initialized with some values controlled by a small parameter  ε .
Next, in steps 3–8, we check that the set of efficient units E in the VRS model and the set of efficient units  E A R in the VRS-AR model are equal. This guarantees that (P1) is held according to Proposition 1. If the weight restrictions chosen in step 2 violate (P1), then lower bounds l k and L i are decreased by factor w, while upper bounds are increased by w. These steps are repeated until coefficients that meet (P1) are found. Proposition 2 says that such coefficients exist and are positive. Thus, they can be found in a finite number of steps.
In steps 9–16, lower and upper bounds of weight restrictions are adjusted until efficient units in the VRS model coincide with the set of efficient units in the VRS-AR model, i.e., (P1) does not violate here. The adjustments lead to a consistent increase in the lower bounds and a decrease in the upper bounds. Hence, this process will be finished in a finite number of steps because lower bounds cannot be greater than upper bounds.
This completes the proof.    □
Next, we describe the algorithm for generating efficient units using a three-dimensional VRS model. In Figure 4a, the PPS of the VRS model is determined by observed production units A G . According to (P2), artificial efficient DMUs should extend the efficient part of the frontier. Units B and D are efficient, but they cannot be used for extending the efficient frontier because they are located in the interior part of it. Units A, C, E, F, and G are terminal since the infinite edges start at these units. These units are vertices of unbounded inefficient facets where the efficient frontier can be expanded. To explain the process of generating artificial efficient units, consider terminal unit G. Two unbounded facets, Γ 1 and Γ 2 , contain this unit. In order to identify these facets, several points R j are randomly generated in the vicinity of unit G such that G R j δ .
Variables in DEA models may have various units of measurement, and their values may be of different order. In order to avoid numerical difficulties, we use a weighted infinite norm
Z = max 1 k m + r β k | z k | , Z R m + r
to determine the vicinity of the terminal unit. Here, weights β k > 0 are chosen to “normalize” each coordinate k. For the vicinity of unit  G = ( g 1 , , g m + r ) , its coordinates  g k are used for normalization, i.e., weights of the norm are obtained as β k = 1 / g k , k = 1 , , m + r .
By choosing the radius of vicinity 0 < δ < 1 and using uniformly distributed random values w k j U [ δ , δ ] , k = 1 , , m + r , j = 1 , , p , the coordinates of random points R j = ( r 1 j , , r ( m + r ) j ) are derived as follows:
r j k = g k ( 1 + w k ) , k = 1 , , m + r , j = 1 , , p ,
where p represents the number of randomly generated points.
Next, each random point R j is projected onto the frontier by solving models (9) and (10). If all slacks in the optimal solution are zero, then the facet is bounded and is not being used further. Otherwise, using the optimal solution of model (9) or (10), an unbounded facet that contains radial projection can be determined.
Suppose the projection lies in the facet Γ 1 and the optimal solution contains the following nonzero optimal λ -variables λ j * > 0 , j J , and optimal slacks s k * > 0 , k I 1 , s i + * > 0 , i I 2 , then an unbounded facet Γ 1 that contains the radial projection of some random point can be represented in the form:
Γ 1 = { Z | Z = j J Z j λ j + k I 1 μ k e k i I 2 ρ i e i , j J λ j = 1 , λ j 0 , j J , μ k 0 , k I 1 , ρ i 0 , i I 2 } ,
where index set J represents the vertices of Γ 1 , and index sets  I 1 and I 2 correspond to the rays of Γ 1 .
Since inefficient facet Γ 1 is found, it can be used for the construction of artificial efficient units. For this purpose, a point Z ˜ belonging to an unbounded facet is generated:
Z ˜ = Z ¯ + k I 1 α Z ¯ e k i I 2 α Z ¯ e i ,
where α is a parameter that determines the shift from point  Z ¯ along the rays of  Γ 1 ; Z ¯ is the centroid of all vertexes of facet  Γ 1 given by
Z ¯ = j J Z j / | J | ,
where J is a subset of the vertices of  Γ 1 . In accordance with the recommendation of Dulá [47], point  Z ˜ is constructed with the aim of ensuring the uniformity of the distribution of artificially generated DMUs.
In Figure 4a, the face  Γ 1 has two vertices A and G and one ray  e 1 corresponding to the axis  x 1 . Therefore, point  Z ¯ lies in the middle of segment  A G , i.e., Z ¯ = 0.5 A + 0.5 G . According to (18), point Z ˜ is obtained as Z ˜ = Z ¯ + α Z ¯ e 1 .
After that, Z ˜ is projected radially onto the frontier of the VRS-AR model by solving problems (14) or (15). The projection of  Z ˜ is the new artificial efficient unit H that is added to the PPS. In Figure 4b, we can see we can see how the PPS has changed after adding unit H. It is worthy of note that the original efficient frontier has not changed, while a new efficient facet, A G H , has appeared.
Such operations are repeated for all the facets that contain terminal units. As a result, a number of artificial efficient units are included in the PPS. Next, weight restrictions are slightly relaxed by multiplying coefficients  l k and  L i by factor w and dividing coefficients  u k and  U i by w. Thus, the iterations continue with a new PPS until the number of efficient DMUs reaches the set value.
The pseudocode of the proposed Algorithm 3 is given below. Parameter p in algorithm represents the number of random units generated for each terminal unit in order to find unbounded facets that contain this unit. The smaller value of the parameter p will result in fewer facets being detected. The higher the p value, the more test points are checked and the more facets the algorithm can potentially detect. However, too high a value of p leads to excessive computations. Therefore, we recommend choosing the p value in the range of 10 to 30. Parameter w ( 0 < w < 1 ) characterizes the strategy of changing the weight restrictions.
Lemma 2.
Algorithm 3 converges in a finite number of steps and does not violate (P1) and (P2).
Proof. 
The algorithm stops when the number of efficient units in S E is equal to or greater than  N E . Hence, to prove the convergence of the algorithm, it is sufficient to show that at every iteration at least one artificial efficient unit is added to the S E .
The frontier of the VRS model always has unbounded facets due to the free disposability of inputs and outputs. Additionally, every PPS has at least one terminal unit and all terminal units can be determined in a finite number of steps using the algorithm described in [68]. For each terminal unit, by Definition 5, there exists an infinite edge starting at this unit. The production possibility set of the VRS model is a convex polyhedral set [1]. Hence, there exist a number of unbounded facets that contain the infinite edge and terminal unit itself. This means that there are points in the vicinity of the terminal unit belonging to unbounded facets or projected onto these facets. Thus, there is a nonzero probability that the projection of the random point will belong to an unbounded facet. With a sufficiently large number of random points, there will be at least one point that is projected onto an unbounded facet. If at least one unbounded facet is found, then it is projected onto the efficient frontier of the VRS-AR model. According to Theorem 5 in [33], this projection always exists and represents an efficient artificial DMU. Thus, at least one artificial efficient unit is generated at every iteration and the algorithm converges.
At every iteration of the Algorithm 3 in Step 18, weight restrictions are being relaxed by multiplying lower bounds l k and L i by parameter w < 1 and by dividing upper bounds u k and U i by the same value. This ensures that the existing efficient frontier will not change after such a correction of the weight restrictions and that (P1) is not violated.
By construction, unit Z ˜ belongs to an unbounded facet of VRS frontier, then it will be projected onto an unbounded facet of the VRS-AR frontier. In Step 15, this projection becomes an artificial efficient unit that does not belong to the initial efficient frontier of the VRS model because unbounded facets in the VRS model are inefficient. Thus, artificial efficient DMUs extend the efficient part of the frontier, and (P2) also holds.
This completes the proof.    □
Algorithm 3 Generation of artificial efficient units.
Input: Initial dataset D; number of efficient units NE; initial weight restrictions lk, uk, k = 2,…, m, and Li, Ui, i = 2,…, r.
Output: Synthetic dataset S E .
1:procedure GenerateEfficientDMUs
2:    Set S E : = D . Initialization
3:    while number of efficient units in S E is less than N E  do
4:        Set F : = , P : = .
5:        Find the set of terminal units E t in dataset S using the algorithm
          described in [68].
6:        for each terminal unit Z t in E t  do
7:           Generate random units R j , j = 1 , , p in the vicinity of unit Z t using (16).
8:           Find projections of units R j onto the boundary of PPS by solving
             models (9) and (10).
9:           Find vertices and direction vectors of unbounded facets,
             where units are projected.
10:           Add facets to the set F, keeping only distinct facets.
11:        end for
12:        for each facet f in F do
13:           Generate artificial unit Z ˜ according to Equation (18).
14:           Project unit Z ˜ onto VRS-AR frontier by solving models (14) and (15).
15:           Append projected unit to the set P.
16:        end for
17:        Set S E : = S E P .
18:        Set l k : = l k · w , u k : = u k / w , k = 2 , , m , L i : = L i · w , U i : = u i / w , i = 2 , , r .
19:    end while
20:    if number of efficient units in S is greater than N E  then
21:        Remove artificial efficient units in S to make it equal to N E .
22:    end if
23:    return  S E
24:end procedure
At the next stage, inefficient artificial DMUs are generated. In order to generate artificial units with realistic input-output mixes, we used the convex hull of DMUs from the original dataset together with previously generated efficient DMUs because the convex hull contains all possible proportions between inputs and outputs contained in the data. Uniform sampling from a convex hull of points is computationally hard due to the large number of vertexes and large dimension of space. Hence, we chose the simpler and more computationally efficient method. First, we randomly select m + r + 1 different DMUs. These points form a random simplex that is contained in a convex hull. Next, a random point is chosen by uniform sampling from the derived simplex using the method proposed in [75].
After that, to generate inefficient artificial DMUs, random points from the convex hull of the dataset S E produced by Algorithm 3 are projected onto the frontier. Inefficient artificial DMUs are generated relative to those projections by proportionally increasing inputs or proportionally contracting outputs. In the first case, inputs are increased as
X k = X ˜ k / e x p ( u i ) ,
where u i | N ( 0 , σ u 2 ) | , i.e., u i has a half-normal distribution that is produced by the underlying normal. In the output case, inefficient DMUs are generated according to
Y k = Y ˜ k × e x p ( u i ) .
In the proposed algorithm, both methods alternate when generating inefficient artificial DMUs.
To ensure that the generated data has a distribution of efficiency scores close to the original, σ u 2 can be estimated from the initial dataset as the sample variance
σ ^ 2 = 1 n i = 1 n u i 2 ,
where u i = ln ( θ i ) , and θ i is the efficiency score of inefficient DMU i from initial dataset D.
Combining the steps described above, we obtain the following Algorithm 4.
Algorithm 4 Generating inefficient DMUs.
Input: Synthetic dataset SE; estimate σ ^ v 2 for input efficiencies; estimate σ ^ u 2 for output efficiencies; number of inefficient units N.
Output: Set of inefficient DMUs S I .
1:procedure GenerateInefficientDMUs
2:     S I : = . Initialization
3:    Generate N random points ( X i R , Y i R ) , i = 1 , , N in CH ( S E ) .
4:    for  i : = 1 to N do
5:        if  ( i mod 2 ) = 1 thenAlternate between the input and output models
6:           Solve model (9) on S E for unit ( X i R , Y i R ) .
7:           Find projection ( X ˜ i R , Y ˜ i R ) ( θ i X i R , Y i R ) .
8:           Generate random u i | N ( 0 , σ ^ u 2 ) | .
9:            Z i ( X ˜ i R / e x p ( u i ) , Y ˜ i R ) .
10:        else
11:           Solve model (10) on S E for unit ( X i R , Y i R ) .
12:           Find projection ( X ˜ i R , Y ˜ i R ) ( X i R , η i Y i R ) .
13:           Generate random v i | N ( 0 , σ ^ v 2 ) | .
14:            Z i ( X ˜ i R , Y ˜ i R × e x p ( v i ) ) .
15:        end if
16:        Append Z i to the set S I .
17:    end for
18:    return  S I
19:end procedure
The full algorithm for synthetic data generation can be summarized as Algorithm 5. In the first step, we determine restrictions on weights that are large enough but do not change the efficient frontier. Then, we generate efficient units according to Algorithm 3. Finally, we generate inefficient units with Algorithm 4 using σ ^ u 2 determined from the original dataset.
The correctness of the algorithm is confirmed by the following theorem.
Theorem 2.
Algorithm 5 converges in a finite number of steps and does not violate (P1) and (P2).
Algorithm 5 Synthetic data generation.
Input: Initial dataset D; total number of units N; number of efficient units NE.
Output: Synthetic dataset S.
1:Find initial weight restrictions according to Algorithm 2.
2:Generate N E efficient units using Algorithm 3.
3:Find σ ^ v 2 and σ ^ u 2 of dataset D according to Equation (19).
4:Generate ( N N E ) inefficient units using Algorithm 4.
5: S : = S E S I .
6:return S.
Proof. 
According to Lemmas 1 and 2, steps 1 and 2 of the algorithm are executed in a finite number of steps. Steps 3–6 also consist of a limited number of steps. Thus, Algorithm 5 converges.
Inefficient units generated in Algorithm 4 are located inside PPS; they cannot change the efficient frontier. Taking into account the results of Lemmas 1 and 2, it follows that Algorithm 5 does not violate (P1) and (P2).
This completes the proof. □

4.3. Computational Experiments

The work of Algorithm 5 is illustrated using two small datasets. For the first dataset, Case 1, we took the data from a Russian bank’s financial accounts for 2008. The dataset contains 200 DMUs with 6 variables (3 inputs and 3 outputs), see [76]. For Case 2, we took the data from the paper of Charnes et al. [53]; this dateset is described in Section 2. The distributions of the efficiency scores in Cases 1 and 2 are presented in Figure 5.
The computational experiments were conducted on a PC with Intel Core i3 CPU 3.33 GHz. We use CPLEX [77] version 12.6.2 to solve optimization problems.
In Algorithm 3, we set a small parameter ε = 10 3 to initialize the coefficients of the weight restrictions. Parameter w, which characterizes the change of the weight coefficients during the iterations, is equal to 0.5 . The number of generated random points in Algorithm 3 is regulated by the parameter p, which is selected as 20.
Our computations confirmed that the algorithm works correctly, and all initially efficient DMUs remained efficient in the expanded dataset. In other words, the algorithm does not violate the established principles.
Table 1 illustrates the work of Algorithm 3 for Case 1. At the beginning of the first iteration, there were 200 DMUs, of which 28 were efficient and terminal. Then, 229 artificial DMUs are created, and the total number of DMUs becomes 429. After the second iteration, 1140 artificial DMUs were added to the dataset, and the total number of DMUs reached 1369. In the third iteration, 6452 DMUs are produced, bringing the total number of artificial DMUs to 7821. We see that the number of artificial units is increasing rapidly with each iteration, so it turned out that only three iterations were enough to obtain a sufficient number of efficient DMUs.
Before generating inefficient units, the variances σ ^ v 2 and σ ^ u 2 should be estimated from the original dataset. Applying (19) to the input and output efficiency scores of observed DMUs, the following estimates were obtained
σ ^ v 2 = 1.2921 , σ ^ u 2 = 0.7548 .
According to Algorithm 4, inefficient units were added so that the total number of DMUs became 50,000. The synthetic dataset contains 7440 efficient DMUs, so the share of efficient units is equal to 14.88%, which is approximately the same as in the original dataset.
The distribution of the efficiency score of the original dataset in comparison with artificially generated dataset for Case 1 is presented in Figure 6.
Figure 7 represents the correlation between variables in the original dataset and in the synthetic dataset. It can be seen from the figure that after completing the dataset with artificial DMUs, some coefficients become smaller. However, the Algorithm 5 retains the positive correlation between inputs and outputs.
For Case 2, the original dataset contains 70 DMUs, and 27 (or 38.57%) of those are efficient. After three iterations of Algorithm 3, the number of efficient DMUs become 19,480. The variance estimates for input and output efficiency measures in the original dataset are obtained as σ ^ v 2 = 0.0091 and σ ^ u 2 = 0.0094 . Next, 30,477 inefficient DMUs were generated according to Algorithm 4, and the total number of DMUs in the synthetic dataset was reached 50,000. The share of efficient units in the produced dataset is equal to 38.96%.
In order to investigate the multicollinearity in synthetic datasets for Cases 1 and 2, we use the variance inflation factor (VIF). According to the test results presented in Table 2 and Table 3, the proposed algorithm reduces multicollinearity among the variables as a whole. The value of VIF has increased only for variable x 5 in Case 2.
Table 4 presents the execution time for Algorithm 5 in Cases 1 and 2. Only the main steps of the algorithm are included in the table because the remaining steps are performed in a very short time, which can be neglected compared to the other steps. The computational experiments show that the total execution time for Case 2 is greater than in Case 1. This is due to the fact that the number of variables of Case 2 is greater, and the initial number of DMUs is almost three times smaller than in Case 1.
Table 4 shows that the running time of Algorithm 5 is sufficiently long. It is obvious that the proposed approach has worse time complexity than other algorithms for synthetic data generation in DEA because it requires solving a large number of LPs. However, compared to other algorithms, Algorithm 5 has certain advantages.
The basic assumption of a production theory is the monotonicity of the production process. This means that as inputs increase, outputs also increase. Using a uniform distribution as a DGP leads to input-output combinations that violate this assumption. Figure 1 shows that the Cobb–Douglas approach also produces datasets where negative correlation coefficients are presented for some pairs of inputs and outputs. At the same time, computational experiments show that our algorithm preserves isotonicity and does not increase the collinearity of the original data as a whole.
Our approach uses the partial synthetic generation of the DEA datasets. This makes it possible to use the properties of real DMUs to generate artificial ones. Furthermore, the proposed algorithm has a number of useful properties. First, it does not change the existing efficient frontier. In the DEA approach, the efficient frontier contains valuable practical information, so it must be preserved when artificial units are added to the dataset. Second, the algorithm extends the efficient frontier in the borderline region, which makes the frontier more versatile.
The approach proposed in this paper differs from the method of Wimmer and Finger [56] for two reasons. First, the objective is different: they tried to replicate the original dataset, i.e., just replace the original DMUs by artificial ones, and they were not intended to generate a large dataset. Moreover, Algorithm 1 does not preserve the existing frontier since it is designed to replace the original data with artificial ones. Second, in order for the statistical models in Algorithm 1 to be sufficiently accurate, the dataset must be large enough. Our approach does not have such limitations and works well even with small datasets.

5. Conclusions

Some DEA studies need large-scale datasets to test the methods that they propose. The algorithms for applying DEA to big data are most in need because there are not enough open datasets with a large number of DMUs. Thus, these studies mainly use synthetic datasets for testing. Existing data generation algorithms in DEA produce datasets from scratch and cannot provide sufficient statistical variability that is inherent in real data. To fill this gap, a new method is proposed for generating synthetic data. This method receives a real dataset as input and complements it with artificial DMUs. Artificial efficient units are generated in the regions of input-output space that are not covered by the available data. For this reason, weight restrictions are used, which helps to adjust the properties of the efficient frontier in the data neighborhood. Inefficient DMUs are generated taking into account the statistical characteristics of the original dataset. Our computational experiments using two real datasets demonstrate that the proposed algorithm works reliably and can increase the number of DMUs up to 50 K.
The main limitation of the study is that the proposed algorithm has a worse time complexity compared to other existing algorithms in DEA. However, this disadvantage is not so critical. The implementation of parallel computations in the algorithm can significantly reduce the calculation time because many LPs in Algorithm 5 can be solved independently. Moreover, for DEA computations with large datasets, the faster algorithms can be applied [32,54]. So, the authors will address this issue in future works.
In this paper, an algorithm for generating synthetic data is presented only for the VRS model. The generalization to other DEA technologies leaves room for future research. Furthermore, the analysis of the proposed algorithm could be extended with other tests such as rules of thumb and sensitivity analysis.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/data8100146/s1.

Funding

This work was supported by the Russian Science Foundation (project No. 23-11-00197) https://rscf.ru/en/project/23-11-00197/ (accessed on 20 September 2023).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The generated synthetic datasets are available as supplementary material in the online version of the paper.

Acknowledgments

The author thanks the academic editors and anonymous reviewers for their guidance and constructive suggestions.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
ARAssurance Region
CRANComprehensive R Archive Network
CRSConstant Returns to Scale
DEAData Envelopment Analysis
DGPData Generating Process
DMUDecision Making Unit
MLMachine Learning
LPLinear programming
PPSProduction Possibility Set
RTSReturns to Scale
SCSCStrong Complementary Slackness Conditions
VRSVariable Returns to Scale

References

  1. Cooper, W.W.; Seiford, L.M.; Tone, K. Data Envelopment Analysis. A Comprehensive Text with Models, Applications, References and DEA-Solver Software, 2nd ed.; Springer Science and Business Media: New York, NY, USA, 2007. [Google Scholar] [CrossRef]
  2. Mozaffari, M.R.; Ostovan, S. Finding projection in the two-stage supply chain in DEA-R with random data using (CRA) model. Big Data Comput. Visions 2021, 1, 146–155. [Google Scholar] [CrossRef]
  3. Fallah, R.; Kouchaki Tajani, M.; Maranjory, M.; Alikhani, R. Comparison of Banks and Ranking of Bank Loans Types on Based of Efficiency with Dea in Iran. Big Data Comput. Visions 2021, 1, 36–51. [Google Scholar] [CrossRef]
  4. Soltani, M.R.; Edalatpanah, S.A.; Sobhani, F.M.; Najafi, S.E. A Novel Two-Stage DEA Model in Fuzzy Environment: Application to Industrial Workshops Performance Measurement. Int. J. Comput. Intell. Syst. 2020, 13, 1134–1152. [Google Scholar] [CrossRef]
  5. Ebrahimzadeh Shermeh, H.; Alavidoost, M.H.; Darvishinia, R. Evaluating the efficiency of power companies using data envelopment analysis based on SBM models: A case study in power industry of Iran. J. Appl. Res. Ind. Eng. 2018, 5, 286–295. [Google Scholar] [CrossRef]
  6. Khodabakhshi, M.; Cheraghali, Z. Ranking of Iranian executive agencies using audit court budget split indexes and data envelopment analysis. J. Appl. Res. Ind. Eng. 2022, 9, 312–322. [Google Scholar] [CrossRef]
  7. Montazeri, F.Z. An overview of data envelopment analysis models in fuzzy stochastic environments. J. Fuzzy Ext. Appl. 2020, 1, 272–278. [Google Scholar] [CrossRef]
  8. Ucal Sari, I.; Ak, U. Machine efficiency measurement in industry 4.0 using fuzzy data envelopment analysis. J. Fuzzy Ext. Appl. 2022, 3, 177–191. [Google Scholar] [CrossRef]
  9. Bazargan, A.; Najafi, S.E.; Hoseinzadeh Lotfi, F.; Fallah, M.; Edalatpanah, S.A. Presenting a productivity analysis model for Iran oil industries using Malmquist network analysis. Decis. Mak. Appl. Manag. Eng. 2023, 6, 251–292. [Google Scholar] [CrossRef]
  10. Ratner, S.V.; Balashova, S.A.; Lychev, A.V. The Efficiency of National Innovation Systems in Post-Soviet Countries: DEA-Based Approach. Mathematics 2022, 10, 3615. [Google Scholar] [CrossRef]
  11. Kassaei, S.; Hosseinzadeh Lotfi, F.; Amirteimoori, A.; Rostamy-Malkhalifeh, M.; Rahmani, B. Identification and evaluation of congestion in two-stage network data envelopment analysis. Int. J. Res. Ind. Eng. 2023, 12, 53–72. [Google Scholar] [CrossRef]
  12. Ghomashi Langroudi, A.; Abbasi, M. A neutral DEA model for cross-efficiency evaluation. Int. J. Res. Ind. Eng. 2022, 11, 411–422. [Google Scholar] [CrossRef]
  13. Sigala, M. Using Data Envelopment Analysis for Measuring and Benchmarking Productivity in the Hotel Sector. J. Travel Tour. Mark. 2004, 16, 39–60. [Google Scholar] [CrossRef]
  14. Maghbouli, M.; Yekta, A.P. Undesirable Input in Production Process: A DEA-Based Approach. J. Oper. Strateg. Anal. 2023, 1, 46–54. [Google Scholar] [CrossRef]
  15. Sueyoshi, T.; Sekitani, K. Measurement of returns to scale using a non-radial DEA model: A range-adjusted measure approach. Eur. J. Oper. Res. 2007, 176, 1918–1946. [Google Scholar] [CrossRef]
  16. Sueyoshi, T.; Sekitani, K. The measurement of returns to scale under a simultaneous occurrence of multiple solutions in a reference set and a supporting hyperplane. Eur. J. Oper. Res. 2007, 181, 549–570. [Google Scholar] [CrossRef]
  17. Sueyoshi, T.; Sekitani, K. An occurrence of multiple projections in DEA-based measurement of technical efficiency: Theoretical comparison among DEA models from desirable properties. Eur. J. Oper. Res. 2009, 196, 764–794. [Google Scholar] [CrossRef]
  18. Dantzig, G.B.; Thapa, M.N. Linear Programming 2: Theory and Extensions; Springer: New York, NY, USA, 2003. [Google Scholar]
  19. Krivonozhko, V.E.; Førsund, F.R.; Lychev, A.V. A note on imposing strong complementary slackness conditions in DEA. Eur. J. Oper. Res. 2012, 220, 716–721. [Google Scholar] [CrossRef]
  20. Krivonozhko, V.E.; Afanasiev, A.P.; Førsund, F.R.; Lychev, A.V. Comparison of Different Methods for Estimation of Returns to Scale in Nonradial Data Envelopment Analysis Models. Autom. Remote Control. 2022, 83, 1136–1148. [Google Scholar] [CrossRef]
  21. Anderson, T.; Rouse, P. Data Envelopment Analysis Dataset Repository. Available online: http://www.etm.pdx.edu/dea/dataset/default.htm (accessed on 21 June 2023).
  22. Kaggle Datasets. Available online: https://www.kaggle.com/datasets (accessed on 21 June 2023).
  23. PASCAL Network. Machine Learning Data Repository. Available online: http://mldata.org/ (accessed on 21 June 2023).
  24. The World Bank Group. World Bank Open Data. Available online: https://data.worldbank.org/indicator (accessed on 21 June 2023).
  25. United Nations. UNdata. Available online: http://data.un.org/ (accessed on 21 June 2023).
  26. Harvard College. Harvard Dataverse Repository. Available online: https://dataverse.harvard.edu/ (accessed on 21 June 2023).
  27. European Organization For Nuclear Research; OpenAIRE. Zenodo. 2013. Available online: https://doi.org/10.25495/7GXK-RD71 (accessed on 22 September 2023).
  28. Figueira, A.; Vaz, B. Survey on Synthetic Data Generation, Evaluation Methods and GANs. Mathematics 2022, 10, 2733. [Google Scholar] [CrossRef]
  29. Lu, Y.; Shen, M.; Wang, H.; Wei, W. Machine Learning for Synthetic Data Generation: A Review. arXiv 2023, arXiv:2302.04062. Available online: https://arxiv.org/abs/2302.04062 (accessed on 21 June 2023).
  30. Zhu, N.; Zhu, C.; Emrouznejad, A. A combined machine learning algorithms and DEA method for measuring and predicting the efficiency of Chinese manufacturing listed companies. J. Manag. Sci. Eng. 2021, 6, 435–448. [Google Scholar] [CrossRef]
  31. Guerrero, N.M.; Aparicio, J.; Valero-Carreras, D. Combining Data Envelopment Analysis and Machine Learning. Mathematics 2022, 10, 909. [Google Scholar] [CrossRef]
  32. Khezrimotlagh, D.; Zhu, J.; Cook, W.D.; Toloo, M. Data envelopment analysis and big data. Eur. J. Oper. Res. 2019, 274, 1047–1054. [Google Scholar] [CrossRef]
  33. Charnes, A.; Cooper, W.W.; Wei, Q.L.; Huang, Z.M. Cone ratio data envelopment analysis and multi-objective programming. Int. J. Syst. Sci. 1989, 20, 1099–1118. [Google Scholar] [CrossRef]
  34. Charnes, A.; Cooper, W.W.; Huang, Z.M.; Sun, D.B. Polyhedral Cone-Ratio DEA Models with an Illustrative Application to Large Commercial Banks. J. Econ. 1990, 46, 73–91. [Google Scholar] [CrossRef]
  35. Thompson, R.G.; Singleton, F.D.; Thrall, R.M.; Smith, B.A. Comparative Site Evaluations for Locating a High-Energy Physics Lab in Texas. Interfaces 1986, 16, 35–49. [Google Scholar] [CrossRef]
  36. Thompson, R.G.; Langemeier, L.N.; Lee, C.T.; Lee, E.; Thrall, R.M. The role of multiplier bounds in efficiency analysis with an application to Kansas farming. J. Econ. 1990, 46, 93–108. [Google Scholar] [CrossRef]
  37. Thompson, R.G.; Dharmapala, P.; Rothenberg, L.J.; Thrall, R.M. DEA/AR efficiency and profitability of 14 major oil companies in U.S. exploration and production. Comput. Oper. Res. 1996, 23, 357–373. [Google Scholar] [CrossRef]
  38. Brockett, P.L.; Charnes, A.; Cooper, W.W.; Huang, Z.M.; Sun, D.B. Data transformations in DEA cone ratio envelopment approaches for monitoring bank performance. Eur. J. Oper. Res. 1997, 98, 250–268. [Google Scholar] [CrossRef]
  39. Wei, Q.; Yan, H.; Xiong, L. A bi-objective generalized data envelopment analysis model and point-to-set mapping projection. Eur. J. Oper. Res. 2008, 190, 855–876. [Google Scholar] [CrossRef]
  40. Podinovski, V.V. Production trade-offs and weight restrictions in data envelopment analysis. J. Oper. Res. Soc. 2004, 55, 1311–1322. [Google Scholar] [CrossRef]
  41. Podinovski, V.V. Improving data envelopment analysis by the use of production trade-offs. J. Oper. Res. Soc. 2007, 58, 1261–1270. [Google Scholar] [CrossRef]
  42. Podinovski, V.V.; Bouzdine-Chameeva, T. Weight Restrictions and Free Production in Data Envelopment Analysis. Oper. Res. 2013, 61, 426–437. [Google Scholar] [CrossRef]
  43. Allen, R.; Thanassoulis, E. Improving envelopment in data envelopment analysis. Eur. J. Oper. Res. 2004, 154, 363–379. [Google Scholar] [CrossRef]
  44. Thanassoulis, E.; Kortelainen, M.; Allen, R. Improving envelopment in Data Envelopment Analysis under variable returns to scale. Eur. J. Oper. Res. 2012, 218, 175–185. [Google Scholar] [CrossRef]
  45. Banker, R.D.; Charnes, A.; Cooper, W.W. Some models for estimating technical and scale inefficiencies in data envelopment analysis. Manag. Sci. 1984, 30, 1078–1092. [Google Scholar] [CrossRef]
  46. Barr, R.S.; Durchholz, M.L. Parallel and hierarchical decomposition approaches for solving large-scale Data Envelopment Analysis models. Ann. Oper. Res. 1997, 73, 339–372. [Google Scholar] [CrossRef]
  47. Dulá, J.H. A computational study of DEA with massive data sets. Comput. Oper. Res. 2008, 35, 1191–1203. [Google Scholar] [CrossRef]
  48. Zelenyuk, V. Aggregation of inputs and outputs prior to Data Envelopment Analysis under big data. Eur. J. Oper. Res. 2020, 282, 172–187. [Google Scholar] [CrossRef]
  49. Dulá, J.H. An Algorithm for Data Envelopment Analysis. INFORMS J. Comput. 2011, 23, 284–296. [Google Scholar] [CrossRef]
  50. Wilson, P.W. FEAR: A software package for frontier efficiency analysis with R. Socio-Econ. Plan. Sci. 2008, 42, 247–254. [Google Scholar] [CrossRef]
  51. Wilson, P.W. Asymptotic Properties of Some Non-Parametric Hyperbolic Efficiency Estimators. In Exploring Research Frontiers in Contemporary Statistics and Econometrics: A Festschrift for Léopold Simar; Van Keilegom, I., Wilson, P.W., Eds.; Physica-Verlag HD: Berlin/Heidelberg, Germany, 2011; pp. 115–150. [Google Scholar] [CrossRef]
  52. Bogetoft, P.; Otto, L. Benchmarking with DEA, SFA, and R; International Series in Operations Research & Management Science; Springer: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
  53. Charnes, A.; Cooper, W.W.; Rhodes, E. Evaluating Program and Managerial Efficiency: An Application of Data Envelopment Analysis to Program Follow Through. Manag. Sci. 1981, 27, 668–697. [Google Scholar] [CrossRef]
  54. Khezrimotlagh, D.; Zhu, J. Data Envelopment Analysis and Big Data: Revisit with a Faster Method. In Data Science and Productivity Analytics; Charles, V., Aparicio, J., Zhu, J., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 1–34. [Google Scholar] [CrossRef]
  55. Kohl, S.; Brunner, J.O. Benchmarking the benchmarks–Comparing the accuracy of Data Envelopment Analysis models in constant returns to scale settings. Eur. J. Oper. Res. 2020, 285, 1042–1057. [Google Scholar] [CrossRef]
  56. Wimmer, S.; Finger, R. A note on synthetic data for replication purposes in agricultural economics. J. Agric. Econ. 2023, 74, 316–323. [Google Scholar] [CrossRef]
  57. Faisal, M.; Hutson, G.; Mohammed, M. Synthetic NEWS Data. 2022. Available online: https://nhs-r-community.github.io/NHSRdatasets/articles/synthetic_news_data.html (accessed on 21 June 2023).
  58. Krivonozhko, V.E.; Utkin, O.B.; Safin, M.M.; Lychev, A.V. On some generalization of the DEA models. J. Oper. Res. Soc. 2009, 60, 1518–1527. [Google Scholar] [CrossRef]
  59. Ali, A.I.; Seiford, L.M. Computational Accuracy and Infinitesimals In Data Envelopment Analysis. INFOR Inf. Syst. Oper. Res. 1993, 31, 290–297. [Google Scholar] [CrossRef]
  60. Podinovski, V.V.; Bouzdine-Chameeva, T. Solving DEA models in a single optimization stage: Can the non-Archimedean infinitesimal be replaced by a small finite epsilon? Eur. J. Oper. Res. 2017, 257, 412–419. [Google Scholar] [CrossRef]
  61. Charnes, A.; Cooper, W.W.; Thrall, R.M. A structure for classifying and characterizing efficiency and inefficiency in Data Envelopment Analysis. Oper. Res. Lett. 1991, 2, 197–237. [Google Scholar] [CrossRef]
  62. Charnes, A.; Cooper, W.; Golany, B.; Seiford, L.; Stutz, J. Foundations of data envelopment analysis for Pareto-Koopmans efficient empirical production functions. J. Econ. 1985, 30, 91–107. [Google Scholar] [CrossRef]
  63. Andersen, P.; Petersen, N.C. A Procedure for Ranking Efficient Units in Data Envelopment Analysis. Manag. Sci. 1993, 39, 1261–1264. [Google Scholar] [CrossRef]
  64. Førsund, F.R. Weight restrictions in DEA: Misplaced emphasis? J. Product. Anal. 2013, 40, 271–283. [Google Scholar] [CrossRef]
  65. Farrell, M.J. The measurement of productive efficiency. J. R. Stat. Soc. 1957, 120, 253–281. [Google Scholar] [CrossRef]
  66. Farrell, M.J.; Fieldhouse, M. Estimating efficient production functions under increasing returns to scale. J. R. Stat. Soc. 1962, 125, 252–267. [Google Scholar] [CrossRef]
  67. Thanassoulis, E.; Allen, R. Simulating weight restrictions in data envelopment analysis by means of unobserved DMUs. Manag. Sci. 1998, 44, 586–594. [Google Scholar] [CrossRef]
  68. Krivonozhko, V.E.; Førsund, F.R.; Lychev, A.V. Terminal units in DEA: Definition and determination. J. Prod. Anal. 2015, 43, 151–164. [Google Scholar] [CrossRef]
  69. Krivonozhko, V.E.; Førsund, F.R.; Lychev, A.V. On comparison of different sets of units used for improving the frontier in DEA models. Ann. Oper. Res. 2017, 250, 5–20. [Google Scholar] [CrossRef]
  70. Bougnol, M.L.; Dulá, J.H. Anchor points in DEA. Eur. J. Oper. Res. 2009, 192, 668–676. [Google Scholar] [CrossRef]
  71. Dulá, J.H.; Thrall, R.M. A Computational Framework for Accelerating DEA. J. Prod. Anal. 2001, 16, 63–78. [Google Scholar] [CrossRef]
  72. Bessent, A.; Bessent, W.; Elam, J.; Clark, T. Efficiency Frontier Determination by Constrained Facet Analysis. Oper. Res. 1988, 36, 785–796. [Google Scholar] [CrossRef]
  73. Lang, P.; Yolalan, O.R.; Kettani, O. Controlled Envelopment by Face Extension in DEA. J. Oper. Res. Soc. 1995, 46, 473–491. [Google Scholar] [CrossRef]
  74. Olesen, O.B.; Petersen, N.C. Indicators of Ill-Conditioned Data Sets and Model Misspecification in Data Envelopment Analysis: An Extended Facet Approach. Manag. Sci. 1996, 42, 205–219. [Google Scholar] [CrossRef]
  75. Rubin, D.B. The Bayesian Bootstrap. Ann. Stat. 1981, 9, 130–134. [Google Scholar] [CrossRef]
  76. Afanasiev, A.P.; Krivonozhko, V.E.; Lychev, A.V.; Sukhoroslov, O.V. Multidimensional frontier visualization based on optimization methods using parallel computations. J. Glob. Optim. 2020, 76, 563–574. [Google Scholar] [CrossRef]
  77. Koch, T.; Berthold, T.; Pedersen, J.; Vanaret, C. Progress in mathematical programming solvers from 2001 to 2020. EURO J. Comput. Optim. 2022, 10, 100031. [Google Scholar] [CrossRef]
Figure 1. Correlation between variables in the synthetic dataset generated with Cobb–Douglas approach.
Figure 1. Correlation between variables in the synthetic dataset generated with Cobb–Douglas approach.
Data 08 00146 g001
Figure 2. Correlation between variables in real dataset [53].
Figure 2. Correlation between variables in real dataset [53].
Data 08 00146 g002
Figure 3. Generation of artificial efficient units using the assurance region method. (a) Expansion of a production possibility set T VRS as a result of incorporation weight restrictions; (b) Inserting artificial DMUs obtained as projections onto the frontier of the T VRS AR .
Figure 3. Generation of artificial efficient units using the assurance region method. (a) Expansion of a production possibility set T VRS as a result of incorporation weight restrictions; (b) Inserting artificial DMUs obtained as projections onto the frontier of the T VRS AR .
Data 08 00146 g003
Figure 4. Illustration of generating artificial efficient units in the three-dimensional VRS model. (a) Production possibility set T VRS before inserting artificial efficient unit, (b) Production possibility set T VRS after inserting artificial efficient unit H.
Figure 4. Illustration of generating artificial efficient units in the three-dimensional VRS model. (a) Production possibility set T VRS before inserting artificial efficient unit, (b) Production possibility set T VRS after inserting artificial efficient unit H.
Data 08 00146 g004
Figure 5. Distribution of efficiency scores in the original datasets.
Figure 5. Distribution of efficiency scores in the original datasets.
Data 08 00146 g005
Figure 6. Distribution of efficiency scores in real and synthetic datasets in Case 1.
Figure 6. Distribution of efficiency scores in real and synthetic datasets in Case 1.
Data 08 00146 g006
Figure 7. Correlation between variables in Case 1.
Figure 7. Correlation between variables in Case 1.
Data 08 00146 g007
Table 1. Iterations of Algorithm 3.
Table 1. Iterations of Algorithm 3.
Number of DMUs
IterationsInitialEfficientTerminalGeneratedTotal Artificial
12002828229229
242919419411401369
315691122109664527821
Table 2. Variance inflation factors of variables in Case 1.
Table 2. Variance inflation factors of variables in Case 1.
Dataset x 1 x 2 x 3 y 1 y 2 y 3
Original127.99323.18397.8554.9040.04100.83
Synthetic24.2286.09139.0419.1639.1416.48
Table 3. Variance inflation factors of variables in Case 2.
Table 3. Variance inflation factors of variables in Case 2.
Dataset x 1 x 2 x 3 x 4 x 5 y 1 y 2 y 3
Original11.77198.1183.2857.581.7369.3255.32186.80
Synthetic8.2522.3618.5717.825.8217.3111.8511.85
Table 4. Execution time for Algorithm 5.
Table 4. Execution time for Algorithm 5.
Stages of Algorithm 5Execution Time, s
Case 1Case 2
Finding initial weight restrictions (Algorithm 2)279
Generating efficient DMUs (Algorithm 3)434994
Generating inefficient DMUs (Algorithm 4)603422
Total10641425
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Lychev, A.V. Synthetic Data Generation for Data Envelopment Analysis. Data 2023, 8, 146. https://doi.org/10.3390/data8100146

AMA Style

Lychev AV. Synthetic Data Generation for Data Envelopment Analysis. Data. 2023; 8(10):146. https://doi.org/10.3390/data8100146

Chicago/Turabian Style

Lychev, Andrey V. 2023. "Synthetic Data Generation for Data Envelopment Analysis" Data 8, no. 10: 146. https://doi.org/10.3390/data8100146

Article Metrics

Back to TopTop