Combining Data Envelopment Analysis and Machine Learning

: Data Envelopment Analysis (DEA) is one of the most used non-parametric techniques for technical efﬁciency assessment. DEA is exclusively concerned about the minimization of the empirical error, satisfying, at the same time, some shape constraints (convexity and free disposability). Unfortunately, by construction, DEA is a descriptive methodology that is not concerned about preventing overﬁtting. In this paper, we introduce a new methodology that allows for estimating polyhedral technologies following the Structural Risk Minimization (SRM) principle. This technique is called Data Envelopment Analysis-based Machines (DEAM). Given that the new method controls the generalization error of the model, the corresponding estimate of the technology does not suffer from overﬁtting. Moreover, the notion of ε -insensitivity is also introduced, generating a new and more robust deﬁnition of technical efﬁciency. Additionally, we show that DEAM can be seen as a machine learning-type extension of DEA, satisfying the same microeconomic postulates except for minimal extrapolation. Finally, the performance of DEAM is evaluated through simulations. We conclude that the frontier estimator derived from DEAM is better than that associated with DEA. The bias and mean squared error obtained for DEAM are smaller in all the scenarios analyzed, regardless of the number of variables and DMUs.


Introduction
One of the most important issues in the field of statistical learning is the reliability of statistical inference methods. In this framework, a sophisticated theory, the so-called Generalization Theory, explains which factors must be controlled to achieve good generalization. Optimal generalization is achieved when the error generated on evaluating new data through an inference learning method is minimized. The Generalization Theory copes with those factors that allow for the minimization of the prediction or generalization error.
In terms of pattern classifiers, the generalization error is the probability of misclassifying a randomly chosen example that holds with high probability over randomly chosen training sets, and then, a good generalization is achieved when this is minimized. This aim is possible if an upper bound of the generalization error is found, and the parameters on which it depends are controlled in order to reduce it. These bounds are understood as Probably Approximately Correct (PAC) bounds, which specifically means that the probability of the bound failing is small (Probably) when the bound is achieved through the classifier that has a low error rate (Approximately Correct). The standard PAC learning model implements the idea of finding this classifier: it considers a fixed hypothesis (classifier) class together with a required accuracy and confidence, and takes into account the theory that characterizes when a function from this class can be learned from examples (training sample) in terms of a measure called the Vapnik-Chervonenkis dimension (VC dimension). the philosophy of Structural Risk Minimization by analogy with SVM. Our modeling has several implications: (a) For the first time, a bound of the generalization error is implemented to determine the degree of technical inefficiency of a set of Decision Making Units (DMUs). (b) We implement the minimization of the balance between the generalization error and the empirical error through a quadratic optimization model that will be called Data Envelopment Analysis-based Machines (DEAM), which has DEA as a particular case. (c) Through a computational simulation experience, we show that DEAM outperform DEA regarding bias and mean squared error. (d) We estimate production technologies using robust regression models that use the concept of margin. Due to that, the problem of efficiency measurement becomes a classification problem: to be efficient (being located within the margin) or not to be efficient (being located out of the margin).
Finally, we mention that the expected new insights gained by applying our approach (DEAM) are related to the determination of better estimates of production functions in engineering and microeconomics, in terms of bias and mean squared error. Additionally, these gains will also benefit the technical efficiency measures that can be derived from calculating the distance from a given observation to the production function estimate.
The rest of the paper is organized as follows. The following section provides the basic background. Next, in the third section, we introduce a new PAC for the class of piece-wise linear functions. In Section 4, a new approach called Data Envelopment Analysis-based Machines (DEAM) is defined and analyzed. Section 5 shows the main results associated with a computational experience for checking the new approach in comparison with DEA. Section 6 contains a discussion on the main results. Finally, the article ends with the conclusions section.

Background
In this section, we briefly introduce elemental notions of Support Vector Regression, Statistical Learning and Data Envelopment Analysis.

Support Vector Regression (SVR)
Machine learning (ML) is a methodology that studies computer processes that learn from experience and make improvements automatically. ML works with computer algorithms based on a learning sample (training data) and can make predictions about the behavior of future data. The study of this behavior is produced in two different scenarios: the scenario of supervised learning in which training data are vectors of predictors and responses, and the scenario of unsupervised learning, where no responses are considered in the data sample. In the first field, the objective of learning techniques is to determine the functional relationship between the predictors and the responses. In this case, the nature of the responses, if they come from a binary variable or are real values, determines the kind of problem to solve: a classification problem or a regression problem, respectively. In the second field, since there are no responses, the objective is to gain knowledge about the processes lying behind data generation, such as density estimation or clustering. Our paper largely focuses on the regression problem within supervised learning, bearing in mind that our data comprise inputs utilized by firms to produce outputs (real values). Support Vector Machines (Vapnik [1,16]) is a technique that stands out in ML in the world of supervised learning. SVM represents an algorithm constructed on the foundations of statistical learning theory and is in line with the Structural Risk Minimization (SRM) method. SRM is implemented to construct support vector machines, where the objective is to control the value of empirical risk and the value of the VC dimension, which is the regularization term that appears when the generalization error must be minimized rather than minimizing only the empirical error (Vapnik [1,16]). In particular, the definition of the notion of the VC dimension is as follows:

Definition 1 (VC dimension).
Let H be a set of binary-valued functions. A set of points X is shattered by H if for all binary vectors b indexed by X; there is a function f b ∈ H performing b on X. The VC dimension, VC dim(H), of the set H is the size of the largest shattered set.
With regard to the classification problem (for the regression problem, the generalization error is defined in the same way, because the regression problem can be turned in to a classification problem, as we will go on to explain) in SVM, minimizing the generalization error consists of minimizing the probability of incorrectly classifying any new data that emerges from the unknown distribution that was generated by the learning sample. This aim is possible if a bound of the generalization error is found, and the parameters on which it is dependent are controlled to reduce the bound. These bounds are understood as Probably Approximately Correct (PAC) bounds, which were first proposed by Valiant [17]. The standard PAC learning implements the idea of finding this classifier: it considers a fixed hypothesis (classifier) class together with a required accuracy and confidence, and takes into account the theory that characterizes when a function from this class can be learned from examples (data). In the case of regression, the exercise involves converting the regression problem (estimation function) into a classification problem because bounds in the generalization error are precisely based on the VC dimension or when a margin is considered on the fat-shattering dimension (effective VC dimension).
Next, we show the definition of the fat-shattering dimension. Notice that bold will be utilized for denoting vectors, and non-bold for scalars.
Definition 2 (fat-shattering dimension). Let F be a set of real-valued functions. A set of points X is γ-shattered by F if there are real numbers r x indexed by x ∈ X such that for all binary vectors b indexed by X, there is a function f b ∈ F such that The fat-shattering dimension of the set F, f at F , is a function from the positive real numbers to the integers that maps a value γ to the largest γ-shattered set. The VC dimension corresponds to the largest shattered set, considering γ = 0, which is the concept first used by Vapnik to state a bound for the generalization error. This is the reason why the fatshattering dimension is also known as the effective VC dimension.
To convert the regression problem into a classification problem, a threshold θ > 0 that marks the limit needs to be set, such that a mistake will be considered to have been made if it is exceeded by the loss function when testing with new data in the model. The function that determines the distance between the real value of the output and the estimated value of said output through the model is called the loss function. Given a margin γ > 0, in the case of the training point, if the loss function exceeds the value (θ − γ), it will be considered as a mistake. Then, γ measures the discrepancy between the two losses: those measured on test data and those measured on training data. Under this re-interpretation of the regression problem, it is possible to use the dimension free bounds already constructed in the case of classification. In our case, we focus on the bound obtained by Shawe-Taylor and Cristianini [18], based on the fat-shattering dimension.
Theorem 1 (Shawe-Taylor and Cristianini [18]). Let F be a sturdy class of real-valued functions with range [−a, a] and fat-shattering dimension bounded by f at F (γ). Fix θ ∈ R with θ > 0, and a scaling of the output range κ ∈ R + . Consider a fixed but unknown probability distribution in the space X × R. Then, with probability 1 − ρ over randomly drawn training sets S of size m for all γ with θ ≥ γ > 0, the probability that a training set filtered function f ∈ F has an error larger than θ on a randomly chosen input is bounded by where c = max{a, D(S, f , γ) + κ} and provided m ≥ 2 ε .
In the statement of Theorem 1, where e( f ) is the loss function that the analyst selects in order to measure how much f exceeds the error margin (θ − γ). In addition, the theorem introduces the concept of the fat-shattering dimension, f at F (γ), that is, the generalization of the VC dimension, which is sensitive to the size of the margin γ. Theorem 1 is a general result, which in the case of each function class F, will be particularized: for each function class, the fat-shattering dimension is bounded in a different way, and consequently, the same happens with respect to the expected error proposed in (1). In the case of linear function classes, the fat-shattering dimension is bounded by Bartlett and Shawe-Taylor [19].

Theorem 2 (Bartlett and Shawe-Taylor [19]).
Suppose that X is a ball of radius r and center 0 m in R m , i.e., X = {x ∈ R m : x ≤ r}, and consider the set The most general version of this theorem, in which w is not restricted to be at most 1, bounds the fat-shattering dimension of linear classifiers as f at F (γ) ≤ w r γ 2 . The following two previously published lemmas are significant for our purposes throughout this paper (see ).

Lemma 1.
For every input set S γ-shattered by F = {x → w · x : x ∈ X} (the linear hypothesis class) and for every subset In particular, |S|γ w ≤ |S|r, and it is possible to conclude that all sets of inputs S γ-shattered by F are bounded. Therefore, the set γ-shattered by F with higher cardinality is also bounded, which is known as the fat-shattering dimension: f at F (γ) ≤ w r γ 2 . Now, if γ is fixed in such a way that θ ≥ γ > 0, and disregarding the logarithmic factors in (1), the only term to reduce the expected error is (2). This process can be performed by implementing the minimization of its bound, which in the case of linear functions, is as follows: This expected error bound meets the SRM objective: the minimization process leads to more than minimizing the empirical risk, i.e., minimizes the capacity of the estimation function to provide a suitable prediction when a new observation (out of sample) is introduced and that is given by the appearance of the regularization term, that is w 2 , which bounds the fat-shattering dimension (PAC bound). The minimization of this bound corresponds to the objective of the regression problem associated with Support Vector Regression (SVR). Support Vector Regression (SVR), as with any regression approach, attempts to construct a function that is capable of predicting the behavior of the response variable under the study. SVR sets out to predict the value of a continuous response variable y ∈ R + given a vector of covariables x ∈ R m + . Hence, SVR establishes a functionf : R m + → R such that, given x,f (x) yields the response variable prediction. Under the SVR principle, the linear predictorf can be defined asf (x) = w * · x + b * , where w * ∈ R m and b * ∈ R are optimal solutions of the optimization model below: In performing this methodology, the values of C ∈ R + and ε ∈ R + are obtained by a cross-validation process. The SVR yields an estimatorf (x) of the response variable given x as well as lower and upper 'correcting' surfaces, defined asf (x) − ε andf (x) + ε , where ε is a margin that enhances the estimator linked to SVR with robustness (see Figure 1). Additionally, observations below the surfacef (x) − ε reveal an associated (empirical) error of ξ i > 0 (with ξ i = 0), while observations above the surfacef (x) + ε present an (empirical) error of ξ i > 0 (with ξ i = 0). Observations between the surfacesf (x) − ε andf (x) + ε reveal an error of zero (with ξ i = ξ i = 0). The objective function, however, represents the combination of regression and regularization involved in SVR, combining the empirical error term n ∑ i=1 ξ i 2 + ξ i 2 and the regularization term w 2 through a weight C, thus balancing both components (Vazquez and Walter [20]). Moreover, although hyperplanes are linear in shape, it must be highlighted that SVR is able to generate estimation functions that are not necessarily linear in the original (x, y) space, and that can be achieved by using a transformation function φ, a conversion arising from the covariable space, φ : R m + → Z . Figure 1 shows the solution of the linear estimator achieved by an SVR model, as well as the graphical representation of the residuals (empirical error) for two points and the hyperplanes that define the margins (dashed lines). The next subsection explains how Data Envelopment Analysis (DEA) works.

Data Envelopment Analysis (DEA)
Let us consider the observation of n Decision Making Units (DMUs). DMU i takes amounts of outputs. The relative efficiency of each unit in the sample is evaluated by referring to the so-called production possibility set or technology, which is essentially the set of producible bundles of (x, y). It is generally defined as: Under Data Envelopment Analysis (DEA) (Charnes et al. [5] and Banker et al. [6] and more recently, Villa et al. [21], Sahoo et al. [22], and Amirteimoori [23]), T is usually assumed to satisfy free disposability with regard to inputs and outputs; that is, if (x, y) ∈ T, then (x , y ) ∈ T with x ≥ x and y ≤ y. Convexity of T is also generally assumed (see, e.g., Färe and Primont [24]).
Insomuch as the measurement of technical efficiency is concerned, a certain subset of T is of interest. We allude to the weakly efficient set of T, defined as Then, z < t means z (j) < t (j) for all j = 1, . . . , q.). Some authors (see, for example, Briec and Lesourd [25]) define technical efficiency as the distance from a point in T to the weakly efficient set. When s = 1, this context is confined to the central concept of production function f . Accordingly, m input variables are used to yield a univariate output, and hence, we can define the technology as: According to the selected distance for measuring technical inefficiency, different DEA models emerge (Cooper et al. [26]). The directional distance function (DDF) is a relevant example of them. For m inputs and one output, resorting to the directional vector g = (g − , g + ), where g − = 1 m and g + = 1, the DDF problem has the following structure when the efficiency level of DMU i is assessed, i = 1, . . . , n: Given that (6) is a linear program, we can equivalently solve its corresponding dual formulation: DEA models must be solved for each DMU i , i = 1, . . . , n, in the sample. Figure 2 shows an example of the DDF model with a distance vector g = (g − , g + ) = (1 m , 1). Note that DEA generates a piece-wise linear technology (the region below the line), satisfying free disposability in inputs and outputs and convexity. Note also that the DEA estimate envelops all the observations from above. In this case, with g = (1 m , 1), the DDF coincides with a particular distance between data and ∂ W (T): the l ∞ -distance (Briec [27] and Briec and Lesourd [25]). In this paper, our purpose is to construct a method that generates piece-wise linear frontiers as in Figure 2, by implementing the minimization of the generalization error of the model.

New PAC Learning with Piece-Wise Linear Hypothesis
This section revolves around two stages in the search for the generalization error bound: the first stage is based on the construction of the class of piecewise linear hypotheses whose elements are hyperplanes that are located as close as possible to the data sample through l ∞ -distance, and the second stage is based on the construction of the bound of the fat-shattering dimension of the class of hypothesis constructed in the first stage. The minimization of the bound of the expected error using the bound of the fat-shattering dimension calculated gives rise to the Data Envelopment Analysis-based Machines (DEAM) model as a method for estimating piecewise linear production functions, which minimizes the generalization error as well as the empirical error.
To obtain this bound of the class of functions of our interest, we must derive the fatshattering dimension bound for the hypothesis class with the piece-wise structure we desire. Then, minimizing the generalization error will be implemented through the minimization of the fat-shattering dimension bound. For this task, a previous step must be taken: a class of piece-wise linear hypothesis must be defined. A piece-wise linear hypothesis target is defined by a combination of n hyperplanes H p p=1,...,n that are selected to evaluate the data depending on their input values. The hyperplanes will be defined for each input value x ∈ R m + as follows: Then, if we suppose δ p x > 0, ∀p x ∈ {1, . . . , n}, each output value estimation through the set of n hyperplanes H p p=1,...,n can be written as a function of the input value vector x: with w p x ∈ R m + and β p x ∈ R. The value of p x ∈ {1, . . . , n}, in our case, is chosen by considering two desired conditions that are inherited from production theory: and Condition (8) ensures that the estimation of the output value associated with an input x ∈ R m + will be always non-negative. Additionally, condition (9) guarantees that the estimation h(x) through the hyperplane H p x is less or equal than the estimation through any other hyperplane H p . Condition (9) is the one that imposes concavity on the model. This type of condition was the key for stating concavity in the general multiple-regressor modeling in microeconomics (Afriat [28]; Kuosmanen et al. [13]). In particular, if the production function is concave, then the technology defined from this production function is convex.
The function class of piece-wise linear hypothesis can be constructed as follows: Now, we can proceed with the second step: to establish a bound for the fat-shattering dimension of this function class to control the generalization error. Before proving the main theorem of this section, we need to state a necessary technical lemma. In the results, r ∈ R + is the radius of the ball centered in 0 m that bounds the input data in the data sample.

Lemma 3.
If an input learning sample, S = {x 1 , . . . , x d } is γ-shattered through F defined in (10), then every subset S 0 ⊆ S satisfies denoting as ∑ S 0 and ∑(S − S 0 ) the sum of the elements in S 0 and S − S 0 , respectively, and as |S| the cardinal of the set S.

Proof. See Appendix A.
Next, we prove the main theorem of this section. In particular, we state the bound for the fat-shattering dimension for piece-wise linear hypothesis classes.
Theorem 3. Let X be the ball of radius R and center 0 m in R m , i.e., X = {x ∈ R m : x ≤ r}, and let the hypothesis class be as follows Proof. See Appendix A.
The next section involves the task of achieving a model that minimizes the established generalization error through the l ∞ -distance.

Data Envelopment Analysis-Based Machines (DEAM)
Data Envelopment Analysis-based Machines (DEAM) can be defined from the idea of minimizing the expected error proposed in (1). If we do not consider the logarithmic factors, we can directly focus on minimizing d in this expression, for which a bound on the generalization error has been found in the case of the piece-wise linear hypothesis class F defined in (12): Because of the complexity of implementing an optimization model in which the objective function has the aim of minimizing the above bound, we will break up the minimization of the whole bound into different objectives, which will be collected in an aggregation function that will conform the objective function of the final optimization program associated with DEAM, which will be shown later in this section.
Once the number of different hyperplanes in each hypothesis is set as the number of elements in the learning sample |S| = n, minimizing the bound of the fat-shattering dimension requires minimizing part A in (14). This is equivalent to maximizing Regarding this last expression, we must maximize the numerator and minimize the denominator, as follows: (i) The vector of coefficients (slopes) corresponding to the hyperplane H p is w p , δ p .
We can consider, without loss of generality, that w p , δ p 1 = Finally, a way of implementing (i) and (ii) is minimizing u + v, where w p ≤ u, β p ≤ v,∀p ∈ {1, . . . , n}. Accordingly, minimizing the bound of the fat-shattering dimension, A + B, leads to minimizing (u + v) + CD 2 , where D 2 = D(S, f , γ) 2 = ξ 2 2 and C is a parameter to be tuned by, for example, a cross-validation process. As a loss function we use the following: ξ((x, y), f , γ) = max 0, D · ∞ ((x, y), f ) − (θ − γ) . Finally, the objective function has the following structure: Accordingly, we introduce the optimization model that defines DEAM: Model (16) determines a maximum of n different hyperplanes. The intersection of the half-spaces defined from these hyperplanes gives rise to the estimator of the underlying (convex) production technology. The number of hyperplanes to be considered in the implementation of the DEAM model can be seen as a key parameter of our approach since the results could be different depending on it. However, we suggest using n hyperplanes, which coincide with the number of DMUs. This is due to the experimental evidence found in the simulation study carried out in Section 5. We analyzed 2000 databases, and in all these cases, the number of hyperplanes at optimum were less than the number of DMUs in the corresponding data sample. This situation can be identified because some hyperplanes are repeated at the optimal solution of each problem.
Let us now explain each constraint of model (16) in detail. Constraints (16.1) and (16.2) come from w p ≤ u, β p ≤ v,∀p = 1, . . . , n, respectively. The norm l 1 is used to be consistent with constraint (16.8). Additionally, this type of norm is associated with the definition of linear constraints, which are easier to be solved from a computational point . . . , n, i.e., it ensures that the hyperplanes envelop the data sample from above. Condition (16.4) forces that the n hyperplanes are monotonic non-decreasing and will be responsible for the satisfaction of the property of free disposability, as we will show later in the text (see Proposition 2 below). Constraints (16.5), (16.6), (16.7), and (16.8) allow for characterizing ξ ((x, y), f , γ) . The parameter ε (= θ − γ ≥ 0) will be chosen by cross validation. Let us now interpret specifically the value at optimum of the decision variable ξ i . Let us pay attention to constraint (16.5).
i is minimized in the objective function. In this way, considering (16.8), (16.3) and ε ≥ 0, ξ i can be interpreted as the l ∞ -distance from the observation (x i , y i ) to the hyperplane H iε : where H iε = (x, y) ∈ R m+1 : w i x + β i − δ i y − ε = 0 (Mangasarian [29]). If holds. Finally, constraint (16.9) guarantees that, for each (x i , y i ) in the data sample, the hyperplane of the piece-wise linear production function associated with that point is the closest one to (x i , y i ). Note that constraint (16.9), by (16.3) and (16.8), is equivalent to writing D l ∞ ((x i , y i ), H i ) ≤ D l ∞ (x i , y i ), H p ∀i, p = 1, . . . , n (see Mangasarian [29]). Figure 3 shows the shape of the function that will be generated by the model as an estimate of the underlying production function. Note that the estimate satisfies monotonicity and concavity, as happens with the DEA estimator. However, the DEAM estimator does not satisfy minimal extrapolation. Additionally, it implements a certain idea of robustness because of the margin notion inherited from SVR. Additionally, Figure 3 shows the possible interpretation of ξ i as D l ∞ ((x i , y i ), H iε ). In particular, ξ i is the 'radius' of the squared ball in the figure.
The next propositions state that the derived technology from model (16) satisfies convexity and free disposability.

Proposition 1. T DEAM is a convex set.
Proof. The intersection of half-spaces is a convex set.
Proof. The result holds because the production possibility set generated by DEA and the production possibility set yielded by DEAM satisfy convexity, free disposability, and contain all observations, but only the technology related to DEA meets minimal extrapolation.
In this way, we have that DEAM does not satisfy the minimal extrapolation principle, but its associated estimation of the technology always contains the observations.
As for the measurement of technical inefficiency of the observations, due to the nature of the technique used and based on the original ideas derived from Support Vector Regression, any (x i , y i ) located within the margin will be identified as technically efficient (with ξ * i = 0). Otherwise, i.e., if (x i , y i ) is located below the margin (see Figure 3), we have that ξ * i is the l ∞ -distance from the observation to the (efficient) frontier of a 'robust' technology. This robust technology is defined by the translation of the original technol-ogy T DEAM downward following the value of the margin ε. If we define this translated technology as T ε DEAM = (x, y) ∈ R m+1 (this result can be derived from Aparicio and Pastor [30]). Now, we show the relationship between the Directional Distance Function (DDF) in DEA, model, and the DEAM model (16): The DDF model always yields a feasible solution of the model associated with Data Envelopment Analysis-based Machines.
However, it can be shown that the DDF model (7) does not always yield an optimal solution of model (16).

Computational Experience
This section compares the performance of DEA and DEAM for estimating production functions. For this task, we designed five typical production scenarios in Table 1.

Scenario
Inputs Production Function The simulations implement Cobb-Douglas production functions, which are frequently used in econometrics for establishing the relation between the maximum amount of outputs that can be produced from a set of inputs. Thereby, scenario I implements a mono-input mono-output case, while the other scenarios represent multi-input mono-output cases. For each scenario, we ran 100 trials (t = 1, . . . , 100) with sample sizes: n ∈ {25, 50, 75, 100}. The inputs were calculated randomly from Uni [1,10]. For simulating inefficiencies, we selected a random distribution exp(1/3) for u. Mean squared error (MSE) and bias were the two measures employed to assess the performance of each method.
The DEAM model (16), as other machine learning techniques, needs to find the best model through a cross-validation process. For this task and exclusively for the DEAM model, we implemented a five-fold cross validation using a certain grid of hyperparameters. This grid was arbitrarily set as: C ∈ 1, 10, 50, 100, 10 6 and ε ∈ {0, 0.001, 0.01, 0.1, 0.2, 0.3, 0.4, 0.5, 1}. Note that DEA does not need to apply a crossvalidation process. Instead, DEA uses the whole dataset to evaluate efficiency scores. Table 2 sums up the results obtained for each scenario when DEA (without cross validation) and DEAM (with cross validation) are applied. The first two columns present the type of scenario and the sample size. The following columns show the mean and standard deviation (in brackets) of MSE obtained by DEA and DEAM. Fraction of trial reports the proportion of trials in which DEAM either improves upon or equals the MSE given by the DEA method, while the next column illustrates the percentage of improvement of DEAM with respect to DEA. The four subsequent columns are similar to the previous ones, but with regard to bias.
Regarding the results, the DEAM method performed better than DEA, with improvements ranging from 5% to 45% on average in MSE and 2% to 28% in bias. This fact increased when the number of inputs were higher. In addition, the results illustrate how the model worked better when the number of DMUs was around 50-75. Scenario I, i.e., the single input single output framework, shows small differences between the two methods. Nevertheless, in the trials, DEAM outperformed DEA in more than 95% of the cases. In contrast, the best analyzed situation was scenario V (one output and five inputs) with n = 25, showing a 45% reduction in MSE and 28% in bias, on average. This last result could be interpreted in favor of the DEAM approach as an indication that DEAM also seemed to outperform DEA with respect to the curse of dimensionality (Charles et al. [31]).

Discussion
In this section, we briefly discuss the main results of this paper and how they can be interpreted from the perspective of previous studies, mainly those based on Data Envelopment Analysis. Our findings and their implications are also discussed. Some limitations of our approach are highlighted.
In this paper, we have introduced a new way of estimating production frontiers in engineering and microeconomics, which is based upon the same fundamentals of Support Vector Machines (SVM), which is a well-known machine learning technique. Our numerical results have demonstrated that the frontier estimator derived from the new methodology (DEAM) is better than that associated with Data Envelopment Analysis (DEA), which represents the standard non-parametric technique for determining technical efficiency in the literature. The bias and mean squared error obtained for DEAM are smaller in all the scenarios analyzed, regardless of the number of variables and DMUs.
In comparison with the standard literature, the new methodology is more flexible. It generates production possibility sets that satisfy convexity, free disposability in inputs and outputs, and contain all the observations, but they do not meet the postulate of minimal extrapolation. In contrast, DEA satisfies all the above properties. In particular, minimal extrapolation is the reason why DEA can be seen as an overfitted model to estimate the underlying Data Generating Process (DGP) that is behind the generation of the data sample. DEAM does not suffer from this overfitting problem. However, it is not evident where the production possibility set, estimated by a non-overfitted model, should be located in the input-output space to correctly approximate the underlying technology, which, by definition, is unknown to us. In this regard, in this paper, we have implemented for the first time a strategy based on the idea of Structural Risk Minimization (Vapnik [1]) and cross validation, introducing a new PAC (Probably Approximately Correct) bound in production theory with the aim of solving the overfitting problem linked to DEA. Some other authors have tried to modify the standard DEA technique such that the new approaches work as inferential methods (with the focus on the DGP) rather than as mere descriptive tools. For example, Banker and Maindiratta [8] and Banker [9] associated DEA with maximum likelihood. Simar and Wilson [10][11][12] adapted bootstrapping to DEA. Kuosmanen and Johnson [13,14] introduced the Corrected Concave Nonparametric Least Squares. Unfortunately, despite the importance of machine learning techniques in the current literature, there have been few attempts to adapt DEA to the field of machine learning (see, for example, Esteve et al. [7], or Olesen and Ruggiero [15]). In this sense, DEAM has allowed us to build a new bridge between these two worlds: machine learning and efficiency measurement.
Finally, we would like to highlight a clear limitation associated with the new approach. DEAM is linked to an intensive computational procedure based on cross validation. This feature contrasts sharply with the simplicity of Data Envelopment Analysis.

Conclusions and Future Work
In this paper, for the first time, a bound on the generalization error for a piece-wise linear hypothesis has been established in the context of Support Vector Regression (SVR), by also considering typical axioms from production theory: convexity and free disposability. It shapes a new nexus between non-parametric frontier analysis and machine learning in the line recently followed by Esteve et al. [7], Valero-Carreras et al. [32], and Olesen and Ruggiero [15]. The new formulation on the bound of the generalization error of this kind of hypothesis gives rise to a new way of bounding the whole expected error when we approximate a target function through a piece-wise linear function, also controlling the empirical error. Minimizing this bound led to the definition of a new model, called Data Envelopment Analysis-based Machines (DEAM), which generates production function estimations that seek a balance between the empirical error and the generalization error.
Classical non-parametric techniques, such as DEA, suffer from the overfitting problem because they assume the axiom of minimal extrapolation (Banker et al. [6], Afriat [28], and Farrell [33]). The DEAM model, however, is more flexible when it comes to estimating production frontiers through a cross-validation process, disregarding the minimal extrapolation axiom, as was shown by a computational experience in this paper.
Finally, we finish by mentioning several lines that pose interesting avenues for further research. The first one is the possibility of extending the method to model multi-output situations. This could be interesting for dealing with more realistic production situations, considering information on the correlation among several outputs. Second, we could use other transformation functions (kernel methods) for the input space, in the same way as standard Support Vector Regression.