A Predictive Prescription Using Minimum Volume k -Nearest Neighbor Enclosing Ellipsoid and Robust Optimization

: This paper studies the integration of predictive and prescriptive analytics framework for deriving decision from data. Traditionally, in predictive analytics, the purpose is to derive prediction of unknown parameters from data using statistics and machine learning, and in prescriptive analytics, the purpose is to derive a decision from known parameters using optimization technology. These have been studied independently, but the effect of the prediction error in predictive analytics on the decision-making in prescriptive analytics has not been clariﬁed. We propose a modeling framework that integrates machine learning and robust optimization. The proposed algorithm utilizes the k -nearest neighbor model to predict the distribution of uncertain parameters based on the observed auxiliary data. The enclosing minimum volume ellipsoid that contains k -nearest neighbors of is used to form the uncertainty set for the robust optimization formulation. We illustrate the data-driven decision-making framework and our novel robustness notion on a two-stage linear stochastic programming under uncertain parameters. The problem can be reduced to a convex programming, and thus can be solved to optimality very efﬁciently by the off-the-shelf solvers.


Introduction
The term "analytics" was coined in a research report "Competing on Analytics" by Davenport (2006) [1] and has become widespread ever since. INFORMS defines analytics as "the scientific process of transforming data into insights for the purpose of making better decisions" [2]. With rapid progress in data-gathering technologies such as IoT (Internet of Things) and computation power, there are increasing expectations for the business analytics that will lead to the sophistication and automation of decision making in a rapidly changing and uncertain environment.
Business analytics is commonly viewed from three major perspectives: descriptive, predictive, and prescriptive (Lastig (2010) [3]; Evans (2012) [4]). The primal intent of descriptive analytics is to answer what happened? This includes preparing and analyzing historical data and identifying patterns from samples for reporting of trends. The primal intent of predictive analytics is to answer what could happen? This includes deriving prediction of unknown parameters from data using techniques such as statistics and supervised machine learning. The primal intent of predictive analytics is to answer what should we do? This includes deriving decisions from known parameters using techniques such as optimization.
These analytics techniques, however, are applied separately in most cases, and thus may end up in the suboptimal decision. Especially, it is not clear how the parameter prediction error in predictive analytics affects the decision-making in prescriptive analytics. For an optimization problem with uncertain parameters, it is reported that an parameter error of only 0.05% leads to a deterioration of the objective function value of 15-20% (Ben-Tal et al. 2009 [5]). Therefore, it is hard to say that it truly guides decision from data. Thus, how should we leverage data into decision-making?
Given these research gaps, in this research, we will clarify the methodology for deriving decisions from data by integrating the technologies of the predictive analytics and the prescriptive analytics. In order to achieve this research purpose, this research proposes an integrated framework of prediction algorithms and optimization algorithms as described below.
For the prediction algorithm, from the viewpoint of decision-making automation, it is desirable that the analyst can estimate from the data alone without assuming a prediction model formula. In this study, we applied the k-nearest neighbor method, which is one of the nonparametric regressions that can be estimated from the data without an explicit model assumption. The k-nearest neighbor method is one of the simplest algorithms, and is a method of averaging the closest k training data (k-nearest neighbors) in the feature space to perform prediction and discrimination. When the k-nearest neighbor is used for prediction, the average value of the data in the k-nearest neighbor is usually used. However, in the integrated framework proposed in this study, it is desirable to use all the samples in the k-nearest neighbor instead of a single prediction value in order to consider the robustness against the prediction error. Therefore, we propose a method to use a predicted value set without taking average of the samples in the k-nearest neighbors, as the input value of the optimization model.
For the optimization algorithm, robust optimization is applied in order to consider the prediction error of the data and to make it possible to calculate even large-scale data in a realistic time. The application of robust optimization requires the definition of an uncertainty set that indicates the possible range of data. This study proposes a method for finding the minimum volume ellipsoidal set including the predicted value set obtained by the k-nearest neighbors. As this problem is a convex planning problem, it is not affected by the so-called curse of dimensionality and can be solved very efficiently.
We call the proposed algorithm "a predictive prescription using minimum volume k-nearest-neighbor enclosing ellipsoid and robust optimization". The novelty of this proposal is that it integrates predictive analytics and prescription analytics, considers prediction errors that could not be considered in the existing studies, and derives decision from data. Another novelty is to develop an algorithm that has few assumptions by analysts, is versatile, and has scalability that can withstand large-scale data. With this proposed technique, it is possible to achieve sophistication and automation of decision making using large-scale data, which is of great practical importance.
The remainder of the paper is as follows. In Section 2, we review the related research. In Section 3, we outline the modeling framework of the predictive prescription. In Section 4, we describe our proposed algorithm for predictive prescription using enclosing minimum volume ellipsoid for k-nearest neighbors. In Section 5, we illustrate the effectiveness of the proposed method over other predictive prescription approaches, with several numerical examples. In Section 6, we discuss potential extensions and variations.

Literature Review
In this section, we review related research. In Section 2.1, we review the stochastic programming and robust optimization, both of which are the framework for the decisionmaking under uncertainty. In Section 2.2, we review the research on the integration of machine learning and optimization. In Section 2.3, we state the contribution over the cited-research.

Stochastic Programming and Robust Optimization
In the field of prescriptive analytics, stochastic programming is widely studied as decision-making under uncertainty in parameters. In the stochastic programming, given the probability distribution of the unknown parameter, we seek the decision that minimizes the expected cost. In the real world, the probability distribution is unknown and must be inferred from the data. However, there are several difficulties as described below. First, when optimizing an unknown true objective function estimated from the observed data that are subject to random error, even if the value estimates are unbiased, the uncertainty in these estimates coupled with the optimization-based selection process leads the value estimates for the recommended action to be biased high, which leads to that the resultant out-of-performance is often disappointing. This is called the optimizer's curse in the field of decision-making (Smith and Winkler 2006 [6]). Second, even if the probability distribution p(u) and the decision x are given, the calculation of the expected value E(x, u) requires multiple integrals, which is # P-hard. Third, it is time-consuming to estimate the probability distribution, because it is necessary to make assumptions and validations about the statistical model several times. Given a need of quicker decision-making and shortage of data scientists, an autonomous data-driven decision-making has been of great practical interest. From this point of view, Birtsimas et al. (2019) [7] state that the probability distribution is an imaginary one derived from human assumptions and does not exist in reality. Only data exists in reality, and establishing a data-driven decision-making framework that does not explicitly assume a probability distribution is very important in today's data-rich world.
Robust optimization has been extensively studied as an alternative approach for the optimization under uncertainty. The key idea of the robust optimization is to define an uncertainty set as the possible range of the uncertainty parameter and minimize the worst-case objective function within the set (Bertsimas 2018 [8]). Charnes and Cooper [9] first proposed the chance constraints. Soyster (1973) [10] proposed the concept of uncertainty sets and solved the worst-case problems. Ben-Tal and Nemirovski [11][12][13], El-Ghaoui et al. (1997) [14], and El-Ghaoui et al. (1998) [15] derived a robust counterpart for a linear programming problem with ellipsoidal uncertainty and constructed a theory of robust optimization. Bertsimas and Sim (2004) [16] proposed the concept of price of robustness and considered ways to control conservatism. There are extensive review papers, see Ben- Tal [20], and Delage and Iancu (2015) [21] and the references therein. However, in these studies, the optimization is performed under a given uncertainty set, therefore no data-driven mechanism is constructed.
In recent years, distributionally robust optimization (DRO) has also been widely studied, in which there is uncertainty in the probability distribution of parameters. Delge and Ye (2010) [22] proposed the DRO model with the moment-based ambiguity set. Ben-Tal et al. (2011) [23] proposed a robust discriminant analysis when there is uncertainty in the data. Dupacova and Kopa (2012) [24] studied the robustness of stochastic programming using the contamination method. Xu et al. (2012) [25] studied the probabilistic interpretation of robust optimization. They showed the connection between robust optimization and DRO, and showed that the solution of robust optimization is transformed to the solution of DRO. Zymler et al. (2013) [26] proposed an approximation of DRO using semi-definite programming. Wiesemann et al. (2014) [27] introduced an ambiguous set containing trust regions that can be represented in conic form. Ben-Tal et al. (2013) [28] proposed robust linear optimization problems with uncertainty regions defined by φ-divergences. Esfahani and Kuhn (2018) [29] proposed an ambiguity set derived from the Wasserstein distance. In these studies, the uncertainty region of the probability distribution of the parameter is inferred from the realization value of the parameter. However, the methodology for predicting parameters from data has not been clarified.

Integration of Machine Learning and Optimization
As mentioned above, predictive analytics has been studied in the field of statistics and machine learning (such as Melin and Castillo 2014 [30], Pozna and Precup 2014 [31], and Jammalamadaka et al. 2019 [32]), and prescription analytics has been studied in the field of mathematical optimization (e.g., Bertsimas et al. 2018 [8] and Esfahani and Kuhn 2018 [29]). In recent years, research on these integration has attracted attention.
Hertog and Postek (2016) [33] propose two opportunities to take advantage of the synergies of predictive analytics and predictive analytics. The first is the construction of a methodology for optimization using a predictive model. The second is the construction of a methodology that automates optimization modeling by using predictive models. Elmachtoub and Grigas [34] have proposed a framework called Smart "Predict, then Optimize" (SPO). They proposed a prediction model that minimizes the SPO loss function, which minimizes the decision error, rather than the traditional prediction error. Larsen et al. (2018) [35] proposed a methodology for rapidly predicting solutions to discrete probabilistic optimization problems based on supervised learning. The training dataset consists of a number of deterministic problems that are solved independently and offline. Bertsimas et al. (2018) [36] and Dunn (2018) [37] proposed a tree-based algorithm called the optimal prescription tree (OPT). However, these approaches do not incorporate estimation errors into decision-making.
Recently, the study of the integration of predictive analytics and prescriptive analytics has been emerging. Among such, Bertsimas and Kallus (2019) [7] proposed the concept predictive prescription. This framework is very powerful, as it has two properties: asymptotic optimality and tractability. Asymptotic optimality ensures that as the number of samples approaches to infinity, the obtained solution approaches to the true optimal solution. Tractability ensures that the optimal solution can be computed in polynomial time and oracle calls, and, in many important cases, it is solvable using off-the-shelf optimization solvers.
For the extension of the model, Bertsimas (2017) [38] proposed the framework named "bootstrap robust analytics", which integrates distributionally robust optimization and statistical bootstrap that are designed to produce out-of-samples guarantees by exploiting the use of a confidence region, derived from φ-divergence. Despite its fascinating property of bootstrap performance, the size of the inner maximization problem for the bootstrap robust formulation grows with the number of training data samples. Thus, finding a robust prescription may become computationally expensive when the training data set contains a huge amount of samples. They proposed dual formulation to ease the dependence on the amount of training data, which however is not completely eliminated.

Contribution
In this paper, we consider the research question centered around how to integrate predictive and prescriptive analytics. In order to fulfill the gap in the literature, several factors should be taken into account. The integration should be robust against the uncertainty of parameters caused by the prediction error. The integration should be distribution free. The integration should be computationally inexpensive when the training data set contains a huge amount of distinct samples. We propose an effective approach for a class of predictive prescription modeling that is tailored to uncertainty set constructions. The proposed algorithm utilizes enclosing minimum volume ellipsoid, which contains k-nearest neighbors of the observed auxiliary data. The proposed algorithm utilizes a nonparametric prediction model and thus does not need to assume probability distribution. The proposed algorithm forms around k-nearest neighboring samples and thus has robustness against the prediction error. The proposed algorithm utilizes robust optimization over ellipsoidal uncertainty, for which the efficient algorithm has been extensively studied. For linear programming (LP) under uncertain parameters, the problem can be reduced to a standard second-order cone programming (SOCP), and thus can be solved to optimality very efficiently by the custom solver. Therefore, the algorithm is computationally tractable.
The main contribution of the paper is as follows.

1.
Most of the studies on the integration of machine learning and optimization use a separated approach, i.e., they predict uncertain parameters from auxiliary data first, then optimize with predicted uncertain parameters. This approach neglects the effect of prediction error, which is of critical importance in operations research and management science. We propose a framework that integrates machine learning and robust optimization to safeguard against the case when the estimation error yields serious trouble.

2.
We make the nearest neighbor formulation advanced by Bertsimas and Kallus (2019) [7] resilient against the adverse effects of overfitting by formulating a robust counterpart.
To form the robust counterpart, we propose the algorithm to form the minimum volume ellipsoid covering the k-nearest point. This ellipsoid is used as the uncertainty set in the robust counterpart. We indicate that the resulting robust supervised learning formulations are computationally as tractable as their nominal counterparts.

3.
We demonstrate that the worst-case expectation over an ellipsoidal uncertainty set enclosing the k-nearest neighbor can in fact display good performance. We also investigate the out-of-sample performance of the resulting optimal decisions experimentally and analyze its dependence on the number of training samples and nearest neighbors.

Modeling Framework
This section describes the modeling framework. Table 1 presents a summary of the notation. In Section 3.1, the preliminary is explained. In Section 3.2, the predictive prescription is described. In Section 3.3, the alternative approach for the predictive prescription is described.
index set for training data set feasible region of y given uncertain data û z T certificate (the objective function value of the data-driven solution, i.e., f (x T )) x T data-driven solution z V out-of-sample performance

Preliminary
In the predictive analytics, we seek predictor h to predict uncertain quantities of interest u (dependent variable) from associated covariates v (feature vector) as in (1) given the training dataset {(u 1 , v 1 ), · · · , (u M , v M )}, where u j is the j-th observed data on u, and v j is j-th observed data on v.
In the prescriptive analytics, we seek an optimal decision x = [x 1 , · · · , x n ] T ∈ R n to be made, constrained in feasible region x ∈ X, so as to minimize some objective function f (x, u) that depends on decision x and parameter u.
One possible way to incorporate the auxiliary data v on associate covariates into the model is to use supervised machine learningû = h(v) after observing v =v, and solve the optimization problem as in (2).
This point-prediction approach, however, does not incorporate the uncertainty of the data, which is of critical importance in the business analytics. The traditional decisionmaking under uncertain data is stochastic-programming, which has the form of (3).
If we know full joint distribution of u and v, say p(u, v), we could incorporate uncertainty of u, utilizing the training dataset and observation v =v into the model. However, it is often difficult in practice to assume we have full knowledge of the joint distribution.

Predictive Prescription
In order to incorporate the auxiliary data v into the decision-making, we consider the predictive prescription model. The predictive prescription problem takes the form of (4).
In this framework, the objective is to minimize conditional expected cost wherein, on the basis of an observation of auxiliary covariates v =v, a decision x ∈ X is chosen in an optimal manner to minimize an uncertain cost f (x, u) that depends on a random variable u.
In practice the joint distribution p(u, v) is not known, and therefore must be inferred from data. This is called data-driven settings. In the data-driven settings, p(u, v) is partially observable through a finite set of M independent samples, e.g., the training dataset M T := {(u 1 , v 1 ), · · · , (u M , v M )}. In the training phase, the decisionx T is obtained by minimizing the training problem (5).
The solution of the training problemx T is called the data-driven solution, and the objective function value of the training problemẑ The goal of a data-driven problem is to minimize out-of-sample performance of a data-driven solutionx T is defined as in (6).
As p(u, v) is unknown, however, the exact out-of-sample performance cannot be evaluated in practice; therefore, it is evaluated by the validation dataset M V = {û 1 , · · · ,û N } as in (7).
We can extend the problem to a two-stage model, where the decision sequence is as follows. First, an observation of the auxiliary variable v =v is given. Second, the 1st stage here-and-now decision is made. Third, the realization of uncertain data u is given. Finally, the 2nd stage wait-and-see decision is made. The two-stage predictive prescription model is formulated as in (8), where x is here and now variable, and f (x, u) is the optimal value of the second-stage problem (9) minimize y g(y, u) subject to y ∈ Y(u), and y is wait and see variable, g is objective function of second-stage problem, and Y(u) feasible region of y given uncertain data u.
In the training phase, the input is training dataset M T and the output is decision In the validation phase, the input is validation data M V , and the output is out-of-sample performance evaluated bŷ

Alternative Approach
A natural approach to generate data-driven solutionsx T is the sample approximate approximation (SAA) formulation that approximate p with p N where p j = (1/M), ∀j ∈ M T . SAA formulation with training samples u j for the two-stage problem can be written as one large-scale problem (10) This can be written as an integrated form (11).
This formulation, however, does not exploit auxiliary variable v. Another alternative approach is the point-prediction approach (12).
This approach can exploit the auxiliary variable. However, this approach does not consider the robustness against prediction error, as is stated in the single-stage problem. This may lead to the poor out-of-sample performance or the violation of feasibility with respect to y ∈ Y(u).
Bertsimas and Kallus (2019) [7] proposed k-nearest neighbor formulation weight using kNN (13), (13) where N k (v) is k-nearest points to the observed auxiliary variable v =v. Using this weight, the two-stage predictive prescription can be transformed into the problem (14).
We letẑ PR ,x PR be optimal value and solutions of the two-stage predictive prescription problem, respectively.x PR is proven to converge almost surely to their counterparts of the true problem with M → ∞. However,x PR tends to display a poor out-of-sample performance in situations where M is small and the acquisition of additional samples would be costly. Furthermore, the number of constraints grows with k, which is computationally challenging when k is large.

Proposed Algorithm
In this section, we describe the proposed algorithm. The proposed algorithm is outlined below. First, the k-nearest points are drawn from the training samples M T . Second, the minimum volume enclosing ellipsoid E knn that contains the finite set {u j |∀u j ∈ N knn (v)}. Finally, the robust counterpart of the two-stage stochastic problem is solved. Each of these steps are described in detail in the following sections.

k-Nearest-Neighbor
The k-nearest-neighbor algorithm (kNN) is a nonparametric method used for classification and regression. k-nearest neighbor nonparametric regression method is a broadly applied algorithm, which has nonparametric, small error ratio, and good error distribution (Yiannis and Poulicos 1857 [39]). It is a nonparametric model, and thus the predictor does not take a predetermined form but is constructed according to information derived from the data.
N k (v) contains the indices of the k closest points of v 1 , · · · , v k tov. The distance can be measured in several ways. In this research, we use 2-norms, which is one of the most standard distance measures. A heuristically optimal number k of nearest neighbors can be found based on root mean square error (RMSE) using cross-validation.
In the standard kNN regression model, the uncertain parameter u can be predicted by taking average of the training samples in the k-nearest neighbors as in the Equation (15).
In the proposed algorithm, however, in order to have a safeguard against prediction error in the optimization model, the regression model is not used. Instead, we form the minimum volume ellipsoid enclosing all points in N k (v), hoping that the uncertain parameter u lies in that ellipsoid.

Minimum Volume Ellipsoid Around a Set
In this section, we present an algorithm for computing the minimum-volume ellipsoid that must contain k-nearest neighbors to form the uncertainty set U.
We consider the problem of finding the minimum volume ellipsoid that contains the samples in k-nearest neighbor N k (v). An ellipsoid covers N k (v) if and only if it covers its convex hull, so finding the minimum volume ellipsoid that covers N k (v) is the same as finding the minimum volume ellipsoid containing the polyhedron conv N k (v). We parameterize the ellipsoid as in (16).
We can assume without loss of generality that P is positive semidefinite, in which case the volume of E knn is proportional to det P −1 . The problem of computing the minimum volume ellipsoid containing N k (v) can be expressed as in (17), minimize log det P −1 subject to ||Pu j + ρ|| 2 ≤ 1, j = 1 ∈ N k (v) (17) where the variables are P and ρ, and the implicit constraint P is positive semidefinite. The objective and constraint functions are both convex in P and ρ, so the problem is convex. See Boyd (2004) [40] for further detail.
Once P and ρ are obtained, we transform from (P, ρ) to (R,ū) as in (18) and (19). (19) and we can form the uncertainty set derived from the k-nearest neighbor (20) whereū can be interpreted as nominal value of the uncertain parameter.

Robust Optimization
The robust optimization is the framework that random variables are modeled as uncertain parameter u belonging to a convex uncertainty set U and the decision-maker protects the system against the worst-case within that set. The robust optimization takes the form of (21).
minimize sup u∈U f (x, u) subject to x ∈ X.
The robust optimization can be transformed into the following inequality form (22).
In the proposed method, we use the minimum volume ellipsoid to cover the k-nearest neighbor to the observation v =v, presented in the previous section, to form the uncertainty set. Ellipsoid uncertainty has been extensively studied for over a decade. One advantage of the ellipsoid uncertainty is that it is found to be not too pessimistic, compared to the box uncertainty ||u|| ∞ ≤ 1. Another desirable feature is that the robust counterpart can be derived via the second-order cone programming, which can be solved very efficiently.
For a linear constraint in inequality form in which u is known to lie in given ellipsoids and the constraints must be satisfied for all possible values of the u The robust counterpart of the inequality can be expressed as The left-hand-side can be expressed as Thus, the robust linear constraint can be expressed as second-order cone inequality.

Overall Algorithm
The overall Algorithm 1 is described as follows.
Algorithm 1 Summary of algorithm. 1. pick k-nearest points N k (v) 2. form minimum volume ellipsoid E knn to cover {j|∀j ∈ N k (v)} 3. solve robust optimization problem The proposed algorithm has desirable properties in the data-driven predictive prescription. First, the proposed algorithm utilizes nonparametric k-nearest neighbor and thus does not need to assume the joint probability distribution. Second, the proposed algorithm forms around k-nearest neighboring samples and thus has robustness against the prediction error. Third, the proposed algorithm utilizes robust optimization over ellipsoidal uncertainty, for which the efficient algorithm has been extensively studied, and therefore is computationally tractable.

Numerical Example
We demonstrate the effectiveness of the proposed method with numerical experiments. We restrict our discussion to a two-stage data-driven linear predictive prescription model. We applied and compared the following alternative approaches: the sample average approximation (SAA), the point-prediction (PP) approach, the predictive-prediction (PR) approach, and the proposed approach. First, we apply the proposed framework to a small-sized problem in which there are two variables and two constraints in the first and second stages and show how the proposed method is applied. Second, we expand the experiments with larger sized problems.
Experimental conditions are Intel(R) CoreTM i7-8700 (3.20 GHz, 3.19 GHz) with 32.0 GB memory. The program was coded in Julia with Gurobi optimizer called from Convex.jl.
We generate training samples M T and test samples M V . The decision-maker does not know the true distribution or test samples. The decision-maker only knows the training samples generated from the true distribution.
We have M training samples and N test samples. We repeat this same experiment where the decision-maker sees M samples and solves the problem 10,000 times. Each time, we record the optimal value of the optimization problemẑ. Each time, we got an optimal decisionx T which is a random variable that depends on training samples. Each of these decisions, we evaluated the objective of the optimization problem by using another N test samples to compute the out-of-sample performance. The optimal value will be random because the training samples are random. Table 2 presents the average out-of-sample performance, where SAA, PP, PR, and RO denote the average out-of-sample performance of the sample average approximation, point-prediction approach, predictive prescription, and proposed method, respectively. From Table 2, we see that PP derived the worst out-of-sample performance of these. This is because the PP approach does not take into account the robustness against the prediction error. From the assumption that the joint distribution of u and v is a multivariate normal distribution, even if v is obtained, the range that u can take varies. The point prediction approach makes decisions without considering this variability, and as a result, the outof-sample performance was disappointing. The SAA approach was the second worst, which was also disappointing. This is because SAA does not use the information of v, so regardless of the value of v, all samples are used as training data to make decisions. Therefore, it is considered that the overfitting made the out of sample performance worse because the possible ranges of training data and validation data are different. PR and RO, both of which utilized auxiliary data v and consider the robustness against prediction error, derived much better out-of-sample performance than the other two. Furthermore, the proposed method was able to obtain the best value of all. This is because the PR approach makes decisions using only the samples that appear in the k-nearest neighbor as training data, whereas in RO, the uncertainty set is defined using the minimum volume enclosing ellipsoid. Therefore, it is considered that the better result was obtained because the worst case is taken even for the unknown sample. Table 3 presents the average out-of-sample performance of the proposed algorithm with the different number of training samples that included in the kNN. From the Table 3, it can be seen that the quality of the proposed method changes greatly depending on the value of k. As k increases, the robust optimization approach does not work well. To consider the reason for this, Figure 1 shows the minimum volume ellipsoid with k = 10 and k = 50, where the horizontal axis represents u 1 and the vertical axis represents u 2 , the blue dots indicate all training data, the black dots indicate validation data, the red dots indicate samples within the k-nearest neighbor, and the green line indicates the obtained minimum volume ellipsoid. From this Figure 1, we see that the distributions of training data and validation data are significantly different. At k = 10, it can be seen that the distribution of the samples in the k-nearest neighbor and the distribution of validation data are close. On the other hand, when k = 50, the distribution of validation data differs greatly from that of k-nearest neighbors because the k-nearest neighbors are too large. These results indicate that by setting k properly, the corresponding ellipsoid covers the proper size of uncertainty.

Large-Size Instances
We consider a two-stage stochastic linear programming problem: where c ∈ R n 1 , A ∈ R n 1 ×m 1 , and b ∈ R m 1 are first-stage parameters and f (x, u) is an optimal value of the second-stage problem minimize q T y subject to s i x + t i y ≥ w i , i = 1, · · · , m 2 (33) and u := [q T , s T 1 , · · · , s T m 2 , t T 1 , · · · , t T m 2 , w T ] T with second-stage parameters q ∈ R n 2 , S ∈ R m 2 ×n 1 , T ∈ R m 2 ×n 2 , w ∈ R m 2 . The problem has complete recourse, i.e., there exists y that satisfying Sx + Ty ≥ w for ∀x.
We assume that u := (q, S, T, w) follows the multivariate normal distribution as in (34).
We also assume that v ∈ R r follows the normal distribution with known We change the parameters as n 1 = 100, n 2 = 100, m 1 = 100, m 2 = 100, M = {10, 100, 1000}, k = {0.1 M, 0.5 M, 0.9 M}, N =10,000. Each element of c, A was randomly generated from a uniform distribution U(0, 1). Furthermore, b was set by generating a random solution x 0 and setting b = A −1 x 0 . Each element of mu v was randomly drawn from the uniform distribution U(0, 5). Each element of Σ v was randomly drawn from U(0, 5) and made into a symmetric matrix by setting Σ v := (Σ v + Σ T v )/2. Each element of µ q , µ s , µ t , µ w was randomly drawn from the uniform distribution U(0, 1). Each element of Σ q , Σ s , Σ t , Σ w was randomly drawn from U(0, 1) and made into a symmetric matrix by the same method as Σ v .
The result of the out-of-sample performance is summarized in Table 4. From Table 4, it can be seen that RO (k = 1), the proposed method, has the best out-of-sample performance, as in the case of small sized instance. We also find that the SAA and PP approaches have very poor out-of-sample performance. This result suggests that utilizing the auxiliary variable v and consideration of the prediction error will make the out-of-sample performance better.
The CPU Time to solve randomly generated samples is summarized in Table 5. From Table 5, SAA takes a long time when the sample size M is large. This is because it is necessary to solve the 2nd stage linear programming problem for each sample, i.e., M times. The PP is the fastest, regardless of sample size M. This is because PP solves 2nd stage linear programming problem only once regardless of the sample size. The PR was faster than the SAA and slower than the PP. This is because it is necessary to solve the 2nd stage linear programming problem for for each sample in the k-nearest neighborhood, i.e., k times. As 1 ≤ k ≤ M, the relation of the CPU Time for these three approaches can be explained. Finally, the RO was faster than PR and slower than PP. This is because the proposed method solves only one SOCP regardless of the sample size M.
It is not possible to directly compare the speed with other papers because the assumption of the proposed framework is more complex. The closest one is that of Bertsimas and Van Parys (2017), in which their proposed algorithm was tested by the newsvendor problem with one decision variable and the portfolio allocation problem with the six decision variables. In this study, the proposed method was tested to the two-stage problem with over 100 decision variables in each stage. These results indicate the proposed method has a better scalability compared to the existing alternative approaches.
Unfortunately, the result was not obtained within an hour for even larger data, e.g., M >= 10 4 or n 1 = n 2 >= 10 3 . This is mainly because of the performance of the commercial solver. The robust counterpart derived in the proposed method is a SOCP and in theory can be solved efficiently. However, it is still nonlinear programming model and is difficult. Therefore, we need to consider developing the algorithm to exploit the special structure of the model for the future research.

Conclusions
Business analytics has been more important than ever. In this field, the integration of predictive analytics and prescriptive analytics has enormous potential. However, existing studies applied them separately and thus ended up in the suboptimal solution.
In this study, we propose an alternative approach that integrates machine learning and robust optimization. The proposed method applied a non-parametric k-nearest neighbor prediction model given the observation of the auxiliary covariates. The enclosing minimum volume ellipsoid that contains k-nearest neighboring samples is applied to form the uncertainty set of uncertain parameters. The robust optimization is applied to minimize the worst-case objective function over the obtained uncertainty set.
The proposed algorithm utilizes a nonparametric prediction model and thus does not need to assume probability distribution. The proposed algorithm forms around k-nearest neighboring samples and thus has robustness against the prediction error. The proposed algorithm utilizes robust optimization over ellipsoidal uncertainty, for which the efficient algorithm has been extensively studied.
In the numerical experiment, we applied the proposed method to the two-stage linear predictive prescription problem. The proposed method outperforms the alternative approaches, in terms of the out-of-sample performance and the computation time.
For future research, we consider the connection with probability. This can be achieved by application of other nonparametric methods, such as kernel regression. These models have a connection to probability without assuming the probability distribution. We can draw a confidence region of the uncertain parameter, given the observation of the auxiliary variable. The minimum volume ellipsoid enclosing the training samples within the confidence region. By doing so, we can control the degree of conservatism. Another important issue is the development of a custom solver that can exploit the special structure of the problem. As the robust counterpart is the nonlinear SOCP, which can be difficult to solve when the problem size is large, this problem can be solved by modern convex optimization techniques for the large sized problem.