Data Transformation in the Predict-Then-Optimize Framework: Enhancing Decision Making under Uncertainty

: Decision making under uncertainty is pivotal in real-world scenarios, such as selecting the shortest transportation route amidst variable trafﬁc conditions or choosing the best investment portfolio during market ﬂuctuations. In today’s big data age, while the predict-then-optimize framework has become a standard method for tackling uncertain optimization challenges using machine learning tools, many prediction models overlook data intricacies such as outliers and heteroskedasticity. These oversights can degrade decision-making quality. To enhance predictive accuracy and consequent decision-making quality, we introduce a data transformation technique into the predict-then-optimize framework. Our approach transforms target values in linear regression, decision tree, and random forest models using a power function, aiming to boost their predictive prowess and, in turn, drive better decisions. Empirical validation on several datasets reveals marked improvements in decision tree and random forest models. In contrast, the beneﬁts of linear regression are nuanced. Thus, while data transformation can bolster the predict-then-optimize framework, its efﬁcacy is model-dependent. This research underscores the potential of tailoring transformation techniques for speciﬁc models to foster reliable and robust decision-making under uncertainty.


Introduction
Decision making under uncertainty is ubiquitous in real life.Real-world applications often involve making decisions with inherent uncertainty [1,2], such as the shortest transportation route when faced with uncertain and different traffic conditions [3,4] or choosing the best investment portfolio amidst market volatility [5,6].These problems typically require selecting the best solution from multiple candidate options under uncertainty.To solve these problems [7][8][9], we need to consider the impact of the uncertainty on downstream decisions.
Given the complexity of current uncertain decision-making problems and the increasing volume in data, solving those problems requires advanced data analytics methods, such as machine learning tools [10].A commonly used approach is the predict-then-optimize framework [11,12].First, the prediction model is constructed based on the mapping between the relevant features and the uncertain parameter in the optimization model.Then, the optimization model makes the decision based on the predicted data attained from the prediction model.
For simplicity, this paper focuses on continuous uncertain parameters, in which regression models are used for prediction.The prediction model [13,14] can be represented by y = f (x 1 , x 2 , . . . ,x k ) + ε, where f represents the prediction function and (x 1 , x 2 , . . . ,x k ) is the input vector denoting the features used to predict the response variable y, whose goal is to find the appropriate parameters such that the error ε between the predicted value f (x 1 , x 2 , . . . ,x k ) and the true value y is as small as possible.
Various techniques, such as the deep learning model [15,16] and support vector regression model [17,18], have been used in predicting uncertain parameters in decisionmaking problems.Decision-making problems in the real world face complex and diverse situations; nevertheless, these traditional studies have often overlooked the impact of the intricacy of data, outliers [19], and heteroscedasticity [20] on the decision-making process.This oversight may result in sub-optimal decision when addressing practical problems.
Among these influencing factors, heteroskedasticity, as a widespread phenomenon, has a significant impact on predictive performance, which means that the variance of the explanatory variables changes as the response variable changes.This phenomenon may cause the prediction model to perform well in some regions of the data and poorly in others, thus reducing the decision quality of predicted parameters.For example, in the investment portfolio selection process, we would like to predict stock prices by using some economic indicators.However, due to market instability, stock prices may fluctuate significantly differently during different market phases, resulting in the variance of the error term varying with the state of the market [21].Consequently, an investor might underestimate the risks associated with certain stocks in turbulent times, resulting in overexposure to volatile assets and potential financial losses.Outliers are observations that are significantly different from the majority of data points, and they may be due to measurement errors or chance circumstances.Outliers may interfere with the quality of the model's predictions, thereby reducing the accuracy of the model.For example, in the house selection process, we would like to predict the price of a house from its size and other characteristics.However, if the dataset contains some outliers, such as records of extremely high-or low-priced home transactions, these outliers may cause the model to make large errors in its predictions [22].Such misconceptions could lead to overpayments, missed investment opportunities, or misguided selling prices based on the distorted data.
To compensate for the above shortcomings of predictive models as well as to improve the decision-making performance using predictive parameters, we utilize the data transformation technique to optimize the predictive model in the predict-then-optimize framework.Data transformation methods can help reduce heteroskedasticity by transforming the response variables in the model so that the variance of the data becomes more homogeneous.At the same time, it can effectively reduce the proportion of outliers in the data and reduce the impact of outliers on the predictive model.Consequently, using the characteristics of the data and the attributes of the decision-making problem, the appropriate data transformation method is selected to improve the accuracy of the predictive model and further enhance the quality of decision making.
Our research specifically concentrates on three common prediction models: the linear regression model, the decision tree model, and the random forest model.We convert their target values y in the three models via the power function.This transformation method aims to improve the predictive performance of machine learning models, leading to better real-world decision-making abilities under uncertainty.Via experimental validation using several real-world datasets, we find that with decision tree and random forest models, the data transformation technique can effectively improve both the predictive and decisionmaking quality.With linear regression models, their impact is limited.Therefore, it is feasible to use data transformation techniques to improve the performance of the predictthen-optimize framework, but it is necessary to select data transformation schemes based on different models.
In this study, we delve into the challenges of decision making under uncertainty by using the predict-then-optimize framework.Central to our work is the enhancement of both prediction and decision accuracy via data transformation.Our primary contribution is the innovative integration of data transformation techniques with diverse prediction models to bolster their efficacy in deriving decisions.The essence of our methodology is to augment decision performance by synergizing various models with data transforma-tion.This approach offers fresh perspectives and methodologies for studies addressing analogous challenges.
In the remainder of this paper, we introduce the background and research significance of the decision-making problems and explore the application of prediction models to these problems.Subsequently, we apply data transformation techniques to three common prediction models.Finally, we conduct experiments on several real-world datasets to verify the generalizability of the proposed method.

Related Literature
Decision making is a crucial research area focusing on identifying the optimal solution from multiple options.Its applications are vast: in manufacturing, it helps in choosing raw materials to cut production costs [23]; in logistics, it aids in selecting optimal transportation routes to reduce costs and time [24]; and in the medical sector, it streamlines the creation of treatment programs, enhancing treatment efficacy and patient satisfaction [25].
Among decision problems, the "selection problem" has received a considerable amount of attention.In this study, we focus on this particular class of decision problems.In real-world decision-making scenarios, we often need to select the best solution among many alternatives.This choice not only impacts resource utilization but also determines the decision-making quality.
When solving uncertain selection problems, most studies usually adopt a two-stage framework known as the predict-then-optimize framework.This framework involves creating a regression or classification model for predictions and plugging these predictions into the downstream optimization model to decide on final results.When regression or classification models are constructed, uncertain parameters can be predicted based on multiple features to aid in the decision-making process.Srivastava et al. [26] draw on machine learning methods to construct mathematical models for heart disease risk prediction and to assist medical staff in medical diagnosis.Wang et al. [27] predict the risk of corporate finance risk with the help of the Light Gradient Boosting Machine method to improve the efficiency of corporate finance.Wei et al. [28] present a predictive model for pipeline safety assessment based on the eXtreme Gradient Boosting algorithm, applying grid searches to optimize the model.The construction of the model can significantly reduce the downsizing of manpower and physical resources in non-destructive examination and engineering assessment.
Although many studies have started leveraging machine learning tools for data-driven optimization, they often neglect the complexity of real-world data during the prediction stage.One such oversight is heteroskedasticity during model training.This can cause models to be overly sensitive to data performance under specific conditions, leading to unstable decision making.For instance, Lee et al. [29] find that heteroskedasticity affects users' choices of travel modes between cities. Models that do not take heteroskedasticity into account fall short in supporting users' travel modes choices.Similarly, Morgan [30] demonstrates that stock returns are heteroskedastic, thus influencing investment choices.Furthermore, outliers may influence the model training process and lead to over-sensitization of the model.Di Bella et al. [31], using data from a refinery's sulfur recovery unit, and Kalisch et al. [32], via experiments on an ensemble of 100 semi-artificial time series, both find that properly addressing data outliers enhances model accuracy.Nevertheless, these studies primarily emphasize predictive analytics without thoroughly examining how predictive performance influences subsequent decision-making outcomes.
In real-world decision-making problems, neglecting issues such as data outliers and heteroskedasticity can compromise model performance, leading to sub-optimal decisions.Recognizing this, this study aims to refine the predictive model by considering the impact of the data structure on decision-making performance.To this end, we introduce the data transformation technique into the prediction model.By changing the distribution of the data via the transformation of the response variable y, we mitigate the effect of heteroskedasticity and outliers.This adaptation not only ensures the prediction model aligns better with data nuances but also bolsters decision-making efficacy.

Methods
In this section, we utilize the response variable transformation technique to optimize three prediction models.For simplicity, this paper mainly focuses on the continuous response variable and, therefore, studies regression prediction models.Then, the optimization model is constructed to perform target selection based on the uncertain parameters predicted by the prediction model.

Problem Setting
In the selection problem, the objective of the decision-making process is to minimize the cost function denoted as c(z, y) via the choice of the decision variable z ∈ Z ⊂ Z d z , where y ∈ Y ⊂ Y d y is the uncertain parameter.Additionally, contextual information, represented as x ∈ X ⊂ X d x , is related to the parameter y.Assuming that we obtain the new observation data x 0 ∈ X ⊂ X d x , the formulation of the optimal decision z * (x 0 ) for the optimization problem is presented as follows: A historical dataset , where |D H | means the total number of samples in the dataset D H , is available to address the problem (1).With this dataset at hand, the predict-then-optimize framework initially constructs a prediction model P to forecast the value of y based on the new observation information x 0 , denoted as ŷ0 .Subsequently, this framework adopts the optimization model O, plugging ŷ0 into the problem (1) by solving min z∈Z c(z, ŷ0 ) to obtain the optimal decision solution.The pseudocode of the predict-then-optimize algorithm is shown in Algorithm 1.

Linear Regression Model with the Response Variable Transformation
The linear regression model [33] is a common method in regression analysis that fits a linear relationship between an input feature vector x and a target value y via the least squares function.The goal is to find a straight line (or hyperplane in high-dimensional space) such that the predicted value of y is as close to the actual value of y as possible.
In a linear regression model [34], it is assumed that the predicted target value y i is obtained by linearly transforming the input value x.For example, for one training dataset , where x i is the ith input feature vector, y i is its target value, and |D| represents the number of samples in D; y i is calculated as follows: where w is the coefficient vector to be trained by the model, and b is the error term to be obtained, which can be calculated via the least squares function, shown as follows: After obtaining w * and b * , we could calculate the predicted value y 0 = (w * ) T x 0 + b * for a new observation x 0 .However, in the actual model training process, when a linear regression model is used to fit the regression to the target value y, the relationship between the explanatory variable (feature) vector x and the response variable y may not be linear, showing a nonlinear trend.When the response variable y is transformed, the linear regression model can be made more suitable for predicting the target value y.Thus, the method of data transformation can cope with the abnormal values or outliers in the dataset, reduce the influence of abnormal values on the prediction model, and improve the robustness of the model.According to the characteristics of the data and the requirements of the model, the model prediction accuracy can be improved by adopting appropriate data transformation methods.
Considering that the response variable y may contain the value 0, we perform a data transformation of the response variable y using the form of a power function, which transforms the response variable y into its power square, i.e., it is processed in the form of y m 1 , where m 1 is the parameter of the power square for the linear regression model.Thus, the goal of the model construction process is to minimize the error between the transformed response variable y m 1 with its corresponding transformed prediction value (y ) m 1 .Therefore, the calculation of w * and b * is transformed into (4), shown as follows: here, the hyperparameter m 1 is chosen based on the current model requirements.Our objective is to get better decision performance based on the predicted value and the characteristics of the dataset D.

Decision Tree Model with the Response Variable Transformation
The decision tree model [35] is a common method used in machine learning algorithms for establishing mapping relationships between input features and output response variables.The decision tree [36] constructs a tree structure by recursively dividing the dataset into subsets and selecting the optimal features and feature value at each node.For regression tasks, the model forecasts continuous values, generally by computing the average of the target variable values present in the corresponding leaf node.
In the decision tree construction process, first, we need to select the evaluation criteria of node splitting, for which the mean square error (MSE) is the most commonly used metric for regression tasks, calculated as follows: where D is the current training dataset and |D| means the number of samples in D, and g f s (x i ) is the predicted value of x i from the decision tree model based on splitting feature f and the splitting feature value s, and MSE D f s is the MSE value under f and s.
Based on the MSE metric, the decision tree method traverses the current features and its corresponding feature values and selects the best feature and best feature value that reach the minimum MSE value for node splitting.
However, during the construction of the decision tree, we have multiple issues, i.e., heteroskedasticity and outliers, that may affect the predictive performance and stability of the model.To better solve these problems, we plan to optimize the decision tree model by using data transformation.Specifically, we use the power function to convert the predicted response variable y into the form of y m 2 , where m 2 is the parameter of the power function for the decision tree model.This data transformation can mitigate the effect of heteroskedasticity and make the variance more stable, thus improving the predictive performance and stability of the model.In addition, for outliers, power function data transformation may also reduce their impact on the model to some extent, making it more robust.
Accordingly, the calculation of the MSE is converted into:

Random Forest Model with the Response Variable Transformation
The random forest algorithm [37] is an ensemble learning method, which is constituted by multiple decision trees.The prediction result is gained by integrating the predictions of each decision tree [38].For regression problems, the final predicted value of the random forest algorithm for a new observation x 0 is the average of the predicted values of all decision trees, shown as follows: where N is the number of decision trees in the random forest model and T j (x 0 ) (j ∈ {1, . . . ,N}) is the predicted valued of the jth decision tree model for the new observation.The random forest method adopts the bootstrap sampling method to take samples from the original training dataset, forming a new training subset for constructing one decision tree, which can increase the diversity of the model and improve the generalization ability of the random forest model.
Similar to the optimization of the decision tree model, we optimize the random forest algorithm using the response variable transformation technique, which converts the predicted response variable y into y m 3 to improve the model's ability to fit the heteroskedasticity of the data as well as to enhance the robustness of the model to outliers, where m 3 is the parameter of the power function for the decision tree model.Ppecifically, because the random forest algorithm is composed of multiple decision trees, we optimize each decision tree in the algorithm when we optimize the random forest with the data transformation, i.e., we use Equation ( 5) to optimize each decision tree in the algorithm.

Optimization Model
Specifically to examine the impact of predictions on the downstream decisions, this paper adopts the selection problem, which is ubiquitous in real life.In the selection problem, we assume that we make selections based on the predicted values obtained from the transformed predictive models described above.We assume that we select the largest u values from U predictions, which can be formulated as follows: where y i is the ith (i ∈ {1, . . . ,U}) predicted value obtained via the regression model, and c i is the binary variable that indicates whether the ith term is chosen (c i = 1) or not ( c i = 0).The goal of the optimization model is to maximize the total predicted number and meet the requirements such that the number of selected terms is no more than u.

Evaluation
In this section, we conduct computational experiments on several cases to assess the validity and robustness of the models presented in Section 3. Specifically, Section 4.1 describes the experimental setup in this paper.Subsequently, we evaluate and compare the decision quality before and after the transformation of the prediction model in Section 4.2.

Experiment Settings
In our experiments, we evaluate the robustness of our proposed models using two datasets to test their performance.The first dataset (denoted by D d ), derived from the "scikitlearn" library (https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html,accessed on 1 August 2023), is the diabetes dataset and contains 442 data records.The dataset D d includes several physiological indicators of diabetic patients, with a target value of y as a quantitative measure of disease progression one year after baseline, used for the prediction of the extent of disease progression.There are ten features: age, sex, body mass index, average blood pressure, and six blood serum measurements.This dataset is utilized extensively to evaluate the performance of regression models.Therefore, the selection problem is that we select u d high-risk patients from U d patients, based on the dataset D d .
The second dataset, denoted as D s , is derived from port state control records of the port of Hong Kong from January 2015 to December 2019, a total of 3026 entries, sourced from the Asia-Pacific Computerized Information System (https://apcis.tmou.org/public/,accessed on 1 May 2022).Based on the literature [39,40], we consider 13 features that are closely related to ship conditions, which are ship age, gross tonnage, length, depth, beam, ship type, the total detention times of the ship, the total number of flag changes, the total number of casualties in last 5 years, the total number of deficiencies in the last inspection, and flag performance, recognized organization performance, and company performance in the Tokyo MoU.We construct our prediction model using dataset D s to predict the deficiency counts of ships, where we perform ship selection based on the predicted values.Specifically, we prioritize the inspection of ships with larger predicted deficiencies, which helps to identify substandard vessels effectively.Therefore, the selection problem is that we select u s high-risk vessels from U s vessels, based on the dataset D s .
For prediction model training, we divide the dataset into training and testing sets in the proportion of 4 : 1.To better evaluate our proposed models, we process the test set in a batch method, which is divided into multiple subsets to better simulate the framework structure of the current selection problem.Taking the ship dataset D s as an example, we divide the training set D T s and the test set D t s according to the proportion and, for model evaluation, divide the test set D t s into multiple subsets, where O(u s , q) means the set of selected ships whose predicted values are among the highest u s ships in the ship set q.To accurately compare the prediction accuracy of the regression models before and after using the data transformation method, we select the initial MSE metric.R 2 is close to 0, it represents that the model does not explain the variability of the dependent variable well, and the fit is poor.
To streamline our experiments, we assume that the diabetic patient selection problem is to choose two patients from a pool of four, namely U d = 4, u d = 2, and choose four patients from a pool of ten, namely U d = 10, u d = 4, who would have a more severe health condition one year later.The ship selection problem is to choose three high-risk ships from a pool of ten, namely U s = 10, u s = 3, and select three high-risk ships from a pool of twenty, namely U s = 20, u s = 3. Validation and testing in different application scenarios and instance scales under different datasets can better prove the effectiveness of our proposed technique and enhance the trustworthiness of our method in practical applications.

Evaluation of Models
In our research, we try to optimize the performance of regression models, especially for the possible issues of heteroskedasticity and outliers, which may reduce the accuracy and stability of the prediction model.To optimize prediction models and make better decisions using predictions, we use the data transformation method, in which the target variable is converted from the original y to the form of y m .For the three prediction models presented in Section 3, we adopt the cross-validation method to find the appropriate parameter m.
First, the value of the hyperparameter m affects the complexity of constructing the model; considering the model complexity, prior experience, and practical considerations, we select the values of m 1 , m 2 , and m 3 from set {1/4, 1/3, 1/2, 1, 2, 3}.In linear regres- sion, the coefficients w and intercept b are iteratively adjusted using the gradient descent technique, and the gradient of w in the linear model can be computed as outlined below: where D means the current training data used for linear regression model and loss (y ) m 1 , y m 1 means the MSE between the predicted values y i m 1 and the target values (y i ) m 1 .Because the predicted values calculated via wx i + b may contain 0, the value of m must be greater than or equal to 1. Therefore, the range of values of m 1 in linear regression model is {1, 2, 3}.Then, the cross-validation method is utilized to evaluate the model for each m value.Specifically, we divide the training dataset D T into several parts and then sequentially use each part as the validation set and the rest as the training set and compute the decision performance of the model on the validation set as the evaluation index, which is calculated via Equation (9a).We search for the value of m that makes decision performance optimal via the cross-validation approach and evaluate in detail the performance of these models in terms of MSE and decision performance (P D t ) metrics in the test dataset D t .
We construct three different regression models in this experiment, linear regression model, decision tree model, and random forest model, optimizing those regression models via the data transformation method.We first train and test the three models on the original data to obtain their benchmark performance.Then, we transform the response variables y of the three models with power functions, respectively, and train and test them under the optimization method.
The overall results are represented in Tables 1-4, where Tables 1 and 2 show the results for D d and Tables 3 and 4 show the results for D s .Via the comparative analysis of results before and after transforming the response variable, we desire to gain a deeper understanding of the strengths and weaknesses of different models and provide a reliable basis for selecting the optimal regression model.As shown in Tables 1-4, we found that the regression models after data transformation have improved both their decision performance (P D ), coefficient of determination (R 2 ), and predictive accuracy (MSE and MAE) in most models, with the exception of linear regression.
Zooming in on Table 1, for the task of selecting high-risk patients (two out of four), the decision tree method exhibits a decrease in MSE by roughly 1.5% and an increment in MAE by about 1.0%.Decision performance improves by 3.4%.Notably, while the initial R 2 is less than 0 (indicating potential issues with prediction), after data transformation, there is an increase in R 2 of approximately 23.8%, signifying improvement.The random forest model shows a decline in MSE of around 2.7%, a reduction in MAE of roughly 0.8%, a slight improvement in decision performance of 0.1%, and a boost in R 2 of about 7.3%.
As shown in Table 2, for the task of selecting high-risk patients (four out of ten), the decision tree reduces the MSE by about 4.8% and reduces a decrease in MAE by roughly 0.6%, whereas the decision performance improves by 7.0%, and R 2 is increased by about 23.9%.The random forest exhibits a decrease in MSE of about 4.9% and reduces MAE by roughly 0.9%.Decision performance improves by 2.2% and R 2 increases by approximately 6.9%.
Meanwhile, for the high-risk ship selection problem of three out of ten, shown in Table 3, the decision tree's prediction quality of MSE increases by roughly 0.9% and the increment in MAE by approximately 2.0%, while the decision performance improves by 3.2% and there is an increase in R 2 of approximately 10.0%.For the random forest model, its prediction accuracy of MSE is improved by about 1.3%, and the MAE is reduced by about 1.5%.Decision performance goes up by about 0.3%, and R 2 is increased by about 1.7%.
The results in Table 4 are about selecting three ships from a pool of twenty.The decision tree shows a decline in MSE of roughly 7.6% and reduces the MAE by approximately 2.8%, whereas the decision performance improves by 16.7% and R 2 increases by about 10.2%.The random forest shows a decline in MSE of roughly 0.2% and a boost in MAE of about 2.3%.Decision performance improves by approximately 0.1% and R 2 is enhanced by about 1.3%.

Discussion and Analysis
Combining the experimental results from the two selection problems, the linear regression model does not improve its performance after the response variable transformation.This likely stems from the model's inability to adeptly fit the transformed data given the dataset's characteristics.However, the transformation technique is more effective for the decision tree and the random forest, probably because these two models can better capture and construct the nonlinear relationships in the two current datasets.
The linear regression method is highly sensitive to the data distribution and requires checking for linearity in regression tasks.Consequently, data transformation requirements are more stringent, often necessitating multiple tests to find a suitable transformation method.On the other hand, the decision tree method is less sensitive to data distribution, allowing for the use of data transformation to mitigate the effects of outliers and heteroskedasticity.Random forest, comprising multiple decision trees, shares similarities with the decision tree model.Selecting the appropriate data transformation method involves considering factors such as data distribution and expertise knowledge.The most effective approach is to experiment with different methods and evaluate their performance to determine the optimal transformation method for a specific model and application.
In summary, this study shows the impact of response variable transformation on different algorithms and datasets, whose results are slightly different.In general, the data transformation method can effectively improve the model's prediction and decision performance.

Conclusions
In this study, we focus on optimizing prediction models under the predict-thenoptimize framework, so as to improving the decision-making performance for the downstream optimization problems.To this end, we adopt the response variable transformation technique.Our research is based on the three common prediction models, i.e., linear re-gression model, decision tree model, and random forest model.After performing data transformation on the three regression models, we obtain some interesting results.For the decision tree and random forest models, after optimization, both their decision performance and prediction quality are improved.That suggests that data transformation is effective in these models and can improve decision-making performance.However, for the linear model, there is no significant effect.This may be due to the characteristics of the model itself, which may be more suitable for dealing with linear relationships, and the nonlinear features introduced by data transformation are not helpful for model performance.In conclusion, using data transformation to improve prediction models is feasible under the traditional predict-then-optimize framework, but it is necessary to choose appropriate data transformation methods according to different models and applications.These findings are instructive for better solving decision-making problems and provide references for future research.

Algorithm 1
The predict-then-optimize framework 1: Input: training dataset D H , prediction model P, and optimization model O 2: Output: solution z * (x 0 ) 3: Predicted values ŷ0 = P(x 0 |D H ) 4: Solution z * (x 0 ) = argmin z∈Z c(z, ŷ0 ) 5: Return z * (x 0 ) measure the extent to which the regression model explains the variability of the dependent variable.The closer the R 2 D t s value is to 1, the better the model explains the variability of the dependent variable, representing a better fit.If the R 2 D t s . . ,(x dU s , y dU s ) ,

Table 1 .
The prediction quality and decision performance of different prediction methods for D d when U d = 4, u d = 2.

Table 2 .
The prediction quality and decision performance of different prediction methods for D d when U d = 10, u d = 4.

Table 3 .
The prediction quality and decision performance of different prediction methods for D s when U s = 10, u s = 3.

Table 4 .
The prediction quality and decision performance of different prediction methods for D s when U s = 20, u s = 3.