Markowitz Mean-Variance Portfolio Optimization with Predictive Stock Selection Using Machine Learning

: With the advances in time-series prediction, several recent developments in machine learning have shown that integrating prediction methods into portfolio selection is a great opportunity. In this paper, we propose a novel approach to portfolio formation strategy based on a hybrid machine learning model that combines convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) with robust input features obtained from Huber’s location for stock prediction and the Markowitz mean-variance (MV) model for optimal portfolio construction. Specifically, this study first applies a prediction method for stock preselection to ensure high-quality stock inputs for portfolio formation. Then, the predicted results are integrated into the MV model. To comprehensively demonstrate the superiority of the proposed model, we used two portfolio models, the MV model and the equal-weight portfolio (1/N) model, with LSTM, BiLSTM, and CNN-BiLSTM, and employed them as benchmarks. Between January 2015 and December 2020, historical data from the Stock Exchange of Thailand 50 Index (SET50) were collected for the study. The experiment shows that integrating preselection of stocks can improve MV performance, and the results of the proposed method show that they outperform comparison models in terms of Sharpe ratio, mean return, and risk.


Introduction
Portfolio management is an analytical process of selecting and allocating a group of investment assets in which the portion of allocated investment is persistently changed to optimize expected return and risk tolerance (Markowitz 1952).The Markowitz meanvariance (MV) model, first developed in 1952, is the foundation of portfolio theory, which is extensively used and recognized in portfolio management (Sharpe and Markowitz 1989).However, based on the classical MV model, there are two main issues of concern for practical application.The first is that the MV relies on the expected return and risk of asset inputs to produce optimal portfolios for each level of expected return and risk (Beheshti 2018).As a result, by selecting good assets to put into the optimization process, the MV model may achieve improved performance (Mitra Thakur et al. 2018).Another issue is that many high-risk assets often return a large number of small-scale weights in the optimal portfolio, which makes them difficult to implement, particularly for individual investors (Ben Salah et al. 2018;Ortiz et al. 2021;Huang et al. 2021).
In recent years, machine learning has been proven to be advantageous in quantitative finance (Dixon et al. 2020); portfolio optimization is one of the most interesting problems in this regard.Normally, the MV model relies on historical data to generate the optimal portfolio and can only show the optimal portfolio as far as the data input.Therefore, a number of researchers have been applying machine learning for predicting return and volatility in the future (Henrique et al. 2019).Investors in the financial market must evaluate a variety of factors and perspectives to maximize their investment earnings (Rahiminezhad Galankashi et al. 2020).In this regard, including stock price prediction methods in portfolio optimization would be advantageous and profitable for investors (Kolm et al. 2014).Financial time-series prediction has long been a difficult field of study since financial market fluctuations are inherently unstable, complex, and dynamic (Paiva et al. 2019).However, several related studies claim that there is a pattern of asset price movement in financial time-series data and that this pattern may be used to forecast financial time-series data to some extent (Wan et al. 2020;Wang et al. 2020).
The main purpose of this study is to develop a portfolio-formation approach for individual investors in which a hybrid machine learning model that combines convolutional neural network and bidirectional long short-term memory with robust input features (R-CNN-BiLSTM) is applied to predict future stock closing prices before using the MV model to form the optimal portfolio.In this regard, this study has two main contributions that fill the gap in the existing literature.Firstly, this study proposes a novel approach for portfolio formation that combines R-CNN-BiLSTM and MV (R-CNN-BiLSTM+MV 1 ).This method, which is suitable for capturing the pattern in financial time-series data, leverages robust input instead of direct stock closing price for machine learning training.Three LSTM-based machine learning models (i.e., LSTM, BiLSTM, and CNN-BiLSTM) are used as comparison models in this experiment to compare the results with the R-CNN-BiLSTM model in terms of prediction accuracy to illustrate the superiority of the proposed method.Second, the method includes a stock selection process to ensure the quality of stock inputs, in which stocks with higher potential returns are selected as candidates before constructing different sizes of optimal portfolios using the MV model to determine the appropriate number of stocks in the optimal portfolio that provide the best return and risk for individual investors.
The remainder of the paper is organized as follows.In Section 2, reviews of some existing studies are discussed relating to stock prediction and portfolio optimization, as well as empirical works that employ traditional statistics and machine learning methods to solve problems in relation to stock prediction and selection.Section 3 briefly explains the underlying knowledge used in this study.Section 4 presents the detailed experimental process.Section 5 reports the experimental results.Finally, Section 6 addresses the work's key findings, theory implementations, and limitations.

Literature Review
Many studies have been performed on the process of stock selection and portfolio optimization using various methods.Lozza et al. (2011) proposed the ex-post comparison of asset preselection strategies using the joint Markovian behavior of the returns in relation to market stochastic bounds to deal with large-scale portfolio selection with approximately 10,000 stocks from 14 different stock markets and discovered that Markovian strategies outperformed the classical approach based on maximizing Sharpe ratio.Huang (2012) proposed a stock selection model using the support-vector machine (SVR) and genetic algorithms (GA).This model applies SVR to predict each stock's future performance, with GA utilized to optimize model parameters and input characteristics.The highest-ranked stocks are then weighted equally to build the portfolio.The experimental results show that the investment performance of the proposed model is better than the benchmarks.Nguyen (2014) proposed a risk-measurement method for large-scale datasets that includes a stockpreselecting procedure to remove low-diversification stocks before optimization using the Sharpe ratio, Stutzer performance index, and the Omega measure.The experimental results showed that the preselection process improved the performance and diversification of the proposed portfolio.Rather et al. (2015) proposed a novel robust hybrid model for stock return prediction.The model consists of two linear models, the autoregressive moving average and the exponential smoothing models, and a non-linear model, a recurrent neural network (RNN).The proposed model combined the results of these three prediction-based methods with the objective to improve the accuracy of the model prediction.The optimization model was then used to generate the model's ideal weight using GA.The proposed hybrid prediction model outperformed the RNN model in terms of the prediction accuracy.Le Caillec et al. (2017) integrated several indicators for stock selection using analysis performance evaluation and a behavioral uncertainty framework of human bias to calculate the cumulative return (CR) of the portfolio.The combined methods, one probabilistic and one possibilistic, focused on discriminating the common use of multiple technical indicators (TI) to preselect stocks based on the probabilistic framework.Experiments showed that the proposed model could raise portfolio performance.Fischer and Krauss (2018) implemented the LSTM neural network to predict the directional movement of the constituent stocks of the S&P 500 from 1992 to 2015.The study found that the portfolio based on the LSTM outperformed the other machine learning models without a memory function (i.e., RF, DNN, and LR).
These models only apply simple portfolio construction methods that ignore individual stock risk, such as the equal-weight method, which resulted in a portfolio with unbalanced risk and expected return.As a result, they are not suitable for individual investors in practice.Due to the shortcomings of the models, some researchers have adopted the MV model and used a quantitative method to improve investment decisions.
Tu and Zhou (2010) incorporated Bayesian priors with economic objective functions in the MV model in which the priors were imposed on the solution rather than primitive parameters.The study used the monthly return on a Fama-French 25 size from January 1965 to December 2004 and book-to-market portfolio.The results show that portfolio strategies using objective-based priors outperformed the standard portfolio allocation.Brown and Smith (2011) studied some heuristic trading strategies in portfolio optimization and developed a dual approach to examine the quality of the heuristics.The approach considered several utility functions, i.e., transaction costs, constraint sets, and models of returns.Most heuristic models performed very close to the optimal solution in the experiment, indicating that the heuristics model could capture the tradeoff between improving the position of assets and reducing the transaction costs.Li et al. (2015) proposed a specific portfolio selection approach using background risk.The study compared a probabilistic portfolio model with background risk to a probabilistic portfolio without background risk.The experiment indicated that when the expected return is the same, the variance of the background risk is larger than the one without risk.Bodnar et al. (2017) analyzed the weights in the optimal portfolio using the Bayesian framework.This approach enabled investor's beliefs to be incorporated into portfolio selection.The study derived explicit formulas for the posterior distributions of linear combinations of global minimum variance (GMV) using different priors for the return of assets, specifically, the non-informative (diffuse) and the informative (conjugate and hierarchical) priors.Then, the prior is suggested directly for the weights of the portfolio.The numerical study results showed that the studies performed well for the suggested prior.Katsikis et al. (2021) presented an online approach for time-varying financial problems while removing the limitations of static methods.The study found that time-varying mean-variance portfolio selection with transaction costs and a cardinality constraint (TV-MVPSTC-CC) can be made more realistic by using technical analysis to generate the expected return of a portfolio.Additionally, a beetle antennae search (BAS) was implemented to automatically adjust the parameters, which results in dramatically improved computationally efficacy.The results demonstrated that BAS more is suitable than the Fa, Ga, and De algorithms for portfolio configurations in real-world data.Khan et al. (2021) developed a meta-heuristic optimization called quantum beetle antennae search (QBAS) and incorporated it into portfolio selection to generate the optimal portfolio.The study applied QBAS on real-world data from the Shanghai Stock Exchange 50 Index (SSE50) and compared the performance to conventional algorithms (i.e., particle swarm optimization (PSO), genetic algorithm (GA), and beetle antennae search (BAS)).The experimental results showed that QBAS outperformed other algorithms in terms of time-consumption, especially for extensive data.Although this method is neither computationally expensive nor time-consuming, the optimal portfolio still relies on historical data.Khan et al. (2022) proposed a meta-heuristic algorithm called non-linear activated beetle antenna search (NABAS) and formulated a tax-aware portfolio, which is a non-convex problem where conventional algorithms can be stuck in local minima.The study used data from 20 companies in National Association of Securities Dealers Automatic Quotation System (NASDAQ) to compare its performance with that of BAS, PSO, and GA.The results indicated that the performance of NABAS is comparable to BAS, PSO, and GA for convex problems and is apparently better for non-convex problems, as the method is immune to local minima.Khan et al. (2020) proposed a non-convex method for portfolio selection using BAS.The method takes cardinality and transaction costs into consideration to enhance the classical Markowitz and reformulates it as an unconstrained optimization problem before solving the problem using BAS.Additionally, the study compared the performance of the proposed method to that of PS, PSO, and GA.The experimental results showed that BAS's performance was six times faster than others' performance in the worst case and twenty-five times faster than others' performance, in the best case with comparable accuracy.
These models show us more reasonable and balanced portfolios using the MV model, in which some models are very computationally efficient.However, they are not effective in dealing with complex financial time-series data and are not able to accurately predict potential outcomes in the future, which leads to a difficult situation of asset preselection, where only applying complex methods without the selection of high-quality asset inputs before the optimization is not sustainable.In this regard, several scholars have focused more on incorporating machine learning and deep learning to capture the complex financial time-series data rather than improving the optimization model.Alizadeh et al. (2010) developed a portfolio optimization model using an adaptive neural fuzzy inference system to predict portfolio returns and an index of variance for risk assessment.The results of the experiment show that this portfolio optimization model outperforms the MV, neural network, and Sugeno-Yasukawa models.This research tells us that the combination of artificial intelligence techniques with modern portfolio optimization can produce better performance than any single trading investment model.Paiva et al. (2019) proposed a single decision-making model for day-trading stock market investments, which was developed using a composite approach of SVM and MV models for portfolio selection.The proposed model is compared with two other models, namely SVM+1/N and Random+MV.An experimental evaluation based on Ibovespa's stock market assets shows that the proposed model works best.Wang et al. (2020) proposed a portfolio formation method consisting of LSTM networks and an MV model.The LSTM is applied to capture stock price patterns using a variety of technical indicators as input variables, such as Relative Strength Index (RSI), Momentum Index (MOM), and True Range (TR).The MV model was used to generate optimal portfolios with different numbers of assets between five and ten candidates before benchmarking with other combined machine learning and MV models.The experimental results showed that the proposed method, which is LSTM+MV, outperformed other models, especially when the number of stocks in the portfolio is ten.Ta et al. (2020) built the portfolio using LSTM neural network and three portfolio optimization techniques, namely the equal weight method, Monte Carlo simulation, and the MV model.In addition, they applied linear regression and SVM for comparison in the stock selection process.The test results show that the LSTM neural network outperforms linear regression and SVM in terms of prediction accuracy, and its built portfolio outperforms that of others.Chen et al. (2021) developed a novel portfolio construction in which a hybrid machine learning-based model, eXtreme Gradient Boosting (XGBoost) with an improved firefly algorithm (IFA), was applied to predict the future stock price, and then the MV model was employed to select sets of a different number of stocks with higher predicted returns into optimal portfolios.Empirical results showed that the results of the proposed model were superior to other results of benchmark models, especially when the portfolio contained seven stocks.
These models apply different machine learning and deep learning methods for stock selection, then develop portfolio optimization models with the selected stocks for trading investments.These methods point us in a promising direction to build portfolio models in practice.It is obvious that several related studies mainly focus on how to improve the quality of assets input before optimization rather than to improve Markowitz's optimization model.As a result, in order to advance research on portfolio optimization, this study attempts to follow logical procedures and adapt the original MV model.

Background Knowledge
3.1.Mean-Variance Optimization Markowitz (1952) proposed the mean-variance (MV) model and was awarded the Noble Prize in Economics in 1990.The MV model made use of mean and variance, which are calculated from historical asset prices to quantify the expected return and risk of the generated portfolio.The MV model assumes that the investor would like to either maximize the expected return for a given level of risk or minimize risk for a given return (Kolm et al. 2014).However, in this study, we only show the optimization with minimum variance.The MV model is described as follows: where N represents the total number of assets, which indicates the dimensionality of the optimization in the portfolio; w i is the weight of each i asset in the portfolio to be optimized; σ 2 stands for the variance of the portfolio which generally refers to portfolio risk; C ij is the covariance of return between asset i and j; γ is the expected or target return; and E i is the average return on an individual asset i.

CNN
A convolutional neural network (CNN) is a kind of deep learning model for processing grid pattern data, such as image processing and natural language processing.CNN can be applied to predict time-series data (Sadouk 2019).CNN can significantly improve the quality of the learning models by reducing the number of parameters.CNN is mainly composed of three types of layers: a convolution layer, a pooling layer, and a fully connected layer (Albawi et al. 2017).The first two layers, the convolution layer and the pooling layer, execute feature extraction, while the last layer, the fully connected layer, directs the extracted features into output (Milošević and Racković 2019).

LSTM
Long short-term memory (LSTM) was proposed by Hochreiter and Schmidhuber (Hochreiter and Schmidhuber 1997).The model is a class of RNN but has a function of memory, which enables LSTM to retrain data over a long period of time compared to RNN (Fischer and Krauss 2018).The LSTM model filtrates information that enters through gate structures composed of an input gate, a forget gate, and an output gate to improve and maintain memory cells.LSTM is particularly popular in the field of financial time-series prediction, since the model can effectively handle the redundancy in historical data (Gao et al. 2021).The operation equation of LSTM is as follows: Forget gate: Input gate: Output gate:

BiLSTM
Bidirectional long short-term memory (BiLSTM) is an improved version of LSTM with the ability to access both forward and backward directions of the input feature (Dong et al. 2014).The key difference between BiLSTM and LSTM is that it uses two hidden layers.BiLSTM was shown to be better compared to LSTM in terms of time-series data prediction (Siami-Namini et al. 2019).The hidden layer output of BiLSTM has the activation function for both forward and backward.The BiLSTM equations (Yang and Wang 2022) are described as follows: where σ stands for the activation function of the model; W is the weight of the matrix; W xh is the weight of input (x) to the hidden layer (h); H t indicates the hidden layer input; and b x denotes the bias of the respective gates (x).The output is carried out by updating forward → h t and backward ← h t structures.

Robust Statistics
In real-world applications, data collection often includes some atypical observations that deviate from the majority or bulk of the data, in which these observations are referred to as outliers, especially in financial time-series data.For example, for stock prices, the outliers deviate from the general pattern of the data.Therefore, it is very difficult to predict future stock prices.To overcome these limitations, this study uses robust statistics theory (Maronna et al. 2019) to estimate the appropriate dataset to be used in the training process of machine learning.

The Classical Robust Location Estimator
The sample mean and median are considered location estimators of the distribution of the data.The main difference is that the sample mean is not robust to extreme outliers.For example, the closing price of a certain stock in a week falls immediately for unexpected reasons.For example, provided that historical closing prices of a certain stock are {243, 190, 150, 80, 56, 28, 142}, in this case, the sample mean is 127, which is not considered a good location estimator for these observations.Next, the median of the sample is 150, which is a robust location estimator of the data.However, if the distribution of the data is considered approximately normal, the sample mean would then be considered a better estimator than the sample median.The robust location estimator is a combination of these two classical estimators.To put it precisely, when there are extreme outliers in the observations, the robust estimator approximates the sample median.On the other hand, the estimated location approximates the sample mean.
3.5.2.Huber's Location Estimator Huber (1964) proposed a good combination of mean and median, called the robust location estimator or M-estimator of a location, which can be described as follows: where μ is a robust location estimator of the observation; x i is the variable of observation I; ρ stands for the error function.The robust location is a parameter µ that minimizes the ρ function to ensure that the parameter provides the minimum error between the location estimator and all observations.Several methods have been proposed (Maronna et al. 2006) to find the local minimum of the ρ function such as the maximum likelihood estimator (MLE).In this paper, we use a numerical method, the Newton-Raphson method, to find the robust location estimator.The Newton-Raphson method is an iterative method for solving non-linear equations.To solve the equation h(µ) = 0, h is set to be linearized for each iteration.In a location M-estimator, it is necessary to solve the equation h(µ) = 0 for h(µ) = avg{ψ(x − µ)}.The iterations are defined as follows: where µ m is the value of the location estimator at iteration m.The Ψ of observation x is defined by the following function with respect to a given positive constant of k as follows: if Ψ k is bounded, its derivative tends to be zero at infinity (Hampel et al. 2011).If k approaches infinity, then the Ψ k is the mean.On the other hand, if k approaches zero, then the Ψ k acts as the median.In this paper, we use k = 1.435 in our proposed method.This value is similar to that used by Fox and Weisberg (2019).

Data Preparation
One of the greatest challenges in stock prediction is to capture the pattern of financial time-series data between the past and future (Wang et al. 2020).Hence, it is easier to predict stable stocks than volatile stocks.The Stock Exchange of Thailand SET 50 index (SET50) consists of the topmost 50 large-capitalization companies in the stock market of Thailand, which comprehensively reflect the overall situation of the stock market in Thailand.In this study, the historical data of the stocks in SET50 are considered as the experimental data set according to characteristics of stability and large scale of the stocks.Additionally, some related studies have been conducted by selecting 21-49 stocks as the experimental data set.Wang et al. (2020) randomly select 21 stocks from FTSE100 as the sample for machine learning prediction process before optimization.Chen et al. (2021) randomly chose 24 stocks from SSE50 as candidate assets in stock prediction process before forming a portfolio.Ma et al. (2021) employed 49 stocks from SSE100 as a dataset for stock prediction using machine learning before constructing a portfolio.Additionally, numerous researchers agree on holding around 10 different stocks in the portfolio.For instance, Soeryana et al. (2017) chose five different stocks in the optimal portfolio.Abrami and Marsoem (2021) constructed an eight-asset portfolio.Therefore, our study randomly selected 25 stocks that have been fully trading between 1 January 2015 and 30 December 2020 covering 1462 trading days from the SET50 index and used closing price as the experimental data set, which is sufficiently large for individual investors to build a portfolio (Zaimovic et al. 2021).The names of these stocks are "Airport of Thailand" (AOT), "Bangkok Dusit Medical Services" (BDMS), "Bangkok Expressway and Metro" (BEM), "Berli Jucker" (BJC), "BTS Group Holdings" (BTS), "CP ALL" (CPALL), "Central Pattana" (CPN), "Delta Electronics Thailand" (DELTA), "Total Access Communication" (DTAC), "Energy Absolute" (EA), "Siam Global House" (GLOBAL), "Intouch Holdings" (INTUCH), "IRPC" (IRPC), "Indorama Ventures" (IVL), "KCE Electronics" (KCE), "Krungthai Card" (KTC), "Land & Houses Public" (LH), "Minor International" (MINT), "Muangthai Capital" (MTC), "Petroleum Authority of Thailand" (PTT), "PTT Exploration and Production" (PTTEP), "PTT Global Chemical" (PTTGC), "Ratch Group" (RATCH), "Srisawad Corporation" (SAWAD), and "The Siam Cement" (SCC).
Table 1 presents summary statistics of close prices for the 25 stocks.The stocks with the highest and lowest returns are clearly Delta and IRPC, while the stocks with the highest and lowest standard deviation are SCC and LH, respectively.This study proposes a hybrid R-CNN-BiLSTM model to improve the accuracy of the prediction.The detailed process of the proposed model is presented in Figure 1.First, the data transformation component converts the stock closing prices into the robust domain, which is the non-noisy version of the data.In this study, the direct stock closing price data are not suitable for machine learning training due to high standard deviations.Therefore, we need to transform the data to make them more suitable for the training process.Stock closing prices are divided into a small time-series size of 4 days, the so-called lag time.The lag times overlap 1 day with each other.The Huber's location estimator of each lag time is calculated using Equations ( 14) and (15).
Second, the feature extraction is performed using a CNN network.CNN has the ability to identify important factors in the data, which are called "features".The purpose of this step is to preserve the historical data in the time-series data and feed them into First, the data transformation component converts the stock closing prices into the robust domain, which is the non-noisy version of the data.In this study, the direct stock closing price data are not suitable for machine learning training due to high standard deviations.Therefore, we need to transform the data to make them more suitable for the training process.Stock closing prices are divided into a small time-series size of 4 days, the so-called lag time.The lag times overlap 1 day with each other.The Huber's location estimator of each lag time is calculated using Equations ( 14) and (15).
Second, the feature extraction is performed using a CNN network.CNN has the ability to identify important factors in the data, which are called "features".The purpose of this step is to preserve the historical data in the time-series data and feed them into BiLSTM.Therefore, the input data is converted by performing convolutional operations on the time steps of the time-series data using a sequence folding layer.In the next step, the two-dimensional convolutional layer is used to extract the data features.The filtering size of the first convolutional layer is 3 × 3, and the stride parameter is set to {a = 1, b = 1}, where a is a vertical step size and b is a horizontal step size.The first convolutional layer is followed by Batch Normalize (BN), which normalizes a mini batch of data over all data points to speed up the training process of the CNN.In our model, The Exponential Linear Unit (ELU) is used as the activation function.This function performs the identity operation on positive inputs and exponential nonlinearity on negative inputs.The filtering size of the second and third convolutional layers increases to 5 × 5 and 7 × 7, respectively.The next layer is the pooling layer.It performs down-sampling by dividing the data into sub-regions and then calculating the maximum of each region.The sequence structure of the input is restored using the sequence unfolding layer.Finally, the spatial dimensions of data are collapsed with the flatten layer.The overview of the proposed framework of the CNN model is shown in Figure 2.
BiLSTM.Therefore, the input data is converted by performing convolutional operations on the time steps of the time-series data using a sequence folding layer.In the next step, the two-dimensional convolutional layer is used to extract the data features.The filtering size of the first convolutional layer is 3 3, and the stride parameter is set to { = 1, = 1}, where a is a vertical step size and b is a horizontal step size.The first convolutional layer is followed by Batch Normalize (BN), which normalizes a mini batch of data over all data points to speed up the training process of the CNN.In our model, The Exponential Linear Unit (ELU) is used as the activation function.This function performs the identity operation on positive inputs and exponential nonlinearity on negative inputs.The filtering size of the second and third convolutional layers increases to 5 5 and 7 7, respectively.The next layer is the pooling layer.It performs down-sampling by dividing the data into sub-regions and then calculating the maximum of each region.The sequence structure of the input is restored using the sequence unfolding layer.Finally, the spatial dimensions of data are collapsed with the flatten layer.The overview of the proposed framework of the CNN model is shown in Figure 2. Finally, the price prediction using BiLSTM is performed.The flattened data from the CNN are used as the input of BiLSTM.The input size of the first BiLSTM layer is 500, and the number of hidden units is 128.Tanh is used as the state activation function and Sigmoid as a gate activation function.To prevent the network from overfitting, we put the dropout layer next to BiLSTM layer.Its operation changes the underlying network architecture between interactions by randomly setting input elements to zero with uniform probability.For the second BiLSTM layer, we decrease the input size and the number of hidden units to 256 and 16, respectively.The last two layers are fully connected, and regression layers are the standard of CNN architecture.The BiLSTM framework is shown in Figure 3.  Finally, the price prediction using BiLSTM is performed.The flattened data from the CNN are used as the input of BiLSTM.The input size of the first BiLSTM layer is 500, and the number of hidden units is 128.Tanh is used as the state activation function and Sigmoid as a gate activation function.To prevent the network from overfitting, we put the dropout layer next to BiLSTM layer.Its operation changes the underlying network architecture between interactions by randomly setting input elements to zero with uniform probability.For the second BiLSTM layer, we decrease the input size and the number of hidden units to 256 and 16, respectively.The last two layers are fully connected, and regression layers are the standard of CNN architecture.The BiLSTM framework is shown in Figure 3.
BiLSTM.Therefore, the input data is converted by performing convolutional operations on the time steps of the time-series data using a sequence folding layer.In the next step, the two-dimensional convolutional layer is used to extract the data features.The filtering size of the first convolutional layer is 3 3, and the stride parameter is set to { = 1, = 1}, where a is a vertical step size and b is a horizontal step size.The first convolutional layer is followed by Batch Normalize (BN), which normalizes a mini batch of data over all data points to speed up the training process of the CNN.In our model, The Exponential Linear Unit (ELU) is used as the activation function.This function performs the identity operation on positive inputs and exponential nonlinearity on negative inputs.The filtering size of the second and third convolutional layers increases to 5 5 and 7 7, respectively.The next layer is the pooling layer.It performs down-sampling by dividing the data into sub-regions and then calculating the maximum of each region.The sequence structure of the input is restored using the sequence unfolding layer.Finally, the spatial dimensions of data are collapsed with the flatten layer.The overview of the proposed framework of the CNN model is shown in Figure 2. Finally, the price prediction using BiLSTM is performed.The flattened data from the CNN are used as the input of BiLSTM.The input size of the first BiLSTM layer is 500, and the number of hidden units is 128.Tanh is used as the state activation function and Sigmoid as a gate activation function.To prevent the network from overfitting, we put the dropout layer next to BiLSTM layer.Its operation changes the underlying network architecture between interactions by randomly setting input elements to zero with uniform probability.For the second BiLSTM layer, we decrease the input size and the number of hidden units to 256 and 16, respectively.The last two layers are fully connected, and regression layers are the standard of CNN architecture.The BiLSTM framework is shown in Figure 3.

Hyperparameter Setting
The training dataset is passed to the proposed model for training.In this step, the various hyperparameters of the neural network are specified.These include the number of hidden layers, the number of epochs, and the size of batch inputs.Finding the optimal hyperparameters is still a major challenge in the field of deep learning.In this study, hyperparameters are set manually by trial and error with the selection of best parameters from the experiment.The following is a detailed description of the hyperparameters and their value settings.

1.
The number of epochs: An epoch is one round of full training.In our experiments, we set the number of epochs to 100 and performed our training.After training, we found that all training stops at a maximum of 100 to 120 epochs.Therefore, 100 is selected as the value for this hyperparameter.

2.
The number of hidden layers: This is the number of layers between input and output layers.For the CNN network, we set the hidden convolutional layer counts to 100, 100, and 50.In the BiLSTM network, we set these numbers to 128 and 16.

3.
Learning rate: This value is set for the accurate model convergence of the model in prediction.In our experiment, we set a learning rate to 0.0001.Many researchers recommend using a learning value lower than 0.01 (Hastie et al. 2017).

4.
Optimizer: This is the optimization function used to obtain the best results.In our work, we use the Adam optimizer, as it works well for LSTM based networks.5.
Loss function: Mean Squared Error (MSE) was used as the loss function.Our implementation was written using MATLAB with GPU computing.

Stock Selection
Once all the stock prices are successfully predicted, high-quality stocks are selected to perform in the optimization process one by one by ranking them in descending order based on the expected (average) return.The predicted stock price is used to calculate the stock return using Equation ( 17).

Rt
where Rt is the return of the stock at time t, while pt is the predicted stock price at time t and pt−1 is the predicted stock price at time t − 1.
As a result, we select the top (N) number of stocks with a higher potential return according to the ranking order.the selected stocks are qualified for constructing the portfolio in the next stage.The MV model is used in this process to build the optimal portfolio with different proportions of asset allocation based on the qualified stocks.optimization process is performed using the MS Excel solver in which the minimum variance is set as the objective function, and the weight of each asset is adjusted using the Excel solver.Consequently, each of the optimal portfolios with the lowest variance is found and used for analysis.

Benchmark with Comparison Models
To comprehensively benchmark the R-CNN-BiLSTM+MV, three representative machine learning models (LSTM, BiLSTM, and CNN-BiLSTM) and two portfolio models, the MV model and equal-weight portfolio (1/N) model, were employed.The hyperparameters are set as in the proposed model, and the following models are comparison models.

Comparison Model 1: R-CNN-BiLSTM+1/N
The prediction process to select stocks for the optimization of this model is the same as R-CNN-BiLSTM.The only difference is that all the weights of selected stocks are equally distributed.Specifically, in the first stage, R-CNN-BiLSTM is used to predict future stock close prices, and then the top N stocks ranking with a higher predicted return are selected and weighted equally in the portfolio.The objective of this comparison model is to see the performance of the MV model when the same set of stocks are chosen.

Comparison Model 2: Machine Learning+MV and Machine Learning+1/N
The objective of this comparison model is to investigate whether the different prediction models affect the optimal portfolio.Specifically, the stock closing prices are predicted using LSTM, BiLSTM, and CNN-BiLSTM, and the stocks with the higher predicted return are chosen to put into either the MV model or the 1/N model.In addition, the number of N stocks is consistent with R-CNN-BiLSTM+MV.Note that the hyperparameters are set as the proposed model.

Comparison Model Random+MV and Random+1/N
This comparison model is totally different from the previous models in terms of stock selection.Specifically, in the first stage, the stock prediction is carried out randomly without relying on any machine learning models.The stocks to be processed in the portfolio optimization using either the MV model or the 1/N model are selected randomly from all the samples in the second stage.The main objective of this strategy is to examine the necessity of stock selection using machine learning.

Experimental Results
This section first presents the prediction performance of the LSTM, BiLSTM, CNN-BiLSTM, and R-CNN-BiLSTM models.In the following, this study constructs different sizes of portfolios using the classical MV model to compare the prediction result of different machine learning models without a transaction fee.

Machine Learning Metrics
In this section, the predictive accuracy of machine learning models is evaluated by three criteria, mean absolute error (MAE), mean square error (MSE), and mean absolute percentage error (SMAPE), as they are extensively used as performance metrics (Jierula et al. 2021;Singh et al. 2021).These measures are described as follows: where pi refers to the predicted price, p i represents the true value, and n indicates the total number of stocks used in the experiment.

Performance of the Prediction
Tables 2 and 3 present the results of each model that has been applied in accordance with the performance metrics employed.According to the two tables, it can be clearly seen that the R-CNN-BiLSTM model provides most of the best results compared to the other models in the experiment.However, there are still some exceptions in which some comparison models perform better.For example, the prediction error of stock BTS, LH, PTTGC, and SAWAD in terms of MAE, MSE, and SMAPE of BiLSTM are smaller than R-CNN-BiLSTM.Another example is that the MAE, MSE, and SMAPE of stock SCC that were predicted using CNN-LSTM are smaller than R-CNN-BiLSTM.Mean absolute error (MAE): As can be seen from Tables 2 and 3, the average value of MAE for each of the machine learning model is descending as follows: 1.7219 for LSTM, 1.5350 for CNN-BiLSTM, 1.5222 for BiLSTM, and 1.4582 for R-CNN-BiLSTM.Stock PTTGC has the highest MAE of 4.7754, which is found for the LSTM model.The lowest MAE is for the stock BEM, which was predicted using CNN-BiLSTM, with a value of 0.1651.
Mean square error (MSE): According to Tables 2 and 3, the average values of MSE for each of the machine learning models are reported as follows: 2.9000 for CNN-BiLSTM, 2.5794 for BiLSTM, 2.5570 for LSTM, and 1.8081 for R-CNN-BiLSTM.The biggest MAE is 9.3412, which is found on stock MTC generated from CNN-BiLSTM.Using R-CNN-BiLSTM in stock BEM, the least MAE of 0.0523 was predicted.
Mean absolute percentage error (SMAPE): From Tables 2 and 3, the average values for each of the machine learning models are described from high to low as follows: 2.7197 for LSTM, 2.4589 for CNN-BiLSTM, 2.4229 for BiLSTM, and 2.3332 for R-CNN-BiLSTM.The largest SMAPE is 13.487 which is associated with stock DELTA predicted using CNN-BiLSTM.The lowest SMAPE is found on stock CPALL, R-CNN-BiLSTM model, with the value of 0.5713.
In conclusion, most of the R-CNN-BiLSTM results outperform the LSTM, BiLSTM, and CNN-BiLSTM models for the stock prediction process in terms of MAE, MSE, and SMAPE.Specifically, 14 stocks, BEM, BJC, CPALL, CPN, DELTA, EA, GLOBAL, IVL, KCE, KTC, MINT, MTC, PTT, and PTTEP, which were predicted using R-CNN-BiLSTM, perform the best in terms of all three metrics, followed by BiLSTM and CNN-BiLSTM.In addition, a traditional single machine learning model, LSTM, performs the worst, with several predictive errors in this experiment.Specifically, only the stock RATCH performs the best in terms of MAE and SMAPE for the LSTM model.It can be seen that the proposed model, R-CNN-BiLSTM, which uses robust input features instead of the direct stock closing price in the machine learning training process, achieves a majority of better results than machine learning models that use direct stock closing price input.

Portfolio Metrics
In this section, the performance of different optimal portfolios is measured and compared using three criteria, the Sharpe ratio, mean return, and risk of the portfolio.These metrics are widely used to evaluate and compare the performance of stock portfolios (Lefebvre et al. 2020;Sikalo et al. 2022;Mba et al. 2022).Another portfolio metric is the Sharpe ratio, which can be described as follows: where E p denotes the expected (average) return or mean return of the portfolio; σ is the standard deviation or risk of the portfolio; and R f refers to risk-free assets.In this study, we use a risk-free asset rate of 0.022, according to the 10-year Thai treasury rate.

Performance of Different-Sized Portfolios
Numerous studies have shown that holding too many different stocks makes them hard to control and manage, especially for individual investors.Several studies related to portfolio optimization consider building a portfolio with fewer than 10 stocks (Almahdi and Yang 2017).Paiva et al. (2019) found that a portfolio with an average of seven stocks outperforms other portfolios with a different number of stocks.Wang et al. (2020) showed that the optimal portfolio with ten stocks performs better than a portfolio with other numbers of stocks.Chen et al. (2021) argued that having seven stocks in a portfolio is the most appropriate number for portfolio formation.As a result, this study decides to construct portfolios corresponding to the number of stocks N = 5, 6, 7, 8, 9, and 10 and to comprehensively evaluate the performance of the proposed models.Annualized mean return, annualized standard deviation, and annualized Sharpe ratio are employed as indicators.
According to Figure 4, the annualized performance for different sizes of portfolios N = 5, 6, 7, 8, 9, and 10 are presented in three sub-graphs in which the Y-axis of each subgraph indicates the amount of mean return, standard deviation, and Sharpe ratio computed annually, while the X-axis of each subgraph represents the different models, which are formed by a different number of stocks.It can be clearly seen from Figure 4 that when the number of N = 5, R-CNN-BiLSTM+MV performs the best in terms of mean return, standard deviation, and Sharp ratio.More precisely, when N = 5, R-CNN-BiLSTM+MV has the highest mean return of 0.47, followed by 0.46 for R-CNN-BiLSTM+1/N, 0.43 for CNN-BiLSTM+MV, 0.42 for BiLSTM+MV, 0.38 for LSTM+MV, 0.37 for CNN+BiLSTM+1/N, 0.36 for LSTM+1/N, 0.34 for BiLSTM+1/N, 0.07 for Random+MV, and 0.02 for Random+1/N.Furthermore, when N = 9 and 10, R-CNN-BiLSTM+MV outperforms the other models in terms of standard deviation.Specifically, R-CNN-BiLSTM+MV has the lowest standard deviation of 0.07 for both N = 9 and 10.As for the Sharpe ratio measurement, when N = 5, 8, R-CNN-BiLSTM+MV provides the best Sharpe ratios of 2.62 and 1.99, respectively.However, R-CNN-BiLSTM+1/N also has the same Sharpe ratio as R-CNN-BiLSTM+MV when N = 8.
In summary, a clear advantage of R-CNN-BiLSTM+MV is found with a portfolio size N = 5 in which all three indicators outperform the other models except for the risk of Random+MV and Random+1/N; however, these two models have low expected returns of 0.07 and 0.02, respectively, but it is still reasonable to consider R-CNN-BiLSTM+MV superior.Furthermore, most of the models tend to perform better in terms of annualized expected return, annualized Sharpe ratio, and annualized standard deviation or risk when the number of stocks in the portfolio is five.

Discussion and Key Findings
The paper aims to extend the existing literature on portfolio optimization with stock selection.The proposed prediction model is developed based on the use of robust statistics theory and CNN-BiLSTM machine learning model to advance the MV model, which incorporates the advantages of machine learning into stock selection.This study has several findings.
First, this paper compares the predictive performance of LSTM, BiLSTM, CNN-BiLSTM, and stock prediction.The experimental results show that BiLSTM is superior to the other models, which indicates that it is more suitable for financial time-series prediction than the other machine models applied in this experiment, confirming the study by Wang et al. (2020) showing that that traditional LSTM was superior in terms of prediction performance.
Second, this study improves the predictive accuracy of the CNN-BiLSTM by transforming the stock closing price into a robust input feature that can effectively reduce the First, this paper compares the predictive performance of LSTM, BiLSTM, CNN-BiLSTM, and stock prediction.The experimental results show that BiLSTM is superior to the other models, which indicates that it is more suitable for financial time-series prediction than the other machine models applied in this experiment, confirming the study by Wang et al. (2020) showing that that traditional LSTM was superior in terms of prediction performance.
Second, this study improves the predictive accuracy of the CNN-BiLSTM by transforming the stock closing price into a robust input feature that can effectively reduce the error of the prediction before the model predicts the future price.After comparing the outcomes of the R-CNN-BiLSTM to LSTM, BiLSTM, and CNN-BiLSTM, it was discovered that the robust input is appropriate to use as an input feature for the machine learning training process to capture financial time-series data that can overcome the other comparison models when the direct stock closing price is used as the input feature.
Finally, the result from the prediction process is incorporated into stock selection for portfolio optimization; the stocks with higher returns calculated from predicted prices are chosen to construct the optimal portfolio.The experimental results show that holding five stocks is appropriate and realistic for individual investors, which is different from the results Wang et al. (2020) and Chen et al. (2021).Additionally, most of the results of R-CNN-BiLSTM+MV, R-CNN-BiLSTM+1/N, CNN-BiLSTM+MV, CNN-BiLSTM+1/N, BiLSTM+MV, BiLSTM+1/N, LSTM+MV, and LSTM+1/N are superior to both Random+MV and Random+1/N in terms of the Sharpe ratio, mean return, and standard deviation, which indicates the significance of selecting high-quality stocks in portfolio optimization.The significance of stock preselection is similar to the conclusions by Wang et al. (2020), Ta et al. (2020), and Chen et al. (2021).

Theoretical Implications
This study enriches the theoretical research on stock price prediction and portfolio optimization.This paper uses four prediction models, which can capture the financial time-series data to guarantee high-quality assets before commencing portfolio optimization.Specifically, LSTM, BiLSTM, CNN-BiLSTM, and R-CNN-BiLSTM are adopted to predict the daily future close price of the stock and compare the forecasting outcomes of R-CNN-BiLSTM with LSTM, BiLSTM, and CNN-BiLSTM to show the predictability of R-CNN-BiLSTM using robust input instead of the direct stock closing price in more accurately predicting financial time-series data.

Limitations and Future Work
Although this study provides useful insights, there are some limitations.First, we only use the stock data in Thailand.Due to different economics between countries, this method might not be suitable for stock markets in other countries.Second, there are several external factors that impact the financial market and can be added as input indicators to improve the method, such as COVID-19 crisis, interest rates, and politics.Third, this study does not consider time complexity as a constraint to compare the results.Finally, since this study sets hyperparameters manually based on trial and error, applying hyperparameter optimization algorithms may provide better hyperparameters.
In future research, time complexity should be considered to further demonstrate the applicability of the proposed method.

Figure 1 .
Figure 1.The scheme of the proposed model.The proposed model consists of three parts: data transformation, feature extraction, and price prediction.First, the data transformation component converts the stock closing prices into the robust domain, which is the non-noisy version of the data.In this study, the direct stock closing price data are not suitable for machine learning training due to high standard deviations.Therefore, we need to transform the data to make them more suitable for the training process.Stock closing prices are divided into a small time-series size of 4 days, the so-called lag time.The lag times overlap 1 day with each other.The Huber's location estimator of each lag time is calculated using Equations (14) and (15).Second, the feature extraction is performed using a CNN network.CNN has the ability to identify important factors in the data, which are called "features".The purpose of this step is to preserve the historical data in the time-series data and feed them into

Figure 1 .
Figure 1.The scheme of the proposed model.The proposed model consists of three parts: data transformation, feature extraction, and price prediction.First, the data transformation component converts the stock closing prices into the robust domain, which is the non-noisy version of the data.In this study, the direct stock closing price data are not suitable for machine learning training due to high standard deviations.Therefore, we need to transform the data to make them more suitable for the training process.Stock closing prices are divided into a small time-series size of 4 days, the so-called lag time.The lag times overlap 1 day with each other.The Huber's location estimator of each lag time is calculated using Equations (14) and (15).Second, the feature extraction is performed using a CNN network.CNN has the ability to identify important factors in the data, which are called "features".The purpose of this step is to preserve the historical data in the time-series data and feed them into BiLSTM.Therefore, the input data is converted by performing convolutional operations on the time steps of the time-series data using a sequence folding layer.In the next step, the two-dimensional convolutional layer is used to extract the data features.The filtering size of the first convolutional layer is 3 × 3, and the stride parameter is set to {a = 1, b = 1},

Figure 2 .
Figure 2. The framework of CNN model.

Figure 3 .
Figure 3.The framework of BiLSTM model.4.1.2.Process of Training and Testing One of the most important factors that determine the success of machine learning is the process of training and testing.In this study, we divided the close price of each chosen stock into training and testing sets according to the ratio of 80:20.Therefore, the first 1201 days of data are used in the training process, and the last 262 days are used as the testing set.

Figure 2 .
Figure 2. The framework of CNN model.

Figure 2 .
Figure 2. The framework of CNN model.

Figure 3 .
Figure 3.The framework of BiLSTM model.4.1.2.Process of Training and Testing One of the most important factors that determine the success of machine learning is the process of training and testing.In this study, we divided the close price of each chosen stock into training and testing sets according to the ratio of 80:20.Therefore, the first 1201 days of data are used in the training process, and the last 262 days are used as the testing set.

Figure 3 .
Figure 3.The framework of BiLSTM model.4.1.2.Process of Training and Testing One of the most important factors that determine the success of machine learning is the process of training and testing.In this study, we divided the close price of each chosen stock into training and testing sets according to the ratio of 80:20.Therefore, the first 1201 days of data are used in the training process, and the last 262 days are used as the testing set.

Figure 4 .
Figure 4. Annualizedportfolio performance for different sizes of portfolios.

Figure 4 .
Figure 4. Annualizedportfolio performance for different sizes of portfolios.6. Discussion and Conclusions 6.1.Discussion and Key Findings The paper aims to extend the existing literature on portfolio optimization with stock selection.The proposed prediction model is developed based on the use of robust statistics theory and CNN-BiLSTM machine learning model to advance the MV model, which incorporates the advantages of machine learning into stock selection.This study has several findings.First, this paper compares the predictive performance of LSTM, BiLSTM, CNN-BiLSTM, and stock prediction.The experimental results show that BiLSTM is superior to the other models, which indicates that it is more suitable for financial time-series prediction than the other machine models applied in this experiment, confirming the study byWang et al. (2020) showing that that traditional LSTM was superior in terms of prediction performance.
)where f t , i t , and o t refer to forget gate, input gate, and output gate, respectively; w represents the weight of the matrix; b t , b i , and b o indicate the bias of the forget gate, input gate, and output gate, respectively; σ stands for sigmoid function; x t and h t denote the input and current output at time t, respectively; c t is the value from the input gate at time t; and the hyperbolic function (tangent) is represented by tanh.

Table 1 .
Summary statistics for experimental data (Thai baht).

Table 2 .
Comparison of prediction performance between LSTM and BiLSTM.

Table 3 .
Comparison of prediction performance between CNN-BiLSTM and R-CNN-BiLSTM.