A Hybrid Framework Using PCA, EMD and LSTM Methods for Stock Market Price Prediction with Sentiment Analysis

: The aim of investors is to obtain the maximum return when buying or selling stocks in the market. However, stock price shows non-linearity and non-stationarity and is difﬁcult to accurately predict. To address this issue, a hybrid prediction model was formulated combining principal component analysis (PCA), empirical mode decomposition (EMD) and long short-term memory (LSTM) called PCA-EMD-LSTM to predict one step ahead of the closing price of the stock market in Thailand. In this research, news sentiment analysis was also applied to improve the performance of the proposed framework, based on ﬁnancial and economic news using FinBERT. Experiments with stock market price in Thailand collected from 2018–2022 were examined and various statistical indicators were used as evaluation criteria. The obtained results showed that the proposed framework yielded the best performance compared to baseline methods for predicting stock market price. In addition, an adoption of news sentiment analysis can help to enhance performance of the original LSTM model.


Introduction
Predicting stock price behavior is an investor's goal in order to make the correct decision. A stock trader is a type of investor who always attempts to profit from the purchase and sale of stock. Therefore, this sort of investor must predict stock price changes to make the right decision on whether to sell or hold the stock they currently own. To earn money, stock traders must purchase stocks whose prices are expected to grow over the predicted period and sell stocks whose prices are dropping. If stock traders predict trends in stock prices correctly, they will have the potential to make a profit. Thus, predicting stock price trends is very important for stock traders' decision-making. However, the stock market shows highly complex trends. It is influenced by a wide range of economic factors such as Market Capitalization (MAC), general economic conditions, sentiment indices of social media and financial news [1,2]. Therefore, stock market prediction is known as one of the most challenging issues in time series prediction due to noise and volatility characteristics [3].
Previous research on predicting stock prices with effective machine learning models has largely been divided into two main approaches. The first approach aims to propose a prediction model using only historical stock data as input features and the second approach aims to apply related features to create models, including external indicators (e.g., news sentiment and social sentiment) and technical indicators.
For the first approach, there are a vast number of methodologies used to create predicting models. The common techniques include Artificial Neural Networks (ANN), Support Vector Machine (SVM), Auto Regressive Integrated Moving Average (ARIMA), etc. In addition, ANN has various structures for each data type such as Recurrent Neural

Background Theories
This section describes related theories used in this research. It divides into seven subsections. The first two sub-sections deal with processes to create input features including feature transformation and sentiment analysis. The next two sub-sections cover processes to create a prediction model including Empirical Mode Decomposition and Long Short-Term Memory. Finally, the last three subsections look at statistical methods to check model performance including time series cross-validation, performance metrics and the Augmented Dickey-Fuller Test.

Feature Transformation
The Curse of Dimensionality basically means that the error increases along with the number of features. In other words, increasing the number of features does not always improve accuracy. Nowadays, this concept is applied in the fields of machine learning. In theory, increasing the dimensions can add more information to the dataset and improve its quality. Nevertheless, it rarely helps improve model performance in practice because real-world data contains more noise and redundancy [23].
The model is likely to underfit when a dataset does not have enough features. On the other hand, it is likely to overfit when the dataset has too many features. Thus, many dimensionality reduction methods have been proposed to overcome this limitation. Dimensionality reduction is a method to eliminate some features of the dataset and create a restricted set of features that contains all the data needed to predict more efficiently and accurately. There are two methods of dimensionality reduction including feature selection and feature transformation. The key difference between them is that feature selection keeps a subset of the original features, whereas feature transformation creates a new feature that catches most of the important data.
Principal Component Analysis (PCA) is one of the most well-known techniques for depletion reduction [24]. PCA is a feature transformation method used to reduce the dimension of massive data sets by transforming many variables to fewer, while retaining most of the information in the large set. This technique saves resources for running models and increases accuracy [25].
In the field of stock prediction, since technical indicators depend on trend, volatility, volume, momentum and daily returns, they can generalize to various scenarios. PCA can consider a high number of technical indicators as input features without encountering the curse of dimensionality [26]. The advantages of PCA can be applied in various data sources and applications such as tourist behavior analysis [27] and offshore wind turbines selection [28]. In addition, some research indicates that combining machine learning and PCA results in significant model improvement, particularly in comparison to mature dimensionality reduction techniques [29]. The basic steps of PCA are as follows:

•
The first step is normalization of the original data to ensure that each set contributes equally to the analysis. Mathematically, the normalization equation is represented as (1) where x min and x max denote the minimum and maximum value of a feature, x denotes an original value and x normailzed denotes a new value.
• The second step is establishing a covariance matrix according to the normalized data matrix. Since the dataset is n-dimensional, this will result in an n × n covariance matrix represented as matrix A.

•
The third step is to calculate the eigenvectors and eigenvalues of the covariance matrix to identify the principal components. The eigenvalues (λ) of matrix A are found by solving (2), where I denotes the same dimensional identity matrix as A, which is an essential requirement for matrix subtraction. For each λ, a corresponding eigenvector (v) can be found by solving (3).
The last step is decreasing the original matrix by sorting eigenvectors with corresponding eigenvalues from largest to smallest. The eigenvector with the highest eigenvalue becomes the principal component of the data. After that, first p eigenvalues are chosen to reduce the dimensions and then principal components are received.

Sentiment Analysis
Sentiment Analysis is a method for defining whether data are positive, negative, or neutral by using Natural Language Processing (NLP). Sentiment analysis is commonly used on textual data to examine the attentions, feelings, behaviors, decisions and emotions of persons who are either the speaker or writer concerning the target topics. The basic task in sentiment analysis is grouping texts in sentences or documents. The grouping of texts are determined by the opinions of people which are either positive, negative, or neutral.
Sentiment analysis techniques can be categorized into three approaches: lexicon-based approaches, machine learning-based approaches and hybrid approaches. First, lexiconbased approaches are a method of using a lexicon to perform sentiment classification by calculating the weighting of labeled words and counting. Second, machine learningbased approaches are a method of using machine learning techniques, for example, Naive Bayesian and Support Vector Machine, which are considered as standard machine learning techniques. The input of the model includes lexical features, sentiment lexicon-based features and parts of speech [30]. Last, hybrid approaches are methods that use the aggregation of both lexicon-based and machine learning techniques [31]. In addition, sentiment analysis can generate profits for investors because it can help to make decisions [32].
Financial Sentiment Analysis with Bidirectional Encoder Representations from Transformers (FinBERT) proposed by [33] is a language model based on Bidirectional Encoder Representations from Transformers (BERT) for financial NLP tasks. The FinBERT model includes two phases: pre-training and fine-tuning. During the pre-training phase, the FinBERT model constructs a large variety of pre-training objectives to help better capture language knowledge and semantic information. This phase trains the BERT language model in the finance domain, using a large financial corpus and a general corpus. During the fine-tuning phase, datasets for financial sentiment classification are labeled. The main sentiment analysis dataset is Financial PhraseBank. Researchers extracted 4845 sentences from the dataset with financial terms. Then, 16 experts and master students with finance backgrounds labeled the data with sentiments including positive, negative and neutral. The FinBERT model will provide a polarity score for a given text and SoftMax outputs for one of three labels: positive, negative, or neutral.

Empirical Mode Decomposition (EMD)
EMD proposed by [11] is used to divide a signal without leaving the time domain. It can be equated to other analysis methods such as Fourier Transformation and Wavelet Decomposition. The EMD is beneficial for analyzing natural signals and it often applies to non-linear and non-stationary situations.
The EMD distinguishes the complexity of the original signal into a series of Intrinsic Mode Functions (IMF) with amplitude and a residual difference. IMFs satisfy the following two conditions: The IMFs have only one extreme between zero crossings. In another word, the difference in number of maxima and minima is at most 1.

2.
The mean of the wave of IMF is zero. The EMD decomposes the signal into IMFs through a sifting process. As shown in Figure 1, the sifting method can be explained using the following algorithm. Decompose a data set x(t) into IMFs x n (t) and a residual r(t), as a result of which the signal can be described by (4) Appl. Sci. 2022, 12, 10823 5 of 23 a data set x(t) into IMFs xn(t) and a residual r(t), as a result of which the signal can be described by (4)

Long Short-Term Memory (LSTM)
Deep learning is a type of machine learning that simulates the process of the human brain in terms of data and pattern formation for making decisions. The number of architectures and algorithms used in deep learning is wide and various [34]. Countless deep learning architectures such as Recurrent Neural Network (RNN) have been applied to NLP [35]. RNN is a variant of the Artificial Neural Network (ANN) which is designed to handle tasks with sequence data. The idea of RNNs is to make use of the output from the previous state as an input to the next state. This allows the model to recognize the pattern of the input sequence. RNN has the benefit of using past data to predict future events. As a result, everything that has occurred in the past will have an influence on the future. However, RNN is ineffective for very long-term dependencies. This is due to the exponentially decreasing gradients and the decay of information for long-term dependencies. This problem is called the vanishing gradient problem.
LSTM proposed by [36] is an improved version of RNN, avoiding the encountering of problems. LSTM is specifically modeled to manage tasks involving long-term dependencies information because it has a capacity to forget irrelevant information or store information for longer periods of time with memory cell support. The LSTM has a chain-like structure consisting of several subunits joined together. The unit of the LSTM architecture is a block memory with memory cells. These memory cells have three structures to control information flow: forget gate layer, input gate layer and output gate layer. The forget gate

Long Short-Term Memory (LSTM)
Deep learning is a type of machine learning that simulates the process of the human brain in terms of data and pattern formation for making decisions. The number of architectures and algorithms used in deep learning is wide and various [34]. Countless deep learning architectures such as Recurrent Neural Network (RNN) have been applied to NLP [35]. RNN is a variant of the Artificial Neural Network (ANN) which is designed to handle tasks with sequence data. The idea of RNNs is to make use of the output from the previous state as an input to the next state. This allows the model to recognize the pattern of the input sequence. RNN has the benefit of using past data to predict future events. As a result, everything that has occurred in the past will have an influence on the future. However, RNN is ineffective for very long-term dependencies. This is due to the exponentially decreasing gradients and the decay of information for long-term dependencies. This problem is called the vanishing gradient problem.
LSTM proposed by [36] is an improved version of RNN, avoiding the encountering of problems. LSTM is specifically modeled to manage tasks involving long-term dependencies information because it has a capacity to forget irrelevant information or store information for longer periods of time with memory cell support. The LSTM has a chain-like structure consisting of several subunits joined together. The unit of the LSTM architecture is a block memory with memory cells. These memory cells have three structures to control information flow: forget gate layer, input gate layer and output gate layer. The forget gate layer determines what information from the previous cell is fed onto the current cell. The input gate layer determines the relevant information to update the cell state. The output gate layer determines the output value for the next hidden state based on the input and memory of the block [37].
Furthermore, LSTM is appropriate for time series prediction because it can learn and remember long-term memory topics such as market movement [38]. Advanced versions of LSTM can be used for various applications such as energy consumption [39], gas field production [40], chatbot messages classification [41] and rice export price prediction [42].

K-Fold Cross-Validation with Time Series Data
Cross-validation is a data resampling method for estimating the actual prediction performance of models and tuning hyper-parameters. In order to overcome the problem of overfitting, cross-validation is used to check overall model performance to detect this problem. In addition, it is used to adjust appropriate hyper-parameters, such as the appropriate batch size and epoch in ANN model. K-fold cross-validation is one of the methods. The procedure begins by randomly splitting the dataset into folds of equal size. The model is trained by using k-1 folds that represent the training set. Then, the trained model is applied to the remaining fold, which represents the testing set and the performance of the model is evaluated. This procedure is repeated until every fold is used as a testing set. The final metrics are the average of the errors obtained in each fold [43].
However, K-fold cross-validation cannot be utilized in the case of time series due to randomly splitting the dataset, because it is irreconcilable in the real world to use values from the future to forecast values from the past. The K-fold Cross-validation with Time Series Data has a different procedure. The idea is that each observation is the first used as a testing set and then added to the training set of the model [44]. The procedure begins by splitting the dataset into k folds of equal size. In the initial iteration, only the first k folds are used as a training set and the next folds are used as a testing set. In the next iteration, the old training set and testing set are merged and used as a training set. This procedure continues until the last fold is tested. The comparable training set only contains observations that occurred before the testing set observation. Hence, no future observations are used to make the prediction [45].

Performance Metrics
In this research, the performance metrics are separated into two main parts. The first part evaluates the performance of the financial news sentiment analysis model and the second part evaluates the performance of the stock price prediction model.
In the first part, confusion metrics are used to validate the financial news sentiment analysis model, in order to compare performance with other models using precision, recall, F1-score and accuracy from (5)-(8), respectively where TP denotes true positive, TN denote true negatives, FP denotes false positives, FN denotes false negatives and n denotes the number of observations.
In the second part, to evaluate the performance of the stock price prediction model, the Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), Mean Absolute Error (MAE) and coefficient of determination (R 2 ) statistics are used to compare performance with the other models (9)-(11), respectively, where y i denotes actual value,ŷ i denotes predicted value and y i denotes the mean of y value.

Augmented Dickey-Fuller (ADF) Test
An ADF test is a fundamentally statistical significance test for determining whether time series is stationary or non-stationary. The ADF test is suitable for testing the stationarity of a time series because it belongs to a category of tests called the Unit Root Test. It exists in a time series of ρ values calculated by (13) where Y t denotes the value of the time series at time t, X t denotes exogenous variables t denotes a white noise, and ρ and δ denotes estimated parameters.
If |ρ| ≥ 1, Y is a non-stationary series while |ρ| < 1, Y is a stationary series, as a result, the stationarity hypothesis can be determined by determining if the total value of ρ is strictly smaller than 1.
The ADF test expands the Dickey-Fuller test (DF) equation to include high order regressive process in the model. The DF test is a unit root test that tests the null hypothesis. The standard DF test is carried out by subtracting from both sides of the Unit Root Test from (14) where α denotes a constant equal ρ, β denotes a coefficient.
The null and alternative hypotheses are evaluated using the conventional t-ratio for α. The ADF equation, which is a DF equation but includes a high-order regressive process in the model, can be calculated as (15), The t-ratio is then used to test the same null hypothesis as the DF test. Assuming the null hypothesis involves the presence of unit root, that is α = ρ − 1, the ρ-value derived from the equation (13) should be greater than the significance level and the statistical test value be greater than the critical value in order to reject the null hypothesis. As a result, the series is inferred to be non-stationary [46,47].

Materials and Methods
This section describes the proposed hybrid framework for predicting stock market price in Thailand. It divides into two sub-sections, data collection and system architecture.

Data Collection
This research uses two types of data to create the prediction model, financial news data as text data and historical data as numerical data. The details of each type are described.

Financial News Data
In this research, the news was collected from a total of six news agencies. The news content mainly focuses on the fields of finance and economics. The news texts were crawled from the source websites using the web scraping method, resulting in a total of 12,667 articles from 21 February 2013, to 24 February 2022. Due to the insufficiency of news data in some time periods, the news data from 24 February 2018 to 24 February 2022 was chosen to guarantee data continuity. Lastly, after removing this part of the news, the dataset had a total of 11,386 news texts and an average of 8 news samples per day. In addition, this research used 1500 labeled pieces of financial news in Thailand during the fine-tuned phase in order to improve the FinBERT performance according to Thai Financial news sentiment analysis.

Historical Data
This research used a one-step-ahead prediction to testify to the prediction preciseness of the proposed model on the closing price of the stock market in Thailand. The dataset obtained from investing.com included close, open, high, low and volume. The range of closing price of the stock market is from 24 February 2018 to 24 February 2022. Only the data from trading days was used for research. The value of data in the selected period is visualized in Figure 2.

Data Collection
This research uses two types of data to create the prediction model, financial news data as text data and historical data as numerical data. The details of each type are described.

Financial News Data
In this research, the news was collected from a total of six news agencies. The news content mainly focuses on the fields of finance and economics. The news texts were crawled from the source websites using the web scraping method, resulting in a total of 12,667 articles from 21 February 2013, to 24 February 2022. Due to the insufficiency of news data in some time periods, the news data from 24 February 2018 to 24 February 2022 was chosen to guarantee data continuity. Lastly, after removing this part of the news, the dataset had a total of 11,386 news texts and an average of 8 news samples per day. In addition, this research used 1500 labeled pieces of financial news in Thailand during the fine-tuned phase in order to improve the FinBERT performance according to Thai Financial news sentiment analysis.

Historical Data
This research used a one-step-ahead prediction to testify to the prediction preciseness of the proposed model on the closing price of the stock market in Thailand. The dataset obtained from investing.com included close, open, high, low and volume. The range of closing price of the stock market is from 24 February 2018 to 24 February 2022. Only the data from trading days was used for research. The value of data in the selected period is visualized in Figure 2. The statistical analysis of the closing price of the stock market, including the amount of data contained in the closing index and the minimum, maximum, mean, standard deviation and ρ-value of the ADF test, is shown in Table 1. To calculate the ADF test, Python module Statsmodels is used in this research [48]. There was a significant difference between the maximum and minimum values; furthermore, the closing prices are extremely volatile due to the high standard deviation.
In this proposed framework, the ADF test is used to check the stationary or nonstationary nature of time series data. If the ρ-value of the ADF test result (as presented in Table 1) was greater than a threshold of 0.05, which fails to reject the null hypothesis, it indicates that the dataset was highly volatile and non-stationary. Apparently, this dataset was suitable to use with the EMD method because it was an effective method for analyzing non-linear and non-stationary time series.
In addition, this research selected other input features related to the closing price of the stock market, called technical indicators. A total of nine technical indicators are selected. Categories and names are shown in Table 2. The statistical analysis of the closing price of the stock market, including the amount of data contained in the closing index and the minimum, maximum, mean, standard deviation and ρ-value of the ADF test, is shown in Table 1. To calculate the ADF test, Python module Statsmodels is used in this research [48]. There was a significant difference between the maximum and minimum values; furthermore, the closing prices are extremely volatile due to the high standard deviation. In this proposed framework, the ADF test is used to check the stationary or nonstationary nature of time series data. If the ρ-value of the ADF test result (as presented in Table 1) was greater than a threshold of 0.05, which fails to reject the null hypothesis, it indicates that the dataset was highly volatile and non-stationary. Apparently, this dataset was suitable to use with the EMD method because it was an effective method for analyzing non-linear and non-stationary time series.
In addition, this research selected other input features related to the closing price of the stock market, called technical indicators. A total of nine technical indicators are selected. Categories and names are shown in Table 2.

System Architecture
The purpose of this research is to propose a hybrid framework for the closing price of the stock market in Thailand using a combination of PCA, EMD and LSTM. The overall architecture of the proposed system is shown in Figure 3. The system was divided into two parts: the feature engineering part and the prediction model part.

System Architecture
The purpose of this research is to propose a hybrid framework for the closing price of the stock market in Thailand using a combination of PCA, EMD and LSTM. The overall architecture of the proposed system is shown in Figure 3. The system was divided into two parts: the feature engineering part and the prediction model part.

Feature Engineering
The process of developing input features for prediction models is described in this section. In this research, the input feature was a combination of technical indicator components and news sentiment score to create a better predictive model.

Feature Engineering
The process of developing input features for prediction models is described in this section. In this research, the input feature was a combination of technical indicator components and news sentiment score to create a better predictive model.

•
News Sentiment Score: In news sentiment analysis, FinBERT was used to generate the sentiment score. In order to efficiently use FinBERT for Thai news analysis, Fin-BERT with Thai news fine-tuning was implemented. This method trained an original FinBERT with extra data, which is Thai news in this research. According to Figure 4, there were six steps in the FinBERT with Thai news fine-tuning modeling process. The first step was to collect headlines from news sources. In this research, news sources in the financial and economic fields were collected from a total of six news agencies by using the web scraping method. After that, the acquired dataset was cleaned and text preprocessed, including text to lowercase, removal of punctuations and removing extra spaces. The next step is to manually label the news into three categories, negative and neutral. Then, the dataset is randomly divided into a training set with 80% and a testing set with 20%. The training set was used to add FinBERT supervised fine-tuning for sentiment analysis to fit a particular task in its training stage. Finally, the model was tested with the testing data set and performance measured with an F1-score and accuracy.
• News Sentiment Score: In news sentiment analysis, FinBERT was used to generate the sentiment score. In order to efficiently use FinBERT for Thai news analysis, Fin-BERT with Thai news fine-tuning was implemented. This method trained an original FinBERT with extra data, which is Thai news in this research. According to Figure 4, there were six steps in the FinBERT with Thai news fine-tuning modeling process. The first step was to collect headlines from news sources. In this research, news sources in the financial and economic fields were collected from a total of six news agencies by using the web scraping method. After that, the acquired dataset was cleaned and text preprocessed, including text to lowercase, removal of punctuations and removing extra spaces. The next step is to manually label the news into three categories, negative and neutral. Then, the dataset is randomly divided into a training set with 80% and a testing set with 20%. The training set was used to add FinBERT supervised fine-tuning for sentiment analysis to fit a particular task in its training stage. Finally, the model was tested with the testing data set and performance measured with an F1-score and accuracy. • Technical Indicator Component: In this research, nine technical indicators are selected and used as input features of the proposed model. In the simple terms of the curse of dimensionality, the more features there are, the higher the risk of overfitting.
To solve this issue, PCA is adopted to decrease the feature space along with consideration of a set of principal features. In order to create a principal component from PCA, there are steps to follow as shown in Figure 5. Firstly, the historical data was obtained from investing.com including open, high, low, close and volume data. After that, the "ta" package from [49] was used to generate technical indicators. Then, the technical indicators were normalized before reducing the dimensions of the data by using PCA. The result of PCA was the principal component, which in this research is done by starting from the first principal component until the sum of the explained variance ratio is greater than 80%. Therefore, this indicates that 80% of the information in the technical indicator can be explained by the -principal component. • Technical Indicator Component: In this research, nine technical indicators are selected and used as input features of the proposed model. In the simple terms of the curse of dimensionality, the more features there are, the higher the risk of overfitting. To solve this issue, PCA is adopted to decrease the feature space along with consideration of a set of principal features. In order to create a principal component from PCA, there are steps to follow as shown in Figure 5. Firstly, the historical data was obtained from investing.com including open, high, low, close and volume data. After that, the "ta" package from [49] was used to generate technical indicators. Then, the technical indicators were normalized before reducing the dimensions of the data by using PCA. The result of PCA was the principal component, which in this research is done by starting from the first principal component until the sum of the explained variance ratio is greater than 80%. Therefore, this indicates that 80% of the information in the technical indicator can be explained by the n-principal component.

Prediction Model
An integrated prediction model based on the combination of EMD and LSTM is proposed to maximize the prediction effectiveness and minimize the complexity of calculations. The proposed model is shown in Figure 6 consisting of the following four steps: • Firstly, the EMD algorithm was applied to decompose the original stock closing price time series into several independent IMF components and one residual component. • Secondly, the news sentiment score and the principal component from the feature engineering part were included as input features to the model.

•
Thirdly, the LSTM model was used as the prediction tool for each IMF component. Consequently, the corresponding components acquired the prediction values. The LSTM was trained individually by each IMF; thus, the network parameters, epoch and batch size are specially tuned for each IMF. This is the significant difference that makes a hybrid EMD-LSTM model better than a single LSTM model.

•
Finally, each predicted IMF was combined using (4) to get the final predicted stock closing price after obtaining the predicted results of the IMFs. Then, the results were compared with other models using performance metrics.

Prediction Model
An integrated prediction model based on the combination of EMD and LSTM is proposed to maximize the prediction effectiveness and minimize the complexity of calculations. The proposed model is shown in Figure 6 consisting of the following four steps: • Firstly, the EMD algorithm was applied to decompose the original stock closing price time series into several independent IMF components and one residual component. • Secondly, the news sentiment score and the principal component from the feature engineering part were included as input features to the model.

•
Thirdly, the LSTM model was used as the prediction tool for each IMF component. Consequently, the corresponding components acquired the prediction values. The LSTM was trained individually by each IMF; thus, the network parameters, epoch and batch size are specially tuned for each IMF. This is the significant difference that makes a hybrid EMD-LSTM model better than a single LSTM model. • Finally, each predicted IMF was combined using (4) to get the final predicted stock closing price after obtaining the predicted results of the IMFs. Then, the results were compared with other models using performance metrics.

Experimental Methods and Results
In this section, there are four sub-sections. The decomposition results of EMD re discussed in the first sub-section. The comparison results between the FinBERT with the Thai news fine-tuning model, an original FinBERT and other sentiment analysis models are presented to verify the effectiveness of the FinBERT in the second sub-section. The different combinations of the PCA, EMD, LSTM and news components are presented to validate the proposed model from several perspectives in the third sub-section. The comparison results between different advanced versions of EMD are presented to verify the best version of EMD in the last sub-section.

Experimental Methods and Results of the Decomposition Component by EMD
For construction of the prediction model, the closing price of stock data as historical data was transformed into the new data using EMD. As shown in Figure 7, this experiment demonstrates decomposing results to create IMFs using EMD. The seven IMFs were decomposed from the original closing price sequence and the results of the IMFs' scale from high to low frequency. However, the number of IMFs are different depending on the raw data. The processes of EMD are repeated until there are only one global maxima and minima value showing on IMF 7 in Figure 7. The number of IMFs will be changed if the raw data is changed. On the other hand, the number of IMFs will still be of the same value when applying EMD to the same data.
The result shows that it can be divided into three groups. The first group are highfrequency components in the original data. This group was represented by the first few IMFs with a lot of noise. The second group are middle-frequency components. represented by the center IMFs with a medium noise. The last group are low-frequency components. This group was represented by the last few IMFs with little noise. Moreover, the last IMF is comparable to the trend of a stock. It is common to hypothesize that the LSTM can accurately predict low-frequency IMFs, but it will struggle with high-frequency IMFs. To maximize the prediction efficiency, the LSTM is trained individually by each IMF. Thus, the hyper-parameter, the number of hidden layers and weights are different for

Experimental Methods and Results
In this section, there are four sub-sections. The decomposition results of EMD re discussed in the first sub-section. The comparison results between the FinBERT with the Thai news fine-tuning model, an original FinBERT and other sentiment analysis models are presented to verify the effectiveness of the FinBERT in the second sub-section. The different combinations of the PCA, EMD, LSTM and news components are presented to validate the proposed model from several perspectives in the third sub-section. The comparison results between different advanced versions of EMD are presented to verify the best version of EMD in the last sub-section.

Experimental Methods and Results of the Decomposition Component by EMD
For construction of the prediction model, the closing price of stock data as historical data was transformed into the new data using EMD. As shown in Figure 7, this experiment demonstrates decomposing results to create IMFs using EMD. The seven IMFs were decomposed from the original closing price sequence and the results of the IMFs' scale from high to low frequency. However, the number of IMFs are different depending on the raw data. The processes of EMD are repeated until there are only one global maxima and minima value showing on IMF 7 in Figure 7. The number of IMFs will be changed if the raw data is changed. On the other hand, the number of IMFs will still be of the same value when applying EMD to the same data.
The result shows that it can be divided into three groups. The first group are highfrequency components in the original data. This group was represented by the first few IMFs with a lot of noise. The second group are middle-frequency components. represented by the center IMFs with a medium noise. The last group are low-frequency components. This group was represented by the last few IMFs with little noise. Moreover, the last IMF is comparable to the trend of a stock. It is common to hypothesize that the LSTM can accurately predict low-frequency IMFs, but it will struggle with high-frequency IMFs. To maximize the prediction efficiency, the LSTM is trained individually by each IMF. Thus, the hyper-parameter, the number of hidden layers and weights are different for each IMF. This is the significant difference making a hybrid EMD-LSTM model perform better than a single LSTM model, which is applied directly to the original closing price time series, with characteristics of noise and volatility.
Appl. Sci. 2022, 12, 10823 13 of 23 each IMF. This is the significant difference making a hybrid EMD-LSTM model perform better than a single LSTM model, which is applied directly to the original closing price time series, with characteristics of noise and volatility.  IMFs were obtained by subtracting from the original closing price, so the summation of all IMFs is totally identical to the original. For this reason, the summation of the prediction results of all IMFs can be considered as the prediction result for the original closing price.

Experimental Methods and Results of News Sentiment Analysis
FinBERT with Thai news fine-tuning was used in the financial news sentiment analysis model in the feature engineering part of the proposed model. In order to assess the efficacy of Thai news analysis, FinBERT with Thai news fine-tuning was compared to the original FinBERT, which is FinBERT with default, and other popular sentiment analysis models, such as Vader [50] and Text-blob [51].
This research manually annotated the financial news dataset. The annotated dataset was random from news data to label with three classes of sentiments: positive, negative and neutral, totaling 1500 articles. The annotated dataset was used for training, FinBERT supervised fine-tuning and model performance testing. The top 80% of the data is used as the training dataset for supervised fine-tuning training and the remaining 20% was used as the testing dataset to evaluate the model performance. Moreover, each class has an equal number of examples in the testing set. Other models were used, similar to both training and testing datasets and to FinBERT with Thai news fine-tuning.
From Table 3, the result shows that the FinBERT with Thai news fine-tuning has the highest average accuracy and average F1-score of the compared models. When considering the F1-score in each class, the FinBERT with Thai news fine-tuning has the highest value. Both FinBERT models perform similarly well when it comes to categorizing news as Class Negative. In classes Positive and Neutral, the FinBERT with Thai news fine-tuning has a moderate F1-score value. However, both Vader and Textblob have very low model performance for this dataset.

Experimental Methods and Results of the Proposed Framework
The proposed framework was used for the closing price of stock market prediction. This framework contained many processes in both the feature engineering and prediction model part. Therefore, this sub-section was used to verify the efficacy of the proposed model in each process. A set of sensitivity experiments was established using various combinations of the EMD, LSTM, PCA and financial news components validating the proposed model from several perspectives. The experimental design can be seen in Table 4 and the output data for all experiments are the closing price of the stocks market in Thailand.
• Experiment 1 was a comparison result between the EMD-LSTM model and other prediction models. The purpose of this experiment is to apply the models to the original closing price directly without using additional input features. The comparison results between the proposed model and other models evaluate whether the EMD-LSTM model effectively improves the outcomes of prediction over state-of-the-art models in stock price time series modeling. • Experiment 2 is a comparison of the effects of adding principal components and technical indicators to EMD-LSTM. This experiment applies an additional input feature to the proposed model, which is the original technical indicator and the principal component of PCA. The experiment aims to show whether the principal component from PCA can improve the prediction of EMD-LSTM. The comparison results comparing using the principal components as input features and using the original technical indicators as input features examines whether the model, when applying PCA effectively, improves the outcomes of prediction due to the curse of dimension. • Experiment 3 is a comparison of the effects of adding news sentiment score to prediction models. This experiment applied an additional input feature from FinBERT to the proposed model. The experiments are evaluated to identify whether applying a news sentiment score improves model performance.

Comparisons Result between the EMD-LSTM Model and Other Models
In this experiment, machine learning methods and the original closing price are applied to estimate the prediction performance. The EMD-LSTM, LSTM and ARIMA are used for comparison. Table 5 shows that the EMD method has great advantages in the closing price of stock market prediction, with MAE dropping by 56.13% when compared to LSTM and 85.67% when compared to the ARIMA model. Moreover, LSTM and ARIMA had a close model performance. This implies that a single model cannot impressively solve data patterns and make brilliant predictions. In addition, Figure 8 shows the predictive results of the LSTM and EMD-LSTM, revealing that the predicted values of the EMD-LSTM series visibly deviate from the original data.

Comparisons of the Effects of Adding Principal Component and Technical Indicator to EMD-LSTM
From the previous experiment, the EMD-LSTM model outperforms when compared with the other prediction models. This experiment applied an additional input feature to the proposed model, which is the original technical indicator and the principal components from PCA. In order to make predictions with the EMD-LSTM model, individual IMFs were predicted with LSTM and the additional input feature. After tuning of the LSTM model, the optimal hyper-parameters were obtained to achieve the prime prediction results for the IMFs, as shown in Table 6. The batch size was between 8 and 32 while the epoch was between 150 and 300. In addition, the other settings of the LSTM model included hyperbolic tangent as activation function, ADAM as optimizer, mean squared error as loss function and learning with 0.001.

Comparisons of the Effects of Adding Principal Component and Technical Indicator to EMD-LSTM
From the previous experiment, the EMD-LSTM model outperforms when compared with the other prediction models. This experiment applied an additional input feature to the proposed model, which is the original technical indicator and the principal components from PCA. In order to make predictions with the EMD-LSTM model, individual IMFs were predicted with LSTM and the additional input feature. After tuning of the LSTM model, the optimal hyper-parameters were obtained to achieve the prime prediction results for the IMFs, as shown in Table 6. The batch size was between 8 and 32 while the epoch was between 150 and 300. In addition, the other settings of the LSTM model included hyperbolic tangent as activation function, ADAM as optimizer, mean squared error as loss function and learning with 0.001. Table 6. Hyper-parameters of LSTM.

IMFs
Batch Size Epoch   1  8  300  2  8  250  3  32  200  4  16  180  5  16  200  6  16  150  7 32 150 The experimental results of adding the input feature of each IMF are shown in Table 7. The results show that the principal components can improve the efficiency of the model and outperform the IMF and technical indicators in the first four IMFs. Nevertheless, after the fifth IMF, the model that uses only the IMF value outperforms the other models using additional input features, as can be seen in Figure 9. Meanwhile, models with a technical indicator as an input feature perform the worst across all IMFs. In addition, Figure 10 shows the IMF of closing price testing set prediction results. Due to the high frequency of the components, the prediction values of the first several IMF components explicitly diverge from the original data, but the prediction values of the last IMF nearly matched the original data.
Next, the prediction results of each IMF are assembled in order to compare the final closing price prediction results. In addition, the PCA-EMD-LSTM model used principal components as an input feature to predict IMF 1 to 4 but uses only the IMF value for IMF 5 to 7. From Table 8, the result shows that PCA-EMD-LSTM achieves the best prediction result, followed by a close second to the model that uses only the IMF value, whereas models using technical indicators as the input feature have the worst prediction results. and outperform the IMF and technical indicators in the first four IMFs. Nevertheless, after the fifth IMF, the model that uses only the IMF value outperforms the other models using additional input features, as can be seen in Figure 9. Meanwhile, models with a technical indicator as an input feature perform the worst across all IMFs. In addition, Figure 10 shows the IMF of closing price testing set prediction results. Due to the high frequency of the components, the prediction values of the first several IMF components explicitly diverge from the original data, but the prediction values of the last IMF nearly matched the original data. Next, the prediction results of each IMF are assembled in order to compare the final closing price prediction results. In addition, the PCA-EMD-LSTM model used principal components as an input feature to predict IMF 1 to 4 but uses only the IMF value for IMF 5 to 7. From Table 8, the result shows that PCA-EMD-LSTM achieves the best prediction result, followed by a close second to the model that uses only the IMF value, whereas models using technical indicators as the input feature have the worst prediction results.

Comparison Results between Difference Advanced Versions of EMD
From the previous three experiments, the best architecture of the proposed model was PCA-EMD-LSTM. However, there are advanced versions of EMD such as EEMD and CEEMDAN. Therefore, this experiment changed the EMD part from the proposed model to both EEMD and CEEMDAN, called PCA-EEMD-LSTM and PCA-CEEMDAN-LSTM, respectively. To create the prediction model, closing price of the stock market in Thailand  The preliminary model in the previous experiment used only input features from historical data. In this experiment, the news sentiment score was applied to a prediction model to identify whether applying a news sentiment score improves the prediction model.
From Table 9, the result shows that the news sentiment score has great advantages as an input feature in stock price prediction, with MAE dropping by 20.82% when compared to a single LSTM. On the other hand, the prediction results become worse when the news sentiment score of PCA-EMD-LSTM is included compared with only PCA-EMD-LSTM. Evidently, it is better to ignore the news sentiments component part of the proposed model. However, the news sentiment score part can improve the model performance of the original model.

Comparison Results between Difference Advanced Versions of EMD
From the previous three experiments, the best architecture of the proposed model was PCA-EMD-LSTM. However, there are advanced versions of EMD such as EEMD and CEEMDAN. Therefore, this experiment changed the EMD part from the proposed model to both EEMD and CEEMDAN, called PCA-EEMD-LSTM and PCA-CEEMDAN-LSTM, respectively. To create the prediction model, closing price of the stock market in Thailand and principal components from PCA are used as input features.
The experiment result shows in Table 10 that the PCA-EMD-LSTM had the lowest model performance and PCA-EEMD-LSTM had a moderate model performance. On the other hand, using the EMD as a decomposition method is the most effective for prediction with PCA-LSTM. Finally, the PCA-EMD-LSTM architecture had the highest model performance and Figure 3 can exclude the news sentiment score part with blue background.

Discussion
In order to verify the effectiveness of the proposed hybrid framework, several experiments on various factors were examined. The observation results are as follows:

•
The EMD-LSTM model outperforms state-of-the-art benchmark models indicating that decomposition methods with EMD decrease the complexity of sequences and develop prediction performance. Moreover, EMD decomposed the original signal into minor components based on their frequencies. In order to maximize the prediction effectiveness, the LSTM is trained individually by each component; therefore, the hyper-parameters and weights are different for each component. This is the significant difference that makes a hybrid EMD-LSTM model perform better than a single LSTM.

•
The prediction result shows that PCA can help to reduce prediction errors in the first few IMFs when applying the principal components from PCA to the EMD-LSTM. This indicates that the PCA method creates useful features from technical indicators for improving IMF with high-frequency prediction. From the obtained results in Table 10, the PCA-EMD-LSTM achieves the highest prediction performance for the closing price of the stock market. However, based on the experimental results in Table 5, the MAPE value of the EMD-LSTM model is slightly higher than the obtained MAPE result of the PCA-EMD-LSTM. Therefore, further experiments on different datasets are required to verify the performance improvement of using PCA in the EMD-LSTM model.

•
Applying the news sentiment score to the EMD-LSTM does not improve prediction results in every IMF. On the other hand, adding the news sentiment score can improve the original LSTM performance. This means that news sentiment can be used to predict the closing price of the stock market while it cannot be used to predict the decomposed component of a closing price of the stock market.
To increase the efficiency of this proposed framework, there are a number of gaps that need further development. For example, IMFs may be adaptively predicted by various traditional or hybrid machine learning models. Recently, a novel approach [52] to select effective machine learning model combination for time series forecasting was proposed. Based on the machine learning combination approach, it can be applied to this proposed framework for improving the prediction results of each IMF and the prediction of the closing price of the stock market. In addition, based on a recently published re-search study [53], an interesting decomposition method, i.e., a hybrid time series decom-position strategy (HTD), can be applied instead of EMD for further improvement of this proposed framework.

Conclusions
In this research, a hybrid framework based on the combination of PCA, EMD and LSTM is proposed to predict one step ahead of the closing price of the stock market. Moreover, the proposed model is capable of combining both historical and textual data as the input features. The overall design of the proposed system is separated into two parts: the feature engineering part and the prediction model part. The feature engineering part is used to create input features for the prediction model. There are two main processes: the finance and economics news sentiment score using FinBERT with Thai news finetuning and the principal components from technical indicators using PCA. The prediction model part is used to predict the closing price of the stock market. Historical data were decomposed into several IMFs via EMD. Next, LSTM was utilized to predict each IMF along with input features from the previous part. Finally, the prediction values of each IMF were used together to produce the final stock price prediction. Based on the results of the experiments, the proposed framework using PCA, EMD and LSTM had the best prediction performance for the closing price of the stock market. Moreover, based on the obtained experimental results in the LTSM model, the performance of the original LSTM is improved when applying news sentiment analysis.
Future research can be conducted in order to optimize the model's efficiency. For example, different machine learning algorithms can be adaptively selected for the different IMFs after decomposing the original data. This process may improve the prediction results of each IMF and the prediction of the closing price of the stock market. In addition, the effect of different sets of technical indicators can be explored to find the best set for IMF prediction.
Author Contributions: K.S., Y.L. and T.T. did comprehensive searching for research background. K.S. and Y.L. collected the experimental data. T.T. developed the research framework. K.S. and T.T. designed the research methodology. Y.L. did programming and implementation. All authors evaluated the research framework. K.S. and T.T. performed data analysis and wrote the manuscript. K.S. submitted the manuscript for publication and communicated with the journal editor. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
The raw data of this work are available from investing.com and six news agencies in Thailand upon request.