Using Deep Learning Techniques in Forecasting Stock Markets by Hybrid Data with Multilingual Sentiment Analysis

: Electronic word-of-mouth data on social media inﬂuences stock trading and the conﬁdence of stock markets. Thus, sentiment analysis of comments related to stock markets becomes crucial in forecasting stock markets. However, current sentiment analysis is mainly in English. Therefore, this study performs multilingual sentiment analysis by translating non-native English-speaking countries’ texts into English. This study used unstructured data from social media and structured data, including trading data and technical indicators, to forecast stock markets. Deep learning techniques and machine learning models have emerged as powerful ways of coping with forecasting problems, and parameter determination greatly inﬂuences forecasting models’ performance. This study used Long Short-Term Memory (LSTM) models employing the genetic algorithm (GA) to select parameters for predicting stock market indices and prices of company stocks by hybrid data in non-native English-speaking regions. Numerical results revealed that the developed LSTMGA model with hybrid multilingual sentiment data generates more accurate forecasting than the other machine learning models with various data types. Thus, the proposed LSTMGA model with hybrid multilingual sentiment analysis is a feasible and promising way of forecasting the stock market.


Introduction
Many studies have pointed out that when investigating social issues, data on social media behaviors are more reflective of people's real thinking than questionnaire data. In addition, the time to collect data from social media is very close to real-time [1]. Numerous studies showed that considering fundamental analysis data, such as financial web news or posts on social media platforms, could effectively improve the performance of the forecasting stock price. Furthermore, roughly one-third to two-thirds of investors used social media in their investment decisions for collecting and learning information about concerned companies. Therefore, social media comments have a certain degree of impact on stock prices [2]. Wu et al. [3] used historical stock data, technical indicators, and nontraditional data, such as stock posts and financial news, to predict the stock price with long short-term memory. The non-traditional data was employed in the convolutional neural network to calculate investors' sentiment index. The experimental results showed that the proposed method could provide more accurate values than the single data source. Ko and Chang [4] applied the natural language processing tool to recognize the sentiment of the text of news and PTT bulletin board system (BBS) forum discussion. The long short-term memory approach was used to forecast the stock prices. Numerical results illustrated that using news and PPT attributes did improve forecasting accuracy. Ren et al. [5] employed support vector machines with financial market data and sentiment indexes extracted from news to forecast stock market movements. The day-of-week effect was considered in this to encode the input data, and then BiLSTM can learn more information from the input context. The empirical results pointed out that the developed model can yield wonderful performance. Araújo et al. [20] investigated the performance of the language-specific sentiment analysis compared to the existing sentiment analysis methods for English content. This study pointed out that translating the text expressed in different languages into English. Then, using existing English sentiment analysis tools can generate better results than employing the language-specific technique directly in evaluating the sentiment of multilingual text.
Deep learning techniques demonstrated promising capabilities in capturing non-linear characteristics and thus can result in accurate predictions for stock market forecasting. Luo et al. [21] presented a long short-term memory model to forecast stock market profits. The adaptive shuffled frog-leaping algorithm was developed to search for appropriate hyper-parameters. This study illustrated the superiority of the proposed model by performing comparisons with artificial neural networks, support vector machines, gray models, and basic long short-term memory. Kanwal et al. [22] designed a hybrid deep learning model, namely BiCuDNNLSTM-1dCNN, integrating a Bidirectional Cuda Deep Neural Network Long Short-Term Memory and a one-dimensional Convolutional Neural Network to conduct stock price predictions. Two datasets were employed to examine forecasting performances, including individual stock items and the stock market's performance indices. This investigation pointed out that the presented models can generate accurate forecasting results helpful in decision-making in stock market investments. Wang et al. [23] utilized the Transformer model to forecast the stock market indices of the CSI 300, S&P 500, Hang Seng Index, and Nikkei 225. More underlying rules can be described by encoder-decoder architecture and multi-head attention mechanism. This study indicated that the Transformer outperformed other classic techniques and was useful to investors. Gao et al. [24] employed the evidential rule and the genetic algorithm on recurrent neural networks to predict daily movement directions of the S&P 500 index, Dow Jones Industrial Average index, and NASDAQ 100 index. The numerical results indicated that the designed model effectively improved classification performances. Kumar et al. [25] presented a long shortterm memory network and adaptive particle swarm optimization (PSO)-based hybrid deep learning model to forecast the stock prices in Sensex, S&P 500, and Nifty 50. PSO was used to provide initial weights of the long short-term memory and the fully connected layer. This investigation revealed that the proposed model could generate accurate forecasting results. Aldhyani and Alzahrani [26] developed a hybrid convolutional neural network with long short-term memory (CNN-LSTM) to predict the closing prices of stock markets. Stock close prices of two corporations, namely Tesla, Inc. and Apple, Inc., were utilized to measure the forecasting performances of the proposed model. This study indicated that the CNN-LSTM model is superior to the basic LSTM model in forecasting accuracy. Ratchagit and Xu [27] presented a two-delay way for three deep learning techniques, including MLP, CNN, and LSTM. Stock data of three companies, namely Microsoft Corporation, Johnson & Johnson, and Pfizer Inc., were used to investigate forecasting performances. Numerical results illustrated that the proposed two-delay model outperformed other linear combination forecasting techniques. Table 1 lists the recent related literature in 2022 regarding data types, problem types, and stock markets. Most previous studies employed unique structured data or unstructured data individually in analyzing stock markets. Therefore, this study attempted to exploit the unique strength of the structured and unstructured data in enhancing the capabilities of LSTMGA models for predicting corporations' stock prices and stock market indices in a regression way. Additionally, in this study, multilingual social media posts were translated into English by Google Translation, and then the SentiStrength was used to evaluate the sentiment of the posts in English. Google Translation was used to translate multilingual social media posts. The other four forecasting methods, backpropagation neural networks (BPNN), least square support vector regression (LSSVR), random forest (RF), and extreme gradient boosting (XGBoost), were employed to conduct forecasting tasks with the same data; and the genetic algorithm was used to determine parameters of models. The forecasting performances were measured by the mean absolute percentage error (MAPE), and the root mean square error (RMSE). LSTM v v v v Kamara et al. [12] EHTS (AB-CNN and CB-LSTM) v v v Luo et al. [21] LSTM+ SFLA v v v Kanwal et al. [22] BiCuDNNLSTM-1dCNN v v v v Wang et al. [23] Transformer v v v Gao et al. [24] RNN [26] CNN-LSTM v v v Ratchagit and Xu [27] LSTM The rest of this study is organized as follows. Section 2 introduces the long short-term memory networks. Section 3 illustrates the architecture of this study for forecasting stock prices with sentiment analysis. Section 4 depicts the experimental results of the proposed models. Conclusions are provided in Section 5.

Long Short-Term Memory Networks
Long short-term memory networks (LSTM), proposed by Hochreiter and Schmidhuber [28], recently had been successfully applied in various forecasting fields, such as stock price movements [11], pandemic [29], rainfall [30], sea levels [31], energy consumptions [32], and sales [33]. LSTM coped with the problem of gradient vanishing and hard-to-capture long-term dependencies in recurrent neural networks when processing long sequences. The cell states are added into the long short-term memory to store the long-term memory. Hence, the essential information can be stored for a long time, and the earlier information can be connected with current tasks. Figure 1 illustrates the memory cell of long short-term memory networks [34][35][36]. The long short-term memory network is composed of the forget gate (xf ), the input gate (including and c ), the output gate (xo ), and the cell state (C ). The forget gate controls what information will be discarded and how much information will be added to the next cell's memory. The input gate determines whether the new information enters the memory. Instead, the output gate defines whether the updated information should be transferred to the next layer of networks. The long short-term memory network calculation is expressed in a way from the input sequence x to the output sequence y , represented by Equations (1) and (2), respectively: y = y , y , y , … , y , In the first step, the long short-term memory regulates whether the information should be discarded or stored from the previous cell state c . Thus, the forget gate (xf ) is constructed by Equation (3), calculated based on the current input and the hidden state at time t − 1: where σ represents the activation function and maps the variable to values between 1 and 0; x is the input vector to the LSTM unit, W is the weight matrix of the input process, U is the weight matrix of the state transitions, and b is the bias vector. Then, the input gate, which includes xi and c , controls information added to the network, represented by Equations (4) and (5), respectively.
where the tanh • represents the activation function and maps the variable to values between 1 to −1; c and xf are employed to produce the new state of the memory cell C , which can be expressed by Equation (6): the cell state vector C is utilized to calculate the output gate xo , which is demonstrated in Equation (7). The long short-term memory network is composed of the forget gate (xf t ), the input gate (including xi t and c t ), the output gate (xo t ), and the cell state (C t ). The forget gate controls what information will be discarded and how much information will be added to the next cell's memory. The input gate determines whether the new information enters the memory. Instead, the output gate defines whether the updated information should be transferred to the next layer of networks. The long short-term memory network calculation is expressed in a way from the input sequence {x} to the output sequence {y}, represented by Equations (1) and (2), respectively: {y} = y t−1 , y t , y t+1 , . . . , y t+n , In the first step, the long short-term memory regulates whether the information should be discarded or stored from the previous cell state c t−1 . Thus, the forget gate (xf t ) is constructed by Equation (3), calculated based on the current input and the hidden state at time t − 1: where σ represents the activation function and maps the variable to values between 1 and 0; x t is the input vector to the LSTM unit, W is the weight matrix of the input process, U is the weight matrix of the state transitions, and b is the bias vector. Then, the input gate, which includes xi t and c t , controls information added to the network, represented by Equations (4) and (5), respectively.
where the tan h (·) represents the activation function and maps the variable to values between 1 to −1; c t and xf t are employed to produce the new state of the memory cell C t , which can be expressed by Equation (6): the cell state vector C t is utilized to calculate the output gate xo t , which is demonstrated in Equation (7).
Finally, the output vector h t of long short-term memory networks is expressed by Equation (8). The cell state and activation vector are used to generate the output in the output gate, and the weights and bias terms are adjusted to minimize the loss of objective function in the training phase. The vanishing gradient problems have been successfully solved by the long short-term memory network architecture [11,28,37].

The Proposed Architecture for Predicting Stock Markets
Influences of social media on stock market prices have been investigated in English contexts, but studies in non-native English-speaking countries were not explored widely. This study intended to forecast closing values of stock market indices and stock prices in five non-native English-speaking countries. It has been pointed out that fundamental analysis, including sentiment in social media and technical analysis containing technical indices, is essential in forecasting stock prices [38,39]. Figure 2 depicts the architecture of this study. The fundamental analysis has influences on stock prices. Thus, tweets and posts related to the stock market information collected from social media platforms, including Twitter and PTT, were employed as the fundamental data in this study. In addition, technical indicators and trading data served as the technical analysis impacting stock prices. In the data collection phase, fundamental data and technical data were gathered. Thus, two parts comprise the architecture: data collection and preprocessing; models training and testing. In the data collection and preprocessing phase, the multilingual posts collected from Twitter and PTT were transformed into sentiment scores, the stock trading data from Yahoo Finance were gathered, and technical indicators were generated. In this study, three datasets, dataset A, dataset B, and dataset C, were used for forecasting stock markets individually, and forecasting performances generated by three datasets were compared. Dataset A was structured data, dataset B was unstructured data, and dataset C was a hybrid dataset consisting of dataset A and dataset B. Finally, each dataset was divided into a training dataset and a testing dataset individually with percentages of 80% and 20% roughly.

h = xo tanh C
The cell state and activation vector are used to generate the output in the out and the weights and bias terms are adjusted to minimize the loss of objective fun the training phase. The vanishing gradient problems have been successfully solve long short-term memory network architecture [11,28,37].

The Proposed Architecture for Predicting Stock Markets
Influences of social media on stock market prices have been investigated in contexts, but studies in non-native English-speaking countries were not explored This study intended to forecast closing values of stock market indices and stock five non-native English-speaking countries. It has been pointed out that fund analysis, including sentiment in social media and technical analysis containing t indices, is essential in forecasting stock prices [38,39]. Figure 2 depicts the archite this study. The fundamental analysis has influences on stock prices. Thus, tw posts related to the stock market information collected from social media platfo cluding Twitter and PTT, were employed as the fundamental data in this study. tion, technical indicators and trading data served as the technical analysis impacti prices. In the data collection phase, fundamental data and technical data were g Thus, two parts comprise the architecture: data collection and preprocessing; training and testing. In the data collection and preprocessing phase, the multilingu collected from Twitter and PTT were transformed into sentiment scores, the stock data from Yahoo Finance were gathered, and technical indicators were generated study, three datasets, dataset A, dataset B, and dataset C, were used for forecasti markets individually, and forecasting performances generated by three datas compared. Dataset A was structured data, dataset B was unstructured data, and C was a hybrid dataset consisting of dataset A and dataset B. Finally, each dat divided into a training dataset and a testing dataset individually with percentage and 20% roughly.

Data Collection and Preprocessing
In this study, both unstructured data and structured data were collected for stock market forecasting. The unstructured data included posts on Twitter and PTT. Selenium and BeautifulSoup were utilized to crowd the users' posts from 1 January 2019 to 31 December 2019 in five non-native English-speaking countries with one stock market and one cooperation for each country. One keyword was used for collecting social media data for each stock market or company. Thus, ten keywords in total were utilized to gather multilingual posts, the number of original posts and the number of posts after data processing are illustrated in Table 2. Table 3 illustrates the example of the four main steps of data cleaning. The data of the posting contains the posting UTC time (Coordinated Universal Time) and the content of the post. Firstly, adjusting the time of each post to be interrelated with the stock trading data for each country. Secondly, remove the posts with the same content, the day without any tweet, and the paragraph notations for each tweet. Thirdly, Araújo et al. [20] revealed that translating the input text into English and then performing the existing sentiment analysis tool can yield more encouraging results. Thus, the GOOGLETRANSLATE function [40][41][42] for multilingual text translation in Google Sheets was used to translate all posts into English. Finally, the refined posts were calculated by SentiStrength [43][44][45] to generate sentiment scores. Studies pointed out that SentiStrength is a promising tool for sentiment analysis [43,46,47]. In addition, SentiStrength is a lexicon-based technique and outperforms conventional machine learning techniques in sentiment analysis [48]. Thus, in this study, SentiStrength was employed to evaluate the sentiment of posts. The SentiStrength divides posts into positive and negative sentiment polarities with five levels. The positive polarity is from 1 to 5, and the negative polarity is from −1 to −5. The level of 0 does not exist. Each post has one positive level and one negative level. In this study, daily scores were the accumulations of each post's levels on that day. Table 2 lists countries, languages, codes, keywords, number of posts, and names of stock markets and corporations for five countries collected from Yahoo Finance from 1 January 2019 to 31 December 2019. Trading data and technical indicators served as the other 12 independent variables lustrated in Table 4. Trading data consisted of open, high, low, close, adjusted close, and volume. Technical indicators included K%, D%, William R%, RSI, MACD, PSY, MA, and BIAS; and were computed by the historical stock market data gathered from Yahoo Finance. Trading data and the technical indicators were integrated into dataset B. The trading data and technical indicators were denoted from x 11 to x 14 and from x 15 to x 22 , respectively. Finally, two datasets were combined into dataset C. Tables 4 and 5 show all independent variables and three datasets, respectively.

Data Collection and Preprocessing
In this study, both unstructured data and structured data were collected for stock market forecasting. The unstructured data included posts on Twitter and PTT. Selenium and BeautifulSoup were utilized to crowd the users' posts from 1 January 2019 to 31 December 2019 in five non-native English-speaking countries with one stock market and one cooperation for each country. One keyword was used for collecting social media data for each stock market or company. Thus, ten keywords in total were utilized to gather multilingual posts, the number of original posts and the number of posts after data processing are illustrated in Table 2. Table 3 illustrates the example of the four main steps of data cleaning. The data of the posting contains the posting UTC time (Coordinated Universal Time) and the content of the post. Firstly, adjusting the time of each post to be interrelated with the stock trading data for each country. Secondly, remove the posts with the same content, the day without any tweet, and the paragraph notations for each tweet. Thirdly, Araújo et al. [20] revealed that translating the input text into English and then performing the existing sentiment analysis tool can yield more encouraging results. Thus, the GOOG-LETRANSLATE function [40][41][42] for multilingual text translation in Google Sheets was used to translate all posts into English. Finally, the refined posts were calculated by Sen-tiStrength [43][44][45] to generate sentiment scores. Studies pointed out that SentiStrength is a promising tool for sentiment analysis [43,46,47]. In addition, SentiStrength is a lexiconbased technique and outperforms conventional machine learning techniques in sentiment analysis [48]. Thus, in this study, SentiStrength was employed to evaluate the sentiment of posts. The SentiStrength divides posts into positive and negative sentiment polarities with five levels. The positive polarity is from 1 to 5, and the negative polarity is from −1 to −5. The level of 0 does not exist. Each post has one positive level and one negative level. In this study, daily scores were the accumulations of each post's levels on that day. Table 2 lists countries, languages, codes, keywords, number of posts, and names of stock markets and corporations for five countries collected from Yahoo Finance from 1 January 2019 to 31 December 2019. Trading data and technical indicators served as the other 12 independent variables lustrated in Table 4. Trading data consisted of open, high, low, close, adjusted close, and volume. Technical indicators included K%, D%, William R%, RSI, MACD, PSY, MA, and BIAS; and were computed by the historical stock market data gathered from Yahoo Finance. Trading data and the technical indicators were integrated into dataset B. The trading data and technical indicators were denoted from x to x and from x to x , respectively. Finally, two datasets were combined into dataset C. Tables  4 and 5 show all independent variables and three datasets, respectively.      6 Score +1 x 16 D% x 7 Score +2 x 17 William R% x 8 Score +3 x 18 RSI x 9 Score +4 x 19 MACD x 10 Score +5 x 20 PSY x 21 MA x 22 BIAS Table 5. Three datasets used in this study.

Content of Data Variables
Data A Sentiment scores of posts x 1 -x 10 Data B Trading data and technical indicators x 11 -x 22 Data C Sentiment scores of posts, trading data, and technical indicators x 1 -x 22

The Training and Testing of Models
Three datasets were divided into training data and testing data. In addition, the portions of the training and testing data were approximately 80% and 20%. Table 6 lists training and testing data periods for stock markets and corporation stocks. The training data were used to train forecasting models, and the testing data were employed to evaluate the performances of forecasting models. Trial and error methods [49,50] and meta-heuristics [51][52][53] are two major ways of determining the parameters of machine learning models. The trial and error methods rely heavily on users' experiences. As the number of parameters increases, the difficulty of parameter determination arises. Thus, the metaheuristic has been an effective and useful way of determining parameters with a relatively mitigated computational burden. In this study, the genetic algorithm was used to select the parameters of forecasting models. Developed by John H. Holland [54,55], the genetic algorithm was used to cope with optimization problems in a directed search way to find a near-optimal or optimal solution. The details of the genetic algorithm are presented in Figure 3. The first step is to generate the initial population randomly. The population includes chromosomes, and each chromosome consists of genes. Then, the fitness is calculated to measure the quality of the chromosome. The chromosomes with higher values of fitness are selected and left. Sequentially, operations of crossover and mutation are conducted to reproduce the next generation with updated generated chromosomes. For the long short-term memory model, the genetic algorithm tuned three hyper-parameters, including the dropout rate, the learning rate, and the batch size. The genetic algorithm was used with a population size of 20 while the maximum generation was set at 100, and the crossover and mutation ratios used were 0.8 and 0.2. Trial and error methods [49,50] and meta-heuristics [51][52][53] are two major ways of determining the parameters of machine learning models. The trial and error methods rely heavily on users' experiences. As the number of parameters increases, the difficulty of parameter determination arises. Thus, the metaheuristic has been an effective and useful way of determining parameters with a relatively mitigated computational burden. In this study, the genetic algorithm was used to select the parameters of forecasting models. Developed by John H. Holland [54,55], the genetic algorithm was used to cope with optimization problems in a directed search way to find a near-optimal or optimal solution. The details of the genetic algorithm are presented in Figure 3. The first step is to generate the initial population randomly. The population includes chromosomes, and each chromosome consists of genes. Then, the fitness is calculated to measure the quality of the chromosome. The chromosomes with higher values of fitness are selected and left. Sequentially, operations of crossover and mutation are conducted to reproduce the next generation with updated generated chromosomes. For the long short-term memory model, the genetic algorithm tuned three hyper-parameters, including the dropout rate, the learning rate, and the batch size. The genetic algorithm was used with a population size of 20 while the maximum generation was set at 100, and the crossover and mutation ratios used were 0.8 and 0.2.
Thus, multilingual fundamental data collected from social media platforms and technical data, including historical trading data and technical indicators, were employed to train the five forecasting models with the genetic algorithm to predict stock indices and corporations' stock prices.

Numerical Results
This study used the LSTMGA model to forecast stock indices and corporations' stock closing prices in the stock market. Forecasting results were compared with four other methods, including BPNNGA, LSSVRGA, RFGA, and XGBoostGA. The long short-term Thus, multilingual fundamental data collected from social media platforms and technical data, including historical trading data and technical indicators, were employed to train the five forecasting models with the genetic algorithm to predict stock indices and corporations' stock prices.

Numerical Results
This study used the LSTMGA model to forecast stock indices and corporations' stock closing prices in the stock market. Forecasting results were compared with four other methods, including BPNNGA, LSSVRGA, RFGA, and XGBoostGA. The long short-term memory network model used in this study concluded 50 neurons. The time stamp, epoch, and optimizer were set as 1, 100, and Nadam, respectively. Tables 7 and 8 list the parameters of the forecasting models for dataset A, dataset B, and dataset C of five stock market indices and corporations' stock prices. Two measurements, MAPE and RMSE, were used to demonstrate the performances of forecasting models and illustrated in Equations (9) and (10).
where N is the number of the forecasting periods. A t and F t are the actual value at time t and the forecasting value at time t, respectively. The MAPE value is independent of the data scale and is used to compare the performances of forecasting models with various data scales. The RMSE is one of the commonly used measurements to evaluate forecasting accuracy. However, the RMSE values are influenced by data scales theoretically [56]. Figures 4 and 5 Table 9 illustrates MAPE and RMSE values of forecasting models with three datasets, and the best results for each dataset in individual cases and on average were in bold. Numerical results indicated that LSTMGA models were superior to the other four machine learning models in forecasting accuracy using hybrid data. MAPE is expressed in a percentage way, and not difficult to compare forecasting performance between datasets while scales vary. RMSE values are easily influenced by data scales. Furthermore, according to Lewis's MAPE forecasting accuracy levels [57], LSTMGA models provided excellent prediction results with MAPE values less than 5% by using dataset C for all forecasting cases in this study.   102  115  266  315  131  303  185  241  171  154  376  392  396  258  133  mtry *  8  8  16  7  12  21  8  10  17  10  11  13  10  11  19  nodesize *  3  5  3  3  3  3  4  3  10  3  3  4  3  4  4  samplesize *  6  9  2  6  4  20  8  5  12  8  5  10  10  2  14  maxnodes *  99  63  75  95  51  61  90  57  51  92  89  98  97  87  *: ntree = the number of trees to grow; mtry = the number of variables used at each split; nodesize = the minimum size of terminal nodes; samplesize = the sample sizes to draw; maxnodes = the maximum number of terminal nodes trees in the forest can have; colsample_bytree = subsample percentage of columns while generating new trees; subsample = the subsample ration of training cases; max_depth = the maximum depth of the tree; eta = the learning rate; min_child_weight = the minimum sum of weights related to child nodes; lambda = the L2 regularization term of weights.              Note: AVG = average; GA = the genetic algorithm.   Note: AVG = average; GA = the genetic algorithm.
The nonparametric statistical Wilcoxon signed-rank test [58] was used to measure the performances of LSTMGA models with three different datasets. Based on the null hypothesis that the median of differences between dataset C and dataset A or dataset B equals 0, Table 10 illustrates the Wilcoxon signed-rank test results in dataset C to dataset A and dataset C to dataset B. The Wilcoxon signed-rank test of LSTMGA models is used to verify the statistical significance between the hybrid data, dataset C, and each single-type dataset, dataset A and dataset B. Testing results show that z values are greater than critical values and p values indicate significance levels of 0.025. This shows that employing dataset C by LSTMGA models can generate more statistically and significantly accurate results than using dataset A or dataset B. Boxplots can reveal unusual data, data distributions, and likelihoods of data dispersions [59][60][61]. Figures 8 and 9 show boxplots of absolute errors for forecasting models of stock market indices and corporations' stock prices correspondingly. It can be observed that LSTMGA models resulted in smaller and more dense absolute errors for various datasets. Figures 10 and 11 plot point-to-point graphs and make comparisons of actual values and predicted values of various forecasting models with three datasets. The plots indicated that the hybrid data could capture trends of stock markets and corporations' stock prices more than individual data in most forecasting models. In addition, the LSTMGA model with hybrid dataset C performed the best in all cases.

Conclusions
Sentiment data and analysis have been applied to many fields and obtained promising results when integrated with classical structured data. For stock market analysis and prediction, numerous studies indicated that investors' sentiments significantly influence stock markets. This study used three types of data, namely social media data, trading data, and hybrid data, to forecast stock market indices and corporations' stock prices. The posts of investors in non-native English-speaking countries were translated into English, and then sentiment analysis was performed. Five forecasting models were employed to forecast stock market indices and corporations' stock prices in a one-step-ahead policy. The genetic algorithm was utilized to determine the appropriate parameters of all forecasting models. The findings of this study can be illustrated as follows. First, more accurate results can be obtained by using hybrid data rather than single social media data or single trading data individually. In addition, the LSTMGA models outperformed the other four forecasting models in terms of forecasting accuracy with hybrid data. The numerical results also indicated that all 10 cases could generate outstanding forecasting results with MAPE values less than 5%, illustrated in Table 9, using the proposed LSTMGA model. Thus, the proposed LSTMGA model can provide feasible, effective, and promising results in forecasting stock market indices and corporations' stock prices with multilingual sentiment analysis.
For future works, data collected from other countries or regions can be included to examine the robustness and feasibility of the developed LSTMGA model. Secondly, the other transformer-based machine learning natural language processing tools, such as bidirectional encoder representations from transformers (BERT), can be implemented to compare the forecasting accuracy. In addition, more deep learning techniques and simple neural network methods can be used to perform forecasting tasks and compare performances. Thirdly, real-time forecasting can be performed by embedding well-trained models into a system. The system can collect real-time data for forecasting stock markets. Therefore, high-frequency data can be used to reexamine the performance of the proposed

Conclusions
Sentiment data and analysis have been applied to many fields and obtained promising results when integrated with classical structured data. For stock market analysis and prediction, numerous studies indicated that investors' sentiments significantly influence stock markets. This study used three types of data, namely social media data, trading data, and hybrid data, to forecast stock market indices and corporations' stock prices. The posts of investors in non-native English-speaking countries were translated into English, and then sentiment analysis was performed. Five forecasting models were employed to forecast stock market indices and corporations' stock prices in a one-step-ahead policy. The genetic algorithm was utilized to determine the appropriate parameters of all forecasting models. The findings of this study can be illustrated as follows. First, more accurate results can be obtained by using hybrid data rather than single social media data or single trading data individually. In addition, the LSTMGA models outperformed the other four forecasting models in terms of forecasting accuracy with hybrid data. The numerical results also indicated that all 10 cases could generate outstanding forecasting results with MAPE values less than 5%, illustrated in Table 9, using the proposed LSTMGA model. Thus, the proposed LSTMGA model can provide feasible, effective, and promising results in forecasting stock market indices and corporations' stock prices with multilingual sentiment analysis.
For future works, data collected from other countries or regions can be included to examine the robustness and feasibility of the developed LSTMGA model. Secondly, the other transformer-based machine learning natural language processing tools, such as bidirectional encoder representations from transformers (BERT), can be implemented to compare the forecasting accuracy. In addition, more deep learning techniques and simple neural network methods can be used to perform forecasting tasks and compare performances. Thirdly, real-time forecasting can be performed by embedding well-trained models into a system. The system can collect real-time data for forecasting stock markets. Therefore, high-frequency data can be used to reexamine the performance of the proposed models in intraday trading. Finally, using sentiment analysis tools directly to forecast stock markets in English contexts and comparing results with non-native English data is a possible direction for future study.