Stock Market Prediction Using Machine Learning Techniques: A Decade Survey on Methodologies, Recent Developments, and Future Directions

: With the advent of technological marvels like global digitization, the prediction of the stock market has entered a technologically advanced era, revamping the old model of trading. With the ceaseless increase in market capitalization, stock trading has become a center of investment for many ﬁnancial investors. Many analysts and researchers have developed tools and techniques that predict stock price movements and help investors in proper decision-making. Advanced trading models enable researchers to predict the market using non-traditional textual data from social platforms. The application of advanced machine learning approaches such as text data analytics and ensemble methods have greatly increased the prediction accuracies. Meanwhile, the analysis and prediction of stock markets continue to be one of the most challenging research areas due to dynamic, erratic, and chaotic data. This study explains the systematics of machine learning-based approaches for stock market prediction based on the deployment of a generic framework. Findings from the last decade (2011–2021) were critically analyzed, having been retrieved from online digital libraries and databases like ACM digital library and Scopus. Furthermore, an extensive comparative analysis was carried out to identify the direction of signiﬁcance. The study would be helpful for emerging researchers to understand the basics and advancements of this emerging area, and thus carry-on further research in promising directions.


Introduction
An advancement in the fundamental aspects of information technology over the last few decades has altered the route of businesses. As one of the most captivating inventions, financial markets have a pointed effect on the nation's economy [1]. The World Bank reported in 2018 that the stock market capitalization worldwide has surpassed 68.654 trillion US$ [2]. Over the last few years, stock trading has become a center of attention, which can largely be attributed to technological advances. Investors search for tools and techniques that would increase profit and reduce the risk [3]. However, Stock Market Prediction (SMP) is not a simple task due to its non-linear, dynamic, stochastic, and unreliable nature [4]. SMP is an example of time-series forecasting that promptly largely be attributed to technological advances. Investors search for tools and techniques that would increase profit and reduce the risk [3]. However, Stock Market Prediction (SMP) is not a simple task due to its non-linear, dynamic, stochastic, and unreliable nature [4]. SMP is an example of time-series forecasting that promptly examines previous data and estimates future data values. Financial market prediction has been a matter of worry for analysts in different disciplines, including economics, mathematics, material science, and computer science. Driving profits from the trading of stocks is an important factor for the prediction of the stock market [5]. The stock market is dependent on various parameters, such as the market value of a share, the company's performance, government policies, the country's Gross Domestic Product (GDP), the inflation rate, natural calamities, and so on [6]. The Efficient Market Hypothesis explains that stock market costs are significantly determined by new information, and follow a random walk pattern, such that they cannot be predicted solely based on past information [7]. This was a widely accepted theory in the past. With the advent of technology, researchers demonstrated that stock market prices could be predicted to a certain extent. Historical market data, combined with the data extracted from social media platforms, can be analyzed to predict the changes in the economic and business sectors [8]. The performance of stock market prediction systems relies intensely on the quality of the features it is using [9]. While researchers have used some strategies for enhancing the stock-explicit features, more attention needs to be paid to feature extraction and selection mechanisms. Figure 1 presents the outline of this article.

Classical Approaches for SMP
According to [10], there exist two main traditional approaches to the analysis of the stock markets: (1) fundamental analysis and (2) technical analysis.

Fundamental Analysis
Fundamental analysis calculates a genuine value of a sector/company and determines the amount that one share of that company should cost. A supposition is made that, if given sufficient time, the company will move to a cost agreeing with the prediction. If a sector/company is undervalued, then the market value of that company should rise, and conversely, if a company is overvalued, then the market price should fall [11]. The analysis is performed considering various factors, such as yearly fiscal summaries and reports, balance sheets, a future prospectus, and the company's work environment [12]. If stocks are overvalued, then the market price will fall [13], e.g., the Dotcom bubble burst in the year 2000 [14]. The two most common metrics used to predict long-term price movements yearly for fundamental analysis are (a) the Price to Earnings ratio (P/E) and (b) the Price by Book ratio (P/B). The P/E ratio is used as a predictor. The companies with a lower P/E ratio yield higher returns than companies with a high P/E ratio [15]. Financial analysts also use this to prove their stock recommendations [16]. Fundamental analysis can be used for the consideration of financial ratios to distinguish poor stocks from quality stocks [17]. The P/B ratio compares the company value specified by the market to the company value specified on paper. If the ratio is high, the company may be overvalued, and the company's value might fall with time. Conversely, if the ratio is low, the company may be underestimated, and the price may rise with time. Of course, fundamental analysis is a powerful method. Still, it has some drawbacks. Fundamental analysis, firstly, lacks adequate knowledge of the rules governing the workings of the system, and secondly, there is non-linearity in the system [18].

Technical Analysis
Technical analysis is the study of stock prices to make a profit, or to make better investment decisions [19]. Technical analysis predicts the direction of the future price movements of stocks based on their historical data, and helps to analyze financial time series data using technical indicators to forecast stock prices. Meanwhile, it is assumed that the price moves in a trend and has momentum [20]. Technical analysis uses price charts and certain formulae, and studies patterns to predict future stock prices; it is mainly used by short-term investors. The price would be considered high, low or open, or the closing price of the stock, where the time points would be daily, weekly, monthly, or yearly. Dow theory puts forward the main principles for technical analysis, which are that the market price discounts everything, prices move in trends, and historic trends usually repeat the same patterns [21]. There are several technical indicators, such as the Moving Average (MA), Moving Average Convergence/Divergence (MACD), the Aroon indicator, and the money flow index, etc. The evident flaws of technical analysis as per [18] are that expert's opinions define rules in technical analysis, which are fixed and are reluctant to change. Various parameters that affect stock prices are ignored.
The prerequisite is to overcome the deficiencies of fundamental and technical analysis, and the evident advancement in the modelling techniques has motivated various researchers to study new methods for stock price prediction. A new form of collective intelligence has emerged, and new innovative methods are being employed for stock value forecasting. The methodologies incorporate the work of machine learning algorithms for stock market analysis and prediction.

Modern Approaches for SMP
There are some modern approaches that can be functional and fruitful for SMP that would enhance prediction accuracies. In this review, we will highlight some modern functional approaches. Because of global digitization, SMP has entered a technological era. Machine learning in stock price prediction is used to discover patterns in data [22]. Usually, a tremendous amount of structured and unstructured heterogeneous data is generated from stock markets. Using machine learning algorithms, it is possible to quickly analyze more complex heterogeneous data and generate more accurate results. Various machine learning methods have been used for SMP [23]. The machine learning approaches are mainly categorized into supervised and unsupervised approaches. In the supervised learning approach, named input data and the desired output are given to the learning algorithms. Meanwhile, in the unsupervised learning approach, unlabeled input data is provided to the learning algorithm, and the algorithm identifies the patterns and generates the output accordingly. Furthermore, different algorithmic approaches have been used in SMP, such as the Support Vector Machine (SVM), k Nearest Neighbors (kNN), Artificial Neural Networks (ANN), Decision Trees, Fuzzy Time-Series, and Evolutionary Algorithms. The SVM is a supervised machine learning technique that limits error and augments geometric margins, and is a pattern classification algorithm [24]. In terms of accuracy, the SVM is an important machine learning algorithm compared to the other classifiers [25]. In the kNN, stock prediction is mapped into a classification based on closeness. Using Euclidean distance, the kNN classifies the "k" nearest neighbors in the training set. The ANN is a nonlinear computational structure for various machine learning algorithms to analyze and process complex input data together. The FIS (Fuzzy Inference Systems) apply rules to fuzzy sets and then apply de-fuzzification to give crisp outputs for decision making [26]. The evolutionary algorithms include gene-inspired neuro-fuzzy and neuro-genetic algorithms, mimic the natural selection theory of species, and can give an optimal output.

Sentiment Analysis Approach
One of the phenomena of current times that is changing the world is the global availability of the internet. The most-used platforms on the internet are social media. It is estimated that social media users all over the world will number around 3.07 billion [27]. There is a high association between stock prices and events related to stocks on the web. The event information is extracted from the internet to predict stock prices; such an approach is known as event-driven stock prediction [28]. Through social networks, people generate tremendous amounts of data that is filled with emotions. Much of this data is related to user perceptions and concerns [29]. Sentiment analysis is a field of study that deals with the people's concerns, beliefs, emotions, perceptions, and sentiments towards some entity [30,31]. It is the process of analyzing text corpora, e.g., news feeds or stock marketspecific tweets, for stock trend prediction. The Stock Twits, Twitter, Yahoo Finance, and so on are well-known platforms used for the extraction of sentiments. There is a significant importance of using sentimental data for enhancing the prediction of volatility in the stock market. The 'Wisdom of Crowds' and sentiment analysis generate more insights that can be used to increase the performance in various fields, such as box office sales, election outcomes, SMP, and so on [32]. This suggests that a good decision can be made by taking the opinions and insights of large groups of people with varied types of information [33]. The information generated through social media allows us to explore vast and diverse opinions. Exploring sentiments from social media in addition to numeric time-series stock data would enhance the accuracy of the prediction. Using time-series data as well as social media data would intensify the prediction accuracy. Different approaches and techniques have been proposed over time to anticipate stock prices through numerous methodologies, thanks to the dynamic and challenging panorama of stock markets [34].

Research Methodology
This section explains the overall process of the literature collection on SMP using machine learning. Initially, the phrase "stock market prediction using machine learning" was keyed to various search engines, digital libraries and databases, including 'google Electronics 2021, 10, 2717 5 of 25 scholar', 'research gate', 'ACM digital library', 'IEEE Explore', 'Scopus', and so on. During the process of literature collection, various phrases like "stock market prediction methods", "impact of sentiments on stock market prediction", and "machine learning-based approach for stock market prediction" were keyed. The OR and AND operators were used for the keyword searches in single and multiple classes, respectively. As a result, some of the fundamental papers in the field of stock market prediction were retrieved. By the careful analysis of a few basic papers, a primary insight into the domain was obtained. The search criteria were further modified to collect the literature of the last decade, in order to enhance and improve the domain. In addition, the literature selected was screened by applying quality criteria, where metrics such as indexing, quartiles, impact factors and publishers were observed. Figure 2 presents the steps followed in the literature collection.
numerous methodologies, thanks to the dynamic and challenging panorama of stock markets [34].

Research Methodology
This section explains the overall process of the literature collection on SMP using machine learning. Initially, the phrase "stock market prediction using machine learning" was keyed to various search engines, digital libraries and databases, including 'google scholar', 'research gate', 'ACM digital library', 'IEEE Explore', 'Scopus', and so on. During the process of literature collection, various phrases like "stock market prediction methods", "impact of sentiments on stock market prediction", and "machine learning-based approach for stock market prediction" were keyed. The OR and AND operators were used for the keyword searches in single and multiple classes, respectively. As a result, some of the fundamental papers in the field of stock market prediction were retrieved. By the careful analysis of a few basic papers, a primary insight into the domain was obtained. The search criteria were further modified to collect the literature of the last decade, in order to enhance and improve the domain. In addition, the literature selected was screened by applying quality criteria, where metrics such as indexing, quartiles, impact factors and publishers were observed. Figure 2 presents the steps followed in the literature collection.  Figure 3 describes the generic process involved in SMP. The process starts with the collection of the data, and then pre-processing that data so that it can be fed to a machine learning model. The prediction models generally use two types of data: market and textual data. The literature of both types is discussed in the following section. The next section classifies the previous studies based on the type of data used. Furthermore, the next section surveys the previous studies based on the various data-preprocessing approaches applied. Moreover, the literature is further surveyed based on the machine learning algorithms used by different systems.  Figure 3 describes the generic process involved in SMP. The process starts with the collection of the data, and then pre-processing that data so that it can be fed to a machine learning model. The prediction models generally use two types of data: market and textual data. The literature of both types is discussed in the following section. The next section classifies the previous studies based on the type of data used. Furthermore, the next section surveys the previous studies based on the various data-preprocessing approaches applied. Moreover, the literature is further surveyed based on the machine learning algorithms used by different systems.

Types of Data
SMP systems can be classified according to the type of data they use as the input. Most of the studies used market data for their analysis. Recent studies have considered textual data from online sources as well. In this section, the studies are classified based on the type of data they use for prediction purposes. At the end of this section, Table 1 points out the comparison of the data sources, type of input and prediction duration used in the studies so far.

Market Data
Market data are the temporal historical price-related numerical data of financial markets. Analysts and traders use the data to analyze the historical trend and the latest stock prices in the market. They reflect the information needed for the understanding of market behavior. The market data are usually free, and can be directly downloaded from the market websites. Various researchers have used this data for the prediction of price movements using machine learning algorithms. The previous studies have focused on two types of predictions. Some studies have used stock index predictions like the Dow Jones Industrial Average (DJIA) [35], Nifty [36], Standard and Poor's (S&P) 500 [37], National Association of Securities Dealers Automated Quotations (NASDAQ) [38], the Deutscher Aktien Index (DAX) index [39], and multiple indices [40,41]. Other studies have used individual stock prediction based on some specific companies like Apple [42], Google [43], or groups of companies [12,44].
Furthermore, the studies focused on time-specific predictions like intraday [45], daily [20], weekly [46], and monthly predictions [47], and so on. Moreover, most of the previous research is based on categorical prediction, where predictions are categorized into discrete classes like up, down, positive, or negative [32,48]. Technical indicators have been widely used for SMP due to their summative representation of trends in time series data. Some studies considered different types of technical indicators, e.g., trend indicators, momentum indicators, volatility indicators and volume indicators [32,49,50]. Furthermore, numerous studies have used an amalgam of different types of technical indicators for SMP [42,51].

Types of Data
SMP systems can be classified according to the type of data they use as the input. Most of the studies used market data for their analysis. Recent studies have considered textual data from online sources as well. In this section, the studies are classified based on the type of data they use for prediction purposes. At the end of this section, Table 1 points out the comparison of the data sources, type of input and prediction duration used in the studies so far.

Market Data
Market data are the temporal historical price-related numerical data of financial markets. Analysts and traders use the data to analyze the historical trend and the latest stock prices in the market. They reflect the information needed for the understanding of market behavior. The market data are usually free, and can be directly downloaded from the market websites. Various researchers have used this data for the prediction of price movements using machine learning algorithms. The previous studies have focused on two types of predictions. Some studies have used stock index predictions like the Dow Jones Industrial Average (DJIA) [35], Nifty [36], Standard and Poor's (S&P) 500 [37], National Association of Securities Dealers Automated Quotations (NASDAQ) [38], the Deutscher Aktien Index (DAX) index [39], and multiple indices [40,41]. Other studies have used individual stock prediction based on some specific companies like Apple [42], Google [43], or groups of companies [12,44].
Furthermore, the studies focused on time-specific predictions like intraday [45], daily [20], weekly [46], and monthly predictions [47], and so on. Moreover, most of the previous research is based on categorical prediction, where predictions are categorized into discrete classes like up, down, positive, or negative [32,48]. Technical indicators have been widely used for SMP due to their summative representation of trends in time series data. Some studies considered different types of technical indicators, e.g., trend indicators, momentum indicators, volatility indicators and volume indicators [32,49,50]. Furthermore, numerous studies have used an amalgam of different types of technical indicators for SMP [42,51].

Textual Data
Textual data is used to analyze the effect of sentiments on the stock market. Public sentiments have been proven to affect the market considerably. The most challenging part is to convert the textual information into numerical values so that it can be fed to a prediction model. Furthermore, the extraction of textual data is a challenging task. The textual data has many sources, such as financial news websites, general news, and social platforms [52]. Most of the studies were carried out on textual data try to predict whether the sentiment towards a particular stock is positive or negative. The previous studies considered several textual sources for SMP, such as the Wall Street Journal [53], Bloomberg [22], CNBC and Reuters [54], Google Finance [55], and Yahoo Finance [56]. The extracted news may be either generalized news or some specific financial news, but the majority of the researchers use financial news, as it is deemed to be less susceptible to noise [57]. Some researchers have used less formal textual data, such as message boards [58,59].
Meanwhile, the textual data from microblogging websites and social networking websites are comparably less explored than other textual data forms for SMP. Besides this, one challenge faced for the processing of the textual data are that the information generated on these platforms is enormous, increasing the computational complexities [60,61]. For example, the researchers in [62] processed 1,00,000 tweets, and the researchers in [63] processed around 2,500,000 tweets, which was a complex task. Moreover, for the textual data, no proper standard format is followed while posting on social media, which increases the processing complexities. In addition, the detection of shorthand spellings, emoticons and sarcastic statements is yet another challenge. Machine learning algorithms come to the forefront to deal with all kinds of challenges faced while processing textual data. Previous studies have mostly considered the sentiment of textual data as positive or negative [35,48,64]. A few studies have considered mood words to determine the mood of a tweet, such as [8,58,65].

Data Pre-Processing
Once the data is available, it needs some pre-processing so that it can be fed to a machine learning model. The significance of the output depends on the pre-processing of the data [66,67]. The textual data must be transformed into a structured format that can be used in a machine learning model. The previous studies revealed that there are three significant pre-processing steps, i.e., feature selection, order reduction and the representation of features. Table 1 presents the comparison of the data sources, type of input and prediction duration. Table 2 presents the comparison of the data pre-processing techniques used in the studies so far.

Feature Selection
Feature selection is a crucial step in textual data processing. Most of the studies on SMP have used basic feature extraction techniques such as Bag of Words, where the text is broken into words and each word is converted into a numeric feature. The feature selection depends on the number of occurrences of a word. Table 2 points out the various feature selection methods used in the literature so far.
As in [67,70], most of the previous literature used feature selection techniques where the order of words is discarded, causing the loss of context. Another feature selection method is Word2Vec, as proposed by [32,81]. It is a word embedding technique based on a multi-layer perceptron. This technique takes into consideration the order and co-occurrence of words, and hence retains the context. Word2Vec has been used in some of the works, such as [63,77].
Moreover, a few studies have used Latent Dirichlet Allocation (LDA) [82][83][84]., where words are viewed as a probabilistic collection of concepts, and the concepts are used as features [82][83][84]. Some works [44,76,79] used the N-grams technique. N-grams is the contagious collection of N words from a given sequence of text. Other methods like genetic algorithms and particle swarm optimization have also been used for feature selection, like in [80,85].

Order Reduction
The feature selection process for the textual data leads to an increase in the number of features. High dimensional features are extremely difficult to process, and leads to the poor efficiency of most of the learning algorithms [86]. This phenomenon is known as the curse of dimensionality [87]. Lower numbers of features will decrease the training complexity of the algorithms. Table 2 points out the order reduction techniques used in previous studies. A well-known form of multi-variant analysis, Principal Component Analysis (PCA), is used to select the most relevant features, reducing the dimensionality of the features. In [68], the daily direction of the S&P 500 index is predicted using 60 features. The authors used three variants of PCA, and concluded that the inclusion of PCA not only reduced the overall training complexity but also increased the accuracy of the predictions. In [78], numerous feature reduction techniques, e.g., PCA, Factor Analysis (FA), Genetic Algorithms (GA), and Firefly Optimization (FO), we used to solve the data complexity.

Feature Representation
Feature representation is one of the important factors for the efficient training of machine learning algorithms. Once the number of required features is determined, the input data is converted to a numeric representation so that machine learning algorithms can readily process it. Table 2 presents the type of representation or the weighing used in the literature so far. Boolean representation is one of the most basic techniques of feature representation, in which the presence and absence of the feature (word) are represented by 1 and 0, respectively, for Bag of words [56]. Another technique, Term Frequency-Inverse Document Frequency (TF-IDF), has been used in numerous studies [44,69,73]. Generally, the text pre-processing phase is considered to be a crucial phase, and may significantly impact the model's accuracy [88,89].

Machine Learning Methods
This section attempts to summarize the machine learning models used in previous studies for stock prediction and forecasting. After the data is pre-processed and transformed to a standard representation, it is fed to machine learning models for further processing. Table 3 presents the distribution of various machine learning techniques used in literature so far. The following section briefly summarizes the different machine learning approaches presented: Hybrid Approaches (HA)

Artificial Neural Networks (ANN)
The ANN is a biological brain-inspired technique in which a large number of artificial neurons are strongly interconnected in order to solve complex problems [90]. These models understand the context of a problem by creating multiple transformations on the feature space, followed by non-linearity, to create its simplified representations [91]. Numerous studies have employed ANN models for SMP [38,40,[92][93][94][95]. For example, the authors in [68] employed ANN for daily trend prediction of the S&P 500 index. Threedimensional reduction techniques-e.g., PCA, Fuzzy Robust Principal Component Analysis (FRPCA), and Kernel-based Principal Component Analysis-were applied to streamline the dataset. The results suggested that combining the ANNs with the PCAs is more efficient. Furthermore, the selection of an appropriate kernel function directly affects the performance of KPCA [68].
Multilayer perceptron (MLP) is a frequently used technique for SMP [42,43,96,97]. MLP is an ANN with one input and output layer, and one or more intermediate layers.
Generally, the MLP uses the backpropagation method for training, in which predicted errors are back propagated from the output layer to the input layer to minimize the errors [98,99].
A study compared three ANN models-MLP, dynamic artificial neural network (DAN2) and autoregressive conditional heteroscedasticity (GARCH)-for NASDAQ price prediction [38]. All three models were evaluated using the Mean Absolute Deviate (MAD) and Mean Square Error (MSE). The results demonstrated that the MLP outperformed DAN2 and GARCH-MLP. Furthermore, it provides a future direction for researchers by suggesting that they focus on finding out whether GARCH has a remedying impact on forecasts or other correlated variables that have a remedying impact on forecasts.
Another study used Generalized Feed Forward (GFF) and MLP models for the prediction of the Istanbul Stock Exchange (ISE) market index [100], where the data were taken from the Central Bank of Turkey. A total of eight sets of predictions (six ANN and two MAs) were performed by changing the number of hidden layers. Two sets of predictions were based on MA. The accuracy of the prediction was calculated using the coefficient of determination, and the highest accuracy for both MLP and GFF was achieved using one hidden layer.
In addition, often-used ANN technique for SMP is the Radial Basis Function network (RBF). It is a layered network where hidden layers use a radial activation function [101,102]. For example, the authors in [91] used the RBF neural network to predict the Shanghai and NASDAQ index by using an extension of LPP (Locality Preserving Projection) known as two-dimensional LPP for the selection of most relevant features for the prediction. The proposed method performed well on both of the market indices.

Support Vector Machine (SVM)
The SVM is a supervised machine learning technique that limits error and augments geometric margins. It is a pattern classification and regression algorithm that was given by [24]. In terms of accuracy, the SVM is an important linear separation algorithm compared to other classifiers [25]. As presented in Table 4, it is the most popular method used for SMP [39][40][41]44,50,51,72,103,104].
The authors in [47] developed a daily and monthly SMP model using historical and sentimental data for the bank, mining, and oil sectors. The historical prices were obtained from yahoo finance, and a sentiment dataset was created by using news and tweets for one year. PCA with multiple factors was applied to the sparse dataset considered for the sentiment analysis. In this study, three algorithms-i.e., Decision-Boosted Tree, SVM, and Logistic Regression-were compared, and the accuracy was used as a performance metric. The Decision-Boosted Tree outflanked the Logistic Regression and SVM. The Decision-Boosted Tree achieved accuracies of 54.8%, 76%, and 76.9% for the bank, mining, and oil sectors, respectively. The Logistic regression attained accuracies of 65.4%, 61%, and 44.2%, respectively, and the SVM achieved accuracies of 51%, 59%, and 44.2% for the respective sectors. The study finally suggested the consideration of the impact of intra-day price movement for the next-day stock price to improve the accuracy.

Naïve Bayes (NB)
NB is a classification method that classifies the data points based on the Bayesian Theorem of probability. This classification method is extremely fast, and can scale over large datasets. This classification approach has been used widely for SMP [49,69,103,[105][106][107]. For example, the authors in [108] employed the Naïve Bayes algorithm for the sentiment analysis of textual data from multiple sources. The authors compared the effect of conventional and social media data sources on different companies and their interrelatedness.

Genetic Algorithms (GA)
GA are a heuristic approach to problem-solving that mimic the natural evolution process. The algorithms apply the concept of natural selection to select the optimal possible solution. In SMP, GA is used to fine-tune the parameters for the generation of the best trading rule. Numerous studies have used GA to enhance SMP accuracies [4,26,78,80,97,[109][110][111]. For example, the authors in [112] developed an intelligent decision support system for stock trading. This study employed rough sets and GA for non-linear and complex stock data to find the features that can be used to generate the optimal trading rules. These rules are applied to generate optimal buy or sell strategies.

Fuzzy Algorithms (FA)
Fuzzy logic is a human reasoning-based method where all of the intermediate possibilities between 0 and 1 are used for decision making. It is a powerful approach in which the degree of belongingness to a certain category is considered. The adaptive neuro-fuzzy inference system (ANFIS) is the most popular fuzzy algorithm that is used for SMP. Some example studies in which ANFIS was employed for SMP include [72,113,114]. The authors in [29] developed a fuzzy logic approach to analyze the sentiments on social media for SMP. Furthermore, several studies have used hybrid fuzzy approaches for SMP [115][116][117].
As an example of a hybrid fuzzy approach, Sedighi et al. (2019) proposed a novel model for prediction using an Artificial Bee Colony (ABC), SVM, and ANFIS. In this study, data from 50 companies were taken from the US Stock Exchange from 2008-2018. The model used 20 technical indicators as the input. The criteria for the performance measures were accuracy and quality. Furthermore, the model had a more exact forecasting accuracy than the others.

Deep Neural Networks (DNN)
DNN are an improvement over conventional neural networks where more hidden layers and neurons are employed for automatic feature extraction and transformation. The increase in the number of hidden layers with non-linear processing units improves the efficiency of learning from raw data [118]. DNN have been used frequently for financial predictions using textual and numeric data [74,119]. Different studies have used DNN algorithms such as Convolutional Neural Networks (CNNs) [120][121][122], Long-Short Term Memory (LSTM) [123,124], and Deep Belief Networks (DBNs) [106,[125][126][127]. For example, a recent study by [128] made a comparison of four prediction models for stock market price prediction, including an Auto-Regressive Integrated Moving Average (ARIMA), Vector Auto Regression (VAR), LSTM, and Nonlinear Auto-Regressive with exogenous inputs (NARX). The model performance was evaluated using an accuracy metric. The data used for the analysis were the closing price of the NASDAQ. The results revealed that NARX made accurate predictions for the short term but failed in long-term predictions. It also concluded that models that integrate machine learning and technical indicators could predict more accurately. LSTM networks are able to learn long-term dependencies, such that they have a vigilant effect on time series prediction. Moreover, the authors in [43] compared three Recurrent Neural Network (RNN) models on Google stock price data, namely basic RNN, Gated Recurrent Unit (GRU) and the LSTM. The results revealed that LSTM outperformed other techniques and achieved an accuracy of 72% on a 5-day horizon. Furthermore, the authors in [129] applied the dynamic LSTM network to predict Nifty prices using Open, High, Low, and Close as features, and achieved a Root Mean Square Error (RMSE) of 0.00859 in terms of daily percentage changes.

Regression Algorithms (RA)
Regression is a predictive approach that models the relationship between a dependent variable and independent variables [130]. Different regression approaches have been used in previous studies: simple linear regression [131][132][133], multiple regression [134,135], decision tree regression [17,136], logistic regression [137], support vector regression (SVR) [56,138], and ensemble regression [41,69,139]. For example, the authors in [140] developed a model that predicts the stock price of a user-specified company a few days ahead. Regression analysis and candlestick pattern detection were applied to the data, which were collected from multiple sources. The model predicted the market movement to a satisfactory level of efficiency. Furthermore, different machine learning algorithms were used, and an improved accuracy of 85% was achieved.

Hybrid Approaches (HA)
The hybrid approach is the amalgam of various techniques used for the enhancement of performance in prediction models. Hybrid algorithms increase the efficiency of prediction models, as suggested by [141]. A few studies have used various hybrid approaches [9,59,71,72,114,[142][143][144]. For example, the authors in [145] proposed a novel, intelligent, hybrid model for stock prediction by combining the predictions of the linear and non-linear models. The authors used an exponential smoothing model as a linear model and an autoregressive moving reference neural network as a non-linear model. The initial predictions were performed by a linear model, and the prediction errors were calculated and then fed to an autoregressive moving reference neural network. This model minimized the errors due to non-linear processing. Summation and multiplication methods were used for the generation of predictions from the prediction errors. The NSE data were used for the testing of the model, and the results indicated that the model could be a promising and a new approach for the prediction of stock returns. In terms of price predictions, the model outperformed the RNN and achieved a lower MSE and Mean Absolute Error (MAE) compared to the constituent models.

Evaluation Metrics
Generally, two approaches are used for SMP: classification and regression. The former approach classifies the market trend into categories like Up and Down. For the latter, output is a numerical value predicting the ups and downs of the price. Figure 4 presents the taxonomy of the evaluation metrics used in the studies so far. Table 4 points out the different evaluation parameters used in the reviewed studies, as well as the time frame of the prediction. For the most part, the studies used accuracy as an evaluation metric, which is the percentage ratio of correct predictions over the total number of test instances [146].
Moreover, the MAPE is used as a performance indicator in a few studies that measure the mean of absolute error percentages in predictions [38,106]. Furthermore, a few of the reviewed studies used trading return or return on investment (ROI) as an evaluation metric, where the trading technique was tested to measure the profitability of predictions [56,68]. Other studies have used Prediction of Change in Direction (POCID) [149] and hit ratios [144].

Overfitting
One of the most well-known and challenging issues in machine learning models is overfitting. In this phenomenon, the model tries too hard to learn from training data. This means that the model picks up on noise or random fluctuations in the training data and learns them as ideas. These ideas don't apply to the new data that is to be predicted, thereby resulting in poor model generalization. Because stock market data is highly stochastic, it is imperative to explain the methods used to resolve this issue. The most common approach to mitigate the issue of overfitting is cross validation. A few studies have applied this approach, like in [38,42,[150][151][152]. In a typical k-fold cross-validation, the data is partitioned into k subsets, or folds. The model is trained iteratively on k-1 folds, and the remaining fold-also known as the hold-out fold-is treated as a test set. Numerous studies have used the early stopping method to overcome overfitting [153]. Another method is to remove irrelevant features and noise from the data, which greatly increases the model's generalizability. A few studies have implemented these procedures to avoid overfitting, such as [42,44,68,121]. The most important preventive measure against overfitting is regularization. This technique removes the extra weights from the selected features and redistributes them uniformly. It discourages the learning of models that are complex or more flexible, hence avoiding the risk of overfitting. The majority of the reviewed studies applied regularization approaches to prevent overfitting [23,25,83]. A few recent studies applied the procedure of data augmentation to prevent overfitting [154,155].

Comparative Analysis
The distribution of the number of papers published in recent years is presented in Figure 5. The number of publications increased from 2009, and was at its peak in 2019, but over the previous two years, the publication number was low. The distribution of machine learning algorithms used for SMP is shown in Figure 6, where the SVM was the most popular technique used. However, the ANN and DNN have attracted the research community's attention for the last few years. Traditional neural network approaches may not make accurate SMPs as initially; the weight of the randomly selected problems may suffer  Table 4. The accuracy metric is handy to use and has less computation complexity than the other metrics. However, it doesn't consider the importance of Type 1 and Type 2 errors in the case of skewed data distributions [147]. The MSE measures the mean difference between the predicted and actual output. It is an important metric in regression analysis because it measures how close the predicted value is to the actual value. An area under curve measures the degree of separability between classes. It is an important metric for classification problems. The higher the AUC, the better the model's predictability [148]. Precision is the ratio of correctly predicted positives to the total positives predicted [42]. Recall is the ratio of correctly predicted positives to the total number of positives [73]. The F-measure is the harmonic mean of the precision and recall, and indicates the importance of false positives and false negatives in the confusion matrix [42,49,69]. The R2, or the coefficient of determination, is the measure of the closeness of the data to the predicted regression line [41,133]. The MAE measures the average difference between the predicted and actual data [133].  [63] Accuracy and correlation Daily Accuracy of around 70% [51] Accuracy, RMSE Long-term 99% accuracy for yahoo data (XGBoost) [49] Error Rate, F-measure Next Month, Next Week 0.85 [42] Accuracy, f-measure, precision, AUC One day ahead 85% [43] Log loss and accuracy Daily, weekly 72% accuracy (LSTM) [68] Accuracy, Return Daily 58.1% Moreover, the MAPE is used as a performance indicator in a few studies that measure the mean of absolute error percentages in predictions [38,106]. Furthermore, a few of the reviewed studies used trading return or return on investment (ROI) as an evaluation metric, where the trading technique was tested to measure the profitability of predictions [56,68]. Other studies have used Prediction of Change in Direction (POCID) [149] and hit ratios [144].

Overfitting
One of the most well-known and challenging issues in machine learning models is overfitting. In this phenomenon, the model tries too hard to learn from training data. This means that the model picks up on noise or random fluctuations in the training data and learns them as ideas. These ideas don't apply to the new data that is to be predicted, thereby resulting in poor model generalization. Because stock market data is highly stochastic, it is imperative to explain the methods used to resolve this issue. The most common approach to mitigate the issue of overfitting is cross validation. A few studies have applied this approach, like in [38,42,[150][151][152]. In a typical k-fold cross-validation, the data is partitioned into k subsets, or folds. The model is trained iteratively on k-1 folds, and the remaining fold-also known as the hold-out fold-is treated as a test set. Numerous studies have used the early stopping method to overcome overfitting [153]. Another method is to remove irrelevant features and noise from the data, which greatly increases the model's generalizability. A few studies have implemented these procedures to avoid overfitting, such as [42,44,68,121]. The most important preventive measure against overfitting is regularization. This technique removes the extra weights from the selected features and redistributes them uniformly. It discourages the learning of models that are complex or more flexible, hence avoiding the risk of overfitting. The majority of the reviewed studies applied regularization approaches to prevent overfitting [23,25,83]. A few recent studies applied the procedure of data augmentation to prevent overfitting [154,155].

Comparative Analysis
The distribution of the number of papers published in recent years is presented in Figure 5. The number of publications increased from 2009, and was at its peak in 2019, but over the previous two years, the publication number was low. The distribution of machine learning algorithms used for SMP is shown in Figure 6, where the SVM was the most popular technique used. However, the ANN and DNN have attracted the research community's attention for the last few years. Traditional neural network approaches may not make accurate SMPs as initially; the weight of the randomly selected problems may suffer from the local optimal, and results in incorrect predictions [123]. The deep learning approaches are used to analyze complicated patterns in the stock data, and provide much faster results. Furthermore, there is no such single technique that can promise to give the optimum results. The comparative analysis between the type of data used and the performance of the models is represented in Figure 7. Data alone from social media do not perform better than using market data and technical indicators. However, if data from textual sources is combined with them, then the model performance increases.
Electronics 2021, 10, x FOR PEER REVIEW 16 of 2 from the local optimal, and results in incorrect predictions [123]. The deep learning ap proaches are used to analyze complicated patterns in the stock data, and provide muc faster results. Furthermore, there is no such single technique that can promise to give th optimum results. The comparative analysis between the type of data used and the perfor mance of the models is represented in Figure 7. Data alone from social media do not per form better than using market data and technical indicators. However, if data from textua sources is combined with them, then the model performance increases.

Challenges and Open Issues
Financial market analysis and prediction continues to be a fascinating and challeng ing problem. Nowadays, data access is becoming easier, but difficulties are increasing i the acquisition and processing of data to extract valuable insights and analyze their impac on stock prices. Feature extraction from the financial data is a challenging task, as it i Electronics 2021, 10, x FOR PEER REVIEW 16 of 2 from the local optimal, and results in incorrect predictions [123]. The deep learning ap proaches are used to analyze complicated patterns in the stock data, and provide muc faster results. Furthermore, there is no such single technique that can promise to give th optimum results. The comparative analysis between the type of data used and the perfor mance of the models is represented in Figure 7. Data alone from social media do not per form better than using market data and technical indicators. However, if data from textua sources is combined with them, then the model performance increases.

Challenges and Open Issues
Financial market analysis and prediction continues to be a fascinating and challeng ing problem. Nowadays, data access is becoming easier, but difficulties are increasing i the acquisition and processing of data to extract valuable insights and analyze their impac on stock prices. Feature extraction from the financial data is a challenging task, as it i Electronics 2021, 10, x FOR PEER REVIEW 16 of 2 from the local optimal, and results in incorrect predictions [123]. The deep learning ap proaches are used to analyze complicated patterns in the stock data, and provide muc faster results. Furthermore, there is no such single technique that can promise to give th optimum results. The comparative analysis between the type of data used and the perfor mance of the models is represented in Figure 7. Data alone from social media do not per form better than using market data and technical indicators. However, if data from textua sources is combined with them, then the model performance increases.

Challenges and Open Issues
Financial market analysis and prediction continues to be a fascinating and challeng ing problem. Nowadays, data access is becoming easier, but difficulties are increasing i the acquisition and processing of data to extract valuable insights and analyze their impac on stock prices. Feature extraction from the financial data is a challenging task, as it i

Challenges and Open Issues
Financial market analysis and prediction continues to be a fascinating and challenging problem. Nowadays, data access is becoming easier, but difficulties are increasing in the acquisition and processing of data to extract valuable insights and analyze their impact on stock prices. Feature extraction from the financial data is a challenging task, as it is essential to observe the diversity of the variables that are used for the prediction. The Financial datasets are usually noisy [156]. The quality of the data significantly affects SMPs.
Most literature on stock prediction regarding live testing affirms that the previously proposed methodologies can be utilized in real time. However, these methods may work in controlled circumstances. Still, a big challenge will be the live testing for the prediction. The live testing comes up with challenging factors, such as variations in prices, noise, and unpredicted events. One such example is the Knight Capital Tragedy, in which the loss of 440 million dollars was endured by the company [157].
Market volatility is the severity with which the market price of an investment fluctuates. The main reasons for the volatility are uncertainty and inflation, and the risk increases when the market is volatile. The influence of volatility on our emotions is ceaseless. The prediction of stock prices is challenging when the market is volatile. One of the reasons for market volatility is algorithmic trading. One such example is the flash crash, which expunged $860 billion within 30 min from US stock markets [158]. International politics also plays a dramatic role in stock market volatility [159].
Events in which panic selling is triggered are nowadays becoming more common, and they result in market overreaction. Panic selling is the reaction to fear and loss, which leads to the wide-scale selling of investments. The leading causes which result in panic selling are high speculation in the market, political issues, and economic instability [46,160]. It becomes more difficult for a researcher to evaluate market behavior in such situations.
New algorithms are proceeding to flood the markets consistently at a pace, and it is challenging to compare the adequacy and exactness of these algorithms. A fascinating part of this research area is its self-defeating nature. In simple terms, sharing the methodologies that generate high profits with market competitors will render the methodologies useless. In this way, best-class algorithm exchanging in the markets is restricted, and is private. The procedure or strategy behind such algorithms is never published.
The data on social media platforms can either be generated by humans or bots. The sentiments of bots can sometimes result in inaccurate predictions. As such, there arises a need for social bot detection to obtain better predictions [161]. Investigators, analysts, and researchers are continuously reporting the potential dangers brought about by social bots. Market investors actively participate in and react to social media sentiments. As such, it can be said that the data from social platforms play a significant role in stock prediction. One example is of 23 April 2013, when the Syrian Electronic Army hacked the Twitter account of the Associated Press, and they posted fake news of a terror attack on the White House in which President Obama was allegedly injured. This provoked an immediate crash in the stock markets [162][163][164]. Due to the rising impact of online networking on numerous aspects of our lives, more attention is paid to sentiment analysis based on data generated from social media. This data can be temperamental and hard to process due to various factors, such as fake news and the bot data published on the web by numerous sources. It is challenging to identify the quality data and draw valuable insights from it. A decent option or an extra asset that can be used is quarterly or yearly reports documented by the organizations for the prediction of stocks. These records, when decoded accurately, give a significant knowledge of an organization's status, which can help with the understanding of the future stock trend.

Conclusions
Financial markets provide an excellent platform for investors and traders, who can trade from any gadget that connects to the internet. Over the last few years, people have become more attracted to stock trading. Like any other walk of life, the stock market has also changed due to the advent of technology. Now, people can make their investments grow. Online trading has only changed the way individuals purchase and sell stocks. The budgetary markets have advanced rapidly, and have formed an interconnected global marketplace. These advancements pave the way to new opportunities.
In contrast to conventional frameworks, SMP is currently performed using machine learning, big data analytics, and deep learning, which provide more optimal decision making. Stock markets, nowadays, are vulnerable to social media sentiments and cyberattacks. Researchers can play a significant role and flourish in these areas by developing the frameworks for better and more secure trading.
This article reviewed studies based on a generic framework of SMP, as presented in Figure 2. It mainly focused on the studies from last decade (2011)(2012)(2013)(2014)(2015)(2016)(2017)(2018)(2019)(2020)(2021). The studies were analyzed and compared based on the type of data used as the input, the data pre-processing approaches, and the machine learning techniques used for the predictions. Furthermore, it reviewed the different evaluation metrics used for performance measurement by different studies, as presented in Section 7. Moreover, an extensive comparative analysis was performed, and it was concluded that SVM is the most popular technique used for SMP. However, techniques like ANN and DNN are mostly used, as they provide more accurate and faster predictions. Furthermore, the inclusion of both market data and textual data from online sources improve the prediction accuracies. Section 9 discussed the generic challenges and open issues in SMP systems.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest:
The authors declare no conflict of interest.