You are currently viewing a new version of our website. To view the old version click .
Entropy
  • Article
  • Open Access

23 January 2023

Investigating Deep Stock Market Forecasting with Sentiment Analysis

,
and
Department of Mathematics, University of Patras, 26504 Patras, Greece
*
Authors to whom correspondence should be addressed.
This article belongs to the Special Issue Recent Trends and Developments in Econophysics

Abstract

When forecasting financial time series, incorporating relevant sentiment analysis data into the feature space is a common assumption to increase the capacities of the model. In addition, deep learning architectures and state-of-the-art schemes are increasingly used due to their efficiency. This work compares state-of-the-art methods in financial time series forecasting incorporating sentiment analysis. Through an extensive experimental process, 67 different feature setups consisting of stock closing prices and sentiment scores were tested on a variety of different datasets and metrics. In total, 30 state-of-the-art algorithmic schemes were used over two case studies: one comparing methods and one comparing input feature setups. The aggregated results indicate, on the one hand, the prevalence of a proposed method and, on the other, a conditional improvement in model efficiency after the incorporation of sentiment setups in certain forecast time frames.

1. Introduction

Somewhere in the course of history, the human species’ need for knowledge of possible future outcomes of various events emerged. Associative norms were thus constructed between decision-making and observed data that were influenced by theoretical biases that had been inductively established on the basis of such observations. Protoscience was formed. Or not?
Even if this hypothetical description of human initiation into scientific capacities is naive or even unfounded, the bottom line is that the human species partly operates on the basis of predictions. Observing time-evolving phenomena and questioning their structure in the direction of an understanding that will derive predictions about their projected future behavior constitutes an inherent part of post-primitive human history. In response to this self-referential demand and assuming that the authors are post-primitive individuals, the core of the present work is about predicting sequential and time-dependent phenomena. This domain is called time series forecasting. Time series forecasting is, in broad terms, the process of using a model to predict future values of variables that characterize a phenomenon based on historical data. A time series is a set of time-dependent observations sampled at specific points in time. The sampling rate depends on the nature of the problem. Moreover, depending on the number of variables describing the sequentially recorded observations, a distinction is made between univariate and multivariate time series. Since there is a wide range of time-evolving problems, the field is quite relevant in modern times, with an increasing demand for model accuracy and robustness.
In addition, there are phenomena, the mathematical formalism of which is represented by time series with values which are also sub-determined by the given composition of a society of individuals. This means that the attitudes of such individuals, as they nonetheless form within the whole, are somewhat informative about aspects of the phenomenon in question. It is natural, given human nature and the consequent conceptual treatment of the world as part of it, that these attitudes are articulated somewhere linguistically. Therefore, a hypothesis on which mathematical quantifications of the attitudes of which such linguistic representations that are signs are possible could, if valid, describe a framework for improving the modeling of the phenomena in question. For example, specific economic figures can be points in a context, the elements of which are partially shaped by what is said about them. Accordingly, it can be argued that a line of research that would investigate whether stock closing prices can be modeled in terms of their future fluctuations using relevant linguistic data collected from social networks is valid.
Thus, in this work, the incorporation of sentiment analysis in stock market forecasting is investigated. In particular, a large number of state-of-the-art methods are put under an experimental framework that includes multiple configurations of input features that incorporate quantified values of sentiment attitudes in the form of time series. These time series consist of sentiment scores extracted from Twitter using three different sentiment analysis methods. Regarding prediction methods, there are schemes that come from both the field of statistics and machine learning. Within the machine learning domain, deep learning and other state-of-the-art methods are currently in use, dominating research. Here, a large number of such widely used state-of-the-art models were benchmarked in terms of performance. Moreover, various sentiment setups of input features were tested. Two distinct case studies were investigated. In the first case study, the evaluations were organized according to methods. The subsequent comparisons followed the grouping. In the second case study, the comparisons concerned the feature setups used as inputs. Sentiment scores were tested in the context of improving the predictive capacities of the various models used. All comparisons yielded results from an extended experimental procedure that incorporated various steps. The whole setting involved a wide range of multivariate setups, which included various sentiment time series. Multiple evaluation metrics and three different time frames were used to derive multiple-view results. Below, first, a brief presentation of related literature is given. Then, the experimental procedure is thoroughly presented, which is followed by the results. Finally, Section 5 lists the extracted conclusions.

3. Experimental Procedure

Information regarding the stages of the experimental procedure will now be presented. This presentation will be as detailed as possible given the necessary space constraints and content commitments in order not to disrupt the depictive nature of the paper.
It has already been mentioned that to some extent, the “core” of the present work consists of an experimental procedure that aims, in its most abstract scope, to check the efficiency, on the one hand, of a number of state-of-the-art algorithms and, on the other, of incorporating sentiment analysis into predictive schemas. Thus, a total of 16 datasets × 67 combinations × 30 algorithms × 3 time-shifts = 96,480 experiments were conducted. The dataset consisted of time series containing the daily closing values of various stocks along with a multitude of 67 different sentiment score setups. Specifically, 16 datasets of stocks containing such closing price values were used over a three-year period, beginning on 2 January 2018 and ending on 24 December 2020. Generated sentiment scores from relevant textual data extracted from the Twitter microblogging platform were used. Three different sentiment analysis methods were deployed. The sentiment score time series and the closing values were subjected to a 7-day and a 14-day rolling mean strategy, yielding a total of 12 distinct features. Various combinations of the created features resulted in a total of 67 distinct input setups per algorithm. The calculated sentiment scores along with the closing values were then tested under both univariate and multivariate forecasting schemes. Lastly, 30 state-of-the-art methods were investigated. Below, a more thorough presentation of the aforementioned experimental setting follows.

3.1. Datasets

Starting with data, the process of collecting and creating the sets used will now be addressed.

3.1.1. Overview

To begin with, Table 1 contains the names of the aforementioned datasets along with their corresponding abbreviations. These initial data included time series containing closing values for 16 well-known listed companies. All sets comprise three-year period data for dates ranging from 2 January 2018 to 24 December 2020.
Table 1. Stock datasets.
Essentially, the initial features were four: that is, the closing prices of each stock and three additional time series containing relative sentiment scores for the given period. Subsequently, and after applying 7- and 14-day rolling averages, a total of 14 features were extracted. Thus, for each share, the final input settings were composed by introducing altered features derived from stock values and a sentiment analysis process applied to an extended corpus of tweets. Figure 1 depicts a—rather abstractive—snapshot of the whole process from data collection to the creation of the final input setups.
Figure 1. Feature setups: creation pipeline.

3.1.2. Tweets and Preprocessing

A large part of the process involved deriving sentiment scores related to stocks. Using the Twitter Intelligence Tool (TWINT) [53], a large number of stock-related posts written in English were downloaded from Twitter and grouped by day. TWINT is an easy-to-use yet sophisticated Python-based Twitter scraping tool. After a comprehensive search for stock-related remarks that were either directly or indirectly linked to shares under consideration, a sizable amount of text data containing daily attitudes toward stocks were created. Then, the collected textual sets underwent the various preprocessing procedures necessary in order to be passed on to the classification modules for extracting their respective sentiment scores.
Regarding preprocessing tweets, initially, irrelevant hyperlinks and URLs were removed using the Re Python library [54]. Each tweet was then converted to lowercase and split into words. Then, unwanted phrases from a manually produced list and various numerical strings were also dismissed. After performing the necessary joins to restore each text to its original structure, each tweet was tokenized in terms of its sentences using the NLTK [55,56] library. Lastly, using the String [57] module, punctuation removal was applied. The whole text-preprocessing step is schematically presented in Figure 2.
Figure 2. Preprocessing.

3.1.3. Sentiment Analysis

The subsequent process involved extracting sentiment scores from the gathered yet cleaned tweets. To perform the sentiment quantification step, three different sentiment analysis methods were utilized.
Specifically, the procedure included extracting sentiment scores from TextBlob [58], using the Vader sentiment analysis tool [59], and incorporating FinBERT [60]. FinBERT is a financial-based fine-tuning of the BERT [61] language representation model. Using each of the above methods, daily sentiment scores were extracted for each stock. The daily mean was then extracted, forming the final collection, which constituted the sentiment-valued time series of every corresponding method. Then, 7- and 14-day moving averages were applied to the previously extracted sentiment score time series. This resulted in the extraction of nine sentiment time series, which, together with the application of the aforementioned procedure to the closing price time series, led to the final number of 12 generated time series used as features. Various combinations of the above features, along with the univariate case scenario, resulted in 67 different study cases. These data constituted the distinct experimental procedures that run for every algorithm. The use of three different methods of sentiment analysis has already been mentioned. Below, a rough description of these methods is given. For further information, the reader is advised to refer to the respective papers.
  • TextBlob: The TextBlob module is a Python-based library for performing a wide range of manipulations over text data. The specific TextBlob method used in this work is a rule-based sentiment-analysis scheme. That is, it works by simply applying manually created rules. This is how the value attributed to the corresponding sentiment score is calculated. An exemplified snapshot of the process would be counting the number of times a term of interest appears within a given section. This would modify the projected sentiment score values in line with the way the phrase is assessed. Here, within this experimental setup and by exploiting TextBlob’s sentiment property, a real number within the [ 1 , 1 ] interval representing the sentiment polarity score was generated for each tweet. The algorithm’s numerical output was then averaged using the individual scores of each tweet to obtain a single sentiment value representing the users’ daily attitudes;
  • Vader: Vader is also a straightforward rule-based approach for realizing general sentiment analysis. In the context of this work, the Vader sentiment analysis tool was used in order to extract a compound score produced by a normalization of sentiment values that the algorithm calculates. Specifically, given a string, the procedure outputs four values: negative, neutral, and positive sentiment values, as well as the aforementioned composite score used. A normalized average of all compound scores for each day was generated the usual way. The resulting time series contained daily sentiment scores that ranged within the [ 1 , 1 ] interval;
  • FinBERT: Regarding FinBERT, in this work, the implementation contained in [62] was utilized. Specifically, the model that was trained on PhraseBank presented in [63] was used. Again, first, the daily scores regarding sentiment attitudes were extracted to eventually form a daily average time series. Generally, the method is a pre-trained natural-language-processing (NLP) model for sentiment analysis. It is produced by simply fine-tuning the pre-trained BERT model over financial textual data. BERT, meaning bidirectional encoder representations from transformers, is an implementation of the transformers architecture used for natural language processing problems. The technique is basically a pre-trained representational model based on transfer learning principles. Given textual data, multi-layer deep representations are trained with a bidirectional attention strategy so that the various different contexts of each linguistic token constitute the content of the token’s embedding. Regardless of data references—here financial—the model can be fine-tuned in any domain by only using a single additional layer that addresses the specific tasks.

3.2. Algorithms

In this section, the methods, algorithmic schemes, and architectures employed in the experiments are listed. Additional details are given on the implementation framework and the tools used.
Regarding the algorithms used, a total of 30 different state-of-the-art methods and method variations were compared. The number of 30 methods used results from the supplementation of the set of well-known core methods with their variations. Further details can be found in the cited tsAI library [64], using which the implementation was carried out. However, it is this multitude of methods that apparently makes a detailed presentation practically impossible. Nevertheless, the reader is urged to track the cited papers. Table 2 contains the main algorithms utilized during the experimental procedure along with a corresponding citation. There, among others, one can notice that in addition to a multitude of state-of-the-art methods, implementations involving combinations of the individual architectures were also used. Note that in addition to the corresponding papers, information regarding the variations of the basic algorithms employed can be searched, inter alia, in notebook files taken from the library implementations.
In order to carry out the experiments, the Python library tsAI [64] was used. The tsAI module is “an open-source deep learning package built on top of Pytorch and Fastai focused on state-of-the-art techniques for time series tasks like classification, regression, forecasting” [64], and others. Here, the forecasting procedure was essentially treated as a predictive regression problem. In the experiments, the initial parameters of the respective methods from the library were preserved with the implementation environment being kept fixed for all algorithmic schemes. Thus, all algorithms compared were utilized in the most basic configuration. That way, one can gain additional insight regarding implementing high-level yet low-code programming and data analysis in real-world tasks. Of the data, 20% were used as the test set. Regarding prediction time horizons, three forecast scenarios were implemented: one single-step and two multi-step. In particular, with regard to multi-step forecasts, and leaving aside the single-step predictions, estimates were provided for a seven-day window on the one hand and a fourteen-day window on the other. The results were evaluated according to the metrics presented in the following paragraph.
Table 2. Algorithms.
Table 2. Algorithms.
No.AbbreviationAlgorithm 1
1FCNFully Convolutional Network [65]
2FCNPlusFully Convolutional Network Plus [66]
3ITInception Time [67]
4ITPlusInception Time Plus [68]
5MLPMultilayer Perceptron [65]
6RNNRecurrent Neural Network [69]
7LSTMLong Short-Term Memory [70]
8GRUGated Recurrent Unit [71]
9RNNPlusRecurrent Neural Network Plus [69]
10LSTMPusLong Short-Term Memory Plus [69]
11GRUPlusGated Recurrent Unit Plus [69]
12RNN_FCNRecurrent Neural—Fully Convolutional Network [72]
13LSTM_FCNLong Short-Term Memory—Fully Convolutional Network [73]
14GRU_FCNGated Recurrent Unit—Fully Convolutional Network [74]
15RNN_FCNPlusRecurrent Neural—Fully Convolutional Network Plus [75]
16LSTM_FCNPlusLong Short-Term Memory—Fully Convolutional Network Plus [75]
17GRU_FCNPlusGated Recurrent Unit—Fully Convolutional Network Plus [75]
18ResCNNResidual—Convolutional Neural Network [76]
19ResNetResidual Network [65]
20RestNetPlusResidual Network Plus [77]
21TCNTemporal Convolutional Network [78]
22TSTTime Series Transformer [79]
23TSTPlusTime Series Transformer Plus [80]
24TSiTPlusTime Series Vision Transformer Plus [81]
25TransformerTransformer Model [82]
26XCMExplainable Convolutional Neural Network [83]
27XCMPlusExplainable Convolutional Neural Network Plus [84]
28XceptionTimeXception Time Model [85]
29XceptionTimePlusXception Time Plus [86]
30OmniScaleCNNOmni-Scale 1D-Convolutional Neural Network [87]
1 Methods and method variations used.

3.3. Metrics

Regarding performance evaluation, six metrics were used. The use of the different metrics serves the necessity of having not only a presentation of the conclusions of a large comparison of methods and feature and sentiment setups but also a number of diverse extractions in terms of evaluation aspects that can be used in future research. This is exactly because each of the metrics exposes the results in different aspects, and therefore, an investigation would be incomplete if it focused on just one of them. Thus, regarding evaluating results, each one of the six performance indicators utilized has advantages and disadvantages. The metrics used are:
  • the Mean Absolute Error (MAE);
  • the Mean Absolute Percentage Error (MAPE);
  • the Mean Squared Error (MSE);
  • the Root Mean Squared Error (RMSE);
  • the Root Mean Squared Logarithmic Error (RMSLE);
  • the Coefficient of Determination R 2 .
In what follows, a rather detailed description of aspects of the aforementioned well-known evaluation metrics is given. The presentation aspires to provide details and some insight regarding the interpretation of the metrics. Below, the actual values are denoted by y a i and the forecasts are denoted by y p i .

3.3.1. MAE

First is MAE:
M A E = 1 n i = 1 n y p i y a i
MAE stands for the arithmetic mean of the absolute errors, and it is a very straightforward metric and easy to calculate. By default, in terms of the difference between the prediction and the observation, the values share the same weights. The absence of exponents in the analytic form ensures good behavior, which is displayed even when outliers are present. The target variable’s unit of measurement is the one expressing the results. MAE is a scale-dependent error metric; that is, the scale of the observation is crucial. This means that it can only be used to compare methods in scenarios where every scheme incorporates the same specific target variable rather than different ones.

3.3.2. MAPE

Next is MAPE:
M A P E = 1 n i = 1 n y p i y a i y a i
MAPE is the mean absolute percentage error. It is a relative and not an absolute error measure. MAPE is common when evaluating the accuracy of forecasts. It is the average of the absolute differences between the prediction and the observations divided by the absolute value of the observation. A multiplication by 100 can afterwards convert this output to a percentage. This error cannot be calculated when the actual value is zero. Instead of being a percentage, in practice, it can take values in 0 , . Specifically, when the predictions contain values much larger than the observations, then the MAPE output can exceed 100 % . Conversely, in cases where both the prediction and the observation contain low values, the output of the metric may deviate greatly from 100 % . This, in turn, can lead to a misjudgment of the model’s predictive capabilities, believing them to be limited when, in fact, the errors may be low. MAPE attributes more weight to cases where the predicted value is higher than the actual one. These cases produce larger errors. Hence, using this metric is best suitable for methods with low prediction values. Lastly, MAPE, being not scale-dependent, can be used to evaluate comparisons of a variety of different time series and variables.

3.3.3. MSE

The next metric is MSE:
M S E = 1 n i = 1 n y p i y a i 2
MSE stands for mean squared error. It constitutes a common forecast evaluation metric. The mean squared error is the average of the squares of the differences between the actual and predicted values. Its unit of measurement is the square of the unit of the variable of interest. Looking at the analytical form, first, the square of the differences ensures the non-negativity of the error. At the same time, it makes information about minor errors usable. It is obvious, at the same time, that larger deviations entail larger penalties, i.e., a higher MSE. Thus, outliers have a big influence on the output of the error; that is, the existence of such extreme values has a significant impact on the measurements and, consequently, the evaluation. Furthermore, and in a sense the other way around, when differences are less than 1, there is a risk of overestimating the predictive capabilities of the model. Given the error’s differentiability, as one can observe, it can easily be optimized.

3.3.4. RMSE

Moving on to RMSE:
R M S E = 1 n i = 1 n y p i y a i 2
RMSE stands for root mean squared error. It is a common metric for evaluating differences between estimated values and observations. To compute it, apparently, one just calculates the root of the mean squared error. From the numerical formulation, one can think of the metric as an abstraction that captures the representation of something of an average distance between the actual values and the predictions. That is, if one ignores the denominator, then one can observe the formula as being the Euclidean distance. The subsequent interpretation of the metric as a kind of normalized distance comes out of the act of division by the number of observations. Here also, the existence of outliers has a significant impact on the output. In terms of interpreting error values, the RMSE is expressed in the same units as the target variable and not in its square, as in the MSE, making its use straightforward. Finally, the metric is scale-dependent; hence, one can only use it to evaluate various models or model variations given a particular fixed variable.

3.3.5. RMSLE

The next metric is also an error. The formula for RMSLE is as follows:
R M S L E = 1 n i = 1 n log ( y p i + 1 ) log ( y a i + 1 ) 2
RMSLE stands for Root Mean Squared Logarithmic Error. The RMSLE metric seems as if it is a modified version of the MSE. Using this modification is preferred when predictions display significant deviations. RMSLE uses logarithms of both the observations and predicted values while ensuring non-zero values in the logarithms through the appropriate simple unit additions appearing in the formula. This modified version is resistant to the existence of outliers and noise, and it smooths the penalty that the MSE imposes in cases in which predictions deviate significantly from observations. The metric cannot be used when there are negative values. RMSLE can be interpreted as a relative error between observations and forecasts. This can be made evident by simply applying the following property to the radicand term of the square root:
log ( y p i + 1 ) log ( y a i + 1 ) = log y p i + 1 y a i + 1
Since RMSLE gives more weight to cases where the predicted value is lower than the actual value, it is quite a useful metric for types of predictions where similar conditions require special care for the reliability of the application in real-world conditions, where lower forecasts may lead to specific problems.

3.3.6. R 2

The last metric is the coefficient of determination R 2 :
R 2 = 1 S S R E S S S T O T = 1 i = 1 n y p i y a i 2 i = 1 n y p i y ¯ 2
The coefficient of determination R 2 is not an error evaluation metric. It is the ratio depicted in the above equation. This metric is essentially not a measure of model reliability. R 2 is a measure of how good a fit is: a quantification of how well a model fits the data. Its values typically range from 0 to 1. A rather simple interpretation would be this: the closer to 1 the value of the metric is, the better the model fits the observations, i.e., the predictions are closer, in terms of their values, to the observations. Thus, the value 0 corresponds to cases where the explanatory variables do not explain the variance of the dependent variable at all. Conversely, the value 1 corresponds to cases where the explanatory variables fully explain the dependent variable. However, this interval does not strictly constitute the set of values of the metric. There are conditions in which R 2 could take negative values. Observing the formula, one can identify the above as permissible. In such cases, the model performs worse in fitting the data than a simple horizontal line, essentially being unable to follow the trend. Lastly, values outside the above range indicate either an inadequate model or other flaws in its implementation.

4. Results

Returning to the dual objective of this work, the two case studies whose results will be presented in this chapter were:
  • On the one hand, the comparison of a large number of time series forecasting contemporary algorithms;
  • On the other hand, the investigation of whether knowledge of public opinion, as reflected in social networks and quantified using three different sentiment analysis methods, can improve the derived predictions.
Accordingly, the presentation of the results of the experimental process is split into two distinct parts. In what follows, both various statistical analysis and visualization methods are incorporated. However, it should be noted that the number of comparisons performed yielded a quite large volume of results. Specifically, as already pointed out, in each case, the performance of the 30 predictive schemes and the 67 different feature setups was investigated over three different time frames (1, 7, and 14 day shifts). Note that these three time-shifting options have no—or at least no intended—financial consequences. Here, the primary goal in designing the framework was to forecast the stock market over short time frames, such as a few days. Then, an expansion was made to investigate the performance of both methods and feature setups over longer periods of time. Each of these schemas was evaluated with six different metrics, while the process was repeated for each of the datasets. Consequently, it becomes clear that the complete tables with the numerical results cannot contribute satisfactorily to the understanding of the conclusions drawn. Below, following a necessary brief reminder of the process, results are presented.
As has already been mentioned, during the procedure, for each of the stocks, the following strategy was followed: each of the thirty algorithms to be compared was “ran” 67 times, each time accepting as input one of the different feature setups. This was repeated three times, once for each of the three forecast time frames. In each of the above runs, the six metrics used in the evaluation of the results were calculated. The comparison of the algorithms was performed by using Friedman’s statistical tests in terms of feature setups for each of the time shifts. Thus, given setups and stocks, the ranking of the methods per evaluation metric was extracted according to the use of the Friedman test [88]. Therefore, regarding this case study, a total of 67 × 6 × 3 = 1206 statistical tests were executed. In a similar way, the Friedman rankings of input feature setups were estimated in terms of metrics and time shifts, given the various algorithms and stocks. Here, a total of 30 × 6 × 3 = 540 statistical tests were performed. An additional abstraction of the results was derived as follows: For each of the 30 methods, the average rank achieved by each method in terms of feature setups and shares was calculated. So, for each metric and each of the three time frames, a more comprehensive display of the information was obtained based on the average value of the different setups. In an identical way, in the case of checking the effectiveness of features, the average value of the 30 algorithms for each of the 67 different input setups was calculated in each case. In both cases, the ranking was calculated based on the positions produced by the Friedman test, while at the same time, with the Nemenyi post hoc test [89] that followed, every schema was checked pair-wise for significant differences. The results of the Nemenyi post hoc tests are shown in the corresponding Critical Difference diagrams (CD-diagrams), in which methods that are not significantly different are joined by black horizontal lines. Two methods are considered not significantly different when the difference between their mean ranks is less than the CD value.
Next, organized in both cases based on time frames, the results concerning the comparison of the forecast algorithms are presented, which are followed by those regarding the feature setups.

4.1. Method Comparison

The presentation begins with results concerning the investigation of methods. The results are presented per forecast time shift. In each case, the Friedman Ranking results for all six metrics are listed. To save space, only methods that occupy the top ten positions of the ranking are listed. Full tables are available at: shorturl.at/FTU06 (accessed on 15 January 2023). The CD diagrams follow. There, we can visually observe which of the methods exhibit similar behavior and which differ significantly. Finally, box plots of results per metric are presented, again for the best 10 methods. The box plots present in a graphical and concise manner information concerning the distribution of the aforementioned data, that is, in our case, the average values of the sentiment setups per algorithm for all stocks. In particular, one can derive information about the maximum and minimum value of the data, the median, as well as the 1st and 3rd quartile values isolated by 25% and 75% of the observations, respectively.

4.1.1. Time Shift 1

With respect to the one-day forecasts, Table A1 lists the Friedman Ranking results for the top 10 scoring methods per metric. Although there is no single method that dominates all metrics and significant reorderings are also observed in the table positions, the TCN method achieves the best ranking in three out of six metrics (MAPE, R2, and RMSLE) and is always in the top four. Furthermore, from the box plots, it is evident that TCN has by far the smallest range of values.
Apart from this, in all metrics, GRU_FCN is always in the top five. It is also observed that LSTM_FCN and LSTMPlus behave equally well. The latter shows a drop in the MAPE metric, but in all other cases, it is in the top three, while in two metricsm it ranks first. It should also be noted that the LSTMPlus method ranks first in two metrics, namely MAE and RMSE. In terms of R 2 and RMSLE, it occupies the second position of the ranking, while regarding MSE, LSTMPlus ranks third. However, at the same time, according to MAPE, the method is not even in the top ten. Thus, as will be seen in the following, TCN is the consistent choice.
The results produced by Friedman’s statistical test, in terms of the six metrics, are presented in Table A1, while the corresponding CD diagrams and box plots are depicted in Figure 3 and Figure 4.
Figure 3. Box Plots: Methods—Shift 1.
Figure 4. CD Diagrams: Methods—Shift 1.

4.1.2. Time Shift 7

At the one-week forecast time frame, the algorithms that occupy the top positions in the ranking produced by the statistical control appear to have stabilized. The corresponding ranking produced by the Friedman statistical test regarding the ten best methods with respect to the six metrics is presented in Table A2. In all metrics, the TCN method ranks first. From the CD diagrams, it can be seen that in all metrics—except for R2—this superiority is also validated by the fact that this method differs significantly from the others. Box plots show the method also having the smallest range around the median. Figure 5 and Figure 6 contain the relevant results in the form of box plots and CD-diagrams.
Figure 5. Box Plots: Methods—Shift 7.
Figure 6. CD Diagrams: Methods—Shift 7.
Other methods that clearly show some dominance over the rest in terms of given performance ratings are, on the one hand, TSTPlus, which ranks second in all metrics except MAPE, and, on the other hand, XCMPlus and XCM, which are mostly found in the top five. In general, the same methods can be found in similar positions in all metrics, with minor rank variations. In addition, the statistical correlations between the methods are shown in the CD diagram plots.

4.1.3. Time Shift 14

In the forecast results with a two-week shift, a relative agreement can be seen in the top-ranking algorithms with those of the one-week frames. The ranking produced by the Friedman statistical test for the ten best methods with respect to the six metrics is presented in Table A3.
Once more, TCN ranks first in all metrics. TSTPlus again ranks second in all metrics except for R2, where it ranks third. In almost all cases, XCMPlus and RNNPlus appear in the top five. Likewise, as in the previous time shift, there is a relative agreement in the methods appearing in the corresponding positions regarding all metrics. Moreover, according to the above, an argument regarding the general superiority of the TCN method in this particular scenario is easily obtained. An obvious predominance of the TCN method is established. The corresponding CD diagrams and box plots for the 10 best performing algorithms are seen in Figure 7 and Figure 8.
Figure 7. Box Plots: Methods—Shift 14.
Figure 8. CD Diagrams: Methods—Shift 14.

4.2. Feature Setup Comparison

Now, we are moving on to the findings of the second case study, which concern, on the one hand, the investigation of whether the use of sentiment analysis contributes to the improvement of the extracted predictions and, on the other hand, the identification of specific feature setups whose use improves the model’s predictive ability.
Again, the results of the experimental procedure will be presented separately for the three forecast time frames. Likewise, due to the volume of results, only the 10 most promising feature setups will be listed. These were again derived based on the Friedman classification of the averages calculated for each of them, taking into account the predictions in the use of the 30 forecast methods used. The full rankings of all 67 setups can be found at shorturl.at/alqwx (accessed on 13 December 2022). For the presentation below, again, the corresponding CD diagrams and box plots were used.

4.2.1. Time Shift 1

Starting with the results concerning one-day depth forecasting, one notices that the univariate version, in which the forecasts are based only on the stock price of the previous days, ranks first only in the case of the R 2 metric. In fact, in three metrics, the univariate version is not even in the top twenty of the ranking (See Figure 9 and Figure 10).
Figure 9. Box Plots: Features—Shift 1.
Figure 10. CD Diagrams: Features—Shift 1.
Another interesting observation would be that even though there are rerankings of the sentiment setups in terms of their performance on the six metrics, the Blob_RM_7_Blob setup—that is, the setup incorporating Blob and Rolling Mean 7 Blob along with the closing values time series—although it does not score well in the ranking regarding R 2 , it is, on the one hand, at the top ranking in four metrics, that is, MAE, MSE, RMSE, RMSLE, and, on the other hand, second in MAPE. Moreover, from the results, it becomes evident that an argument in favor of using sentiment analysis in multivariate time series layouts, even in the case where the forecasts concern one-day depth, is, at least, relevant. At the same time, using smoothed versions of both the sentiment time series and those containing the closing stock price values appears to be beneficial in general.

4.2.2. Time Shift 7

Regarding the time frame of one week, one can notice that the use of the univariate version is marginally ranked first in three metrics, namely, the R 2 , RMSE and RMSLE, while in two metrics, the Vader sentiment setup appears to be superior, actually being, at the same time, in second place regarding the MAPE and RMSE metrics and fifth regarding the RMSLE (Figure 11 and Figure 12).
Figure 11. Box Plots: Features—Shift 7.
Figure 12. CD Diagrams: Features—Shift 7.
It is also notable that Blob_RM_7_Blob, which appeared to perform particularly well during the one-day shift, remains in the top three rankings in five of the six metrics. More generally, once again, one notices that there are rearrangements, especially in the central positions of the table. However, given the small differences in performance between the different setups, this should not be considered unreasonable. Overall, the picture still points in favor of using multivariate inputs containing sentiment data.

4.2.3. Time Shift 14

Finally, regarding the two-week time frame, a first observation is that in relation to the R 2 , a feature setup that does not contain sentiment data dominates. This pattern is also present in the previous time shifts (See Figure 13 and Figure 14).
Figure 13. Box Plots: Features—Shift 14.
Figure 14. CD-Diagrams: Features—Shift 14.
In addition, although there are metrics in which the univariate version is in the top ten, in these cases, the difference in its performance with those in the first positions is quite significant. This is easily seen from the CD diagrams: there are no connections with setups that appear in the top positions. At the seven-day time lag, it was observed that the univariate version prevailed in three cases. However, as one examines the 14-day time shift, one notices that the superiority of methods that use sentiment data is reinforced.
At the same time, combinations containing the closing price appear in the first positions of the table more often than in the previous two setups. Furthermore, it is observed that the setup that dominates four of the six metrics is RM_7_Close_Blob. These metrics are MAE, MAPE, MSE, and RMSE. The RM_7_Close_Blob feature setup is the one that incorporates both a smoothed version of the closing values as well as sentiment scores. Thus, the use of weighted averages in the original time series along with the incorporation of sentiment scores is mostly shown to be optimal regardless of the individual choice of a specific layout. Methodologically, the utilization of both has an improving effect.

5. Conclusions

Some general conclusions drawn from the whole experimental procedure will now be addressed. The discussion will follow the binary separation of the preceding case studies.

5.1. Methods

The first case study of the paper consisted of a comparison of 30 methods for time series forecasting. Within the above-discussed experimental context, the extracted results are such as to safely allow a conclusion regarding the superiority of the TCN method over the rest. This is the case because, in the vast majority of comparisons, it excels, being, for the most part, at the top of the Friedman ranking. In particular, the only cases where it does not outperform all the rest are found in the single-day time frame predictions. In fact, from the CD diagrams, one can extract the additional fact that in many cases, the superiority of the aforementioned method is marked by a significant difference. Furthermore, in addition to the TCN method, other methods whose predictive capacities can be considered significant were identified. TSTPlus is one of them, as it produces significant results, particularly over longer time horizons. XCMPlus is another.
In Figure 15, one can see the relative rankings of these three methods per time shift. The values in Figure 15 correspond to the values of Table A1, Table A2 and Table A3. Regarding the one-day forecast window, LSTMPlus is an additional option, as is the combination of GRU and FCN. However, an additional point to note here is that the individual method differences are less clear in their significance. On the contrary, there can also be conclusions regarding methods whose behavior was not evaluated, on average, as satisfactory. In particular, specific methods that are always ranked last in all scenarios were identified. Specifically, TSiTPlus ranks last in all three scenarios across all metrics. In addition to this, there are methods, such as Transformer Model, XceptionTime, and XceptionTimePlus, which are always at the bottom of the table in the vast majority of cases. In conclusion, given the limitations and further prerequisites developed throughout this paper, TCN can be easily recommended.
Figure 15. TCN, TSTPlus and XCMPlus relative rankings.

5.2. Feature and Sentiment Setups

In relation to the second case study, the consideration of the results also points in some important directions. Of these, the main conclusion drawn seems to be that the use of information derived from both smoothed versions of the initial time series and sentiment analysis shows, in most cases, to have a beneficial effect on the derived forecasts. Not using sentiments in the feature setup of the inputs dominates the rest only in a small number of cases, and, as confirmed by the CD diagrams, only in two of them is this difference significant.
Moreover, the answer to whether the use of sentiment setups specifically leads to the extraction of more accurate forecasts, as evidenced by the individual layouts of the weighted results, seems to be that, in general, sentiment analysis improves forecasts. Of course, it is also reasonable to investigate whether there is a specific sentiment setup that outperforms the rest. This would also lead to an assessment of the performance of the three sentiment analysis methods used. However, the answer to this question needs further investigation. However, even with the possibility of further inquiries within the framework of the experimental setup presented here, it is still not certain that firm conclusions will be drawn. Here, while such setups can be found for each time horizon, there is not one that dominates all three.
In order, however, to illustrate a relative ranking of the three sentiment analysis methodologies used, regardless of the particular variation involved, an additional table was created. All variations of each method were placed under a corresponding class. The Friedman-aligned ranks [90] were then calculated. Hence, in order to draw a clearer picture of the way the three employed approaches to sentiment analysis performed, three sentiment classes were formed, one matching each of the previously described sentiment analysis methods. The arithmetic mean of all the sentiment setups that solely contain different variations of a particular sentiment analysis algorithm, that is, only one of the three incorporated, is used to represent the corresponding class concerning each metric. In other words, each class represents a sentiment analysis method, and each class corresponds to six sentiment setups that contain variations exclusively of the technique in question. Specifically, a representative value of a class, as it pertains to a particular method, is formed by the following setups: method, RM7method, RM14method, method + RM7method, method + RM14method, and RM7method + RM14method. The sum is then divided by six, which is apparently the number of setups, and this result is the output value to be depicted. This way, setups produced either by combining the various sentiment analysis methods or by using the target variable in variants containing rolling means are excluded in order to compare only the relative performances of the three individual techniques and their variations.
Figure 16 illustrates these relative rankings of the three sentiment analysis methods per time shift. One can observe the relative performances in terms of individual wins with respect to each metric and time shift: the Blob and Vader classes top the ranking seven times each, while the Finbert class only has four wins. Again, a conclusion in terms of an obvious generality regarding a specific algorithm does not appear. Nevertheless, the identification of groups of such setups, even at the level of a specific time frame, can be particularly useful, with the methodology for the selection of individual setups needing more investigation.
Figure 16. Sentiment rankings.

Author Contributions

Conceptualization, C.M.L. and S.K.; methodology, C.M.L.; software, C.M.L.; validation, C.M.L., A.K. and S.K.; formal analysis, C.M.L. and A.K.; investigation, C.M.L. and A.K.; resources, S.K.; data curation, A.K.; writing—original draft preparation, C.M.L. and A.K.; writing—review and editing, C.M.L.; visualization, A.K.; supervision, S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

URLs of the full Friedman Ranking results. (i) Methods rankings: shorturl.at/FTU06 (accessed on 15 January 2023). (ii) Feature setup rankings: shorturl.at/alqwx (accessed on 15 January 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Table A1. Friedman results: Algorithms—Shift 1.
Table A1. Friedman results: Algorithms—Shift 1.
MAEMAPER 2
MethodFriedman ScoreMethodFriedman ScoreMethodFriedman Score
1stLSTMPlus8.266667RNN_FCN10.4TCN22.4
2ndLSTM8.533333GRU_FCN10.53333LSTMPlus21.13333
3rdTCN9.466667TCN11LSTM20.6
4thGRU_FCN10LSTM_FCN11.2GRU_FCN20.53333
5thLSTM_FCN10.2GRU_FCNPlus11.6LSTM_FCN19.73333
6thRNN10.73333RNN_FCNPlus11.73333LSTM_FCNPlus19.13333
7thRNN_FCN11.13333RNN11.93333GRU_FCNPlus18.93333
8thGRU_FCNPlus11.33333ResCNN12.13333RNN_FCN18.93333
9thXCM11.33333LSTM_FCNPlus12.13333RNN18.93333
10thLSTM_FCNPlus11.4FCNPlus12.46667XCMPlus18.66667
MSERMSERMSLE
MethodFriedman ScoreMethodFriedman ScoreMethodFriedman Score
1stTCN9LSTMPlus9.066667TCN7.733333
2ndGRU_FCN9.266667LSTM9.6LSTMPlus9.333333
3rdLSTMPlus9.6GRU_FCN9.6LSTM9.8
4thLSTM_FCN9.8TCN9.733333GRU_FCN10
5thLSTM9.933333LSTM_FCN10.06667LSTM_FCN10.2
6thRNN_FCN10.33333RNN_FCN10.93333RNN11.13333
7thLSTM_FCNPlus10.46667LSTM_FCNPlus11.06667GRU_FCNPlus11.26667
8thGRU_FCNPlus10.8RNN11.13333RNN_FCN11.26667
9thRNN_FCNPlus11.33333GRU_FCNPlus11.2LSTM_FCNPlus11.26667
10thFCNPlus11.46667RNN_FCNPlus11.86667GRU12
Table A2. Friedman results: Algorithms—Shift 7.
Table A2. Friedman results: Algorithms—Shift 7.
MAEMAPER 2
MethodFriedman ScoreMethodFriedman ScoreMethodFriedman Score
1stTCN3.733333TCN6.133333TCN25.86667
2ndTSTPlus8.266667XCMPlus9.866667TSTPlus25.8
3rdXCMPlus8.866667RNNPlus10.93333XceptionTimePlus19.7
4thXCM10.53333TSTPlus11XCMPlus19.66667
5thRNN_FCNPlus12.06667RNN11.06667XceptionTime19.53333
6thGRU_FCNPlus12.13333XCM11.26667XCM18.53333
7thRNN_FCN12.26667LSTMPlus13RNN_FCN16.9
8thGRU_FCN13.2GRU13.66667GRU_FCNPlus16.66667
9thRNN13.53333ResCNN13.86667InceptionTime16.5
10thLSTM_FCNPlus13.53333LSTM14.06667RNN_FCNPlus16.16667
MSERMSERMSLE
MethodFriedman ScoreMethodFriedman ScoreMethodFriedman Score
1stTCN3.666666667TCN3.933333333TCN3.8
2ndTSTPlus8.733333333TSTPlus8.066666667TSTPlus8.066666667
3rdXCMPlus8.933333333XCMPlus9.066666667XCMPlus9.4
4thXCM11.93333333XCM10.8XCM10.06666667
5thRNN_FCNPlus12.06666667RNN_FCN12.26666667RNN12.13333333
6thRNN_FCN12.2RNNPlus12.8RNNPlus12.46666667
7thGRU_FCNPlus12.6RNN_FCNPlus12.86666667RNN_FCN12.66666667
8thLSTM_FCNPlus13GRU_FCNPlus13GRU_FCNPlus12.86666667
9thRNN13.26666667RNN13.06666667RNN_FCNPlus13.06666667
10thFCN13.4LSTMPlus13.66666667GRU_FCN14.06666667
Table A3. TFriedman results: Algorithms—Shift 14.
Table A3. TFriedman results: Algorithms—Shift 14.
MAEMAPER 2
MethodFriedman ScoreMethodFriedman ScoreMethodFriedman Score
1stTCN6TCN8.2TCN25.33333333
2ndTSTPlus8TSTPlus9.466666667TST21.6
3rdXCMPlus10.4RNN9.533333333TSTPlus20.8
4thXCM11.33333333RNNPlus9.666666667XceptionTime18.9
5thRNNPlus11.86666667XCMPlus10.6XCMPlus17.96666667
6thLSTMPlus11.93333333LSTM10.86666667XceptionTimePlus17.7
7thLSTM12.13333333XCM11RNNPlus17.53333333
8thRNN12.46666667LSTMPlus11.73333333OmniScaleCNN17.13333333
9thLSTM_FCNPlus13.46666667GRUPlus13.13333333LSTM17.03333333
10thGRU_FCN14.46666667TST13.8RNN16.93333333
MSERMSERMSLE
MethodFriedman ScoreMethodFriedman ScoreMethodFriedman Score
1stTCN7.8TCN7.133333333TCN4.133333333
2ndTSTPlus7.8TSTPlus7.6TSTPlus7.466666667
3rdXCM10.13333333XCM10.46666667XCMPlus10
4thXCMPlus10.6XCMPlus10.86666667XCM10.2
5thRNNPlus10.86666667RNNPlus10.86666667RNNPlus10.73333333
6thLSTM11.6LSTMPlus11.93333333RNN11.73333333
7thLSTMPlus11.86666667LSTM12.06666667LSTM13.26666667
8thRNN12.53333333RNN12.4LSTMPlus13.33333333
9thLSTM_FCNPlus13.53333333LSTM_FCNPlus13.46666667LSTM_FCNPlus13.8
10thFCN14.93333333TST14.73333333InceptionTime14.13333333

Appendix B

Appendix B.1

Please use the abbreviation table below to read the corresponding results of the Friedman Ranks.
Table A4. Feature Setups and Abbreviations.
Table A4. Feature Setups and Abbreviations.
No.AbbreviationFeature Setup
1UUnivariate
2BBlob
3VVader
4FFinbert
5RM7CRolling Mean 7 Closing Value
6RM14CRolling Mean 14 Closing Value
7RM7BRolling Mean 7 Blob
8RM14BRolling Mean 14 Blob
9RM7VRolling Mean 7 Vader
10RM14VRolling Mean 14 Vader
11RM7FRolling Mean 7 Finbert
12RM14FRolling Mean 14 Finbert

Appendix B.2

Table A5. Friedman results: feature setups—Shift 1.
Table A5. Friedman results: feature setups—Shift 1.
MAEMAPER 2
Feature SetupFriedman ScoreFeature SetupFriedman ScoreFeature SetupFriedman Score
1stB_RM7B19.73333V_F19.2U54.93333
2ndRM7C_F24.13333B_RM7B20.06667RM7F50.53333
3rdRM7F24.53333B_V21.8RM14C47.8
4thV_F25.33333RM7F24.8RM7C_RM7F47.73333
5thRM7C_B26.8RM7C_F26.93333RM7C47.13333
6thB_V27.26667F_RM14V27.2RM14F46.4
7thB28.4RM7B_RM14V28.13333RM7C_B45.86667
8thRM7F_RM14F28.4RM7B_RM14F28.2RM7C_F43.6
9thRM14F30RM7C_RM14B28.26667B43.4
10thB_RM14V30B_RM14V29.2RM7C_RM14C43.13333
MSERMSERMSLE
Feature SetupFriedman ScoreFeature SetupFriedman ScoreFeature SetupFriedman Score
1stB_RM7B21B_RM7B20.6B_RM7B20.93333
2ndV_F21.06667RM7F24.13333RM7C_F21.86667
3rdB_V22.66667RM7C_F24.2B24.8
4thRM7C_F24.6RM7C_B26.13333V_F25.66667
5thRM7F25.4V_F26.26667RM7C_B26.26667
6thB_RM14V27.13333B_V28.26667RM7C_RM14B26.4
7thF_RM14V27.73333B28.6U26.4
8thRM7B_RM14V28.2RM7F_RM14F28.86667RM7F26.8
9thB_RM7F28.4RM7C_RM7B29.06667RM7C27.53333
10thV_RM7V28.6RM7C_RM14B29.13333B_V29
Table A6. Friedman results: feature setups—Shift 7.
Table A6. Friedman results: feature setups—Shift 7.
MAEMAPER 2
Feature SetupFriedman ScoreFeature SetupFriedman ScoreFeature SetupFriedman Score
1stB_RM7B21.53333V22.33333U55.06667
2ndV22.8RM14F24.8RM14C54.93333
3rdRM7B24.46667B_RM7B25.2RM7C54
4thU24.86667V_RM7V25.93333RM7C_RM7V49.33333
5thRM7V25.33333RM7B26.33333RM7C_RM14C47.23333
6thRM14F25.53333RM7C_B26.66667RM14C_RM14F45.46667
7thRM7C_B25.86667RM7C_RM14F26.8RM14C_B45.33333
8thRM7C_RM7F26.66667RM7F_RM14F27.53333RM7C_RM14F45.26667
9thRM7F26.86667U27.73333RM14F45.13333
10thB27.13333V_RM14V27.93333RM14C_RM7V44.93333
MSERMSERMSLE
Feature SetupFriedman ScoreFeature SetupFriedman ScoreFeature SetupFriedman Score
1stV21.93333U22.73333U17.6
2ndB_RM7B23.13333V23.13333RM7C_RM14F21.13333
3rdRM7V24B_RM7B23.66667B_RM7B22.53333
4thV_RM7V24.8RM7B24.46667RM14F22.73333
5thRM7B25.53333RM14F24.93333V23.2
6thRM14F26.73333RM7V25.4RM7F24.46667
7thU27.2RM7C_RM7F25.8RM7C_RM7F25.73333
8thRM7C_RM7F27.73333RM7C_B25.93333RM7C_B26.4
9thB27.86667RM7C_RM14F26.6RM14C_RM7B26.73333
10thRM7C_B27.93333RM7F27.2B26.86667
Table A7. Friedman results: feature setups—Shift 14.
Table A7. Friedman results: feature setups—Shift 14.
MAEMAPER 2
Feature SetupFriedman ScoreFeature SetupFriedman ScoreFeature SetupFriedman Score
1stRM7C_B17.46667RM7C_B18.53333RM14C50.5
2ndRM7C_V21.53333RM14C_B22.2RM7C48.9
3rdRM14C_B22.26667RM7C_V22.93333U48.76667
4thRM7C22.26667RM7F_RM14F23.73333RM14C_RM7F48.16667
5thU23.73333B_RM7V24.2RM14C_RM7B46.83333
6thB_RM7V23.8V25.6RM7C_RM7B46.06667
7thV24.46667RM7C_F26.33333RM7C_RM7F45.76667
8thRM7C_F24.86667V_RM14F27.4RM7C_RM14C45.56667
9thRM7B25.86667RM7C27.66667RM7F43.83333
10thRM14B27.33333RM7B28.13333RM14C_F43.36667
MSERMSERMSLE
Feature SetupFriedman ScoreFeature SetupFriedman ScoreFeature SetupFriedman Score
1stRM7C_B18.26667RM7C_B15.86667RM7C13.8
2ndRM7C_V21.26667RM7C20.33333RM7C_B16
3rdB_RM7V21.6RM14C_B21.26667RM14C_B21.06667
4thRM14C_B23.86667RM7C_V21.4U22.33333
5thV25.06667U22.8RM7C_F24.13333
6thRM7B25.86667RM7C_F24RM14B24.33333
7thRM7C26.2V24.06667B_RM7V25.06667
8thRM7C_F26.26667B_RM7V24.46667RM7C_V26.06667
9thRM7B_RM7F26.33333RM7B26V26.26667
10thRM7B_RM7V26.66667RM7C_RM14C26.2RM7B26.46667

References

  1. Basak, S.; Kar, S.; Saha, S.; Khaidem, L.; Dey, S.R. Predicting the direction of stock market prices using tree-based classifiers. N. Am. J. Econ. Financ. 2019, 47, 552–567. [Google Scholar] [CrossRef]
  2. Ren, R.; Wu, D.D.; Liu, T. Forecasting Stock Market Movement Direction Using Sentiment Analysis and Support Vector Machine. IEEE Syst. J. 2019, 13, 760–770. [Google Scholar] [CrossRef]
  3. Huang, W.; Nakamori, Y.; Wang, S.Y. Forecasting stock market movement direction with support vector machine. Comput. Oper. Res. 2005, 32, 2513–2522. [Google Scholar] [CrossRef]
  4. Zhong, X.; Enke, D. Predicting the daily return direction of the stock market using hybrid machine learning algorithms. Financ. Innov. 2019, 5, 24. [Google Scholar] [CrossRef]
  5. Abraham, B.; Ledolter, J. (Eds.) Statistical Methods for Forecasting; Wiley Series in Probability and Statistics, John Wiley & Sons, Inc.: Hoboken, NJ, USA, 1983. [Google Scholar] [CrossRef]
  6. Armstrong, J.S.; Collopy, F.L. Integration of Statistical Methods and Judgment for Time Series Forecasting: Principles from Empirical Research. Forecast. Model. eJournal 1998, 269–293. [Google Scholar]
  7. Bontempi, G.; Ben Taieb, S.; Le Borgne, Y.A. Machine Learning Strategies for Time Series Forecasting; Springer: Berlin/Heidelberg, Germany, 2013; Volume 138. [Google Scholar] [CrossRef]
  8. Masini, R.P.; Medeiros, M.C.; Mendes, E.F. Machine learning advances for time series forecasting. J. Econ. Surv. 2021, 37, 76–111. [Google Scholar] [CrossRef]
  9. Cao, L.; Tay, F. Support vector machine with adaptive parameters in financial time series forecasting. IEEE Trans. Neural Netw. 2003, 14, 1506–1518. [Google Scholar] [CrossRef] [PubMed]
  10. Yang, A.; Li, W.; Yang, X. Short-term electricity load forecasting based on feature selection and Least Squares Support Vector Machines. Knowl.-Based Syst. 2019, 163, 159–173. [Google Scholar] [CrossRef]
  11. Sagheer, A.; Kotb, M. Time series forecasting of petroleum production using deep LSTM recurrent networks. Neurocomputing 2019, 323, 203–213. [Google Scholar] [CrossRef]
  12. Zhao, Z.; Chen, W.; Wu, X.; Chen, P.C.Y.; Liu, J. LSTM network: A deep learning approach for short-term traffic forecast. IET Intell. Transp. Syst. 2017, 11, 68–75. [Google Scholar]
  13. Graf, R.; Zhu, S.; Sivakumar, B. Forecasting river water temperature time series using a wavelet–neural network hybrid modelling approach. J. Hydrol. 2019, 578, 124115. [Google Scholar] [CrossRef]
  14. Kurumatani, K. Time series forecasting of agricultural product prices based on recurrent neural networks and its evaluation method. SN Appl. Sci. 2020, 2, 1434. [Google Scholar]
  15. Khairalla, M.A.E.; Ning, X.; Al-Jallad, N.T.; El-Faroug, M.O. Short-Term Forecasting for Energy Consumption through Stacking Heterogeneous Ensemble Learning Model. Energies 2018, 11, 1605. [Google Scholar] [CrossRef]
  16. Alkandari, M.; Ahmad, I. Solar power generation forecasting using ensemble approach based on deep learning and statistical methods. Appl. Comput. Inform. 2020. [Google Scholar] [CrossRef]
  17. Liapis, C.M.; Karanikola, A.; Kotsiantis, S.B. Energy Load Forecasting: Investigating Mid-Term Predictions with Ensemble Learners. In Proceedings of the AIAI, Crete, Greece, 17–20 June 2022. [Google Scholar]
  18. Liapis, C.M.; Karanikola, A.C.; Kotsiantis, S.B. An ensemble forecasting method using univariate time series COVID-19 data. In Proceedings of the 24th Pan-Hellenic Conference on Informatics, Athens, Greece, 20–22 November 2020. [Google Scholar]
  19. Liapis, C.M.; Karanikola, A.; Kotsiantis, S.B. A Multi-Method Survey on the Use of Sentiment Analysis in Multivariate Financial Time Series Forecasting. Entropy 2021, 23, 1603. [Google Scholar] [CrossRef] [PubMed]
  20. Siami-Namini, S.; Tavakoli, N.; Namin, A.S. A Comparison of ARIMA and LSTM in Forecasting Time Series. In Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 17–20 December 2018; pp. 1394–1401. [Google Scholar]
  21. Çıbıkdiken, A.; Karakoyun, E. Comparison of ARIMA Time Series Model and LSTM Deep Learning Algorithm for Bitcoin Price Forecasting. In Proceedings of the 13th multidisciplinary academic conference in Prague, Hamburg, Germany, 27–30 August 2018. [Google Scholar]
  22. Yamak, P.T.; Yujian, L.; Gadosey, P.K. A Comparison between ARIMA, LSTM, and GRU for Time Series Forecasting. In Proceedings of the ACAI, Sanya, China, 20–22 December 2019. [Google Scholar]
  23. Maleki, A.; Nasseri, S.; Aminabad, M.S.; Hadi, M. Comparison of ARIMA and NNAR Models for Forecasting Water Treatment Plant’s Influent Characteristics. KSCE J. Civ. Eng. 2018, 22, 3233–3245. [Google Scholar]
  24. Satrio, C.B.A.; Darmawan, W.; Nadia, B.U.; Hanafiah, N. Time series analysis and forecasting of coronavirus disease in Indonesia using ARIMA model and PROPHET. Procedia Comput. Sci. 2021, 179, 524–532. [Google Scholar]
  25. Paliari, I.; Karanikola, A.; Kotsiantis, S.B. A comparison of the optimized LSTM, XGBOOST and ARIMA in Time Series forecasting. In Proceedings of the 2021 12th International Conference on Information, Intelligence, Systems & Applications (IISA), Chania Crete, Greece, 12–14 July 2021; pp. 1–7. [Google Scholar]
  26. Zhang, Y.; Yang, H.L.; Cui, H.; Chen, Q. Comparison of the Ability of ARIMA, WNN and SVM Models for Drought Forecasting in the Sanjiang Plain, China. Nat. Resour. Res. 2019, 29, 1447–1464. [Google Scholar]
  27. Tealab, A. Time series forecasting using artificial neural networks methodologies: A systematic review. Future Comput. Inform. J. 2018, 3, 334–340. [Google Scholar] [CrossRef]
  28. Sezer, O.B.; Gudelek, M.U.; Ozbayoglu, A.M. Financial Time Series Forecasting with Deep Learning: A Systematic Literature Review: 2005–2019. arXiv 2020, arXiv:abs/1911.13288. [Google Scholar] [CrossRef]
  29. Lara-Benítez, P.; Carranza-García, M.; Santos, J.C.R. An Experimental Review on Deep Learning Architectures for Time Series Forecasting. Int. J. Neural Syst. 2021, 31, 2130001. [Google Scholar] [CrossRef] [PubMed]
  30. Karanikola, A.; Liapis, C.M.; Kotsiantis, S. A Comparison of Contemporary Methods on Univariate Time Series Forecasting. In Advances in Machine Learning/Deep Learning-Based Technologies: Selected Papers in Honour of Professor Nikolaos G. Bourbakis—Volume 2; Tsihrintzis, G.A., Virvou, M., Jain, L.C., Eds.; Springer International Publishing: Cham, Switzerland, 2022; pp. 143–168. [Google Scholar] [CrossRef]
  31. Wang, K.; Qi, X.; Liu, H. A comparison of day-ahead photovoltaic power forecasting models based on deep learning neural network. Appl. Energy 2019, 251, 113315. [Google Scholar]
  32. Rao, T.; Srivastava, S. Analyzing Stock Market Movements Using Twitter Sentiment Analysis. In Proceedings of the International Conference on Advances in Social Networks Analysis and Mining, Sydney, Australia, 31 July–3 August 2012. [Google Scholar]
  33. Nguyen, T.H.; Shirai, K.; Velcin, J. Sentiment analysis on social media for stock movement prediction. Expert Syst. Appl. 2015, 42, 9603–9611. [Google Scholar]
  34. Kalyani, J.; Bharathi, H.N.; Jyothi, R. Stock trend prediction using news sentiment analysis. arXiv 2016, arXiv:abs/1607.01958. [Google Scholar]
  35. Shah, D.; Isah, H.; Zulkernine, F.H. Predicting the Effects of News Sentiments on the Stock Market. In Proceedings of the 2018 IEEE International Conference on Big Data (Big Data), Seattle, WA, USA, 10–13 December 2018; pp. 4705–4708. [Google Scholar]
  36. Souma, W.; Vodenska, I.; Aoyama, H. Enhanced news sentiment analysis using deep learning methods. J. Comput. Soc. Sci. 2019, 2, 33–46. [Google Scholar]
  37. Valle-Cruz, D.; Fernandez-Cortez, V.; Chau, A.L.; Sandoval-Almazán, R. Does Twitter Affect Stock Market Decisions? Financial Sentiment Analysis During Pandemics: A Comparative Study of the H1N1 and the COVID-19 Periods. Cogn. Comput. 2021, 14, 372–387. [Google Scholar]
  38. Sharma, V.; Khemnar, R.K.; Kumari, R.A.; Mohan, B.R. Time Series with Sentiment Analysis for Stock Price Prediction. In Proceedings of the 2019 2nd International Conference on Intelligent Communication and Computational Techniques (ICCT), Jaipur, India, 28–29 September 2019; pp. 178–181. [Google Scholar]
  39. Pai, P.F.; Liu, C. Predicting Vehicle Sales by Sentiment Analysis of Twitter Data and Stock Market Values. IEEE Access 2018, 6, 57655–57662. [Google Scholar]
  40. Mohan, S.; Mullapudi, S.; Sammeta, S.; Vijayvergia, P.; Anastasiu, D. Stock Price Prediction Using News Sentiment Analysis. In Proceedings of the 2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService), Newark, CA, USA, 4–9 April 2019; pp. 205–208. [Google Scholar]
  41. Mehta, P.; Pandya, S.; Kotecha, K. Harvesting social media sentiment analysis to enhance stock market prediction using deep learning. PeerJ Comput. Sci. 2021, 7, e476. [Google Scholar] [CrossRef]
  42. Jin, Z.; Yang, Y.; Liu, Y. Stock closing price prediction based on sentiment analysis and LSTM. Neural Comput. Appl. 2019, 32, 9713–9729. [Google Scholar] [CrossRef]
  43. Wu, S.H.; Liu, Y.; Zou, Z.; Weng, T.H. S_I_LSTM: Stock price prediction based on multiple data sources and sentiment analysis. Connect. Sci. 2021, 34, 44–62. [Google Scholar]
  44. Jing, N.; Wu, Z.; Wang, H. A hybrid model integrating deep learning with investor sentiment analysis for stock price prediction. Expert Syst. Appl. 2021, 178, 115019. [Google Scholar]
  45. Smailovic, J.; Grcar, M.; Lavra, N.; Znidarsic, M. Stream-based active learning for sentiment analysis in the financial domain. Inf. Sci. 2014, 285, 181–203. [Google Scholar]
  46. Raju, S.M.; Tarif, A.M. Real-Time Prediction of BITCOIN Price using Machine Learning Techniques and Public Sentiment Analysis. arXiv 2020, arXiv:abs/2006.14473. [Google Scholar]
  47. Abraham, J.; Higdon, D.W.; Nelson, J.; Ibarra, J. Cryptocurrency Price Prediction Using Tweet Volumes and Sentiment Analysis. SMU Data Sci. Rev. 2018, 1, 1. [Google Scholar]
  48. Valencia, F.; Gómez-Espinosa, A.; Valdés-Aguirre, B. Price Movement Prediction of Cryptocurrencies Using Sentiment Analysis and Machine Learning. Entropy 2019, 21, 589. [Google Scholar] [PubMed]
  49. Deb, A.; Lerman, K.; Ferrara, E. Predicting Cyber Events by Leveraging Hacker Sentiment. Information 2018, 9, 280. [Google Scholar] [CrossRef]
  50. Masri, S.; Jia, J.; Li, C.; Zhou, G.; Lee, M.C.; Yan, G.; Wu, J. Use of Twitter data to improve Zika virus surveillance in the United States during the 2016 epidemic. BMC Public Health 2019, 19, 761. [Google Scholar]
  51. Chauhan, P.; Sharma, N.; Sikka, G. The emergence of social media data and sentiment analysis in election prediction. J. Ambient. Intell. Humaniz. Comput. 2021, 12, 2601–2627. [Google Scholar] [CrossRef]
  52. Tseng, K.K.; Lin, R.F.Y.; Zhou, H.; Kurniajaya, K.J.; Li, Q. Price prediction of e-commerce products through Internet sentiment analysis. Electron. Commer. Res. 2018, 18, 65–88. [Google Scholar] [CrossRef]
  53. Twintproject. Twintproject/Twint: An Advanced Twitter Scraping & OSINT Tool. Available online: https://github.com/twintproject/twint (accessed on 7 October 2021).
  54. Van Rossum, G. The Python Library Reference, Release 3.8.2; Python Software Foundation: Wolfeboro Falls, NH, USA, 2020. [Google Scholar]
  55. Bird, S. NLTK: The Natural Language Toolkit. arXiv 2004, arXiv:cs.CL/0205028. [Google Scholar]
  56. Bird, S.; Klein, E.; Loper, E. Natural Language Processing with Python; Packt Publishing Ltd.: Birmingham, UK, 2009. [Google Scholar]
  57. String—Common String Operations. Available online: https://docs.python.org/3/library/string.html (accessed on 7 October 2021).
  58. Simplified Text Processing. Available online: https://textblob.readthedocs.io/en/dev/ (accessed on 7 October 2021).
  59. Hutto, C.J.; Gilbert, E. VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text. In Proceedings of the International AAAI Conference on Web and Social Media, Ann Arbor, MI, USA, 1–4 June 2014. [Google Scholar]
  60. Araci, D. FinBERT: Financial Sentiment Analysis with Pre-trained Language Models. arXiv 2019, arXiv:abs/1908.10063. [Google Scholar]
  61. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:abs/1810.04805. [Google Scholar]
  62. ProsusAI. ProsusAI/finBERT: Financial Sentiment Analysis with Bert. Available online: https://github.com/ProsusAI/finBERT (accessed on 7 October 2021).
  63. Malo, P.; Sinha, A.; Korhonen, P.J.; Wallenius, J.; Takala, P. Good debt or bad debt: Detecting semantic orientations in economic texts. J. Assoc. Inf. Sci. Technol. 2014, 65, 782–796. [Google Scholar] [CrossRef]
  64. timeseriesAI. Timeseriesai/Tsai: Time Series Timeseries Deep Learning Machine Learning Pytorch FASTAI: State-of-the-Art Deep Learning Library for Time Series and Sequences in Pytorch/Fastai. Available online: https://github.com/timeseriesAI/tsai (accessed on 7 October 2021).
  65. Wang, Z.; Yan, W.; Oates, T. Time series classification from scratch with deep neural networks: A strong baseline. In Proceedings of the 2017 International Joint Conference on Neural Networks (IJCNN), Anchorage, AK, USA, 14–19 May 2017; pp. 1578–1585. [Google Scholar]
  66. Oguiza, I. tsAI Models: FCNPlus. Available online: https://timeseriesai.github.io/tsai/models.fcnplus.html (accessed on 7 October 2021).
  67. Fawaz, H.I.; Lucas, B.; Forestier, G.; Pelletier, C.; Schmidt, D.F.; Weber, J.; Webb, G.I.; Idoumghar, L.; Muller, P.A.; Petitjean, F. InceptionTime: Finding AlexNet for Time Series Classification. arXiv 2020, arXiv:abs/1909.04939. [Google Scholar]
  68. Oguiza, I. tsAI Models: InceptionTimePlus. Available online: https://timeseriesai.github.io/tsai/models.inceptiontimeplus.html (accessed on 7 October 2021).
  69. Oguiza, I. tsAI Models: RNNS. Available online: https://timeseriesai.github.io/tsai/models.rnn.html (accessed on 7 November 2022).
  70. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  71. Chung, J.; Gülçehre, C.; Cho, K.; Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. arXiv 2014, arXiv:abs/1412.3555. [Google Scholar]
  72. Oguiza, I. tsAI Models: RNN_FCN. Available online: https://timeseriesai.github.io/tsai/models.rnn_fcn.html (accessed on 7 November 2022).
  73. Karim, F.; Majumdar, S.; Darabi, H.; Chen, S. LSTM Fully Convolutional Networks for Time Series Classification. IEEE Access 2018, 6, 1662–1669. [Google Scholar] [CrossRef]
  74. Elsayed, N.; Maida, A.; Bayoumi, M.A. Deep Gated Recurrent and Convolutional Network Hybrid Model for Univariate Time Series Classification. arXiv 2019, arXiv:abs/1812.07683. [Google Scholar] [CrossRef]
  75. Oguiza, I. tsAI Models: RNN_FCNPlus. Available online: https://timeseriesai.github.io/tsai/models.rnn_fcnplus.html (accessed on 7 November 2022).
  76. Zou, X.; Wang, Z.; Li, Q.; Sheng, W. Integration of residual network and convolutional neural network along with various activation functions and global pooling for time series classification. Neurocomputing 2019, 367, 39–45. [Google Scholar] [CrossRef]
  77. Oguiza, I. tsAI Models: ResNetPlus. Available online: https://timeseriesai.github.io/tsai/models.resnetplus.html (accessed on 7 November 2022).
  78. Bai, S.; Kolter, J.Z.; Koltun, V. An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling. arXiv 2018, arXiv:abs/1803.01271. [Google Scholar]
  79. Zerveas, G.; Jayaraman, S.; Patel, D.; Bhamidipaty, A.; Eickhoff, C. A Transformer-based Framework for Multivariate Time Series Representation Learning. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, Singapore, 14–18 August 2021. [Google Scholar]
  80. Oguiza, I. tsAI Models: TSTPlus. Available online: https://timeseriesai.github.io/tsai/models.tstplus.html (accessed on 7 November 2022).
  81. Oguiza, I. tsAI Models: TSIT. Available online: https://timeseriesai.github.io/tsai/models.tsitplus.html (accessed on 7 November 2022).
  82. Oguiza, I. tsAI Models: Transformermodel. Available online: https://timeseriesai.github.io/tsai/models.transformermodel.html (accessed on 7 November 2022).
  83. Fauvel, K.; Lin, T.; Masson, V.; Fromont, E.; Termier, A. XCM: An Explainable Convolutional Neural Network for Multivariate Time Series Classification. arXiv 2021, arXiv:abs/2009.04796. [Google Scholar]
  84. Oguiza, I. tsAI Models: XCMPlus. Available online: https://timeseriesai.github.io/tsai/models.xcmplus.html (accessed on 7 November 2022).
  85. Rahimian, E.; Zabihi, S.; Atashzar, S.F.; Asif, A.; Mohammadi, A. XceptionTime: A Novel Deep Architecture based on Depthwise Separable Convolutions for Hand Gesture Classification. arXiv 2019, arXiv:abs/1911.03803. [Google Scholar]
  86. Oguiza, I. tsAI Models: XceptionTimePlus. Available online: https://timeseriesai.github.io/tsai/models.xceptiontimeplus.html (accessed on 7 November 2022).
  87. Tang, W.; Long, G.; Liu, L.; Zhou, T.; Blumenstein, M.; Jiang, J. Omni-Scale CNNs: A simple and effective kernel size configuration for time series classification. arXiv 2022, arXiv:2002.10061. [Google Scholar]
  88. Friedman, M. The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J. Am. Stat. Assoc. 1937, 32, 675–701. [Google Scholar] [CrossRef]
  89. Dunn, O.J. Multiple Comparisons among Means. J. Am. Stat. Assoc. 1961, 56, 52–64. [Google Scholar]
  90. Hodges, J.L.; Lehmann, E.L. Rank Methods for Combination of Independent Experiments in Analysis of Variance. Ann. Math. Stat. 1962, 33, 403–418. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.