A Stacking Ensemble Deep Learning Model for Bitcoin Price Prediction Using Twitter Comments on Bitcoin

: Cryptocurrencies can be considered as mathematical money. As the most famous cryp-tocurrency, the Bitcoin price forecasting model is one of the popular mathematical models in ﬁnancial technology because of its large price ﬂuctuations and complexity. This paper proposes a novel ensemble deep learning model to predict Bitcoin’s next 30 min prices by using price data, technical indicators and sentiment indexes, which integrates two kinds of neural networks, long short-term memory (LSTM) and gate recurrent unit (GRU), with stacking ensemble technique to improve the accuracy of decision. Because of the real-time updates of comments on social media, this paper uses social media texts instead of news websites as the source data of public opinion. It is processed by linguistic statistical method to form the sentiment indexes. Meanwhile, as a ﬁnancial market forecasting model, the model selects the technical indicators as input as well. Real data from September 2017 to January 2021 is used to train and evaluate the model. The experimental results show that the near-real time prediction has a better performance, with a mean absolute error (MAE) 88.74% better than the daily prediction. The purpose of this work is to explain our solution and show that the ensemble method has better performance and can better help investors in making the right investment decision than other traditional models.


Introduction
Bitcoin is the first and the most important cryptocurrency. It is a ledger application based on blockchain, cryptography and peer-to-peer technology. In the field of financial technology, many mathematical models are developed to forecast Bitcoin's future price. These models can provide investment advice for quantitative investors.
Similar to other assets, such as stocks [1,2] and commodities, Bitcoin price forecasts are a series of continuous predictions because Bitcoin prices also change over time. One major difference between Bitcoin and a stock is that stocks trade only at certain times on weekdays, but the Bitcoin market typically operates around the clock, and investors can buy or sell Bitcoin all day, which may result in Bitcoin price fluctuations at unpredictable times. We can learn the stock price prediction method and use it to predict the price of Bitcoin. To address the time series problem of Bitcoin prices, two types of models have mainly been used in previous works: traditional time series models, such as autoregressive comprehensive moving average (ARIMA) [3] and generalized autoregressive conditional heterovariance (GARCH) [4]. Another is machine learning models, such as random forest (RF), and deep learning networks, such as recurrent neural networks (RNN), long short-term memory (LSTM), and gated recurrent units (GRU) [5].
According to a study by the American Institute of Economic Research (AIER), globally influential news and sentiment can drive large fluctuations in the price of Bitcoin [6]. Some research uses sentiment analysis based on Twitter data to predict the price of Bitcoin [5,7]. It is effective to explore people's reactions to Bitcoin from tweets since Twitter is an incredibly rich source of information about how people are feeling about a given topic. Previous research methods of sentiment analysis based on Bitcoin-related comments can be divided into two types: dictionary-based methods, such as valence aware dictionary and sentiment reasoner (VADER) [8], and machine learning-based methods, such as RF [7], hard voting classifiers [5], deep learning-based classifiers [9], and other specific analyzers [10].
However, the current research still has some limitations: Firstly, in most previous works, only historical data are used as the input data of the prediction model, which ignores that prices are also affected by unexpected factors in price data. Secondly, sentiment analysis simply categorizes every tweet or comment as positive, neutral or negative and then creates a simple statistic, which loses much emotional detail and is not conducive to learning how different levels of sentiment affect prices. Thirdly, a single model such as ARIMA, LSTM, or GRU, is employed by most previous methods. To solve the existing limitations, this paper proposed following aspects: Firstly, considering the financial nature of Bitcoin, we added the most commonly used technical indicators in traditional finance as predicting input. Secondly, instead of using a simple statistical method to categorize the mood trend of tweets, we used a linguistic method to process tweets about Bitcoin, which proved it brought a higher accuracy. Thirdly, to improve the prediction results, a stacking ensemble Deep Learning, combining LSTM and GRU, was trained to forecast the price of the next time interval. The major steps are as follows. We proposed to use linguistic sentiment analysis to categorize tweets and a stacking ensemble deep learning model to forecast the price of the next time interval based on sentiment trend of tweets and technical indicators. It combines multiple models to add a bias to the final prediction result, which will be offset by the variance of the neural network, making the prediction of the model less sensitive to the details of training data.
The rest of this paper is organized as follows: Section 2 shows the previous related work; Section 3 shows the whole methodology of this paper, including the data acquisition step, data preprocessing step and stacking ensemble prediction model; Section 4 lists all the experimental results and compares our method with common methods; Section 5 draws the conclusion of this paper.

Related Work
Many previous studies can mainly be divided into three main models and three main data categories. The three models include: (1) statistical methods; (2) machine learning; (3) ensemble learning. The three main data types are as follows: (1) price data, including opening, highest, lowest, closing, trading volume, number of trades, quote asst volume and other data; (2) technical indicators based on price data and indicators derived from market technical statistics, such as moving average convergence divergence (MACD) and relative strength index on balance volume (RSI OBV) statistics; (3) sentiment indicators refer to the indicators calculated after natural language processing of text data from social media during a certain time period; (4) other related data, such as blcokchain hashrate, number of online nodes, active address, Google trends and other financial indexes.
Early research into the price prediction of bitcoin were mostly based on the statistical method. P. Katsiampa et al. [11] used price data, and certain types of GARCH models have been used to calculate the daily closing prices between 18 July 2010 and 1 October 2016. As a result of the paper, AR-CGARCH is the best model. S. Roy et al. [4] used price data and performed ARIMA, autoregressive (AR), and moving average (MA) models on the time series dataset. The results of this paper used the ARIMA model to predict the price of Bitcoin with an accuracy rate of 90.31%. Therefore, it can be said that the best results are obtained using ARIMA. Ayaz et al. [12] used price data and only used the ARIMA algorithm to predict the price of Bitcoin. To find the lowest mean square error (MSE), the researchers used different fitting functions in the ARIMA algorithm and found that the lowest MSE = 170,962.195. Because it avoids the use of scaling functions, this result is different from those of other studies. In a recent paper [13], it proposed a general method of user behavior analysis and knowledge pattern extraction based on social network analysis. This method extracts relevant information from the blockchain transaction data in a specified period, carries out statistics and builds an ego network, and extracts important information such as active transaction addresses and different user groups. Using Ethereum blockchain data from 2017-2018, the method was proved to be able to identify bubble speculators. In 2021, R. K. Jana et al. [14] proposed a regression framework based on differential evolution to predict bitcoin. They first decomposed the original sequence into granular linear and nonlinear components using maximum overlapping discrete wavelet transform, and then fitted polynomial regression with interaction (PRI) and support vector regression (SVR) on both linear and nonlinear components to obtain the componentwise projections.Apart from the previously introduced statistical methods, Jong-Min Kim et al. [15] proposed to use linear and nonlinear error correction models to predict bitcoin log returns, and compared with neural network, ARIMA and other methods. The experiment was verified with the price data from 1 January 2019 to 27 August 2021. The results showed that the error correction model was the best in all evaluation indexes, and MAE was as low as 1.84, while other comparison models were all above 3.2. They also ran a Granger causality test on 14 cryptocurrencies.
Over the past few decades, major advances in machine learning have allowed more accurate methods to spread across the field of quantitative finance. A Bayesian neural network model that uses blockchain information to predict the price of Bitcoin was proposed by Jang et al. in 2017 [16]. Specifically, they use price data, blockchain data, economic indices, currency exchange rates and more. Four methods were trained for price prediction using price data, including logistic regression, support vector machine, RNN and ARIMA models in [17]. As far as the prediction accuracy of these four methods is concerned, ARIMA only has a 53% return on the next day's price prediction, and the long-term performance is poor, such as using the price prediction of the last few days to predict the price of the next 5-7 days. The RNN consistently obtains an approximate accuracy of 50% for up to 6 days. It does not violate the assumptions of the logistic regression-based model; it can accurately classify only when there is a separable hyperplane with 47% accuracy. The support vector machine has an accuracy rate of 48%. Shen et al. [18] used price data for training the GARCH, simple moving average (SMA) and RNN (GRU) models. The GRU model performs better than the SMA model with the lowest root MSE (RMSE) and mean absolute error (MAE) ratios. Some researchers used price data, technical indicators and a complex neural network called CNN-LSTM [19]. Compared with a single CNN and a single LSTM model, the results are slightly improved, with the MAE reaching 209.89 and the RMSE reaching 258.31. The stochastic neural network model has also been used to predict the price of cryptocurrency [20]. The model introduces layer-wise randomness into the observed neural network feature activation to simulate market fluctuations. It used market transaction data, blockchain data, and Twitter and Google Trends data. A latest research on cryptocurrencies by Wołk [21] used Google Trends and Twitter to predict the price of cryptocurrencies by distinctive multimodal scheme. However, they used textual data mechanically, unlike our article, which considers linguistic approaches to textual data. In 2021, Jagannath et al. [22] proposed a Bitcoin price prediction method using data features of users, miners, and exchanges. They also propose jSO adaptive deep neural network optimization algorithm to speed up the training process. The model uses Bitcoin data from 2016 to 2020 for training and testing. The MAE value of LSTM is 2.90, while the MAE value of this method is 1.89, thus effectively reducing the MAE value. A novel price prediction model WT-CATCN was proposed in 2021 by Haizhou Guo et al. [23]. It utilizes Wavelet Transform (WT) and Casual Multi-Head Attention (CA) Temporal Convolutional Network (TCN) to predict cryptocurrency prices. The data input of the model is divided into three categories: blockchain transaction information, exchange information, and Google Trends.
Considering how widespread cryptocurrency information has become, Loginova proposed a bitcoin price direction prediction method in 2021 that combined the sentiment analysis model JST and TS-LDA [24]. They used market trading data as well as text data from Reddit, CryptoCompare and Bitcointalk. The model was verified by using the data from 20 February 2017 to 6 April 2019. The accuracy of the model using JST and TS-LDA was 57%, which was improved compared with the same model that was not used. For Dogecoin, which has a huge market cap, Sashank Sridhar et al. proposed a multi-head attention-based encoder-decoder model for a transformer model to predict its price [25]. It is verified using real DOGE hourly transaction data from 5 July 2019 to 28 April 2021, with an R-squared value of 0.8616 for the model. A more complex hybrid framework, DL-GuesS, was proposed by Raj Parekh et al. for cryptocurrency price prediction [26]. This framework takes into account its interdependence with other cryptocurrencies and market sentiment. The model uses transaction data from different cryptocurrencies as input, along with Twitter text. The model was validated using Bitcoin Cash data from March 2021 to April 2021, and the model MSE value was as low as 0.0011.
Ensemble learning is also a popular method for forecasting. Using this approach, researchers have been able to improve the accuracy and stability of predictions. Ahmed Ibrahim [27] used price and sentiment data to predict Bitcoin prices by constructing an XGBoost-Composite integrated model. A paper using price data to compare different ensemble models, including averaging, bagging, and stacking was written in 2020 [28]. Among them, stacking has the best performance, but the blending ensemble was not used in the paper. Other researchers used price data and integrated LSTM models after training for different lengths of time (days, hours, and minutes) to obtain an integrated model that was superior to each individual model [29].
Mainly inspired by Li and Pan [1], whose workflow is shown in Figure 1, this paper designs a series of methods to avoid these current limitations: (1) more data sources are used as input; (2) linguistic methods are used for sentiment analysis to replace the simple statistical methods used in most papers; (3) one kind of ensemble model is used for training and prediction. However, due to different data sources, the methods proposed in this paper are somewhat different from those proposed in Li [1]. The differences of specific data sources are as follows: 1. There is less news about digital currency than stocks, which means there are not many reports about digital currency in the news, which is not enough to support our real-time prediction, so we chose social media. 2. Digital currencies are traded 24 h a day and comments on Twitter are live 24 h a day, so real-time comments on Twitter can be very effective for price forecasting.
3. Li's work uses two data sources, price and news, to predict price. Considering the financial properties of digital currency, we use price, comments on Twitter and technical indicators to predict price. 4. Data preprocessing methods are also different: The text data used in Li [1], namely news data, does not need to be cleaned and can be scored directly by VADER. Moreover, the Twitter data we obtain from crawlers is very dirty, such as pictures, links, etc., which need to be cleaned.

Methodology
In this paper, sentiment indicators are combined with Bitcoin price data to predict the future price. The proposed model workflow is shown in Figure 2. In step 1, Twitter data are collected and processed to form a structured Twitter date, which is in CSV format. In step 2, the structured Twitter date is sent to the sentiment calculation program. The SGSBI and SGSDI are calculated and attached to the market sentiment indicator data. In step 3, Bitcoin price data are collected and processed with TA-LIB to generate price data with technical indicators. In step 4, two parts of the data are merged by time indexes to evaluate the models.

Bitcoin Price Data
Bitcoin price data is provided by Binance.com. To help Bitcoin researchers, Binance collects and processes all their trading data and provides them at http://data.binance. vision/, accessed on 2 November 2021. The data is stored in CSV format. In this paper, the data from September 2017 to January 2021 are selected as the data for model learning and prediction in most cases.

Data Collection
Twint is used to collect tweets from Twitter in this paper. Twint, which is the abbreviation for the Twitter Intelligence Tool, is an open source Twitter scraper that searches and scrapes tweets; it is different from the Twitter Search API. Since no authentication is needed, Twint is an out-of-the-box tool for anyone who needs to scrape tweets. Additionally, Twint has no rate limitations, while the Twitter Search API limits a search to the last 3200 tweets. Certainly, Twint supports almost all the functions of the Twitter Search API, which allows users to request specific queries and allows filtering based on language, region, geographic location, and time range. CSV, JSON, and txt are supported output file formats.
BTC and Bitcoin are the keywords to search for in the related tweets. Instead of #, $ is used for the hashtag symbol to avoid a very large number of unwanted tweets. From September 2017 to January 2021, more than 7 million tweets were collected.

Sentiment Score Calculation
This paper uses VADER for the basic sentiment score calculation. VADER is an open source Python library for sentiment analysis based on dictionaries and rules. The library is used out-of-the-box and does not need to use text data for training. Compared with traditional sentiment analysis methods, VADER has many advantages: (1) it is suitable for multiple text types, such as social media; (2) training data are not required; and (3) due to fast speeds and streaming data, it can be used online.
VADER not only calculates the positive, neutral and negative scores about the input statement but also provides a compound score, which is a numeric value between −1 and +1. In general, a compound score from −1 to −0.05 is considered negative, a score from 0.05 to 1 is considered positive, and the rest is considered neural. However, in this way, information of the numeric score is filtered out. For example, the compound scores 0.12 and 0.86 are both considered positive emotions, but the degree of positive emotion expressed is not the same.

Small Granularity Sentiment Indicators
According to previous work [30], the sentiment indexes constructed by Antweiler and Frank have been revised. Specifically, this work took advantage of VADER and the work of Antweiler and Frank and then proposed small granular sentiment indicators, as shown in Equations (1)-(3).

Technical Indicator Calculation
The technical indicators in Table 1, including MACD, SMA, OBV, RSI and MFI, are calculated based on the raw price data through a Python library called TA-Lib. The input data to the TA-Lib function are transferred to the ndarray type by numpy in advance. These technical indicators are chosen because of their popularity in the field of traditional financial market price forecasting.
The simple moving average (SMA) is a simple technical analysis tool that smooths out price data by creating a constantly updated average price. A simple moving average helps cut down the amount of noise on a price chart. The stop and reverse (SAR) indicator is used by traders to determine trend direction and potential reversals in price. Moving average convergence divergence (MACD) is a trend-following momentum indicator that shows the relationship between two moving averages of a security's price. The MACD is calculated by subtracting the 26-period exponential moving average (EMA) from the 12-period EMA. The relative strength index (RSI) is a momentum indicator used in technical analysis that measures the magnitude of recent price changes to evaluate overbought or oversold conditions in the price of a stock or other asset. The Money Flow Index (MFI) is a technical oscillator that uses price and volume data for identifying overbought or oversold signals in an asset. On-balance volume (OBV) is a technical trading momentum indicator that uses volume flow to predict changes in stock price. Long short-term memory (LSTM) is a neural network with the ability to remember long-term and short-term information. It was first proposed by Hochreiter and Schmidhub [31] in 1997 and then led to the rise of deep learning in 2012. After undergoing several generations of development, a relatively systematic and complete framework has been formed for the LSTM model.
LSTM is a special kind of RNN model that is designed to solve the problem of gradient dispersion of the RNN model. In traditional RNNs, back propagation through time (BPTT) is used in the training algorithm. When the training time is relatively long, the residual error that needs to be returned will decrease exponentially, which leads to slow network weight updating; hence, it cannot reflect the long-term memory effect of RNNs [32]. Therefore, a storage unit is needed to store memory, and the architecture of the LSTM model prevents the problem of long-term dependence.
In an ordinary RNN, which is shown in Figure 3, the structure of the repeating module is very simple; for example, there is only one tanh layer. LSTM also has a kind of chain structure, which is shown in Figure 4, but its repeating module structure is different. There are four neural network layers in the repeating module of LSTM, and the interactions between them are very special.
The LSTM model can store important past information into the cell state and forget unimportant information. Its memory cell consists of three parts: the forget gate, the input gate, and the output gate.  The first step of LSTM is to decide what information will be abandoned from the cell state. The decision is controlled by a sigmoid layer called the "forget gate". f t (the forget gate) observes h t−1 (the output vector) and x t (the input vector) and outputs a number between 0∼1 for each element in the cell state C t−1 , where 1 means "keep this information completely" and 0 means "discard this information completely".
The next step is to decide which new information will be stored in the cell state. First, there is a sigmoid layer called the "input gate" i t that determines what information should be updated. Next, a tanh layer creates a new candidate value c t , which may be added to the cell state.
Then, the old cell state C t−1 updates to the new state c t .
In the end, the final output o t is supposed to be decided, and it is based on the current cell state after some filtering. Initially, an output gate in the sigmoid layer is established to determine which parts of the cell will be output. Then, the cell state is multiplied by the output gate after passing through the tanh layer, and the output value is between −1∼1.

Gate Recurrent Unit
Proposed by Cho et al. in 2014 [34], Gate recurrent unit (GRU), another special kind of RNN, was proposed to solve the vanishing gradient problem of RNNs through an update gate and a reset gate. In addition to eliminating the RNN vanishing gradient problem, the two gates can store relevant information in the memory cell and pass the values to the next steps of the network. The performances of LSTM and GRU are equally matched under different test conditions. However, there are some differences between GRU and LSTM: first, GRU does not have a separate memory cell; computationally, GRU is more efficient than LSTM because of the lack of memory units; and when dealing with small datasets, GRU is more suitable.

Stacking Ensemble
As a primary paradigm of machine learning, ensemble learning has achieved notable success in a vast range of real-world applications. One model that fits an entire training dataset may not be enough to meet all expectations. Many previous studies have shown that ensemble learning, which combines multiple individual learning algorithms, outperforms a single learning algorithm in both accuracy and robustness [35].
Thomas G. Dietterich pointed out the reasons for the better performance of ensemble learning from statistical, computational, and representational aspects [36]. There are various types of ensemble learning models, such as bagging, boosting, stacking, and blending [36]. A deep learning network, a special kind of artificial neural network, consists of multiple processing layers. With the ability to mine information from the plethora of historical data and effectively use that data for future predictions, deep learning has become a popular choice for problem solving [37]. However, deep learning methods have one obvious disadvantage: deep learning models are very sensitive to initial conditions. According to [38], it is computationally expensive to train deep learning neural networks, and even if a vast amount of time is spent to train a model, the trained network with the best performance on validation sets may not perform best on new test data. Generally, we could regard deep learning neural networks as models with low bias but high variances. Combining the advantages of both deep learning and ensemble learning, ensemble deep models have been proposed [39]. Specifically, ensemble deep models combine the predictions from multiple good but different deep learning models. Good means that the performance of each deep learning neural network used is relatively good. Different means that each of the deep learning neural networks has different prediction errors. As stated in [40], different models usually have different errors on a test set, and this has resulted in studies on model averaging. The combination of ensemble models and deep learning models adds bias that in turn cancels out the variance in a single training neural network model. The bias-variance tradeoff is illustrated in the graph in Figure 5. In addition to reducing the variance in the prediction, an ensemble deep model can also produce better predictions than any single best model according to the ensemble model properties described above.
Our model consists of two levels, shown in Figure 6: level 1 contains five LSTM and five GRU, which are called sub-models; and level 2 is a single-layer model called the meta-model. We choose LSTM and GRU as sub-models due to their good performance in the field of price prediction. Based on a large number of experiments, we set the number of sub-models in the first layer to five in order to achieve a balance between accuracy and computation. The steps of the model are as follows:

1.
Data split: Divide the data used into training set and test set as shown in the step (1).

2.
Sub-model training: Further divide the training set into five subsets, defined as train1 to train5. Then define the five LSTM instances as LSTM1 to LSTM5, and the five GRU instances as GRU1 to GRU5.
• Train sub models: Train LSTM1 on train1 to train4, and then predict the result as Prediction1 on data subset train5. Train LSTM2 on train1, train3 to train5, and then predict the result as Prediction2 on data subset train4, and so on. The same action was repeated in the five instances of GRU as shown in the step (2); • Generate training features for meta-model: Combine prediction1-5 of LSTM successively and therefore obtain the feature meta-train1 for training metamodel. The same action was repeated on GRU to obtain the feature meta-train2 for training meta-model as shown in the step (3); • Create new prediction features for layer two: Make predictions respectively on LSTM1-5 to obtain five prediction results by using the test set. Average these results to yield a feature meta-test1 for prediction. The same operation was repeated on GRU to obtain another feature for prediction as shown in the step (4).
3. Meta-model training and predicting: Concatenate meta-train1 and meta-train2 for training the meta-model. Predict the result by using meta-model through the merging of meta-test1 and meta-test2 as shown in the step (5).
Let n be sequence length and d be representation Dimension, and the LSTM/GRU of this model is a single layer. The time complexity of the stacking ensemble is estimated to be O(n · d 2 ).

Evaluation Metrics
Many metrics have been used to compare the performance of price trend and price movement direction predictions of different models. To comprehensively evaluate the performance of the models, four widely used indicators are adopted in the experiments: the MSE, the MAE, the mean absolute percentage error (MAPE), and the symmetric MAPE (sMAPE). 13) where N is the number of predictions, y is the actual value andŷ is the predicted value of the model.

Result Evaluation
In this section, the proposed method is used to forecast the Bitcoin closing price. We implement the proposed method using the TensorFlow deep learning framework on TITAN RTXs through the Python programming language. Many trials of simulation experiments are conducted to determine the parameters of the model.
The comparative experiments in this paper are divided into two categories: the first is to compare the performance of different models; the other is to compare the performance of different categories of data combinations in the forecast.
As shown in Figure 7, the whole data is divided into two parts: training data, and testing data. The training data is from 24 September 2017 to 11 April 2020, which is used to train the weak learners in level 1; the testing data is from 12 April 2020 to 30 November 2020, which is used to make the final prediction.
A rolling window with 5 steps is used in these financial time series data, as shown in Figure 8. In addition, technical indicators and sentiment indicators are calculated as data sources. Table 2 lists the input features for Bitcoin price prediction from the price data sector, technical indicator sector, and sentiment indicator sector.   The training duration of models are show in Table 3. Stacking ensemble model training on 30 min interval data only costs about 27 min because of the GPU. The first part is the experiments that compare the different models. The compared models include not only neural network models, such as LSTM and GRU, but also average ensemble (AE) and blending ensemble (BE). Both LSTM and GRU are single models that can be used for prediction. They are essential components of our ensemble models. The average ensemble model takes the average of the sum of the LSTM and GRU results as the final result. The MAE, MSE, MAPE, sMAPE, and MDA are used to evaluate the performance results of the proposed method and other models. All our results are shown in the Tables 4 and 5. As shown in Table 6, the proposed stacking ensemble model has amazing performance in the MAE, MSE, and MDA evaluation categories. In MAPE evaluation, the proposed stacking ensemble model is the best compared with the other models on the 30 min time interval, but on the 1-day time interval, the blending ensemble obtains the best MDA score. In general, the proposed stacking ensemble model outperforms other models in most cases. Figure 9 shows the results of the different models on the testing data. Figure 10 is part of Figure 9, the result of stacking ensemble model is marked 'X' and the actual value is marked '+' to illustrate performance of models. The graph visually illustrates that the prediction results of the stacking ensemble model are closer to the actual closing price, and the shape of the prediction line is more identical to the shape of the actual line.    The second part is the comparative experiments with different data combinations. It is shown in Table 7 that, for different time intervals, the data combinations that produce optimal performance are not necessarily the same. Specifically, when the data interval is one day, the combination of price data and technical indicators has better prediction performance than other data combinations since it obtains the best value of 492.90 among all the 1-day interval data combinations. The combination of price data, technical indicators, and sentiment indicators outperforms the other combinations for time intervals of 30 min, since it obtains the best value of 88.74 among all data combinations for 30-min intervals. Experiments show that, in most cases, the combination of price data, technical indicators and sentiment indicators outperforms the data combination in previous articles. We can conclude that the richness of the input data used in the prediction can improve the accuracy of the prediction. Furthermore, other metrics are shown in Figure 11. The better the prediction obtained with the data combination, the redder the values are; the worse the prediction obtained with the data combination, the whiter its values are. The combination of price data and technical indicators achieves the best performance for 1-day intervals, and the combination of price data, technical indicators and sentiment indicators achieves the best performance for 30 min intervals. From our experiments, we found that price data with technical indicators are better for short-term predictions, such as predicting the next-day prices; however, price data with sentiment indicators are better for extra-short-term predictions, such as predicting the prices in the next 30 min.  Figure 12 shows the testing data with different data combinations. Figure 13 is part of Figure 12, the result of using all data is marked 'X' and the actual value is marked '+' to illustrate performance of data combinations. The graph visually illustrates that, for the stacking ensemble model, the accuracy of the prediction results depends on whether it is used for short-term prediction or long-term prediction. Generally, the combination of price data and technical indicators is better for short-term prediction, and the combination of price data, technical indicators and sentiment indicators is better for extra short-term prediction.
At present, in the research field of Bitcoin price prediction, there are several difficulties limiting the fair comparison of the new proposed method and previous methods: (1) the data format is diverse and difficult to unify; (2) the data acquisition methods are different, and the versions are different; (3) some implementation details are not mentioned in the theses of previous studies; (4) the source code is hard to obtain and run in new environments. Therefore, we briefly compare the results of previous related work with our newly proposed method in Table 8.  Specially, the data combination of price and sentiment indicators under the 1-day time internal can be considered as the variant of Li and Pan's proposed method [1] in our experiments. By this way, it is shown that our proposed method has got the improvement from Li and Pan's proposed method.
Bitcoin price data and social media text data are presented in different formats due to different providers or acquisition tools. Most of the methods in this paper only read data in one of the formats. For data formats other than the specified format, additional processing work is required.
As there are no standard open data for Bitcoin price prediction, all researchers collect data on their own. At present, there are several major trading platforms that provide their own transaction data for Bitcoin price data. The version differences among Bitcoin's social media texts, such as those on Twitter or Reddit, are even more serious because the collection tools are different and the collection times are different. For example, a tweet that was published yesterday may be deleted by the author today. Then, the data version collected today is not the same as the data version collected yesterday. There are many parameters and implementation details in modeling and model training. In a deep neural network, the structure of each layer has many parameters. However, these parameters are not all written in the original theses for good reasons. Moreover, there are many details in modeling, such as the split of training and test data and some shuffle operations to prevent overfitting of the model. These details can also be missing due to the lengths of the theses and the focus of the topics. The lack of this information makes it difficult to reproduce previous methods solely by the theses themselves.
If one is fortunate enough to obtain the source code with the author's consent, there will still be environmental and operational difficulties. We know that many machine learning and statistical toolkits are updated very frequently. A piece of code can run under the package version used by the author at the time, but it may not be able to run smoothly under a new version. In addition, it is also possible that the running result is different from the author's result due to the inability to obtain the same running environment as the author.

Conclusions
The price of Bitcoin often fluctuates wildly, inspired by the work of Li and Pan [1], we propose an ensemble deep method, which combines two RNNs, to predict the future price and price movement of Bitcoin based on the combination of historical transaction data, tweet sentiment indicators and technical indicators. It is worth noting that we crawled two datasets at different time intervals: 1 day and 30 min. Because of the financial attribute of cryptocurrency, four evaluation indicators, the MSE, the MAE, the MAPE, and the sMAPE, are used to measure the price prediction performance, and the movement direction accuracy (MDA) is used to measure the price movement prediction. Two types of comparative experiments are conducted in this research: experiments that compare different models and experiments that compare the impact of different data combinations on forecast prices. The results show that in the same situation, a stacking ensemble can help with fewer training resources and better performance, and social media sentiment analysis makes a greater contribution to extra short-term price prediction than to short-term price prediction.
Prediction models and input data sources have great room for improvement in the future. First, the model can be optimized from the three aspects of the model framework, model size and optimization process to improve prediction performance [44]. For the model framework, we can consider changing the model types and activation function. For the model size, the width and number of hidden layers are two potential values where we can make adjustments. For optimization, the proper setting of the hyperparameters is essential. Second, the inclusion of other data sources may improve the existing forecasting accuracy. In this research, we consider the historical transaction data, sentiment trends of Twitter, and technical indicators. However, there may be other potential factors, including regulatory and legal matters, competition between Bitcoin and other cryptocurrencies, and the supply and demand of Bitcoin. In addition, the microexpressions of cryptocurrency investors during trading can also be considered potential factors affecting cryptocurrency prices. Third, we can also dynamically change the size of the window according to different data types. For example, news is not published as quickly as social media comments, such as tweets. Therefore, we can set different window sizes for data with different update frequencies and study the long-term or short-term influences on prices. Experiments based on the proposed model can be extended to research on the price prediction of other cryptocurrencies. The new bitcoin price prediction model proposed by us provides a reference for practitioners to avoid their potential risks in trading. In addition, researchers can develop better regulatory measures and laws by studying the relationship 429 between opinion analysis on social media and price movements of cryptocurrencies.