A Comparative Study of Bitcoin Price Prediction Using Deep Learning

: Bitcoin has recently received a lot of attention from the media and the public due to its recent price surge and crash. Correspondingly, many researchers have investigated various factors that affect the Bitcoin price and the patterns behind its ﬂuctuations, in particular, using various machine learning methods. In this paper, we study and compare various state-of-the-art deep learning methods such as a deep neural network (DNN), a long short-term memory (LSTM) model, a convolutional neural network, a deep residual network, and their combinations for Bitcoin price prediction. Experimental results showed that although LSTM-based prediction models slightly outperformed the other prediction models for Bitcoin price prediction (regression), DNN-based models performed the best for price ups and downs prediction (classiﬁcation). In addition, a simple proﬁtability analysis showed that classiﬁcation models were more effective than regression models for algorithmic trading. Overall, the performances of the proposed deep learning-based prediction models were comparable.


Introduction
Bitcoin [1] has recently received a lot of attention from the media and the public due to its recent price surge and crash. Figure 1 shows the Bitcoin daily prices from 29 November 2011 to 31 December 2018 on Bitstamp (https://www.bitstamp.net/), which is the longest-running cryptocurrency exchange. On Bitstamp, the Bitcoin price reached the highest price (19,187.78 USD) on 16 December 2017 and has fallen up to 3179.54 USD (16.57% of the highest price) on 15 December 2018. Then, it has again increased with some fluctuations since April 2019. Although the Bitcoin price seems to follow a random walk [2], some recurring patterns seem to exist in the price fluctuations when considering the log value of the Bitcoin price, as shown in Figure 1.
As Bitcoin has been considered to be a financial asset and is traded through many cryptocurrency exchanges like a stock market, many researchers have investigated various factors that affect the Bitcoin price and the patterns behind its fluctuations using various analytical and experimental methods; for example, see the works by the authors of [3,4] and references therein. In particular, due to the recent advances in machine learning, many deep learning-based prediction models for the Bitcoin price have been proposed [5][6][7][8][9][10][11].
Although so far several deep learning methods were studied and compared for the Bitcoin price prediction, most previous work considered only a few deep learning methods, mostly based on a deep neural network (DNN) or a recurrent neural network (RNN) [12]. For example, a convolutional neural network (CNN) [13,14] and its variants, such as a deep residual network (ResNet) [15], have gained little attention for the Bitcoin price prediction, even though they were shown to be very effective for many applications, including long sequence data analysis [16]. Moreover, most previous work addressed only a regression problem, where the prediction model predicts the next Bitcoin price based on the previous prices, but not a classification problem, where the prediction model predicts if the next price will go up or down with respect to the previous prices. More precisely, for a regression problem, the performance of prediction models is often measured in terms of the root-mean-square error (RMSE) or the mean absolute percentage error (MAPE) between the predicted values and the actual values, but a low RMSE or MAPE value does not necessarily mean that the prediction model is indeed effective; for instance, for Bitcoin trading, as it might not perform well for a classification problem, as shown in Section 4.
In this paper, we study and compare various state-of-the-art deep learning methods, such as DNNs, long short-term memory (LSTM) models [17], CNNs, ResNets, a combination of CNNs and RNNs (CRNN) [18], and their ensemble models for Bitcoin price prediction. In particular, we developed both regression and classification models by exploiting the Bitcoin blockchain information and compared their prediction performance under various settings. Experimental results showed that although LSTM-based prediction models slightly outperformed the other prediction models for regression problems, DNN-based prediction models performed the best for classification problems. In addition, to determine the applicability of the proposed prediction models to algorithmic trading, we compared the profitability of the proposed models by using a simple trading strategy. More specifically, for regression models, if the predicted price is higher than or equal to the current price, then we buy Bitcoin with all funds or hold it if we already spent all funds. Otherwise, we sell all Bitcoin or wait if we did not buy Bitcoin yet. Similarly, for classification models, we buy or hold if the prediction model predicts a price rise and otherwise sell or wait. The analysis result showed that classification models were more effective than regression models. Overall, the performance of the deep learning-based prediction models studied in this paper was comparable to each other. The upper line shows log prices, whereas the lower line shows plain prices. Some recurring patterns seem to exist when considering the log value of the Bitcoin price.

Related Work
This section briefly reviews previous studies on Bitcoin price prediction using deep learning. For other statistical analysis, we refer the reader to the works by the authors of [3,4] and references therein. McNally et al. [5] proposed two prediction models based on recurrent neural networks (RNNs) and long short-term memory (LSTM), and compared them with an autoregressive integrated moving average (ARIMA) model [19], which is a traditionally widely used time series forecasting model. They developed classification models using the Bitcoin price information, which predict if the next Bitcoin price will go up or down based on the previous prices. In the work by the authors of [5], the RNN and LSTM models were shown to be better than the ARIMA model. In addition to the price information, Saad and Mohaisen [6] analyzed the Bitcoin blockchain information, such as the number of Bitcoin wallets and unique addresses, block mining difficulty, hash rate, etc., and used those features that are highly correlated to the Bitcoin price to build prediction models. They studied several regression models based on linear regression, random forests [20], gradient boosting [21], and neural networks. In addition to the blockchain information, Jang and Lee [7] further considered macroeconomic factors such as S&P 500, Euro stoxx 50, DOW 30, and NASDAQ, and exchange rates between major fiat currencies. They studied three prediction models based on a Bayesian neural network (BNN) ( [22], Chapter 5.7), linear regression, and support vector regression (SVR) [23], and showed that the BNN model outperformed the other two prediction models. In their follow-up study, Jang et al. [10] proposed a rolling window LSTM model and showed that it outperformed the prediction models based on linear regression, SVR, neural networks, and LSTM. Similarly, Shintate and Pichl [11] proposed a deep learning-based random sampling model and showed that it outperformed LSTM-based models. Meanwhile, Kim et al. [24] and Li et al. [25] considered social data to predict the Bitcoin price fluctuations. However, none of the work mentioned above considered CNN-based prediction models and various combinations of deep learning models. Moreover, unlike the previous studies, we provide in-depth comparisons between various deep learning models both for regression and classification problems, and show that the best prediction model for regression is not necessarily the best for classification and vice versa.

Organization of the Paper
Section 2 analyzes various features of the Bitcoin blockchain using the Spearman rank correlation analysis [26], and discusses those features that we used to build prediction models. Section 3 presents various prediction models based on deep neural networks (DNNs), LSTMs, CNNs, ResNets, CRNNs, and their ensemble models for Bitcoin price prediction. Section 4 presents experimental results under various settings. In particular, both regression and classification results of the proposed deep learning models are reported and analyzed. Finally, Section 5 concludes and discusses future work.

Datasets and Feature Engineering
In this study, we used the Bitstamp Bitcoin market (USD) time series data, which can be obtained from https://bitcoincharts.com/charts/. In particular, we used Bitcoin's prices from 29 November 2011 to 31 December 2018 (for a total of 2590 days), which are shown in Figure 1. During this period, the lowest price was 2.72 USD (29 November 2011) and the highest price was 19,187.78 USD (16 December 2017), which is 7054.33 times higher than the lowest price. Moreover, the Bitcoin blockchain information and its various statistics were also exploited, which can be obtained from https://www.blockchain.com/charts. Among various features, we considered the 29 features of the Bitcoin blockchain, listed in Table 1. The values of each feature were measured every day or converted to daily average values for this study.
Among the features listed in Table 1, we did not consider the features related to the Bitcoin memory pool such as mempool-count, mempool-growth, mempool-size, and trans-per-sec because their information is available only from 25 April 2016. Then, to find the most useful features, a correlation analysis between the rest of the features and the Bitstamp Bitcoin price was performed. In particular, we calculated the Spearman rank correlation coefficients [26] as shown in Figure 2 and Table 2. The Spearman rank correlation coefficient is a nonparametric measurement correlation, which can be used to evaluate the monotonic relationship between two variables. The Spearman correlation between two variables is the same as the Pearson correlation [27], i.e., linear correlation, between the ranked values of the two variables. A positive Spearman correlation coefficient between two variables, X and Y, indicates that when X increases (decreases), Y also tends to increase (decrease). In contrast, a negative coefficient indicates that when X increases (decreases), Y tends to decrease (increase). A Spearman correlation of zero indicates that there is no correlation between X and Y.
In this study, we excluded the features with a correlation coefficient less than 0.75, such as cost-per-trans-pct, est-trans-vol, med-cfm-time, output-vol, and trans-fees, as they are not so much related to the Bitcoin price. We also excluded the features with a correlation coefficient greater than 0.95, such as market-cap, market-price, and miners-revenue. Indeed, the trend in market-cap and market-price is almost the same as in the Bitstamp Bitcoin price. Note that market-price is the average Bitcoin USD market price across major Bitcoin exchanges, and market-cap is defined as market-price times the amount of Bitcoin supply in circulation. In addition, miners-revenue is directly proportional to the Bitcoin price and should be considered as a lagging indicator. Figure 3 shows the trends of some features which are highly correlated or not correlated with the Bitcoin price. For example, the trend of trade-vol shown in Figure 3b (having a Spearman correlation of 0.8977) is very similar to that of the Bitstamp Bitcoin price shown in Figure 3a. In contrast, the trend of est-trans-vol shown in Figure 3e (having a Spearman correlation of −0.0595) has nothing to do with the Bitcoin price. In total, 18 features (including the Bitcoin price) were used to develop prediction models and to predict the next Bitcoin price. Among 18 features, the range of the values of difficulty, est-trans-vol-usd, hash-rate, my-wallets, trade-vol, and trans-fees-usd is very large. For these features, the difference between the minimum and the maximum values is from 2700 times to 8,000,000 times. Therefore, we used the log values of these features in the experiment. The total USD value of Bitcoin supply in circulation. market-price The average USD market price across major Bitcoin exchanges. med-cfm-time The median time for a transaction to be accepted into a mined block. mempool-count The number of transactions waiting to be confirmed. mempool-growth The rate of the memory pool (mempool) growth per second. mempool-size The aggregate size of transactions waiting to be confirmed. miners-revenue The total value of Coinbase block rewards and transaction fees paid to miners. my-wallets The total number of blockchain wallets created. n-trans The number of daily confirmed Bitcoin transactions. n-trans-excl-100 The total number of transactions per day excluding the chains longer than 100. n-trans-excl-popular The total number of transactions, excluding those involving any of the network's 100 most popular addresses. n-trans-per-block The average number of transactions per block. n-trans-total The total number of transactions. n-unique-addr The total number of unique addresses used on the Bitcoin blockchain. output-val The total value of all transaction outputs per day. total-bitcoins The total number of Bitcoins that have already been mined. trade-vol The total USD value of trading volume on major Bitcoin exchanges. trans-fees The total BTC value of all transaction fees paid to miners. trans-fees-usd The total USD value of all transaction fees paid to miners. trans-per-sec The number of Bitcoin transactions added to the mempool per second. utxo-count The number of unspent Bitcoin transactions outputs.

Methods
This section describes data preprocessing and proposes various deep learning-based prediction models for Bitcoin price regression and classification problems. Table 3 summarizes the notation used in this paper. Table 3. List of notations.

Notation Definition
The value of the j-th dimension of the i-th data point in X

Data Preparation
In our study, the raw dataset S consisted of 2590 days Bitcoin data (from 29 November 2011 to 31 December 2018). Each data point, p, is an 18-dimensional real-valued vector where each dimension stores a daily value of one of the Bitcoin blockchain features discussed in Section 2. To predict Bitcoin prices, sequences of previous data were exploited. When the size of each sequence was m, a total of 2590 − m + 1 sequence data were used to train and test various deep learning-based prediction models, that is, from S[1 : m] to S[2590 − m + 1 : 2590], as shown in Figure 4. The sequence size was experimentally determined as explained in Section 4, and the first 80% of the sequence data were used as the training data and the rest as the test data. Given a sequence S[i : i + m − 1], for a regression problem, we predicted the Bitcoin price of the (i + m)-th day, and for a classification problem, we predicted whether the price would go up or down in the (i + m)-th day with respect to the (i + m − 1)-th day. In other words, by analyzing previous Bitcoin prices including today's price, we predict tomorrow's price for the regression case and predict if tomorrow's price will go up or down with respect to today's price for the classification case, as it is what most investors or traders would be interested in. Each sequence X was normalized in the following way. For all i and j, Then, the normalized sequences Ys were used instead of the original sequences Xs to build and test the prediction models. This normalization method is more effective in capturing local trends than the usual minmax normalization method. Assuming that max j and min j are, respectively, the maximum and the minimum value of the j-th dimension of the raw data points, the "minmax normalization" method translates each data point as follows, That is, every value is normalized into a value between 0 and 1. As the highest price is 7054.33 times higher than the lowest price in our dataset, if the usual minmax normalization method is used, the Bitcoin prices hardly change in most sequence data, and thus the prediction accuracy is degraded. Experimental results also confirmed that the first value-based normalization was more effective than the minmax normalization.

Deep Neural Networks
The first prediction model was based on a deep neural network (DNN). A DNN usually consists of an input layer, several hidden layers, and an output layer as shown in Figure 5. As we use 18 features to predict the Bitcoin price, the input layer consists of 18 nodes, i.e., each input node corresponds to each feature. Moreover, if we analyze m days data at once, then each input node takes a column vector of size m as input. In addition, in our DNN prediction model, every node in one layer is connected to every node in the next layer, i.e., fully-connected. That is, each node in the hidden layer takes input vectors x 1 , . . ., x n from the previous layer, where n is the number of nodes in the previous layer and the size of each vector is m. Its output h is then defined as where the weight vectors w 1 , . . ., w n and the bias b are parameters to be learned using training data. In addition, is an element-wise multiplication (Hadamard product) and σ h is a nonlinear function, such as the logistic sigmoid, defined by the activation function and the rectified linear unit (ReLU) [28], defined as g(x) = max(0, x). σ h is applied element-wise, and thus the size of h is also m.
In the output layer, the set of input vectors are flattened into a single vector and then translated into a single value y using a dot product between the flattened vector and a weight vector. For a regression problem, the final value y is used as the predicted Bitcoin price. For a classification problem, y is further translated into a value between 0 and 1 using a sigmoid function and then rounded off. In particular, 1 means the rise of the price and 0 means the fall or no change.
If we represent the whole DNN model as a function f , the parameters of the model are optimized using the back-propagation algorithm [12] and training data so that the following mean squared error can be minimized: where f (X i ) is a predicted value based on an input sequence X i and y true i is the true Bitcoin price at the day after the day corresponding to X i [m] for regression. For classification, y true i is 1 if the true price is higher than the price at X i [m] and otherwise 0.

Recurrent Neural Networks and Long Short-Term Memory
For time series data, it is well known that a recurrent neural network (RNN) [12] model and a long short-term memory (LSTM) [17] model are effective. Although a DNN model can be used to predict time series data, it does not exploit the temporal relationship between the data points in the input sequence. To exploit the temporal relationship, both RNN and LSTM models use an internal state (memory). Figure 6 shows example RNN and LSTM models for the Bitcoin price prediction. To illustrate, suppose that we use three days data to predict the price. Then, the input layer of an RNN model consists of three nodes each of which takes the whole data of a single day, i.e., a vector of 18 features. Given x t as an input, each hidden state h t is then computed as follows: where W h , U h , and b h are parameters to be learned and tanh is a hyperbolic tangent function. The main difference from DNNs is that RNNs use not only the input x t , but also the previous hidden state h t−1 , to compute the current hidden state h t , thus exploiting the temporal dependency between the input sequence data. Assuming that we use three days sequence data, the final output y for regression is computed as follows, where w y and b y are also parameters to be learned. One drawback of RNNs is that when the length of the input sequence is large, the long-term information is rarely considered due to the so-called vanishing gradient problem. LSTM avoids the vanishing gradient problem mainly by using several gate structures, namely input, output, and forget gates, together with a cell unit (the memory part of an LSTM unit), and additive connections between cell states. For an LSTM unit at time t, its hidden state h t , i.e., the output vector of the LSTM unit, is computed as follows, where W, U, and b are parameters to be learned and σ is a sigmoid function. f t , i t , and o t , respectively, correspond to the forget, input, and output gates, and c t is the cell state vector. Thanks to the cell state, the long-term dependencies between the data points in the input sequence is maintained well and LSTMs can be applied to long sequence data. In the experiment, only LSTM-based prediction models were considered.

Convolutional Neural Networks
A convolutional neural network (CNN) [13,14] has many applications in image analysis and classification problems. Recently, it is shown to be also effective for sequence data analysis [16], and thus we also developed a CNN-based prediction model. Normally, a CNN consists of a series of convolution layers, ReLU layers, pooling layers, and fully-connected layers, where a convolution layer convolves the input with a dot product. In this work, we developed a simple CNN model consisting of a single 2D convolution layer (Conv2D) as shown in Figure 7. (A more sophisticated CNN model is discussed in the next section.) The input is an m × 18 matrix, where m is the number of days to be consulted for the prediction and 18 is the number of the features. A total of 36 2D convolution filters of size 3 × 18 are used for convolution, where single real values are extracted from consecutive three days data through each filter. That is, 3 × 18 feature values are translated into a single value and thus each filter produces a real-valued vector of size m − 2, which is then applied to an element-wise ReLU activation function. Then, the 36 output vectors produced by the convolution filters are flattened into a single vector of size (m − 2) × 36, which is then translated into a single prediction value through a fully connected layer. The number of filters was determined experimentally. Moreover, unlike other image analysis applications, simply adding more convolution layers together with pooling layers did not improve the performance of CNN models for Bitcoin price prediction, and therefore we present only a simple CNN model.

Deep Residual Networks
Although stacking many layers allows a CNN model to encode a very complex function from the input to the output, it is difficult to train such a model due to the vanishing gradient problem as in RNNs. A deep residual network (ResNet) [15] resolves this problem by introducing a so-called "identity shortcut connection" that skips one or more layers. Figure 8a shows stacked nonlinear layers without/with a shortcut connection. Assuming that H(x) is the desired underlying mapping, residual learning lets the stacked nonlinear layers fit a residual mapping of H(x) − x. The original mapping is then recast to F(x) + x. In the work by the authors of [15], it is shown that the residual mapping is easier to optimize than the original. By simply adding identity shortcut connections, deep convolutional networks can easily be optimized and produce a good result. Figure 8b shows our ResNet model used in the experiment. It consists of an input layer, one 2D convolution layer, four residual blocks, one average pooling layer, and an output layer. A residual block implements a shortcut connection between two convolution layers. For ResNet, we use convolution layers different from the one used for CNN. More precisely, the first Conv2D for input data uses 36 convolution filters of size 4 × 4. Then, each Conv2D in the four residual blocks uses filters of size 2 × 2. The number of filters used in each Conv2D of the first, second, third, and last residual blocks is 36, 72, 144, and 288, respectively. Although it is not shown in the figure, we apply batch normalization [29] and the ReLU activation function after each Conv2D in each residual block. After applying 2D average pooling of size 3 × 3, we flatten the resultant 288 vectors of size 1 × 1 into a one-dimensional vector. It is then translated into a single prediction value through a fully connected layer as in the CNN model.

Combinations of CNNs and RNNs
As CNNs are effective for structural analysis, whereas RNNs are effective for temporal analysis, we developed a prediction model called CRNN by combining CNN and LSTM models so as to integrate their strengths [18]. Figure 9 shows the structure of our CRNN model. The same input data are fed into both the CNN and LSTM models discussed in the previous sections. In particular, for the CNN part, after applying 2D convolution, we also apply batch normalization and flattening. As for the LSTM part, we additionally apply dropout [30]. The results of CNN and LSTM are then combined into a single vector, which is then translated into a single prediction value through a fully connected layer. As a combining operator, we tested element-wise addition, subtraction, multiplication, average, maximum and minimum operators, and a concatenation operator. In the experiment, the subtraction operator was used for regression problems while the concatenation operator was used for classification problems because they showed better performance.

Ensemble Models
The last model we considered is a stacking ensemble model [31]. The main idea behind ensemble learning is that by combining multiple weak models, one may produce a strong model with better predictive performance. Our stacking model consists of two levels as shown in Figure 10. The first level consists of three base learners, namely, DNN, LSTM, and CNN, which are already discussed in the previous sections. The second level consists of a single DNN model as a meta learner, whose main purpose is to find an optimal combination of the base learners. To train the stacking model, we divided the training data into two disjoint sets. The first dataset was then used to train the base learners. Next, each base learner was tested on the second dataset to make predictions. Lastly, the meta learner was trained to produce final outputs based on the predicted values of the base learners as inputs.

Experimental Results
This section experimentally evaluates the performance of the DNN, LSTM, CNN, ResNet, CRNN, and Ensemble models. As a baseline method, we also compared support vector machine (SVM) prediction models [23]. We used the SVR class for regression and the SVC class for classification, which are provided in the Python Scikit-learn machine learning library (https://scikit-learn.org/stable/). Three kernel options, namely, "linear", "poly (polynomial)", and "rbf (radial basis function)", were tested. The performance of SVMs with the polynomial kernel was not so good while the performance with the linear kernel and the RBF kernel was comparable. Therefore, in this section, we report only the performance of SVMs with the RBF kernel. In addition, we also implemented linear and logistic regression models and gated recurrent unit (GRU) models. GRU is a simpler variation of LSTM that uses fewer parameters but exhibits better performance than LSTM on certain small datasets [32]. In the experiments, however, the performance of linear and logistic regression models was similar to or worse than that of SVM models, whereas the performance of GRU models was comparable to that of LSTM models. Therefore, we omitted the experimental results of prediction models based on linear and logistic regression and GRUs.
Every model was trained using the training data, and their performance was tested using the test data. In particular, we compared their prediction performance with respect to the mean absolute percentage error (MAPE) for regression analysis, which is defined by where N is the total number of predictions, and y true i and y pred i are the actual value and the predicted value, respectively. For classification analysis, we compared the accuracy of the proposed models, which is defined by the number of correct predictions divided by the total number of predictions. The seven prediction models were compared under various conditions as follows. • The first set of experiments varied the size, m, of the input sequence. We used m = 5, 10, 20, 50, and 100. For other experiments below, m was set to 20 for regression problems and 50 for classification problems.

•
In the second set of experiments, for six blockchain features such as difficulty, est-trans-vol-usd, hash-rate, my-wallets, trade-vol, and trans-fees-usd, where the difference between the minimum and the maximum values is very large, we compared the performance of using their log values with the performance of using their plain values. For other experiments, the default experimental setting was to use their log values. • Different data split methods were also compared. The default split method was to use the first 80% of the entire data as training data and the rest as test data (i.e., sequential partitioning).

•
In addition, different normalization methods were compared. The default normalization method was the first value-based normalization discussed in Section 3.1.

•
Finally, we compared the profitability of the proposed models using a simple trading strategy: buy Bitcoin with all funds or hold it if the model predicts a price rise or otherwise sell all Bitcoin or wait.
All experiments were conducted on a workstation equipped with an Intel R Core TM i7-6700 3.4 GHz CPU, 8 GB of main memory, and an NVIDIA GeForce GTX 1080 GPU with 8 GB memory. The host operating system was the Windows 10 and all prediction models were implemented using Python 3 and the Keras 2.1.6 deep learning library (https://keras.io). For every deep learning-based prediction model, the number of epochs was 200 and the batch size was set to 64. All measurements were averaged over 10 sample runs.

Effect of the Sequence Size on Regression
We experimentally determined the input sequence size, m, i.e., the number of previous data consulted to predict the Bitcoin price. Table 4 shows the effects of m on the regression performance of each prediction model with respect to the MAPE. In the table, the underlined numbers denote the best value for the given window size and the boldface numbers denote the best value for the given prediction model. All prediction models exhibited the best performance when m = 5 (the lower the MAPE, the better the performance). Indeed, for every model, the MAPE increases with m. The reason is that when the input sequence size is small, most prediction models were trained to predict a price that is very close to the previous day's price, thus minimizing the MAPE. Interestingly, in terms of the MAPE, a simple shift model is the best, which simply uses the previous day's price for the prediction. In the entire data, the rate of price change of more than 5% is 17.72%, and thus it is indeed an effective policy to predict a price closed to the previous price for regression problems. However, this is not the case for classification problems, as explained in Section 4.1.2.
In this experiment, when m = 5, DNN showed the best performance, whereas when m = 10, 20, LSTM was the best. Meanwhile, when m = 50, 100, SVM showed the best performance. Moreover, SVM was the least sensitive to the sequence size. When m = 5, 10, 20, the performance of all prediction models was comparable to each other. However, when m was large, CNN did not seem to work very well, and consequently the performance of ResNet, CRNN, and Ensemble decreased greatly. For the rest of experiments, m was set to 20 for regression problems unless otherwise specified. Although all prediction models performed the best when m = 5, we chose to use m = 20, because this setting clearly contrasted the performance of the proposed models and all models were relatively well-trained under this setting. Another reason not to use m = 5 is that when m = 5, most prediction models behaved like a shift model and thus their distinct characteristics seemed to disappear.  Figure 11 shows the prediction results of DNN with m = 5, 10, 20, 50, 100. Note that when m = 100, the predicted prices are totally different from the actual prices. In contrast, when m = 5, 10, 20, the predicted prices are quite close to the actual prices. In other words, if the MAPE value is less than 5, then it means that the model produced a good prediction result. Figure 12 shows the prediction results of all prediction models with m = 20. Although there were some deviations from the actual prices (in particular, when the Bitcoin price was the highest), all prediction models quite successfully predicted the Bitcoin prices.     Table 5 shows the effect of the input sequence size m on the classification performance of each prediction model with respect to accuracy. For classification problems, we also tested two baseline prediction methods, namely, Base and Random. The Base model always predicts the rise of the price, and thus its accuracy is the price rise rate in the whole test data. As the total number of sequence data slightly varies depending on the size of the sequence, the accuracy of Base also slightly varies. The Random model randomly chooses between the rise and the fall of the price. In clear contrast to the regression result given in Table 4, most deep learning-based prediction models exhibited poor performance when m = 5, 10, 20; in these settings, Base or Random rather showed a better result. Note that for regression, the smaller the sequence size, the better the performance of the deep learning models. In contrast, for classification, it seems that more data are needed for better performance. This is partly because when m is small, every prediction model tends to behave like a shift model, but it means that when the price falls after rises (or vice versa), the model tends to make a wrong prediction. Another difference from the regression results was that the performance of LSTM was not particularly good and mostly similar to the performance of other models. When m = 50, where most prediction models were relatively well trained, DNN, CNN, CRNN, and Ensemble outperformed LSTM. Meanwhile, in contrast to the regression results, SVM was not trained very well for all cases and behaved like Base. As all deep learning models except ResNet exhibited the best performance when m = 50, for the rest of experiments, m was set to 50 for classification problems.  Table 6 compares the proposed prediction models using various measures such as precision, recall, specificity, and F1 score when m = 50. Specifically, the precision is the percentage of the correctly predicted price ups out of the total price ups predicted by the model. The recall is the percentage of the correctly predicted price ups out of the total actual price ups, and thus the recall of the Base model is 100%. The specificity is the percentage of the correctly predicted price downs out of the total actual price downs, and therefore the specificity of Base is 0%. Finally, the F1 score is the harmonic mean of the precision and the recall. In terms of the accuracy, precision, and specificity, DNN showed the best result, whereas in terms of the recall and the F1 score, DNN and Ensemble showed a relatively good result. Overall, although DNN was slightly better than other models, the performance of all proposed models except for ResNet was comparable to each other and there was no clear winner. One reason for the relatively poor performance of ResNet is the lack of data for training. As ResNet uses many layers, it requires a huge amount of data, but the amount of data available on the day scale is insufficient for training ResNet.

Effect of Using Log Values
This section presents the effect of using the log values of the six blockchain features: difficulty, est-trans-vol-usd, hash-rate, my-wallets, trade-vol, and trans-fees-usd. Tables 7 and 8 show the results on regression and classification, respectively. In both cases, for all prediction models except for ResNet and CRNN, using the log values was better than using the plain values. In particular, in Table 8, when the plain values were used, most prediction models were not trained well, and thus exhibited poor performance. As the difference between the maximum and the minimum plain values for the aforementioned six features is too high, e.g., up to eight million times, it seems that the prediction models were trained to be more sensitive to the changes of the six features than the rest 12 features. By using their log values, the prediction models were trained to consider other features as important factors as well, thus improving the prediction accuracy.

Effect of Data Split Methods
This section presents the effect of data split methods on the performance of the proposed models. We compared three methods: sequential and random partitioning, and 5-fold cross-validation (CV). The usual sequential partitioning is the default split method used in our experiments, and it uses the first 80% data as training data and the rest 20% data as test data. In this case, the training data contain only the information up to 31 July 2017, and thus the price surge and collapse and abrupt changes from November 2017 to February 2018 are not reflected in the prediction models. In contrast, the random partitioning randomly chooses the training data from the entire sequence data, and thus the recent large fluctuations are also partly reflected in the prediction models. Lastly, 5-fold CV divides the whole dataset into five equal-sized subsets. Then, for each prediction method, five models are trained and tested using five distinct pairs of training and test data, where the former consists of four subsets and the latter the remaining subset. The prediction performance is then averaged. By using 5-fold CV, we may avoid overfitting and sampling bias.
Tables 9 shows the regression performance, where all the proposed models performed better when using the random partitioning. Note that LSTM showed the best result for all cases. The results of 5-fold CV are similar to those of using the sequential partitioning. In contrast to the regression result, the random partitioning is not necessarily a better choice for classification as shown in Table 10. For classification, what matters is having a sufficient amount of patterns of price ups and downs in the training data. As both sequential and random partitioning generated a similar amount of ups and downs patterns in the training data, the classification performance under the two methods was comparable. Meanwhile, in the results of 5-fold CV, Base and SVM significantly outperformed the other deep learning models except for DNN. The reason is that the percentage of price rises during the period from 20 April 2016 to 31 July 2017 (the fourth subset of the five CV subsets) is 60.47%, which was the prediction accuracy of Base and SVM, whereas most deep learning models achieved a much lower accuracy. If we exclude that period, the prediction accuracy of Base and SVM becomes 52.39% and 52.08%, respectively, which is lower than the prediction accuracy 53.57% of DNN under the same condition.

Effect of Normalization Methods
This section presents the effect of normalization methods on the performance of the proposed models. We compared two normalization methods using the sequential partitioning: the first value-based normalization and the usual minmax normalization discussed in Section 3.1. Tables 11 and 12 show the results on regression and classification, respectively. In Table 11, for all prediction models, using the first value-based normalization was better than using the minmax normalization. By using the first value-based normalization, local fluctuations were better reflected in the prediction models, and thus the prediction accuracy was improved. In contrast, when using the minmax normalization, the Bitcoin price fluctuations in the training data (until 31 July 2017) were relatively small, and thus the prediction accuracy was not good for the test data containing large price fluctuations. However, for classification, it was not always better to use the first value-based normalization as shown in Table 12. The reason is along the same lines as Table 10. Figure 14 shows the prediction results of DNN with two normalization methods. Although the predicted prices under the first value-based normalization were very close to the actual prices, the predicted prices under the minmax normalization were quite deviated from the actual prices.

Profitability of the Proposed Models
This section presents the results of a profitability analysis using a simple trading strategy. More specifically, for each regression model, we bought Bitcoin with all funds if the predicted price was higher than the current price and sold all Bitcoin if the predicted price was lower than the current price. Similarly, for each classification model, we bought Bitcoin if the prediction model predicted a price rise and sold it otherwise. Table 13 shows the investment results during the period from 20 September 2017 to 31 December 2018, with the initial budget of 10,000 USD. For simplicity, we did not consider trading fees. In the beginning, for classification models, the previous 50-days information from 1 August 2017 to 19 September 2017, which was not used during the model training, was used to predict the price on 20 September 2017. As for regression models, the previous 20-days information from 31 August 2017 to 19 September 2017 was used to predict the price on 20 September 2017. The Bitcoin price was 3874.46 USD on 20 September 2017 and 3693.30 USD on 31 December 2018. Interestingly, every deep learning-based classification model made a little profit, whereas every regression model made a negative return. Moreover, all deep learning-based regression models even underperformed Base, which corresponds to a simple buy-and-hold strategy, and Random, which corresponds to a strategy that randomly buys and sells. Meanwhile, as in the previous results, DNN was the best for classification-based trading, whereas LSTM was the best for regression-based trading among deep learning methods. In addition, although SVM behaved like Base for classification-based trading, for regression-based trading, it outperformed every deep learning models. Overall, classification models were more effective than regression models for algorithmic trading, but it is still premature to apply such models in practice. Table 13. Results of a profitability analysis. m = 20 was used for regression models and m = 50 for classification models. The log values of the major features, the sequential partitioning, and the first value-based normalization were used. For every model, the initial budget was 10,000 USD and the final value of investment is shown in the

Discussion and Conclusions
In this study, we have developed and compared various deep learning-based Bitcoin price prediction models using Bitcoin blockchain information. More specifically, we tested the state-of-the-art deep learning models such as deep neural networks (DNN), long short-term memory (LSTM) models, convolutional neural networks (CNN), deep residual networks (ResNet), and their combinations. We addressed both regression and classification problems, where the former predicts the future Bitcoin price, and the latter predicts whether or not the future price will go up or down. For regression problems, LSTM slightly outperformed the other models, whereas for classification problems, DNN slightly outperformed the other models unlike the previous literature on Bitcoin price prediction. Although CNN and ResNet are known to be very effective in many applications, including sequence data analysis, their performance was not particularly good for Bitcoin price prediction. Overall, there was no clear winner and the performance of all deep learning models studied in this work was comparable to each other. In addition, although deep learning models seem to predict the Bitcoin price very well in terms of the regression analysis, it is still premature to solely use such models for algorithmic Bitcoin trading.
There are several research directions for future work. First, it would be interesting to consider other recent deep learning models, such as transformer networks [33] and few-shot/one-shot learning [34], to build prediction models. Moreover, to improve the accuracy of the prediction models, it would also be interesting to consider other factors such as stock market indices and major currency exchange rates as in the works by the authors of [7,10]. We may also consider price fluctuations of major Altcoins such as Ethereum, Ripple, Litecoin, Bitcoin Cash, or EOS as in the work by the authors of [9]. Lastly, news and social media sentiment analysis could also be incorporated to improve the prediction accuracy as in the works by the authors of [24,25].