What Drives Bitcoin? An Approach from Continuous Local Transfer Entropy and Deep Learning Classification Models

Bitcoin has attracted attention from different market participants due to unpredictable price patterns. Sometimes, the price has exhibited big jumps. Bitcoin prices have also had extreme, unexpected crashes. We test the predictive power of a wide range of determinants on bitcoins’ price direction under the continuous transfer entropy approach as a feature selection criterion. Accordingly, the statistically significant assets in the sense of permutation test on the nearest neighbour estimation of local transfer entropy are used as features or explanatory variables in a deep learning classification model to predict the price direction of bitcoin. The proposed variable selection do not find significative the explanatory power of NASDAQ and Tesla. Under different scenarios and metrics, the best results are obtained using the significant drivers during the pandemic as validation. In the test, the accuracy increased in the post-pandemic scenario of July 2020 to January 2021 without drivers. In other words, our results indicate that in times of high volatility, Bitcoin seems to self-regulate and does not need additional drivers to improve the accuracy of the price direction.


Introduction
Currently, there is tremendous interest in determining the dynamics and direction of the price of Bitcoin due to its unique characteristics, such as its decentralization, transparency, anonymity, and speed in carrying out international transactions. Recently, these characteristics have attracted the attention of both institutional and retail investors. Thanks to technological developments, investor trading strategies are benefited by digital platforms; therefore, market participants are more likely to digest and create information for this market. Of special interest is its decentralized character, since its value is not determined by a central bank but, essentially, only by supply and demand, recovering the ideal of a free market economy. At the same time, it is accessible to all sectors of society, which breaks down geographic and particular barriers for investors. The fact that there are a finite number of coins and the cost of mining new coins grows exponentially has suggested to some specialists that it may be a good instrument for preserving value. That is, unlike fiat money, Bitcoin cannot be arbitrarily issued, so its value is not affected by the excessive issuance of currency that central banks currently follow, or by low interest rates as a strategy to control inflation. In other words, it has been recently suggested that bitcoin is a safe-haven asset or store of value, having a role similar to that once played by gold and other metals.
The study of cryptocurrencies and bitcoin has been approached from different perspectives and research areas. It has been addressed from the point of view of financial economics, econometrics, data science, and more recently by econophysics. In these approaches, various methodologies and mathematical techniques have been utilised to understand different aspects of these new financial instruments. These topics range from systemic risk, the spillover effect, autoscaling properties, collective patterns, price formation, and forecasting in general. Remarkable work in the line of multiscale analysis of cryptocurrency markets can be found in [1]. However, this paper is motivated by using the econphysics approach, incorporated with rigorous control variables to predict Bitcoin price patterns. We would like to offer a comprehensive review of the determinants of Bitcoin prices. The first pillar can be defined as sentiment and social media content. While Bitcoin is widely considered a digital financial asset, investors pay attention to this largest market capitalization by searching its name. Therefore, the strand of literature on Google search volume has become popular for capturing investor attention [2]. Concomitantly, not only peer-to-peer sentiment (individual Twitter accounts or fear from investors) [3,4] but also influential accounts (the U.S. President, media companies) [5][6][7] significantly contribute to Bitcoin price movement. Given the greatest debate on whether Bitcoin should act as a hedging, diversifying or safe-haven instrument, Bitcoin exhibits a mixture of investing features. More interestingly, uncertain shocks might cause changes in both supply and demand in Bitcoin circulation, implying a change in its prices [8]. Thus, the diverse stylized facts of Bitcoin, including heteroskedasticity and long memory, require uncertainty to be controlled in the model. While uncertainties represent the amount of risk (compensated by the Bitcoin returns) [9], our model also includes the price of risk, named the 'risk aversion index' [10]. These two concepts (amount of risk and the price of risk) demonstrate discount rate factors in the time variation of any financial market [11]. In summary, the appearance of these determinants could capture the dynamics of the cryptocurrency market. Since cryptocurrency is a newly emerging market, the level of dependence in the market structure is likely higher than that in other markets [12]. Furthermore, the contagion risk and the connectedness among these cryptocurrencies could be considered the risk premium for expected returns [13,14]. More importantly, this market can be driven by small market capitalization, implying vulnerability of the market [15]. Hence, our model should contain alternative coins (altcoins) to capture their movements in the context of Bitcoin price changes. Finally, investors might consider the list of these following assets as alternative investment, precious metals being the first named. They are not only substitute assets [16] but also predictive factors (for instance, gold and platinum) [17], which additionally include commodity markets (such as crude oil [18,19], exchange rate [20], equity market [21]), and Tesla's owner [22]). In summary, there are voluminous determinants of Bitcoin prices. In the scope of this study, we focus on the predictability of our model, especially the inclusion of social media content, representing the high popularity of information, on the Bitcoin market. However, the more control variables there are, the higher the accuracy of prediction. Our model thus may be a useful tool by combining the huge predictive factors for training and forecasting the response dynamics of Bitcoin to other relevant information.
This study approaches Bitcoin from the framework of behavioural and financial economics using an approach from econophysics and data science. In this sense, it seeks to understand the speculative character and the possibilities of arbitrage through a model that includes investor attention and the effect of the news, among other factors. For this, we will use a causality method originally proposed by Schreiber [23], and we will use the information as characteristics of a deep learning model. The current literature only focuses on specific sentiment indicators (such as Twitter users [3] or the number of tweets [24,25]), and our study crawled the original text from influential Twitter social media users (such as the President of United States, CEO of Tesla, and well-known organizations such as the United Nations and BBC Breaking News). Then, we processed language analyses to construct the predictive factor for Bitcoin prices. Therefore, our model incorporates a new perspective on Bitcoin's drivers.
In this direction, the work of [26] uses the effective transfer entropy as an additional feature to predict the direction of U.S. stock prices under different machine learning approaches. However, the approximation is discrete and based on averages. Furthermore, the employed metrics are not exhaustive to determine the predictive power of the models. In a similar vein, the authors of [27] perform a comparative analysis of machine learning methods for the problem of measuring asset risk premiums. Nevertheless, they do not take into account recurrent neural network models or additional nontraditional features. Furthermore, an alternative approach to study the main drivers of Bitcoin is discussed in [28], where the author explores wavelet coherence to examine the time and frequency domains between short-and long-term interactions. In the same vein, the recent studies employed the correlation networks and vector error correction models to explain the price prediction and exchange spillovers [29,30]. Of course, Bitcoin prediction is more likely to have sentimental and 'noise' factors differing from stock prediction.
On the other hand, there are methodologies to explain machine learning results known as eXplainable Artificial Intelligence (XAI). Among these, two of the most popular are Local Interpretable Model Agnostic [31] and Shapley Additive Explanation (SHAP) [32]. Both techniques are based on disturbing the model locally. The former assumes a linear model to obtain the score of the characteristics in terms of the importance of making predictions; the latter uses game theory concepts to find the best feature fitting in terms of predictive gain. In [33] these techniques are extended to include temporal dependencies and demonstrate the need to develop XAI techniques applicable to time series. In [34,35] is proposed an XAI method applicable to credit risk. In a similar vein, the authors of [36] mention the difficulty of estimating out-of-sample behavior in stress scenarios. An interesting work is [37], where it is considered a gradient boosting decision trees approximation to predict the drops of the S&P 500 markets using a large number of characteristics. The authors claim that retaining a small and carefully selected amount of features improves the learning model results.
However, as mentioned in the cornerstone work [31] it is not possible to explain a highly non-linear model through local perturbations. That is, there is a high instability derived from the characteristics of the inherent dynamical system. In addition, the examples of the articles mentioned above run in most cases in seconds or minutes. Therefore, the LIME and SHAP methods are appropriate mainly for machine learning models or simple deep learning scenarios [38]. In this spirit, it is not practical to follow the traditional XAI approach, given the computational demand derived from the number of hyperparameters and configurations to be implemented. However, our proposal to use transfer entropy in the variable selection process can be considered an alternative strategy to XAI. In particular, of interest for highly non-linear dependency conditions, such as bitcoin dynamics.
Our study embodied a wide range of Bitcoin's drivers from alternative investment, economic policy uncertainty, investor attention, and so on. However, social media is our main contribution to predictive factors. Specifically, we study the effect that a set of Twitter accounts belonging to politicians and millionaires has on the behaviour of Bitcoin's price direction. In this work, the statistically significant drivers of Bitcoin are detected in the sense of the continuous estimation of local transfer entropy (local TE) through nearest neighbours and permutation tests. The proposed methodology deals with non-Gaussian data and nonlinear dependencies in the problem of variable selection and forecasting. One main aim is to quantify the effects of investor attention and social media on Bitcoin in the context of behavioural finance. Another aim is to apply classification metrics to indicate the effects of including or not the statistically significant features in an LSTM's classification problem.
The next Section 2 introduce the local transfer entropy, the nearest neighbour estimation technique, the deep learning forecasting models, and the classification metrics. Section 3 describes the data and their main descriptive characteristics. Section 4 presents and highlights the main results. Finally, Section 5 highlights the implications of the results, and future work is proposed.

Transfer Entropy
Transfer Entropy (TE) [23] measures the flow of information from system Y over system X in a nonsymmetric way. Denote the sequences of states of systems X, Y in the following way: x i = x(i) and y i = y(i), i = 1, . . . , N. The idea is to model the signals or time series as Markov systems and incorporate the temporal dependencies by considering the states x i and y i to predict the next state x i+1 . If there is no deviation from the generalized Markov property p(x i+1 |x i , y i ) = p(x i+1 |x i ), then Y has no influence on X. Hence, TE is derived using the last idea and defined as . TE can be thought of as a global average or expected value of a local transfer entropy at each observation [39] The main characteristic of the local version of TE is to be measured at each time n for each destination element X in the system and each causal information source Y of the destination. It can be either positive or negative for a specific event set ( which gives the opportunity to have a measure of informativeness or noninformativeness at each point of a pair of time series. On the other hand, there exist several approximations to estimate the probability transition distributions involved in TE expression. Nevertheless, there is not a perfect estimator. It is generally impossible to minimize both the variance and the bias at the same time. Then, it is important to choose the one that best suits the characteristics of the data under study. That is the reason finding good estimators is an open research area [40]. This study followed the Kraskov-Stögbauer-Grassberger) KSG estimator [41], which focused on small samples for continuous distributions. Their approach is based on nearest neighbours. Although obtaining insight into this estimator is not easy, we will try it in the following.
Let X = (x 1 , x 2 , . . . , x d ) now denote a d-dimensional continuous random variable whose probability density function is defined as p : R d → R. The continuous or differential Shannon entropy is defined as The KSG estimator aims to use similar length scales for K-nearest-neighbour distance in different spaces, as in the joint space to reduce the bias [42].
To obtain the explicit expression of the differential entropy under the KSG estimator, Beneath the assumption that i,K is twice the (maximum norm) distance to the k-th nearest neighbour of X(i), the differential entropy can be estimated aŝ where ψ is known as the digamma function and can be defined as the derivative of the logarithm of the gamma function Γ(x) The parameter K defines the size of the neighbourhood to use in the local density estimation. It is a free parameter, but there exists a trade-off between using a smaller or larger value of K. The former approach should be more accurate, but the latter reduces the variance of the estimate. For further intuition, Figure 1 graphically shows the mechanism for choosing the nearest neighbours at K = 3. The KSG estimator of TE can be derived based on the previous estimation of the differential entropy. Yet, in most cases, as analysed in this work, no analytic distribution is known. Hence, the distribution of T Y s →X (k, l) must be computed empirically, where Y s denotes the surrogate time series of Y. This is done by a resampling method, creating a large number of surrogate time-series pairs {Y s , X} by shuffling (for permutations or redrawing for bootstrapping) the samples of Y. In particular, the distribution of T Y s →X (k, l) is computed by permutation, under which surrogates must preserve p(x n+1 |x n ) but not p(x n+1 |x n , y n ).

Deep Learning Models
We can think of artificial neural networks (ANNs) as a mathematical model whose operation is inspired by the activity and interactions between neuronal cells due to their electrochemical signals. The main advantages of ANNs are their non-parametric and nonlinear characteristics. The essential ingredients of an ANN are the neurons that receive an input vector x i , and through the point product with a vector of weights w, generate an output via the activation function g(·): where b is a trend to be estimated during the training process. The basic procedure is the following. The first layer of neurons or input layer receives each of the elements of the input vector x i and transmits them to the second (hidden) layer. The next hidden layers calculate their output values or signals and transmit them as an input vector to the next layer until reaching the last layer or output layer, which generates an estimation for an output vector. Further developments of ANNs have brought recurrent neural networks (RNNs), which have connections in the neurons or units of the hidden layers to themselves and are more appropriate to capture temporal dependencies and therefore are better models for time series forecasting problems. Instead of neurons, the composition of an RNN includes a unit, an input vector x t , and an output signal or value h t . The unit is designed with a recurring connection. This property induces a feedback loop, which sends a recurrent signal to the unit as the observations in the training data set are analysed. In the internal process, backpropagation is performed to obtain the optimal weights. Unfortunately, backpropagation is sensitive to long-range dependencies. The involved gradients face the problem of vanishing or exploding. Long-short-term memory (LSTM) models were introduced by Hochreiter and Schmidhuber [43] to avoid these problems. The fundamental difference is that LSTM units are provided with memory cells and gates to store and forget unnecessary information.
The final ANNs we need to discuss are convolutional neural networks (CNNs). They can be thought of as a kind of ANN that uses a high number of identical copies of the same neuron. This allows the network to express computationally large models while keeping the number of parameters small. Usually, in the construction of these types of ANNs, a max-pooling layer is included to capture the largest value over small blocks or patches in each feature map of previous layers. It is common that CNN and pooling layers are followed by a dense fully connected layer that interprets the extracted features. Then, the standard approach is to use a flattened layer between the CNN layers and the dense layer to reduce the feature maps to a single one-dimensional vector [44].

Classification Metrics
In classification problems, we have the predicted class and the actual class. • F1 score = 2 PPV×TPR PPV+TPR . The most complex measure is the area under the curve (AUC) of the receiver operating characteristic (ROC), where it expresses the pair (TPR τ , 1 − TNR τ ) for different thresholds τ. Contrary to the other metrics, the AUC of the ROC is a quality measure that evaluates all the operational points of the model. A model with the aforementioned metric equal to 0.5 is considered a random model. Then, a value significantly higher than 0.5 is considered a model with predictive power, with a value of 1 the upper bound of this quantity.

Data
An important part of the work is the acquisition and preprocessing of data. We focus on the period of time from 1 January 2017 to 9 January 2021 at a daily frequency for a total of n = 1470 observations. As a priority, we consider the variables listed in Table 1 as potential drivers of the price direction of Bitcoin (BTC). Investor attention is considered Google Trends with the query = "Bitcoin". Additionally, the number of mentions is properly scaled to make comparisons between days of different months because by default, Google Trends weighs the values by a monthly factor. Then, the log return of the resulting time series is calculated. The social media data are collected from the Twitter API (https://developer.twitter. com/en/docs/twitter-api, accessed on 15 January 2021). Nevertheless, the API of Twitter only enables downloading the latest 3200 tweets of a public profile, which generally was not enough to cover the period of study. Then, the dataset has been completed with the freely available repository of https://polititweet.org/ (accessed on 15 January 2021). In this way, the collected number of tweets was 21,336, 22,808, 24,702, 11,140, and 26,169 for each of the profiles listed on Table 1 in the social media type, respectively. The textual data of each tweet in the collected dataset are transformed to a sentiment polarity score through the VADER lexicon [45]. Then, the scores are aggregated daily for each profile. The resulting daily time series have missing values due to the inactivity of the users, and then a third-order spline is considered before calculating their differences. The last is to stationarize the polarity time series. It is important to remember that Donald Trump's account was blocked on 8 January 2021, so it was also necessary to impute the last value to have series of the same length.
The economic policy uncertainty index is a Twitter-based uncertainty index (Twitter-EPU). The creators of the index used the Twitter API to extract tweets containing keywords related to uncertainty ("uncertain", "uncertainly", "uncertainties", "uncertainty") and econ-omy ("economic", "economical", "economically", "economics", "economies", "economist", "economists", "economy"). Then, we use the index consisting of the total number of daily tweets containing inflections of the words uncertainty and economy (Please consult https://www.policyuncertainty.com/twitter_uncert.html for further details of the index, accessed on 15 January 2021). The risk aversion category considers the financial proxy to risk aversion and economic uncertainty proposed as a utility-based aversion coefficient [10]. A remarkable feature of the index is that in early 2020, it reacted more strongly to the new COVID-19 infectious cases than did a standard uncertainty proxy.
As complementary drivers, it includes a set of highly capitalized cryptocurrencies and a heterogeneous portfolio of financial indices. Specifically, Ethereum (ETH), Litecoin (LTC), Ripple (XRP), Dogecoin (DOGE), and the stable coin TETHER are included from yahoo finance (https://finance.yahoo.com/, accessed on 15 January 2021). The components of the heterogeneous portfolio are listed in Table 1, which takes into account the Chicago Board Options Exchange's CBOE Volatility Index (VIX). This last information was extracted from Bloomberg (https://www.bloomberg.com/, accessed on 15 January 2021). It is important to point out that risk aversion and the financial indices do not have information that corresponds to weekends. The imputation method to obtain a complete database consisted of repeated Friday values as a proxy for Saturday and Sunday. Then, the log return of the resulting time series is calculated. This last transformation was also made for Twitter-EPU and cryptocurrencies. The complete dataset can be found in the Supplementary Material.
Usually, the econophysics and data science approaches share the perspective of observing data first and then modelling the phenomena of interest. In this spirit, and with the intention of gaining intuition on the problem, the standardized time series (target and potential drivers), as well as the cumulative return of the selected cryptocurrencies and financial assets are plotted in Figures 2 and 3. The former figure shows high volatility in almost all the studied time series around March 2020, which might be due to the declaration of the pandemic by the World Health Organization (WHO) and the consequent fall of the worlds main stock markets. The latter figure exhibits the overall best cumulative gains for BTC, ETH, LTC, XRP, DOGE, and Tesla. It is worth noting that the only asset with a comparable profit to that of the cryptocurrencies is Tesla, which reaches high cumulative returns starting at the end of 2019 and increases its uptrend immediately after the announcement of the worldwide health emergency.  Furthermore, Figure 4 shows the heatmap of the correlation matrix of the preprocessed dataset. We can observe the formation of certain clusters, such as cryptocurrencies, metals, energy, and financial indices, which tells us about the heterogeneity of the data. It should also be noted that the VIX volatility index is anti-correlated with most of the variables. Additionally, the main statistical descriptors of the data are presented in Table 2. The first column is the variable's names or tickers. The subsequent columns represent the mean, standard deviation, skewness, kurtosis, Jarque Bera test (JB), and the associated p value of the test for each variable, i.e., target, and potential drivers. Basically, none of the time series passes the test of normality distribution, and most of them present a high kurtosis, which is indicative of heavy tail behaviour. Finally, stationarity was checked in the sense of Dickey-Fuller and the Phillips-Perron unit root tests, where all variables pass both tests.

Variable Selection
The observed characteristics of the data in the previous section justify the use of a nonparametric approach to determine the explainable features to be employed in the predictive classification model. Therefore, the variable selection procedure consisted of applying the continuous transfer entropy from each driver to Bitcoin using the KSG estimation. Figure 5 shows the average transfer entropy when varying the Markov order k, l and neighbour parameter K from one to ten for a total of 1000 different estimations by each driver. The higher the intensity of the colour, the higher the average transfer entropy (measured in nats). The grey cases do not transfer information to BTC. In other words, these cases do not show a statistically significant flow of information, where the permutation test is applied to construct 100 surrogate measurements under the null hypothesis of no directed relationship between the given variables. Figure 5. Average transfer entropy from each potential driver to BTC. The y-axis indicates the driver, and the x-axis indicates the Markov order pair k, l of the source and target. From (a) to (j), nearest neighbours K run from one to ten, respectively.
The tuple of parameters {k, l, K} that give the highest average transfer entropy from each potential driver to BTC are considered optimal, and the associated local TE is kept as a feature in the classification model of Bitcoin's price direction. Figure 6 shows the local TE from each statistically significant driver to BTC at the optimal parameter tuple {k, l, K}. Note that the set of local TE time series is limited to 23 features. Consequently, the set of originally proposed potential drivers is reduced from 25 to 23. Surprisingly, NASDAQ and Tesla do not send significative information to BTC for any value of {k, l, K} in the grid of the 1000 different configurations. The variations are smooth on K, but not on the Markov order k, l. It is also notorious the negligible amounts of the flow of information at k = l = 1.

Bitcoin's Price Direction
The task of detecting Bitcoin's price direction was done through a deep learning approach. The first step consisted of splitting the data into training, validation, and test datasets. The chosen training period runs from 1 January 2017 to 4 January 2020, or 75% of the original entire period of time, and is characterized as a prepandemic scenario. The validation dataset is restricted to the period from 5 January 2020 to 11 July 2020, or 13% of the original data, and is considered the pandemic scenario. The test dataset involves the postpandemic scenario from 12 July 2020 to 9 January 2021 and contains 12% of the complete dataset. Deep learning forecasting requires transforming the original data into a supervised data set. Here, samples of 74 historical days and a one-step prediction horizon are given to the model to obtain a supervised training dataset, with the first dimension being a power of two, which is important for the hyperparameter selection of the batch dimension. Specifically, the sample dimensions are 1024, 114, and 107 for training, validation, and testing, respectively. Because we are interested in predicting the direction of BTC, the time series are not demeaned and instead are only scaled by their variance when feeding the deep learning models. An important piece in a deep learning model is the selection of the activation function. In this work, the rectified linear unit (ReLU) was selected for the hidden layers. Then, for the output layer, the sigmoid function is chosen because we are dealing with a classification problem. In addition, an essential ingredient is the selection of the stochastic gradient descent method. Here, Adam optimization is selected based on adaptive estimation of the first-and second-order moments. In particular, we used version [46] to search for the long-term memory of past gradients to improve the convergence of the optimizer.
There exist several hyperparameters to take into account when modelling a classification problem under a deep learning approach. These hyperparameters must be calibrated on the training and validation datasets to obtain reliable results on the test dataset. The usual procedure to set them is via a grid search. Nevertheless, deeper networks with more computational power are necessary to obtain the optimal values in a reasonable amount of time. To avoid excessive time demands, we vary the most crucial parameters in a small grid and apply some heuristics when required. The number of epochs, is selected under the early stopping procedure. Another crucial hyperparameter is the batch, or the number of samples to work through before updating the internal weight of the model. For this parameter the selected grid was {32, 64, 128, 256}. Additionally, we consider the initial learning rates at which the optimizer starts the algorithm, which were {0.001, 0.0001}. As an additional method of regularization, the effect of dropping between consecutive layers is added. This value can take values from 0 to 1. Our grid for this hyperparameter is {0.3, 0.5, 0.7}. Finally, because of the stochastic nature of the deep learning models, it is necessary to run several realizations and work with averages. We repeat the hyperparameter selection with ten different random seeds for robustness. The covered scenarios are the following: univariate (S1), where bitcoin is self-driven; all features (S2), where all the potential drivers listed in Table 1 are included as features of the model; significative features (S3), only statistically significant drivers under the KSG transfer entropy approach are considered as features; local TE, only the local TE of the statistically significant drivers are included as a feature; and finally the significative features + local TE (S5) scenario, which combines scenarios (S3) and (S4). Finally, five different designs have been proposed for the architectures of the neural networks, which are denoted as deep LSTM (D1), wide LSTM (D2), deep bidirectional LSTM (D3), wide bidirectional LSTM (D4), and CNN (D5). The specific designs and diagrams of these architectures are displayed in Figure 7. In total, 6000 configurations or models were executed, which included the grid search for the optimal hyperparameters, the different scenarios and architectures, and the realizations on different seeds to avoid biases due to the stochastic nature of the considered machine learning models.
The computation was done in a workstation with the following characteristics: Alienware Aurora R7, Ubuntu 20.10, Processor i9-9900X 8 cores, 16 logic, 64 GB RAM, Dual NVIDIA RTX 2080 ti, 3TB HHD. On this equipment, the computational demand extends the execution to nearly 60 h of computation. Tables 3 and 4 present the main results for the validation and test datasets, respectively. Table 2 explicitly states the best value for the dropout, learning rate (LR), and batch hyperparameters. In both tables, the hashtag (#) column indicates the number of times the specific scenario gives the best score for the different metrics considered so far. Hence, the architecture design D3 for case S3 yields the highest number of metrics with the best scores in the validation dataset. In contrast, in the test dataset, the highest number of metrics with the best scores correspond to design D2 for case S1. Nevertheless, design D5 from case S5 is close in the sense of the # value, where it presents the best AUC and PPV scores. An important point to keep in mind is that only during the validation stage we find models with an AUC greater than 0.6, so this metric does not give evidence of predictive power in the testing stage.  In a robustness discussion, we would like to compare our predictive feature with the existing approaches. While the current studies look at the conventional approach of econometrics [29,30], our study sheds light on the deep learning method. Accordingly, we had two samples (training sample and test group). Therefore, it allows us to validate our findings with different periods. The unique, comparable study that we have found in the area of learning models is due to [26]. However, they only show the results for two accuracy metrics when predicting the direction of the US markets. Even so, barely the metrics exceed the value of 0.6, and it is not clear if they are considering a test set.

Discussion
We start from descriptive statistics as a first approach to intuitively grasp the complex nature of Bitcoin, as well as its proposed heterogeneous drivers. As expected, the variables did not satisfy the normality assumption and presented high kurtosis, highlighting the need to use non-parametric and nonlinear analyses.
The KSG estimation of TE found a consistent flow of information from the potential drivers to Bitcoin through the considered range of K nearest neighbours. Even when, in principle, the variance of the estimate decreases with K, the results obtained with K = 1 do not change abruptly for larger values. In fact, the variation in the structure of the TE matrix for different Markov orders k, l is more notorious. Additionally, attention must be paid to the evidence about the order k = l = 1 through values near zero. Practitioners usually assume this scenario under Gaussian estimations. A precaution must then be made about the memory parameters of Markov, at least when working with the KSG estimation. The associated local TE does not show any particular pattern beyond high volatility, reaching values of four nats when the average is below 0.1. Thus, volatility might be a better proxy for price fluctuations in future studies.
In terms of intuitive explanations, we found that the drivers of Bitcoin might not truly capture its returns in distressed periods. Although we expected to witness that the predictive power of these determinants might play an important role across time horizons, it turns out that the prediction model of Bitcoin relies on a choice of a specific period. Thus, our findings also confirm the momentum effect that exists in this market [47]. Due to the momentum effect, the timing of market booms could not truly be supported much for further analysis by our models. In regard to our main social media hypothesis, the popularity of Bitcoin content still exists as the predictive component in the model. More noticeably, our study highlights that Bitcoin prices can be driven by momentum on social media [24]. However, the selection of training and testing periods should be cautious with the boom and burst of this cryptocurrency. Apparently, while the fundamental value of Bitcoin is still debatable [48], using behavioural determinants could have some merits in predicting Bitcoin. Thus, we believe that media content would support the predictability of Bitcoin prices alongside other financial indicators. Concomitantly, after clustering these factors, we found that the results seem better able to provide insights into Bitcoin's drivers.
On the other hand, the forecasting of Bitcoin's price direction improves in the validation set but not for all metrics in the test dataset when including significant drivers or local TE as a feature. Nonetheless, the last assertion relies on the number of metrics with the best scores. Although the test dataset having the best performance corresponds to the deep bidirectional LSTM (D3) for the scenario univariate (S3), this case only beat three of the eight metrics. The other five metrics are outperformed by scenarios including significative features (S3) and significative features + local TE (S5). Furthermore, the second-best performances are tied with two of the eight metrics with leading values. Interestingly, the last case shows the best predictive power on the CNN model using significant features as well as local TE indicators (D5-S5). In particular, it outperforms the AUC and PPV overall, yet AUC is in the border of a random model. To delve into the explainable aspect, a future work will seek to apply the Shapley-Lorentz decomposition proposed in [49,50]. There the authors develop a global methodology, which can be associated with a generalization of AUC-ROC.
Moreover, it is important to note that the selected test period is atypical in the sense of a bull period for Bitcoin as a result of the turbulence generated by the COVID-19 public health emergency; this might induce safe haven behaviour related to this asset and increase its price and capitalization. This atypical behaviour opens the door to propose future work to model Bitcoin by the self-exciting process of the Hawkes model during times of great turbulence.
We would like to end by emphasizing that we were not exhaustive in modelling classification forecasting. In contrast, our intention was to exemplify the effect of including the significant features and local TE indicators under different configurations of a deep learning model through a variety of classification metrics. Two methodological contributions to highlight are the use of nontraditional indicators such as market sentiment, as well as a continuous estimation of the local TE as a tool to determine additional drivers in the classification model. Finally, the models presented here are easily adaptable to high-frequency data because they are non-parametric and nonlinear in nature.