Cryptocurrency Trading Using Machine Learning

: We present a model for active trading based on reinforcement machine learning and apply this to ﬁve major cryptocurrencies in circulation. In relation to a buy-and-hold approach, we demonstrate how this model yields enhanced risk-adjusted returns and serves to reduce downside risk. These ﬁndings hold when accounting for actual transaction costs. We conclude that real-world portfolio management application of the model is viable, yet, performance can vary based on how it is calibrated in test samples.


Introduction
Is active trading viable in cryptocurrency markets and can it yield superior performance relative to a buy-and-hold approach? Using a direct reinforcement (DR) model we demonstrate that, yes, active trading is both viable and profitable in such markets and can yield superior risk-adjusted performance relative to a passive buy-and-hold approach. We show that despite the high volatility risks which cryptocurrency traders face, our DR approach can serve to actively reduce downside risks in their portfolios.
The motivation for our DR model stems from two streams of literature. The first is the growing interest in explaining or forecasting the price behavior of cryptocurrencies and the formation of trading strategies in such markets (Akcora et al. 2018;Baur and Dimpfl 2018;Brauneis and Mestel 2018;Catania et al. 2019;Phillip et al. 2018). This stream of literature shows how these blockchain-based digital assets have distinct technological and network attributes that set them apart from traditional assets. In addition, this stream of literature also demonstrates the growing need in understanding whether active trading in such markets is viable. For example, Tzouvanas et al. (2020) show momentum trading may be profitable but only in the short-run, while Grobys and Sapkota (2019) provide evidence that momentum trading is not profitable when examining a large cross-section of cryptocurrencies. Makarov and Schoar (2020) maintain "...while significant attention has been paid to the dramatic ups and downs in the volume and price of cryptocurrencies, there has not been a systematic analysis of the trading and efficiency of cryptocurrency markets..." (p. 1). In addition to this, and as is shown in Koutmos and Payne (2020), there are many heterogeneous traders present in bitcoin exchanges that carry different expectations about the future direction and volatility of bitcoin. Thus, there are many challenges associated with trying to model bitcoin prices accurately and determining optimal trading strategies. Our DR approach overcomes such obstacles specifically since it is eliminates the need to search for and select explanatory factors for cryptocurrencies or to account for the trading behavior of different types of investors. This strength in our approach is further important especially considering how cryptocurrency prices can undergo regime shifts across time (Koutmos 2018;Li and Wang 2017).
The second stream of literature that motivates our DR model is the growing interest in machine learning methods in economic applications or the development of trading systems that can be automated (Acosta 2018;Henrique et al. 2019;Saltzman and Yung 2018). In fact, there are a growing number of patents and patent filings that devise machine learning methods for either stock selection or trading automation Kumar and Raychaudhuri (2014). Nevertheless, Huck (2019) indicates that, first, machine learning approaches can yield mixed results depending on the number of predictor variables and, second, may not provide a competitive edge for investors given how readily available these models now are to all investors in the market. Notwithstanding their limitations, there is a budding strand of recent literature that tests the ability of machine learning methods to be used effectively in portfolio allocation decisions, exchange rate forecasting and even bankruptcy prediction (Jiang et al. 2020;Kristóf and Virág 2020;Zhang and Hamori 2020). Given the increasing use of cryptocurrencies by financial institutions and the investment public, it is important to test whether such methods can also be applied profitably or used for risk management purposes across cryptocurrency exchanges. As mentioned by Koutmos (2019), using macroeconomic variables and models that successfully explain the price variations in traditional asset classes may not be effective in modeling cryptocurrency price changes.
As is discussed further, and apart from the enhanced performance it yields relative to a buy-and-hold approach, our DR model is advantageous for the following reasons. First, it eliminates the need to build a forecasting model for price movements. This is important because this avoids Bellman's curse of dimensionality and the need to somehow aggregate high dimensional data into a parsimonious model. Second, it uses an adaptive algorithm that learns trading strategies which balance returns while avoiding a specified measure of risk.
The remainder of our study proceeds as follows. Section 2 describes our DR model. Section 3 discusses our cryptocurrency sample data. Section 4 presents results from experimental trials of our model. In this section, we show how our DR model provides superior risk-adjusted performance relative to a buy-and-hold strategy. Finally, Section 5 concludes and makes suggestions for future extensions of our analysis.

Direct Reinforcement Learning
A trading system's ultimate objective is to optimize some measure of risk-adjusted returns across time while also considering transaction costs. As in Moody and Saffell (2001), our direct reinforcement (DR) approach contrasts with methods such as dynamic programming, temporal difference learning or q-learning. 1 Such models seek to learn a value function and lend themselves to problems (such as checkers, backgammon or tic-tac-toe) where immediate performance feedback is not promptly available at each time interval. On the other hand, and since trading decisions are readily measurable at each time interval, our DR model can learn to optimize risk-adjusted returns because it is receiving feedback at each time interval (each trading day in our case here).
In our DR model, we seek to estimate the parameters of a nonlinear autoregressive model that achieve the greatest risk-adjusted returns. Our trader can take long or short positions in each of the cryptocurrency markets that we sample, F t ∈ {1, −1}, and at each time interval t. 2 F t values of 1 and −1 correspond to an entirely long and short position respectively. The cryptocurrency price series being traded is denoted as z t . The position, F t , is established or maintained at the end of time interval t and re-evaluated at the end of t + 1. While trading is possible at each time interval, transaction fees discourage excessive trading. 3 1 See (Bellman 1957;Sutton 1988;Watkins and Dayan 1992). Direct reinforcement learning, or, recurrent reinforcement learning, as it is often synonymously referred to, indicates a class of models that do not have to learn a value function Carapuço et al. (2018).
The traded cryptocurrency price series is denoted as z t and returns are r t = z t /z t−1 − 1. F t can be represented as a function of (learned) system parameters θ: whereby m is the number of autoregressive returns to be used. In this way, the model "decides" its next position based on the last m returns of the asset. During training, the discreet sign function is replaced with the differentiable tanh function, in which partial positions can be held for values −1 < F t < 1. Trading returns R t can then be defined as: with δ as the transaction cost. The performance measure we use to provide feedback to (1) and (2) is the Sortino ratio: As Moody and Saffell (2001) indicate, several measures and variations thereof can be used to provide performance feedback. In untabulated results (available on request), we show that the Sharpe ratio discourages the model from making trades that produce large positive returns. This is because large positive returns have a symmetric impact on volatility as do large negative returns of equal magnitude (and volatility 'penalizes' the Sharpe ratio, regardless of the sign of returns). 4 The Sortino ratio in (3) is thus chosen to be the reward function S, whereby parameters θ can be optimized via gradient ascent for each epoch: where η is the learning rate. This is repeated until convergence, whereby the derivative dS(θ) dθ is calculated with a dynamic computational graph using PyTorch Paszke et al. (2017). After optimal parameters are found over a training set, simulated trading results are created from the next time intervals, as we discuss further in Section 4. The full training process is detailed in Appendix A.

Data Sample
We test our DR model using five of the largest cryptocurrencies in circulation (bitcoin, ethereum, litecoin, ripple and monero, respectively). Our sample range is from 26 August 2015 to 12 August 2019 (1447 data points) for all cryptocurrencies (some cryptocurrencies, such as bitcoin, are relatively more established while others, such as ethereum, started in August of 2015).
Presently, bitcoin by itself constitutes over 65% of the total market capitalization of all cryptocurrencies in circulation. Together, our five sampled cryptocurrencies account for about 80% of the present total market capitalization. Likewise, these five presently account for more than 50% of the total trade volume of all cryptocurrencies in circulation. Table 1 shows summary statistics for our sampled cryptocurrencies. Altogether, the average daily trade volume over our sample period exceeded $7 billion. In relation to asset classes investors are more familiar with (such as equities, real estate, and bonds), cryptocurrencies display high levels of price volatility. For example, and for our sample period, bitcoin had a low price of $210 and a high price of $19,000.
transaction costs to our model results in more (less) frequent trading. Additional results across a range of transaction costs are not tabulated for brevity but available on request.
4 Markovitz (1959) shows that downside returns and upside returns should be treated differently in calculating variance measures for risk. Results with the Sharpe ratio as the reward function are not tabulated for brevity but available on request.

Discussion of Findings
Following Moody and Saffell (2001), the trading system is trained for the first 1000 days, and simulated trading signals for the following 100 days are produced using the learned parameters. The training window is then shifted to include this tested-on data, and model is retrained and tested on the following 100 days. 5 In this way, trading data is completely out-of-sample. This process is repeated until trading signals are obtained for the whole 447-day test window, using independent models for each cryptocurrency. Table 2 shows portfolio metrics of our DR model in relation to a buy-and-hold approach. In terms of cumulative returns and risk-adjusted returns (as shown by the Sharpe and Sortino ratios), we see that our DR model outperforms a buy-and-hold approach for all the sampled cryptocurrencies with the exception of ethereum. Our DR performs especially well with bitcoin where, as we show in Figure 1, it more than triples the value of 1 bitcoin.
In terms of reducing risk (as measured by maximum drawdown and value-at-risk), our DR model outperforms buy-and-hold in all cases except for monero (when measured using maximum drawdown) and ethereum (when measured using value-at-risk, although by a negligible amount).
Our experimental results reported in Table 2 and Figure 1 account for transaction fees of 0.10% (see Footnote 3), which are applied when a trade is initiated. This table shows performance metrics between a buy-and-hold (BH) approach and our direct reinforcement (DR) model. For each performance metric (cumulative returns, Sharpe ratio, Sortino ratio, maximum drawdown and value-at-risk, respectively), we highlight in bold the metric that enhances the portfolio for our out-of-sample test period. Cumulative returns are expressed in decimal form (e.g., 1.352 corresponds to 135%). Value-at-risk is also expressed in decimal form (e.g., 0.059 corresponds to 5.9%).

5
Many other time window combinations are entertained (results available on request) but find that these windows optimize risk-adjusted performance. Re-training every 100 days allows the model to adapt to shifts in cryptocurrency market behaviors and conditions. For our system parameters in (1), we use a 10-day window of autoregressive returns (m = 10).
Our results (untabulated but available on request) show that our DR model does not behave lik a momentum-or mean reverting-type of strategy (because lags greater than 1 day are significant in driving the strategy). Thus, our DR model accounts for week-to-week shifts in market behaviors and conditions in determining its long/short decisions.

Conclusions
We show how reinforcement machine learning can make cryptocurrency trading decisions that optimize actively managed portfolios. Our results compare favorably to a naive buy-and-hold approach. Theoretically, these results provide some preliminary evidence that cryptocurrency prices may not follow a purely random walk process. This is because we show how our active trading strategy can yield superior risk-adjusted returns for investors and even reduce portfolio downside risk across time and if the model is trained adequately. Thus, the possibility of devising novel active trading strategies in the future, based on some quantitative or theoretical framework, may be useful in cryptocurrency markets instead of relying on naive buy-and-hold approaches. Our findings also agree theoretically with Lahmiri and Bekiros (2019), who show that cryptocurrency prices may not follow a random walk process. Chen and Hafner (2019) also report weak evidence that cryptocurrencies follow random walks consistently across time. Instead, their prices appear to be driven by sentiment. From a practical perspective, our application of the direct reinforcement methodology we describe can be replicated and tested across a range of cryptocurrencies or other traditional assets such as equities or index funds. Finally, it is important to emphasize that our approach can be successfully applied in markets where both buying and shorting are possible.
In light of these favorable findings, however, we echo the sentiments of Gold (2003) regarding finding optimal parameters in a machine learning model: "...optimal values in one market may be sub-optimal for another..." (p. 368). This claim especially highlights some of the challenges with integrating machine learning for economic decision-making. Namely, machine learning takes an atheoretical 'black box' approach. Nonetheless, machine learning poses countless opportunities.
With regards to our DR approach, it remains unsaid whether integrating microstructure variables (such as number of cryptocurrency users or trading activity) in addition to returns can yield superior degrees of portfolio performance across a larger subset of cryptocurrencies and across market regimes.
Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest.