Contrastive Learning Framework for Bitcoin Crash Prediction

Liu, Zhaoyan; Shu, Min; Zhu, Wei

doi:10.3390/stats7020025

Open AccessArticle

Contrastive Learning Framework for Bitcoin Crash Prediction

by

Zhaoyan Liu

^1,*

,

Min Shu

²

and

Wei Zhu

¹

Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794, USA

²

Department of Statistics, Actuarial and Data Sciences, Central Michigan University, Mt Pleasant, MI 48859, USA

^*

Author to whom correspondence should be addressed.

Stats 2024, 7(2), 402-433; https://doi.org/10.3390/stats7020025

Submission received: 14 January 2024 / Revised: 15 April 2024 / Accepted: 30 April 2024 / Published: 8 May 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

Due to spectacular gains during periods of rapid price increase and unpredictably large drops, Bitcoin has become a popular emergent asset class over the past few years. In this paper, we are interested in predicting the crashes of Bitcoin market. To tackle this task, we propose a framework for deep learning time series classification based on contrastive learning. The proposed framework is evaluated against six machine learning (ML) and deep learning (DL) baseline models, and outperforms them by 15.8% in balanced accuracy. Thus, we conclude that the contrastive learning strategy significantly enhance the model’s ability of extracting informative representations, and our proposed framework performs well in predicting Bitcoin crashes.

Keywords:

cryptocurrency; deep learning; machine learning; representation learning; time series classification

1. Introduction

Digital coins known as cryptocurrencies have gained popularity over the past ten years due to their enormous return potential during periods of rapid price increase and unanticipated sharp falls. Cryptocurrencies are decentralized using a distributed ledger technology known as blockchain that serves as a public record of financial transactions, in contrast to centralized digital money and central banking systems. The 2008 global financial crisis and the 2010–2013 European sovereign debt crisis were caused by failures of governments and central banks, which led to a surge interest in cryptocurrencies from investors. The most well-known cryptocurrency is Bitcoin, and many private investors’ “fear of losing out” may have contributed to its price climb to about USD 20,000 per coin in December 2017. This raises the question of whether the price behavior of Bitcoin has characteristics of financial bubbles. Hence, there is a rapidly expanding body of research on Bitcoin bubble [1,2,3,4,5], but the topic of bubble behavior is still far from being fully understood.

In the last few years, Deep Learning (DL) has strongly emerged as the best performing predictor class within the Machine Learning (ML) field across various implementation areas. Financial time series forecasting is no exception, and as such, an increasing number of prediction models based on various DL techniques have been introduced in the appropriate conferences and journals in recent years. Sezer et al. [6] systematically reviewed the literature on financial forecasting with deep learning during the period 2005–2019 and concluded that models based on recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) and gated recurrent unit (GRU) networks, are the most commonly accepted models because they dominate in price and trend predictions and are easily adapted to a variety of forecasting problems. However, DL models always require large scale labeled dataset for achieving such remarkable performance. It is thus very challenging to apply them to time series data, because generally time series data do not have human recognizable patterns, which make it much harder to label time series data than images, and little time series data have been labeled for practical applications.

To overcome the challenge of limited labeled data, self-supervised learning has recently gained more attention recently to extract effective representations for downstream tasks. Compared with models trained on full labeled data (i.e., supervised models), self-supervised pre-trained models can achieve comparable performance with limited labeled data [7]. Various self-supervised approaches rely on different pretext tasks [8] to train the models and learn representations from unlabeled data. A common pretext task for time series analysis is to train an Autoencoder for reconstruction, and use the final hidden representation as an input for the downstream task of classification or regression, as shown in [9]. As a self-supervised training technique, Contrastive Learning (CL) has recently shown its powerful ability for representation learning because of its ability to learn invariant representation from augmented data [7,10,11]. However, most of the works and applications on Contrastive Learning are in the domain of computer vision, and few are applied to time series data, especially financial data.

In this work, we propose a framework for time series classification framework based on contrastive learning via SimCLR [7]. In order to extract better representations for time series data, we make some modifications—our framework employs simple yet efficient data augmentations that are suitable for time series data, we substitute the encoder of it with a temporal convolutional network (TCN). For the downstream crashes detection task, we employ Epsilon Drawdown Method [12,13] to label the data, and make prediction by using the extracted representations as input and a multilayer perceptron (MLP) as classifier. We test the proposed framework on Bitcoin daily price data, and compare it performance with various ML/DL baseline models. The main contributions of this work are as follows:

We propose a deep learning framework based on contrastive learning for extracting latent representations for time series data and tackle downstream time series classification task.
We apply the proposed framework to predict crashes of Bitcoin. To the best of our knowledge, we are the first to employ contrastive learning in analyzing cryptocurrencies.
We investigate the performance of multiple time series augmentation methods in contrastive learning based model, and the effect of some augmentation parameters.
We compare our proposed framework against six state-of-the-art classification models, which include both ML and DL models.
We explore the efficacy of ensemble techniques in an effort to improve the overall classification performance.
We compare and combine the CL framework with the Log Periodic Power Law Singularity (LPPLS) model, which is also unprecedented, to the best of our knowledge.

The paper is structured as follows. Section 2 reviews several research areas closely related to this paper, providing background information and highlighting Contrastive Learning-based models and their extensions on time series data. Section 3 presents the architecture of our framework and detailed methodologies and procedures employed. Section 4 provides the data in experiments, model implementation, and performance evaluation. Section 5 concludes the paper by summarizing the key findings and contributions of our study.

2. Related Work and Background

In this section, we discuss several research areas that are closely related to our paper, showing not only the background of the fields but also the recent works that make great achievements. These includes the commonly used test for detecting financial bubble, contrastive learning based DL models and their extension on time series data.

2.1. Financial Crashes

Crashes in financial markets are of extreme importance as they may have great impact on the lives and livelihoods of most people all over the world, directly or indirectly. There have been seven distinct crashes in the U.S. financial markets over the past 30 years, including the 1987 stock market crash, the collapse of long-term capital management following the Russian debt crisis and the subsequent market crash of 1998, the bursting of the dotcom bubble during the period of 1999–2001, the financial meltdown following the subprime mortgage crisis during the period from 2007 to 2009, and 2020 stock market crash. In addition, in the past 30 years there are approximately 100 financial crises worldwide [14]. Each crash of bubbles resulted in permanent impairment of wealth, at least for some investors [15]. Because of their great impact, researchers and economists have studied on financial bubbles for many decades, investigating the underlying mechanisms and looking for better ways to predict them.

In the broadest sense, financial bubbles are generally defined as the abnormal accelerating ascent of an asset price above the fundamental value of the asset [16]. When a bubble goes burst, i.e., crash occurs, investors with very little experience of managing risks usually hold the asset at the late phase and thus are damaged by the bubbles excessively [17]. This intuitive description is, however, quite hard to conceptualize and full of traps, as it requires the implicit definition of both ‘abnormal price growth’ and ‘crash’. To measure abnormal price increases, a reference frame or process against which deviations can be gauged, must be defined. However, when employing such a reference process, a bubble may be incorrectly diagnosed due to an inaccurate underlying benchmark model, an issue that makes the diagnostic of a bubble a joint-hypothesis problem. Similarly, a crash is also not easy to define. It can be vaguely described as a mixture of a large loss over some relatively short duration that seems abnormal compared to regular asset price movements.

2.2. Contrastive Learning

The basic concept for contrastive learning was proposed in [18,19], which looked at the comparison of distinct samples without labels. There have been a number of studies since then using a similar principle, which have been surveyed in [20]. In recent years, contrastive learning has risen to prominence as a technique for learning representations from augmented data. Specifically, it uses a set of training instances made up of positive sample pairs (samples that are similar in some way) and negative sample pairs (samples that are different). Within the embedding space, a representation is learned to bring positive sample pairs closer together while also separating negative sample pairs. For instance, MoCo [10] utilized a momentum encoder to learn representations of negative pairs obtained from a memory bank. SimCLR [7] replaced the momentum encoder by using a larger batch of negative pairs. Also, BYOL [11] learned representations by bootstrapping representations even without using negative samples. Last, SimSiam [21] supported the idea of neglecting the negative samples, and relied only on a Siamese network and stop-gradient operation to achieve the state-of-the-art performance. While all of these methods have enhanced representation learning for visual data, they may not function as well on time series data with unique traits such as temporal dependency.

2.3. Contrastive Learning for Time Series

Representation learning for time series is becoming increasingly popular, and few works have recently leveraged contrastive learning for time series data. For example, model CPC [22] learned representations by predicting the future in the latent space and showed great advances in various speech recognition tasks. Also, Ref. [23] designed electroencephalogram (EEG) related augmentations and extended SimCLR model to EEG data. In addition, other works [24,25,26,27,28] applied contrastive learning to improve performance of deep learning models on time series forecasting, classification, or change point detection tasks. They are summarized in Table 1.

From Table 1, we can see that very few studies have applied contrastive learning to financial time series data. One of the first few is [29], which exploited the different efficacy of heterogeneous multigranularity information to construct non-end-to- end multigranularity models for stock price prediction, which can adaptively fuse multigranularity features at each time step and maximize the mutual information between local segments and their global contexts (Context-Instance). Wang et al. [30] applied copula-based CPC architecture to learn a better stock representation with less uncertainty by considering hierarchical couplings from the macro-level to the sector-and micro-level, and used the proposed model for stock movement prediction. Feng et al. [31] utilized contrastive learning to exploit the correlation between intra-day data and enhance stock representation in order to improve the accuracy of stock movement prediction.

However, to the best of our knowledge, there has been no study working on taking advantage of CL to predict bubbles and crashes in assets’ price. Forecasting financial bubbles and crashes has always been a tough but attractive task for researchers in machine learning and data mining. We believe it is worth investigating the CL on various challenging tasks, including bubbles prediction, because it will definitely give researchers a deeper understanding and a wider view on this emerging technique, thus providing chances to improve it and make a better use of it.

3. Methods

In this section, we define the problem we aim to address using the language of machine learning domain, outline the methodology for generating the target labels in the problem, and provide a detailed overview of the proposed framework.

3.1. Problem Definition

For forecasting crashes, we aim to predict if there will be a large crash in the next few days. So, in machine learning field, it can be categorized as a binary classification problems. Formally, the prediction model learns a function

{\hat{y}}_{t} = F_{θ} (X_{t})

, which maps a sequence of the historical stock prices to the label space. Specifically, for current time t,

X_{t - T + 1} = [x_{t - T + 1}, \dots, x_{t}] \in R^{T}

represents a sequence of stock prices in the lag of past T time-steps. The target label

y_{t}

has two classes—there is a crash in the next M days or there is not a crash in the next M days. To generate the target label, we apply Epsilon Drawdown Method for identifying crashes in the entire time series.

3.2. The Epsilon Drawdown Method

The Epsilon Drawdown Method is a peak detection algorithm developed by Johansen and Sornette [12,13] and further used in [34,35]. The main goal of Epsilon-drawdown method is to systematically segment a price trajectory into a series of alternating and consecutive price drawup and drawdown phases that are subsequently translated into bubbles and crashes. A drawdown (drawup) is defined as a succession of negative (positive) returns (

(p_{t} - p_{t - 1}) / (p_{t - 1})

) that may only be interrupted by a positive (negative) return which is larger than the pre-specified tolerance level

ϵ

. Consequently, the start time of a drawdown is called peak time—when a succession of positive returns followed by a negative return whose amplitude exceeds

ϵ

. Parameter

ϵ

controls the degree to which counter-movements in a drawup or drawdown phase are tolerated.

Let parameter

ϵ

be a function of time t

ε (t) = ε_{0} σ_{t} (ω)

(1)

where

σ_{t} (ω)

denotes the standard deviation of returns over the past

ω

days from time t and

ε_{0}

is a constant multiplier. Thus, the counter-movement tolerance dynamically changes over time, instead of being fixed. Once

ε_{0}

and

ω

is given, a set of the start times of drawdowns (peak times) among the time series data can be obtained.

To collect a robust set of peak times, it is better to run the algorithm with numerous pairs of

(ε_{0}, ω)

, and then select the time points that are highly frequently determined as peak time by different pair of parameters. The process can be expressed as that given a set of parameter pairs

{{(ε_{0}, ω)}_{i} | i = 1, \dots, N_{ε}}

, for each pair

{(ε_{0}, ω)}_{i}

, a set of peak times among the time series (with length of T) can be obtained by the algorithm and denoted it as

I_{i} = {I_{i, 1}, I_{i, 2}, \dots, I_{i, T}}, i = 1, \dots, N_{ε} .

(2)

where

I_{i, t}

is an indicator function

I_{i, t} = \{\begin{matrix} 1, & if time t is a peak time determined by {(ε_{0}, ω)}_{i} \\ 0, & otherwise . \end{matrix}

(3)

The frequency of time t being selected as a peak time over all pairs of parameters can be expressed as

n_{t} = \sum_{i = 1}^{N_{ε}} I_{i, t}

(4)

The fraction of occurrence of time t with respect to the total number of trials is

P = {p_{t} = \frac{n_{t}}{N_{ε}} | t = 1, \dots, T} .

(5)

This fraction is finally used to filter peaks.

3.3. Model Architecture

We adopt a similar approach to the SimCLR model [7] to extract compact latent representations that maximizes the similarity between positive pairs, however there are variations as we deal with time series data instead of images:

for sampling, we apply the augmentations that are not only commonly used for images but also reasonable for time series sequences, maintaining the overall trend of the sequence so as not to change the original input too much;
instead of a ResNet-50 [36] architecture, we use a temporal convolutional network (TCN) [37], which is more suitable for time series data;
we remain the projection head for the downstream task

The architecture of our approach is depicted in Figure 1. It starts from an input time series window of length T, denoted as

X_{i} = [x_{i}, \dots, x_{i + T - 1}] \in R^{T}

, from which a positive pair is composed of two equal-length sequences (

I_{h}

and

I_{f}

as shown in Figure 1) that are transformed by augmentation methods from the same original window. The two augmentation methods we take into consideration are transformation functions commonly used for time series [38], including jittering, scaling, magnitude-warping, time-warping, crop-and-resize and Gaussian smoothing. Details of these methods are shown in Section 3.4. Thus, the two augmented sequences of input window

X_{i}

can be denoted as

\begin{matrix} h_{i} & = {Aug}_{h} ([x_{i}, \dots, x_{i + T - 1}]), \\ f_{i} & = {Aug}_{f} ([x_{i}, \dots, x_{i + T - 1}]) \end{matrix}

(6)

where

{Aug}_{h}

and

{Aug}_{f}

denote two augmentation methods, therefore

h_{i}

and

f_{i}

form a positive pair of samples. In training process, each batch of samples contains K randomly selected positive pairs of windows

{(h_{i}, f_{i}) | i = 1, 2, \dots, K}

. To construct negative pairs for the batch, for each anchor sample

h_{i}

in the batch, we set the remaining

2 K - 2

windows as its negative samples, so all the negative pairs can be written as

{(h_{i}, h_{j}), (h_{i}, f_{j}) | i, j \in [1, K], i \neq j}

. The intuition of this sampling procedure is that time series are typically non-stationary, hence samples that are augmented from temporally separated windows are likely to have lower statistical dependencies than those from the same window. Figure 2 visualizes the construction of a batch in our model. A red two-way arrow represents a positive pair of samples, and a blue two-way arrow represents a negative pair of samples.

Given it has been shown in some works [39,40,41,42] that TCN can typically outperform Recurrent Neural Network (RNN)-based Long Short-Term Memory (LSTM) with temporal data on a vast range of tasks and is generally easier and faster in training, we employ TCN as the encoder to compress time windows into embedding representations.

Figure 3 illustrates the encoder architecture we use. It consists of two stacks of TCN (we acknowledge the main illustration and TCN implementation: https://github.com/philipperemy/keras-tcn, accessed on 13 January 2024), each containing 4 residual blocks with respective dilation rates of 1, 2, 4 and 8. Each residual block consists of two dilated causal convolutions with 64 kernel filters of size 4. Batch normalization and Rectified Linear Unit (ReLU) activation function are performed for every convolution layer. According to the architecture, the shape of input data is (N, T, c), where N is the number of windows, T is the length of window, and c denotes the number of channels. Through the entire encoder, the intermediate data has shape (N, T, 64)—the number of channel is increased from 1 to 64, because each dilated convolution layer has 64 filters inside, and each filter generates a single output tensor. In order to match the output shape of encoder to the input of the following projection head, a flatten layer is added between these two parts.

The projection head is used to reduce the encoder’s output dimension to the pre-specified size of desired representation (

c o d e_s i z e

). To achieve this goal, the projection head is set up as a simple three-layer of multi-layer perceptron (MLP) with ReLU activation function, and output shapes of T,

T / 2

and

c o d e_s i z e

, respectively. The architecture of projection head is shown in Figure 4.

After passing the augmented tensor through the encoder and the projection head, each positive pair

(h_{i}, f_{i})

can be converted to its final representation

(\tilde{h_{i}}, \tilde{f_{i}})

, and their similarity is calculated by cosine similarity and written as

\begin{matrix} sim (\tilde{h_{i}}, \tilde{f_{i}}) = \frac{\tilde{h_{i}} \cdot \tilde{f_{i}}}{| | \tilde{h_{i}} | | | | \tilde{f_{i}} | |} . \end{matrix}

(7)

To maximize the similarity of positive pairs and simultaneously minimize the similarity of negative pairs, a Normalized Temperature-scaled Cross Entropy Loss (NT-Xent) [43] is used in the contrastive learning process. The final loss function can be computed as:

L = \frac{1}{2 K} \sum_{i = 1}^{K} - log \frac{\exp (sim (\tilde{h_{i}}, \tilde{f_{i}}) / τ)}{\sum_{j = 1}^{K} 1_{j \neq i} {\exp (sim (\tilde{h_{i}}, \tilde{f_{j}}) / τ) + \exp (sim (\tilde{h_{i}}, \tilde{h_{j}}) / τ)}} .

(8)

After training the CL model, the downstream classification task will proceed. We use a simple linear layer with sigmoid activation function as the downstream classifier, which is trained in a fully supervised manner. Specifically, the raw time series windows in training set are input into the pre-trained CL model, which outputs the representations of windows, and then these representations and labels of windows are used to train the classifier.

3.4. Data Augmentation

The performance of various data augmentation methods for time series have been investigated in some works [25,27,38,44], but not all make much sense for financial time series data. Therefore, we selected six different methods that are both commonly used and reasonable for our data. An example of the six augmentations is shown in Figure 5.

Jittering is to apply different noise to each data point. To implement it, we add to each window of time series a window-length vector $ϵ$ with random values independently sampled from a normal distribution

N (0, σ^{2})

. The default value of standard deviation

σ

is set to 5. This yields to

\begin{matrix} \tilde{X} & = X + ϵ, \\ ϵ & = [ϵ_{1}, ϵ_{2}, \dots, ϵ_{T}], ϵ_{i} \sim N (0, σ^{2}) \end{matrix}

(9)

where

X

represents the original window of data, and

\tilde{X}

represents the augmented data.

Scaling can be considered as multiplying each window by a factor

λ

, which is the absolute value of a random number sampled from a normal distribution

N (1, σ^{2})

. We set 0.6 as the default value of the standard deviation

σ

. This can be written as:

\begin{matrix} \tilde{X} & = λ X, λ \sim N (0, σ^{2}) \end{matrix}

(10)

where

X

represents the original window of data and

\tilde{X}

the augmented data.

Magnitude-warping also scales data along the y-axis, but instead of applying a consistent factor, it multiplies different factors to different data points in a window. In addition, magnitude-warping requires the factors to be smoothly-varying. Specifically, we take

k + 2

time points

{0, (T - 1) / (k + 1), 2 (T - 1) / (k + 1), 3 (T - 1) / (k + 1), \dots, k (T - 1) / (k + 1), T - 1}

to divide the window of length T into

k + 1

equal parts, and randomly draw

k + 2

values

{y_{0}, y_{1}, \dots, y_{k}, y_{k + 1}}

from a normal distribution

N (1, σ^{2})

. With the

k + 2

knots

{(i T / (k + 1), y_{i}) | i = 0, 1, \dots, k + 1}

, a cubic spline can be fitted, as shown in Figure 6. The scaling factor corresponding to each time point can be obtained from the curve, and then we multiply the data by the factors. The whole process can be expressed as

\begin{matrix} \tilde{X} & = λ \circ X, \\ λ & = [S (0), S (1), S (2), \dots, S (T - 1)] \in R^{T} \end{matrix}

(11)

where

X

represents the original window of data,

\tilde{X}

the augmented data, S is the cubic spline. The default value of k is set to 10, and

σ

is set to 0.6.

Time-warping is similar to magnitude-warping, but it scales data points along the time direction (x-axis). Specifically, we first generate a vector of factors

λ = [λ_{0}, λ_{1}, \dots, λ_{T - 1}]

in the same way as for magnitude-warping, then calculate the scaled cumulative sums of factors and obtain a vector

λ^{'} = [s λ_{0}, s (λ_{0} + λ_{1}), s (λ_{0} + λ_{1} + λ_{2}), \dots, s (λ_{0} + λ_{1} + \dots + λ_{T - 1})]

, where

s = (T - 1) / (λ_{0} + λ_{1} + \dots + λ_{T - 1})

. By using

λ^{'}

(as timestamps) and original prices, new (horizontally scaled) data points can be generated, as shown in red in Figure 7. The final augmented prices are evaluated by applying linear interpolation at the original timestamps

[0, 1, \dots, T - 1]

. The introduced procedure is formulated as:

\begin{matrix} \tilde{X} & = [L (0), L (1), \dots, L (T - 1)], \\ L (i) & = x_{i} + (i - λ_{i}^{'}) \frac{x_{i + 1} - x_{i}}{λ_{i + 1}^{'} - λ_{i}^{'}}, i = 0, 1, \dots, T - 2, \\ λ^{'} & = [λ_{0}^{'}, λ_{1}^{'}, \dots, λ_{T - 1}^{'}] \end{matrix}

(12)

where

\tilde{X}

represents the augmented data,

x_{i}

is the original data of time i of the window, and L indicates the linear interpolation function formed by the scaled timestamps

λ^{'}

and the original data

X = [x_{0}, x_{1}, \dots, x_{T - 1}]

.

Crop-and-resize method cuts the original window to a shorter one and then makes it as long as before by interpolating new data points in it. To implement this, we first linearly interpolate between every pair of adjacent data points in the original window with length T, resulting in

T - 1

intermediate values and a new window with length

2 T - 1

. Then we randomly select a sub-sequence with length T from the new window, and use it as the augmented data. This yields to

\begin{matrix} \tilde{X} & = [w_{i}, w_{i + 1}, \dots, w_{i + T - 1}], i \sim U (0, T - 1) \\ w_{j} & = \{\begin{matrix} x_{\frac{j}{2}} & if j is even \\ \frac{1}{2} (x_{j - 1} + x_{j}) & if j is odd \end{matrix}, j = 0, 1, 2, \dots, 2 T - 2 \end{matrix}

(13)

where

\tilde{X}

represents the augmented data,

x_{i}

is the original data of time i of the window, and

w_{i}

indicates the ith data points in the new window.

Smoothing is a method to remove noise and shows trends in the data. For each point in a window, we calculate the Gaussian kernel function values of it and its −2/+2 neighbors, and then take the weighted average of those data points. The process can be described by

\begin{matrix} \tilde{X} & = [z_{0}, z_{1}, \dots, z_{T - 1}], \\ z_{i} & = r_{2} x_{i - 2} + r_{1} x_{i - 1} + r_{0} x_{i} + r_{1} x_{i + 1} + r_{2} x_{i + 2}, \\ r_{j} & = \frac{pdf (j)}{pdf (0) + 2 pdf (1) + 2 pdf (2)}, \\ pdf (j) & = \frac{1}{\sqrt{2 π σ^{2}}} \exp (- \frac{1}{2 σ^{2}} j^{2}), σ \sim U (1, 2) \end{matrix}

(14)

where

\tilde{X}

represents the augmented data,

x_{i}

is the original data of time i of the window. When calculating z’s, if there are not enough neighbors at left (or right), then we extend the input by replicating the existing leftmost (or rightmost) neighbor.

4. Experiments

In this section, we provide the details of data collection and models implementation and also evaluate our proposed model on Bitcoin bubble forecasting, and compare its performance with other machine learning and deep learning models.

4.1. Data

The data contains Bitcoin daily closing prices covering the period from 17 September 2014 to 17 September 2022 totally including 2923 time points, downloaded from Yahoo Finance. The target labels of drawdowns are determined by applying the Epsilon Drawdown Method with the expression

ϵ = ϵ_{0} σ (ω)

, where

ϵ_{0}

controls the number of standard deviations up to which counter-movements are tolerated, and

ω

determines the time scale on which the standard deviation is estimated. We perform a grid search over a predefined search space of

(ϵ_{0}, w)

-pairs, scanning

ϵ_{0}

with a step size of 0.1 from 0.1 to 5 and

ω

with a step size of 5 days from 10 to 60 days. Thus, there are in total 550 grid points. For each

(ϵ_{0}, w)

-pair, a set of peak dates

t_{p}

is recorded. After obtaining the 550 set of peak dates, we count the number of times that each date was identified as a peak day, and divide the count by 550 to obtain a fraction

p_{t}

. By filtering peaks with

p_{t} > 0.65

and drop percentage > 20%, we obtain a total of 21 peaks, summarized in Table 2. The bitcoin historical prices of the whole selected period are visualized by the Figure A1,Figure A2,Figure A3 and Figure A4 in Appendix A.

We set the size of each window

T = 30

and the prediction horizon

M = 14

, and split the data by time, forming the training sets (80%) and the test sets (20%). The details of the two sets are listed in Table 3.

4.2. Contrastive Model Setup

For our experiments, we use the TensorFlow [45] deep learning framework. During the training of the CL models, we employ a sliding window with size of 30 and a moving step of 5. This yields 460 windows for training the CL encoder, which can be denoted as

{X_{i}}_{i \in {1, 6, 11, \dots, 2296}}

, where

X_{i} = [x_{i}, \dots, x_{i + 30 - 1}]

. The dimension of the output space of projection head is specified at

c o d e_s i z e = 12

. The training process is done by using a batch-size of 16, 200 epochs and the ADAM optimizer [46] with learning rate of 1 ×

10^{- 4}

. For contrastive loss calculating, the temperature parameter

τ

is set to 0.1. The downstream classifier is a linear layer with the sigmoid activation function, taking the representations extracted from CL encoder as input. To train the classifier, all windows in the training set are input, with batch of 32,200 epochs and class weights of {0:0.2, 1:0.8}. Model are trained 5 times with different random seed, and the averaged evaluation metrics are calculated. Hardware configuration: CPU of 2.6 GHz 6-Core Intel Core i7, RAM of 16 GB.

During the experiments, we calculate several metrics to evaluate the performance of models on the binary classification task; these are the following:

Balanced Accuracy: The balanced accuracy (BA) metric is the average of sensitivity and specificity. It can be written as:

\begin{matrix} BA = \frac{1}{2} (sensitivity + specificity) \end{matrix}

(15)

It is a performance metric that takes into equal consideration the accuracy obtained by the evaluated model in the both the majority class and minority class.

F2-score: F score measures the accuracy of a model using recall and precision. F2-score weights recall higher than precision, which is suitable for the situation where the tolerance for false negative samples is low. It is defined as:

\begin{matrix} F 2 = (1 + 2^{2}) \frac{precision \cdot recall}{2^{2} \cdot precision + recall} \end{matrix}

(16)

G-mean: The geometric mean (GM) is the product of sensitivity and specificity. It measures the balance between classification performances on the majority and minority class. It holds:

\begin{matrix} GM = \sqrt{sensitivity \cdot specificity} \end{matrix}

(17)

Under this metric, poor performance in prediction of the positive cases will lead to a low G-mean value, even if the negative cases are correctly classified by the evaluated algorithm.

FM: The Fowlkes–Mallows (FM) index is the product of precision and recall. It measures the similarity between the predicted clustering and the true clustering. The value range of FM is [0,1], and the higher value means the better classification.

4.3. Baseline Models

The proposed contrastive model is compared with the following three categories of models:

Machine learning models include Random Forests (RF) [47], Support Vector Machine (SVM) [48], Gradient Boosting Machine [49] and XGBoost [50]. We used scikit-learn and XGBoost Python libraries to implement these models. Models are trained by using the extracted features from each window, instead of raw time series sequence data. Specifically, for each window, the average/standard deviation (std) of daily price ( $p_{i}$ )/return ( $r_{i} = (p_{i} - p_{i - 1}) / p_{i - 1}$ )/volatility ( $v_{i} = std ([r_{i - 9}, r_{i - 8}, \dots, r_{i - 1}, r_{i}])$ ) of the last 5 days, 10 days, 15 days,…, 30 days are calculated. In addition, tsfeatures Python library is applied to extract more advanced time series properties (see Table 4).
In the training process, grid-search and 5-fold time-split cross-validation are employed. The configuration parameters of grid-search for each classification model are shown in Table 5.
Deep learning models contain LSTM [51] and the same TCN architecture used for contrastive learning but trained in a fully supervised manner. We used Both baseline models take the same input as the downstream classifier of the proposed contrastive learning framework. The LSTM model have two layers of a LSTM-ReLU-BatchNorm combination, followed by the projection head as CL and a dense layer with sigmoid activation function. For the TCN baseline model, two stacks of TCN followed by the projection head and a dense layer classifier are trained to directly predict the two desired classes for windows. We use batch of 32,200 epochs and class weights of {0:0.2, 1:0.8} to train the two DL models. Python library Tensorflow is also employed to implement the models.
Contrastive learning model uses the same architecture of the proposed framework but replacing the TCN encoder to an LSTM, which is consist of two layers of a LSTM-ReLU-BatchNorm combination. We denote this competitor as CL-LSTM. Similarly, the encoder is trained in self-supervised manner, using the same input and same experimental setup with CL-TCN.
LPPLS Confidence Indicator is constructed by using the shrinking windows whose length is decreased from 360 days to 30 days. For each fixed end point $t_{2}$ , we shift start point $t_{1}$ in steps of 30 days, hence totally 12 of fittings are generated. To estimate parameters and filter fittings, we apply the following search space and filtering condition in Table 6.

4.4. Augmentation Comparison and Analysis

First, to investigate how the values of augmentation parameters would affect the model performance, we conduct a hyperparameter sweep by testing over a parameter space on four augmentation approaches—jittering, scaling, magnitude-warping and time-warping. For jittering and scaling, we test their scale parameters

σ

, while for magnitude-warping and time-warping, both knot parameter k and scale parameter

σ

are investigated. The results are visualized by bar plots, shown in Figure A5, Figure A6, Figure A7, Figure A8, Figure A9 and Figure A10 in Appendix A. We find that the impact of most augmentation approaches appears to be relatively affected by adjusting the parameters in a certain range. Therefore, in the following experiments, each augmentation method is applied with the parameters that have the best performance across all the evaluation metrics. Specifically, for jittering,

σ = 1.0

; for scaling,

σ = 1.2

; for time-warping

σ = 0.6

,

k n o t = 8

, and magnitude-warping

σ = 0.6

,

k n o t = 10

.

Then, we conduct an one-head experiment—throughout the whole training loop, we apply one fixed augmentation on one head while the other head executes an identity function, and compare the model performance with different augmentation in order to identify the most promising augmentations from the introduced function pool while excluding the least promising ones. The results are summarized via the evaluation metrics given in Table 7.

As a result, augmentations like crop-and-resize and smoothing can be discarded, because they perform worse than the baseline under all the metrics as they appear to hinder rather than to improve the training process. Scaling has worse performance when evaluated by G-mean, and therefore we will not consider it in the later experiments either. In contrast, time-warping, jittering and magnitude-warping appear to be the front-runners, yielding significant increases in all measured metrics compared to the baseline without augmentation.

Furthermore, by only considering the three best methods in the one-head experiment, we test the effect of two-head augmentation strategy. Specifically, we select two augmentation methods, and use one on the first head and the other on the second head, with default values of augmentation parameters. The results are shown in Table 8. We find that only the combination of jittering and magnitude-warping achieves significant increases of all evaluation metrics when compared to applying no augmentation at all; while the other two combinations do not show strong effect on improving the model performance. Compared with the results from the one-head strategy (Table 7), it is obvious that jittering can be improved by collaborating with other augmentations, but both time-warping and magnitude-warping perform worse in combination with other methods than alone.

To verify if using only one augmented head is better than applying augmentations on two head, we run experiments for augmentation combinations using by the three front-runners. Specifically, here we apply one method followed by another on the first head and identity on the second head. The performances of different combinations are summarized in Table 9. It shows that the combination of time-warping and magnitude-warping far outperforms not only the baseline, but other combinations as well. By comparing with the two-head experiment (Table 8), we find that time-warping tends to have greater performance in the one-head setting.

4.5. Fine-Tuning and Sensitivity Analysis

To investigate the sensitivity of the proposed model, we conduct additional experiments by adjusting the following parameters:

Window size is the length of original time series sequence input. We think that different window sizes contain different information, so it should be large enough to contain the most useful information for prediction but not too long to make the model hard to catch the key properties of the sequence. We select window sizes of 30, 90, 180 and 360 days for the experiment.
Batch size specifies the number of positive pairs (K) of augmented windows that are processed before the model is updated, so in our case there are $2 K$ training samples in each batch. In the experiment, the batch size is chosen to be 16, 32, 64 and 128.
Code size indicates the length of the embedding vector (representations) that is extracted from the encoder network. The code size is tested to range between 8 and 24.

Figure 8 visualizes the results of comparing the performance of the proposed framework (with respect to BA) across the different parameter settings. It demonstrates that averaged over the code size and batch size, the model has relatively same performance with different window size, but the model tends to have slightly variable performance when training with smaller size of windows. It is expected that 30 days is an effective length of window, and windows too long may render it difficult for the model to extract more predictive information.

For the batch size with respect to the set of

{16, 32, 64, 128}

, we find that the proposed model has robust performance, when taking average across the code size and window size (Figure 9). With window size 30 and 90, the models with batch 16 obviously outperform the ones with other batch sizes. We believe this situation occurs because small batch size makes it less likely that a sample will be similar to other samples in the same batch, which enables the model to better distinguish negative pair of samples. But as the window size increases, the probability of a negative pair of samples being very similar decreases, so changing batch size does not have a large impact on the model performance.

As shown in Figure 10, at the window size of 30, 90 and 180, the model performance, averaged over the batch size, has the best BA at a code size of 12, while for longest window of 360, a larger code size is needed to contain enough information in their representations.

4.6. Baseline Comparison

To demonstrate the effectiveness of our model, we compare its performance to the seven baseline ML/DL models. Table 10 gives an overview of the comparison results. For each measurement, the highest value is highlighted in bold, the second highest one is marked by underline, and the third highest is presented in italic.

We can see that our proposed framework could achieve the best performance over all the selected evaluation metrics, higher BA and GM indicates that our proposed framework has better accuracy on classifying both majority class (non-bubble periods) and minority class (bubble periods), while the larger F2-score and FM index implies that it is less likely to incorrectly predicts bubbles as non-bubbles. In addition, TCN and XGB, the second best and the third best models, outperform others by an obvious margin. The visualization of eight models prediction results on test set are shown in Appendix A Figure A11, Figure A12, Figure A13, Figure A14, Figure A15, Figure A16, Figure A17 and Figure A18.

Through not only the comparison between LSTM and TCN but also the comparison between CL-LSTM and CL-TCN, it is shown that TCN outperforms LSTM on extracting effective representations for time series. This result aligns with the conclusions make in works [39,40,41,42], so that it provides a strong support for our choice of TCN as the encoder rather than LSTM.

The pre-training loss and similarity trajectory of CL-TCN and CL-LSTM are shown in Figure 11 and Figure 12. We can find that TCN encoder learns to distinguish similar samples from different samples in very less epochs, and keeps making the similarity between positive pairs extremely high, while the similarity between negative pairs is near to zero. However, in the first 50 epochs, the epoch-wise averaged similarity for positive pairs is close tightly to that for negative pairs, which indicates that CL-LSTM has a difficulty in distinguishing samples. Although at the end, CL-LSTM make the similarity of negative pairs near to zero, the similarity for positive pair is too low.

4.7. Ensemble Model

Due to the inherently challenging nature of forecasting stock market crisis events on the global scale, it is well-expected that all postulated models are rather weak learners. Some may achieve better modeling performance under specific observed patterns, others under different ones, but we do not expect any single one to capture best all the existing latent patterns. Hence, exploiting forecast combinations that allow us to assign different weights to each of the obtained predictions is expected to boost the overall attainable predictive performance.

Specifically, ensemble modeling is the act of executing two or more related but distinct analytical models and then combining the results into a single score or spread. Empirically, ensembles tend to yield better results when there is a significant diversity among the models. For ML/DL models, we combine the three best performers (CL-TCN, TCN, XGB) together as a ensemble model in three ways—hard vote, soft vote with simple average, and soft vote with weighted average. Since the LPPLS indicator cannot be regarded as predicted probability, logistic regression is conducted by taking the indicator as a input feature. Also, to construct an ensemble CL that is more comparable to LPPLS indicator, we follow the construction logic of LPPLS indicator to implement a CL-indicator.

Hard Vote
In hard voting (also known as majority voting), every individual classifier votes for a class, and the majority wins. The ensemble’s anticipated target label is, statistically speaking, the mode of the distribution of the labels’ individual predictions. In this case, the selected three base models generate binary predictions separately. If more than 2 models infer positive output then the ensemble model output is marked as positive, otherwise marked as negative.
Soft Vote (Simple Average and Weighted Average)
In soft voting, every individual classifier provides a probability value that a specific data point belongs to a particular target class. A simple average strategy is firstly applied and converted to binary predictions for evaluation. For weighted average, the evaluation metric value of each base classifier is used to weight and then average the predictions. As multiple metrics are adopted in evaluation above (F2, GM, BA, FM), the weight can be calculated by looping through them. Different weighted average output can be generated, and consequently the final binary label by metric is gathered.
Logistic Regression
Logistic regression is conducted by using the predicted probabilities of the 3 selected and the LPPLS indicator as 4 features to fit on the train set, then predict on the test set.
CL Indicator
Same as constructing LPPLS indicator, we recursively train CL-TCN with shrinking window of 360 days to 30 days with a step size of 30 days, hence the 12 CL models are obtained. For each time point, we use 12 models to make prediction and then calculate the fraction of models that classify it as class 1.

The comparison of performance between ensemble models and CL-TCN are shown in Table 11. We can find that Simple Average Soft Vote outperforms other ensemble models and has comparable performance with CL-TCN. The logistic regression has a significant decrease for all the evaluation metrics.

The comparison results of LPPLS-indicator and CL-indicator is summarized in Table 12 and visualized in Figure 13 and Figure 14.

From Figure 13 and Figure 14, we can see the indicators result (in green) on test set in the lower panel, and its corresponding y-axis is on the right side of the panel. It is obvious that the indicators of most time point are 0, so we set the threshold to 0 for converting it into binary classes. We find that LPPLS indicator makes alert for all the drawdowns on the test set and have not too much false positive result, although for some drawdowns it is very weak. Compared to CL-indicator, it makes long-time alert for every drawdowns, and also correctly predicted other small drawdowns that are not selected as true label.

5. Conclusions

In conclusion, our study introduces a contrastive learning-based classification framework for predicting financial bubbles and crashes, with a focus on Bitcoin price data. By leveraging the SimCLR and making modifications, we achieve improvements in prediction performance compared to various ML/DL models. Moreover, we introduce a CL-indicator and make comparison between it with LPPLS-indicator, which is unprecedented, to the best of our knowledge. Our results demonstrate the effectiveness of CL in extracting meaningful representations from time series data, paving the way for better understanding and forecasting of financial market dynamics.

The success of our approach suggests its potential applicability to other financial indicators and markets, opening up opportunities for further research in diverse areas such as stock price prediction, regime change detection, and price movement forecasting. Additionally, the scalability of our framework allows for the incorporation of macroeconomic variables, enhancing its predictive power and practical utility in real-world applications.

In future work, we plan to explore the integration of additional features and advanced data augmentation techniques to further improve the performance of our model. Moreover, we aim to extend our analysis to other time series datasets and tasks, while also exploring the use of more sophisticated classifiers to enhance the robustness and generalizability of our framework. Overall, our study contributes to the growing body of research on predictive analytics in finance and highlights the potential of CL as a valuable tool for financial forecasting and risk management.

Author Contributions

Conceptualization, Z.L., M.S. and W.Z.; Methodology, Z.L. and M.S.; Software, Z.L.; Validation, Z.L.; Formal analysis, Z.L.; Investigation, Z.L. and M.S.; Resources, M.S. and W.Z.; Data curation, Z.L.; Writing—original draft, Z.L.; Writing—review & editing, Z.L., M.S. and W.Z.; Visualization, Z.L.; Supervision, W.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Bitcoin daily closing prices were downloaded from https://finance.yahoo.com/.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1. Features from Tsfeatures

Autocorrelation function (ACF) and partial autocorrelation function (PACF) features are used to measure the correlation among the values in each window. Feature acf_1 is the first autocorrelation coefficient. The features acf_10 is the sum of squares of the first ten autocorrelation coefficients. Moreover, autocorrelations of the changes provide a window without temporal changes (e.g., trend and seasonality). Thus, differentiating is first calculated to compute the differences between consecutive observations within a window, and then the ACF parameters are computed. Feature dacf_1 obtains the first autocorrelation coefficient from the differenced data. The features dacf_10 measures the sum of squares of the first ten autocorrelation coefficients from the differenced series. The first derivative of windows derives velocity of deformation. Additionally, d2acf_1 and d2acf_10 provide similar values corresponding to the twice-differenced series (i.e., the second-order differencing operation from the consecutive differences). In fact, the second derivative obtains the displacement acceleration in windows. Similarly, three features (see Table 4) are computed by the Partial Autocorrelation Function (PACF) of the first five partial autocorrelation coefficients, including pacf_5, dpacf_5, and d2pacf_5. The partial autocorrelation assesses the relationship of observations with shorter lags.

Seasonal and Trend Decomposition Using the LOESS (STL) method decomposes a time series window into a trend (i.e., trend-cycle), seasonal, and residual components. Trend shows the strength of a cyclic trend inside a window from 0 to 1. Spike represents the prevalence of spikiness in the remainder component of the STL decomposition. Linearity and curvature features estimate the linearity and curvature of the trend component by applied linear and orthogonal regressions, respectively. Features stl_1 and stl_10 represent the first autocorrelation coefficient and the sum of squares of the first ten autocorrelation coefficients of the remainder series, respectively.

Other features are extracted to develop further properties on deformation of windows. Nonlinearity determines the log of a ratio consisting of the sum of squared residuals from a nonlinear and linear autoregression by Teräsvirta’s nonlinearity test [52], respectively. The entropy metric measures the spectral density of a window, quantifying the complexity or the amount of regularity. Lumpiness and stability features measure the variance of the means and the variances on nonoverlapping windows, which provide information on how a window is free of trends, outliers, and shifts. Finally, the last three features, max_level_shift, max_var_shift, and max_kl_shift, denote the largest shifts in mean, variance, and Kullback–Leibler divergence of a window based on overlapping windows, respectively. These features may distinguish valuable structures regarding the window with jumps.

Appendix A.2. Figures

Figure A1. Identified bitcoin bubble peaks between 2014 and 2016.

Figure A2. Identified bitcoin bubble peaks between 2016 and 2018.

Figure A3. Identified bitcoin bubble peaks between 2018 and 2021.

Figure A4. Identified bitcoin bubble peaks between 2021 and 2022.

Figure A5. Effect of varying the scale parameter on jittering.

Figure A6. Effect of varying the scale parameter on scaling.

Figure A7. Effect of varying the scale parameter on magnitude-warping.

Figure A8. Effect of varying the knot parameter on magnitude-warping.

Figure A9. Effect of varying the scale parameter on time-warping.

Figure A10. Effect of varying the knot parameter on time-warping.

Figure A11. Visualization of prediction of CL-TCN on test set.

Figure A12. Visualization of prediction of TCN on test set.

Figure A13. Visualization of prediction of XGB on test set.

Figure A14. Visualization of prediction of GBM on test set.

Figure A15. Visualization of prediction of CL-LSTM on test set.

Figure A16. Visualization of prediction of RF on test set.

Figure A17. Visualization of prediction of LSTM on test set.

Figure A18. Visualization of prediction of SVM on test set.

Appendix A.3. Shuffle Experiment

For the original Bitcoin daily price time series, we randomly shuffle it to create 10 new time series sequences. Then, we apply the same Epsilon Drawdown Method and grid search setting introduced in Section 4.1 to obtain target labels for each new sequence. Under the same pre-processing steps and model configurations, we train our model, make prediction and get corresponding evaluation metrics. The results obtained with the original time series and those obtained with the shuffled time series are summarized below.

Table A1. Comparison of models performance between original Bitcoin daily price data and shuffled data.

	F2	GM	BA	FM
original	0.54 ± 0.06	0.71 ± 0.04	0.72 ± 0.04	0.44 ± 0.04
shuffle	0.31 ± 0.18	0.40 ± 0.18	0.47 ± 0.16	0.26 ± 0.14

Since our model relies on the assumption that the sequential order of observations contains valuable predictive information, the degradation in performance on shuffled data suggests that this assumption is valid. It highlights the complexity of the temporal dynamics underlying Bitcoin price movements, and indicates that simply shuffling the order of observations disrupts the underlying patterns and dependencies in the data, making it more challenging for our proposed framework to capture and exploit meaningful information for drawdown prediction.

Appendix A.4. Drawdowns in Bitcoin Hourly Price Data

To make more comprehensive comparison between our proposed framework and baseline models, we incorporate hourly prices of Bitcoin from 10 April 2022 10 a.m. UTC to 19 September 2023 11 p.m. UTC, which includes 12,401 time points. The target labels of drawdown are also generated by applying the Epsilon Drawdown Method, scanning

ϵ_{0}

with a step size of 0.1 from 0.1 to 5 and

ω

with a step size of 12 h from 24 to 168 h. By filtering peaks with

p_{t} > 0.5

and drop percentage > 2%, we obtain a total of 84 peaks, summarized in Table A2.

Table A2. Summary of all selected peaks.

Peak Time	Peak Price	End Time	End Price	Drop	Duration
4/21/22 12:00	42,709.7	4/26/22 20:00	38,084.5	10.8%	128
4/28/22 17:00	40,187.5	5/5/22 15:00	36,914.7	8.1%	166
5/5/22 16:00	37,056.9	5/12/22 6:00	26,759.8	27.8%	158
5/12/22 16:00	29,695.5	5/12/22 18:00	28,267.1	4.8%	2
5/13/22 14:00	30,759.5	5/14/22 15:00	28,796.2	6.4%	25
5/15/22 23:00	31,308.2	5/18/22 23:00	28,734.2	8.2%	72
5/20/22 11:00	30,453.1	5/20/22 16:00	28,884.4	5.2%	5
5/23/22 6:00	30,483.2	5/27/22 17:00	28,380.7	6.9%	107
5/31/22 16:00	32,194.6	6/7/22 2:00	29,395.2	8.7%	154
6/7/22 21:00	31,424.5	6/14/22 1:00	21,049.5	33.0%	148
6/14/22 7:00	22,881.0	6/18/22 20:00	17,744.9	22.4%	109
6/18/22 22:00	19,072.5	6/19/22 5:00	18,128.8	4.9%	7
6/19/22 22:00	20,638.7	6/20/22 1:00	19,730.8	4.4%	3
6/20/22 10:00	20,843.5	6/20/22 18:00	20,018.0	4.0%	8
6/21/22 14:00	21,569.9	6/22/22 20:00	19,985.2	7.3%	30
6/26/22 11:00	21,700.1	6/30/22 20:00	18,741.2	13.6%	105
7/1/22 1:00	20,385.3	7/3/22 13:00	18,986.6	6.9%	60
7/5/22 0:00	20,398.1	7/5/22 13:00	19,401.6	4.9%	13
7/8/22 3:00	22,097.0	7/13/22 12:00	19,134.0	13.4%	129
7/17/22 6:00	21,543.6	7/17/22 23:00	20,788.4	3.5%	17
7/18/22 15:00	22,533.7	7/18/22 20:00	21,542.2	4.4%	5
7/18/22 23:00	22,441.3	7/19/22 6:00	21,723.6	3.2%	7
7/20/22 15:00	24,169.7	7/26/22 15:00	20,787.2	14.0%	144
7/30/22 15:00	24,562.4	8/4/22 19:00	22,499.1	8.4%	124
8/8/22 13:00	24,174.5	8/10/22 0:00	22,798.8	5.7%	35
8/11/22 12:00	24,768.6	8/12/22 11:00	23,695.4	4.3%	23
8/15/22 2:00	24,887.2	8/19/22 23:00	20,875.0	16.1%	117
8/26/22 13:00	21,730.7	8/28/22 23:00	19,628.0	9.7%	58
8/30/22 4:00	20,463.3	8/30/22 16:00	19,649.4	4.0%	12
8/31/22 3:00	20,390.8	9/7/22 2:00	18,661.6	8.5%	167
9/13/22 11:00	22,539.4	9/19/22 8:00	18,454.6	18.1%	141
9/19/22 22:00	19,633.8	9/20/22 17:00	18,852.4	4.0%	19
9/21/22 18:00	19,644.2	9/22/22 0:00	18,418.0	6.2%	6
9/23/22 4:00	19,456.3	9/23/22 18:00	18,672.0	4.0%	14
9/27/22 12:00	20,296.4	9/28/22 2:00	18,626.5	8.2%	14
9/30/22 14:00	19,956.7	10/2/22 23:00	19,042.9	4.6%	57
10/6/22 4:00	20,376.8	10/11/22 18:00	18,976.9	6.9%	134
10/12/22 21:00	19,178.1	10/13/22 12:00	18,403.6	4.0%	15
10/14/22 1:00	19,885.4	10/20/22 0:00	18,993.0	4.5%	143
10/26/22 15:00	20,840.7	10/28/22 7:00	20,099.4	3.6%	40
10/29/22 9:00	20,934.5	11/2/22 22:00	20,102.9	4.0%	109
11/5/22 3:00	21,427.9	11/9/22 21:00	15,841.8	26.1%	114
11/10/22 16:00	17,797.3	11/10/22 17:00	17,202.9	3.3%	1
11/10/22 20:00	17,990.0	11/14/22 5:00	15,876.6	11.7%	81
11/14/22 12:00	16,791.8	11/14/22 20:00	16,270.6	3.1%	8
11/15/22 17:00	16,996.9	11/21/22 21:00	15,652.8	7.9%	148
11/24/22 1:00	16,748.6	11/28/22 2:00	16,133.0	3.7%	97
12/5/22 8:00	17,361.8	12/7/22 7:00	16,774.4	3.4%	47
12/14/22 18:00	18,314.6	12/19/22 22:00	16,403.1	10.4%	124
1/18/23 14:00	21,422.4	1/18/23 16:00	20,619.5	3.7%	2
1/21/23 19:00	23,268.6	1/22/23 20:00	22,459.9	3.5%	25
1/25/23 21:00	23,569.8	1/27/23 1:00	22,672.6	3.8%	28
1/29/23 19:00	23,900.7	1/30/23 20:00	22,723.1	4.9%	25
2/2/23 0:00	24,158.6	2/6/23 23:00	22,759.4	5.8%	119
2/8/23 0:00	23,320.0	2/13/23 17:00	21,471.7	7.9%	137
2/16/23 16:00	24,967.3	2/16/23 23:00	23,611.8	5.4%	7
2/21/23 6:00	25,014.8	2/25/23 21:00	22,916.1	8.4%	111
3/1/23 8:00	23,875.8	3/8/23 5:00	21,988.6	7.9%	165
3/8/23 15:00	22,157.6	3/10/23 10:00	19,673.7	11.2%	43
3/11/23 5:00	20,792.5	3/11/23 12:00	20,069.2	3.5%	7
3/14/23 16:00	25,954.7	3/15/23 16:00	24,157.1	6.9%	24
3/19/23 18:00	28,338.2	3/20/23 3:00	27,350.7	3.5%	9
3/22/23 16:00	28,680.7	3/28/23 10:00	26,735.9	6.8%	138
3/30/23 2:00	28,989.1	4/3/23 20:00	27,610.0	4.8%	114
4/14/23 6:00	30,962.3	4/20/23 19:00	28,124.3	9.2%	157
4/21/23 3:00	28,331.1	4/24/23 16:00	27,156.0	4.1%	85
4/26/23 12:00	29,995.8	5/1/23 20:00	27,680.8	7.7%	128
5/6/23 0:00	29,695.9	5/12/23 6:00	26,294.3	11.5%	150
5/15/23 17:00	27,512.6	5/18/23 17:00	26,469.6	3.8%	72
5/23/23 5:00	27,407.9	5/25/23 1:00	26,096.4	4.8%	44
5/29/23 0:00	28,232.0	6/1/23 2:00	26,721.2	5.4%	74
6/3/23 15:00	27,309.6	6/6/23 12:00	25,513.6	6.6%	69
6/6/23 23:00	27,241.0	6/10/23 5:00	25,520.5	6.3%	78
6/13/23 11:00	26,186.1	6/15/23 8:00	24,865.1	5.0%	45
6/23/23 17:00	31,256.1	6/30/23 13:00	30,002.4	4.0%	164
7/6/23 8:00	31,303.9	7/6/23 23:00	29,931.9	4.4%	15
7/13/23 19:00	31,663.5	7/20/23 18:00	29,687.3	6.2%	167
7/23/23 18:00	30,259.9	7/24/23 17:00	29,033.6	4.1%	23
8/2/23 1:00	29,855.1	8/7/23 15:00	28,820.4	3.5%	134
8/9/23 12:00	30,043.5	8/16/23 5:00	29,110.5	3.1%	161
8/16/23 7:00	29,167.8	8/22/23 21:00	25,620.4	12.2%	158
8/23/23 19:00	26,639.8	8/25/23 14:00	25,831.1	3.0%	43
8/29/23 18:00	27,975.1	9/1/23 17:00	25,452.6	9.0%	71
9/8/23 0:00	26,392.0	9/11/23 19:00	25,001.4	5.3%	91

We set the size of each window

T = 168

(hours) and the prediction horizon

M = 24

(hours), and split the data by time, forming the training sets (80%) and the test sets (20%). The details of the two sets are listed in Table A3.

Table A3. Details of training set and test set for daily price data.

	Training	Test
Period	2022-04-27 9:00:00 ∼ 2023-06-10 00:00:00	2023-06-10 1:00:00 ∼ 2023-09-19 23:00:00
# Windows	9787	2447
# Label = 1	1677	264

We compare our proposed CL framework performance to the seven baseline ML/DL models. Table A4 gives an overview of the comparison results.

Table A4. Comparison of models performance on drawdowns prediction for hourly data.

	F2	GM	BA	FM
RF	$0.15 \pm 0.01$	$0.34 \pm 0.01$	$0.52 \pm 0.004$	$0.15 \pm 0.01$
SVM	$0.17 \pm 0.02$	$0.37 \pm 0.03$	$0.49 \pm 0.05$	$0.18 \pm 0.05$
GBM	$0.14 \pm 0.03$	$0.35 \pm 0.03$	$0.54 \pm 0.01$	$0.17 \pm 0.03$
XGB	$0.04 \pm 0.00$	$0.19 \pm 0.11$	$0.50 \pm 0.07$	$0.06 \pm 0.00$
LSTM	$0.23 \pm 0.09$	$0.45 \pm 0.09$	$0.56 \pm 0.05$	$0.21 \pm 0.07$
TCN	$0.21 \pm 0.07$	$0.44 \pm 0.08$	$0.52 \pm 0.05$	$0.18 \pm 0.06$
CL-LSTM	$0.28 \pm 0.11$	$0.29 \pm 0.18$	$0.49 \pm 0.04$	$0.24 \pm 0.09$
CL-TCN	$0.33 \pm 0.07$	$0.50 \pm 0.05$	$0.52 \pm 0.06$	$0.27 \pm 0.06$

References

Bouri, E.; Gil-Alana, L.A.; Gupta, R.; Roubaud, D. Modelling long memory volatility in the Bitcoin market: Evidence of persistence and structural breaks. Int. J. Financ. Econ. 2019, 24, 412–426. [Google Scholar] [CrossRef]
Gradojevic, N.; Kukolj, D.; Adcock, R.; Djakovic, V. Forecasting Bitcoin with technical analysis: A not-so-random forest? Int. J. Forecast. 2021, 39, 1–17. [Google Scholar] [CrossRef]
Wheatley, S.; Sornette, D.; Huber, T.; Reppen, M.; Gantner, R.N. Are Bitcoin bubbles predictable? Combining a generalized Metcalfe’s Law and the Log-Periodic Power Law Singularity model. R. Soc. Open Sci. 2019, 6, 180538. [Google Scholar] [CrossRef]
Geuder, J.; Kinateder, H.; Wagner, N.F. Cryptocurrencies as financial bubbles: The case of Bitcoin. Financ. Res. Lett. 2019, 31, S1544612318306846. [Google Scholar] [CrossRef]
Shu, M.; Zhu, W. Real-time prediction of Bitcoin bubble crashes. Phys. A Stat. Mech. Its Appl. 2020, 548, 124477. [Google Scholar] [CrossRef]
Sezer, O.B.; Gudelek, M.U.; Ozbayoglu, A.M. Financial time series forecasting with deep learning: A systematic literature review: 2005–2019. Appl. Soft Comput. 2020, 90, 106181. [Google Scholar] [CrossRef]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. arXiv 2020, arXiv:2002.05709. [Google Scholar]
Jing, L.; Tian, Y. Self-Supervised Visual Feature Learning With Deep Neural Networks: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 4037–4058. [Google Scholar] [CrossRef] [PubMed]
Laptev, N.; Yosinski, J.; Li, L.E.; Smyl, S. Time-series Extreme Event Forecasting with Neural Networks at Uber. In Proceedings of the ICML 2017 TimeSeries Workshop, Sydney, Australia, 6–11 August 2017. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar] [CrossRef]
Grill, J.B.; Strub, F.; Altché, F.; Tallec, C.; Richemond, P.H.; Buchatskaya, E.; Doersch, C.; Pires, B.A.; Guo, Z.D.; Azar, M.G.; et al. Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning. NeurIPS 2020, 33, 21271–21284. [Google Scholar]
Johansen, A.; Sornette, D. Stock market crashes are outliers. Eur. Phys. J. B 1998, 1, 141–143. [Google Scholar] [CrossRef]
Johansen, A.; Sornette, D. Large Stock Market Price Drawdowns Are Outliers. J. Risk 2001, 4, 69–110. [Google Scholar] [CrossRef]
Stiglitz, J.E. The Lessons of the North Atlantic Crisis for Economic Theory and Policy; MIT Press: 2014; pp. 335–347. [CrossRef]
Focardi, S.M.; Fabozzi, F.J. Can We Predict Stock Market Crashes. J. Portf. Manag. 2014, 40, 183–195. [Google Scholar] [CrossRef]
Sornette, D. Why Stock Markets Crash: Critical Events in Complex Financial Systems. J. Risk Insur. 2002, 72, 190. [Google Scholar]
Galbraith, J.K. The Great Crash 1929; Harper Business: New York, NY, USA, 2009. [Google Scholar]
Becker, S.; Hinton, G.E. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature 1992, 355, 161–163. [Google Scholar] [CrossRef] [PubMed]
Bromley, J.; Bentz, J.W.; Bottou, L.; Guyon, I.; LeCun, Y.; Moore, C.; Sackinger, E.; Shah, R. Signature verification using a “Siamese” time delay neural network. Int. J. Pattern Recognit. Artif. Intell. 1993, 7, 669–688. [Google Scholar] [CrossRef]
Le-Khac, P.H.; Healy, G.; Smeaton, A.F. Contrastive Representation Learning: A Framework and Review. IEEE Access 2020, 8, 193907–193934. [Google Scholar] [CrossRef]
Chen, X.; He, K. Exploring Simple Siamese Representation Learning. arXiv 2020. [Google Scholar] [CrossRef]
Oord, A.v.d.; Li, Y.; Vinyals, O. Representation Learning with Contrastive Predictive Coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Mohsenvand, M.; Izadi, M.; Maes, P. Contrastive Representation Learning for Electroencephalogram Classification. In Proceedings of the Machine Learning for Health NeurIPS Workshop, Virtual, 11 December 2020; pp. 238–253. Available online: https://proceedings.mlr.press/v136/mohsenvand20a.html (accessed on 13 January 2024).
Franceschi, J.Y.; Dieuleveut, A.; Jaggi, M.; Jaggi, M. Unsupervised Scalable Representation Learning for Multivariate Time Series. arXiv 2019, arXiv:1901.10738. [Google Scholar]
Eldele, E.; Ragab, M.; Chen, Z.; Wu, M.; Kwoh, C.K.; Li, X.; Guan, C. Time-Series Representation Learning via Temporal and Contextual Contrasting. arXiv 2022, arXiv:2208.06616. [Google Scholar] [CrossRef] [PubMed]
Yue, Z.; Wang, Y.; Duan, J.; Yang, T.; Huang, C.; Tong, Y.; Xu, B. TS2Vec: Towards Universal Representation of Time Series. arXiv 2022, arXiv:2106.10466. [Google Scholar] [CrossRef]
Pöppelbaum, J.; Chadha, G.S.; Schwung, A. Contrastive learning based self-supervised time-series analysis. Appl. Soft Comput. 2022, 117, 108397. [Google Scholar] [CrossRef]
Deldari, S.; Smith, D.V.; Xue, H.; Salim, F.D. Time Series Change Point Detection with Self-Supervised Contrastive Predictive Coding. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 3124–3135. [Google Scholar] [CrossRef]
Hou, M.; Xu, C.; Liu, Y.; Liu, W.; Bian, J.; Wu, L.; Li, Z.; Chen, E.; Liu, T.Y. Stock Trend Prediction with Multi-granularity Data: A Contrastive Learning Approach with Adaptive Fusion. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management, Virtual Event, QLD, Australia, 1–5 November 2021; pp. 700–709. [Google Scholar] [CrossRef]
Wang, G. Coupling Macro-Sector-Micro Financial Indicators for Learning Stock Representations with Less Uncertainty. Proc. AAAI Conf. Artif. Intell. 2021, 35, 4418–4426. [Google Scholar]
Feng, W.; Ma, X.; Li, X.; Zhang, C. A Representation Learning Framework for Stock Movement Prediction. SSRN Electron. J. 2022, 144, 110409. [Google Scholar] [CrossRef]
Zhan, D.; Dai, Y.; Dong, Y.; He, J.; Wang, Z.; Anderson, J. Meta-Adaptive Stock Movement Prediction with Two-Stage Representation Learning. In Proceedings of the NeurIPS 2022 Workshop on Distribution Shifts: Connecting Methods and Applications, New Orleans, LA, USA, 3 December 2022. [Google Scholar]
Zhang, X.; Zhao, Z.; Tsiligkaridis, T.; Zitnik, M. Self-Supervised Contrastive Pre-Training For Time Series via Time-Frequency Consistency. arXiv 2022, arXiv:2206.08496. [Google Scholar] [CrossRef]
Johansen, A.; Sornette, D. Shocks, Crashes and Bubbles in Financial Markets. Bruss. Econ. Rev. 2010, 53, 201–253. [Google Scholar]
Filimonov, V.; Sornette, D. Power Law Scaling and ’Dragon-Kings’ in Distributions of Intraday Financial Drawdowns. Chaos Solitons Fractals 2015, 74, 27–45. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385. [Google Scholar]
Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio. arXiv 2016, arXiv:1609.03499. [Google Scholar]
Iwana, B.K.; Uchida, S. An empirical survey of data augmentation for time series classification with neural networks. PLoS ONE 2021, 16, e0254841. [Google Scholar] [CrossRef] [PubMed]
Nan, M.; Trascau, M.; Florea, A.M.; Iacob, C.C. Comparison between Recurrent Networks and Temporal Convolutional Networks Approaches for Skeleton-Based Action Recognition. Sensors 2021, 21, 2051. [Google Scholar] [CrossRef] [PubMed]
Lara-Benítez, P.; Carranza-García, M.; Luna-Romera, J.M.; Riquelme, J.C. Temporal Convolutional Networks Applied to Energy-Related Time Series Forecasting. Appl. Sci. 2020, 10, 2322. [Google Scholar] [CrossRef]
Hewage, P.R.P.G.; Behera, A.; Trovati, M.; Pereira, E.; Pereira, E.; Ghahremani, M.; Palmieri, F.; Liu, Y. Temporal convolutional neural (TCN) network for an effective weather forecasting using time-series data from the local weather station. Soft Comput. Fusion Found. Methodol. Appl. 2020, 24, 16453–16482. [Google Scholar] [CrossRef]
Gopali, S.; Abri, F.; Siami-Namini, S. A Comparison of TCN and LSTM Models in Detecting Anomalies in Time Series Data. In Proceedings of the 2021 IEEE International Conference on Big Data (Big Data), Orlando, FL, USA, 15–18 December 2021; pp. 2415–2420. [Google Scholar] [CrossRef]
Sohn, K. Improved Deep Metric Learning with Multi-Class N-Pair Loss Objective; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29, pp. 1857–1865. [Google Scholar]
Rashid, K.M.; Louis, J. Times-series data augmentation and deep learning for construction equipment activity recognition. Adv. Eng. Inform. 2019, 42, 100944. [Google Scholar] [CrossRef]
Abadi, M.; Barham, P.; Chen, J.; Chen, Z.; Davis, A.; Dean, J.; Devin, M.; Ghemawat, S.; Irving, G.; Isard, M.; et al. TensorFlow: A system for large-scale machine learning. In Proceedings of the USENIX Symposium on Operating Systems Design and Implementation, Savannah, GA, USA, 2–4 November 2016; pp. 265–283. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Vapnik, V. Statistical Learning Theory; Wiley-Interscience: Hoboken, NJ, USA, 1998. [Google Scholar]
Friedman, J.H. Greedy function approximation: A gradient boosting machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD’16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
Teräsvirta, T.; Lin, C.F.; Granger, C.W.J. Power of the neural network linearity test. J. Time Ser. Anal. 1993, 14, 209–220. [Google Scholar] [CrossRef]

Figure 1. Illustration of the proposed model architecture.

Figure 2. Illustration of a batch construction.

Figure 3. The TCN encoder architecture.

Figure 4. The projection head architecture.

Figure 5. Visualization of the used data augmentation methods. In each panel, original time series data is in orange and augmented data is in blue.

Figure 6. An example of the knots and the cubic spline using in magnitude-warping.

Figure 7. An example of the horizontally scaled data points using in time-warping.

Figure 8. The experiment results on the sensitivity of window size, batch size and code size.

Figure 9. Effect of batch size across different window sizes.

Figure 10. Effect of code size across different window sizes.

Figure 11. CL-TCN pre-training loss and similarity trajectory.

Figure 12. CL-LSTM pre-training loss and similarity trajectory.

Figure 13. LPPLS confidence indicator result on test set.

Figure 14. CL indicator result on test set.

Table 1. Literature summary of contrastive learning based on time series data.

Paper	Framework	Data	Task
Mohsenvand et al., 2020 [23]	SimCLR	non-financial	Classification
Deldari et al., 2021 [28]	CPC	non-financial	Change point detection
Hou et al., 2021 [29]	CPC	CSI-300, CSI-800, NASDAQ100	Stock Trend Prediction
Wang et al., 2021 [30]	CPC	ACL18, KDD17	Stock Trend Prediction
Eldele et al., 2022 [25]		non-financial	Classification
Poppelbaum et al., 2022 [27]	SimCLR	non-finanical	Classification
Feng et al., 2022 [31]	SimCLR	CSI-500	Stock Trend Prediction
Yue et al., 2022 [26]		non-financial	Classification
Zhan et al., 2022 [32]	SimCLR	ACL18, KDD17	Stock Trend Prediction
Zhang et al., 2022 [33]		non-financial	Classification

Table 2. Summary of all selected peaks.

Peak Day	Peak Price	End Day	End Price	Drop	Duration
2014-11-12	423.6	2015-01-14	178.1	58.0%	63
2015-03-11	296.4	2015-04-14	219.2	26.1%	34
2015-07-12	310.9	2015-08-24	210.5	32.3%	43
2015-12-15	465.3	2016-01-15	364.3	21.7%	31
2016-06-16	766.3	2016-08-02	547.5	28.6%	47
2017-01-04	1154.7	2017-01-11	777.8	32.6%	7
2017-03-03	1275.0	2017-03-24	937.5	26.5%	21
2017-06-11	2958.1	2017-07-16	1929.8	34.8%	35
2017-09-01	4892.0	2017-09-14	3154.9	35.5%	13
2017-12-16	19,497.4	2018-02-05	6955.3	64.3%	51
2018-03-05	11,573.3	2018-04-06	6636.3	42.7%	32
2018-05-05	9858.2	2018-06-28	5903.4	40.1%	54
2018-07-24	8424.3	2018-08-10	6184.7	26.6%	17
2019-06-26	13,016.2	2019-07-16	9477.6	27.2%	20
2019-10-27	9551.7	2019-12-17	6640.5	30.5%	51
2020-02-12	10,326.1	2020-03-12	4970.8	51.9%	29
2021-04-13	63,503.5	2021-04-25	49,004.3	22.8%	12
2021-09-06	52,633.5	2021-09-21	40,693.7	22.7%	15
2021-11-08	67,566.8	2022-01-22	35,030.3	48.2%	75
2022-03-29	47,465.7	2022-06-18	19,017.6	59.9%	81
2022-08-13	24,424.1	2022-09-06	18,837.7	22.9%	24

Table 3. Details of training set and test set.

	Training	Test
period	2014-10-30∼2021-02-18	2021-02-19∼2022-09-17
# windows	2304	576
# label = 1	224	70

Table 4. Time series features used in ML models including those based on sample ACFs, PACFs, and other time series features.

Category	#Features	Features
General	5	median, min, max, skewness, kurtosis
ACF	6	acf1, acf10, dacf1, dacf10, d2acf1, d2acf10
PACF	3	pacf5, dpacf5, d2pacf5
STL	6	trend, spike, linearity, curvature, stl1, stl10
Other	7	nonlinearity, entropy, lumpiness, stability, max_level_shift, max_var_shift, max_kl_shift

Table 5. Configuration parameters for grid-search of ML classifiers.

Model	Parameters	Values
RF	n_estimators	100, 150, 200, 250
	max_depth	5, 6, 7, 8
SVM	C	0.1, 1, 10, 100
	gamma	0.001, 0.01, 0.1, 1
	kernel	sigmiod, rbf, poly
GBM	learning_rate	0.01, 0.03, 0.1, 0.3
	n_estimators	100, 150, 200, 250
	subsample	0.6, 0.8, 1
	max_depth	5, 6, 7, 8
XGB	learning_rate	0.01, 0.03, 0.1, 0.3
	n_estimators	100, 150, 200, 250
	subsample	0.6, 0.8, 1
	max_depth	5, 6, 7, 8

Table 6. Search space and filter conditions for the qualification of valid LPPLS fits used in this study.

Item	Notation	Search Space	Filtering Condition
3 nonlinear parameters	m	$[0, 2]$	$[0.01, 0.99]$
	$ω$	$[1, 50]$	$[2, 25]$
	$t_{c}$	$[t_{2} - 0.2 d t,$	$[t_{2},$
		$t_{2} + 0.2 d t]$	$t_{2} + 0.1 d t]$
Number of oscillations	$\frac{ω}{2} ln \| \frac{t_{c} - t_{1}}{t_{2} - t_{1}} \|$	—	$[2.5, + \infty]$
Damping	$\frac{m \| B \|}{ω \| C \|}$	—	$[1, + \infty)$
Relative error	$\frac{p_{t} - \hat{p_{t}}}{\hat{p_{t}}}$	—	$[0, 0.2]$

Table 7. Performance on drawdowns prediction of CL with one head.

Augmentation	F2	GM	BA	FM
None	$0.29 \pm 0.15$	$0.51 \pm 0.14$	$0.57 \pm 0.11$	$0.25 \pm 0.12$
Jittering	$0.49 \pm 0.00$	$0.62 \pm 0.09$	$0.67 \pm 0.03$	$0.40 \pm 0.01$
Scaling	$0.34 \pm 0.26$	$0.48 \pm 0.25$	$0.57 \pm 0.17$	$0.28 \pm 0.21$
Magnitude-warping	$0.44 \pm 0.03$	$0.56 \pm 0.16$	$0.63 \pm 0.06$	$0.37 \pm 0.01$
Time-warping	$0.54 \pm 0.06$	$0.71 \pm 0.04$	$0.72 \pm 0.04$	$0.44 \pm 0.04$
Crop-and-resize	$0.23 \pm 0.19$	$0.43 \pm 0.22$	$0.56 \pm 0.09$	$0.21 \pm 0.16$
Smoothing	$0.16 \pm 0.24$	$0.26 \pm 0.33$	$0.45 \pm 0.19$	$0.13 \pm 0.19$

Table 8. Performance on drawdowns prediction of CL with two head.

Augmentation	F2	GM	BA	FM
None, None	$0.29 \pm 0.15$	$0.51 \pm 0.14$	$0.57 \pm 0.11$	$0.25 \pm 0.12$
Jittering, Time-warping	$0.37 \pm 0.14$	$0.57 \pm 0.10$	$0.60 \pm 0.09$	$0.31 \pm 0.11$
Jittering, Magnitude-warping	$0.39 \pm 0.11$	$0.60 \pm 0.10$	$0.63 \pm 0.07$	$0.33 \pm 0.09$
Time-warping, Magnitude-warping	$0.36 \pm 0.13$	$0.46 \pm 0.21$	$0.51 \pm 0.17$	$0.30 \pm 0.12$

Table 9. Performance on drawdowns prediction of CL with augmentation combinations on one head.

Augmentation	F2	GM	BA	FM
None, None	$0.29 \pm 0.15$	$0.51 \pm 0.14$	$0.57 \pm 0.11$	$0.25 \pm 0.12$
Jittering, Time-warping	$0.45 \pm 0.17$	$0.63 \pm 0.15$	$0.64 \pm 0.15$	$0.37 \pm 0.14$
Jittering, Magnitude-warping	$0.28 \pm 0.19$	$0.44 \pm 0.18$	$0.57 \pm 0.07$	$0.27 \pm 0.18$
Time-warping, Magnitude-warping	$0.55 \pm 0.08$	$0.73 \pm 0.05$	$0.74 \pm 0.06$	$0.45 \pm 0.05$

Table 10. Comparison of models performance on drawdowns prediction for daily data.

	F2	GM	BA	FM
RF	$0.30 \pm 0.02$	$0.53 \pm 0.01$	$0.57 \pm 0.01$	$0.26 \pm 0.01$
SVM	$0.09 \pm 0.10$	$0.25 \pm 0.15$	$0.49 \pm 0.05$	$0.09 \pm 0.08$
GBM	$0.36 \pm 0.04$	$0.57 \pm 0.03$	$0.58 \pm 0.02$	$0.29 \pm 0.03$
XGB	0.46 ± 0.004	0.66 ± 0.003	0.66 ± 0.003	0.37 ± 0.003
LSTM	$0.12 \pm 0.08$	$0.31 \pm 0.12$	$0.51 \pm 0.01$	$0.13 \pm 0.06$
TCN	0.50 ± 0.07	0.68 ± 0.04	0.69 ± 0.05	0.40 ± 0.06
CL-LSTM	$0.36 \pm 0.04$	$0.49 \pm 0.05$	$0.53 \pm 0.03$	$0.29 \pm 0.03$
CL-TCN	0.54 ± 0.06	0.71 ± 0.04	0.72 ± 0.04	0.44 ± 0.04

Table 11. Performance on drawdowns prediction of ensemble models, compared with CL-TCN.

	Simple Average	Weighted Average	Hard Vote	Logistic	CL-TCN
F2	0.534	0.495	0.516	0.311	0.535
GM	0.710	0.687	0.700	0.538	0.711
BA	0.715	0.687	0.700	0.575	0.722
FM	0.430	0.400	0.416	0.265	0.436

Table 12. Performance on drawdowns prediction of LPPLS indicator and CL-TCN indicator.

	LPPLS-Indicator	CL-Indicator
F2	0.075	0.423
GM	0.254	0.634
BA	0.487	0.642
FM	0.081	0.349

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Z.; Shu, M.; Zhu, W. Contrastive Learning Framework for Bitcoin Crash Prediction. Stats 2024, 7, 402-433. https://doi.org/10.3390/stats7020025

AMA Style

Liu Z, Shu M, Zhu W. Contrastive Learning Framework for Bitcoin Crash Prediction. Stats. 2024; 7(2):402-433. https://doi.org/10.3390/stats7020025

Chicago/Turabian Style

Liu, Zhaoyan, Min Shu, and Wei Zhu. 2024. "Contrastive Learning Framework for Bitcoin Crash Prediction" Stats 7, no. 2: 402-433. https://doi.org/10.3390/stats7020025

APA Style

Liu, Z., Shu, M., & Zhu, W. (2024). Contrastive Learning Framework for Bitcoin Crash Prediction. Stats, 7(2), 402-433. https://doi.org/10.3390/stats7020025

Article Menu

Contrastive Learning Framework for Bitcoin Crash Prediction

Abstract

1. Introduction

2. Related Work and Background

2.1. Financial Crashes

2.2. Contrastive Learning

2.3. Contrastive Learning for Time Series

3. Methods

3.1. Problem Definition

3.2. The Epsilon Drawdown Method

3.3. Model Architecture

3.4. Data Augmentation

4. Experiments

4.1. Data

4.2. Contrastive Model Setup

4.3. Baseline Models

4.4. Augmentation Comparison and Analysis

4.5. Fine-Tuning and Sensitivity Analysis

4.6. Baseline Comparison

4.7. Ensemble Model

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

Appendix A.1. Features from Tsfeatures

Appendix A.2. Figures

Appendix A.3. Shuffle Experiment

Appendix A.4. Drawdowns in Bitcoin Hourly Price Data

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI