Contrastive Learning Framework for Bitcoin Crash Prediction

: Due to spectacular gains during periods of rapid price increase and unpredictably large drops, Bitcoin has become a popular emergent asset class over the past few years. In this paper, we are interested in predicting the crashes of Bitcoin market. To tackle this task, we propose a framework for deep learning time series classification based on contrastive learning. The proposed framework is evaluated against six machine learning (ML) and deep learning (DL) baseline models, and outperforms them by 15.8% in balanced accuracy. Thus, we conclude that the contrastive learning strategy significantly enhance the model’s ability of extracting informative representations, and our proposed framework performs well in predicting Bitcoin crashes.


Introduction
Digital coins known as cryptocurrencies have gained popularity over the past ten years due to their enormous return potential during periods of rapid price increase and unanticipated sharp falls.Cryptocurrencies are decentralized using a distributed ledger technology known as blockchain that serves as a public record of financial transactions, in contrast to centralized digital money and central banking systems.The 2008 global financial crisis and the 2010-2013 European sovereign debt crisis were caused by failures of governments and central banks, which led to a surge interest in cryptocurrencies from investors.The most well-known cryptocurrency is Bitcoin, and many private investors' "fear of losing out" may have contributed to its price climb to about USD 20,000 per coin in December 2017.This raises the question of whether the price behavior of Bitcoin has characteristics of financial bubbles.Hence, there is a rapidly expanding body of research on Bitcoin bubble [1-5], but the topic of bubble behavior is still far from being fully understood.
In the last few years, Deep Learning (DL) has strongly emerged as the best performing predictor class within the Machine Learning (ML) field across various implementation areas.Financial time series forecasting is no exception, and as such, an increasing number of prediction models based on various DL techniques have been introduced in the appropriate conferences and journals in recent years.Sezer et al. [6] systematically reviewed the literature on financial forecasting with deep learning during the period 2005-2019 and concluded that models based on recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) and gated recurrent unit (GRU) networks, are the most commonly accepted models because they dominate in price and trend predictions and are easily adapted to a variety of forecasting problems.However, DL models always require large scale labeled dataset for achieving such remarkable performance.It is thus very challenging to apply them to time series data, because generally time series data do not have human recognizable patterns, which make it much harder to label time series data than images, and little time series data have been labeled for practical applications.
To overcome the challenge of limited labeled data, self-supervised learning has recently gained more attention recently to extract effective representations for downstream tasks.Compared with models trained on full labeled data (i.e., supervised models), selfsupervised pre-trained models can achieve comparable performance with limited labeled data [7].Various self-supervised approaches rely on different pretext tasks [8] to train the models and learn representations from unlabeled data.A common pretext task for time series analysis is to train an Autoencoder for reconstruction, and use the final hidden representation as an input for the downstream task of classification or regression, as shown in [9].As a self-supervised training technique, Contrastive Learning (CL) has recently shown its powerful ability for representation learning because of its ability to learn invariant representation from augmented data [7,10,11].However, most of the works and applications on Contrastive Learning are in the domain of computer vision, and few are applied to time series data, especially financial data.
In this work, we propose a framework for time series classification framework based on contrastive learning via SimCLR [7].In order to extract better representations for time series data, we make some modifications-our framework employs simple yet efficient data augmentations that are suitable for time series data, we substitute the encoder of it with a temporal convolutional network (TCN).For the downstream crashes detection task, we employ Epsilon Drawdown Method [12,13] to label the data, and make prediction by using the extracted representations as input and a multilayer perceptron (MLP) as classifier.We test the proposed framework on Bitcoin daily price data, and compare it performance with various ML/DL baseline models.The main contributions of this work are as follows: • We propose a deep learning framework based on contrastive learning for extracting latent representations for time series data and tackle downstream time series classification task.

•
We apply the proposed framework to predict crashes of Bitcoin.To the best of our knowledge, we are the first to employ contrastive learning in analyzing cryptocurrencies.

•
We investigate the performance of multiple time series augmentation methods in contrastive learning based model, and the effect of some augmentation parameters.

•
We compare our proposed framework against six state-of-the-art classification models, which include both ML and DL models.

•
We explore the efficacy of ensemble techniques in an effort to improve the overall classification performance.

•
We compare and combine the CL framework with the Log Periodic Power Law Singularity (LPPLS) model, which is also unprecedented, to the best of our knowledge.
The paper is structured as follows.Section 2 reviews several research areas closely related to this paper, providing background information and highlighting Contrastive Learning-based models and their extensions on time series data.Section 3 presents the architecture of our framework and detailed methodologies and procedures employed.Section 4 provides the data in experiments, model implementation, and performance evaluation.Section 5 concludes the paper by summarizing the key findings and contributions of our study.

Related Work and Background
In this section, we discuss several research areas that are closely related to our paper, showing not only the background of the fields but also the recent works that make great achievements.These includes the commonly used test for detecting financial bubble, contrastive learning based DL models and their extension on time series data.

Financial Crashes
Crashes in financial markets are of extreme importance as they may have great impact on the lives and livelihoods of most people all over the world, directly or indirectly.There have been seven distinct crashes in the U.S. financial markets over the past 30 years, including the 1987 stock market crash, the collapse of long-term capital management following the Russian debt crisis and the subsequent market crash of 1998, the bursting of the dotcom bubble during the period of 1999-2001, the financial meltdown following the subprime mortgage crisis during the period from 2007 to 2009, and 2020 stock market crash.In addition, in the past 30 years there are approximately 100 financial crises worldwide [14].Each crash of bubbles resulted in permanent impairment of wealth, at least for some investors [15].Because of their great impact, researchers and economists have studied on financial bubbles for many decades, investigating the underlying mechanisms and looking for better ways to predict them.
In the broadest sense, financial bubbles are generally defined as the abnormal accelerating ascent of an asset price above the fundamental value of the asset [16].When a bubble goes burst, i.e., crash occurs, investors with very little experience of managing risks usually hold the asset at the late phase and thus are damaged by the bubbles excessively [17].This intuitive description is, however, quite hard to conceptualize and full of traps, as it requires the implicit definition of both 'abnormal price growth' and 'crash'.To measure abnormal price increases, a reference frame or process against which deviations can be gauged, must be defined.However, when employing such a reference process, a bubble may be incorrectly diagnosed due to an inaccurate underlying benchmark model, an issue that makes the diagnostic of a bubble a joint-hypothesis problem.Similarly, a crash is also not easy to define.It can be vaguely described as a mixture of a large loss over some relatively short duration that seems abnormal compared to regular asset price movements.

Contrastive Learning
The basic concept for contrastive learning was proposed in [18,19], which looked at the comparison of distinct samples without labels.There have been a number of studies since then using a similar principle, which have been surveyed in [20].In recent years, contrastive learning has risen to prominence as a technique for learning representations from augmented data.Specifically, it uses a set of training instances made up of positive sample pairs (samples that are similar in some way) and negative sample pairs (samples that are different).Within the embedding space, a representation is learned to bring positive sample pairs closer together while also separating negative sample pairs.For instance, MoCo [10] utilized a momentum encoder to learn representations of negative pairs obtained from a memory bank.SimCLR [7] replaced the momentum encoder by using a larger batch of negative pairs.Also, BYOL [11] learned representations by bootstrapping representations even without using negative samples.Last, SimSiam [21] supported the idea of neglecting the negative samples, and relied only on a Siamese network and stop-gradient operation to achieve the state-of-the-art performance.While all of these methods have enhanced representation learning for visual data, they may not function as well on time series data with unique traits such as temporal dependency.

Contrastive Learning for Time Series
Representation learning for time series is becoming increasingly popular, and few works have recently leveraged contrastive learning for time series data.For example, model CPC [22] learned representations by predicting the future in the latent space and showed great advances in various speech recognition tasks.Also, Ref. [23] designed electroencephalogram (EEG) related augmentations and extended SimCLR model to EEG data.In addition, other works [24][25][26][27][28] applied contrastive learning to improve performance of deep learning models on time series forecasting, classification, or change point detection tasks.They are summarized in Table 1.
From Table 1, we can see that very few studies have applied contrastive learning to financial time series data.One of the first few is [29], which exploited the different efficacy of heterogeneous multigranularity information to construct non-end-to-end multigranularity models for stock price prediction, which can adaptively fuse multigranularity features at each time step and maximize the mutual information between local segments and their global contexts (Context-Instance).Wang et al. [30] applied copula-based CPC architecture to learn a better stock representation with less uncertainty by considering hierarchical couplings from the macro-level to the sector-and micro-level, and used the proposed model for stock movement prediction.Feng et al. [31] utilized contrastive learning to exploit the correlation between intra-day data and enhance stock representation in order to improve the accuracy of stock movement prediction.However, to the best of our knowledge, there has been no study working on taking advantage of CL to predict bubbles and crashes in assets' price.Forecasting financial bubbles and crashes has always been a tough but attractive task for researchers in machine learning and data mining.We believe it is worth investigating the CL on various challenging tasks, including bubbles prediction, because it will definitely give researchers a deeper understanding and a wider view on this emerging technique, thus providing chances to improve it and make a better use of it.

Methods
In this section, we define the problem we aim to address using the language of machine learning domain, outline the methodology for generating the target labels in the problem, and provide a detailed overview of the proposed framework.

Problem Definition
For forecasting crashes, we aim to predict if there will be a large crash in the next few days.So, in machine learning field, it can be categorized as a binary classification problems.Formally, the prediction model learns a function ŷt = F θ (X t ), which maps a sequence of the historical stock prices to the label space.Specifically, for current time t, X t−T+1 = [x t−T+1 , . . ., x t ] ∈ R T represents a sequence of stock prices in the lag of past T time-steps.The target label y t has two classes-there is a crash in the next M days or there is not a crash in the next M days.To generate the target label, we apply Epsilon Drawdown Method for identifying crashes in the entire time series.

The Epsilon Drawdown Method
The Epsilon Drawdown Method is a peak detection algorithm developed by Johansen and Sornette [12,13] and further used in [34,35].The main goal of Epsilon-drawdown method is to systematically segment a price trajectory into a series of alternating and consecutive price drawup and drawdown phases that are subsequently translated into bubbles and crashes.A drawdown (drawup) is defined as a succession of negative (positive) returns ((p t − p t−1 )/(p t−1 )) that may only be interrupted by a positive (negative) return which is larger than the pre-specified tolerance level ϵ.Consequently, the start time of a drawdown is called peak time-when a succession of positive returns followed by a negative return whose amplitude exceeds ϵ.Parameter ϵ controls the degree to which counter-movements in a drawup or drawdown phase are tolerated.
Let parameter ϵ be a function of time t where σ t (ω) denotes the standard deviation of returns over the past ω days from time t and ε 0 is a constant multiplier.Thus, the counter-movement tolerance dynamically changes over time, instead of being fixed.Once ε 0 and ω is given, a set of the start times of drawdowns (peak times) among the time series data can be obtained.
To collect a robust set of peak times, it is better to run the algorithm with numerous pairs of (ε 0 , ω), and then select the time points that are highly frequently determined as peak time by different pair of parameters.The process can be expressed as that given a set of parameter pairs {(ε 0 , ω) i |i = 1, . . ., N ε }, for each pair (ε 0 , ω) i , a set of peak times among the time series (with length of T) can be obtained by the algorithm and denoted it as where I i,t is an indicator function The frequency of time t being selected as a peak time over all pairs of parameters can be expressed as The fraction of occurrence of time t with respect to the total number of trials is This fraction is finally used to filter peaks.

Model Architecture
We adopt a similar approach to the SimCLR model [7] to extract compact latent representations that maximizes the similarity between positive pairs, however there are variations as we deal with time series data instead of images: 1.
for sampling, we apply the augmentations that are not only commonly used for images but also reasonable for time series sequences, maintaining the overall trend of the sequence so as not to change the original input too much; 2.
instead of a ResNet-50 [36] architecture, we use a temporal convolutional network (TCN) [37], which is more suitable for time series data; 3.
we remain the projection head for the downstream task The architecture of our approach is depicted in Figure 1.It starts from an input time series window of length T, denoted as X i = [x i , . . ., x i+T−1 ] ∈ R T , from which a positive pair is composed of two equal-length sequences (I h and I f as shown in Figure 1) that are transformed by augmentation methods from the same original window.The two augmentation methods we take into consideration are transformation functions commonly used for time series [38], including jittering, scaling, magnitude-warping, time-warping, crop-and-resize and Gaussian smoothing.Details of these methods are shown in Section 3.4.Thus, the two augmented sequences of input window X i can be denoted as where Aug h and Aug f denote two augmentation methods, therefore h i and f i form a positive pair of samples.In training process, each batch of samples contains K randomly selected positive pairs of windows {(h i , f i )|i = 1, 2, . . ., K}.To construct negative pairs for the batch, for each anchor sample h i in the batch, we set the remaining 2K − 2 windows as its negative samples, so all the negative pairs can be written as The intuition of this sampling procedure is that time series are typically non-stationary, hence samples that are augmented from temporally separated windows are likely to have lower statistical dependencies than those from the same window.Figure 2 visualizes the construction of a batch in our model.A red two-way arrow represents a positive pair of samples, and a blue two-way arrow represents a negative pair of samples.

Encoder Projection head
Maximize similarity Given it has been shown in some works [39][40][41][42] that TCN can typically outperform Recurrent Neural Network (RNN)-based Long Short-Term Memory (LSTM) with temporal data on a vast range of tasks and is generally easier and faster in training, we employ TCN as the encoder to compress time windows into embedding representations.
Figure 3 illustrates the encoder architecture we use.It consists of two stacks of TCN (we acknowledge the main illustration and TCN implementation: https://github.com/philipperemy/keras-tcn, accessed on 13 January 2024), each containing 4 residual blocks with respective dilation rates of 1, 2, 4 and 8.Each residual block consists of two dilated causal convolutions with 64 kernel filters of size 4. Batch normalization and Rectified Linear Unit (ReLU) activation function are performed for every convolution layer.According to the architecture, the shape of input data is (N, T, c), where N is the number of windows, T is the length of window, and c denotes the number of channels.Through the entire encoder, the intermediate data has shape (N, T, 64)-the number of channel is increased from 1 to 64, because each dilated convolution layer has 64 filters inside, and each filter generates a single output tensor.In order to match the output shape of encoder to the input of the following projection head, a flatten layer is added between these two parts.The projection head is used to reduce the encoder's output dimension to the prespecified size of desired representation (code_size).To achieve this goal, the projection head is set up as a simple three-layer of multi-layer perceptron (MLP) with ReLU activation function, and output shapes of T, T/2 and code_size, respectively.The architecture of projection head is shown in Figure 4.
After passing the augmented tensor through the encoder and the projection head, each positive pair (h i , f i ) can be converted to its final representation ( hi , fi ), and their similarity is calculated by cosine similarity and written as To maximize the similarity of positive pairs and simultaneously minimize the similarity of negative pairs, a Normalized Temperature-scaled Cross Entropy Loss (NT-Xent) [43] is used in the contrastive learning process.The final loss function can be computed as: After training the CL model, the downstream classification task will proceed.We use a simple linear layer with sigmoid activation function as the downstream classifier, which is trained in a fully supervised manner.Specifically, the raw time series windows in training set are input into the pre-trained CL model, which outputs the representations of windows, and then these representations and labels of windows are used to train the classifier.

Data Augmentation
The performance of various data augmentation methods for time series have been investigated in some works [25,27,38,44], but not all make much sense for financial time series data.Therefore, we selected six different methods that are both commonly used and reasonable for our data.An example of the six augmentations is shown in Figure 5.
Jittering is to apply different noise to each data point.To implement it, we add to each window of time series a window-length vector ϵ with random values independently sampled from a normal distribution N(0, σ 2 ).The default value of standard deviation σ is set to 5.This yields to X = X + ϵ ϵ ϵ, where X represents the original window of data, and X represents the augmented data.Scaling can be considered as multiplying each window by a factor λ, which is the absolute value of a random number sampled from a normal distribution N(1, σ 2 ).We set 0.6 as the default value of the standard deviation σ.This can be written as: where X represents the original window of data and X the augmented data.
Magnitude-warping also scales data along the y-axis, but instead of applying a consistent factor, it multiplies different factors to different data points in a window.In addition, magnitude-warping requires the factors to be smoothly-varying.Specifically, we take k + 2 time points {0, (T − 1)/(k + 1), 2(T − 1)/(k + 1), 3(T − 1)/(k + 1), . . ., k(T − 1)/(k + 1), T − 1} to divide the window of length T into k + 1 equal parts, and randomly draw k + 2 values {y 0 , y 1 , . . ., y k , y k+1 } from a normal distribution N(1, σ 2 ).With the k + 2 knots {(iT/(k + 1), y i )|i = 0, 1, . . ., k + 1}, a cubic spline can be fitted, as shown in Figure 6.The scaling factor corresponding to each time point can be obtained from the curve, and then we multiply the data by the factors.The whole process can be expressed as X = λ λ λ • X, λ λ λ = [S(0), S(1), S(2), . . ., S(T − 1)] ∈ R T (11) where X represents the original window of data, X the augmented data, S is the cubic spline.The default value of k is set to 10, and σ is set to 0.6.Time-warping is similar to magnitude-warping, but it scales data points along the time direction (x-axis).Specifically, we first generate a vector of factors λ λ λ = [λ 0 , λ 1 , . . ., λ T−1 ] in the same way as for magnitude-warping, then calculate the scaled cumulative sums of factors and obtain a vector λ λ λ ′ = [sλ 0 , s(λ 0 + λ 1 ), s(λ 0 + λ 1 + λ 2 ), . . ., s(λ 0 + λ 1 + . . .+ λ T−1 )], where s = (T − 1)/(λ 0 + λ 1 + . . .+ λ T−1 ).By using λ λ λ ′ (as timestamps) and original prices, new (horizontally scaled) data points can be generated, as shown in red in Figure 7.The final augmented prices are evaluated by applying linear interpolation at the original timestamps [0, 1, . . ., T − 1].The introduced procedure is formulated as: X = [L(0), L(1), . . ., L(T − 1)], where X represents the augmented data, x i is the original data of time i of the window, and L indicates the linear interpolation function formed by the scaled timestamps λ λ λ ′ and the original data X = [x 0 , x 1 , . . ., x T−1 ].Crop-and-resize method cuts the original window to a shorter one and then makes it as long as before by interpolating new data points in it.To implement this, we first linearly interpolate between every pair of adjacent data points in the original window with length T, resulting in T − 1 intermediate values and a new window with length 2T − 1.Then we randomly select a sub-sequence with length T from the new window, and use it as the augmented data.This yields to X = [w i , w i+1 , . . ., w i+T−1 ], i ∼ U(0, T − 1) where X represents the augmented data, x i is the original data of time i of the window, and w i indicates the ith data points in the new window.
Smoothing is a method to remove noise and shows trends in the data.For each point in a window, we calculate the Gaussian kernel function values of it and its −2/+2 neighbors, and then take the weighted average of those data points.The process can be described by X = [z 0 , z 1 , . . ., z T−1 ], where X represents the augmented data, x i is the original data of time i of the window.
When calculating z's, if there are not enough neighbors at left (or right), then we extend the input by replicating the existing leftmost (or rightmost) neighbor.

Experiments
In this section, we provide the details of data collection and models implementation and also evaluate our proposed model on Bitcoin bubble forecasting, and compare its performance with other machine learning and deep learning models.

Data
The data contains Bitcoin daily closing prices covering the period from 17 September 2014 to 17 September 2022 totally including 2923 time points, downloaded from Yahoo Finance.The target labels of drawdowns are determined by applying the Epsilon Drawdown Method with the expression ϵ = ϵ 0 σ(ω), where ϵ 0 controls the number of standard deviations up to which counter-movements are tolerated, and ω determines the time scale on which the standard deviation is estimated.We perform a grid search over a predefined search space of (ϵ 0 , w)-pairs, scanning ϵ 0 with a step size of 0.1 from 0.1 to 5 and ω with a step size of 5 days from 10 to 60 days.Thus, there are in total 550 grid points.For each (ϵ 0 , w)-pair, a set of peak dates t p is recorded.After obtaining the 550 set of peak dates, we count the number of times that each date was identified as a peak day, and divide the count by 550 to obtain a fraction p t .By filtering peaks with p t > 0.65 and drop percentage > 20%, we obtain a total of 21 peaks, summarized in Table 2.The bitcoin historical prices of the whole selected period are visualized by the Figures A1-A4 in Appendix A. We set the size of each window T = 30 and the prediction horizon M = 14, and split the data by time, forming the training sets (80%) and the test sets (20%).The details of the two sets are listed in Table 3.

Contrastive Model Setup
For our experiments, we use the TensorFlow [45] deep learning framework.During the training of the CL models, we employ a sliding window with size of 30 and a moving step of 5.This yields 460 windows for training the CL encoder, which can be denoted as {X i } i∈{1,6,11,...,2296} , where X i = [x i , . . ., x i+30−1 ].The dimension of the output space of projection head is specified at code_size = 12.The training process is done by using a batch-size of 16, 200 epochs and the ADAM optimizer [46] with learning rate of 1 × 10 −4 .For contrastive loss calculating, the temperature parameter τ is set to 0.1.The downstream classifier is a linear layer with the sigmoid activation function, taking the representations extracted from CL encoder as input.To train the classifier, all windows in the training set are input, with batch of 32,200 epochs and class weights of {0:0.2, 1:0.8}.Model are trained 5 times with different random seed, and the averaged evaluation metrics are calculated.Hardware configuration: CPU of 2.6 GHz 6-Core Intel Core i7, RAM of 16 GB.
During the experiments, we calculate several metrics to evaluate the performance of models on the binary classification task; these are the following: Balanced Accuracy: The balanced accuracy (BA) metric is the average of sensitivity and specificity.It can be written as: It is a performance metric that takes into equal consideration the accuracy obtained by the evaluated model in the both the majority class and minority class.F2-score: F score measures the accuracy of a model using recall and precision.F2-score weights recall higher than precision, which is suitable for the situation where the tolerance for false negative samples is low.It is defined as: G-mean: The geometric mean (GM) is the product of sensitivity and specificity.It measures the balance between classification performances on the majority and minority class.It holds: GM = sensitivity • specificity (17) Under this metric, poor performance in prediction of the positive cases will lead to a low G-mean value, even if the negative cases are correctly classified by the evaluated algorithm.FM: The Fowlkes-Mallows (FM) index is the product of precision and recall.It measures the similarity between the predicted clustering and the true clustering.The value range of FM is [0,1], and the higher value means the better classification.

Baseline Models
The proposed contrastive model is compared with the following three categories of models:

•
Machine learning models include Random Forests (RF) [47], Support Vector Machine (SVM) [48], Gradient Boosting Machine [49] and XGBoost [50].We used scikit-learn and XGBoost Python libraries to implement these models.Models are trained by using the extracted features from each window, instead of raw time series sequence data.Specifically, for each window, the average/standard deviation (std) of daily price (p i )/return (r i = (p i − p i−1 )/p i−1 )/volatility (v i = std([r i−9 , r i−8 , . . ., r i−1 , r i ])) of the last 5 days, 10 days, 15 days, . . ., 30 days are calculated.In addition, tsfeatures Python library is applied to extract more advanced time series properties (see Table 4).In the training process, grid-search and 5-fold time-split cross-validation are employed.The configuration parameters of grid-search for each classification model are shown in Table 5. • Deep learning models contain LSTM [51] and the same TCN architecture used for contrastive learning but trained in a fully supervised manner.We used Both baseline models take the same input as the downstream classifier of the proposed contrastive learning framework.The LSTM model have two layers of a LSTM-ReLU-BatchNorm combination, followed by the projection head as CL and a dense layer with sigmoid activation function.For the TCN baseline model, two stacks of TCN followed by the projection head and a dense layer classifier are trained to directly predict the two desired classes for windows.We use batch of 32,200 epochs and class weights of {0:0.2, 1:0.8} to train the two DL models.Python library Tensorflow is also employed to implement the models.

•
Contrastive learning model uses the same architecture of the proposed framework but replacing the TCN encoder to an LSTM, which is consist of two layers of a LSTM-ReLU-BatchNorm combination.We denote this competitor as CL-LSTM.Similarly, the encoder is trained in self-supervised manner, using the same input and same experimental setup with CL-TCN.• LPPLS Confidence Indicator is constructed by using the shrinking windows whose length is decreased from 360 days to 30 days.For each fixed end point t 2 , we shift start point t 1 in steps of 30 days, hence totally 12 of fittings are generated.To estimate parameters and filter fittings, we apply the following search space and filtering condition in Table 6: Table 6.Search space and filter conditions for the qualification of valid LPPLS fits used in this study.

Item Notation Search Space
Filtering Condition Relative error

Augmentation Comparison and Analysis
First, to investigate how the values of augmentation parameters would affect the model performance, we conduct a hyperparameter sweep by testing over a parameter space on four augmentation approaches-jittering, scaling, magnitude-warping and time-warping.For jittering and scaling, we test their scale parameters σ, while for magnitude-warping and time-warping, both knot parameter k and scale parameter σ are investigated.The results are visualized by bar plots, shown in Figures A5-A10 in Appendix A. We find that the impact of most augmentation approaches appears to be relatively affected by adjusting the parameters in a certain range.Therefore, in the following experiments, each augmentation method is applied with the parameters that have the best performance across all the evaluation metrics.Specifically, for jittering, σ = 1.0; for scaling, σ = 1.2; for time-warping σ = 0.6, knot = 8, and magnitude-warping σ = 0.6, knot = 10.
Then, we conduct an one-head experiment-throughout the whole training loop, we apply one fixed augmentation on one head while the other head executes an identity function, and compare the model performance with different augmentation in order to identify the most promising augmentations from the introduced function pool while excluding the least promising ones.The results are summarized via the evaluation metrics given in Table 7.As a result, augmentations like crop-and-resize and smoothing can be discarded, because they perform worse than the baseline under all the metrics as they appear to hinder rather than to improve the training process.Scaling has worse performance when evaluated by G-mean, and therefore we will not consider it in the later experiments either.In contrast, time-warping, jittering and magnitude-warping appear to be the front-runners, yielding significant increases in all measured metrics compared to the baseline without augmentation.Furthermore, by only considering the three best methods in the one-head experiment, we test the effect of two-head augmentation strategy.Specifically, we select two augmentation methods, and use one on the first head and the other on the second head, with default values of augmentation parameters.The results are shown in Table 8.We find that only the combination of jittering and magnitude-warping achieves significant increases of all evaluation metrics when compared to applying no augmentation at all; while the other two combinations do not show strong effect on improving the model performance.Compared with the results from the one-head strategy (Table 7), it is obvious that jittering can be improved by collaborating with other augmentations, but both time-warping and magnitude-warping perform worse in combination with other methods than alone.To verify if using only one augmented head is better than applying augmentations on two head, we run experiments for augmentation combinations using by the three frontrunners.Specifically, here we apply one method followed by another on the first head and identity on the second head.The performances of different combinations are summarized in Table 9.It shows that the combination of time-warping and magnitude-warping far outperforms not only the baseline, but other combinations as well.By comparing with the two-head experiment (Table 8), we find that time-warping tends to have greater performance in the one-head setting.

Fine-Tuning and Sensitivity Analysis
To investigate the sensitivity of the proposed model, we conduct additional experiments by adjusting the following parameters:

•
Window size is the length of original time series sequence input.We think that different window sizes contain different information, so it should be large enough to contain the most useful information for prediction but not too long to make the model hard to catch the key properties of the sequence.We select window sizes of 30, 90, 180 and 360 days for the experiment.

•
Batch size specifies the number of positive pairs (K) of augmented windows that are processed before the model is updated, so in our case there are 2K training samples in each batch.In the experiment, the batch size is chosen to be 16, 32, 64 and 128.

•
Code size indicates the length of the embedding vector (representations) that is extracted from the encoder network.The code size is tested to range between 8 and 24.
Figure 8 visualizes the results of comparing the performance of the proposed framework (with respect to BA) across the different parameter settings.It demonstrates that averaged over the code size and batch size, the model has relatively same performance with different window size, but the model tends to have slightly variable performance when training with smaller size of windows.It is expected that 30 days is an effective length of window, and windows too long may render it difficult for the model to extract more predictive information.For the batch size with respect to the set of {16, 32, 64, 128}, we find that the proposed model has robust performance, when taking average across the code size and window size (Figure 9).With window size 30 and 90, the models with batch 16 obviously outperform the ones with other batch sizes.We believe this situation occurs because small batch size makes it less likely that a sample will be similar to other samples in the same batch, which enables the model to better distinguish negative pair of samples.But as the window size increases, the probability of a negative pair of samples being very similar decreases, so changing batch size does not have a large impact on the model performance.As shown in Figure 10, at the window size of 30, 90 and 180, the model performance, averaged over the batch size, has the best BA at a code size of 12, while for longest window of 360, a larger code size is needed to contain enough information in their representations.

Baseline Comparison
To demonstrate the effectiveness of our model, we compare its performance to the seven baseline ML/DL models.Table 10 gives an overview of the comparison results.For each measurement, the highest value is highlighted in bold, the second highest one is marked by underline, and the third highest is presented in italic.We can see that our proposed framework could achieve the best performance over all the selected evaluation metrics, higher BA and GM indicates that our proposed framework has better accuracy on classifying both majority class (non-bubble periods) and minority class (bubble periods), while the larger F2-score and FM index implies that it is less likely to incorrectly predicts bubbles as non-bubbles.In addition, TCN and XGB, the second best and the third best models, outperform others by an obvious margin.The visualization of eight models prediction results on test set are shown in Appendix A Figures A11-A18.
Through not only the comparison between LSTM and TCN but also the comparison between CL-LSTM and CL-TCN, it is shown that TCN outperforms LSTM on extracting effective representations for time series.This result aligns with the conclusions make in works [39][40][41][42], so that it provides a strong support for our choice of TCN as the encoder rather than LSTM.
The pre-training loss and similarity trajectory of CL-TCN and CL-LSTM are shown in Figures 11 and 12.We can find that TCN encoder learns to distinguish similar samples from different samples in very less epochs, and keeps making the similarity between positive pairs extremely high, while the similarity between negative pairs is near to zero.However, in the first 50 epochs, the epoch-wise averaged similarity for positive pairs is close tightly to that for negative pairs, which indicates that CL-LSTM has a difficulty in distinguishing samples.Although at the end, CL-LSTM make the similarity of negative pairs near to zero, the similarity for positive pair is too low.

Ensemble Model
Due to the inherently challenging nature of forecasting stock market crisis events on the global scale, it is well-expected that all postulated models are rather weak learners.Some may achieve better modeling performance under specific observed patterns, others under different ones, but we do not expect any single one to capture best all the existing latent patterns.Hence, exploiting forecast combinations that allow us to assign different weights to each of the obtained predictions is expected to boost the overall attainable predictive performance.
Specifically, ensemble modeling is the act of executing two or more related but distinct analytical models and then combining the results into a single score or spread.Empirically, ensembles tend to yield better results when there is a significant diversity among the models.For ML/DL models, we combine the three best performers (CL-TCN, TCN, XGB) together as a ensemble model in three ways-hard vote, soft vote with simple average, and soft vote with weighted average.Since the LPPLS indicator cannot be regarded as predicted probability, logistic regression is conducted by taking the indicator as a input feature.Also, to construct an ensemble CL that is more comparable to LPPLS indicator, we follow the construction logic of LPPLS indicator to implement a CL-indicator.

Hard Vote
In hard voting (also known as majority voting), every individual classifier votes for a class, and the majority wins.The ensemble's anticipated target label is, statistically speaking, the mode of the distribution of the labels' individual predictions.In this case, the selected three base models generate binary predictions separately.If more than 2 models infer positive output then the ensemble model output is marked as positive, otherwise marked as negative.

•
Soft Vote (Simple Average and Weighted Average) In soft voting, every individual classifier provides a probability value that a specific data point belongs to a particular target class.A simple average strategy is firstly applied and converted to binary predictions for evaluation.For weighted average, the evaluation metric value of each base classifier is used to weight and then average the predictions.As multiple metrics are adopted in evaluation above (F2, GM, BA, FM), the weight can be calculated by looping through them.Different weighted average output can be generated, and consequently the final binary label by metric is gathered.

• Logistic Regression
Logistic regression is conducted by using the predicted probabilities of the 3 selected and the LPPLS indicator as 4 features to fit on the train set, then predict on the test set.• CL Indicator Same as constructing LPPLS indicator, we recursively train CL-TCN with shrinking window of 360 days to 30 days with a step size of 30 days, hence the 12 CL models are obtained.For each time point, we use 12 models to make prediction and then calculate the fraction of models that classify it as class 1.
The comparison of performance between ensemble models and CL-TCN are shown in Table 11.We can find that Simple Average Soft Vote outperforms other ensemble models and has comparable performance with CL-TCN.The logistic regression has a significant decrease for all the evaluation metrics.The comparison results of LPPLS-indicator and CL-indicator is summarized in Table 12 and visualized in Figures 13 and 14.From Figures 13 and 14, we can see the indicators result (in green) on test set in the lower panel, and its corresponding y-axis is on the right side of the panel.It is obvious that the indicators of most time point are 0, so we set the threshold to 0 for converting it into binary classes.We find that LPPLS indicator makes alert for all the drawdowns on the test set and have not too much false positive result, although for some drawdowns it is very weak.Compared to CL-indicator, it makes long-time alert for every drawdowns, and also correctly predicted other small drawdowns that are not selected as true label.

Conclusions
In conclusion, our study introduces a contrastive learning-based classification framework for predicting financial bubbles and crashes, with a focus on Bitcoin price data.By leveraging the SimCLR and making modifications, we achieve improvements in prediction performance compared to various ML/DL models.Moreover, we introduce a CL-indicator and make comparison between it with LPPLS-indicator, which is unprecedented, to the best of our knowledge.Our results demonstrate the effectiveness of CL in extracting meaningful representations from time series data, paving the way for better understanding and forecasting of financial market dynamics.
The success of our approach suggests its potential applicability to other financial indicators and markets, opening up opportunities for further research in diverse areas such as stock price prediction, regime change detection, and price movement forecasting.Addi-tionally, the scalability of our framework allows for the incorporation of macroeconomic variables, enhancing its predictive power and practical utility in real-world applications.
In future work, we plan to explore the integration of additional features and advanced data augmentation techniques to further improve the performance of our model.Moreover, we aim to extend our analysis to other time series datasets and tasks, while also exploring the use of more sophisticated classifiers to enhance the robustness and generalizability of our framework.Overall, our study contributes to the growing body of research on predictive analytics in finance and highlights the potential of CL as a valuable tool for financial forecasting and risk management.
or the amount of regularity.Lumpiness and stability features measure the variance of the means and the variances on nonoverlapping windows, which provide information on how a window is free of trends, outliers, and shifts.Finally, the last three features, max_level_shift, max_var_shift, and max_kl_shift, denote the largest shifts in mean, variance, and Kullback-Leibler divergence of a window based on overlapping windows, respectively.These features may distinguish valuable structures regarding the window with jumps.

Appendix A.3. Shuffle Experiment
For the original Bitcoin daily price time series, we randomly shuffle it to create 10 new time series sequences.Then, we apply the same Epsilon Drawdown Method and grid search setting introduced in Section 4.1 to obtain target labels for each new sequence.Under the same pre-processing steps and model configurations, we train our model, make prediction and get corresponding evaluation metrics.The results obtained with the original time series and those obtained with the shuffled time series are summarized below.Since our model relies on the assumption that the sequential order of observations contains valuable predictive information, the degradation in performance on shuffled data suggests that this assumption is valid.It highlights the complexity of the temporal dynamics underlying Bitcoin price movements, and indicates that simply shuffling the order of observations disrupts the underlying patterns and dependencies in the data, making it more challenging for our proposed framework to capture and exploit meaningful information for drawdown prediction.

Appendix A.4. Drawdowns in Bitcoin Hourly Price Data
To make more comprehensive comparison between our proposed framework and baseline models, we incorporate hourly prices of Bitcoin from 10 April 2022 10 a.m.UTC to 19 September 2023 11 p.m. UTC, which includes 12,401 time points.The target labels of drawdown are also generated by applying the Epsilon Drawdown Method, scanning ϵ 0 with a step size of 0.1 from 0.1 to 5 and ω with a step size of 12 h from 24 to 168 h.By filtering peaks with p t > 0.5 and drop percentage > 2%, we obtain a total of 84 peaks, summarized in Table A2.We set the size of each window T = 168 (hours) and the prediction horizon M = 24 (hours), and split the data by time, forming the training sets (80%) and the test sets (20%).The details of the two sets are listed in Table A3.We compare our proposed CL framework performance to the seven baseline ML/DL models.Table A4 gives an overview of the comparison results.

Figure 1 .
Figure 1.Illustration of the proposed model architecture.

Figure 5 .
Figure 5. Visualization of the used data augmentation methods.In each panel, original time series data is in orange and augmented data is in blue.

Figure 6 .
Figure 6.An example of the knots and the cubic spline using in magnitude-warping.

Figure 7 .
Figure 7.An example of the horizontally scaled data points using in time-warping.

Figure 8 .
Figure 8.The experiment results on the sensitivity of window size, batch size and code size.

Figure 9 .
Figure 9.Effect of batch size across different window sizes.

Figure 10 .
Figure 10.Effect of code size across different window sizes.

Figure 13 .
Figure 13.LPPLS confidence indicator result on test set.

Figure 14 .
Figure 14.CL indicator result on test set.

Figure A5 .
Figure A5.Effect of varying the scale parameter on jittering.

Figure A6 .
Figure A6.Effect of varying the scale parameter on scaling.

Figure A7 .
Figure A7.Effect of varying the scale parameter on magnitude-warping.

Figure A8 .
Figure A8.Effect of varying the knot parameter on magnitude-warping.

Figure A9 .
Figure A9.Effect of varying the scale parameter on time-warping.

Figure A10 .
Figure A10.Effect of varying the knot parameter on time-warping.

Figure A11 .
Figure A11.Visualization of prediction of CL-TCN on test set.

Figure A12 .
Figure A12.Visualization of prediction of TCN on test set.

Figure A13 .
Figure A13.Visualization of prediction of XGB on test set.

Figure A14 .
Figure A14.Visualization of prediction of GBM on test set.

Figure A15 .
Figure A15.Visualization of prediction of CL-LSTM on test set.

Figure A16 .
Figure A16.Visualization of prediction of RF on test set.

Figure A17 .
Figure A17.Visualization of prediction of LSTM on test set.

Figure A18 .
Figure A18.Visualization of prediction of SVM on test set.

Table 1 .
Literature summary of contrastive learning based on time series data.

Table 2 .
Summary of all selected peaks.

Table 3 .
Details of training set and test set.

Table 4 .
Time series features used in ML models including those based on sample ACFs, PACFs, and other time series features.

Table 5 .
Configuration parameters for grid-search of ML classifiers.

Table 7 .
Performance on drawdowns prediction of CL with one head.

Table 8 .
Performance on drawdowns prediction of CL with two head.

Table 9 .
Performance on drawdowns prediction of CL with augmentation combinations on one head.

Table 10 .
Comparison of models performance on drawdowns prediction for daily data.

Table 11 .
Performance on drawdowns prediction of ensemble models, compared with CL-TCN.

Table 12 .
Performance on drawdowns prediction of LPPLS indicator and CL-TCN indicator.

Table A1 .
Comparison of models performance between original Bitcoin daily price data and shuffled data.

Table A2 .
Summary of all selected peaks.

Table A3 .
Details of training set and test set for daily price data.

Table A4 .
Comparison of models performance on drawdowns prediction for hourly data.