Next Article in Journal
Bayesian Inference for Multiple Datasets
Previous Article in Journal
On Non-Occurrence of the Inspection Paradox
 
 
Article
Peer-Review Record

Contrastive Learning Framework for Bitcoin Crash Prediction

Stats 2024, 7(2), 402-433; https://doi.org/10.3390/stats7020025
by Zhaoyan Liu 1,*, Min Shu 2 and Wei Zhu 1
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Stats 2024, 7(2), 402-433; https://doi.org/10.3390/stats7020025
Submission received: 14 January 2024 / Revised: 15 April 2024 / Accepted: 30 April 2024 / Published: 8 May 2024

Round 1

Reviewer 1 Report

Comments and Suggestions for Authors

This submission adapts contrastive learning to the prediction of large future drawdowns with daily Bitcoin data.

The epsilon drawdown measure is not adequately described.

While the methodological merits from a DL point of view are clear, they are dubious at best regarind financial time series prediction. In my experience, the DL literature is way over-optimistic about its successes in that domain as it often blissfully ignores how not to overfit.

I find it hard to believe that the authors can train such a complex network with so little data. Data augmentation, given the amount of noise in financial data, is probably useless. In addition, it is really weird to use raw prices to train a network, as the base currency (and the scale of asset prices) is arbitrary.

Unless the authors share the code, I cannot as a referee check that this study has no look-ahead bias. In fact, given the discussion on hyperoptimisation, it is highly likely that overfitting is present.

All the loss measures are reported with 3 significant digits, but no confidence intervals (a common problem in this field). Statistics tells us that this makes them hard or impossible to interpret.

Finally, why stopping at a single currency ?

 

Comments on the Quality of English Language

Random types, including missing capitals at the beginning of sentences.

Author Response

Dear reviewer,

We would like to express our sincere gratitude to you for your valuable feedback and comments on our manuscript. We have carefully revised the manuscript based on your comments. Please find the detailed responses below.

 

Comment 1: “The epsilon drawdown measure is not adequately described.”

We revised this subsection to make it clearer.

 

Comment 2: “While the methodological merits from a DL point of view are clear, they are dubious at best regarind financial time series prediction. In my experience, the DL literature is way over-optimistic about its successes in that domain as it often blissfully ignores how not to overfit.”

Overfittin is alway the issue we’re working on training and evaluating ML/DL models. We applied some methods to reduce overfit impact so that we can fairly evaluate model performance:

1) We tried time-series cross validation process. Similar to simple CV, we split data set into multiple folds in time order and iterate training/testing process, then calculate average metrics (the split diagram is shown below). However this method failed in experiment, because the number of drawdowns are rare which causes highly imbalance for final binary target label. The beginning folds have more imbalanced data, which makes it even exposed more to overfit issue. Thus, we employed full data for one time train and test. 

(Figure from: https://medium.com/@soumyachess1496/cross-validation-in-time-series-566ae4981ce4)

2) We adopt sensitivity analysis in section 4.5. It helps identify overfitting indirectly by observing how the model’s performance changes in response to variation in the training data or model parameters. According to the analysis results, the model has relatively similar performance with different sets of parameter inputs. 

3) In model evaluation, we not only tested on single models (section 4.6 on page 17~19), but also compared ensemble models (section 4.7 on page 19~21) combined by high performance single models. As we know, ensemble could be an effective way to avoid model overfitting issue. We tested different ensemble ways and all of them outperform single models without CL, except Logistic ensemble which also incorporates LPPLS model. 

 

Comment 3: “I find it hard to believe that the authors can train such a complex network with so little data. Data augmentation, given the amount of noise in financial data, is probably useless. In addition, it is really weird to use raw prices to train a network, as the base currency (and the scale of asset prices) is arbitrary.”

The Contrastive Learning network works as a pre-trained data encoder, which reduces the data dimension by effectively extracting latent representations and clean out useless noise from the original time series data. For limited size of data, such dimension reduction process can help downstream classifier achieve better performance. In addition, when training the encoder, as shown in the illustration of a batch construction (Figure2 on page 7), although the direct input only has K time-series sequences, K augmented sequences are generated and for each original sequence, 2K-1 sequences are applied to make the encoder learn pattens and convert it to representations. Thus, it's relatively equivalent to expanding the data size from K to K(2K-1). 

For data augmentation, it is only applied on training the encoder, which is a way to add difficulties and challenges for the encoder to learn the trends/patterns from the time series. Thus, the augmented data is not used under a supervised manner. Also, as shown in section 4.4 on page 14~16, we compared the encoder performance between with and without using data augmentations, and we found some of the augmentation methods outperforms and we only apply those augmentations in the final framework.

The original data we used to train models is adjusted closing price. For CL/DL networks, they contain normalization layers (shown in Figure3~4 on page 7~8), which can adaptively scale and shift the input data within the network itself during the training process, so pre-scaling is not applied. For ML models, the Robust Scaler is applied, which removes the median and scales the data according to the quantile range.

 

Comment 4: “Unless the authors share the code, I cannot as a referee check that this study has no look-ahead bias. In fact, given the discussion on hyperoptimisation, it is highly likely that overfitting is present.”

We are still reorganizing the code and will upload it to GitHub for sharing later. As we show in section 4.1 on page 11, the train and test data is split at the very beginning, and input of training process is always the training set.

 

Comment 5: “All the loss measures are reported with 3 significant digits, but no confidence intervals (a common problem in this field). Statistics tells us that this makes them hard or impossible to interpret.”

The reason why we only reported the averaged values of evaluation metrics is that it is more intuitively to present the difference in performance between multiple models.

 

Comment 6: “Finally, why stopping at a single currency?” 

One important part of this study is to compare performance of CL/DL/ML models with LPPLS model, basing on authors’ previous study (https://arxiv.org/abs/1905.09647) about predicting Bitcoin bubbles. In addition, we think bitcoin is an ideal start point of the current study, because its price has much more volatility, which makes the target label much less imbalance and sparse than others for a classification problem. For the future work, we will extend to more currencies in financial market and try to find out generic models. 

 

Thank you once again for your time and insightful comments!

Reviewer 2 Report

Comments and Suggestions for Authors

See the report attached.

Comments for author File: Comments.pdf

Author Response

Dear reviewer,

We would like to express our sincere gratitude to you for your valuable feedback and comments on our manuscript. We have carefully revised the manuscript based on your comments. Please find the detailed responses below.

Comment1: "My major concern is if the authors will be able to prove, at least using simulation-based model, that their CL-TCN outperform the other models for different types of augmented time series."

Regarding to the potential applicability of our model to different types of time series data, it's indeed an important aspect to consider in evaluating the robustness and generalizability of our framework. Currently, our study primarily focuses on modeling financial time series data, so we didn't include data from other domains. But we have referenced past works involving modeling other types of time series data by Contrastive Learning (CL) in the section 2.3, and some of them demonstrate CL can outperform other models in non-financial fields.

Comment 2: "can the authors explained why their method still not showing good performance using F2 and FM? "

Our results clearly demonstrate that our CL-TCN framework consistently outperformed other models in terms of F2 and FM metrics, as evidenced by section 4.6, especially results shown in Table 10. We are not clear about what specific aspects of our results that led to this comment, as our results indicate otherwise. It may have been a misunderstanding, so we kindly request clarification from the reviewer, and want to make sure that none of those results could mislead the reader.

Comment 3: "The loss function (9) should be rewritten in the right format. "

We have corrected it.

Comment 4: "The moving size 5 is not respected in the indices of Xi in page 328. "

We don't have page 328, so we guess the reviewer is referring to line 328, and have corrected the indices of Xi in that line.

Comment 5: "Mistake in the title. "

We check the title, but didn't notice any typo or mistakes. We kindly ask clarification on what mistake the reviewer indicated. 

Comment 6: "Appendix should not be part of the paper and can be added as an extra document. "

We have removed appendix from the manuscript.

 

Thank you once again for your time and insightful comments!

Round 2

Reviewer 1 Report

Comments and Suggestions for Authors

The reply of the authors is not satisfactory. I think that this submission suffers from a typical bias from ML that underestimate by far the amount of noise in financial data. I maintain that these results are impossible given how little data there is to train a complex neural network and that the authors are not aware of that. Note that I certainly wish to be proven wrong. Making the code available will certainly help deciding either way. 

Contrastive learning (and any kind of learning) requires data that is not too noisy, which is not at all the case of price time series. Data augmentation is just noise shifting here. There is nothing magic about any kind of neural architecture when applied to price prediction (and I have 15 years of practical experience with real money).

Finally, I would suggest to the authors to really spellcheck their paper. The first word of the title has a typo.

I thus recommend a major revision that must include three improvements:

1.  Shuffle the index of price returns, compute a new time series and apply your method, 50 times. Show that the results obtained with the original time series are significantly different from those of the original time series

2. Make your code available.

3. Report confidence intervals. They really are useful to assess the value of any statistics. Digits are meaningless without c.i.

 

Comments on the Quality of English Language

See review.

 

Author Response

Dear reviewer:

We are grateful for your constructive comments and suggestions, which have greatly contributed to the improvement of our manuscript. We have added the experiment and results you suggested, and carefully revised the manuscript based on your comments. Please find the detailed responses below.

Comment 1: “Shuffle the index of price returns, compute a new time series and apply your method, 50 times. Show that the results obtained with the original time series are significantly different from those of the original time series”

As the reviewer suggested, we ran the shuffling experiment and the results are shown in section 4 of Appendix, although due to time limitation we only conduct 10 times. In addition, to see the performance of our proposed framework on a larger dataset, we added an experiment by applied the CL model and other baseline models on Bitcoin hourly price data. The data is introduced and the results are shown in section 1 of Appendix.

 

Comment 2: “Make your code available.”

We make our code public on Github:  https://github.com/Zhaoyan030/Contrastive-Learning-Framework-for-Bitcoin-Crash-Prediction

 

Comment 3: “Report confidence intervals. They really are useful to assess the value of any statistics. Digits are meaningless without c.i.”

We added standard deviations for all evaluation metrics (Table7~Table10).

 

Thank you once again for your time and invaluable feedback.

Reviewer 2 Report

Comments and Suggestions for Authors

Dear authors,

Thank you for trying to answer my previous comments to improve the paper. First of all, in the title, it should be Contrastive and not Constrsative.

I have the following remarks that need to be addressed in order to make the paper acceptable:

1. I am still convinced that we can't make such decisions about the preferred model unless some model-model simulation is applied in this context. Also, I am convinced that it will not take that much time to simulate at least the 2 models you compared at the end and show some consistency for the mixed model in the two conditions. 

2. When we say significantly improve the accuracy, an inference analysis should be provided, which is missing in this paper. 

3. What if you increased the sample size? Is it possible to keep the difference in accuracy, in percentage, stable?

4. Is there any explanation why GM and BA are given the same accuracy for the 3 augmentation combinations (see Table 9)? 

5. Again, any explanation why the metrics F2 (0.535) and FM (0.436) produce much lower efficiencies than BA (0.711) and GM (0.722)? 

6. I think you should share the code to check the results and perhaps we can have our answers to the question.

 

Comments on the Quality of English Language

No comment. 

Author Response

Dear reviewer:

We are grateful for your constructive comments and suggestions, which have greatly contributed to the improvement of our manuscript. We have carefully revised the manuscript based on your comments and made the detailed responses below.

Comment 1: “I am still convinced that we can't make such decisions about the preferred model unless some model-model simulation is applied in this context. Also, I am convinced that it will not take that much time to simulate at least the 2 models you compared at the end and show some consistency for the mixed model in the two conditions. “

We did some search online, but still not very clear about what is “model-model simulation”. We’d appreciated it if there is more clarifications. Instead, we conducted a permutation test by shuffling the original Bitcoin price time series multiple times to generate surrogate price sequences, and then apply our proposed framework on those new sequences. The results can be found at section 4 of Appendix.

 

Comment 2: “When we say significantly improve the accuracy, an inference analysis should be provided, which is missing in this paper. “

We added standard deviations for all evaluation metrics (Table7~Table10).

 

Comment 3: “What if you increased the sample size? Is it possible to keep the difference in accuracy, in percentage, stable?”

To increase the sample size, we used Bitcoin hourly price data and added an experiment by applied the CL model and other baseline models on the new dataset. The dataset is introduced and the performance results are shown in section 1 of Appendix.

 

Comment 4: “Is there any explanation why GM and BA are given the same accuracy for the 3 augmentation combinations (see Table 9)? “

There was a mistake when generating the table, and now it is corrected. 

 

Comment 5: “Again, any explanation why the metrics F2 (0.535) and FM (0.436) produce much lower efficiencies than BA (0.711) and GM (0.722)? “

The differences in performance metrics can be attributed to their sensitivity to different aspects of model predictions and the underlying class distribution. F2 and FM are particularly sensitive to class imbalance, meaning they give more weight to correctly classifying the minority class (positive class) compared to the majority class (negative class). Since our dataset has a class imbalance, F2 and FM may be lower compared to metrics like BA and GM.

 

Comment 6: “I think you should share the code to check the results and perhaps we can have our answers to the question.”

We make our code public on Github:  https://github.com/Zhaoyan030/Contrastive-Learning-Framework-for-Bitcoin-Crash-Prediction

 

Thank you for your time and invaluable feedback!

Round 3

Reviewer 2 Report

Comments and Suggestions for Authors

Dear Authors,

In my first and second reviews, I meant the simulation-based model. It means that you should define the models under different parameters and conditions and show theoretically that the proposed models are much better than the existing ones. Anyway, you should remove (+-) from all tables and keep only the standard deviation between parentheses.  

Comments on the Quality of English Language

No comment

Back to TopTop