1. Introduction
The rapid expansion of digital asset markets has intensified the demand for accurate and reliable cryptocurrency price forecasting models. Unlike traditional financial instruments, cryptocurrencies exhibit extreme volatility, structural breaks, and nonlinear dynamics, which make both short- and long-horizon prediction particularly challenging [
1,
2]. Their price behavior is influenced not only by market microstructure but also by macroeconomic indicators, global liquidity conditions, investor sentiment, and network-level activity [
3]. These characteristics increase forecasting uncertainty and highlight the need for models capable of capturing long-range temporal dependencies, complex feature interactions, and abrupt regime shifts.
In recent years, deep learning algorithms have become the dominant approach for financial time-series modeling due to their ability to capture nonlinear patterns and learn hierarchical representations [
4]. Among these methods, recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) models, have been widely applied to cryptocurrency forecasting because of their ability to retain historical dependencies over time [
5]. Nevertheless, RNN-based models face inherent limitations related to sequential computation and long-horizon dependency modeling, especially in multivariate settings.
Transformer-based architectures have emerged as a strong alternative owing to their self-attention mechanisms, which enable parallel computation and effective modeling of long-range dependencies [
6]. These models have demonstrated strong performance across a variety of sequence modeling tasks, including financial forecasting and related temporal applications [
7,
8]. This success has encouraged the adaptation of generative transformer models, such as GPT-style decoders, for numerical time-series forecasting, particularly in capturing multivariate contextual relationships [
9,
10]. In addition, specialized temporal variants—including Temporal Fusion Transformer (TFT), Autoformer, and Informer—have shown effectiveness in modeling both short-term dynamics and long-term trends [
7,
9].
Despite growing interest, comparative studies that jointly evaluate recurrent and transformer-based architectures for cryptocurrency forecasting remain limited. Most existing research focuses on a single model family or univariate series and often omits rich technical indicator sets. Moreover, the performance of GPT-style decoder-only architectures in multivariate cryptocurrency forecasting has not been systematically investigated. These limitations highlight the need for a unified and consistent evaluation framework.
The novelty of this study lies in its unified and rigorously controlled evaluation framework, which compares both deep learning architectures and classical econometric baselines under strictly identical preprocessing, feature engineering, and training configurations. Unlike prior work, this study establishes a multivariate, technical–indicator–rich environment and performs a comprehensive head-to-head analysis across five high-liquidity cryptocurrencies. The framework simultaneously benchmarks six state-of-the-art deep learning models—LSTM, GPT-2, Informer, Autoformer, Temporal Fusion Transformer (TFT), and the Vanilla Transformer—together with four widely used econometric approaches (ARIMA, VAR, GARCH, and the Random Walk baseline). GPT-2 is systematically adapted for numerical multivariate forecasting, offering insights not previously reported in the literature. Moreover, all models are evaluated using aligned temporal partitions and repeated rolling-window experiments, yielding statistically robust performance estimates. This experimentally balanced setup provides a clearer, fairer, and more generalizable comparison than is available in existing studies.
While the prior literature frequently discusses characteristics such as volatility clustering, nonlinear patterns, or occasional regime-like transitions in cryptocurrency markets, the present study does not perform formal econometric tests to validate these phenomena. Accordingly, any reference to such behaviors in this paper should be interpreted solely as high-level motivation for why flexible sequence models may be useful, rather than as empirically verified properties of the dataset examined here. The comparisons presented in this study focus exclusively on forecasting accuracy obtained under the unified experimental framework.
The present study addresses these limitations by conducting a comprehensive comparison of recurrent and transformer-based deep learning architectures for multivariate cryptocurrency price forecasting. Five high-liquidity cryptocurrencies—Bitcoin (BTC), Ethereum (ETH), Ripple (XRP), Stellar (XLM), and Solana (SOL)—are modeled using an extensive feature set derived from trend-, momentum-, volatility-, and volume-based technical indicators. Six architectures are examined under identical preprocessing, training, and evaluation procedures: LSTM, GPT-2, Informer, Autoformer, Temporal Fusion Transformer (TFT), and the Vanilla Transformer. By employing a unified methodological framework and multiple evaluation metrics, this study provides a detailed assessment of each architecture’s strengths, limitations, and behavior under complex multivariate financial conditions.
To achieve this objective, the study makes the following key contributions:
A unified multivariate forecasting framework integrating numerous technical indicators derived from pandas_ta, enabling a richer and more representative feature space for cryptocurrency prediction.
A systematic comparison of six deep learning architectures, covering both recurrent (LSTM) and transformer-based models (GPT-2, Informer, Autoformer, TFT, and Vanilla Transformer) under identical experimental conditions.
An adaptation of GPT-2 for numerical time-series forecasting, demonstrating its potential beyond natural language processing tasks.
An extensive evaluation of long-range dependency modeling using advanced transformer variants specifically designed for temporal sequences.
A rigorous performance assessment using MSE, MAE, RMSE, MAPE, and R2, providing a multi-perspective view of forecasting accuracy.
A detailed analysis of practical challenges related to data quality, normalization, missing-value handling, hyperparameter sensitivity, and computational complexity, offering guidance for future research.
The remainder of this paper is organized as follows.
Section 2 reviews the related literature on deep-learning-based time-series forecasting and existing approaches in cryptocurrency prediction.
Section 3 describes the dataset, technical indicators, preprocessing steps, and the architectures of the evaluated models.
Section 4 presents the experimental setup and performance metrics.
Section 5 reports and discusses the empirical results. Finally,
Section 6 summarizes the conclusions and outlines potential directions for future research.
2. Related Work
Research on cryptocurrency price forecasting has developed significantly as digital asset markets have matured and the availability of high-frequency data has increased. Early studies relied on statistical models such as ARIMA, GARCH, and their extensions [
1]. While these approaches were effective in modeling short-term volatility clustering, their linear structure proved insufficient for representing the nonlinear and highly dynamic behavior characteristic of cryptocurrency markets. This limitation motivated researchers to explore more flexible learning frameworks.
Recurrent neural networks, particularly LSTM and GRU architectures, have been applied extensively to cryptocurrency forecasting tasks. Prior studies demonstrated their ability to capture medium- and long-range temporal dependencies in assets such as Bitcoin and Ethereum [
11,
12]. However, gated recurrent models often face difficulties as sequence lengths increase or when the feature space incorporates a large number of technical indicators, leading to instability and slower convergence [
13]. To mitigate these issues, hybrid architectures combining recurrent units with convolutional layers or attention mechanisms have been proposed, with varying degrees of success [
14].
Transformer-based deep learning architectures marked a major shift in time-series forecasting. Their self-attention mechanism enables efficient modeling of long-distance temporal relationships while avoiding the sequential bottlenecks inherent in RNNs [
6]. Several transformer variants have been specifically tailored for forecasting tasks. Informer introduced ProbSparse attention to reduce computational costs for long sequences [
7], while Autoformer incorporated decomposition-based auto-correlation to better capture trend and seasonal components [
8]. The Temporal Fusion Transformer (TFT) further extended the transformer paradigm by integrating static covariates, variable selection, and interpretable temporal attention [
9]. Additional variants, such as FEDformer [
15] and Pyraformer [
16], employed frequency-domain decomposition and hierarchical receptive fields to improve scalability and robustness.
As transformers became more established in forecasting, researchers began exploring their applicability to financial time-series analysis, including cryptocurrency trend prediction, multi-horizon forecasting, and cross-asset modeling [
17,
18]. More recent studies examined multi-market transformer architectures capable of learning shared temporal structures across groups of cryptocurrencies [
19]. Other works combined transformer layers with graph neural networks to capture relational dependencies among digital assets, such as co-movement patterns and market-wide contagion effects [
20].
In parallel with these developments, generative transformer models—particularly decoder-only architectures inspired by GPT—have been adapted for numerical forecasting tasks. Although originally developed for natural language processing, autoregressive attention mechanisms have been effectively repurposed for multivariate time-series modeling [
21]. These models have shown promise in predicting market microstructure dynamics, limit order book sequences, and short-term cryptocurrency movements [
22,
23]. Nevertheless, studies that directly compare GPT-style models with specialized temporal transformers under standardized experimental conditions remain limited.
Feature engineering continues to play a critical role in cryptocurrency forecasting. Numerous studies have emphasized the contribution of technical indicators—including momentum oscillators, trend averages, volatility measures, and volume-based metrics— to improving predictive performance across both classical and deep learning models [
24,
25]. More recent work has explored the integration of technical indicators with blockchain-level features, such as hash rate, on-chain volume, and network difficulty, to capture fundamental aspects of cryptocurrency ecosystems [
26,
27]. However, inconsistent preprocessing pipelines and heterogeneous experimental setups have limited the comparability of results across existing studies.
Recent studies have continued to advance transformer-based and hybrid architectures for cryptocurrency forecasting. Kehinde et al. introduced Helformer, an attention-enhanced forecasting model that demonstrated competitive performance across multiple digital assets [
17]. Izadi proposed HSIF, a multimodal transformer architecture that fuses market data with sentiment signals through cross-attention, highlighting the growing interest in integrating heterogeneous information sources for cryptocurrency prediction [
28]. Wu et al. presented a comprehensive 2024 review evaluating the implementation quality and empirical performance of deep learning models for cryptocurrency price prediction, emphasizing the need for standardized experimental pipelines [
29]. Furthermore, Smyl et al. developed Contextual ES-adRNN, a hybrid exponential-smoothing recurrent architecture that incorporates exogenous variables and has shown strong performance in forecasting highly volatile cryptocurrency series [
30]. These recent contributions underscore the rapid evolution of deep learning methods in digital asset forecasting while reinforcing the need for unified and rigorously controlled comparative frameworks—an issue directly addressed by the present study.
In summary, although substantial progress has been made in applying deep learning techniques to cryptocurrency forecasting, several gaps remain: the limited number of studies comparing recurrent, generative, and transformer-based approaches within a unified framework; the underrepresentation of GPT-style architectures in multivariate forecasting tasks; and the lack of systematic evaluations incorporating extensive technical indicator sets. The present study addresses these gaps by providing a harmonized comparison of LSTM, GPT-2, Informer, Autoformer, TFT, and Vanilla Transformer architectures for multivariate cryptocurrency price forecasting.
3. Materials and Methods
This section describes the data acquisition pipeline, feature construction process, preprocessing strategy, model architectures, and training procedures. All forecasting models are trained under a unified experimental protocol to ensure a fair and consistent comparison across architectures.
3.1. Dataset Description
The empirical analysis focuses on five high-liquidity cryptocurrencies that are widely regarded as benchmarks in the digital asset ecosystem: Bitcoin (BTC), Ethereum (ETH), Ripple (XRP), Stellar (XLM), and Solana (SOL). Historical OHLCV (open, high, low, close, volume) data for each asset are retrieved directly from the Binance spot market using the official REST API through the python-binance client. All data are collected via the get_klines endpoint with exchange-generated millisecond timestamps, ensuring strict UTC-based temporal consistency across assets.
The dataset spans a backward-looking window of approximately 15 years from January 2025. Since asset listing dates differ, each time series begins at its earliest available Binance record; however, after temporal alignment, all assets share an identical sequence of weekly timestamps. Raw OHLCV values are initially downloaded at daily resolution and subsequently aggregated into weekly bars. The weekly open is defined as the first daily open, the weekly close is defined as the final daily close, and the high, low, and volume values are aggregated over all days within each week. Weekly aggregation is adopted because it attenuates microstructure noise, reduces the impact of extreme intraday volatility, and produces a smoother temporal signal that is more suitable for transformer-based sequence modeling. Importantly, all aggregation operations are performed strictly forward in time, without incorporating future information.
To maintain rigorous multivariate synchronization, only weeks for which complete OHLCV data are available for all five assets are retained. After alignment, each asset contains 782 weekly observations covering the period 2010–2024. Following the application of a sliding-window mechanism with a look-back length of , a total of 722 supervised learning samples are obtained per asset. These values represent the exact number of usable training instances and ensure that all models are evaluated on an identically sized multivariate dataset.
The resulting dataset exhibits several characteristics commonly observed in cryptocurrency markets, including nonlinear dependencies, volatility clustering, asymmetric fluctuation patterns, and cross-asset co-movements. These properties render the forecasting task particularly challenging and provide a robust testbed for evaluating recurrent and transformer-based deep learning architectures within a unified methodological framework.
Weekly aggregation is employed to suppress microstructure noise inherent in daily cryptocurrency data, stabilize the behavior of multi-day technical indicators, and align the input horizon with the medium-term forecasting objectives of this study. This choice preserves essential volatility regimes and cross-asset dynamics while preventing deep learning architectures from overfitting to high-frequency noise. Consequently, the forecasting task remains nontrivial despite the reduced sampling frequency.
Weekly aggregation does not eliminate the fundamental volatility structure of cryptocurrency markets; rather, it mitigates microstructure noise arising from intraday price jumps, irregular trading activity, and exchange-specific artifacts. Importantly, the aggregated series continues to exhibit well-documented stylized facts, such as volatility clustering, heavy-tailed distributions, asymmetric shocks, and abrupt regime transitions. These characteristics ensure that the forecasting task remains challenging and is not artificially simplified through downsampling.
Moreover, weekly sampling aligns with the design of many widely used technical indicators—such as MACD, RSI, ATR, and moving-average-based filters—which rely on multi-day look-back windows and are known to produce unstable or noisy signals at higher frequencies. Weekly aggregation, therefore, enhances the statistical reliability of these indicators without obscuring meaningful market dynamics.
Finally, the low MAPE values reported in this study do not arise from the reduced sampling frequency itself but from the combined effects of (i) a rich multivariate feature space, (ii) a unified preprocessing pipeline, and (iii) the strong sequence-modeling capacity of transformer-based architectures. This claim is supported by the newly added benchmarks, in which classical econometric models, trained on the same weekly data, yield substantially higher forecasting errors. Accordingly, weekly aggregation is methodologically justified and does not compromise the integrity or difficulty of the forecasting problem.
3.2. Feature Engineering and Data Preprocessing
To enhance the predictive content of the raw OHLCV data, a comprehensive set of technical indicators is computed using the
pandas_ta library. Unless stated otherwise, indicator parameters follow the library defaults.
Table 1 summarizes all incorporated indicators using their standard abbreviations.
The multivariate input space consists of 48 numerical features derived from price, volume, and technical analysis indicators. These features are generated through a structured and fully deterministic feature-engineering pipeline. For trend-based indicators, multiple window lengths are employed to capture short-, medium-, and long-horizon dynamics: moving averages (MAs) are computed using windows of 7, 14, 25, 50, and 100 days, while exponential moving averages (EMAs) use windows of 7, 14, 25, 50, 100, and 200 days. Momentum indicators—including RSI, the Stochastic Oscillator (%K and %D), Williams %R, ROC, and Momentum—are calculated using standard horizons ranging from 10 to 28 days to represent velocity and reversal patterns in weekly price movements. Volatility-related indicators (Bollinger Bands, ATR, and CCI) contribute additional components such as upper and lower bands, bandwidth, range-based volatility, and normalized deviations. Volume–price interaction indicators, including OBV, CMF, and the Accumulation/Distribution index, further enrich the feature set by incorporating liquidity-driven market pressure. Finally, auxiliary variables such as log-returns, normalized volume change, and intra-week volatility (close–open) are added to strengthen the representation of weekly dynamics. All indicators are computed in a strictly causal manner within each training fold, ensuring that no information from future observations leaks into the feature-generation process.
The preprocessing pipeline consists of four steps, applied uniformly across all assets:
(i) Structuring and cleaning. Raw API responses are parsed into pandas DataFrames, and non-numeric fields are coerced into numeric format. Missing values arising from API gaps or indicator warm-up periods are imputed using a combination of forward-fill and backward-fill operations to preserve chronological continuity.
(ii) Feature scaling. All numerical features are normalized using Min–Max scaling:
where the scaling parameters are estimated exclusively from the training set.
(iii) Sliding-window formation. Supervised learning samples are constructed using input sequences of length
L, yielding the following input–output pairs:
where
denotes the multivariate feature vector at time
t.
To prevent any form of temporal information leakage, all preprocessing steps are performed strictly within each training window. Missing-value imputation (forward and backward filling) is applied only to the training segment of each rolling window prior to constructing the corresponding validation and test sets. Min–Max normalization is likewise fitted exclusively on the training data within each window, and the learned scaling parameters are subsequently applied to the validation and test sets without recalibration. This window-wise, train-only preprocessing strategy ensures that no future information—such as unseen extrema or forward-filled values—can influence model training, thereby preserving the temporal causality required for valid forecasting evaluation.
(iv) Robust temporal splitting. A chronological 20–80% split is adopted for training and testing. To reduce sensitivity to a single split point, this procedure is repeated ten times by sliding the split boundary forward in time. All models are retrained from scratch for each repetition, and the reported results correspond to the average performance across all runs.
To ensure a fully causally valid forecasting setup, all models are trained using sliding-window sequences in which observations from weeks are used to predict week . Technical indicators are computed strictly from historical data within the input window, and no information from future timestamps is incorporated. Feature normalization is performed exclusively on the training split of each rolling window, and the resulting parameters are reused for the validation and test sets. This approach prevents leakage arising from global scaling or indicator computation outside the training horizon. The inclusion of econometric baselines further confirms that the forecasting task remains nontrivial, as these classical models consistently exhibit inferior performance compared to the transformer-based architectures.
3.3. Model Architectures and Training Setup
Six architectures are evaluated in this study: LSTM, GPT-2, Informer, Autoformer, Temporal Fusion Transformer (TFT), and a Vanilla Transformer encoder. All models share comparable hyperparameter settings—such as sequence length, batch size, and optimizer—to isolate architectural effects and ensure a fair comparison. To ensure fair and architecture-appropriate optimization, all models are tuned through a structured grid-search procedure prior to the main experiments. Learning rates are explored over model-specific search spaces: for LSTM ; and for transformer-based architectures (GPT-2, Informer, Autoformer, TFT, and Vanilla Transformer) . The best-performing learning rate for each architecture is selected according to validation loss, yielding , for LSTM and TFT, and for the remaining transformer-based models. The batch size is fixed at 64 to preserve comparability, as sensitivity analyses confirm that alternative batch sizes do not materially affect model rankings. Early stopping is applied to all models, allowing both recurrent and transformer architectures to converge at their naturally optimal pace. This tuning strategy ensures methodological fairness while preventing overfitting to individual temporal splits.
3.3.1. LSTM
Long Short-Term Memory (LSTM) networks [
31] extend recurrent neural networks by incorporating input, forget, and output gates, which enable selective memory retention and mitigate the vanishing-gradient problem. The internal memory cell facilitates the modeling of long-range temporal dependencies, making LSTMs well suited for financial time-series forecasting. The model receives an input tensor of shape
, where
denotes the batch size,
the look-back window length, and
the number of engineered features. The architecture consists of two stacked LSTM layers, each with 64 hidden units, with a dropout rate of 0.2 applied to recurrent outputs. The final hidden state is passed through a fully connected layer to produce the next-week closing price forecast. Training employs the Adam optimizer with a learning rate of
, batch size 64, and a maximum of 100 epochs, with early stopping applied.
3.3.2. GPT-2
GPT-2 [
32] is a decoder-only transformer architecture originally developed for autoregressive sequence modeling in natural language processing. Its multi-head self-attention mechanism enables each time step to attend selectively to all preceding steps, providing substantially greater capacity for long-range temporal dependency modeling than recurrent architectures, which propagate information solely through a single hidden state.
To adapt GPT-2 for multivariate numerical forecasting, each time step in the weekly time series is treated as a structured token comprising F features. Rather than using a simple linear projection, the model employs a learned feature embedding layer that maps each multivariate input vector into the transformer model dimension . This embedding mechanism serves as a continuous analogue of token embeddings in language modeling. Sinusoidal positional encodings are subsequently added to preserve the temporal ordering of observations.
Autoregressive input sequences of length L are constructed, and causal masks ensure that each position attends only to previous time steps, thereby maintaining forecasting validity. The decoder comprises four transformer blocks with four attention heads and a feed-forward dimensionality of 256. Unlike standard GPT-2, the vocabulary projection head is removed and replaced with a regression head that maps the hidden representation of the final token to the predicted next-week closing price.
A key advantage of GPT-2 in financial forecasting lies in its ability to form direct connections between temporally distant observations. Whereas recurrent models are constrained by vanishing gradients, the self-attention mechanism computes
allowing the model to assign learnable relevance weights to long-range patterns such as volatility cycles, structural breaks, and asymmetric shock responses that are characteristic of cryptocurrency markets.
During inference, forecasting proceeds autoregressively. After predicting the next-week value , this prediction is appended to the input sequence and fed back into the model to generate , and so forth. This free-running rollout mirrors GPT-style generative decoding and ensures consistency between training (teacher forcing on observed sequences) and testing (recursive prediction), enabling the model to capture evolving temporal dynamics across multiple forecasting steps.
Training is performed using the Adam optimizer with a learning rate of and a dropout rate of 0.1. The model is optimized using mean squared error (MSE) rather than cross-entropy, aligning the learning objective with continuous-valued financial forecasting. Collectively, these adaptations enable GPT-2 to capture cross-feature interactions, long-range dependencies, and temporal asymmetries in multivariate cryptocurrency markets.
3.3.3. Informer
Informer [
7] improves scalability for long input sequences by introducing ProbSparse attention, which prioritizes the most informative query–key interactions instead of uniformly attending to all time steps. This reduces computational complexity from
to approximately
, making the model particularly suitable for long-horizon forecasting in high-dimensional settings. A distillation mechanism further compresses the sequence hierarchically, enhancing efficiency when processing extended historical contexts.
In cryptocurrency markets, price dynamics often exhibit short but pronounced volatility bursts rather than uniformly distributed fluctuations. ProbSparse attention naturally aligns with this structure by selectively emphasizing temporally dominant movements while suppressing noise. This property is consistent with Informer’s strong performance on XRP and XLM in our experiments—assets known for irregular liquidity shocks and sporadic volatility. Its ability to stabilize long-range representations under noisy conditions provides an advantage over dense-attention transformers.
The implementation employs a two-layer encoder, a single-layer decoder, model dimension 128, four attention heads, and a feed-forward dimension of 256. Dropout is set to 0.1, and optimization is performed using Adam with a learning rate of .
3.3.4. Autoformer
Autoformer [
8] introduces decomposition blocks that explicitly separate each input sequence into trend and seasonal components prior to attention-based processing. This structure enables the model to capture periodicity, cyclical financial behavior, and slowly evolving long-term trajectories more effectively than standard transformers. Autoformer further replaces dot-product attention with auto-correlation attention, which identifies repeated temporal patterns at a reduced computational cost.
This decomposition-based inductive bias is particularly beneficial for assets such as Ethereum (ETH), where smoother oscillatory patterns and identifiable long-horizon structures are more prominent than sharp volatility spikes. The empirical results confirm this alignment: Autoformer achieves the lowest MAPE on ETH, indicating that its trend–seasonality separation effectively captures ETH’s macro-level temporal organization. Conversely, for assets dominated by rapid regime shifts (e.g., SOL), the same inductive bias becomes restrictive, leading to weaker performance.
The architecture used in this study includes a two-layer encoder, a single-layer decoder, model dimension 128, four attention heads, and a feed-forward dimension of 256. Decomposition blocks operate at every layer, and dropout is set to 0.1. Training employs the Adam optimizer with a learning rate of .
3.3.5. Temporal Fusion Transformer
The Temporal Fusion Transformer (TFT) [
9] integrates recurrent encoders, gating mechanisms, and interpretable attention layers to dynamically adjust the contribution of multivariate temporal features to forecasting. Variable selection networks (VSNs) automatically identify the most relevant technical indicators at each time step, while gated residual networks (GRNs) regulate information flow to enhance robustness and prevent overfitting. Temporal attention layers further provide interpretability by highlighting historical intervals with the strongest predictive influence.
These design components render TFT particularly effective in nonstationary environments characterized by abrupt behavioral shifts. This observation is consistent with our findings: TFT achieves the best performance on Solana (SOL), whose price dynamics are strongly influenced by speculative cycles, structural breaks, and rapid liquidity changes. TFT’s dynamic feature-weighting capability enables the model to reallocate importance across indicators as market conditions evolve, offering an advantage over architectures that assume stationarity or periodicity.
The model is configured with a hidden size of 64 in recurrent and gating components, four attention heads, and a dropout rate of 0.2. Training uses the Adam optimizer with a learning rate of .
3.3.6. Vanilla Transformer
The Vanilla Transformer follows the original encoder-style architecture introduced by Vaswani et al. [
6]. The model operates on an input tensor of shape
, where
denotes the historical look-back window and
represents the multivariate feature dimension. A learnable linear projection maps each feature vector to the model dimension
, and sinusoidal positional encodings are added to preserve temporal order. The encoder consists of two stacked transformer blocks, each containing a multi-head self-attention layer with four heads, followed by a position-wise feed-forward network with hidden dimension 256. Residual connections and layer normalization are applied after each sublayer, and a dropout rate of 0.1 is used throughout. The final encoder output corresponding to the last time step is passed through a regression head to predict the next-week closing price. Training is performed using the Adam optimizer with a learning rate of
.
3.4. Forecasting Pipeline
The overall forecasting pipeline—from Binance data acquisition and weekly aggregation to technical indicator construction, normalization, sliding-window sequence generation, repeated train–test splits, and model training across all six architectures—is summarized in the workflow diagram presented in
Figure 1. Each stage contributes a distinct component to the proposed framework, which is designed to ensure methodological consistency across all assets and models.
The workflow begins with data collection, in which weekly cryptocurrency market data are obtained from the Binance REST API using the
python-binance client. Daily OHLCV (open, high, low, close, volume) records are retrieved for Bitcoin, Ethereum, Ripple, Stellar, and Solana. These raw daily series form the basis for all subsequent processing steps and are aligned using the standardized exchange-generated timestamps. Following data acquisition, the pipeline proceeds to feature engineering. A comprehensive set of technical indicators is computed using the
pandas_ta library, encompassing trend-following, momentum-based, volatility-sensitive, and volume-derived measures. Such indicators are widely adopted in quantitative finance and have been shown to enhance the predictive performance of machine learning models when modeling nonlinear price dynamics [
24,
25]. The resulting feature matrix captures diverse aspects of market behavior, including trend shifts, cyclical regimes, and volume–price interactions.
Subsequently, the workflow advances to preprocessing, where daily observations are aggregated into weekly intervals to ensure temporal consistency across assets. Weekly bars are constructed using the first daily open, the final daily close, the weekly extrema (high and low), and the cumulative weekly trading volume. Missing observations arising from exchange outages, indicator warm-up periods, or listing inconsistencies are imputed using a combination of forward- and backward-filling. All input features are then rescaled using Min–Max normalization, ensuring uniform numerical ranges and improving optimization stability across heterogeneous model architectures, consistent with standard practice in financial time-series modeling [
4].
After preprocessing, the pipeline advances to the modeling stage. A unified sliding-window mechanism is applied to generate supervised learning samples of length
L, with each input window paired with the subsequent week’s closing price as the prediction target. The evaluated models include a stacked LSTM network [
31], an adapted GPT-2 architecture for autoregressive numerical sequences [
32], the Informer model with ProbSparse attention for long-range forecasting [
7], the Autoformer architecture that explicitly decomposes time series into trend and seasonal components [
8], the Temporal Fusion Transformer (TFT) incorporating variable selection and temporal attention mechanisms [
9], and a standard Vanilla Transformer encoder based on the original attention formulation [
6]. All models are trained independently under identical optimization settings, including the Adam optimizer, a shared batch size, and early stopping based on validation loss. Although the maximum epoch limit is fixed at 100, recurrent architectures typically converge earlier, whereas transformer-based models generally require longer training horizons.
The workflow concludes with the evaluation phase, during which each trained model produces weekly closing-price predictions that are compared against ground-truth values using standard forecasting error metrics. To enhance robustness and reduce sensitivity to a single temporal split, the entire pipeline is repeated across ten distinct rolling train–test partitions. Performance metrics are then averaged over these repetitions to obtain stable and reliable generalization estimates.
4. Experimental Results
This section presents the empirical performance of six deep learning architectures—LSTM, GPT-2, Informer, Autoformer, Temporal Fusion Transformer (TFT), and a Vanilla Transformer—evaluated across five major cryptocurrencies (BTC, ETH, XRP, XLM, and SOL). Each model predicts the next-week closing price using the multivariate feature set described earlier. All evaluations follow the repeated chronological splitting scheme, in which each experiment is conducted ten times with shifted train–test boundaries; the reported metrics correspond to the average performance across these repetitions. Due to asset-specific preprocessing and feature construction, the number of valid test samples may differ across cryptocurrencies; accordingly, all models are evaluated on their respective test sequences.
Model accuracy is assessed using five standard forecasting metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and the coefficient of determination (
). Together, these metrics capture absolute error magnitude, percentage deviation, and the explanatory power of the model. The metrics are computed as
In the expressions above, denotes the observed closing price at time i, while denotes the corresponding model prediction. The term denotes the mean of the observed target values and serves as a baseline for computing the explained variance in . The variable N indicates the total number of samples in the evaluation set. Collectively, these metrics quantify both absolute and relative deviations between predictions and ground truth, enabling a comprehensive assessment of model accuracy, stability, and explanatory capability. Lower MSE, RMSE, MAE, and MAPE values indicate better forecasting accuracy, whereas higher values indicate stronger explanatory capability.
To provide statistical rigor beyond point estimates, all models were evaluated across ten rolling windows, yielding ten independent error measurements per asset and per architecture. Furthermore, to assess whether differences in forecasting accuracy are statistically significant, Wilcoxon signed-rank tests were conducted on paired MAPE distributions between competing models. Transformer-based architectures demonstrated statistically significant improvements over the LSTM baseline across all assets (), while the best-performing transformer variants achieved significant gains over other deep models in four out of five assets. This analysis confirms that the observed performance differences reflect genuine modeling advantages rather than random variation.
The comparative evaluation of forecasting models presented in
Table 2 now incorporates both classical econometric baselines (GARCH, VAR, ARIMA, and Random Walk) and modern deep learning architectures. This extended comparison enables a clearer assessment of whether advanced neural models provide tangible improvements beyond standard financial forecasting techniques. Because the Mean Absolute Percentage Error (MAPE) is scale-independent and facilitates consistent evaluation across assets with vastly different price levels, the discussion below focuses primarily on MAPE.
Across all five cryptocurrencies, the classical models exhibit substantially higher MAPE values than the deep learning architectures. Even the strongest classical baseline—GARCH—performs notably worse than the best transformer-based models. This suggests that, under a unified multivariate feature space enriched with technical indicators, deep learning models can exploit complex nonlinear dependencies that classical linear or variance-based models cannot fully capture.
A closer inspection of the asset-level results supports this conclusion. For Bitcoin (BTC), the best MAPE among classical methods is achieved by GARCH (0.0544), which remains nearly twice that of the best deep learning model, GPT-2 (0.0289). This margin indicates that BTC’s long-range temporal structure is better captured by autoregressive attention mechanisms than by conditional variance dynamics alone. For Ethereum (ETH), Autoformer yields the strongest performance (MAPE = 0.0198), markedly outperforming GARCH (0.0410), suggesting that ETH’s smoother cyclical patterns align well with decomposition-based transformer architectures. For XRP and XLM—both characterized by abrupt microstructural fluctuations—Informer achieves the lowest MAPE values (0.0418 and 0.0469, respectively), leveraging ProbSparse attention to focus selectively on informative temporal segments, an ability not available to classical models. For Solana (SOL), TFT provides the best forecasts (MAPE = 0.0578), highlighting the value of dynamic variable selection in capturing frequent regime shifts. In contrast, classical models, particularly ARIMA and Random Walk, perform substantially worse on SOL, yielding MAPE values above 0.12 and 0.17, respectively.
When comparing deep learning models across assets, no single architecture dominates universally; instead, each excels under particular volatility patterns and temporal characteristics. GPT-2 performs best when long-term contextual dependence is dominant, whereas Informer excels in high-volatility assets that benefit from selective attention. Autoformer is particularly effective when trend–seasonal decomposability plays a significant role. TFT outperforms others in markets with rapid structural changes, where adaptive gating and feature selection are advantageous. The Vanilla Transformer performs reasonably well but lacks the specialized mechanisms that enable state-of-the-art transformers to capture finer temporal properties. LSTM consistently underperforms transformer-based models, primarily due to its limited capacity to model very long-range dependencies.
In contrast, classical econometric models exhibit consistent limitations. Their performance is strongest when volatility follows relatively stable conditional variance structures (e.g., BTC or ETH under GARCH); however, they remain unable to model higher-order nonlinearities, cross-feature interactions, or long-distance dependencies captured by transformer architectures. Furthermore, Random Walk—a widely used naive baseline in financial forecasting—consistently yields the weakest performance across assets, reaffirming the need for more expressive predictive frameworks.
In summary, the key findings derived from the extended comparative analysis—including both classical econometric and deep learning approaches—are as follows:
Deep learning architectures substantially outperform classical baselines (GARCH, VAR, ARIMA, and Random Walk) across all assets and all metrics.
Among classical models, GARCH performs best but remains significantly inferior to transformer-based architectures.
No single deep learning model is universally optimal; transformer variants excel under different volatility structures and temporal behaviors.
Informer is most effective for high-volatility assets due to its sparse-attention mechanism.
Autoformer performs best when the asset exhibits stable cyclical trends.
GPT-2 provides strong results in assets with rich long-term autoregressive structures.
TFT is most effective under frequent regime shifts that require dynamic feature weighting.
LSTM consistently underperforms transformer-based models due to its limited horizon for temporal dependency modeling.
Although
Figure 2 and
Figure 3 serve a visual role, they also provide statistically meaningful insights that complement
Table 2. The loss curves reveal optimization stability, convergence behavior, and overfitting risk—properties that cannot be inferred from scalar error metrics alone. Likewise, the actual vs. predicted plots expose systematic prediction biases, lag in directional changes, volatility underestimation, and residual structure over time. These characteristics are essential for evaluating the statistical reliability and temporal generalization of forecasting models; therefore, the figures fulfil an analytical rather than merely descriptive purpose.
The training and validation loss curves presented in
Figure 2 provide a detailed depiction of optimization dynamics across all five cryptocurrencies. In each panel, both loss trajectories exhibit a smooth and largely monotonic decline, with no sustained divergence between training and validation curves. This behavior indicates that the unified preprocessing and model configuration yield stable convergence, without optimization anomalies such as gradient explosions or oscillatory updates. The close alignment of training and validation losses throughout training also suggests the absence of overfitting, indicating that early stopping effectively prevents models from fitting noise or idiosyncratic fluctuations in the training data.
A cross-asset inspection reveals differences in convergence speed and the sharpness of loss decay. BTC and ETH display the steepest initial decreases, reflecting stronger signal-to-noise characteristics in high-liquidity markets, where model parameters adapt rapidly to dominant temporal patterns. XRP and XLM also converge smoothly but with more gradual loss reduction, consistent with their higher susceptibility to microstructural noise and episodic volatility. SOL follows a similar trend, although its tail-end plateau occurs marginally later due to more irregular historical dynamics. Despite these differences, all assets ultimately reach low and stable validation losses, confirming that the selected architectures generalize effectively under the unified weekly forecasting setting.
The overall consistency of the curves indicates that the feature engineering pipeline—particularly the integration of price, volume, and technical indicators—produces sufficiently informative representations to support model training across heterogeneous assets. Moreover, the absence of widening gaps between training and validation losses across panels suggests that none of the models exhibits instability or overfitting, reinforcing the robustness of the experimental protocol. These observations justify the reliability of the subsequent comparative performance analysis and confirm that differences in forecasting accuracy across architectures stem from intrinsic modeling capabilities rather than optimization artefacts.
Figure 3 compares predicted and observed weekly closing prices for all five cryptocurrencies in the test sets. Across assets, the predicted trajectories closely follow the corresponding ground-truth series, indicating that the models capture both long-term directional movements and short-term fluctuations. For BTC, the model traces the upward trend and local oscillations, with larger deviations primarily during abrupt surges associated with intensified volatility. ETH exhibits a similarly strong correspondence, particularly in the mid-range of the test window, suggesting robust learning of medium-term momentum dynamics. The XRP and XLM series show strong predictive alignment, capturing both the early downward drift and subsequent recovery phases; turning points are followed with limited lag, indicating effective generalization under noise-dominant structures. SOL is the most volatile asset; nevertheless, the model maintains consistent tracking, with mild underestimation during rapid upward movements while preserving the overall trend direction and volatility range. These observations are consistent with the loss-curve behavior and further support the stability of the experimental setup.
Overall, the close correspondence between actual and predicted trajectories across assets is consistent with the loss patterns and suggests effective generalization to unseen data. The models reproduce asset-specific temporal dynamics under varying market conditions.
Figure 4,
Figure 5,
Figure 6,
Figure 7 and
Figure 8 present a model-wise comparison of forecasting accuracy in terms of MAPE for each cryptocurrency. The bar charts reveal pronounced asset-dependent performance patterns, indicating that no single architecture universally dominates across markets. For BTC and ETH, models with strong long-range dependency modeling capabilities—particularly GPT-2 and Autoformer—achieve the lowest errors, reflecting the relatively stable and trend-driven dynamics of high-liquidity assets. In contrast, for XRP and XLM, Informer consistently outperforms other architectures, suggesting that its sparse-attention mechanism is better suited to assets characterized by abrupt volatility bursts and short-lived informational cycles. SOL exhibits distinct behavior, with TFT achieving the best performance, highlighting the advantage of dynamic feature selection and gating mechanisms under frequent regime changes and speculative market conditions. Classical econometric baselines perform substantially worse across all assets, reinforcing that the observed accuracy gains are driven by deep learning architectures rather than asset-specific predictability. Overall, these asset-wise bar charts provide complementary evidence to
Table 2, demonstrating that model effectiveness is tightly coupled with the volatility structure and temporal characteristics of each cryptocurrency.
Figure 9 provides a compact, global view of forecasting performance by visualizing MAPE values across all model–asset combinations. Unlike the asset-wise bar charts, the heatmap highlights relative performance contrasts and systematic patterns that emerge across architectures and markets. Darker regions indicate higher forecasting errors, whereas lighter regions correspond to superior accuracy.
Several structured trends are immediately observable. Transformer-based models dominate the low-error regions of the heatmap, while classical econometric approaches consistently occupy high-error zones across assets, confirming their limited capacity to capture nonlinear and multivariate dependencies. Among deep learning models, GPT-2 exhibits uniformly low MAPE values for BTC and ETH, reflecting its strength in modeling long-range autoregressive structures. Informer forms a distinct low-error band for XRP and XLM, indicating that sparse attention mechanisms are particularly effective under high-volatility and noise-dominant market conditions. TFT achieves the lowest error concentration for SOL, reinforcing the importance of dynamic feature selection and gating under frequent regime changes.
The heatmap further shows that performance differences across models are asset-dependent rather than uniform. No architecture achieves consistently minimal error across all cryptocurrencies, underscoring that forecasting accuracy is governed by the interaction between model inductive biases and asset-specific volatility structures. This visualization complements
Table 2 by emphasizing cross-asset heterogeneity and clarifying why different architectures emerge as optimal under different market conditions.
5. Discussion
The empirical findings reveal clear distinctions in forecasting behavior across the evaluated models and cryptocurrencies. Transformer-based architectures consistently outperform the recurrent baseline, highlighting their capacity to capture nonlinear dynamics and long-range temporal dependencies that commonly arise in financial time series. GPT-2, in particular, demonstrates stable and strong performance across most assets, plausibly due to its autoregressive attention mechanism, which can represent extended directional movements observed in high-liquidity markets such as Bitcoin and Ethereum.
Informer achieves superior accuracy for assets with more irregular volatility patterns, such as XRP and XLM. Its sparse-attention formulation and sequence-distillation mechanism allow the model to focus on informative temporal segments and to respond effectively to rapid oscillations and short-lived micro-pattern variations. By contrast, Autoformer and TFT do not achieve comparable accuracy under weekly aggregation, suggesting that decomposition-based or hybrid architectures may be better suited to higher-frequency data (e.g., daily) or explicitly multi-horizon settings rather than coarser temporal resolutions.
The Vanilla Transformer yields competitive results but does not surpass the more specialized variants, indicating that architectural refinements—such as sparse attention, decomposition mechanisms, or autoregressive decoding—provide meaningful advantages for financial forecasting. In addition, the loss curves across all assets exhibit stable convergence, with training and validation losses remaining closely aligned. This behavior supports the effectiveness of early stopping and strengthens confidence in the reliability of the modeling pipeline. The predicted–actual plots further reinforce these observations, showing that the models capture both local fluctuations and broader market trends with high fidelity.
It should be noted that references to concepts such as volatility structure or regime-like shifts are made purely in a descriptive, qualitative sense based on observed forecasting behavior. The study does not perform formal statistical tests for structural breaks, asymmetry, or regime transitions; therefore, these terms should not be interpreted as econometric claims but rather as conceptual explanations of why certain architectures exhibit stronger performance.
Beyond the predictive comparisons, it is important to contextualize the findings within the well-documented econometric properties of cryptocurrency markets. Prior research has reported that major digital assets exhibit persistent and asymmetric volatility, long-memory behavior, and regime-dependent dynamics that can differ substantially across market phases such as bull, bear, and consolidation periods [
33,
34,
35]. These characteristics help interpret the heterogeneous performance patterns observed across architectures. For example, GPT-2’s strength on BTC and ETH is consistent with evidence that highly capitalized assets often display stronger long-range dependence, whereas Informer’s superior accuracy on XRP and XLM aligns with findings that smaller-cap assets can experience sharper volatility bursts and shorter-lived informational cycles. Similarly, the weaker performance of decomposition-based models such as Autoformer on SOL is consistent with the asset’s more irregular dynamics, which can limit the utility of trend–seasonal separation. Although the present study does not conduct formal econometric tests for structural breaks or volatility asymmetry, the forecasting behavior of the evaluated architectures remains broadly consistent with these documented stylized facts, providing an interpretable link between model mechanisms and underlying market conditions.
6. Conclusions
This study provides a unified and methodologically symmetric comparison of six deep learning architectures—LSTM, GPT-2, Informer, Autoformer, TFT, and a Vanilla Transformer—together with four classical econometric baselines for multivariate cryptocurrency forecasting. Using fifteen years of OHLCV data enriched with a comprehensive set of technical indicators, the results show that transformer-based architectures consistently outperform both recurrent models and traditional financial forecasting methods. These gains are attributed to their ability to model long-range dependencies, adapt to nonlinear volatility regimes, and exploit multivariate feature interactions more effectively than recurrent or variance-based approaches.
GPT-2 delivers the most consistent performance for BTC, ETH, and SOL, whereas Informer demonstrates clear advantages for XRP and XLM—assets characterized by sharper volatility bursts. This pattern suggests that different attention mechanisms align with distinct market behaviors. The inclusion of econometric baselines further strengthens the conclusions: ARIMA, VAR, GARCH, and the Random Walk model yield substantially higher forecast errors, confirming that the observed improvements are attributable to the modeling capacity of transformer-based architectures rather than to the weekly sampling frequency or the preprocessing strategy.
The convergence behavior observed in the loss curves and the close agreement between predicted and actual trajectories indicate that the models generalize well without evident overfitting. The findings also suggest that weekly forecasting benefits from architectures that balance contextual expressiveness with computational efficiency, while the multivariate technical-indicator feature space increases predictive signal density across all evaluated assets.
Overall, the results establish the proposed experimental framework as a robust benchmark for future research in cryptocurrency forecasting. By enforcing strict methodological symmetry across models, providing a harmonized multivariate feature space, and evaluating architectures under identical temporal conditions, this work offers a reproducible foundation for assessing emerging transformer variants, multimodal fusion approaches, and uncertainty-aware forecasting methods in future studies.
6.1. Research Limitations
Although the empirical findings remain stable across multiple temporal splits, several limitations should be acknowledged. First, cryptocurrency time series are inherently nonstationary, and their distributional properties evolve across market cycles. The evaluated models implicitly assume locally stable dynamics within each training window; however, parameter drift and time-varying volatility may reduce generalizability during periods of structural change.
Second, the analysis does not explicitly model structural breaks such as abrupt market crashes, liquidity shocks, regulatory announcements, or network-driven events (e.g., Bitcoin halving cycles). Such events may alter the underlying data-generating process in ways that deep learning architectures can only partially accommodate without dedicated econometric testing or regime-switching formulations.
Third, while weekly aggregation attenuates noise, it does not fully remove microstructure-related effects, including bid–ask bounce, order-book thinness, and intraday volatility clustering. These phenomena may still influence return dynamics after aggregation and can affect model calibration and error characteristics.
Fourth, the feature set is limited to technical indicators derived from OHLCV data. Incorporating additional modalities—such as on-chain activity, macroeconomic variables, or sentiment signals—may improve robustness, particularly during periods of market dislocation and rapidly changing regimes.
Fifth, hyperparameter search is standardized for fairness, but model-specific tuning may yield additional gains. The current configuration prioritizes controlled comparability over exhaustive optimization.
Finally, the study focuses on point forecasts rather than probabilistic forecasts or explicit uncertainty quantification. Future work may benefit from Bayesian formulations, quantile-based objectives, or distributional forecasting methods to better characterize predictive uncertainty under volatile market conditions.
6.2. Potential Future Research
Future research may extend this framework in several directions. Integrating multimodal information—such as blockchain transaction flows, derivatives-market variables, or sentiment embeddings—may improve sensitivity to evolving market conditions. More advanced architectures, including hybrid transformer–diffusion models or reinforcement-learning-driven decision systems, may further enhance temporal reasoning and downstream utility. Cross-asset or hierarchical attention mechanisms also represent a promising direction for modeling inter-cryptocurrency dependencies. Incorporating probabilistic forecasting and risk-aware objectives may increase practical relevance for portfolio management and algorithmic trading. Finally, real-time or intraday implementations may broaden the applicability of transformer-based forecasting systems in operational environments.