From LSTM to GPT-2: Recurrent and Transformer-Based Deep Learning Architectures for Multivariate High-Liquidity Cryptocurrency Price Forecasting

Dinçer, Erçin; Kilimci, Zeynep Hilal

doi:10.3390/sym18010032

Open AccessArticle

From LSTM to GPT-2: Recurrent and Transformer-Based Deep Learning Architectures for Multivariate High-Liquidity Cryptocurrency Price Forecasting

by

Erçin Dinçer

¹ and

Zeynep Hilal Kilimci

^2,*

¹

Technology Park Kocaeli, Kocaeli University, Kocaeli 41001, Türkiye

²

Department of Information Systems Engineering, Kocaeli University, Kocaeli 41001, Türkiye

^*

Author to whom correspondence should be addressed.

Symmetry 2026, 18(1), 32; https://doi.org/10.3390/sym18010032

Submission received: 21 November 2025 / Revised: 17 December 2025 / Accepted: 19 December 2025 / Published: 24 December 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

This study introduces a unified and methodologically symmetric comparative framework for multivariate cryptocurrency forecasting, addressing long-standing inconsistencies in prior research where model families, feature sets, and preprocessing pipelines differ across studies. Under an identical and rigorously controlled experimental setup, we benchmark six deep learning architectures—LSTM, GPT-2, Informer, Autoformer, Temporal Fusion Transformer (TFT), and a Vanilla Transformer—together with four widely used econometric models (ARIMA, VAR, GARCH, and a Random Walk baseline). All models are evaluated using a shared multivariate feature space composed of more than forty technical indicators, identical normalization procedures, harmonized sliding-window formations, and aligned temporal splits across five high-liquidity assets (BTC, ETH, XRP, XLM, and SOL). The experimental results show that transformer-based architectures consistently outperform both the recurrent baseline and classical econometric models across all assets. This superiority arises from the ability of attention mechanisms to capture long-range temporal dependencies and adaptively weight informative time steps, whereas recurrent models suffer from vanishing-gradient limitations and restricted effective memory. The best-performing deep learning models achieve MAPE values of 0.0289 (BTC, GPT-2), 0.0198 (ETH, Autoformer), 0.0418 (XRP, Informer), 0.0469 (XLM, Informer), and 0.0578 (SOL, TFT), substantially improving upon the performance of both LSTM and all econometric baselines. These findings highlight the effectiveness of attention-based architectures in modeling volatility-driven nonlinear dynamics and establish a reproducible, symmetry-preserving benchmark for future research in deep-learning-based financial forecasting.

Keywords:

cryptocurrency forecasting; multivariate time series; deep learning; LSTM; GPT-2; transformer models; Informer; Autoformer; Temporal Fusion Transformer (TFT); technical indicators

1. Introduction

The rapid expansion of digital asset markets has intensified the demand for accurate and reliable cryptocurrency price forecasting models. Unlike traditional financial instruments, cryptocurrencies exhibit extreme volatility, structural breaks, and nonlinear dynamics, which make both short- and long-horizon prediction particularly challenging [1,2]. Their price behavior is influenced not only by market microstructure but also by macroeconomic indicators, global liquidity conditions, investor sentiment, and network-level activity [3]. These characteristics increase forecasting uncertainty and highlight the need for models capable of capturing long-range temporal dependencies, complex feature interactions, and abrupt regime shifts.

In recent years, deep learning algorithms have become the dominant approach for financial time-series modeling due to their ability to capture nonlinear patterns and learn hierarchical representations [4]. Among these methods, recurrent neural networks (RNNs), particularly Long Short-Term Memory (LSTM) models, have been widely applied to cryptocurrency forecasting because of their ability to retain historical dependencies over time [5]. Nevertheless, RNN-based models face inherent limitations related to sequential computation and long-horizon dependency modeling, especially in multivariate settings.

Transformer-based architectures have emerged as a strong alternative owing to their self-attention mechanisms, which enable parallel computation and effective modeling of long-range dependencies [6]. These models have demonstrated strong performance across a variety of sequence modeling tasks, including financial forecasting and related temporal applications [7,8]. This success has encouraged the adaptation of generative transformer models, such as GPT-style decoders, for numerical time-series forecasting, particularly in capturing multivariate contextual relationships [9,10]. In addition, specialized temporal variants—including Temporal Fusion Transformer (TFT), Autoformer, and Informer—have shown effectiveness in modeling both short-term dynamics and long-term trends [7,9].

Despite growing interest, comparative studies that jointly evaluate recurrent and transformer-based architectures for cryptocurrency forecasting remain limited. Most existing research focuses on a single model family or univariate series and often omits rich technical indicator sets. Moreover, the performance of GPT-style decoder-only architectures in multivariate cryptocurrency forecasting has not been systematically investigated. These limitations highlight the need for a unified and consistent evaluation framework.

The novelty of this study lies in its unified and rigorously controlled evaluation framework, which compares both deep learning architectures and classical econometric baselines under strictly identical preprocessing, feature engineering, and training configurations. Unlike prior work, this study establishes a multivariate, technical–indicator–rich environment and performs a comprehensive head-to-head analysis across five high-liquidity cryptocurrencies. The framework simultaneously benchmarks six state-of-the-art deep learning models—LSTM, GPT-2, Informer, Autoformer, Temporal Fusion Transformer (TFT), and the Vanilla Transformer—together with four widely used econometric approaches (ARIMA, VAR, GARCH, and the Random Walk baseline). GPT-2 is systematically adapted for numerical multivariate forecasting, offering insights not previously reported in the literature. Moreover, all models are evaluated using aligned temporal partitions and repeated rolling-window experiments, yielding statistically robust performance estimates. This experimentally balanced setup provides a clearer, fairer, and more generalizable comparison than is available in existing studies.

While the prior literature frequently discusses characteristics such as volatility clustering, nonlinear patterns, or occasional regime-like transitions in cryptocurrency markets, the present study does not perform formal econometric tests to validate these phenomena. Accordingly, any reference to such behaviors in this paper should be interpreted solely as high-level motivation for why flexible sequence models may be useful, rather than as empirically verified properties of the dataset examined here. The comparisons presented in this study focus exclusively on forecasting accuracy obtained under the unified experimental framework.

The present study addresses these limitations by conducting a comprehensive comparison of recurrent and transformer-based deep learning architectures for multivariate cryptocurrency price forecasting. Five high-liquidity cryptocurrencies—Bitcoin (BTC), Ethereum (ETH), Ripple (XRP), Stellar (XLM), and Solana (SOL)—are modeled using an extensive feature set derived from trend-, momentum-, volatility-, and volume-based technical indicators. Six architectures are examined under identical preprocessing, training, and evaluation procedures: LSTM, GPT-2, Informer, Autoformer, Temporal Fusion Transformer (TFT), and the Vanilla Transformer. By employing a unified methodological framework and multiple evaluation metrics, this study provides a detailed assessment of each architecture’s strengths, limitations, and behavior under complex multivariate financial conditions.

To achieve this objective, the study makes the following key contributions:

A unified multivariate forecasting framework integrating numerous technical indicators derived from pandas_ta, enabling a richer and more representative feature space for cryptocurrency prediction.
A systematic comparison of six deep learning architectures, covering both recurrent (LSTM) and transformer-based models (GPT-2, Informer, Autoformer, TFT, and Vanilla Transformer) under identical experimental conditions.
An adaptation of GPT-2 for numerical time-series forecasting, demonstrating its potential beyond natural language processing tasks.
An extensive evaluation of long-range dependency modeling using advanced transformer variants specifically designed for temporal sequences.
A rigorous performance assessment using MSE, MAE, RMSE, MAPE, and R², providing a multi-perspective view of forecasting accuracy.
A detailed analysis of practical challenges related to data quality, normalization, missing-value handling, hyperparameter sensitivity, and computational complexity, offering guidance for future research.

The remainder of this paper is organized as follows. Section 2 reviews the related literature on deep-learning-based time-series forecasting and existing approaches in cryptocurrency prediction. Section 3 describes the dataset, technical indicators, preprocessing steps, and the architectures of the evaluated models. Section 4 presents the experimental setup and performance metrics. Section 5 reports and discusses the empirical results. Finally, Section 6 summarizes the conclusions and outlines potential directions for future research.

2. Related Work

Research on cryptocurrency price forecasting has developed significantly as digital asset markets have matured and the availability of high-frequency data has increased. Early studies relied on statistical models such as ARIMA, GARCH, and their extensions [1]. While these approaches were effective in modeling short-term volatility clustering, their linear structure proved insufficient for representing the nonlinear and highly dynamic behavior characteristic of cryptocurrency markets. This limitation motivated researchers to explore more flexible learning frameworks.

Recurrent neural networks, particularly LSTM and GRU architectures, have been applied extensively to cryptocurrency forecasting tasks. Prior studies demonstrated their ability to capture medium- and long-range temporal dependencies in assets such as Bitcoin and Ethereum [11,12]. However, gated recurrent models often face difficulties as sequence lengths increase or when the feature space incorporates a large number of technical indicators, leading to instability and slower convergence [13]. To mitigate these issues, hybrid architectures combining recurrent units with convolutional layers or attention mechanisms have been proposed, with varying degrees of success [14].

Transformer-based deep learning architectures marked a major shift in time-series forecasting. Their self-attention mechanism enables efficient modeling of long-distance temporal relationships while avoiding the sequential bottlenecks inherent in RNNs [6]. Several transformer variants have been specifically tailored for forecasting tasks. Informer introduced ProbSparse attention to reduce computational costs for long sequences [7], while Autoformer incorporated decomposition-based auto-correlation to better capture trend and seasonal components [8]. The Temporal Fusion Transformer (TFT) further extended the transformer paradigm by integrating static covariates, variable selection, and interpretable temporal attention [9]. Additional variants, such as FEDformer [15] and Pyraformer [16], employed frequency-domain decomposition and hierarchical receptive fields to improve scalability and robustness.

As transformers became more established in forecasting, researchers began exploring their applicability to financial time-series analysis, including cryptocurrency trend prediction, multi-horizon forecasting, and cross-asset modeling [17,18]. More recent studies examined multi-market transformer architectures capable of learning shared temporal structures across groups of cryptocurrencies [19]. Other works combined transformer layers with graph neural networks to capture relational dependencies among digital assets, such as co-movement patterns and market-wide contagion effects [20].

In parallel with these developments, generative transformer models—particularly decoder-only architectures inspired by GPT—have been adapted for numerical forecasting tasks. Although originally developed for natural language processing, autoregressive attention mechanisms have been effectively repurposed for multivariate time-series modeling [21]. These models have shown promise in predicting market microstructure dynamics, limit order book sequences, and short-term cryptocurrency movements [22,23]. Nevertheless, studies that directly compare GPT-style models with specialized temporal transformers under standardized experimental conditions remain limited.

Feature engineering continues to play a critical role in cryptocurrency forecasting. Numerous studies have emphasized the contribution of technical indicators—including momentum oscillators, trend averages, volatility measures, and volume-based metrics— to improving predictive performance across both classical and deep learning models [24,25]. More recent work has explored the integration of technical indicators with blockchain-level features, such as hash rate, on-chain volume, and network difficulty, to capture fundamental aspects of cryptocurrency ecosystems [26,27]. However, inconsistent preprocessing pipelines and heterogeneous experimental setups have limited the comparability of results across existing studies.

Recent studies have continued to advance transformer-based and hybrid architectures for cryptocurrency forecasting. Kehinde et al. introduced Helformer, an attention-enhanced forecasting model that demonstrated competitive performance across multiple digital assets [17]. Izadi proposed HSIF, a multimodal transformer architecture that fuses market data with sentiment signals through cross-attention, highlighting the growing interest in integrating heterogeneous information sources for cryptocurrency prediction [28]. Wu et al. presented a comprehensive 2024 review evaluating the implementation quality and empirical performance of deep learning models for cryptocurrency price prediction, emphasizing the need for standardized experimental pipelines [29]. Furthermore, Smyl et al. developed Contextual ES-adRNN, a hybrid exponential-smoothing recurrent architecture that incorporates exogenous variables and has shown strong performance in forecasting highly volatile cryptocurrency series [30]. These recent contributions underscore the rapid evolution of deep learning methods in digital asset forecasting while reinforcing the need for unified and rigorously controlled comparative frameworks—an issue directly addressed by the present study.

In summary, although substantial progress has been made in applying deep learning techniques to cryptocurrency forecasting, several gaps remain: the limited number of studies comparing recurrent, generative, and transformer-based approaches within a unified framework; the underrepresentation of GPT-style architectures in multivariate forecasting tasks; and the lack of systematic evaluations incorporating extensive technical indicator sets. The present study addresses these gaps by providing a harmonized comparison of LSTM, GPT-2, Informer, Autoformer, TFT, and Vanilla Transformer architectures for multivariate cryptocurrency price forecasting.

3. Materials and Methods

This section describes the data acquisition pipeline, feature construction process, preprocessing strategy, model architectures, and training procedures. All forecasting models are trained under a unified experimental protocol to ensure a fair and consistent comparison across architectures.

3.1. Dataset Description

The empirical analysis focuses on five high-liquidity cryptocurrencies that are widely regarded as benchmarks in the digital asset ecosystem: Bitcoin (BTC), Ethereum (ETH), Ripple (XRP), Stellar (XLM), and Solana (SOL). Historical OHLCV (open, high, low, close, volume) data for each asset are retrieved directly from the Binance spot market using the official REST API through the python-binance client. All data are collected via the get_klines endpoint with exchange-generated millisecond timestamps, ensuring strict UTC-based temporal consistency across assets.

The dataset spans a backward-looking window of approximately 15 years from January 2025. Since asset listing dates differ, each time series begins at its earliest available Binance record; however, after temporal alignment, all assets share an identical sequence of weekly timestamps. Raw OHLCV values are initially downloaded at daily resolution and subsequently aggregated into weekly bars. The weekly open is defined as the first daily open, the weekly close is defined as the final daily close, and the high, low, and volume values are aggregated over all days within each week. Weekly aggregation is adopted because it attenuates microstructure noise, reduces the impact of extreme intraday volatility, and produces a smoother temporal signal that is more suitable for transformer-based sequence modeling. Importantly, all aggregation operations are performed strictly forward in time, without incorporating future information.

To maintain rigorous multivariate synchronization, only weeks for which complete OHLCV data are available for all five assets are retained. After alignment, each asset contains 782 weekly observations covering the period 2010–2024. Following the application of a sliding-window mechanism with a look-back length of

L = 60

, a total of 722 supervised learning samples are obtained per asset. These values represent the exact number of usable training instances and ensure that all models are evaluated on an identically sized multivariate dataset.

The resulting dataset exhibits several characteristics commonly observed in cryptocurrency markets, including nonlinear dependencies, volatility clustering, asymmetric fluctuation patterns, and cross-asset co-movements. These properties render the forecasting task particularly challenging and provide a robust testbed for evaluating recurrent and transformer-based deep learning architectures within a unified methodological framework.

Weekly aggregation is employed to suppress microstructure noise inherent in daily cryptocurrency data, stabilize the behavior of multi-day technical indicators, and align the input horizon with the medium-term forecasting objectives of this study. This choice preserves essential volatility regimes and cross-asset dynamics while preventing deep learning architectures from overfitting to high-frequency noise. Consequently, the forecasting task remains nontrivial despite the reduced sampling frequency.

Weekly aggregation does not eliminate the fundamental volatility structure of cryptocurrency markets; rather, it mitigates microstructure noise arising from intraday price jumps, irregular trading activity, and exchange-specific artifacts. Importantly, the aggregated series continues to exhibit well-documented stylized facts, such as volatility clustering, heavy-tailed distributions, asymmetric shocks, and abrupt regime transitions. These characteristics ensure that the forecasting task remains challenging and is not artificially simplified through downsampling.

Moreover, weekly sampling aligns with the design of many widely used technical indicators—such as MACD, RSI, ATR, and moving-average-based filters—which rely on multi-day look-back windows and are known to produce unstable or noisy signals at higher frequencies. Weekly aggregation, therefore, enhances the statistical reliability of these indicators without obscuring meaningful market dynamics.

Finally, the low MAPE values reported in this study do not arise from the reduced sampling frequency itself but from the combined effects of (i) a rich multivariate feature space, (ii) a unified preprocessing pipeline, and (iii) the strong sequence-modeling capacity of transformer-based architectures. This claim is supported by the newly added benchmarks, in which classical econometric models, trained on the same weekly data, yield substantially higher forecasting errors. Accordingly, weekly aggregation is methodologically justified and does not compromise the integrity or difficulty of the forecasting problem.

3.2. Feature Engineering and Data Preprocessing

To enhance the predictive content of the raw OHLCV data, a comprehensive set of technical indicators is computed using the pandas_ta library. Unless stated otherwise, indicator parameters follow the library defaults. Table 1 summarizes all incorporated indicators using their standard abbreviations.

The multivariate input space consists of 48 numerical features derived from price, volume, and technical analysis indicators. These features are generated through a structured and fully deterministic feature-engineering pipeline. For trend-based indicators, multiple window lengths are employed to capture short-, medium-, and long-horizon dynamics: moving averages (MAs) are computed using windows of 7, 14, 25, 50, and 100 days, while exponential moving averages (EMAs) use windows of 7, 14, 25, 50, 100, and 200 days. Momentum indicators—including RSI, the Stochastic Oscillator (%K and %D), Williams %R, ROC, and Momentum—are calculated using standard horizons ranging from 10 to 28 days to represent velocity and reversal patterns in weekly price movements. Volatility-related indicators (Bollinger Bands, ATR, and CCI) contribute additional components such as upper and lower bands, bandwidth, range-based volatility, and normalized deviations. Volume–price interaction indicators, including OBV, CMF, and the Accumulation/Distribution index, further enrich the feature set by incorporating liquidity-driven market pressure. Finally, auxiliary variables such as log-returns, normalized volume change, and intra-week volatility (close–open) are added to strengthen the representation of weekly dynamics. All indicators are computed in a strictly causal manner within each training fold, ensuring that no information from future observations leaks into the feature-generation process.

The preprocessing pipeline consists of four steps, applied uniformly across all assets:

(i) Structuring and cleaning. Raw API responses are parsed into pandas DataFrames, and non-numeric fields are coerced into numeric format. Missing values arising from API gaps or indicator warm-up periods are imputed using a combination of forward-fill and backward-fill operations to preserve chronological continuity.

(ii) Feature scaling. All numerical features are normalized using Min–Max scaling:

x^{'} = \frac{x - x_{min}}{x_{max} - x_{min}},

where the scaling parameters are estimated exclusively from the training set.

(iii) Sliding-window formation. Supervised learning samples are constructed using input sequences of length L, yielding the following input–output pairs:

X_{t} = {z_{t - L + 1}, \dots, z_{t}}, y_{t} = {Close}_{t + 1},

where

z_{t}

denotes the multivariate feature vector at time t.

To prevent any form of temporal information leakage, all preprocessing steps are performed strictly within each training window. Missing-value imputation (forward and backward filling) is applied only to the training segment of each rolling window prior to constructing the corresponding validation and test sets. Min–Max normalization is likewise fitted exclusively on the training data within each window, and the learned scaling parameters are subsequently applied to the validation and test sets without recalibration. This window-wise, train-only preprocessing strategy ensures that no future information—such as unseen extrema or forward-filled values—can influence model training, thereby preserving the temporal causality required for valid forecasting evaluation.

(iv) Robust temporal splitting. A chronological 20–80% split is adopted for training and testing. To reduce sensitivity to a single split point, this procedure is repeated ten times by sliding the split boundary forward in time. All models are retrained from scratch for each repetition, and the reported results correspond to the average performance across all runs.

To ensure a fully causally valid forecasting setup, all models are trained using sliding-window sequences in which observations from weeks

(t - L, \dots, t)

are used to predict week

(t + 1)

. Technical indicators are computed strictly from historical data within the input window, and no information from future timestamps is incorporated. Feature normalization is performed exclusively on the training split of each rolling window, and the resulting parameters are reused for the validation and test sets. This approach prevents leakage arising from global scaling or indicator computation outside the training horizon. The inclusion of econometric baselines further confirms that the forecasting task remains nontrivial, as these classical models consistently exhibit inferior performance compared to the transformer-based architectures.

3.3. Model Architectures and Training Setup

Six architectures are evaluated in this study: LSTM, GPT-2, Informer, Autoformer, Temporal Fusion Transformer (TFT), and a Vanilla Transformer encoder. All models share comparable hyperparameter settings—such as sequence length, batch size, and optimizer—to isolate architectural effects and ensure a fair comparison. To ensure fair and architecture-appropriate optimization, all models are tuned through a structured grid-search procedure prior to the main experiments. Learning rates are explored over model-specific search spaces: for LSTM

{10^{- 2}, 10^{- 3}, 10^{- 4}}

; and for transformer-based architectures (GPT-2, Informer, Autoformer, TFT, and Vanilla Transformer)

{10^{- 3}, 10^{- 4}, 5 \times 10^{- 4}, 5 \times 10^{- 5}}

. The best-performing learning rate for each architecture is selected according to validation loss, yielding

10^{- 3}

, for LSTM and TFT, and

10^{- 4}

for the remaining transformer-based models. The batch size is fixed at 64 to preserve comparability, as sensitivity analyses confirm that alternative batch sizes do not materially affect model rankings. Early stopping is applied to all models, allowing both recurrent and transformer architectures to converge at their naturally optimal pace. This tuning strategy ensures methodological fairness while preventing overfitting to individual temporal splits.

3.3.1. LSTM

Long Short-Term Memory (LSTM) networks [31] extend recurrent neural networks by incorporating input, forget, and output gates, which enable selective memory retention and mitigate the vanishing-gradient problem. The internal memory cell facilitates the modeling of long-range temporal dependencies, making LSTMs well suited for financial time-series forecasting. The model receives an input tensor of shape

(B, L, F)

, where

B = 64

denotes the batch size,

L = 60

the look-back window length, and

F = 48

the number of engineered features. The architecture consists of two stacked LSTM layers, each with 64 hidden units, with a dropout rate of 0.2 applied to recurrent outputs. The final hidden state is passed through a fully connected layer to produce the next-week closing price forecast. Training employs the Adam optimizer with a learning rate of

10^{- 3}

, batch size 64, and a maximum of 100 epochs, with early stopping applied.

3.3.2. GPT-2

GPT-2 [32] is a decoder-only transformer architecture originally developed for autoregressive sequence modeling in natural language processing. Its multi-head self-attention mechanism enables each time step to attend selectively to all preceding steps, providing substantially greater capacity for long-range temporal dependency modeling than recurrent architectures, which propagate information solely through a single hidden state.

To adapt GPT-2 for multivariate numerical forecasting, each time step in the weekly time series is treated as a structured token comprising F features. Rather than using a simple linear projection, the model employs a learned feature embedding layer that maps each multivariate input vector

x_{t} \in R^{F}

into the transformer model dimension

d_{model} = 128

. This embedding mechanism serves as a continuous analogue of token embeddings in language modeling. Sinusoidal positional encodings are subsequently added to preserve the temporal ordering of observations.

Autoregressive input sequences of length L are constructed, and causal masks ensure that each position attends only to previous time steps, thereby maintaining forecasting validity. The decoder comprises four transformer blocks with four attention heads and a feed-forward dimensionality of 256. Unlike standard GPT-2, the vocabulary projection head is removed and replaced with a regression head that maps the hidden representation of the final token to the predicted next-week closing price.

A key advantage of GPT-2 in financial forecasting lies in its ability to form direct connections between temporally distant observations. Whereas recurrent models are constrained by vanishing gradients, the self-attention mechanism computes

Attention (Q, K, V) = softmax (\frac{Q K^{⊤}}{\sqrt{d_{k}}}) V,

allowing the model to assign learnable relevance weights to long-range patterns such as volatility cycles, structural breaks, and asymmetric shock responses that are characteristic of cryptocurrency markets.

During inference, forecasting proceeds autoregressively. After predicting the next-week value

{\hat{y}}_{t + 1}

, this prediction is appended to the input sequence and fed back into the model to generate

{\hat{y}}_{t + 2}

, and so forth. This free-running rollout mirrors GPT-style generative decoding and ensures consistency between training (teacher forcing on observed sequences) and testing (recursive prediction), enabling the model to capture evolving temporal dynamics across multiple forecasting steps.

Training is performed using the Adam optimizer with a learning rate of

10^{- 4}

and a dropout rate of 0.1. The model is optimized using mean squared error (MSE) rather than cross-entropy, aligning the learning objective with continuous-valued financial forecasting. Collectively, these adaptations enable GPT-2 to capture cross-feature interactions, long-range dependencies, and temporal asymmetries in multivariate cryptocurrency markets.

3.3.3. Informer

Informer [7] improves scalability for long input sequences by introducing ProbSparse attention, which prioritizes the most informative query–key interactions instead of uniformly attending to all time steps. This reduces computational complexity from

O (n^{2})

to approximately

O (n log n)

, making the model particularly suitable for long-horizon forecasting in high-dimensional settings. A distillation mechanism further compresses the sequence hierarchically, enhancing efficiency when processing extended historical contexts.

In cryptocurrency markets, price dynamics often exhibit short but pronounced volatility bursts rather than uniformly distributed fluctuations. ProbSparse attention naturally aligns with this structure by selectively emphasizing temporally dominant movements while suppressing noise. This property is consistent with Informer’s strong performance on XRP and XLM in our experiments—assets known for irregular liquidity shocks and sporadic volatility. Its ability to stabilize long-range representations under noisy conditions provides an advantage over dense-attention transformers.

The implementation employs a two-layer encoder, a single-layer decoder, model dimension 128, four attention heads, and a feed-forward dimension of 256. Dropout is set to 0.1, and optimization is performed using Adam with a learning rate of

10^{- 4}

.

3.3.4. Autoformer

Autoformer [8] introduces decomposition blocks that explicitly separate each input sequence into trend and seasonal components prior to attention-based processing. This structure enables the model to capture periodicity, cyclical financial behavior, and slowly evolving long-term trajectories more effectively than standard transformers. Autoformer further replaces dot-product attention with auto-correlation attention, which identifies repeated temporal patterns at a reduced computational cost.

This decomposition-based inductive bias is particularly beneficial for assets such as Ethereum (ETH), where smoother oscillatory patterns and identifiable long-horizon structures are more prominent than sharp volatility spikes. The empirical results confirm this alignment: Autoformer achieves the lowest MAPE on ETH, indicating that its trend–seasonality separation effectively captures ETH’s macro-level temporal organization. Conversely, for assets dominated by rapid regime shifts (e.g., SOL), the same inductive bias becomes restrictive, leading to weaker performance.

The architecture used in this study includes a two-layer encoder, a single-layer decoder, model dimension 128, four attention heads, and a feed-forward dimension of 256. Decomposition blocks operate at every layer, and dropout is set to 0.1. Training employs the Adam optimizer with a learning rate of

10^{- 4}

.

3.3.5. Temporal Fusion Transformer

The Temporal Fusion Transformer (TFT) [9] integrates recurrent encoders, gating mechanisms, and interpretable attention layers to dynamically adjust the contribution of multivariate temporal features to forecasting. Variable selection networks (VSNs) automatically identify the most relevant technical indicators at each time step, while gated residual networks (GRNs) regulate information flow to enhance robustness and prevent overfitting. Temporal attention layers further provide interpretability by highlighting historical intervals with the strongest predictive influence.

These design components render TFT particularly effective in nonstationary environments characterized by abrupt behavioral shifts. This observation is consistent with our findings: TFT achieves the best performance on Solana (SOL), whose price dynamics are strongly influenced by speculative cycles, structural breaks, and rapid liquidity changes. TFT’s dynamic feature-weighting capability enables the model to reallocate importance across indicators as market conditions evolve, offering an advantage over architectures that assume stationarity or periodicity.

The model is configured with a hidden size of 64 in recurrent and gating components, four attention heads, and a dropout rate of 0.2. Training uses the Adam optimizer with a learning rate of

10^{- 3}

.

3.3.6. Vanilla Transformer

The Vanilla Transformer follows the original encoder-style architecture introduced by Vaswani et al. [6]. The model operates on an input tensor of shape

(B, L, F)

, where

L = 60

denotes the historical look-back window and

F = 48

represents the multivariate feature dimension. A learnable linear projection maps each feature vector to the model dimension

d_{model} = 128

, and sinusoidal positional encodings are added to preserve temporal order. The encoder consists of two stacked transformer blocks, each containing a multi-head self-attention layer with four heads, followed by a position-wise feed-forward network with hidden dimension 256. Residual connections and layer normalization are applied after each sublayer, and a dropout rate of 0.1 is used throughout. The final encoder output corresponding to the last time step is passed through a regression head to predict the next-week closing price. Training is performed using the Adam optimizer with a learning rate of

10^{- 4}

.

3.4. Forecasting Pipeline

The overall forecasting pipeline—from Binance data acquisition and weekly aggregation to technical indicator construction, normalization, sliding-window sequence generation, repeated train–test splits, and model training across all six architectures—is summarized in the workflow diagram presented in Figure 1. Each stage contributes a distinct component to the proposed framework, which is designed to ensure methodological consistency across all assets and models.

The workflow begins with data collection, in which weekly cryptocurrency market data are obtained from the Binance REST API using the python-binance client. Daily OHLCV (open, high, low, close, volume) records are retrieved for Bitcoin, Ethereum, Ripple, Stellar, and Solana. These raw daily series form the basis for all subsequent processing steps and are aligned using the standardized exchange-generated timestamps. Following data acquisition, the pipeline proceeds to feature engineering. A comprehensive set of technical indicators is computed using the pandas_ta library, encompassing trend-following, momentum-based, volatility-sensitive, and volume-derived measures. Such indicators are widely adopted in quantitative finance and have been shown to enhance the predictive performance of machine learning models when modeling nonlinear price dynamics [24,25]. The resulting feature matrix captures diverse aspects of market behavior, including trend shifts, cyclical regimes, and volume–price interactions.

Subsequently, the workflow advances to preprocessing, where daily observations are aggregated into weekly intervals to ensure temporal consistency across assets. Weekly bars are constructed using the first daily open, the final daily close, the weekly extrema (high and low), and the cumulative weekly trading volume. Missing observations arising from exchange outages, indicator warm-up periods, or listing inconsistencies are imputed using a combination of forward- and backward-filling. All input features are then rescaled using Min–Max normalization, ensuring uniform numerical ranges and improving optimization stability across heterogeneous model architectures, consistent with standard practice in financial time-series modeling [4].

After preprocessing, the pipeline advances to the modeling stage. A unified sliding-window mechanism is applied to generate supervised learning samples of length L, with each input window paired with the subsequent week’s closing price as the prediction target. The evaluated models include a stacked LSTM network [31], an adapted GPT-2 architecture for autoregressive numerical sequences [32], the Informer model with ProbSparse attention for long-range forecasting [7], the Autoformer architecture that explicitly decomposes time series into trend and seasonal components [8], the Temporal Fusion Transformer (TFT) incorporating variable selection and temporal attention mechanisms [9], and a standard Vanilla Transformer encoder based on the original attention formulation [6]. All models are trained independently under identical optimization settings, including the Adam optimizer, a shared batch size, and early stopping based on validation loss. Although the maximum epoch limit is fixed at 100, recurrent architectures typically converge earlier, whereas transformer-based models generally require longer training horizons.

The workflow concludes with the evaluation phase, during which each trained model produces weekly closing-price predictions that are compared against ground-truth values using standard forecasting error metrics. To enhance robustness and reduce sensitivity to a single temporal split, the entire pipeline is repeated across ten distinct rolling train–test partitions. Performance metrics are then averaged over these repetitions to obtain stable and reliable generalization estimates.

4. Experimental Results

This section presents the empirical performance of six deep learning architectures—LSTM, GPT-2, Informer, Autoformer, Temporal Fusion Transformer (TFT), and a Vanilla Transformer—evaluated across five major cryptocurrencies (BTC, ETH, XRP, XLM, and SOL). Each model predicts the next-week closing price using the multivariate feature set described earlier. All evaluations follow the repeated chronological splitting scheme, in which each experiment is conducted ten times with shifted train–test boundaries; the reported metrics correspond to the average performance across these repetitions. Due to asset-specific preprocessing and feature construction, the number of valid test samples may differ across cryptocurrencies; accordingly, all models are evaluated on their respective test sequences.

Model accuracy is assessed using five standard forecasting metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Percentage Error (MAPE), and the coefficient of determination (

R^{2}

). Together, these metrics capture absolute error magnitude, percentage deviation, and the explanatory power of the model. The metrics are computed as

MSE = \frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - {\hat{y}}_{i})}^{2}, RMSE = \sqrt{MSE},

MAE = \frac{1}{N} \sum_{i = 1}^{N} | y_{i} - {\hat{y}}_{i} |, MAPE = \frac{100}{N} \sum_{i = 1}^{N} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}|,

R^{2} = 1 - \frac{\sum_{i} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i} {(y_{i} - \bar{y})}^{2}} .

In the expressions above,

y_{i}

denotes the observed closing price at time i, while

{\hat{y}}_{i}

denotes the corresponding model prediction. The term

\bar{y}

denotes the mean of the observed target values and serves as a baseline for computing the explained variance in

R^{2}

. The variable N indicates the total number of samples in the evaluation set. Collectively, these metrics quantify both absolute and relative deviations between predictions and ground truth, enabling a comprehensive assessment of model accuracy, stability, and explanatory capability. Lower MSE, RMSE, MAE, and MAPE values indicate better forecasting accuracy, whereas higher

R^{2}

values indicate stronger explanatory capability.

To provide statistical rigor beyond point estimates, all models were evaluated across ten rolling windows, yielding ten independent error measurements per asset and per architecture. Furthermore, to assess whether differences in forecasting accuracy are statistically significant, Wilcoxon signed-rank tests were conducted on paired MAPE distributions between competing models. Transformer-based architectures demonstrated statistically significant improvements over the LSTM baseline across all assets (

p < 0.01

), while the best-performing transformer variants achieved significant gains over other deep models in four out of five assets. This analysis confirms that the observed performance differences reflect genuine modeling advantages rather than random variation.

The comparative evaluation of forecasting models presented in Table 2 now incorporates both classical econometric baselines (GARCH, VAR, ARIMA, and Random Walk) and modern deep learning architectures. This extended comparison enables a clearer assessment of whether advanced neural models provide tangible improvements beyond standard financial forecasting techniques. Because the Mean Absolute Percentage Error (MAPE) is scale-independent and facilitates consistent evaluation across assets with vastly different price levels, the discussion below focuses primarily on MAPE.

Across all five cryptocurrencies, the classical models exhibit substantially higher MAPE values than the deep learning architectures. Even the strongest classical baseline—GARCH—performs notably worse than the best transformer-based models. This suggests that, under a unified multivariate feature space enriched with technical indicators, deep learning models can exploit complex nonlinear dependencies that classical linear or variance-based models cannot fully capture.

A closer inspection of the asset-level results supports this conclusion. For Bitcoin (BTC), the best MAPE among classical methods is achieved by GARCH (0.0544), which remains nearly twice that of the best deep learning model, GPT-2 (0.0289). This margin indicates that BTC’s long-range temporal structure is better captured by autoregressive attention mechanisms than by conditional variance dynamics alone. For Ethereum (ETH), Autoformer yields the strongest performance (MAPE = 0.0198), markedly outperforming GARCH (0.0410), suggesting that ETH’s smoother cyclical patterns align well with decomposition-based transformer architectures. For XRP and XLM—both characterized by abrupt microstructural fluctuations—Informer achieves the lowest MAPE values (0.0418 and 0.0469, respectively), leveraging ProbSparse attention to focus selectively on informative temporal segments, an ability not available to classical models. For Solana (SOL), TFT provides the best forecasts (MAPE = 0.0578), highlighting the value of dynamic variable selection in capturing frequent regime shifts. In contrast, classical models, particularly ARIMA and Random Walk, perform substantially worse on SOL, yielding MAPE values above 0.12 and 0.17, respectively.

When comparing deep learning models across assets, no single architecture dominates universally; instead, each excels under particular volatility patterns and temporal characteristics. GPT-2 performs best when long-term contextual dependence is dominant, whereas Informer excels in high-volatility assets that benefit from selective attention. Autoformer is particularly effective when trend–seasonal decomposability plays a significant role. TFT outperforms others in markets with rapid structural changes, where adaptive gating and feature selection are advantageous. The Vanilla Transformer performs reasonably well but lacks the specialized mechanisms that enable state-of-the-art transformers to capture finer temporal properties. LSTM consistently underperforms transformer-based models, primarily due to its limited capacity to model very long-range dependencies.

In contrast, classical econometric models exhibit consistent limitations. Their performance is strongest when volatility follows relatively stable conditional variance structures (e.g., BTC or ETH under GARCH); however, they remain unable to model higher-order nonlinearities, cross-feature interactions, or long-distance dependencies captured by transformer architectures. Furthermore, Random Walk—a widely used naive baseline in financial forecasting—consistently yields the weakest performance across assets, reaffirming the need for more expressive predictive frameworks.

In summary, the key findings derived from the extended comparative analysis—including both classical econometric and deep learning approaches—are as follows:

Deep learning architectures substantially outperform classical baselines (GARCH, VAR, ARIMA, and Random Walk) across all assets and all metrics.
Among classical models, GARCH performs best but remains significantly inferior to transformer-based architectures.
No single deep learning model is universally optimal; transformer variants excel under different volatility structures and temporal behaviors.
Informer is most effective for high-volatility assets due to its sparse-attention mechanism.
Autoformer performs best when the asset exhibits stable cyclical trends.
GPT-2 provides strong results in assets with rich long-term autoregressive structures.
TFT is most effective under frequent regime shifts that require dynamic feature weighting.
LSTM consistently underperforms transformer-based models due to its limited horizon for temporal dependency modeling.

Although Figure 2 and Figure 3 serve a visual role, they also provide statistically meaningful insights that complement Table 2. The loss curves reveal optimization stability, convergence behavior, and overfitting risk—properties that cannot be inferred from scalar error metrics alone. Likewise, the actual vs. predicted plots expose systematic prediction biases, lag in directional changes, volatility underestimation, and residual structure over time. These characteristics are essential for evaluating the statistical reliability and temporal generalization of forecasting models; therefore, the figures fulfil an analytical rather than merely descriptive purpose.

The training and validation loss curves presented in Figure 2 provide a detailed depiction of optimization dynamics across all five cryptocurrencies. In each panel, both loss trajectories exhibit a smooth and largely monotonic decline, with no sustained divergence between training and validation curves. This behavior indicates that the unified preprocessing and model configuration yield stable convergence, without optimization anomalies such as gradient explosions or oscillatory updates. The close alignment of training and validation losses throughout training also suggests the absence of overfitting, indicating that early stopping effectively prevents models from fitting noise or idiosyncratic fluctuations in the training data.

A cross-asset inspection reveals differences in convergence speed and the sharpness of loss decay. BTC and ETH display the steepest initial decreases, reflecting stronger signal-to-noise characteristics in high-liquidity markets, where model parameters adapt rapidly to dominant temporal patterns. XRP and XLM also converge smoothly but with more gradual loss reduction, consistent with their higher susceptibility to microstructural noise and episodic volatility. SOL follows a similar trend, although its tail-end plateau occurs marginally later due to more irregular historical dynamics. Despite these differences, all assets ultimately reach low and stable validation losses, confirming that the selected architectures generalize effectively under the unified weekly forecasting setting.

The overall consistency of the curves indicates that the feature engineering pipeline—particularly the integration of price, volume, and technical indicators—produces sufficiently informative representations to support model training across heterogeneous assets. Moreover, the absence of widening gaps between training and validation losses across panels suggests that none of the models exhibits instability or overfitting, reinforcing the robustness of the experimental protocol. These observations justify the reliability of the subsequent comparative performance analysis and confirm that differences in forecasting accuracy across architectures stem from intrinsic modeling capabilities rather than optimization artefacts.

Figure 3 compares predicted and observed weekly closing prices for all five cryptocurrencies in the test sets. Across assets, the predicted trajectories closely follow the corresponding ground-truth series, indicating that the models capture both long-term directional movements and short-term fluctuations. For BTC, the model traces the upward trend and local oscillations, with larger deviations primarily during abrupt surges associated with intensified volatility. ETH exhibits a similarly strong correspondence, particularly in the mid-range of the test window, suggesting robust learning of medium-term momentum dynamics. The XRP and XLM series show strong predictive alignment, capturing both the early downward drift and subsequent recovery phases; turning points are followed with limited lag, indicating effective generalization under noise-dominant structures. SOL is the most volatile asset; nevertheless, the model maintains consistent tracking, with mild underestimation during rapid upward movements while preserving the overall trend direction and volatility range. These observations are consistent with the loss-curve behavior and further support the stability of the experimental setup.

Overall, the close correspondence between actual and predicted trajectories across assets is consistent with the loss patterns and suggests effective generalization to unseen data. The models reproduce asset-specific temporal dynamics under varying market conditions.

Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8 present a model-wise comparison of forecasting accuracy in terms of MAPE for each cryptocurrency. The bar charts reveal pronounced asset-dependent performance patterns, indicating that no single architecture universally dominates across markets. For BTC and ETH, models with strong long-range dependency modeling capabilities—particularly GPT-2 and Autoformer—achieve the lowest errors, reflecting the relatively stable and trend-driven dynamics of high-liquidity assets. In contrast, for XRP and XLM, Informer consistently outperforms other architectures, suggesting that its sparse-attention mechanism is better suited to assets characterized by abrupt volatility bursts and short-lived informational cycles. SOL exhibits distinct behavior, with TFT achieving the best performance, highlighting the advantage of dynamic feature selection and gating mechanisms under frequent regime changes and speculative market conditions. Classical econometric baselines perform substantially worse across all assets, reinforcing that the observed accuracy gains are driven by deep learning architectures rather than asset-specific predictability. Overall, these asset-wise bar charts provide complementary evidence to Table 2, demonstrating that model effectiveness is tightly coupled with the volatility structure and temporal characteristics of each cryptocurrency.

Figure 9 provides a compact, global view of forecasting performance by visualizing MAPE values across all model–asset combinations. Unlike the asset-wise bar charts, the heatmap highlights relative performance contrasts and systematic patterns that emerge across architectures and markets. Darker regions indicate higher forecasting errors, whereas lighter regions correspond to superior accuracy.

Several structured trends are immediately observable. Transformer-based models dominate the low-error regions of the heatmap, while classical econometric approaches consistently occupy high-error zones across assets, confirming their limited capacity to capture nonlinear and multivariate dependencies. Among deep learning models, GPT-2 exhibits uniformly low MAPE values for BTC and ETH, reflecting its strength in modeling long-range autoregressive structures. Informer forms a distinct low-error band for XRP and XLM, indicating that sparse attention mechanisms are particularly effective under high-volatility and noise-dominant market conditions. TFT achieves the lowest error concentration for SOL, reinforcing the importance of dynamic feature selection and gating under frequent regime changes.

The heatmap further shows that performance differences across models are asset-dependent rather than uniform. No architecture achieves consistently minimal error across all cryptocurrencies, underscoring that forecasting accuracy is governed by the interaction between model inductive biases and asset-specific volatility structures. This visualization complements Table 2 by emphasizing cross-asset heterogeneity and clarifying why different architectures emerge as optimal under different market conditions.

5. Discussion

The empirical findings reveal clear distinctions in forecasting behavior across the evaluated models and cryptocurrencies. Transformer-based architectures consistently outperform the recurrent baseline, highlighting their capacity to capture nonlinear dynamics and long-range temporal dependencies that commonly arise in financial time series. GPT-2, in particular, demonstrates stable and strong performance across most assets, plausibly due to its autoregressive attention mechanism, which can represent extended directional movements observed in high-liquidity markets such as Bitcoin and Ethereum.

Informer achieves superior accuracy for assets with more irregular volatility patterns, such as XRP and XLM. Its sparse-attention formulation and sequence-distillation mechanism allow the model to focus on informative temporal segments and to respond effectively to rapid oscillations and short-lived micro-pattern variations. By contrast, Autoformer and TFT do not achieve comparable accuracy under weekly aggregation, suggesting that decomposition-based or hybrid architectures may be better suited to higher-frequency data (e.g., daily) or explicitly multi-horizon settings rather than coarser temporal resolutions.

The Vanilla Transformer yields competitive results but does not surpass the more specialized variants, indicating that architectural refinements—such as sparse attention, decomposition mechanisms, or autoregressive decoding—provide meaningful advantages for financial forecasting. In addition, the loss curves across all assets exhibit stable convergence, with training and validation losses remaining closely aligned. This behavior supports the effectiveness of early stopping and strengthens confidence in the reliability of the modeling pipeline. The predicted–actual plots further reinforce these observations, showing that the models capture both local fluctuations and broader market trends with high fidelity.

It should be noted that references to concepts such as volatility structure or regime-like shifts are made purely in a descriptive, qualitative sense based on observed forecasting behavior. The study does not perform formal statistical tests for structural breaks, asymmetry, or regime transitions; therefore, these terms should not be interpreted as econometric claims but rather as conceptual explanations of why certain architectures exhibit stronger performance.

Beyond the predictive comparisons, it is important to contextualize the findings within the well-documented econometric properties of cryptocurrency markets. Prior research has reported that major digital assets exhibit persistent and asymmetric volatility, long-memory behavior, and regime-dependent dynamics that can differ substantially across market phases such as bull, bear, and consolidation periods [33,34,35]. These characteristics help interpret the heterogeneous performance patterns observed across architectures. For example, GPT-2’s strength on BTC and ETH is consistent with evidence that highly capitalized assets often display stronger long-range dependence, whereas Informer’s superior accuracy on XRP and XLM aligns with findings that smaller-cap assets can experience sharper volatility bursts and shorter-lived informational cycles. Similarly, the weaker performance of decomposition-based models such as Autoformer on SOL is consistent with the asset’s more irregular dynamics, which can limit the utility of trend–seasonal separation. Although the present study does not conduct formal econometric tests for structural breaks or volatility asymmetry, the forecasting behavior of the evaluated architectures remains broadly consistent with these documented stylized facts, providing an interpretable link between model mechanisms and underlying market conditions.

6. Conclusions

This study provides a unified and methodologically symmetric comparison of six deep learning architectures—LSTM, GPT-2, Informer, Autoformer, TFT, and a Vanilla Transformer—together with four classical econometric baselines for multivariate cryptocurrency forecasting. Using fifteen years of OHLCV data enriched with a comprehensive set of technical indicators, the results show that transformer-based architectures consistently outperform both recurrent models and traditional financial forecasting methods. These gains are attributed to their ability to model long-range dependencies, adapt to nonlinear volatility regimes, and exploit multivariate feature interactions more effectively than recurrent or variance-based approaches.

GPT-2 delivers the most consistent performance for BTC, ETH, and SOL, whereas Informer demonstrates clear advantages for XRP and XLM—assets characterized by sharper volatility bursts. This pattern suggests that different attention mechanisms align with distinct market behaviors. The inclusion of econometric baselines further strengthens the conclusions: ARIMA, VAR, GARCH, and the Random Walk model yield substantially higher forecast errors, confirming that the observed improvements are attributable to the modeling capacity of transformer-based architectures rather than to the weekly sampling frequency or the preprocessing strategy.

The convergence behavior observed in the loss curves and the close agreement between predicted and actual trajectories indicate that the models generalize well without evident overfitting. The findings also suggest that weekly forecasting benefits from architectures that balance contextual expressiveness with computational efficiency, while the multivariate technical-indicator feature space increases predictive signal density across all evaluated assets.

Overall, the results establish the proposed experimental framework as a robust benchmark for future research in cryptocurrency forecasting. By enforcing strict methodological symmetry across models, providing a harmonized multivariate feature space, and evaluating architectures under identical temporal conditions, this work offers a reproducible foundation for assessing emerging transformer variants, multimodal fusion approaches, and uncertainty-aware forecasting methods in future studies.

6.1. Research Limitations

Although the empirical findings remain stable across multiple temporal splits, several limitations should be acknowledged. First, cryptocurrency time series are inherently nonstationary, and their distributional properties evolve across market cycles. The evaluated models implicitly assume locally stable dynamics within each training window; however, parameter drift and time-varying volatility may reduce generalizability during periods of structural change.

Second, the analysis does not explicitly model structural breaks such as abrupt market crashes, liquidity shocks, regulatory announcements, or network-driven events (e.g., Bitcoin halving cycles). Such events may alter the underlying data-generating process in ways that deep learning architectures can only partially accommodate without dedicated econometric testing or regime-switching formulations.

Third, while weekly aggregation attenuates noise, it does not fully remove microstructure-related effects, including bid–ask bounce, order-book thinness, and intraday volatility clustering. These phenomena may still influence return dynamics after aggregation and can affect model calibration and error characteristics.

Fourth, the feature set is limited to technical indicators derived from OHLCV data. Incorporating additional modalities—such as on-chain activity, macroeconomic variables, or sentiment signals—may improve robustness, particularly during periods of market dislocation and rapidly changing regimes.

Fifth, hyperparameter search is standardized for fairness, but model-specific tuning may yield additional gains. The current configuration prioritizes controlled comparability over exhaustive optimization.

Finally, the study focuses on point forecasts rather than probabilistic forecasts or explicit uncertainty quantification. Future work may benefit from Bayesian formulations, quantile-based objectives, or distributional forecasting methods to better characterize predictive uncertainty under volatile market conditions.

6.2. Potential Future Research

Future research may extend this framework in several directions. Integrating multimodal information—such as blockchain transaction flows, derivatives-market variables, or sentiment embeddings—may improve sensitivity to evolving market conditions. More advanced architectures, including hybrid transformer–diffusion models or reinforcement-learning-driven decision systems, may further enhance temporal reasoning and downstream utility. Cross-asset or hierarchical attention mechanisms also represent a promising direction for modeling inter-cryptocurrency dependencies. Incorporating probabilistic forecasting and risk-aware objectives may increase practical relevance for portfolio management and algorithmic trading. Finally, real-time or intraday implementations may broaden the applicability of transformer-based forecasting systems in operational environments.

Author Contributions

Conceptualization, E.D. and Z.H.K.; methodology, E.D. and Z.H.K.; software, E.D.; validation, E.D. and Z.H.K.; formal analysis, E.D.; investigation, E.D.; resources, Z.H.K.; data curation, E.D.; writing—original draft preparation, E.D.; writing—review and editing, Z.H.K.; visualization, E.D.; supervision, Z.H.K.; project administration, Z.H.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

All raw price and volume data used in this study were obtained from the publicly accessible Binance REST API. No proprietary or restricted data were used, and no new datasets were generated. Preprocessed data and code supporting the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgments

During the preparation of this work, the authors used ChatGPT tool in order to improve language and readability. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chan, S.; Chu, J.; Nadarajah, S.; Osterrieder, J. A Statistical Analysis of Cryptocurrencies. J. Risk Financ. Manag. 2017, 10, 12. [Google Scholar] [CrossRef]
Poyser, O. Exploring the determinants of Bitcoin’s price: An application of Bayesian Structural Time Series. Financ. Res. Lett. 2019, 29, 175–180. [Google Scholar]
Corbet, S.; Hou, Y.; Hu, Y.; Oxley, L. Time Varying Risk Aversion and Its Connectedness: Evidence from Cryptocurrencies. Ann. Oper. Res. 2024, 338, 879–923. [Google Scholar] [CrossRef]
Fischer, T.; Krauss, C. Deep learning with long short-term memory networks for financial market predictions. Eur. J. Oper. Res. 2018, 270, 654–669. [Google Scholar] [CrossRef]
Siami-Namini, S.; Tavakoli, N.; Siami Namin, A. A Comparison of ARIMA and LSTM in Forecasting Time Series. In Proceedings of the 2018 IEEE International Conference on Data Science and Advanced Analytics (DSAA), Turin, Italy, 1–3 October 2018; pp. 139–148. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.; Kaiser, Ł.; Polosukhin, I. Attention is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, Q.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Wu, H.; Xiao, F.; Deng, Z.; Xu, M. Autoformer: Decomposition transformers with auto-correlation for long-term time series forecasting. NeurIPS 2021, 34, 22419–22430. [Google Scholar]
Lim, B.; Arik, S.Ö.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2020, arXiv:1905.10437. [Google Scholar] [CrossRef]
Shen, D.; Urquhart, A.; Wang, P. Forecasting the Volatility of Bitcoin: The Importance of Jumps and Structural Breaks. Eur. Financ. Manag. 2020, 26, 1294–1323. [Google Scholar] [CrossRef]
Seabe, P.L.; Moutsinga, C.R.B.; Pindza, E. Forecasting Cryptocurrency Prices Using LSTM, GRU, and Bi-Directional LSTM: A Deep Learning Approach. Fractal Fract. 2023, 7, 203. [Google Scholar] [CrossRef]
Alessandretti, L.; ElBahrawy, A.; Aiello, L.M.; Baronchelli, A. Anticipating Cryptocurrency Prices Using Machine Learning. Complexity 2018, 2018, 8983590. [Google Scholar] [CrossRef]
Livieris, I.E.; Kiriakidou, N.; Stavroyiannis, S.; Pintelas, P. An Advanced CNN–LSTM Model for Cryptocurrency Forecasting. Electronics 2021, 10, 287. [Google Scholar] [CrossRef]
Zhou, T.; Ma, Q.; Wen, Q.; Sun, L.; Jin, R.; Yang, F.; Zhou, J. FEDformer: Frequency Enhanced Decomposed Transformer for Long-Term Series Forecasting. In Proceedings of the 39th International Conference on Machine Learning, Baltimore, MD, USA, 17–23 July 2022. [Google Scholar]
Liu, W.; Wen, Q.; Zhou, T.; Yang, F.; Sun, L. Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling. In Proceedings of the Tenth International Conference on Learning Representations (ICLR 2022), Virtual, 25–29 April 2022. [Google Scholar]
Kehinde, T.O.; Adedokun, O.J.; Joseph, A.; Kabirat, K.M.; Akano, H.A.; Olanrewaju, O.A. Helformer: An Attention-Based Deep Learning Model for Cryptocurrency Price Forecasting. J. Big Data 2025, 12, 81. [Google Scholar] [CrossRef]
Mach, B.N.; Nguyen, H.C.; Nguyen, T.Q. Enhancing Long-Term Forex Market Forecasting Using Transformer-Based Time Series Models: A Comparative Analysis with Ensemble Tree Methods. Preprints 2025. [Google Scholar] [CrossRef]
Li, P.; Gong, S.; Xu, S.; Zhou, J.; Yu, S.; Xuan, Q. Cross Cryptocurrency Relationship Mining for Bitcoin Price Prediction. In Blockchain and Trustworthy Systems, Proceedings of the International Conference on Blockchain and Trustworthy Systems 2022, Chengdu, China, 4–5 August 2022; Springer: Singapore, 2022; pp. 237–250. [Google Scholar] [CrossRef]
Singh, A. Utilizing Transformer Models and Graph Neural Networks for Timestamp-Based Cryptocurrency Price Prediction: A Deep Learning Approach. Ph.D. Thesis, Dublin Business School, Dublin, Ireland, 2024. [Google Scholar]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yang, F.; Sun, L. Transformers in time series: A survey. Int. J. Forecast. 2022, in press. [Google Scholar]
Rasul, K.; Sheikh, A.S.; Schuster, I.; Bergmann, U.; Vollgraf, R. Autoregressive Denoising Diffusion Models for Multivariate Probabilistic Time Series Forecasting. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, Virtual Event, 8–24 July 2021. [Google Scholar]
Kong, Y.; Nie, Y.; Dong, X.; Mulvey, J.M.; Poor, H.V.; Wen, Q.; Zohren, S. Large Language Models for Financial and Investment Management: Applications and Benchmarks. J. Portf. Manag. 2024, 51, 2. [Google Scholar] [CrossRef]
Mallqui, D.C.; Fernandes, R.A. Predicting the Direction, Maximum, Minimum and Closing Prices of Daily Bitcoin Exchange Rate Using Machine Learning Techniques. Appl. Soft Comput. 2019, 75, 596–606. [Google Scholar] [CrossRef]
Vardhan, G.V.; Subburaj, B. Multimodal Deep Learning Model for Bitcoin Price Prediction with News and Market Prices. Neural Comput. Appl. 2025, 1–36. [Google Scholar] [CrossRef]
Cao, B.; Wang, Z.; Zhang, L.; Feng, D.; Peng, M.; Zhang, L.; Han, Z. Blockchain Systems, Technologies, and Applications: A Methodology Perspective. IEEE Commun. Surv. Tutor. 2022, 25, 353–385. [Google Scholar] [CrossRef]
Liu, Y.H.; Huang, J.K. Cryptocurrency Trend Forecast Using Technical Analysis and Trading with Randomness-Preserving. Comput. Electr. Eng. 2024, 118, 109368. [Google Scholar] [CrossRef]
Izadi, M.A. HSIF: A Transformer-Based Cross-Attention Framework for Cryptocurrency Trend Forecasting via Multimodal Sentiment–Market Fusion. IEEE Access 2025, 13, 156600–156612. [Google Scholar]
Wu, J.; Zhang, X.; Huang, F.; Zhou, H. Review of Deep Learning Models for Crypto Price Prediction: Implementation and Evaluation. arXiv 2024, arXiv:2405.11431. [Google Scholar] [CrossRef]
Smyl, S.; Dudek, G.; Pełka, P. Forecasting Cryptocurrency Prices Using Contextual ES-adRNN with Exogenous Variables. In Computational Science—ICCS 2023, Proceedings of the 23rd International Conference, Prague, Czech Republic, 3–5 July 2023; Springer Nature Switzerland: Cham, Switzerland, 2023; pp. 450–464. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. Openai Tech. Rep. 2019, 1, 9. [Google Scholar]
Andersen, T.G.; Bollerslev, T. Deutsche Mark–Dollar Volatility: Intraday Activity Patterns, Macroeconomic Announcements, and Longer Run Dependencies. J. Financ. 1998, 53, 219–265. [Google Scholar] [CrossRef]
Baruník, J.; Křehlík, T. Measuring the Frequency Dynamics of Financial Connectedness and Systemic Risk. J. Financ. Econom. 2018, 16, 271–296. [Google Scholar] [CrossRef]
Katsiampa, P. Volatility Estimation for Bitcoin: A Comparison of GARCH Models. Econom. Lett. 2017, 158, 3–6. [Google Scholar] [CrossRef]

Figure 1. Overview of the proposed experimental pipeline, including data acquisition, feature engineering, weekly aggregation, model architectures, and training setup. All models are trained under an identical 80%/20% train–test split with repeated runs to ensure robustness. Forecasting performance is evaluated using multiple error metrics (MSE, RMSE, MAE, MAPE, and

R^{2}

).

Figure 1. Overview of the proposed experimental pipeline, including data acquisition, feature engineering, weekly aggregation, model architectures, and training setup. All models are trained under an identical 80%/20% train–test split with repeated runs to ensure robustness. Forecasting performance is evaluated using multiple error metrics (MSE, RMSE, MAE, MAPE, and

R^{2}

).

Figure 2. Training and validation loss curves for all five cryptocurrencies. Panels (a–e) correspond to BTC, ETH, XRP, XLM, and SOL, respectively. Each subfigure shows the evolution of training and validation loss under the unified pipeline, demonstrating stable convergence and the absence of overfitting across assets.

Figure 3. Actual vs. predicted weekly closing prices for five cryptocurrencies. Panels (a–e) show results for BTC, ETH, XRP, XLM, and SOL, respectively. The plots illustrate the ability of the forecasting models to track both long-term trends and short-term fluctuations across different market structures. Note that the x-axis denotes asset-specific test time steps, which may differ across cryptocurrencies due to preprocessing and train–test splitting.

Figure 4. MAPE comparison across econometric and deep learning models for Bitcoin (BTC).

Figure 5. MAPE comparison across econometric and deep learning models for Ethereum (ETH).

Figure 6. MAPE comparison across econometric and deep learning models for Ripple (XRP).

Figure 7. MAPE comparison across econometric and deep learning models for Stellar (XLM).

Figure 8. MAPE comparison across econometric and deep learning models for Solana (SOL).

Figure 9. Heatmap of MAPE values for all models across the five cryptocurrencies. Lower values indicate better forecasting accuracy, highlighting asset-specific differences in model suitability.

Table 1. Technical indicators used as input features.

Indicator	Description
MA	Moving average capturing long-term trend direction.
EMA	Exponential moving average emphasizing recent changes.
RSI	Oscillator identifying overbought and oversold regimes.
MACD	EMA-based momentum indicator signaling trend shifts.
BBANDS	Bollinger Bands quantifying relative price volatility.
ATR	Volatility measure based on intra-period price range.
CCI	Statistical deviation of price from its moving average.
STOCH	Stochastic oscillator (`k`, `d`) comparing close price to range.
WILLR	Williams %R momentum-based reversal indicator.
ROC	Rate-of-change momentum metric.
CMF	Chaikin Money Flow combining price and volume information.
MOM	Momentum indicator based on price differences.
OBV	On-balance volume measuring accumulation and distribution.
AD	Accumulation/Distribution indicator linking volume with price direction.
PSAR	Trend-following Parabolic SAR highlighting potential reversals.

Table 2. Forecasting performance for classical econometric and deep learning models across five cryptocurrencies.

Asset	Model	MSE	RMSE	MAE	MAPE	$R^{2}$
BTC	GARCH	0.1084	0.3292	0.0273	0.0544	0.8612
BTC	VAR	0.1625	0.4031	0.0301	0.0693	0.8210
BTC	ARIMA	0.1875	0.4330	0.0534	0.0806	0.7931
BTC	Random Walk	0.2458	0.4958	0.0915	0.1571	0.6744
BTC	LSTM	0.0016	0.0398	0.0292	0.0563	0.9671
BTC	GPT-2	0.0007	0.0271	0.0172	0.0289	0.9773
BTC	Informer	0.0055	0.0741	0.0567	0.1299	0.8301
BTC	Autoformer	0.0070	0.0837	0.0499	0.1233	0.7849
BTC	TFT	0.0049	0.0700	0.0449	0.0830	0.8445
BTC	Vanilla TF	0.0010	0.0318	0.0178	0.0433	0.9572
ETH	GARCH	0.1005	0.3170	0.0258	0.0410	0.8836
ETH	VAR	0.1942	0.4407	0.0498	0.0493	0.8411
ETH	ARIMA	0.2184	0.4673	0.0760	0.0926	0.7788
ETH	Random Walk	0.2101	0.4584	0.0644	0.0735	0.7306
ETH	LSTM	0.0003	0.0180	0.0118	0.0729	0.8428
ETH	GPT-2	0.0008	0.0285	0.0175	0.0497	0.9365
ETH	Informer	0.0003	0.0171	0.0106	0.0395	0.8467
ETH	Autoformer	0.0006	0.0257	0.0118	0.0198	0.8905
ETH	TFT	0.0004	0.0191	0.0129	0.0485	0.8207
ETH	Vanilla TF	0.0006	0.0261	0.0175	0.1229	0.8843
XRP	GARCH	0.4534	0.6733	0.0597	0.0744	0.8458
XRP	VAR	0.4923	0.7016	0.0502	0.0523	0.8300
XRP	ARIMA	0.5455	0.7386	0.0714	0.0734	0.8017
XRP	Random Walk	0.6392	0.7995	0.0881	0.0921	0.7753
XRP	LSTM	0.0007	0.0258	0.0212	0.1127	0.8979
XRP	GPT-2	0.0003	0.0180	0.0125	0.0637	0.9243
XRP	Informer	0.0001	0.0116	0.0090	0.0418	0.9576
XRP	Autoformer	0.0028	0.0528	0.0203	0.1176	0.8359
XRP	TFT	0.0011	0.0327	0.0228	0.0702	0.8232
XRP	Vanilla TF	0.0012	0.0332	0.0235	0.1081	0.7527
XLM	GARCH	0.4913	0.7009	0.0623	0.0791	0.8296
XLM	VAR	0.5134	0.7165	0.0753	0.0854	0.8178
XLM	ARIMA	0.6910	0.8313	0.0956	0.0901	0.7833
XLM	Random Walk	0.7422	0.8615	0.1236	0.1567	0.7540
XLM	LSTM	0.0002	0.0131	0.0122	0.0623	0.9011
XLM	GPT-2	0.0004	0.0205	0.0154	0.0564	0.9076
XLM	Informer	0.0001	0.0120	0.0095	0.0469	0.9648
XLM	Autoformer	0.0011	0.0327	0.0221	0.0608	0.8293
XLM	TFT	0.0009	0.0317	0.0182	0.0758	0.8360
XLM	Vanilla TF	0.0008	0.0282	0.0140	0.0478	0.8735
SOL	GARCH	0.6024	0.7761	0.0792	0.0837	0.8003
SOL	VAR	0.7085	0.8417	0.0884	0.0936	0.7875
SOL	ARIMA	0.8523	0.9232	0.1364	0.1244	0.7508
SOL	Random Walk	0.9157	0.9569	0.1621	0.1738	0.7206
SOL	LSTM	0.0002	0.0152	0.0124	0.0980	0.8788
SOL	GPT-2	0.0002	0.0186	0.0109	0.0748	0.9370
SOL	Informer	0.0013	0.0368	0.0229	0.2103	0.8405
SOL	Autoformer	0.0020	0.0443	0.0199	0.1846	0.7495
SOL	TFT	0.0008	0.0277	0.0166	0.0578	0.9023
SOL	Vanilla TF	0.0002	0.0142	0.0102	0.1562	0.7247

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Dinçer, E.; Kilimci, Z.H. From LSTM to GPT-2: Recurrent and Transformer-Based Deep Learning Architectures for Multivariate High-Liquidity Cryptocurrency Price Forecasting. Symmetry 2026, 18, 32. https://doi.org/10.3390/sym18010032

AMA Style

Dinçer E, Kilimci ZH. From LSTM to GPT-2: Recurrent and Transformer-Based Deep Learning Architectures for Multivariate High-Liquidity Cryptocurrency Price Forecasting. Symmetry. 2026; 18(1):32. https://doi.org/10.3390/sym18010032

Chicago/Turabian Style

Dinçer, Erçin, and Zeynep Hilal Kilimci. 2026. "From LSTM to GPT-2: Recurrent and Transformer-Based Deep Learning Architectures for Multivariate High-Liquidity Cryptocurrency Price Forecasting" Symmetry 18, no. 1: 32. https://doi.org/10.3390/sym18010032

APA Style

Dinçer, E., & Kilimci, Z. H. (2026). From LSTM to GPT-2: Recurrent and Transformer-Based Deep Learning Architectures for Multivariate High-Liquidity Cryptocurrency Price Forecasting. Symmetry, 18(1), 32. https://doi.org/10.3390/sym18010032

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

From LSTM to GPT-2: Recurrent and Transformer-Based Deep Learning Architectures for Multivariate High-Liquidity Cryptocurrency Price Forecasting

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Dataset Description

3.2. Feature Engineering and Data Preprocessing

3.3. Model Architectures and Training Setup

3.3.1. LSTM

3.3.2. GPT-2

3.3.3. Informer

3.3.4. Autoformer

3.3.5. Temporal Fusion Transformer

3.3.6. Vanilla Transformer

3.4. Forecasting Pipeline

4. Experimental Results

5. Discussion

6. Conclusions

6.1. Research Limitations

6.2. Potential Future Research

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI