Next Article in Journal
Revealing Encoded Images Using a Generative Adversarial Network
Previous Article in Journal
Adaptive Consensus Control of Multiple Underactuated Marine Surface Vessels with Input Saturation and Severe Uncertainties
Previous Article in Special Issue
Estimating Asset Pricing Models in the Presence of Cross-Sectionally Correlated Pricing Errors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Integrating Vision Transformer and Time–Frequency Analysis for Stock Volatility Prediction

School of Computing, Gachon University, Seongnam 13120, Republic of Korea
*
Author to whom correspondence should be addressed.
Mathematics 2025, 13(23), 3787; https://doi.org/10.3390/math13233787
Submission received: 21 October 2025 / Revised: 22 November 2025 / Accepted: 24 November 2025 / Published: 25 November 2025
(This article belongs to the Special Issue Advances in Machine Learning Applied to Financial Economics)

Abstract

Financial market volatility prediction remains challenging due to data nonlinearity and non-stationarity. Existing quantitative approaches struggle to capture multi-scale information embedded in time series, while convolutional neural network (CNN)-based image approaches primarily emphasize local feature extraction, whereas Vision Transformers (ViTs) more directly capture global dependencies through self-attention. To address these limitations, we propose TF-ViTNet, a dual-path hybrid model that integrates time–frequency scalogram generated via Continuous Wavelet Transform (CWT) with ViTs for volatility prediction. While time–frequency analysis has been widely adopted in prior studies, the application of ViTs to CWT-based scalograms within parallel architecture provides a new perspective for capturing global spatiotemporal structures in financial volatility. The model employs a parallel architecture where a Vision Transformer pathway learns global spatiotemporal patterns from scalograms while a Long Short-Term Memory (LSTM) pathway captures temporal characteristics from technical indicators, with both streams integrated at the final stage for volatility prediction. Empirical analysis using NASDAQ and S&P 500 index data from 2010 to 2024 demonstrates that TF-ViTNet consistently outperforms LSTM models using numerical data alone and existing benchmarks. In parallel architectures, Vision Transformers capture global patterns in scalograms more effectively than CNNs, achieving significant performance improvements, particularly for NASDAQ. The model maintains stable predictive power even during high volatility regimes, demonstrating strong potential as a risk management tool. Data augmentation improves performance for the stable S&P 500 market but degrades results for the volatile NASDAQ market, emphasizing the need for market-specific augmentation strategies tailored to underlying signal-to-noise characteristics.

1. Introduction

Volatility prediction in financial markets is essential for portfolio optimization, risk management, and option pricing. Accurate volatility predictions enable investors to make informed decisions, hedge against adverse market movements, and optimize capital allocation. Traditional statistical time series models, such as Generalized Autoregressive Conditional Heteroskedasticity (GARCH) models, have been predominantly used for volatility estimation and forecasting [1]. While GARCH-type models effectively capture volatility clustering and persistence characteristics of financial markets, they are inherently limited by their linear structure and reliance on distributional assumptions, making them less effective at capturing the complex, nonlinear dynamics and regime shifts prevalent in modern financial markets [2].
Deep learning has substantially advanced financial time series forecasting by enabling automated feature extraction from raw data. Long Short-Term Memory (LSTM) networks capture long-term dependencies effectively [3], while Transformers model global temporal relations through self-attention without sequential constraints [4]. Hybrid designs combining Convolutional Neural Networks (CNNs) for spatial features and LSTMs or Transformers for temporal learning have shown strong performance. Nevertheless, most models process time series in a one-dimensional form, potentially overlooking rich multi-scale temporal-frequency structures inherent in financial volatility.
To address the limitation of one-dimensional representations, researchers have adopted signal processing techniques that extract multi-scale information from financial data. Time–frequency analysis methods capture both temporal and spectral characteristics, effectively revealing market dynamics. Among them, the Wavelet Transform decomposes signals into time-localized frequency components, identifying short-term volatility spikes and long-term trends [5]. Transforming time series into two-dimensional representations, such as scalograms or spectrograms, exposes hidden structures missed in raw sequences. Integrating these representations with deep learning has improved forecasting accuracy by providing richer multi-scale features. Beyond wavelets, decomposition methods such as Empirical Mode Decomposition (EMD) and Variational Mode Decomposition (VMD) also enhance performance. However, most prior studies still rely on CNNs or recurrent neural networks (RNNs) to extract features from such representations, limiting their ability to capture global dependencies.
In parallel with signal processing advances, vision-based approaches have emerged as a new paradigm by transforming financial time series into two-dimensional images. This enables the use of computer vision models for market prediction. Early studies converted candlestick charts and technical indicator plots into images analyzed by CNNs to capture spatial patterns unseen in raw sequences. Later methods, such as the Gramian Angular Field and its quantum variants [6], further encoded temporal dependencies into image representations. Pre-trained CNNs have achieved strong generalization across markets, but their focus on local features limits global understanding. Vision Transformers (ViTs), leveraging self-attention to model long-range dependencies, offer greater potential to capture both local and global structures in time–frequency representations.
Accurate volatility estimation is essential for building reliable prediction models. Traditional close-to-close estimators lose valuable information by ignoring intraday price movements. To address this, Parkinson [7] proposed using the daily high–low range, providing more efficient and robust variance estimates. Recent research has also explored integrating multiple information sources for improved forecasting. Combining realized volatility measures with deep learning architectures has shown that intraday information enhances accuracy [8]. Moreover, volatility exhibits dependencies across multiple time scales, motivating models that jointly capture short-term fluctuations and long-term persistence [9]. Our approach addresses this challenge through time–frequency analysis and a parallel architecture that effectively integrates multi-scale volatility features.
Building upon these advances, this study proposes TF-ViTNet, a novel dual-path hybrid model that addresses three key limitations in existing work. First, while most deep learning models process time series in one-dimensional sequential form, we transform volatility time series into two-dimensional scalograms via Continuous Wavelet Transform (CWT) to capture multi-scale temporal-frequency structures. Second, whereas signal processing techniques are typically combined with traditional sequential models, we employ a ViT to analyze scalogram images, enabling the capture of both local and global patterns through self-attention mechanisms. Third, while existing vision-based approaches predominantly rely on CNNs that excel at local feature extraction, ViTs capture broader global dependencies through self-attention, potentially leveraging time–frequency representations more effectively. Our approach employs a parallel architecture where a ViT pathway analyzes scalograms and an LSTM pathway processes numerical technical indicators, with both streams independently processed and integrated at the final stage. This design allows each modality to develop specialized representations before fusion, distinguishing our approach from integrated models that combine features at each time step. We evaluate TF-ViTNet on NASDAQ and S&P 500 indices from 2010 to 2024, demonstrating superior performance compared to baseline LSTM and various hybrid architectures across different volatility regimes. In this study, we focus exclusively on one–step-ahead volatility prediction, where the model estimates the next day’s volatility using information available up to the current day. Multi-step forecasting is not considered in this work and represents an important direction for future extension.
The main contributions of this study are threefold. First, we introduce the novel combination of CWT-based time–frequency scalograms with a Vision Transformer architecture for volatility prediction. While CWT and time–frequency analysis have been widely used in prior studies, their integration with ViTs within a parallel dual-path architecture has not yet been explored. Second, we apply ViTs to financial volatility prediction, exploiting their ability to capture global spatiotemporal patterns embedded in scalogram images through self-attention mechanisms. Third, we design a novel dual-path parallel architecture that effectively fuses heterogeneous data representations by combining a ViT pathway for scalograms with an LSTM pathway for numerical technical indicators, allowing each modality to develop specialized features independently before integration.
The remainder of this paper is structured as follows: Section 2 provides a comprehensive review of the literature across deep learning for financial forecasting, signal processing methods, vision-based approaches, and volatility modeling. Section 3 describes our methodology, including data preprocessing, the proposed TF-ViTNet architecture, and experimental design. Section 4 outlines the experimental methodology and presents statistical analyses of financial time-series data. Section 5 presents comprehensive experimental results and analysis. Section 6 discusses the findings and concludes with limitations and future research directions.

2. Literature Review

2.1. Deep Learning for Financial Time Series Forecasting

Deep learning has fundamentally transformed financial time series forecasting by enabling automated feature learning from raw data. LSTM networks have become predominant architectures for sequential financial data modeling due to their ability to capture long-term dependencies [3]. The integration of Transformer architectures has marked significant advancement, with Lim et al. [10] proposing the Temporal Fusion Transformer (TFT) that effectively integrates static covariates, known future inputs, and historical observations for multi-horizon forecasting. Muhammad et al. [11] applied Transformers with time2vec encoding to stock market prediction, while Srijiranon et al. [12] developed the LSTM-mTrans-MLP hybrid model that achieved state-of-the-art performance across seven diverse financial datasets and demonstrated robustness during extreme volatility periods such as the COVID-19 pandemic. Xie et al. [13] proposed a Deep Convolutional Transformer (DCT) that combines CNN’s local pattern extraction with Transformer’s global context modeling, achieving the lowest prediction errors across NASDAQ, Hang Seng, and Shanghai indices. Sattarov and Makhmudov [14] proposed a Deep Q-Network (DQN) model for Bitcoin price prediction, incorporating multi-step state representations and volatility-adjusted rewards to enhance short-term forecasting accuracy. Kim et al. [15] proposed FINGAN-BiLSTM, combining Financial Generative Adversarial Networks (GANs) with Bidirectional LSTM to capture stylized facts in foreign exchange markets.
Hybrid ensemble approaches that combine multiple modeling paradigms have demonstrated superior performance by capturing both linear and nonlinear components. He et al. [16] proposed the ARMA-CNNLSTM ensemble model that integrates the Autoregressive moving-average model (ARMA) for linear autocorrelation with CNN-LSTM for spatiotemporal pattern learning, achieving lower prediction errors across three financial datasets. Wang et al. [17] developed a hybrid model that applies Complete ensemble EMD with adaptive noise (CEEMDAN) to separate high-frequency and low-frequency components, then processes them with ARIMA and CNN-LSTM, respectively, leading to superior accuracy for gold futures prediction. Zhang et al. [18] proposed a CNN-BiLSTM-Attention model that extracts local features through CNN layers, processes them bidirectionally with LSTM, and applies attention mechanisms to focus on the most informative temporal segments, achieving consistent superior performance across 12 domestic and international stock indices. Liu et al. [19] integrated mixed-frequency data by using GARCH-MIDAS to transform low-frequency macroeconomic indicators into high-frequency information, which was then processed by a Transformer model for CSI300 volatility prediction, demonstrating that incorporating multi-scale temporal information improves forecasting accuracy.

2.2. Time–Frequency Analysis for Financial Time Series Forecasting

Time–frequency analysis techniques have been widely adopted in financial forecasting for their ability to capture both temporal and spectral characteristics of nonstationary signals. Wavelet-based approaches, in particular, have demonstrated strong performance. Zhang [20] combined Wavelet Transform denoising with an LSTM optimized by Whale Optimization Algorithm for Hang Seng Index forecasting, while Huang et al. [21] introduced Wavelet Packet Decomposition with SW-LSTM to capture localized multi-scale features in crude oil and gasoline prices. Armah et al. [22] further applied wavelet-based analysis to investigate dynamic correlations between financial stress and commodity prices, revealing time-varying dependencies intensified during crises. Umar et al. [23] used time–frequency analysis to show that the Russia–Ukraine war had time-varying effects on short-selling stocks.
The Hilbert–Huang Transform (HHT), which integrates EMD with Hilbert spectral analysis, has also been effective in financial contexts. Dezhkam and Manzuri [24] showed that HHT-derived features enhanced classification and returns in S&P 500 portfolio selection using XGBoost. Rai et al. [25] used HHT to analyze sectoral recovery patterns in India during COVID-19, and Li and Qian [26] combined CEEMDAN with GRU–Transformer models for limit order book analysis, improving predictive performance. Beyond wavelet and HHT methods, several decomposition techniques have been explored. Kouzmanidis et al. [27] employed EMD with Gaussian Mixture Models for GameStop, Tesla, and XRP forecasting, and Xu et al. [28] applied CEEMDAN for causal analysis between stock prices and volumes across multiple scales. Kim et al. [29] proposed a forecasting framework combining VMD with an Attention-enhanced Cascaded LSTM network for Volatility Index (VIX) prediction. Collectively, these studies demonstrate that multi-scale decomposition—via Wavelet, HHT, or related methods—consistently enhances modeling accuracy by enabling finer temporal–frequency specialization.

2.3. Volatility Modeling and Prediction

Parkinson [7] introduced the Extreme Value Method, which estimates volatility more efficiently by using intraday high and low prices rather than closing prices. Glosten et al. [1] proposed the GJR-GARCH model to capture asymmetric responses of volatility to positive and negative shocks, reflecting the leverage effect in stock returns. Christensen and Prabhala [30] examined the relationship between implied and realized volatility using S&P 500 data, showing that implied volatility provides better short-term forecasts while realized volatility performs better over longer horizons. Corsi [9] developed the HAR-RV model, which reproduces the long-memory property of realized volatility in a simple autoregressive structure and outperforms ARMA models. Hansen et al. [8] extended this approach through the Realized GARCH model, integrating realized measures directly into the GARCH framework and achieving improved in-sample and out-of-sample performance.
With the rise of deep learning, subsequent research expanded volatility prediction beyond traditional econometric methods. Bao et al. [31] proposed a hybrid deep learning framework that applies Wavelet Transform for denoising, Stacked Autoencoders for feature extraction, and LSTM for temporal modeling. Shi et al. [32] introduced a Graph Transformer Network combining graph structures for inter-stock relationships with transformer-based temporal learning, achieving lower RMSPE on S&P 500 data. Cai et al. [33] utilized a Temporal Graph Model incorporating value chain data from Thomson Reuters Refinitiv, combining Graph Convolutional Networks (GCN) for spatial dependencies and LSTM for sequential learning, outperforming ARIMA, LSTM, and GCN baselines. Zhang et al. [34] enhanced volatility prediction by pooling data across multiple stocks to capture intraday commonality, improving generalization to unseen data. Pérez et al. [35] proposed a hybrid framework integrating GARCH-type models and moving average with Support Vector Machine (SVM) and LSTM for volatility and risk forecasting, achieving superior VaR and expected shortfall backtesting performance.

2.4. Vision-Based Approaches for Market Prediction

Vision-based approaches have gained attention by transforming time series data into two-dimensional images for analysis using computer vision models. Early studies demonstrated that converting candlestick charts into images enables effective learning: Kusuma et al. [36] achieved over 92% accuracy with CNNs for Taiwanese and Indonesian markets, and Lee et al. [37] trained a Deep Q-Network on U.S. market data that generalized to 31 countries, suggesting universal chart patterns. Technical indicators have also been visualized, as shown by Khalid et al. [38] and Pongsena et al. [39], both reporting improved accuracy and profitability over traditional models.
More advanced transformations have further enhanced temporal representation. Ramadhan et al. [6] used Gramian Angular Field (GAF) to encode candlestick sequences for CNN-LSTM, achieving 82.7% accuracy, while Xu [40] proposed Quantum GAF, reducing MAE and MSE substantially. Pre-trained CNN models have been effectively applied—Altunta¸s et al. [41] fine-tuned CNNs with RSI filters for silver trading to achieve 115% cumulative returns, and Guosheng Hu et al. [42] used a VGG16-based Autoencoder for portfolio construction outperforming the FTSE 100. Multi-scale models such as Pei et al. [43]’s Multi-Scale CNN, which jointly processes decomposed time series and OHLCT images, have shown strong results in Chinese A-share markets. While CNNs are powerful, their limited receptive fields hinder global context capture. ViTs overcome this through self-attention across image patches, effectively modeling both local and global patterns—an approach promising for time–frequency scalograms but still rare in finance.

3. Methodology

3.1. Time–Frequency Analysis

Financial time series exhibit complex non-stationary characteristics and abrupt short-term fluctuations that cannot be adequately captured only by traditional time series analysis. To effectively analyze these dynamic properties, time–frequency analysis techniques that simultaneously represent signals in both temporal and spectral domains have emerged as powerful tools for financial signal processing. The conventional Fourier transform provides global frequency information averaged over the entire signal duration, which limits its ability to capture transient phenomena and time-varying spectral characteristics that are prevalent in financial markets. To address this limitation, the Short-Time Fourier Transform (STFT) was developed to provide localized frequency information by partitioning the signal into overlapping time windows and applying the Fourier transform to each segment [44]. The STFT of a signal x ( t ) is defined as shown in Equation (1).
S T F T x ( τ , f ) = x ( t ) w ( t τ ) e j 2 π f t d t ,
where x ( t ) , w ( t τ ) , f, j, and  τ represent the original signal, a window function centered at time τ that localizes the analysis, the frequency, the imaginary unit, and the temporal position of the analysis window, respectively. While STFT provides time-localized frequency information, it suffers from the fundamental trade-off between temporal and frequency resolution due to the fixed window length. This limitation becomes particularly pronounced when analyzing financial time series that contain both rapid transient events and slower trend components.
CWT offers a more flexible approach by employing basis functions that are naturally localized in both time and frequency domains. Unlike STFT, CWT uses wavelets with varying time–frequency resolution, providing fine temporal resolution for high-frequency components and fine frequency resolution for low-frequency components. For financial signal analysis, the Morlet wavelet is commonly employed due to its optimal time–frequency localization properties. The Morlet wavelet is defined as shown in Equation (2).
ψ ( t ) = exp j ω 0 t t 2 2
where ω 0 denotes the central frequency parameter. Using the wavelet function, the CWT is defined as shown in Equation (3).
W x ( a , b ) = 1 a x ( t ) ψ * t b a d t
where a, b, and  ψ * ( t ) indicate the scale parameter that controls the dilation of the wavelet and inversely relates to frequency, the translation parameter representing temporal localization, and its complex conjugate, respectively.
The squared magnitude of the wavelet coefficients | W x ( a , b ) | 2 represents the signal energy at each time position b and scale a, which can be visualized as a two-dimensional scalogram image. In this representation, rows correspond to frequency bands, columns represent time, with higher frequencies in the upper region and lower frequencies in the lower region. The intensity of each pixel indicates the energy concentration at the corresponding time–frequency location.
In this study, we apply CWT directly to 60-day sequences of Parkinson’s volatility series to generate scalogram images that capture the energy distribution across time and frequency domains. These generated scalograms serve as image inputs to our proposed ViT architecture, enabling the model to learn complex spatiotemporal patterns that are not readily apparent in the original time series representation. The transformation from time series to scalogram representation offers several advantages for volatility prediction: different frequency components can reveal various market dynamics by capturing distinct patterns across frequency bands, complex temporal dependencies can be transformed into spatial patterns naturally suited for computer vision techniques, the time–frequency representation can help distinguish signal from noise components across different scales, and hidden periodic patterns or regime changes in volatility may become more apparent in the scalogram.

3.2. Vision Transformer

The Transformer model, originally proposed by [4] for natural language processing, achieved remarkable success by overcoming the limitations of RNNs. The core self-attention mechanism of Transformers models relationships between all elements in an input sequence in parallel, effectively learning long-range dependencies and global context. This architectural success led to its extension into computer vision, resulting in the Vision Transformer.
Unlike traditional CNNs that progressively integrate local features to understand entire images, ViT introduced a novel approach by dividing images into multiple fixed-size patches and processing them as sequence data [45]. As illustrated in Figure 1, each patch is converted into a vector through embedding, positional encoding containing spatial information is added, and then fed into a standard Transformer encoder. This structure enables ViT to learn global relationships between patches across the entire image from the initial layers of the model.
Specifically, an input image x R H × W × C is divided into N patches of size P × P . Each patch is transformed into a D-dimensional embedding vector through linear transformation E , and learnable positional embeddings E pos are added to preserve spatial location information. Additionally, a class token x class is prepended to the sequence for classification purposes, and the final input sequence z 0 to the encoder is defined as shown in Equation (4).
z 0 = [ x class ; x p 1 E ; x p 2 E ; ; x p N E ] + E pos , E R ( P 2 · C ) × D , E pos R ( N + 1 ) × D
where z 0 , x class , x p i , E , E pos , N, P, C, and D represent the input sequence to the Transformer encoder, the learnable class token for classification, the i-th image patch, the linear embedding matrix that maps patches to D dimensions, the learnable positional embeddings containing spatial information, the total number of patches, the patch size, the number of image channels, and the embedding dimension, respectively.
These patch embeddings are then fed into the Transformer encoder, where the self-attention mechanism evaluates interactions between patches and learns global features. The core operation of self-attention is scaled dot-product attention using query Q, key K, and value V vectors, and is defined as shown in Equation (5).
Attention ( Q , K , V ) = softmax Q K T d k V
where d k denotes the dimension of the key vectors. The scaling factor prevents the dot products from becoming excessively large, facilitating stable gradient propagation. ViT employs multi-head attention, which performs this attention operation multiple times in parallel, allowing the model to attend to information from different representation subspaces. Each attention head is computed independently as shown in Equation (6), and the outputs from all heads are then concatenated and projected. Multi-head attention with h heads is defined as shown in Equation (7).
head i = Attention ( Q W i Q , K W i K , V W i V )
MultiHead ( Q , K , V ) = Concat ( head 1 , , head h ) W O
where W i Q R D × d q , W i K R D × d k , W i V R D × d v are learnable projection matrices for each head, and  W O R h · d v × D is the projection matrix that combines outputs from all heads. Each Transformer encoder block consists of multi-head self-attention and a feed-forward network, incorporating layer normalization and residual connections. The computation of the intermediate representation z 𝓁 is shown in Equation (8), and the final output of 𝓁-th encoder block z 𝓁 is defined as shown in Equation (9).
z 𝓁 = MSA ( LN ( z 𝓁 1 ) ) + z 𝓁 1
z 𝓁 = MLP ( LN ( z 𝓁 ) ) + z 𝓁
where MSA, LN, MLP, and  z 𝓁 represent multi-head self-attention, layer normalization, multi-layer perceptron, and the output of the 𝓁-th block, respectively.
In this study, we utilize scalograms transformed through CWT as input images for ViT. Financial time series data contains complex time–frequency information beyond simple time series values, and scalograms can quantitatively and intuitively represent this information in two-dimensional image format. Since ViT demonstrates strength in capturing global contextual information, it can effectively learn relationships between visual characteristics conveyed through scalograms and complex volatility patterns in financial data.
For implementation, we employed the vit_tiny_patch16_224 pre-trained model from the timm library as a feature extractor [46]. Input images have a size of 224 × 224 pixels and are divided into a 14 × 14 grid of patches with a patch size of 16 × 16 . Each patch is linearly transformed into a 384-dimensional embedding vector, and learnable 384-dimensional positional encoding is added. The Transformer encoder consists of 12 encoder blocks, each with 6 multi-head attention heads, and the MLP internal dimension is set to 1536, which is 4 times 384. The classification head is removed to purely extract token features. The extracted patch token sequence is then passed to an LSTM module to learn temporal dependencies.

3.3. Proposed TF-ViTNet for Volatility Prediction

This study proposes the Time–Frequency Vision Transformer Network (TF-ViTNet), a dual-path hybrid model for volatility prediction. TF-ViTNet simultaneously learns visual features from scalogram images and temporal features from numerical data to capture complex patterns inherent in financial time series from multiple perspectives. The proposed model consists of two parallel pathways: a ViT-based refinement path that analyzes image-based visual patterns and an LSTM path that analyzes numerical temporal trends. The scalogram feature vector extracted by ViT and the time-series feature vector extracted by the LSTM path exist in different feature spaces. The refinement path performs a feature-space transformation, non-linearly reshaping the ViT-derived feature vector through the LSTM’s gate mechanism. Through this refinement process, the ViT feature vector is refined and aligned into an optimal form for fusion with the numerical feature vector from the LSTM path. Features independently extracted from each path are integrated in the final stage to predict Parkinson’s volatility for the next trading day.
The first path receives scalogram images generated through CWT as input. The generated scalograms apply data augmentation techniques such as masking and affine transformation to improve the model’s generalization performance. The core of this path is utilizing a pre-trained ViT as a feature extractor. ViT effectively captures global spatial patterns and interrelationships across the entire image, and through transfer learning, applies visual feature understanding capabilities learned from large-scale image datasets to financial scalogram analysis. High-dimensional feature vectors extracted through ViT are sequentially passed to LSTM layers to enhance temporal context. The LSTM module is not intended to recover temporal information already encoded in scalograms, but rather to refine the sequential structure of ViT patch embeddings. Since ViT processes patches as an unordered token sequence with position embeddings, the additional LSTM layer allows the model to learn smoother temporal transitions across patches and align the extracted visual features with the temporal dynamics learned in the numerical path.
The second path directly receives multivariate time series data, including OHLCV data, Parkinson’s volatility from previous time points, and traditional technical indicators such as RSI and MACD. This path uses a standard LSTM network to effectively handle the non-stationarity and long-term dependencies inherent in financial data. LSTM learns long-term trends and temporal dependencies inherent in numerical data by selectively remembering important information from the past and reflecting it in current predictions.
The two feature vectors generated at the final stage of each path are combined into a single high-dimensional vector through concatenation. This integrated vector contains rich information implying both visual patterns and quantitative trends of the market, and is ultimately passed to a multi-layer perceptron that performs regression. This multi-layer perceptron is a fully connected neural network containing one or more hidden layers, and performs the role of mapping nonlinear relationships from complex features integrated from both paths to the final prediction target of Parkinson’s volatility for the next trading day. The overall structure of the model is shown in Figure 2.
The overall data processing, training, and evaluation pipeline of the volatility prediction model proposed in this study is presented in Algorithm 1. This process consists of four stages: (1) a preprocessing stage that generates numerical sequences and scalogram images from raw time series data, (2) an optimal hyperparameter search stage using Optuna [47], (3) a training stage for the dual-path model with optimized hyperparameters, and (4) a stage that evaluates the performance of the final model.
The main notation used in the algorithm is as follows: S r a w , L, S n u m , I, y, θ , and  H represent raw time series data, sequence length, technical indicator sequences, CWT scalogram images, target volatility values, model parameters, and hyperparameter set, respectively.
Algorithm 1 Proposed Volatility Prediction Algorithm
Require: 
Raw time series data S r a w , Hyperparameter search space H , Sequence length L
Ensure: 
Final model parameters θ * , Performance metrics M
  •   —Phase 1: Data Preparation—
  1:
X n u m CalcIndicators ( S r a w ) ; σ P CalcParkinsonVolatility ( S r a w )
  2:
D
  3:
for each day t in available period do
  4:
       S n u m ( t ) { X n u m , τ } τ = t L + 1 t ; σ P , s e q ( t ) { σ P , τ } τ = t L + 1 t
  5:
       I ( t ) ContinuousWaveletTransform ( σ P , s e q ( t ) ) ; y ( t ) σ P , t + 1
  6:
       D D { ( S n u m ( t ) , I ( t ) , y ( t ) ) }
  7:
end for
  8:
D t r a i n , D v a l , D t e s t SplitAndStandardize ( D )
  •   —Phase 2: Hyperparameter Optimization—
  9:
H * argmin H H ValidationLoss ( H , D t r a i n , D v a l )                ▹ using Optuna (TPE Sampler)
  •   —Phase 3: Final Model Training—
10:
Initialize model parameters θ 0 with hyperparameters H *
11:
for  e = 1 E m a x  do
12:
      for mini-batch ( S n u m , b , I b , y b ) D t r a i n  do                  ▹ Dual-path forward pass
13:
             z V i T , b ViT_FeatureExtractor ( I b , θ V i T )
14:
             h i m g , b LSTM i m g ( z V i T , b , θ L S T M i m g ) ; h n u m , b LSTM n u m ( S n u m , b , θ L S T M n u m )
15:
             h c o m b , b Concatenate ( h i m g , b , h n u m , b )
16:
             y ^ b FullyConnectedLayer ( h c o m b , b , θ F C )
17:
             L b L ( y ^ b , y b ) ; θ e AdamOptimizer ( θ e 1 , θ L b )
18:
      end for
19:
      if EarlyStopping on D v a l  then  θ * best model ; break
20:
      end if
21:
end for
  •   —Phase 4: Evaluation—
22:
{ y ^ i } InverseScale ( Predict ( f θ * , D t e s t ) )
23:
M CalculateMetrics ( { y ^ i } , { y i } test )                          ▹ MSE, RMSE, R2
24:
return  θ * , M
To fully exploit the potential of this complex structure, various hyperparameters related to the model’s learning process, structural complexity, and generalization performance must be carefully optimized. In this study, we aimed to efficiently discover the optimal combination using Optuna, an automated hyperparameter search tool based on Bayesian optimization.
To enhance the learning stability and robustness of the model, we adopted Huber Loss as the loss function [48]. Huber Loss is a hybrid loss function that behaves like mean squared error when prediction errors are smaller than a certain threshold and like mean absolute error when errors exceed the threshold. Due to this characteristic, it converges stably through the smooth gradient of mean squared error in typical situations. However, it reacts less sensitively to large prediction errors caused by sudden market volatility, similar to mean absolute error, preventing the model from being excessively influenced by outliers. In particular, instability in the training process of Vision Transformer-based models is known as a major problem that impairs accuracy, and this instability can be difficult to detect as it manifests as subtle performance degradation rather than complete training failure [49]. Therefore, adopting outlier-robust Huber Loss is an effective strategy for stable model training. As the optimization algorithm, we used the Adam optimizer, which induces fast and stable convergence through adaptive learning rates, and is a commonly used choice for training ViT architectures [49,50].
The main hyperparameters subject to search are directly related to the model’s learning process, structure, expressiveness, and generalization performance. To control the learning process, we adjusted the optimizer’s learning rate and mini-batch size. As variables determining the model’s structure, we searched for the Vision Transformer’s feature vector dimension, the hidden layer size and number of layers of the LSTM processing image features. Finally, to suppress overfitting and improve generalization performance, we optimized the dropout ratio and weight decay coefficient. The specific search space for each hyperparameter is summarized in Section 4.1.
The hyperparameter search space in this study was designed to precisely explore the balance between the model’s learning stability, expressiveness, and generalization performance. To achieve this, we included not only general learning-related variables such as learning rate and batch size, but also structural elements such as ViT Output Dimension in the search space. This is to directly explore the impact of model structure on task-specific performance and find the most suitable model capacity for the data [51]. The search ranges for core hyperparameters, learning rate and batch size, were set based on empirically proven ranges from previous studies on ViT model training [52]. The wide learning rate range in log scale helps efficiently explore stable convergence points, and the batch size was configured to balance the accuracy of gradient estimation and computational efficiency. Additionally, we included weight decay and dropout as search variables to control overfitting. Weight decay limits the magnitude of weights, and dropout prevents unnecessary synchronization between neurons to enhance model robustness [53]. Since the optimal regularization strength in modern architectures such as ViT heavily depends on data and model complexity [54], this study aimed to identify the optimal combination through automated search rather than simply adopting universal values [55].

3.4. Evaluation Metrics

To comprehensively evaluate the regression performance of the model, we adopted four metrics: Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Percentage Error (MAPE), and the coefficient of determination ( R 2 ), which are defined as shown in Equations (10a)–(10d), where y i , y ^ i , y ¯ , and n represent the actual value, predicted value, mean of actual values, and total number of samples, respectively.
MSE = 1 n i = 1 n ( y i y ^ i ) 2
RMSE = 1 n i = 1 n ( y i y ^ i ) 2
MAPE = 100 n i = 1 n y i y ^ i y i
R 2 = 1 i = 1 n ( y i y ^ i ) 2 i = 1 n ( y i y ¯ ) 2
where MSE calculates the average of squared differences between actual and predicted values, imposing larger penalties for greater errors. RMSE, computed as the square root of MSE, shares the same unit as the actual values, enabling intuitive interpretation of prediction errors. MAPE expresses prediction accuracy as the average percentage difference between actual and predicted values, allowing for scale-independent comparison across datasets. The coefficient of determination R 2 indicates the proportion of variance in the actual values explained by the model’s predictions relative to the mean.
These evaluation metrics serve not only to measure model performance but also to validate the practical improvement in predictive power of the proposed model compared to benchmark models using previous-day volatility as predictions. Additionally, these metrics were utilized as criteria for finding the optimal parameter combination during hyperparameter optimization with Optuna and as the basis for early stopping to prevent overfitting and select the optimal model.

4. Experiments and Data

This section empirically validates the volatility prediction performance of the proposed TF-ViTNet model and quantitatively analyzes its superiority through comparison with various benchmark models. The experiments are conducted using S&P500 and NASDAQ index data from 2010 to 2024 in a walk-forward manner. Specifically, for testing year t, the model is trained on data from t–10 years (e.g., 2000–2007 for testing 2010), validated on the subsequent two years (e.g., 2008–2009), and then tested on year t. This rolling evaluation continues annually through 2024, ensuring consistent and non-overlapping training and testing periods. This approach strictly evaluates the model’s generalization performance through sequential validation that predicts the future from past data, similar to real financial market environments. All benchmark models undergo hyperparameter optimization using Optuna for fair evaluation. Prediction performance is comprehensively assessed based on MSE, RMSE, and R 2 metrics. The following subsections detail the configuration of benchmark models and provide comparative analysis of prediction results over the entire period.

4.1. Experiments

To objectively evaluate the prediction performance of the proposed TF-ViTNet model, we conduct comparative analysis with various benchmark models. The benchmark models range from simple baselines to sophisticated hybrid architectures that utilize both image and numerical data. Table 1 summarizes the configuration of each benchmark model.
To ensure a fair comparison, all deep learning models (the proposed TF-ViTNet and all baseline models) were trained and optimized under a unified hyperparameter configuration, as summarized in Table 2. All models were trained for up to 200 epochs using the Adam optimizer and HuberLoss. We employed an Early Stopping patience of 10 and Gradient Clipping with a max_norm of 1.0 to ensure stable convergence. A ReduceLROnPlateau scheduler was also employed for adaptive learning rate adjustment. Key hyperparameters such as Batch Size, Learning Rate, Dropout Rate, and Weight Decay, along with specific architectural parameters (e.g., hidden sizes), were tuned using Optuna, with the search spaces defined in Table 2.
Baseline Models
To establish the lower bound of prediction performance, we introduce a Naive Benchmark that assumes the previous day’s volatility persists to the current day. This persistence model ( y ^ t = y t 1 ) reflects the fundamental characteristic of volatility clustering, where high volatility tends to be followed by high volatility in financial markets.
In addition to the Naive Benchmark, we include a Random Walk model, a widely used reference model in volatility studies [56]. The Random Walk assumes that daily volatility follows the previous value plus Gaussian innovation and is defined as shown in Equation (11).
y ^ t = y t 1 + ϵ t , ϵ t N ( 0 , σ Δ y 2 ) ,
where σ Δ y is the empirical standard deviation of daily volatility changes. This baseline captures both volatility persistence and inherent randomness observed in financial data, and serves as a lower bound that forecasting models are expected to outperform.
We also establish an LSTM-only model using 23 technical indicators to assess how well purely quantitative approaches without scalogram images can predict volatility. The model receives a 60-day sequence of technical indicators with shape ( b a t c h , 60 , 23 ) as input. A single LSTM layer with 256 hidden units processes this sequence and outputs the hidden state vector ( b a t c h , 256 ) at the final time step. This vector passes through a 256-dimensional fully connected layer, then through a single-neuron regression head to predict next-day volatility.
Hybrid Models with Integrated Inputs
These models integrate image features at each time step of the time series by extracting a single feature vector from the ( 224 , 224 , 3 ) scalogram image and repeatedly combining it with numerical data at all 60 time steps.
The CNN feature extractor follows a standard VGGNet-style CNN architecture [57]. The input image passes through four consecutive convolutional blocks, each consisting of two Conv2d layers with 3 × 3 kernels, ReLU activation functions, and a 2 × 2 MaxPool2d layer for dimensionality reduction. The number of channels progressively increases as 32, 64, 128, and 256, effectively extracting hierarchical visual features from the image. The feature map is then standardized to 7 × 7 through AdaptiveAvgPool2d and flattened into a one-dimensional vector. This vector passes through two fully connected layers (1024 and 256 dimensions), with 50% Dropout applied between them to prevent overfitting.
The ViT feature extractor is based on the pre-trained vit_tiny_patch16_224 model from the timm library [46]. We remove the classification head and pass the CLS token through a Projection layer to generate the final image feature vector. The extracted feature vector is then combined with all 60 time steps of the numerical data sequence, creating an expanded sequence that is fed into the LSTM network. This allows image information to continuously influence the model’s learning of temporal patterns.
Hybrid Model with Parallel Processing
In contrast to the previous approach, these models process image and numerical data through independent pathways and then combine the results. This dual-path architecture generates embeddings optimized for each data type before final integration.
The proposed TF-ViTNet and its direct comparison target TF-CNet are both based on this parallel processing structure. Both models share the same LSTM structure in the numerical path but differ in the image path feature extractor. TF-ViTNet uses Vision Transformer to learn global patterns of scalograms, while TF-CNet uses CNN to extract local features. Performance comparison of these two models reveals how the feature extraction methods of ViT and CNN affect volatility prediction within the same parallel architecture.
The image path generates a single feature vector through CNN or ViT, treats it as input with sequence length 1, and passes it through a dedicated LSTM layer to capture temporal context. The numerical path processes 60 days of numerical data through the same structure as the Baseline LSTM model to generate a 256-dimensional embedding. This 60-day length was chosen based on a medium-term perspective; 60 trading days, corresponding to approximately three months or one financial quarter, is a period widely used to filter out short-term daily noise and capture significant medium-term momentum and volatility clustering phenomena. In the final stage, both embeddings (256 dimensions each) are concatenated into a 512-dimensional integrated vector, which is fed into a single-neuron fully connected layer for final prediction. This dual-path structure was originally proposed for video analysis to separately learn spatial and temporal information [58], but is effectively applied in this study to process data with different characteristics.
Econometric Grounding
We incorporate three standard volatility forecasting models for comparative analysis: the GARCH(1,1) model [59], the Heterogeneous Autoregressive model of Realized Volatility (HAR-RV) [9], and the Realized GARCH model [8]. These models are specifically chosen to benchmark our proposed model (TF-ViTNet) and other deep learning baselines against established econometric methods.
To ensure a fair and direct comparison with our proposed model, all econometric baselines were evaluated using the identical walk-forward validation scheme: a rolling 10-year training window was used to forecast the next single day’s volatility. All models share the same prediction target, Parkinson’s Volatility ( σ P ). The implementation details are as follows:
  • GARCH(1,1): This model captures volatility persistence using lagged squared log-returns ( r t 1 2 ) and past variance ( σ t 1 2 ). As its natural output (conditional variance) is not on the same scale as σ P , the one-step-ahead prediction is linearly scaled to match the mean and standard deviation of the training set’s σ P series.
  • HAR-RV: This model is specifically designed to capture the long-memory property of realized volatility. Following [9], it regresses the current σ P directly on the average σ P of the previous day, week (5 days), and month (22 days). As it is fitted directly to the target series, no scaling is required.
  • Realized GARCH: This model [8] extends the GARCH framework by incorporating the realized volatility measure ( σ P ) directly as an external regressor. Similar to GARCH(1,1), its variance prediction is linearly scaled to match the target σ P series for accurate comparison.
The performance of these econometric models is presented alongside our deep learning models in Section 5.1, providing a comprehensive benchmark for evaluating the contributions of the proposed architectures.

4.2. Data

This study uses daily market data from the NASDAQ Composite and S&P 500 indices spanning from 2000 to 2024. The dataset includes open, high, low, close prices (OHLC) and trading volume. We select these two major U.S. indices to capture different market characteristics: NASDAQ represents technology-focused stocks with higher volatility, while S&P 500 represents a diversified portfolio across various sectors with more stable movements. Although the evaluation period begins in 2010, data from 2000 are used to ensure sufficient historical context for model training in the walk-forward design. The data undergoes preprocessing including missing value handling, outlier removal, and normalization before model training. Financial time series contain irregular noise and nonlinear characteristics that make it difficult to clearly identify market trends from raw data alone. To address this limitation, technical indicators are widely used to quantify latent market patterns through statistical processing of historical data [60]. These indicators transform the original time series into multidimensional feature vectors, enabling learning models to more effectively capture complex relationships inherent in the data. We categorize indicators into four groups to comprehensively capture various market aspects: trend, momentum, volatility, and volume.
Momentum indicators measure the speed and strength of price changes to diagnose overbought or oversold market conditions. The Relative Strength Index (RSI) predicts major market turning points based on the relative strength of price gains versus losses [61]. The Stochastic Oscillator measures momentum by evaluating the relative position of the current close within the price range over a specific period [62]. Williams %R operates on a similar principle. The Moving Average Convergence Divergence (MACD) is a representative trend-following momentum indicator that simultaneously shows trend strength and direction through the relationship between two exponential moving averages [63]. Volatility indicators measure the amplitude or degree of dispersion in asset prices. Bollinger Bands form standard deviation bands around a moving average to visually represent relative levels of price volatility [64]. Average True Range (ATR) quantifies market volatility by measuring the average range of daily price movements [61]. Parkinson’s volatility, the core prediction target of this study, is also an important volatility measure that utilizes intraday high and low prices. Volume indicators combine volume information with price to assess trend reliability. Volume-Weighted Average Price (VWAP) calculates the average execution price over a specific period using volume as weights, reflecting market energy. All calculated indicators constitute a multivariate time series dataset used as input to the LSTM network. Table 3 presents the formulas for all technical indicators grouped by category, and Figure 3 shows the time series evolution of closing prices and Parkinson’s volatility for both indices.
Analysis of NASDAQ and S&P 500 index data from 2000 to 2024 reveals that both markets recorded steady long-term growth but with distinct characteristics. Figure 3 visualizes the temporal evolution of closing prices and Parkinson’s volatility for both indices. Both markets show RSI averages of 54.27 and 54.06, indicating mild buying pressure above 50, while ADX averages of 23.18 and 22.74 suggest clear trending behavior. However, the technology-focused NASDAQ exhibits significantly higher volatility with a standard deviation to average closing price ratio of 0.84, recording higher long-term returns. In contrast, the diversified S&P 500 shows a ratio of 0.60, indicating relatively lower volatility and more stable movement with gradual upward momentum. Table 4 and Table 5 provide comprehensive descriptive statistics for all indicators, including measures of central tendency, dispersion, distribution shape, and stationarity tests.
This study adopts walk-forward cross-validation to preserve temporal dependencies in time series data. Unlike standard k-fold cross-validation, this approach ensures that training data always temporally precedes validation and test data, preventing look-ahead bias. Specifically, we train and validate the model on 10 years of data, then evaluate prediction performance on the following 1-year period, repeating this process across the entire period. This approach is a standard methodology for realistically estimating model performance in actual market environments [65]. To ensure technical indicators can be accurately calculated at each training start point, we acquire data from 2000, two years before the analysis period begins.
Close-based volatility has an obvious limitation in that it fails to reflect intraday price movements. To address this, we adopt Parkinson’s volatility, which utilizes daily high and low prices, as the target time series signal for analysis [7]. Parkinson’s volatility explicitly incorporates intraday fluctuation ranges in its calculation, providing statistically more efficient and accurate estimation of actual volatility compared to close-based volatility. This richer capture of intraday noise and price movements inherent in financial time series provides meaningful information to the model.
To capture the complex dynamic characteristics of financial time series, we transform the Parkinson’s volatility signal into a two-dimensional scalogram image in the time–frequency domain using CWT. The process of generating a single input image proceeds as follows. First, we extract the Parkinson’s volatility time series for the past 60 trading days from the analysis point, then normalize the signal to zero mean and unit standard deviation. We apply a wavelet (cmor1.5-1.0) with bandwidth 1.5 and center frequency 1.0 to the normalized signal. If the bandwidth is lower (e.g., B = 0.5), the image becomes too smeared along the time axis, making it difficult to identify the precise timing of volatility bursts. Conversely, if the bandwidth is higher (e.g., B = 2.5), information along the frequency axis is compressed, making it difficult to distinguish between the diverse shapes of volatility patterns. A center frequency (C) of 1.0 is used because it effectively captures the main energy of the volatility signal near the center of the scalogram. If this value is too low (e.g., C = 0.5), the image is concentrated only in the low-frequency band (bottom), and if it is too high (e.g., C = 1.5), seemingly unnecessary high-frequency noise (top) is emphasized. For multi-resolution analysis, the scale range is set from 4 to 31. The minimum scale of 4 effectively filters out high-frequency daily noise, while the maximum scale of 31 fully encompasses the crucial weekly and monthly market cycles. The scalogram is constructed by applying absolute values to the complex coefficient matrix computed through CWT, and is visualized with a jet colormap. This transformation process was carried out using using Python 3.12.12 and the PyWavelets 1.9.0 open-source library [66]. Unlike STFT which uses fixed windows, the CWT-based approach has the advantage of effectively representing both low-frequency long-term trends and high-frequency short-term volatility simultaneously. The finally generated scalogram image compactly encodes the dynamic characteristics of the original time series and is used as input data for the ViT model to learn microscopic and macroscopic patterns. Figure 4 demonstrates the transformation process from raw price data to scalogram representation. The top panel shows the daily high and low prices, the middle panel displays the calculated Parkinson’s volatility time series with its normalized version, and the bottom panel presents the resulting CWT scalogram that serves as input to the Vision Transformer.

5. Results

5.1. Overall Forecasting Performance

We evaluate the prediction performance of TF-ViTNet against eight benchmark models across two major U.S. indices over a 15-year testing period from 2010 to 2024. The benchmark models include standard econometric models (GARCH, HAR-RV, and Realized-GARCH), a naive benchmark model, an LSTM-only model using numerical indicators, two integrated hybrid models (CNN-LSTM and ViT-LSTM) that combine image features at each time step, and a parallel hybrid model (TF-CNet) that independently processes image and numerical data. Table 6 summarizes the overall performance metrics including R 2 , MSE, RMSE and MAPE for all models.
The proposed TF-ViTNet achieves the highest R 2 values of 0.387 for NASDAQ and 0.436 for S&P 500, demonstrating superior predictive accuracy in terms of explained variance compared to all deep learning and econometric benchmark models in both markets. In terms of MSE, TF-ViTNet achieves 1.66 × 10 5 for NASDAQ, the lowest of all models. For S&P 500, its MSE ( 1.30 × 10 5 ) is second only to the HAR-RV model ( 1.08 × 10 5 ), though it significantly outperforms the HAR-RV in R 2 . The results for MAPE are more nuanced; while TF-ViTNet outperforms the LSTM and naive baselines, the econometric models (particularly HAR-RV) achieve lower MAPE scores. This highlights the different optimization goals of each metric, where TF-ViTNet excels at capturing overall variance ( R 2 ) while HAR-RV is competitive in average percentage error. The performance gap is more pronounced in the more volatile NASDAQ market, where TF-ViTNet substantially outperforms TF-CNet with a negative R 2 of −0.095. In contrast, both parallel architectures (TF-ViTNet and TF-CNet) perform competitively in the more stable S&P 500 market.
First, the results demonstrate that the proposed hybrid approach consistently outperforms both traditional econometric models and simpler deep learning baselines in terms of R 2 , MSE, and RMSE, although its performance in MAPE is slightly less competitive. For the NASDAQ market, TF-ViTNet significantly outperforms the best-performing econometric model, HAR-RV, and the LSTM-only baseline. This advantage is even more pronounced in the S&P 500 market, where TF-ViTNet again surpasses HAR-RV and the LSTM-only model. This suggests that the time–frequency features captured by the ViT provide substantial predictive value beyond the autoregressive components modeled by GARCH/HAR-RV and the sequential information from numerical indicators alone.
Second, hybrid models incorporating scalogram images generally show strong performance improvements over the LSTM-only baseline. For S&P 500, most hybrid models outperform the LSTM-only model ( R 2 of 0.235). For NASDAQ, most hybrid models like CNN-LSTM (0.308) and ViT-LSTM (0.265) also outperform the LSTM baseline (0.223). These results suggest that transforming time series data into time–frequency domain images can provide meaningful information for volatility prediction, though performance depends critically on appropriate architecture design.
Third, the choice of image encoder and architecture structure significantly impacts performance. For the parallel structure, TF-ViTNet using ViT substantially outperforms the CNN-based TF-CNet. Specifically, TF-ViTNet achieves R 2 of 0.387 compared to −0.095 for TF-CNet on NASDAQ, and 0.436 compared to 0.375 on S&P 500. In contrast, for the integrated structure, CNN-LSTM consistently outperforms ViT-LSTM in both markets. This suggests that the parallel structure, which combines two independent information streams at the final stage, creates positive synergy with ViT’s ability to capture global context in scalograms.

5.2. Performance Analysis by Period

To assess the temporal stability and robustness of model performance, we examine annual results across the 15-year testing period from 2010 to 2024. This period-by-period analysis is crucial for understanding how models perform under varying market conditions, including periods of high volatility such as the 2011 European debt crisis and the 2020 COVID-19 pandemic, as well as more stable bull market periods. Unlike aggregate metrics that may mask year-to-year fluctuations, annual performance reveals whether a model’s success stems from consistent predictive power or from exceptional performance in specific years.
Figure 5 and Figure 6 visualize the TF-ViTNet model’s prediction performance over the entire testing period for NASDAQ and S&P 500, respectively. The time series plots show actual volatility in blue, predicted volatility in red, and naive benchmark predictions in green. Both figures demonstrate that TF-ViTNet successfully captures the general volatility dynamics, tracking major volatility spikes during crisis periods such as 2011, 2015–2016, and particularly the dramatic surge during the 2020 COVID-19 pandemic. The model shows strong alignment with actual volatility patterns during both calm and turbulent periods, though some divergence is visible during extreme volatility events where all models struggle. Notably, the predicted volatility follows the actual trend more closely than the naive benchmark across most periods, validating the effectiveness of our approach.
Table 7 and Table 8 present the annual R 2 values for all models on NASDAQ and S&P 500, respectively. These results are obtained without applying data augmentation to the scalogram images, isolating the effect of architecture design. The ‘Overall’ row represents the result of a single R 2 metric calculated on the entire dataset, which integrates all actual volatility values and the model’s predicted values accumulated over the entire test period from 2010 to 2024. Therefore, the ’Overall’ R 2 serves as a comprehensive measure of the model’s long-term and cumulative predictive performance, without being excessively biased by anomalous performance or slumps in specific years.
The annual results reveal considerable performance variation across different market conditions and notable differences in model behavior between the two indices. For NASDAQ, TF-ViTNet achieves the highest R 2 in five years: 2011, 2014, 2015, 2019, and 2024. The most impressive result occurs in 2011, a period of heightened volatility associated with the European debt crisis, where TF-ViTNet records an R 2 of 0.434. However, the model also experiences challenging years, with 2012, 2013, and 2017 showing negative or near-zero R 2 values.
For S&P 500, the competitive landscape differs markedly. While TF-ViTNet achieves the best overall performance as shown in Table 6, it wins in only two individual years: 2012 and 2023. CNN-LSTM dominates across multiple years, particularly during the stable growth period from 2014 to 2016 and again in 2018 and 2021, achieving R 2 values consistently above 0.35. This suggests that CNN-based integrated architectures may have advantages in capturing volatility patterns during prolonged stable market regimes. TF-ViTNet’s strength lies in its more balanced performance across diverse conditions, avoiding the severe failures seen in some models during difficult periods.
A consistent pattern across both markets is the difficulty all models face during certain transitional or highly uncertain periods. Years 2012, 2013, 2017, and 2022 show widespread negative or low R 2 values across multiple models, suggesting fundamental challenges in volatility prediction during specific market regimes. The naive benchmark occasionally outperforms sophisticated models in these difficult years, particularly in 2020 for both indices, highlighting that complex architectures do not guarantee superior performance under all conditions. TF-ViTNet’s overall advantage stems from achieving competitive performance during favorable periods while maintaining relative stability during challenging years, resulting in the highest cumulative performance over the full 15-year period.
TF-ViTNet consistently demonstrates statistically significant predictive advantages over the benchmark in annual performance. This superiority is validated by a paired t-test on loss differentials (** denotes 1% and * denotes 5% significance levels). TF-ViTNet achieved high R 2 values across both NASDAQ (e.g., 2011, 2014, 2019, 2024) and S&P 500 (e.g., 2012, 2023), recording the highest overall R 2 for the S&P 500 over the 15-year period. These results confirm that TF-ViTNet captures complex volatility patterns more effectively than traditional benchmarks and other models, proving its practical value in financial forecasting tasks.
However, when comparing p-values, while TF-ViTNet significantly outperforms most benchmarks, its comparison with the CNN-LSTM model yields more ambiguous results. Therefore, a detailed comparison between TF-ViTNet and CNN-LSTM was conducted over 100 individual stocks, and the results of this analysis are presented in a subsequent section.

5.3. Impact of Data Augmentation

Data augmentation is a regularization technique that artificially increases training data diversity to prevent overfitting [67]. To evaluate its impact on volatility prediction performance, we compare three augmentation strategies applied to the TF-ViTNet model’s scalogram images. The ‘Affine’ strategy combines random horizontal flipping with 50% probability [67] and Color Jitter [68], which randomly adjusts brightness, contrast, and saturation within a range of 0.2 and hue within 0.1. The ‘Masking’ strategy randomly masks portions of the scalogram in frequency or time domains, inspired by SpecAugment [69]. The ‘All’ strategy combines both Affine and Masking techniques. Table 9 and Table 10 present the annual R 2 values for each augmentation strategy alongside the baseline without augmentation for NASDAQ and S&P 500, respectively.
The impact of data augmentation reveals markedly different patterns between the two markets. For NASDAQ, the model without augmentation achieves the highest overall R 2 of 0.387, outperforming all augmentation strategies with Affine at 0.364, Masking at 0.353, and All at 0.355. This counter-intuitive result demonstrates that data augmentation does not universally improve performance and can degrade it in certain contexts [70]. The annual breakdown shows that no augmentation wins in 7 out of 15 years, particularly during highly volatile periods such as 2011 with R 2 of 0.434, and 2019–2021 where it consistently achieves the best results. Notably, however, the All strategy demonstrates strong performance in specific market conditions, winning in 6 years, including 2010, 2012, 2014, 2015, 2016, and 2018. This high frequency of year-level victories despite lower overall performance suggests that aggressive augmentation can be highly effective during certain market regimes but may introduce excessive regularization that degrades performance in others. The trade-off is particularly evident in 2016, where All achieves 0.347 compared to 0.200 for the baseline, versus 2020-2021, where it substantially underperforms with 0.113 and 0.075 compared to 0.354 and 0.236 without augmentation.
For S&P 500, the results strongly favor augmentation. The Affine strategy achieves the highest overall R 2 of 0.436, substantially surpassing the no-augmentation baseline of 0.404. Masking records 0.380 and All achieves 0.391. The superior performance of Affine is driven by consistent advantages across multiple years, achieving the best results in 2012, 2015, 2018, 2020, and tying for best in 2023. Particularly impressive is the 2020 result, where Affine achieves R 2 of 0.583 compared to 0.399 without augmentation, demonstrating substantial benefits during the volatile COVID-19 period. Interestingly, while no augmentation wins in 7 years, including 2011, 2013, 2014, 2021, 2022, 2023, and 2024, the Affine strategy’s stronger performance in key volatile years drives its overall advantage.
These contrasting results can be attributed to fundamental differences in market characteristics. The technology-focused NASDAQ exhibits higher volatility with a standard deviation to mean ratio of 0.84, while the diversified S&P 500 shows a more stable ratio of 0.60. In volatile markets like NASDAQ, volatility patterns themselves are noisy and irregular, making it difficult for the model to learn stable representations. Adding data augmentation on top of this inherent noise creates a problem where augmentation-induced variations compound the market’s natural volatility, overwhelming the learning signal. Conversely, in the more stable S&P 500 market, cleaner underlying patterns provide a solid foundation that can withstand the additional variation introduced by augmentation. Here, augmentation acts as intended regularization, helping the model generalize without corrupting the base signal. These findings emphasize that augmentation effectiveness depends critically on the signal-to-noise ratio of the target data, with noisy data suffering from compounded distortion while clean data benefits from improved generalization.

5.4. Performance Analysis with STFT Spectrograms

This section presents a comparative analysis of the TF-ViTNet model’s performance using different input transformations for volatility prediction. Specifically, the model’s efficiency is evaluated when Short-Time Fourier Transform (STFT) spectrograms and Continuous Wavelet Transform (CWT) scalograms are applied.
Both STFT and CWT are widely utilized in financial time series for their ability to extract meaningful features from non-stationary and complex signals. STFT provides a fixed-resolution time–frequency representation that is computationally efficient, whereas CWT offers a flexible multi-resolution analysis better suited for capturing transient patterns, albeit with increased computational demands.
The evaluation was conducted by calculating annual R 2 metrics for the NASDAQ and S&P 500 indices under identical model settings. As shown in Table 11, the CWT-based representations generally resulted in higher R 2 values than STFT across both markets, indicating superior predictive accuracy. However, STFT retained advantages in computational speed, suggesting a trade-off between accuracy and efficiency. These findings emphasize the importance of selecting appropriate time–frequency transformation methods in designing deep learning models for financial volatility forecasting.

5.5. Performance by Market Regime

To formally analyze regime-specific performance, we partition the test data into “Bull” and “Bear” market regimes. Following the classical methodology of [71], a Bear market is declared after a 20% decline from a previous market peak, and a Bull market is declared after a 20% rise from a previous market trough. This allows us to evaluate whether the model’s predictive power is consistent across different long-term market trends. Table 12 presents the R 2 performance of all models, enabling a direct comparison between the proposed TF-ViTNet and multiple benchmark architectures.
For the NASDAQ index, TF-ViTNet demonstrates substantially higher predictive ability during Bear markets ( R 2 = 0.4190 ) compared with Bull markets ( R 2 = 0.2980 ). A similar pattern is observed for several benchmark models, although TF-ViTNet generally achieves the highest performance across both regimes. This suggests that the time–frequency representations learned by TF-ViTNet are particularly effective at capturing the complex, nonlinear volatility structures that intensify during downturns.
For the S&P 500 index, TF-ViTNet again exhibits stronger performance in Bear markets ( R 2 = 0.5392 ), whereas CNN-LSTM attains slightly higher performance during Bull markets. This comparison highlights complementary strengths across models, while still showing that TF-ViTNet maintains a competitive advantage when market conditions become more turbulent.
Overall, the expanded regime-level evaluation—now including all benchmark models—offers a clearer understanding of where TF-ViTNet delivers marginal improvements and how model performance shifts across different market environments.

5.6. Forecasting Performance by Volatility Regime

To analyze model robustness across different volatility conditions, we partition the test data into quintiles based on actual Parkinson’s volatility values and measure Mean Absolute Error (MAE) for each quintile. This quintile-based analysis reveals how models perform in low versus high volatility regimes, which is critical for practical risk management applications where performance during extreme market conditions is particularly important. Table 13 presents the MAE values multiplied by 1000 for all models across the five volatility quintiles for both NASDAQ and S&P 500.
The quintile analysis reveals distinct patterns in model performance across volatility regimes. A consistent finding across both markets is that prediction errors increase substantially in the highest volatility quintile (Q5) for all models. For NASDAQ, MAE values in Q5 range from 5.817 for TF-ViTNet to 8.394 for TF-CNet, representing roughly 3–5 times higher errors compared to mid-range quintiles. Similarly, for S&P 500, Q5 errors range from 5.118 for the naive benchmark to 13.915 for LSTM, demonstrating the fundamental challenge of predicting extreme volatility events.
Despite the overall difficulty in high volatility regimes, TF-ViTNet demonstrates notable advantages. For NASDAQ Q5, TF-ViTNet achieves the lowest MAE of 5.817, outperforming all other models, including the naive benchmark at 5.933. This represents a significant advantage in the most critical regime, where accurate predictions are most valuable for risk management. The model also maintains competitive performance in mid-range volatility quintiles, achieving near-optimal results in Q3 and Q4. However, in low-volatility quintiles (Q1–Q2), simpler models like ViT-LSTM perform better, suggesting that the sophisticated parallel architecture may introduce unnecessary complexity when volatility patterns are straightforward.
For S&P 500, the results show a different pattern. In the critical highest volatility quintile (Q5), while the naive benchmark achieves the best performance at 5.118, TF-ViTNet records 5.325, outperforming all other deep learning models. This contrasts with NASDAQ where TF-ViTNet dominated even the benchmark in extreme conditions, and the difference can be attributed to the S&P 500’s more stable characteristics, where simple persistence forecasts remain effective even during high volatility periods. In lower volatility regimes (Q1–Q4), TF-ViTNet maintains competitive performance with MAE values ranging from 1.464 to 1.998, consistently ranking among the top performers alongside ViT-LSTM and CNN-LSTM. The consistent superiority of TF-ViTNet over competing deep learning approaches across all quintiles, particularly in the critical high volatility regime, demonstrates the robustness of our parallel architecture design.
These findings underscore that TF-ViTNet’s primary strength lies in maintaining stability during extreme market conditions, particularly in volatile markets like NASDAQ. While the model may not always achieve the lowest errors in calm periods, its ability to remain robust when volatility spikes makes it particularly valuable for risk management applications where tail risk prediction is paramount.

5.7. Performance for Individual Stocks: S&P 500 Top-50 and Russell 3000 Low-Liquidity Equities

To further assess the generalizability and robustness of the proposed models, we extended our analysis beyond index-level prediction to individual equities. Specifically, Parkinson’s volatility was predicted for (i) the 50 largest constituents of the S&P 500 by market capitalization and (ii) an additional set of 20 low-liquidity stocks sampled from the Russell 3000 index. For each equity, both TF-ViTNet and CNN-LSTM were independently trained and tested following the same training–validation–test procedure used in the index experiments, focusing on the most recent year, 2024.
To examine how model performance varies across different volatility environments, we performed a volatility-tier analysis for each stock. Specifically, the 250 daily observations in 2024 were sorted by the actual Parkinson volatility and partitioned into three equally sized groups (Low/Medium/High volatility). For each tier, R 2 was computed separately using only the observations belonging to that volatility level. This procedure yields, for every individual equity, three volatility-specific R 2 scores that reveal how predictive accuracy changes across calm, moderate, and turbulent market conditions.
Figure 7 summarizes the cross-sectional distribution of these tier-specific R 2 scores across all 70 equities. In all three volatility tiers—Low, Medium, and High—the distribution of TF-ViTNet is consistently shifted to the right relative to CNN-LSTM, indicating higher average R 2 and more stable predictive performance across different volatility regimes. Although both models exhibit weaker accuracy under high-volatility conditions, TF-ViTNet maintains a noticeably stronger tail behavior with fewer severely negative R 2 outliers. A detailed equity-level breakdown is provided in Appendices Appendix A and Appendix B.

6. Conclusions

This study proposes TF-ViTNet, a novel dual-path hybrid model that integrates time–frequency analysis with ViTs for financial market volatility prediction. The model transforms Parkinson’s volatility time series into scalogram images via CWT and analyzes them using a ViT to extract spatiotemporal patterns. Simultaneously, a separate LSTM pathway learns temporal features from technical indicators, and the two streams are integrated at the final stage to predict future volatility.
Empirical analysis using NASDAQ and S&P 500 index data over a 15-year testing period from 2010 to 2024 yields five important findings. First, regarding the effectiveness of image-based representations, the hybrid approach combining scalogram images with technical indicators consistently outperforms models using numerical data alone. TF-ViTNet achieves substantial improvements over the baseline LSTM model, with particularly strong results for S&P 500. This demonstrates that two-dimensional time–frequency representations provide meaningful complementary information for volatility prediction beyond what can be captured by one-dimensional numerical sequences alone.
Second, concerning parallel versus integrated architecture, the parallel structure, where image and numerical pathways are processed independently before final fusion, demonstrates clear advantages over integrated architectures that combine features at each time step. The parallel TF-ViTNet consistently outperforms the integrated ViT-LSTM across both indices. This suggests that allowing each modality to develop specialized representations independently before integration enables more effective feature learning compared to early fusion strategies.
Third, regarding ViT versus CNN for image encoding, the choice of image encoder significantly impacts performance, with effects varying by architecture type. In parallel architectures, ViT-based TF-ViTNet substantially outperforms CNN-based TF-CNet, especially for the volatile NASDAQ market, where TF-CNet shows negative performance. This superiority stems from ViT’s ability to capture global patterns in spectrograms through self-attention mechanisms. Conversely, in integrated architectures, CNN-LSTM outperforms ViT-LSTM in both markets, suggesting that CNNs’ inductive biases for local pattern extraction align better with sequential feature combination strategies.
Fourth, based on the impact of data augmentation, effectiveness varies dramatically by market characteristics. For the more stable S&P 500 market, Affine augmentation improves overall performance, demonstrating that regularization benefits markets with cleaner underlying patterns where the stronger signal can tolerate additional variation without corruption. Conversely, for the highly volatile NASDAQ market, the model without augmentation achieves the best performance, suggesting that augmentation introduces unwanted noise that compounds the market’s inherent volatility. When underlying patterns are already noisy and irregular, augmentation-induced variations interfere with the already-weak signal. This finding emphasizes that augmentation strategies must be tailored to specific market signal-to-noise characteristics rather than applied universally.
Fifth, regarding performance across volatility regimes, TF-ViTNet demonstrates differential effectiveness depending on volatility levels. In extreme high volatility conditions, the model excels particularly for NASDAQ, achieving the lowest prediction errors among all models including the naive benchmark. For S&P 500, TF-ViTNet outperforms all deep learning models in high volatility, though it is slightly behind the naive benchmark. In mid-range volatility, TF-ViTNet maintains competitive performance, while in low-volatility regimes, simpler models occasionally achieve lower errors. This pattern indicates that TF-ViTNet’s primary strength lies in maintaining stability during market stress when accurate predictions are most critical for risk management, even if it does not always achieve the absolute lowest errors during calm periods.
Our results can be contextualized within the broader literature on volatility prediction. Cho and Lee [72] employed deep learning models to predict absolute returns as a proxy for stock volatility using S&P 500 data, achieving annual R 2 values of 0.223 in 2018, 0.261 in 2019, and 0.502 in 2020. In comparison, TF-ViTNet achieves R 2 values of 0.185, 0.307, and 0.583 for the same years on the same index. While our model underperforms slightly in 2018, it demonstrates superior performance in 2019 and particularly in 2020 during the COVID-19 volatility surge, suggesting that our time–frequency approach with ViTs may be especially effective in capturing extreme volatility patterns. Moreover, our use of Parkinson’s volatility, which incorporates intraday high-low range information, likely provides a more statistically efficient volatility measure compared to absolute returns, contributing to improved predictive accuracy.
Despite these promising results, several limitations warrant consideration. The model’s architectural sophistication comes at the cost of substantial computational requirements and sensitivity to hyperparameter choices, which may pose challenges for resource-constrained environments. Our validation focuses on two major U.S. equity indices, and generalizability to other asset classes with different volatility characteristics remains to be validated. Furthermore, like many deep learning models, TF-ViTNet operates largely as a black box, and understanding which specific time–frequency patterns drive predictions remains limited. Future research could address these limitations by developing attention visualization techniques to interpret model behavior and extending the framework to multi-asset settings. More sophisticated fusion mechanisms, such as cross-modal attention, may further improve performance, while rigorous testing in real-time trading simulations would provide crucial insights into practical deployment feasibility under realistic market constraints.
In conclusion, this study advances financial volatility prediction by demonstrating that parallel integration of time–frequency representations analyzed through ViTs with numerical technical indicators can achieve superior and more robust predictions compared to traditional approaches. The findings contribute to both the theoretical understanding of integrating heterogeneous data representations for financial forecasting and the practical development of more reliable risk management tools.
Beyond methodological contributions, our results carry direct implications for risk management and investment decision-making. The model’s strong performance during high-volatility regimes—when accurate forecasts are most critical—suggests that TF-ViTNet can serve as an effective early-warning mechanism for market turbulence, supporting timely adjustments to portfolio exposure, leverage, and hedging strategies. Conversely, the model’s comparatively moderate advantage in low-volatility periods highlights its suitability for adaptive volatility targeting frameworks, allowing investors to recalibrate position sizes based on regime-dependent prediction accuracy. These insights indicate that TF-ViTNet is particularly valuable for practitioners seeking robust volatility predictions to inform dynamic risk control, stress-scenario preparation, and strategic asset allocation under varying market conditions.

Author Contributions

Conceptualization, P.C.; methodology, M.W.; software, M.W.; validation, M.W.; formal analysis, P.C.; investigation, P.C.; resources, P.C.; data curation, M.W.; writing—original draft preparation, P.C. and M.W.; writing—review and editing, P.C. and M.W.; visualization, M.W.; supervision, P.C.; project administration, P.C.; funding acquisition, P.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LSTMlong short-term memory
CNNconvolutional neural network
RNNrecurrent neural network
GARCHgeneralized autoregressive conditional heteroskedasticity
EMDempirical mode decomposition
VMDvariational mode decomposition
ViTvision transformer
STFTshort-time Fourier transform
CWTcontinuous wavelet transform
HHTHilbert–Huang transform

Appendix A. Detailed Results

The following tickers indicate the assets for which the TF-ViTNet architecture achieved a higher coefficient of determination R 2 than the CNN-LSTM baseline. This result highlights the superior predictive capability of vision transformer-based models in these selected equities.
Relevant Tickers
AGM, ALG, AMT, APH, AVGO, AXP, BAC, BANF, BKNG, CME, CPK, DDS, DE, DORM, DUK, FORR, GOOG, GRC, IBM, JPM, KKR, LLY, MCK, MDT, MSEX, MSFT, NEE, NEM, NHI, PEE, PEP, PH, PRLB, RTX, STRA, TMO, THRM, TR, WEC, WTS
For these series, TF-ViTNet consistently demonstrated an improvement in R 2 over the CNN-LSTM comparison, supporting the robustness of transformer-based volatility modeling on large-cap US stocks.
The full list of tickers utilized in this study is summarized by sector in Table A1 below.
Table A1. ALL Ticker List Grouped by Sector (Headers Centered, Data Left Aligned).
Table A1. ALL Ticker List Grouped by Sector (Headers Centered, Data Left Aligned).
SectorTickers
Communication ServicesCMCSA, CPK, MSEX, VZ
Consumer CyclicalAMZN, BKNG, CRMT, DDS, DIS, DORM, SMP, STRA, TSLA
Consumer DefensiveMZTI, PEP, TR
FinancialsAXP, BAC, BLK, CME, GS, JPM, KKR, MS, WFC
HealthcareISRG, LLY, MCK, MDT, MRK, PFE, THRM, TMO
IndustrialsAMG, BANF, DE, ETN, HON, LKFN, PH, RTX
MaterialsNEM
Real EstateNHI, TRC
TechnologyAAPL, ADI, AMAT, AMD, APH, AVGO, CRM, GOOG, IBM, INTC, KLAC, LRCX, META, MSFT, MU, NVDA, ORCL, QCOM, TXN
UtilitiesALG, DUK, FORR, GRC, NEE, PRLB, WTS

Appendix B. Individual Stock Prediction

The analysis of volatility-based prediction performance revealed that the TF-ViTNet model attained the highest number of maximum R 2 values overall (99 counts), establishing it as the superior model in terms of comprehensive predictive capability. Specifically, TF-ViTNet demonstrated on overwhelming advantage in the Low-Volatility segment. Conversely, the CNN-LSTM model achieved the highest frequency of best R 2 values in both the Medium Volatility and High Volatility segments, seggesting its efficacy in capturing complex patterns within challenging, higher-volatility environments. in stark contrast, the Benchmark recorded only one maximum R 2 count in both the Medium and High Volatility segments, confirming its significant performance degradation compared to the two deep learning models as market volatility increases.
In summary, while CNN-LSTM exhibited a comparative edge in higher volatility markets, TF-ViTNet’s superior overall frequency of top performance metrics designates it as the most robust predictor across the entire dataset.
Table A2. Low, Medium, High Volatility R 2 for each ticker (Part 1 of 2).
Table A2. Low, Medium, High Volatility R 2 for each ticker (Part 1 of 2).
SectorTickerCNN-LSTMTF-ViTNetBenchmark
Low VolMedium VolHigh VolLow VolMedium VolHigh VolLow VolMedium VolHigh Vol
HealthcareLLY0.93270.9635−1.37590.85560.9730−1.01720.81900.5094−1.8501
MRK0.77670.8794−0.43130.81020.9090−0.50460.75120.6035−1.0985
PFE0.87130.9383−2.11720.90280.9478−2.52850.67330.6279−3.3271
TMO0.71390.8141−0.50240.67870.7518−0.44680.79060.8755−1.2984
ISRG0.84160.9036−1.36850.73250.8084−0.83240.78990.2515−2.0741
MDT0.74530.8527−0.83260.71020.8219−0.71940.79160.5721−2.1178
MCK0.70440.8756−0.30480.50720.6926−0.15690.80340.7398−1.2148
THRM0.84320.9425−1.24760.86980.9317−1.40080.66320.0973−1.6179
TechnologyNVDA0.88490.9687−1.49120.75520.8656−0.59290.83440.4980−1.8405
AAPL0.64040.7499−0.10880.69190.8023−0.17350.79050.6399−0.6153
MSFT0.55820.5601−0.48930.72060.7315−0.95470.72730.3668−2.5496
GOOG0.65540.7914−0.72990.62640.7532−0.62540.70250.2796−2.2507
META0.77530.8485−0.77020.74510.7977−0.62540.76470.6242−2.0838
AVGO0.92970.8978−1.61780.96250.8817−2.08660.77450.5272−1.6092
ORCL0.78460.9097−0.80160.79450.9337−0.88880.79280.4737−1.5646
CRM0.72670.8069−0.47680.76910.8630−0.72220.75970.6398−2.0573
AMD0.78780.9140−1.03480.79640.8898−0.93090.78280.4635−2.5717
QCOM0.76450.7955−0.65310.84880.8776−1.06770.83380.3849−2.1320
INTC0.90780.9080−1.67020.93220.8896−1.92350.81600.2863−1.9645
IBM0.93910.9613−1.93470.81490.9078−1.14270.79340.4716−2.1866
TXN0.74040.8788−0.94810.75340.8508−0.92500.71630.2747−1.9575
AMAT0.80110.8985−0.66300.71440.8362−0.43880.78250.6130−1.7053
MU0.86690.9307−1.34970.88170.9353−1.48030.81900.4159−2.1167
ADI0.77830.8527−0.77600.85860.9103−1.16290.73000.5121−2.0587
LRCX0.84580.9055−0.97910.76510.8172−0.57100.76610.6465−1.9947
KLAC0.89890.9393−1.28710.96020.9602−0.92110.82200.6322−1.4051
APH0.92900.8709−0.66050.97720.8420−1.23600.86510.5871−1.1683
Consumer
Cyclical
AMZN0.67070.8414−0.41270.72750.8999−0.54500.65770.5518−1.4143
TSLA0.87790.9397−1.57670.80360.9405−1.13420.71730.6438−1.8295
BKNG0.86460.8843−1.22930.88290.9208−1.36900.67620.5937−1.5694
DIS0.80820.9103−1.01880.81950.9014−1.12490.77920.6299−1.5457
DORM0.68990.8438−0.35350.66110.7950−0.28590.71310.3072−0.9585
DDS0.45910.6429−0.29910.33210.4661−0.19300.45500.5671−1.9046
CRMT0.76140.8662−0.86530.79020.9021−0.94740.84400.5299−2.1658
SMP0.86550.9519−0.85930.86420.9228−0.81850.78820.2145−1.2795
STRA0.69660.8269−0.14470.68450.7989−0.18220.76750.6574−0.8158
Table A3. Low, Medium, High Volatility R 2 for each ticker (Part 2 of 2).
Table A3. Low, Medium, High Volatility R 2 for each ticker (Part 2 of 2).
SectorTickerCNN-LSTMTF-ViTNetBenchmark
Low VolMedium VolHigh VolLow VolMedium VolHigh VolLow VolMedium VolHigh Vol
Consumer
Defensive
PEP0.80770.8865−0.75880.71130.7884−0.55520.73290.5348−1.3735
TR0.72950.9202−0.40950.69190.8968−0.35270.76580.6163−1.1663
MZTI0.84400.9535−0.45020.88670.9570−0.49090.86020.6914−1.1593
FinancialsJPM0.90510.9244−0.99770.94060.9323−1.22620.87550.4175−1.5410
BAC0.72960.8943−0.80260.70380.8773−0.68600.68740.5314−2.0496
WFC0.85900.9334−0.82630.90900.9297−1.05950.64970.3564−1.3790
GS0.83890.9326−0.79340.86040.9586−0.95640.83910.4367−1.3698
MS0.76270.8392−0.59280.81040.8661−0.76140.81790.5153−1.5464
AXP0.38000.4155−0.11260.88050.9391−1.06550.82460.4760−1.6249
BLK0.74450.8400−0.92530.81690.8898−1.28720.62080.6267−1.8374
CME0.62450.6927−0.67470.59850.6605−0.60550.64220.5659−2.1066
KKR0.83170.9321−1.02250.69220.8261−0.55430.79380.6715−1.3396
AMG0.86070.9635−0.80090.84330.9575−0.72740.60440.5278−1.6789
BANF0.78670.8651−0.60320.87040.9426−0.99130.79560.3842−1.5647
LKFN0.87420.9569−1.54220.85490.9530−1.39310.72050.4754−2.1081
IndustrialsHON0.74810.8927−0.65840.71970.8739−0.56280.81110.6219−1.5219
RTX0.55120.6908−0.21880.60310.7724−0.31770.8057−0.1365−1.5227
DE0.57710.6973−0.25310.57370.6599−0.27770.75590.2319−1.4887
ETN0.86180.8822−0.67070.96050.9159−1.54970.84600.5364−1.6265
PH0.85920.9448−0.89480.88850.9475−1.09910.69250.6577−1.1770
PRLB0.87650.9658−0.37700.84600.9517−0.34050.73300.7644−0.6057
FORR0.79390.9325−0.82330.65010.7758−0.38330.76560.1375−1.5166
GRC0.78790.9260−0.47710.70660.8909−0.45800.71790.2932−1.0958
ALG0.84330.9575−0.72740.86080.9635−0.80090.60440.5277−1.6789
WTS0.65820.7976−0.41830.62940.7826−0.35730.66790.5509−1.3867
UtilitiesNEE0.79430.8091−1.47410.85340.8543−1.97630.73970.2498−2.4478
DUK0.59860.7655−0.56810.63660.8125−0.67380.60940.4710−2.0258
MSEX0.85810.9224−1.58840.89720.9515−2.26420.70470.4757−2.3452
CPK0.67160.8330−0.71160.65960.8048−0.69830.68280.4381−2.2010
Comm.
Services
CMCSA0.77590.9069−0.51510.79060.9068−0.50720.73730.6514−1.4621
VZ0.89890.9394−1.28720.90270.9355−1.31870.79240.3971−1.6940
Real
Estate
NHI0.79150.8672−0.41290.92960.8306−1.47310.73100.6599−1.4023
TRC0.81640.9360−1.45250.83590.9362−1.53560.78490.6719−2.2485
MaterialsNEM0.83660.9211−0.52250.85540.9272−0.51110.84010.3485−1.1744

References

  1. Glosten, L.R.; Jagannathan, R.; Runkle, D.E. On the Relation between the Expected Value and the Volatility of the Nominal Excess Return on Stocks. J. Financ. 1993, 48, 1779–1801. [Google Scholar] [CrossRef]
  2. Engle, R.F. Autoregressive Conditional Heteroscedasticity with Estimates of the Variance of United Kingdom Inflation. Econometrica 1982, 50, 987–1007. [Google Scholar] [CrossRef]
  3. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  4. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
  5. Huang, N.E.; Shen, Z.; Long, S.R.; Wu, M.C.; Shih, H.H.; Zheng, Q.; Yen, N.C.; Tung, C.C.; Liu, H.H. The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis. Proc. R. Soc. London Ser. A Math. Phys. Eng. Sci. 1998, 454, 903–995. [Google Scholar] [CrossRef]
  6. Ramadhan, A.; Palupi, I.; Wahyudi, B. Candlestick patterns recognition using CNN-LSTM model to predict financial trading position in stock market. J. Comput. Syst. Inform. 2022, 3, 339–347. [Google Scholar] [CrossRef]
  7. Parkinson, M. The extreme value method for estimating the variance of the rate of return. J. Bus. 1980, 53, 61–65. [Google Scholar] [CrossRef]
  8. Hansen, P.R.; Huang, Z.; Shek, H.H. Realized GARCH: A complete model of returns and realized measures of volatility. J. Appl. Econom. 2012, 27, 877–906. [Google Scholar] [CrossRef]
  9. Corsi, F. A simple approximate long-memory model of realized volatility. J. Financ. Econom. 2009, 7, 174–196. [Google Scholar] [CrossRef]
  10. Lim, B.; Arık, S.O.; Loeff, N.; Pfister, T. Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
  11. Muhammad, T.; Aftab, A.B.; Ibrahim, M.; Ahsan, M.M.; Muhu, M.M.; Khan, S.I.; Alam, M.S. Transformer-based deep learning model for stock price prediction: A case study on bangladesh stock market. Int. J. Comput. Intell. Appl. 2023, 22, 2350013. [Google Scholar] [CrossRef]
  12. Kabir, M.R.; Bhadra, D.; Ridoy, M.; Milanova, M. LSTM–Transformer-based robust hybrid deep learning model for financial time series forecasting. Sci 2025, 7, 7. [Google Scholar] [CrossRef]
  13. Xie, L.; Chen, Z.; Yu, S. Deep convolutional transformer network for stock movement prediction. Electronics 2024, 13, 4225. [Google Scholar] [CrossRef]
  14. Sattarov, O.; Makhmudov, F. Risk-Aware Crypto Price Prediction Using DQN with Volatility-Adjusted Rewards Across Multi-Period State Representations. Mathematics 2025, 13, 3012. [Google Scholar] [CrossRef]
  15. Kim, D.J.; Kim, D.H.; Choi, S.Y. Modeling Stylized Facts in FX Markets with FINGAN-BiLSTM: A Deep Learning Approach to Financial Time Series. Entropy 2025, 27, 635. [Google Scholar] [CrossRef]
  16. He, K.; Yang, Q.; Ji, L.; Pan, J.; Zou, Y. Financial Time Series Forecasting with the Deep Learning Ensemble Model. Mathematics 2023, 11, 1054. [Google Scholar] [CrossRef]
  17. Mutinda, J.K.; Geletu, A. Stock market index prediction using CEEMDAN-LSTM-BPNN-Decomposition ensemble model. J. Appl. Math. 2025, 1, 7706431. [Google Scholar] [CrossRef]
  18. Zhang, J.; Ye, L.; Lai, Y. Stock Price Prediction Using CNN-BiLSTM-Attention Model. Mathematics 2023, 11, 1985. [Google Scholar] [CrossRef]
  19. Liu, W.; Gui, Z.; Jiang, G.; Tang, L.; Zhou, L.; Leng, W.; Zhang, X.; Liu, Y. Stock volatility prediction based on transformer model using mixed-frequency data. In Proceedings of the Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data, Wuhan, China, 6–8 October 2023; pp. 74–88. [Google Scholar] [CrossRef]
  20. Zhang, X. Financial Time Series Forecasting Based on LSTM Neural Network optimized by Wavelet Denoising and Whale Optimization Algorithm. Acad. J. Comput. Inf. Sci. 2022, 5, 1–9. [Google Scholar] [CrossRef]
  21. Huang, Y.; Deng, Y. A new crude oil price forecasting model based on variational mode decomposition. Knowl.-Based Syst. 2021, 213, 106669. [Google Scholar] [CrossRef]
  22. Armah, M.; Amewu, G.; Bossman, A. Time-frequency analysis of financial stress and global commodities prices: Insights from wavelet-based approaches. Cogent Econ. Financ. 2022, 10, 2114161. [Google Scholar] [CrossRef]
  23. Umar, Z.; Bossman, A.; Choi, S.Y.; Vo, X.V. Are short stocks susceptible to geopolitical shocks? Time-Frequency evidence from the Russian-Ukrainian conflict. Financ. Res. Lett. 2023, 52, 103388. [Google Scholar] [CrossRef]
  24. Dezhkam, A.; Manzuri, M.T. Forecasting stock market for an efficient portfolio by combining XGBoost and Hilbert–Huang transform. Eng. Appl. Artif. Intell. 2023, 118, 105626. [Google Scholar] [CrossRef]
  25. Rai, A.; Mahata, A.; Nurujjaman, M.; Majhi, S.; Debnath, K. A sentiment-based modeling and analysis of stock price during the COVID-19: U- and Swoosh-shaped recovery. Phys. A Stat. Mech. Appl. 2022, 592, 126810. [Google Scholar] [CrossRef] [PubMed]
  26. Li, C.; Qian, G. Stock Price Prediction Using a Frequency Decomposition Based GRU Transformer Neural Network. Appl. Sci. 2023, 13, 222. [Google Scholar] [CrossRef]
  27. Palma, G.R.; Skoczeń, M.; Maguire, P. Asset Price Movement Prediction Using Empirical Mode Decomposition and Gaussian Mixture Models. arXiv 2025, arXiv:2025.20678. [Google Scholar] [CrossRef]
  28. Xu, C.; Zhao, X.; Wang, Y. Causal decomposition on multiple time scales: Evidence from stock price-volume time series. Chaos Solitons Fractals 2022, 159, 112137. [Google Scholar] [CrossRef]
  29. Kim, D.H.; Kim, D.J.; Choi, S.Y. A Variational-Mode-Decomposition-Cascaded Long Short-Term Memory with Attention Model for VIX Prediction. Appl. Sci. 2025, 15, 5630. [Google Scholar] [CrossRef]
  30. Christensen, B.; Prabhala, N. The relation between implied and realized volatility. J. Financ. Econ. 1998, 50, 125–150. [Google Scholar] [CrossRef]
  31. Bao, W.; Yue, J.; Rao, Y. A deep learning framework for financial time series using stacked autoencoders and long-short term memory. PLoS ONE 2017, 12, e0180944. [Google Scholar] [CrossRef]
  32. Shi, Y.; Tan, Y.; Li, H.; Rangwala, H.; Wang, L. Multivariate realized volatility forecasting with graph neural network. In Proceedings of the 2021 International Conference on Management of Data, Virtual, 20–25 June 2021; pp. 2849–2852. [Google Scholar] [CrossRef]
  33. Liu, C.; Paterlini, S. Stock price prediction using temporal graph model with value chain data. arXiv 2023, arXiv:2303.09406. [Google Scholar] [CrossRef]
  34. Zhang, C.; Zhang, Y.; Cucuringu, M.; Qian, Z. Volatility forecasting with machine learning and intraday commonality. J. Financ. Econom. 2023, 22, 492–530. [Google Scholar] [CrossRef]
  35. Pérez-Hernández, F.; de Pablos, A.A.; del Mar Camacho-Miñano, M. A hybrid model integrating artificial neural network with multiple GARCH-type models and EWMA for performing the optimal volatility forecasting of market risk factors. Expert Syst. Appl. 2024, 243, 122896. [Google Scholar] [CrossRef]
  36. Kusuma, R.M.I.; Ho, T.T.; Kao, W.C.; Ou, Y.Y.; Hua, K.L. Using deep learning neural networks and candlestick chart representation to predict stock market. arXiv 2019, arXiv:1903.12258. [Google Scholar] [CrossRef]
  37. Lee, J.; Kim, J.; Lee, J.M. Global stock market prediction based on stock chart images using deep q-network. IEEE Access 2018, 7, 167260–167277. [Google Scholar] [CrossRef]
  38. Khalid, T.; Rida, S.M.; Taher, Z. From time series to images: Revolutionizing stock market predictions with convolutional deep neural networks. Int. J. Adv. Comput. Sci. Appl. 2024, 15, 0150144. [Google Scholar] [CrossRef]
  39. Pongsena, W.; Ditsayabut, P.; Kerdprasop, N.; Kerdprasop, K. Deep learning for financial time-series data analytics: An image processing based approach. Int. J. Mach. Learn. Comput. 2020, 10, 51–56. [Google Scholar] [CrossRef]
  40. Xu, Z.; Wang, Y.; Feng, X.; Wang, Y.; Li, Y.; Lin, H. Quantum-Enhanced Forecasting: Leveraging Quantum Gramian Angular Field and CNNs for Stock Return Predictions. Financ. Res. Lett. 2024, 67, 105840. [Google Scholar] [CrossRef]
  41. Altunta¸s, Y.; Okumu¸s, F.; Kocamaz, A.F. Algorithmic Silver Trading via Fine-Tuned CNN-Based Image Classification and Relative Strength Index-Guided Price Direction Prediction. Symmetry 2025, 17, 1338. [Google Scholar] [CrossRef]
  42. Hu, G.; Hu, Y.; Yang, K.; Yu, Z.; Sung, F.; Zhang, Z.; Xie, F.; Liu, J.; Robertson, N.; Hospedales, T.; et al. Deep stock representation learning: From candlestick charts to investment decisions. arXiv 2018, arXiv:1709.03803. [Google Scholar] [CrossRef]
  43. Pei, Z.; Yan, J.; Yan, J.; Yang, B.; Li, Z.; Zhang, L.; Liu, X.; Zhang, Y. A stock price prediction approach based on time series decomposition and multi-scale CNN using OHLCT images. arXiv 2024, arXiv:2410.19291. [Google Scholar] [CrossRef]
  44. Allen, J.; Rabiner, L. A unified approach to short-time Fourier analysis and synthesis. Proc. IEEE 1977, 65, 1558–1564. [Google Scholar] [CrossRef]
  45. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
  46. Wightman, R. PyTorch Image Models. 2019. Available online: https://github.com/rwightman/pytorch-image-models (accessed on 1 November 2025).
  47. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, New York, NY, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
  48. Huber, P.J. Robust Estimation of a Location Parameter. Ann. Math. Stat. 1964, 35, 73–101. [Google Scholar] [CrossRef]
  49. Chen, X.; Xie, S.; He, K. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 9640–9649. [Google Scholar] [CrossRef]
  50. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar] [CrossRef]
  51. Chen, M.; Peng, H.; Fu, J.; Ling, H. Autoformer: Searching transformers for visual recognition. arXiv 2021, arXiv:2107.00651. [Google Scholar] [CrossRef]
  52. Himel, G.M.S.; Islam, M.M.; Al-Aff, K.A.; Karim, S.I.; Sikder, M.K.U. Skin Cancer Segmentation and Classification Using Vision Transformer for Automatic Analysis in Dermatoscopy-based Non-invasive Digital System. arXiv 2024, arXiv:2401.04746. [Google Scholar] [CrossRef]
  53. Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
  54. Steiner, A.; Kolesnikov, A.; Zhai, X.; Wightman, R.; Uszkoreit, J.; Beyer, L. How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. arXiv 2022, arXiv:2106.10270. [Google Scholar] [CrossRef]
  55. Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
  56. Fama, E.F. The Behavior of Stock-Market Prices. J. Bus. 1965, 38, 34–105. [Google Scholar] [CrossRef]
  57. Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar] [CrossRef]
  58. Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the 2014 Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27. [Google Scholar] [CrossRef]
  59. Bollerslev, T. Generalized autoregressive conditional heteroskedasticity. J. Econom. 1986, 31, 307–327. [Google Scholar] [CrossRef]
  60. Murphy, J.J. Technical Analysis of the Financial Markets: A Comprehensive Guide to Trading Methods and Applications; Penguin: London, UK, 1999. [Google Scholar]
  61. Wilder, J.W., Jr. New Concepts in Technical Trading Systems; Trend Research: Edmonton, AB, Canada, 1978. [Google Scholar]
  62. Lane, G.C. Lane’s stochastics. Tech. Anal. Stock. Commod. 1984, 2, 87–90. [Google Scholar]
  63. Appel, G. Technical Analysis: Power Tools for Active Investors; Prentice Hall Press: Saddle River, NJ, USA, 2005. [Google Scholar]
  64. Bollinger, J. Bollinger on Bollinger Bands; McGraw-Hill: Columbus, OH, USA, 2001. [Google Scholar]
  65. López de Prado, M. Advances in Financial Machine Learning; John Wiley & Sons: Hoboken, NJ, USA, 2018. [Google Scholar]
  66. Lee, G.R.; Gommers, R.; Waselewski, F.; Wohlfahrt, K.; O’Leary, A. PyWavelets: A Python package for wavelet analysis. J. Open Source Softw. 2019, 4, 1237. [Google Scholar] [CrossRef]
  67. Shorten, C.; Khoshgoftaar, T.M. A survey on image data augmentation for deep learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
  68. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems; ACM: New York, NY, USA, 2012; Volume 25. [Google Scholar]
  69. Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q.V. Specaugment: A simple data augmentation method for automatic speech recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar]
  70. Taylor, L.; Nitschke, G. Improving deep learning with generic data augmentation. In Proceedings of the 2018 IEEE Symposium Series on Computational Intelligence (SSCI); IEEE: New York, NY, USA, 2018; pp. 1542–1547. [Google Scholar] [CrossRef]
  71. Pagan, A.R.; Sossounov, K.A. A Simple Framework for Analyzing Bull and Bear Markets. J. Appl. Econom. 2003, 18, 23–46. [Google Scholar] [CrossRef]
  72. Cho, P.; Lee, M. Forecasting the Volatility of the Stock Index with Deep Learning Using Asymmetric Hurst Exponents. Fractal Fract. 2022, 6, 394. [Google Scholar] [CrossRef]
Figure 1. Vision Transformer Architecture.
Figure 1. Vision Transformer Architecture.
Mathematics 13 03787 g001
Figure 2. Hybrid TF-ViTNet Architecture.
Figure 2. Hybrid TF-ViTNet Architecture.
Mathematics 13 03787 g002
Figure 3. Time series plots of daily Close Price (top row) and Parkinson’s Volatility (bottom row) for the NASDAQ Composite and S&P 500 indices from 2000 to 2024.
Figure 3. Time series plots of daily Close Price (top row) and Parkinson’s Volatility (bottom row) for the NASDAQ Composite and S&P 500 indices from 2000 to 2024.
Mathematics 13 03787 g003
Figure 4. Sample Scalogram Images Generated for TF-ViTNet Input.
Figure 4. Sample Scalogram Images Generated for TF-ViTNet Input.
Mathematics 13 03787 g004
Figure 5. Performance Evaluation of the Parkinson’s Volatility Prediction Model for the NASDAQ Index.
Figure 5. Performance Evaluation of the Parkinson’s Volatility Prediction Model for the NASDAQ Index.
Mathematics 13 03787 g005
Figure 6. Performance Evaluation of the Parkinson’s Volatility Prediction Model for the S&P 500 Index.
Figure 6. Performance Evaluation of the Parkinson’s Volatility Prediction Model for the S&P 500 Index.
Mathematics 13 03787 g006
Figure 7. Cross-Sectional R 2 Distribution for Parkinson’s Volatility Prediction Across S&P 500 Top-50 and Russell 3000 Low-Liquidity Stocks.
Figure 7. Cross-Sectional R 2 Distribution for Parkinson’s Volatility Prediction Across S&P 500 Top-50 and Russell 3000 Low-Liquidity Stocks.
Mathematics 13 03787 g007
Table 1. Benchmark architecture.
Table 1. Benchmark architecture.
ModelInput DataArchitecture TypeSymbolic Representation
LSTMNumerical OnlySingle-StreamfFC(fLSTM(Xnum))
CNN-LSTMImage + NumericalIntegrated Two-StreamfFC(fLSTM([Xnum; fCNN(Iimg)]))
ViT-LSTMImage + NumericalIntegrated Two-StreamfFC(fLSTM([Xnum; fViT(Iimg)]))
TF-CNetImage + NumericalParallel Two-Stream f F C ( [ f L S T M _ i m g ( f C N N ( I i m g ) ) ; f L S T M _ n u m ( X n u m ) ] )
TF-ViTNetImage + NumericalParallel Two-Stream f F C ( [ f L S T M _ i m g ( f V i T ( I i m g ) ) ; f L S T M _ n u m ( X n u m ) ] )
Table 2. Consolidated Summary of Hyperparameters for All Models.
Table 2. Consolidated Summary of Hyperparameters for All Models.
CategoryParameterValue/SettingSetting Method
DataSequence Length60Fixed
Numeric Features23Fixed
Image Size(224, 224)Fixed (Image-based models)
Model Architecture (LSTM Baseline)
LSTM Hidden Size{64, 128, 256}Tuned by Optuna
LSTM Num Layers{1, 2}Tuned by Optuna
Model Architecture (CNN-based Baselines: TF-CNet, CNN-LSTM)
CNN BackboneVGG-style CNNFixed
Numeric LSTM Hidden Size256Fixed
CNN Output Dimension{32, 64, 128, 256}Tuned by Optuna
CNN-LSTM Hidden Size{32, 64, 128}Tuned by Optuna
CNN-LSTM Num Layers{1, 2}Tuned by Optuna
Model Architecture (Proposed Model: TF-ViTNet)
ViT Backbonevit_tiny_patch16_224Fixed
Numeric LSTM Hidden Size256Fixed
ViT Output Dimension{64, 128, 256, 384}Tuned by Optuna
ViT-LSTM Hidden Size{64, 128, 256}Tuned by Optuna
ViT-LSTM Num Layers{1, 2}Tuned by Optuna
Common Training & Optimization (All Deep Learning Models)
TrainingLoss FunctionHuberLossFixed
Max Epochs200Fixed
Early Stopping Patience10Fixed
Gradient Clippingmax_norm = 1.0Fixed
OptimizationOptimizerAdamFixed
LR SchedulerReduceLROnPlateauFixed
Batch Size{16, 32, 64}Tuned by Optuna
Learning Rate [ 10 6 , 5 × 10 3 ] (log)Tuned by Optuna
Dropout Rate[0.1, 0.3]Tuned by Optuna
Weight Decay [ 10 6 , 10 4 ] (log)Tuned by Optuna
Table 3. Technical Indicators Grouped by Category.
Table 3. Technical Indicators Grouped by Category.
Indicator NameFormula
Trend Indicators
Simple Moving Average SMA t ( n ) = 1 n i = 0 n 1 c p t i
Exponential Moving Average EMA t ( n ) = α · c p t + ( 1 α ) · EMA t 1 ( n )
Momentum Indicators
Relative Strength Index RSI t = 100 100 1 + R S t
Moving Average Convergence Divergence MACD t = EMA 12 ( c p t ) EMA 26 ( c p t )
Momentum MOM t ( n ) = c p t c p t n
Stochastic Oscillator (%K & %D) % K t = 100 · c p t l p n h p n l p n
Williams %R % R t = 100 · h p n c p t h p n l p n
Commodity Channel Index CCI t = T P t SMA ( T P , n ) t 0.015 · MAD ( T P , n ) t
Volatility Indicators
Bollinger Bands (Upper/Lower) BB t = SMA t ( n ) ± k · σ t ( n )
Average True Range ATR t ( n ) = 1 n i = 0 n 1 TR t i
Parkinson’s Volatility σ P , t = 1 2 ln ( 2 ) ln h p t l p t
Log Return ln ( c p t / c p t 1 )
Volume Indicator
Volume-Weighted Average Price VWAP t = i = 1 t c p i · v i i = 1 t v i
Here, c p t , h p t , l p t , and v t are the closing price, highest price, lowest price, and volume at time t. The parameters n , m , k represent the time period, smoothing factor, and standard deviation multiplier. For indicators with a lookback period, h p n and l p n are the highest high and lowest low over the last n periods. The EMA smoothing constant is α = 2 / ( n + 1 ) , and σ t ( n ) is the standard deviation of closing prices over n periods. Intermediate terms include the Typical Price T P t = ( h p t + l p t + c p t ) / 3 , the Mean Absolute Deviation MAD ( T P , n ) t , the Relative Strength R S t calculated as the ratio of EMAs of upward and downward price movements, and the True Range TR t = max ( h p t l p t , | h p t c p t 1 | , | l p t c p t 1 | ) .
Table 4. NASDAQ Descriptive Statistics (January 2000–December 2024).
Table 4. NASDAQ Descriptive Statistics (January 2000–December 2024).
IndicatorMeanStd.MinMaxSkewKurtosisJBADF
Close5454.344480.271114.1120,173.891.300.601874.402.59
High5492.464508.941135.8920,204.581.300.591870.172.65
Low5411.634447.231108.4920,004.731.310.601879.842.50
Open5454.284479.681116.7620,114.981.300.601876.222.41
Volume2.54 × 10 9 1.38 × 10 9 2.21 × 10 8 1.19 × 10 10 1.943.998096.62−0.76
RSI53.9611.6715.4386.70−0.21−0.60138.65−16.16
MACD17.6798.83−603.73379.73−0.555.237457.30−9.39
ADX22.657.747.2851.090.700.03511.48−13.43
EMA5430.384442.061192.4419,665.471.300.561842.792.84
Stochastic K60.4032.130.00100.00−0.42−1.19559.89−15.09
Stochastic D60.4129.850.3099.97−0.39−1.26574.97−11.46
Williams R−39.6032.13−100.000.00−0.42−1.19559.89−15.09
CCI23.53108.17−409.29283.95−0.44−0.65314.69−20.36
VWAP3078.511278.242209.987363.221.782.034412.295.14
Bollinger Upper5665.844632.461273.2820,359.021.290.531828.483.02
Bollinger Lower5194.824258.611078.9519,257.611.300.611874.192.64
Momentum25.16282.90−2047.581773.68−0.617.4614,958.18−12.53
SMA 505392.554387.531260.6219,182.891.290.521812.262.92
SMA 2005212.994150.271336.5617,663.801.270.401740.701.11
Log Return0.00020.0157−0.13150.1325−0.146.169956.67−18.46
Parkinson Vol.0.00960.00700.00120.09632.7113.8557,846.18−5.47
Table 5. S&P 500 Descriptive Statistics (January 2000–December 2024).
Table 5. S&P 500 Descriptive Statistics (January 2000–December 2024).
IndicatorMeanstdMinMaxSkewKurtosisJBADF
Close2111.411242.50676.536090.271.230.511663.02−1.61
High2123.241247.90695.276099.971.230.501656.88−1.56
Low2098.101236.33666.796079.981.230.521667.63−1.69
Open2111.121242.27679.286089.031.230.511664.12−1.61
Volume3.35 × 10 9 1.49 × 10 9 2.47 × 10 8 1.15 × 10 10 0.470.77329.74−17.42
RSI53.9411.2713.6486.69−0.28−0.3954.33−16.48
MACD5.0826.02−237.0292.58−1.419.9022,961.42−11.14
ADX22.217.717.8953.100.870.59602.83−14.39
EMA2104.541232.44743.376023.631.220.481632.75−1.50
Stochastic K61.2731.510.00100.00−0.47−1.14300.75−15.44
Stochastic D61.2829.130.8799.93−0.43−1.21270.07−15.42
Williams R−38.7331.51−100.000.000.47−1.14300.75−21.65
CCI23.45108.79−401.40286.42−0.54−0.45143.26−16.03
VWAP1395.30302.831137.042273.291.380.721076.01−2.06
Bollinger Upper2172.631271.39808.076159.781.220.441621.75−1.46
Bollinger Lower2036.451195.95646.045919.611.220.531664.06−1.74
Momentum7.2076.05−732.02426.28−1.0910.4025,217.40−13.04
SMA 502093.641217.48788.965942.131.210.441591.75−1.69
SMA 2002042.521150.67870.575547.791.180.261461.34−1.91
Log Return0.00020.0122−0.12770.1096−0.3910.4025,227.46−20.66
Parkinson Vol.0.00780.00590.00090.06553.0516.2371,207.13−14.65
Table 6. Overall prediction performance summary.
Table 6. Overall prediction performance summary.
DataModel R 2 MSE ( × 10 5 )RMSE ( × 10 3 )MAPE (%)
NasdaqTF-ViTNet0.3871.664.0843.2
ViT-LSTM0.2652.004.4744.7
TF-CNet−0.0952.975.4562.4
CNN-LSTM0.3081.884.3444.8
LSTM0.2232.184.6750.2
GARCH (1,1)0.1972.134.6237.2
HAR-RV0.3381.764.1935.8
Realized-GARCH0.1942.144.6242.2
Random Walk0.2392.004.5544.9
Benchmark0.2462.054.5244.2
S&P 500TF-ViTNet0.4361.303.6142.9
ViT-LSTM0.2911.644.0542.0
TF-CNet0.3751.453.8041.8
CNN-LSTM0.4151.353.6843.0
LSTM0.2352.084.5651.3
GARCH (1,1)0.2461.303.6143.5
HAR-RV0.3711.083.2939.2
Realized-GARCH0.2191.353.6743.7
Random Walk0.3112.003.9946.7
Benchmark0.3241.563.9546.0
Table 7. Annual R 2 performance on NASDAQ.
Table 7. Annual R 2 performance on NASDAQ.
YearTF-ViTNetViT-LSTMTF-CNetCNN-LSTMLSTMBenchmark
20100.199 **0.235 *−0.1460.262 **0.082 **−0.206
20110.4340.305−0.0140.3450.3660.343
2012−0.204 **0.161 **−0.9920.158 **−0.489 **−0.598
2013−0.018 **−0.011 **−1.0330.056 **−0.055−0.570
20140.222 *0.093−0.2120.196−0.205−0.015
20150.2050.064−0.1620.1700.085−0.025
20160.200−0.116 *−0.045−0.123−0.0490.203
2017−0.227−0.241 **−1.341−0.023 **−0.106−0.556
20180.182−0.210−0.0450.1670.0260.191
20190.267 **0.264−0.0570.282 **0.067 *−0.126
20200.3540.018−0.4380.0300.1900.588
20210.2360.300−0.1020.190−0.0350.121
20220.036 **−0.069−1.7490.065 **0.041 **−0.632
20230.138 **0.159 **−0.166 *−0.268−0.359−0.466
20240.199 **0.197 **−0.014 **0.132 **0.164 **−0.486
Overall0.3870.265−0.0950.3080.2230.246
Note: ** and * indicate that the model’s outperformance against the Benchmark is statistically significant at the 1% and 5% levels, respectively, based on a paired t-test of loss differentials.
Table 8. Annual R 2 performance on S&P 500.
Table 8. Annual R 2 performance on S&P 500.
YearTF-ViTNetViT-LSTMTF-CNetCNN-LSTMLSTMBenchmark
20100.203 **0.171 *0.156 *0.157 *0.260−0.191
20110.313 **0.2990.2780.407 *0.3540.184
20120.146 **0.143 **0.115 **0.079 **0.066−0.839
2013−0.063 **0.045 **0.151 **−0.063 **−0.542−0.535
20140.251 **0.2260.2050.389 **0.1030.111
20150.366 *0.2780.392 **0.401 **0.2010.134
20160.2730.3760.3780.3810.2970.257
2017−1.120−0.941−0.716−0.837−0.330−0.567
20180.1850.1260.2580.3590.0920.316
20190.307 **0.305 **0.325 **0.316 **0.192−0.026
20200.5830.0730.2500.287−0.1710.704
20210.2100.355 *0.2430.367 **−0.2580.118
20220.016 **−0.5050.103 **0.100 **0.038−0.465
20230.183 **0.060 **0.003 **0.153 **−0.654−0.334
20240.114 **0.195 **0.169 **0.181 **0.194−0.343
Overall0.4360.2910.3750.4150.2350.324
Note: ** and * indicate that the model’s outperformance against the Benchmark is statistically significant at the 1% and 5% levels, respectively, based on a paired t-test of loss differentials.
Table 9. Annual R 2 performance on NASDAQ with data augmentation.
Table 9. Annual R 2 performance on NASDAQ with data augmentation.
YearNo AugAffineMaskingAll
20100.1990.2630.2030.265
20110.4340.2850.4340.405
2012−0.2040.053−0.2090.152
2013−0.018−0.162−0.018−0.214
20140.2220.2310.2220.241
20150.2050.2700.2320.285
20160.200−0.0010.2750.347
2017−0.227−0.027−0.167−0.041
20180.1820.1680.1360.274
20190.2670.2620.2620.223
20200.3540.3090.2160.113
20210.2360.2010.1380.075
20220.0360.0450.044−0.032
20230.138−0.091−0.0660.055
20240.1990.1790.1540.192
Overall0.3870.3640.3530.355
Table 10. Annual R 2 performance on S&P 500 with data augmentation.
Table 10. Annual R 2 performance on S&P 500 with data augmentation.
YearNo AugAffineMaskingAll
20100.2010.2030.1660.212
20110.3790.3130.2950.333
20120.1330.1460.1300.085
20130.143−0.0630.1130.117
20140.3600.2510.3230.250
20150.2720.3660.3710.375
20160.3290.2730.3590.254
2017−0.845−1.120−1.029−0.395
2018−0.0330.1850.106−0.035
20190.2400.3070.3210.280
20200.3990.5830.3230.404
20210.3670.2100.2160.332
20220.1380.0160.048−0.011
20230.1830.1830.1650.140
20240.2290.1140.1370.229
Overall0.4040.4360.3800.391
Table 11. TF-ViTNet Overall R 2 Performance by Augmentation.
Table 11. TF-ViTNet Overall R 2 Performance by Augmentation.
DataTransformationR2MSE ( × 10 5 )RMSE ( × 10 3 )MAPE (%)
NasdaqSTFT0.3241.834.1244.7
CWT0.3871.664.0843.2
S&P 500STFT0.3301.543.9343.5
CWT0.4361.303.6142.9
Table 12. Analysis of TF-ViTNet and other models Performance by Market Regime.
Table 12. Analysis of TF-ViTNet and other models Performance by Market Regime.
DataRegimeTF-ViTNetViT-LSTMTF-CNetCNN-LSTMLSTMBenchmark
NasdaqBull0.29800.1428−0.07560.22740.15000.1351
Bear0.41900.2213−0.78940.25680.16190.2901
S&P 500Bull0.37680.29690.33930.37980.17770.2405
Bear0.53920.07150.35800.40730.21300.5043
Note: Bull Market is defined as a market index being +20% from its Trough (lowest point), and Bear Market is defined as a market index being −20% from its Peak (highest point).
Table 13. Model Performance by Volatility Regime (MAE × 10 3 ).
Table 13. Model Performance by Volatility Regime (MAE × 10 3 ).
DataQuintileTF-ViTNetViT-LSTMTF-CNetCNN-LSTMLSTMBenchmark
NasdaqQ12.67711.52364.49812.71093.18431.9150
Q21.85871.07003.06331.94252.26342.1127
Q31.59511.63541.47591.82091.90502.5211
Q42.17413.05661.70922.31452.29653.3831
Q55.81708.38728.39376.05256.30275.9328
S&P 500Q11.99151.91851.93722.09712.44141.5702
Q21.46351.28201.38491.46893.88971.8005
Q31.49251.28261.39341.36325.40222.3289
Q41.99832.01211.94591.82727.53672.9383
Q55.32536.35685.69625.439013.91455.1182
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wooh, M.; Cho, P. Integrating Vision Transformer and Time–Frequency Analysis for Stock Volatility Prediction. Mathematics 2025, 13, 3787. https://doi.org/10.3390/math13233787

AMA Style

Wooh M, Cho P. Integrating Vision Transformer and Time–Frequency Analysis for Stock Volatility Prediction. Mathematics. 2025; 13(23):3787. https://doi.org/10.3390/math13233787

Chicago/Turabian Style

Wooh, Myungjin, and Poongjin Cho. 2025. "Integrating Vision Transformer and Time–Frequency Analysis for Stock Volatility Prediction" Mathematics 13, no. 23: 3787. https://doi.org/10.3390/math13233787

APA Style

Wooh, M., & Cho, P. (2025). Integrating Vision Transformer and Time–Frequency Analysis for Stock Volatility Prediction. Mathematics, 13(23), 3787. https://doi.org/10.3390/math13233787

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop