Previous Article in Journal
Improving Dry-Bulb Air Temperature Prediction Using a Hybrid Model Integrating Genetic Algorithms with a Fourier–Bessel Series Expansion-Based LSTM Model
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Extension of Laor Weight Initialization for Deep Time-Series Forecasting: Evidence from Thai Equity Risk Prediction

by
Katsamapol Petchpol
1,* and
Laor Boongasame
2,3
1
Department of Computer Science, Faculty of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand
2
Department of Mathematics, Faculty of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand
3
Business Innovation and Investment Laboratory: B2I-Lab, Faculty of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand
*
Author to whom correspondence should be addressed.
Forecasting 2025, 7(3), 47; https://doi.org/10.3390/forecast7030047
Submission received: 16 July 2025 / Revised: 27 August 2025 / Accepted: 1 September 2025 / Published: 2 September 2025

Abstract

This study presents a gradient-informed proxy initialization framework designed to improve training efficiency and predictive performance in deep learning models for time-series forecasting. The method extends the Laor Initialization approach by introducing backward gradient norm clustering as a selection criterion for input-layer weights, evaluated through a lightweight, architecture-agnostic proxy model. Only the numerical input layer adopts the selected initialization, while internal components retain standard schemes such as Xavier, Kaiming, or Orthogonal, maintaining compatibility and reducing overhead. The framework is evaluated on a real-world financial forecasting task: identifying high-risk equities from the Thai Market Surveillance Measure List, a domain characterized by label imbalance, non-stationarity, and limited data volume. Experiments across five architectures, including Transformer, ConvTran, and MMAGRU-FCN, show that the proposed strategy improves convergence speed and classification accuracy, particularly in deeper and hybrid models. Results in recurrent-based models are competitive but less pronounced. These findings support the method’s practical utility and generalizability for forecasting tasks under real-world constraints.

1. Introduction

Deep learning methods have achieved substantial success in time-series forecasting tasks across various domains, including retail sales [1], sports outcomes [2], and urban traffic dynamics [3], due to their capacity to model nonlinear dependencies and multiscale temporal patterns. Despite their capabilities, deep neural networks remain sensitive to initial weight configurations, which can significantly influence optimization behavior, predictive stability, and convergence rates, particularly in deep or resource-constrained forecasting pipelines [4].
Widely adopted initialization schemes such as Xavier [5], Kaiming [6], and Orthogonal initialization [7], are valued for their variance-preserving properties but remain inherently data-agnostic. In non-stationary or class-imbalanced settings, these approaches often require prolonged optimization phases to achieve task-relevant representations [8]. Data-driven alternatives, such as Layer-Sequential Unit Variance (LSUV) [9] and Laor Initialization [10], introduce empirical feedback to guide weight selection, often resulting in improved early convergence. However, these methods typically increase computational complexity and are sensitive to model-specific engineering constraints. This limits their scalability across diverse forecasting architectures and operational settings.
Despite their limitations, data-driven initialization methods such as LSUV and Laor often improve convergence behavior, but they also increase computational burden and are less flexible across architectures. Yet, in real-world financial forecasting scenarios, particularly those involving regulatory monitoring, even modest gains in training efficiency or classification accuracy can be consequential. This is especially true in emerging markets like Thailand, where timely detection of high-risk securities directly affects investor protection and market integrity [11,12,13,14,15].
To address these limitations, this study proposes a selective, proxy-assisted initialization framework that extends Laor Initialization using backward gradient norm clustering. The approach applies custom initialization only at the numerical input layer—where parameter sensitivity is highest—while retaining conventional methods for internal layers. A lightweight linear proxy model is used to evaluate candidate weight sets, enabling gradient-informed selection without modifying the forecasting architecture.
The method is empirically evaluated on five deep learning models, Transformer [16], ConvTran [17], MLSTM-FCN [18], MALSTM-FCN, and MMAGRU-FCN [19], using a real-world forecasting task: identifying at-risk stocks in the Stock Exchange of Thailand (SET) based on the Market Surveillance Measure List (MSML). This task exemplifies structural challenges such as class imbalance, temporal drift, and high retraining frequency, making it a representative benchmark for evaluating initialization methods in practical forecasting pipelines.
The remainder of this paper is organized as follows. Section 2 reviews forecasting models, conventional weight initialization schemes, and data-driven approaches, highlighting their limitations in real-world pipelines. Section 3 presents the proposed proxy-assisted, gradient-informed initialization framework, together with the evaluation process and experimental setup. Section 4 reports empirical results across five deep learning architectures, analyzing convergence, efficiency, and computational overhead. Section 5 provides a combined discussion and conclusion, outlining key findings, practical implications, and directions for future research.

2. Background and Related Works

2.1. Forecasting Models and Their Challenges

Forecasting problems have traditionally been addressed with statistical models such as ARIMA [20] and GARCH [21], which provide interpretable predictions but struggle with nonlinear and multivariate dynamics. Classical machine learning models such as Support Vector Machines [22] and Random Forests [23] reduce these limits but require heavy feature engineering and cannot easily model sequential dependencies.
Recent advances in deep learning have greatly improved time-series forecasting. Foundational models such as Recurrent Neural Networks (RNNs) [24], along with Long Short-Term Memory (LSTM) networks [25], and Gated Recurrent Units (GRUs) [26] capture sequential dependencies directly from raw or minimally processed data. Convolutional Neural Networks (CNNs) [27] can also extract local temporal features. More recently, Transformer-based architectures [16] enable parallel processing and long-range dependencies through attention mechanisms.
While RNNs and their gated extensions laid the foundation for sequence modeling, Transformers improved scalability and expressiveness. Hybrid architectures (e.g., MLSTM-FCN, ConvTran) combine convolutional, recurrent, or attention components to balance local feature extraction with long-term dependency learning. Despite their effectiveness, performance remains sensitive to depth, regularization, and weight initialization.
This study does not aim to solve domain challenges such as non-stationarity [28], class imbalance [29], or cold start [30]. Instead, it builds on the validated forecasting pipeline proposed by Petchpol and Boongasame (2025) [31], which addresses these through principled design: preventing data leakage with strict sequence inputs [20]; mitigating non-stationarity via rolling window [32]; handling cold starts with implicit transfer learning [31]; and correcting class imbalance with SMOTE [33].
The pipeline benchmarks classical baselines (Random Walk [34], Random Forest, LSTM, GRU), which proved insufficient under real-world constraints. Final evaluation therefore centers on five deep learning architectures—Transformer, ConvTran, MLSTM-FCN, MALSTM-FCN, and MMAGRU-FCN—demonstrating superior robustness and predictive performance. This diverse set of architectures establishes a reliable foundation for isolating the impact of weight initialization, a sensitive hyperparameter that strongly influences convergence speed, training stability, and forecasting accuracy. The following sections examine its role and review existing strategies in support of a more effective initialization approach.

2.2. Conventional Weight Initialization Methods

Weight initialization plays a foundational role in the training dynamics of deep neural networks [35]. Conventional strategies such as Xavier initialization, Kaiming initialization, and Orthogonal initialization are widely adopted for their theoretical soundness and empirical effectiveness across a range of architectures [8]. Xavier initialization preserves variance across layers under symmetric activation functions such as tanh and sigmoid [5]. Kaiming initialization adjusts the variance for rectified linear units (ReLU), mitigating vanishing gradients in deeper networks [6]. Orthogonal initialization, particularly useful in recurrent structures, maintains signal norms across time steps, supporting stable temporal learning [7].
These methods are computationally efficient and architecture-compatible, which contributes to their widespread use in both research and production systems. Modern deep forecasting models, including LSTM and Transformers, like their predecessors RNNs and CNNs, continue to rely on these initialization schemes due to their compatibility with common training pipelines.
However, these strategies are inherently data-agnostic and do not incorporate information from the input distribution or task objective. As a result, they require additional optimization effort before network parameters converge toward task-relevant representations, which can prolong training time and, in some cases, lead to suboptimal convergence, particularly in forecasting tasks with limited data, non-stationarity, or shifting distributions.

2.3. Data-Driven Initialization Approaches and Limitations

To overcome the limitations of traditional data-agnostic schemes, several data-driven initialization methods have been proposed to incorporate distributional and task-specific information from the training data. Among these, Layer-Sequential Unit-Variance (LSUV) initialization improves training stability by performing forward passes to normalize the output variance of each layer sequentially [9]. LSUV requires minimal tuning and offers improved convergence, particularly in deep architectures.
Laor Initialization introduces a distinct data-driven approach that evaluates multiple randomly sampled weight candidates at the input layer using forward-pass loss as a proxy for optimization readiness [10]. The method clusters the resulting loss scores and selects the candidate associated with the most promising cluster, aiming to improve early convergence and final predictive performance. Unlike LSUV, which normalizes layer activations, Laor Initialization emphasizes loss-informed clustering to guide initialization. This design is particularly suited for tasks characterized by temporal variability and label imbalance, where early training stability can significantly affect downstream performance.
Despite their conceptual strengths, both LSUV and Laor Initialization face practical limitations. Because they rely on multiple forward passes for either activation normalization or loss-based candidate evaluation, they introduce additional computational overhead at the initialization stage. Furthermore, both methods are tightly coupled with specific model architectures. Adapting them to newer or more complex designs, such as Transformers, hybrid attention networks, or modular architectures, often requires extensive code customization and access to internal layers. This combination of computational and engineering overhead reduces their usability in heterogeneous forecasting pipelines, where efficiency and model flexibility are essential. These limitations motivate the development of more architecture-agnostic, reusable, and lightweight initialization strategies.

2.4. Toward Optimization-Aware and Architecture-Agnostic Initialization

While data-driven initialization methods offer task-informed starting points, applying them across all layers of a deep model introduces significant computational and engineering overhead. Each layer typically requires specialized access for forward or gradient evaluation, complicating their integration into diverse or modular architectures. In practice, the input layer, where raw temporal signals first interact with the model, is often the most accessible and initialization-sensitive component.
These considerations highlight the value of a selective hybrid strategy: applying data-informed initialization where it is most impactful—which, in our case, is the input layer—while retaining conventional, efficient initializers for internal components such as attention mechanisms or recurrent blocks. As discussed in the previous section, most existing data-driven methods rely on forward-pass loss to select among candidate weight configurations and are often tightly coupled to specific architectural designs. This limits their usability in dynamic forecasting pipelines that demand both architectural flexibility and computational efficiency.
Consequently, there is a clear need for an initialization strategy that is both optimization-aware and architecture-agnostic, one that can evaluate candidate weights in a task-relevant manner without requiring deep structural access or significant engineering integration. The present study addresses this need by introducing a novel proxy-assisted gradient-informed approach, which is formally described in Section 3.

2.5. Relevance to Forecasting: A Case Study in the Thai Equity Market

Financial time-series forecasting is a technically demanding task characterized by non-stationarity, class imbalance, limited historical data, and dynamic structural shifts. These characteristics increase sensitivity to model configuration choices such as architecture depth, training frequency, and weight initialization. The Thai equity market provides a particularly suitable testbed. In this setting, the Stock Exchange of Thailand (SET) implements a Market Surveillance Measure List (MSML) to flag stocks with abnormal price and volume behavior [11], aiming to protect investors from market manipulation or excessive speculation [11,12,13,14,15].
Forecasting which stocks are likely to be flagged in advance poses a complex classification problem under regulatory, temporal, and data-driven constraints. The problem is further challenged by the infrequent and imbalanced nature of MSML events, the diversity of financial instruments, and the requirement of retraining across evolving time windows. To support experimentation, this study adopts the forecasting pipeline of Petchpol and Boongasame (2025) [31], which handles data leakage, concept drift, class imbalance, and cold start conditions.
While the focus of this work is on financial risk prediction, the challenges of non-stationarity, label imbalance, heterogeneous architectures, and retraining are also common in energy demand forecasting [36], epidemiological trend detection [37], and environmental monitoring [38], where similar pressures on model reliability and efficiency exist. Thus, this case study offers broader insight into the behavior of deep forecasting models under real-world operational constraints.

3. A Hybrid Weight Initialization Framework

3.1. Proxy Model

The proposed framework introduces a selective weight initialization strategy that targets only the numerical input layer, where initialization sensitivity is most pronounced in deep time-series forecasting models. Rather than applying complex initialization schemes across the entire architecture, this design seeks to balance optimization-awareness with architectural simplicity and computational efficiency.
To achieve this, the framework employs a lightweight proxy model that evaluates multiple candidate weight configurations prior to training. The proxy serves as a surrogate for the full forecasting architecture, enabling architecture-agnostic evaluation while keeping computational cost low. This design decouples initialization assessment from internal model structures, making it fully generalizable and applicable across a wide range of deep learning architectures.
The proxy itself is implemented as a single-layer linear transformation, followed by a task-specific output head. For classification tasks such as Market Surveillance Measure List (MSML) prediction, the output head is a softmax classifier trained using cross-entropy loss defined as:
L = i = 1 C y i log y ^ i ,
where C is the number of classes, y i represents the true label, and y ^ i is the predicted probability for class i . In our MSML task, which is binary (flagged vs. not flagged), this categorical cross-entropy naturally reduces to the binary cross-entropy where C = 2 , with y 1 representing the probability of ‘not flagged’ and y 2 representing the probability of ‘flagged’. For regression settings, it is replaced with a linear output trained via mean squared error:
L =   j = 1 S y j     y ^ j 2 ,
where S is the number of samples, y j is the true target value for sample j , and y ^ j is the predicted value.
As illustrated in Figure 1, categorical inputs are processed separately using a standard Xavier-initialized embedding layer, while the proxy-based initialization is applied exclusively to the numerical input layer. The remainder of the forecasting architecture, including models such as Transformer, ConvTran, MLSTM-FCN, MALSTM-FCN, and MMAGRU-FCN, is adopted from the validated pipeline of Petchpol and Boongasame (2025) and used without modification [31]. These internal layers retain a standardized initialization policy based on each layer’s structure and activation behavior (summarized in Table 1), allowing the effect of the proposed method to be evaluated in isolation.
Importantly, the proxy is not discarded after evaluation. The selected weight configuration is retained and used to initialize the input layer of the actual forecasting model. This ensures that initialization reflects real data flow characteristics while avoiding overhead from repeated or full-model training. The following section introduces a data-driven evaluation procedure designed to guide this selection process, using empirical feedback from the proxy to identify initialization configurations that are more responsive to the forecasting task at hand.

3.2. Gradient-Informed Evaluation

The evaluation mechanism employed by the proxy builds on Laor Initialization (Laor-ori), which scores candidate weight configurations based on forward-pass classification loss. A pool of N randomly generated candidate weight matrices is created for the numerical input layer. Each candidate is assigned to the proxy model and evaluated using the task-specific loss function, resulting in scalar error values (Algorithm 1). The proxy model applies these loss functions systematically across all candidate weight configurations. For classification tasks, Equation (1) is used, while regression applications would employ Equation (2).
Algorithm 1 Laor-ori (error-driven via proxy)
  • Input: Model M , proxy layer P , dataset batch ( X ,   y ) , number of candidates N , number of clusters k , loss function L (e.g., Cross-Entropy).
  • Output: Model M with proxy layer P .
  • Begin:
  • Define a standalone proxy layer W i R d out × d in with dimensions matching P , where d in and d out denote the input and ouput dimensionalities of the layer under consideration.
  • Randomly generate N normal distributed candidate weight matrices:
                       W 1 ,   W 2 ,   ,   W N .
  • For each candidate W i :
  •      a. Assign W i to P .
  •      b. Perform forward pass: y i ^ = P X .
  •      c. Compute scalar loss/error: i = L y i ^ , y .
  • Form the error set: E = 1 ,   2 ,   ,   N .
  • Cluster E into k clusters using k-means.
  • Identify cluster C * with the lowest mean error.
  • Average weights from all candidates in C * :
                   W final = 1 C * w i C * W i .
  • Assign W final to P .
The proposed gradient-informed Laor variant (Laor) improves upon this by replacing the loss-based evaluation with the l 2 -norm of the backward gradients of each candidate (Algorithm 2). The gradient norm for candidate i is computed as:
g i   =   W i i 2 ,
where W i denotes the candidate weight matrix in the proxy layer and i is the corresponding scalar loss. This measure provides a direct and actionable indicator of optimization potential, unlike the loss value alone.
Algorithm 2 Laor (gradient-informed via proxy)
  • Input: Model M , proxy layer P , Optimizer O (e.g., SGD, Adam), dataset batch ( X ,   y ) , number of candidates N , number of clusters k , loss function L (e.g., Cross-Entropy), randomizer R : a base weight initializer (e.g., Random, Kaiming-Normal).
  • Output: Model M with proxy layer P .
  • Begin:
  • Define a standalone proxy layer W i R d out × d in with dimensions matching P , where d in and d out denote the input and ouput dimensionalities of the layer under consideration.
  • Generate N candidate weight matrices using randomizer R :
                 W 1 ,   W 2 ,   ,   W N ,     W i ~ R .
  • For each candidate W i :
  •      a. Assign W i to P .
  •      b. Zero gradients in optimizer O .
  •      c. Perform forward pass: y i ^   =   P X .
  •      d. Compute scalar loss value: i = L ( y i ^ , y ) .
  •      e. Perform backward pass and compute gradient l 2 -norm:
                     g i = W i i 2
  • Form the gradient norm set: G = g 1 ,   g 2 ,   ,   g N .
  • Cluster the G into k clusters using k-means.
  • Identify cluster C * with the highest mean gradient norm.
  • Average weights from all candidates in C * :
                   W final = 1 C * w i C * W i .
  • Assign W final to P .
To demonstrate this advantage, consider the gradient descent update rule in Equation (4). If one were to take a step of size η , the expected first-order decrease for candidate i would be
i   η g i 2 .
  • The key distinction between approaches lies in their selection criteria:
  • Laor-ori (Algorithm 1). Selects candidates by i Smaller loss values indicate lower initial misfit but do not guarantee faster optimization.
  • Laor (Algorithm 2). Selects candidates by g i Larger gradient norms imply greater reduction potential per update step, aligning initialization with optimization dynamics rather than only initial fit quality.
Both methods apply k-means clustering followed by averaging within the best cluster. This design, inherited from the original Laor method, improves reproducibility and reduces sensitivity to noise or outliers, ensuring that initialization is not determined by a single potentially unstable candidate but instead reflects a more robust consensus of promising weights.
Finally, to evaluate the robustness and variability of the proposed method, multiple gradient-informed Laor variants are explored, each differing in their random initialization schemes:
  • Laor-n/Laor-u: Laor with normal/uniform distribution sampling;
  • Laor-kn/Laor-ku: Laor with Kaiming Normal/Kaiming Uniform sampling;
  • Laor-xn/Laor-xu: Laor with Xavier Normal/Xavier Uniform sampling;
  • Laor-o: Laor with orthogonal matrix sampling.

3.3. Experimental Setup

This section describes the experimental setup used to evaluate the proposed proxy-assisted, gradient-informed initialization framework. The experiments are conducted within a multivariate time-series classification (MTSC) setting and build directly on a previously validated pipeline developed by Petchpol and Boongasame (2025) [31]. All model configurations, training routines, and evaluation metrics are held constant across experiments to isolate the impact of the weight initialization strategy.

3.3.1. Inherited Setup from Prior Work

  • Dataset: Daily end-of-day (EOD) stock trading data and Market Surveillance Measure List (MSML) labels were obtained from the SET Market Analysis and Reporting Tool (SETSMART) platform [39], covering the period from 16 November 2012 to 15 June 2024. The dataset includes adjusted closing prices, trading volumes, valuation ratios (P/E, P/BV), and derived technical and sentiment indicators. Each record is associated with a binary label indicating whether the corresponding equity appears on the MSML;
  • Preprocessing Steps: Temporal alignment was strictly maintained by using predictors at time t exclusively derived from historical observations up to t     1 . Forecasting labels were forward-shifted according to the forecasting horizon. Numerical features underwent standardization and precision rounding to six decimal places to enhance training stability. Missing trading values for suspended stocks were explicitly imputed with zeros, indicating absence of trading activity;
  • Rolling Window Training Strategy: A rolling window methodology was used to handle temporal non-stationarity, employing a training window of 1260 trading days (~5 years), a validation window of 60 trading days (~1 quarter), and a rolling step size of 60 days. This strategy resulted in 26 training-validation cycles for each experimental iteration;
  • Class Imbalance Handling: Given the significant minority class imbalance (~3.6% positive samples), Synthetic Minority Oversampling Technique (SMOTE) was applied within each rolling training window, doubling minority instances via interpolation with 5-nearest neighbors;
  • Computational Environment: Experiments were executed using PyTorch version 2.3.0 on an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), Intel i7-13700KF CPU (Intel Corporation, Santa Clara, CA, USA), and 32 GB RAM, with CUDA version 12.3, ensuring reproducibility and computational efficiency;
  • Evaluation Metrics: The primary evaluation metric was Matthews Correlation Coefficient (MCC), complemented by supplementary metrics including F1 score, Recall, Precision, Accuracy, and Cross-Entropy Loss. Computational efficiency metrics encompassed training time per epoch, total wall-clock time, and epochs to convergence;
  • Model Architectures: Five deep learning models are used, spanning a range of structural patterns from fully attention-based to recurrent-convolutional hybrids as in Table 2.
  • Hyperparameters: Training configurations were inherited directly from prior work and remained fixed across all initialization experiments. These hyperparameters, summarized in Table 3, were tuned previously and reused to ensure a controlled evaluation setting.
For further details regarding data properties, exploratory statistics, and baseline model performance, please refer to Section 3, Section 4 and Section 5 of the original study.

3.3.2. Novel Experimental Components

This study introduces a systematic evaluation of weight initialization strategies, with particular emphasis on the proposed proxy-assisted gradient-informed Laor initialization. The following components were developed to extend the baseline forecasting framework and isolate the contribution of the initialization method.
  • Initialization Strategies Compared: Seventeen initialization strategies were evaluated at the numerical input layer, grouped into four categories (A comparative summary is provided in Table 4):
  • Traditional data-agnostic methods: Xavier (Uniform, Normal), Kaiming (Uniform, Normal), Random (Uniform, Normal), and Orthogonal.
  • Variance-based method: Layer-Sequential Unit Variance (LSUV).
  • Error-driven method: Original Laor Initialization, using clustering on forward-pass loss values.
  • Gradient-informed variants (proposed): Laor, Laor-Normal (Laor-n), Laor-Uniform (Laor-u), Laor-Kaiming-Normal (Laor-kn), Laor-Kaiming-Uniform (Laor-ku), Laor-Xavier-Normal (Laor-xn), Laor-Xavier-Uniform (Laor-xu), Laor-Orthogonal (Laor-o). All use the same gradient-norm clustering scheme and differ only in the randomization method used to generate candidate weights.
These strategies are applied solely at the input layer; internal layers follow fixed initializers based on their architecture and activation type (as described in Table 1). This separation isolates the impact of input-layer initialization while maintaining compatibility with diverse architectures.
  • Proxy-Layer and Clustering Procedure: Each strategy is evaluated through the same proxy model described in Section 3.1 and 3.2. For gradient-informed variants, N = 10 weight candidates are generated using the assigned randomizer, evaluated by gradient norm, and clustered using k-means with k = 2 . The choice of N = 10 follows the established methodology from the original Laor initialization paper, ensuring methodological consistency and comparability with prior work. The parameter k = 2 aligns with the binary classification structure of our forecasting task (flagged vs. not flagged stocks), providing a natural clustering framework for candidate selection. The final weight is computed by averaging the members of the lowest-gradient cluster. The same clustering process is used for the error-driven variant, except based on forward-pass loss. All candidates are evaluated under identical proxy conditions to ensure fair comparison.
  • Statistical Robustness Checks: To assess the robustness of each strategy, every experimental configuration was repeated 20 times with independent random seeds. Aggregate statistics, including mean performance, standard deviation, and training time, were reported for each model-initializer combination, allowing for comprehensive comparison of predictive accuracy and convergence efficiency.

3.4. Experiment Summary

All experiments in this study are conducted within a previously validated deep time-series forecasting pipeline, incorporating safeguards against data leakage, concept drift, class imbalance, and cold-start scenarios. These aspects are addressed through sequence-based input design, rolling-window training, SMOTE-based oversampling, and implicit transfer learning.
Five deep learning architectures, Transformer, ConvTran, MLSTM-FCN, MALSTM-FCN, and MMAGRU-FCN, are retained from prior work, in which they demonstrated strong and consistent performance for multivariate time-series classification. These configurations were previously identified as the most effective under real-world forecasting constraints.
No architectural or training modifications are introduced in this study. All preprocessing steps, hyperparameters, and training routines are preserved to ensure consistency with the original pipeline, thereby isolating the impact of weight initialization.
Only the initialization strategy at the numerical input layer is varied. This setup ensures that any observed variation in performance or training efficiency, as reported in Section 4, can be attributed solely to the effect of weight initialization.

4. Results

This section presents empirical results evaluating the proposed gradient-informed initialization strategies against traditional, variance-based, and error-driven methods. All evaluations are conducted within the controlled forecasting pipeline described in Section 3, using five deep learning architectures selected from prior work. To ensure statistical robustness, each model–initializer configuration is trained over 20 independent runs. Evaluation focuses on three key dimensions: (1) baseline performance comparison on the Transformer model, (2) generalizability across diverse architectural types, and (3) convergence speed and computational efficiency. Full classification and convergence metrics are provided in Appendix A Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9 and Table A10.

4.1. Baseline Performance Comparison

This section presents classification and convergence results for all weight initialization strategies evaluated on the Transformer model, serving as a representative case for baseline comparison. Initializers include traditional data-agnostic methods (e.g., Xavier, Kaiming, Orthogonal), a variance-balancing strategy (LSUV), an error-driven method (Laor-ori), and eight gradient-informed variants proposed in this study. Evaluation focuses on Matthews Correlation Coefficient (MCC), which is appropriate for imbalanced classification tasks. Training time and initialization overhead are also reported.
As summarized in Table 5, baseline methods such as Normal and LSUV achieved high MCC scores (70.51 ± 1.02% and 70.47 ± 0.80%, respectively), but incurred the longest training times (776.0 s and 805.2 s). Laor-ori yielded similar training durations (748.2 s) but lower MCC (68.53 ± 1.07%), suggesting limited benefit from forward-loss clustering in the proxy-assisted setup.
Gradient-informed Laor variants demonstrated more favorable trade-offs. Laor reached 69.54 ± 1.31% MCC with significantly reduced training time (592.7 s), while Laor-n outperformed all methods in both MCC (70.16 ± 0.90%) and training time (590.7 s). These results suggest that backward-gradient clustering supports efficient convergence while preserving classification performance.
Figure 2 provides a visual comparison of test MCC against total training time for all initialization strategies. Highlighted markers identify the top three performers in terms of predictive accuracy and convergence efficiency.
Since this study reuses the previously validated experimental pipeline by Petchpol and Boongasame (2025) [31], standard baselines such as the Random Walk and the Random Forest are already incorporated, providing a transparent floor against which the benefits of more advanced initialization schemes can be assessed.

4.2. Generalizability of Weight Initializers Across Architectures

To evaluate the generalizability of weight initialization strategies across architectures with differing inductive biases, seventeen initializers were assessed on five deep learning models: Transformer, ConvTran, MLSTM-FCN, MALSTM-FCN, and MMAGRU-FCN. These models range from attention-only architectures to hybrids incorporating convolutional and recurrent modules. The tested initializers include traditional data-agnostic schemes (e.g., Xavier, Kaiming, Normal, Orthogonal), the variance-balancing LSUV method, the error-driven Laor-ori, and eight gradient-informed Laor variants. Detailed classification metrics, including MCC, Accuracy, Precision, Recall, and F1 Score, are provided in Appendix A Table A1, Table A3, Table A5, Table A7 and Table A9, while convergence-related statistics appear in Appendix A Table A2, Table A4, Table A6, Table A8 and Table A10. Comparative visualizations of performance and training time are shown in Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6.
On the Transformer model (Figure 2), Normal and LSUV achieved the highest MCC scores (70.51 ± 1.02% and 70.47 ± 0.80%, respectively), though they incurred longer training times (>750 s). Laor-n performed comparably (70.16 ± 0.90%) while reducing total training time to 590.7 s, the fastest among all methods.
On the ConvTran architecture (Figure 3), gradient-informed variants consistently outperformed other initialization strategies. Both Laor-ku and Laor-xu achieved the highest MCC of 72.62 ± 1.95%, with Laor-ku also yielding the fastest training duration (998.282 s), closely followed by Laor-n (72.02 ± 1.78%, 998.713 s). Other gradient-informed variants, such as Laor-kn and Laor-xn, also performed competitively with MCC scores of 71.88 ± 2.23%. In contrast, traditional and variance-based initializers, including Orthogonal (71.66 ± 1.59%, 1298.746 s) and LSUV (69.85 ± 2.03%, 1299.713 s), lagged behind in both predictive accuracy and convergence efficiency.
In the MLSTM-FCN architecture (Figure 4), Normal (72.44 ± 1.94%) and LSUV (72.30 ± 2.20%) yielded the highest MCC scores. Among data-informed strategies, Laor-kn achieved 70.65 ± 2.58%, offering faster training (532.368 s) than LSUV (655.2 s), suggesting a trade-off between moderate gains in speed and slight reductions in accuracy. Notably, Kaiming-Normal also performed competitively (71.69 ± 2.04%, 520.549 s), rivaling Laor-kn in both accuracy and efficiency.
For the MALSTM-FCN architecture (Figure 5), LSUV achieved the highest MCC (72.54 ± 2.29%) but incurred longer training times (621 s). Among gradient-informed initializers, Laor-o followed with 70.56 ± 4.70%, while Laor-kn and Laor-ku offered lower training times (~528 s) with MCC scores slightly below 70%. Notably, Kaiming-Normal (70.00 ± 3.89%, ~528 s) and Kaiming-Uniform (70.39 ± 3.58%, ~531 s) performed slightly better than Laor variants in both predictive accuracy and convergence time.
On the MMAGRU-FCN model (Figure 6), which integrates attention, convolution, and gated recurrence, gradient-informed variants showed strong scalability. Laor-o (72.87 ± 0.65%) and Laor-n (72.83 ± 0.86%) closely matched the performance of LSUV (73.74 ± 0.69%) while reducing training time by more than 160 s on average (1039.5 s vs. 1201.8 s).
Across all architectures, the original Laor-ori method consistently underperformed relative to its gradient-informed counterparts, indicating that error-driven clustering based on forward loss may be less effective at capturing early-stage optimization dynamics when applied through a proxy model. In contrast, gradient-informed Laor variants demonstrated notable adaptability in deeper and hybrid models such as Transformer, ConvTran, and MMAGRU-FCN. However, in simpler recurrent architectures like MLSTM-FCN and MALSTM-FCN, traditional initializers, particularly Kaiming-Normal and Kaiming-Uniform, achieved comparable or superior results in both predictive accuracy and convergence time. These findings suggest that while gradient-informed initialization offers benefits in complex model settings, its advantage may diminish in recurrent structures where variance-preserving schemes remain effective.

4.3. Cross-Architecture Comparison and Performance-Efficiency Trade-Offs

The preceding analysis reveals architecture-specific strengths of various initialization strategies, but broader patterns also emerge when comparing their behavior across forecasting models. This section synthesizes those results, highlighting general trade-offs between classification performance and convergence efficiency across all evaluated architectures.
Figure 7 visualizes this trade-off, marking initializers that achieved top three performance in MCC, total training time, or overall efficiency. These results illustrate how initialization impacts both accuracy and runtime behavior in shallow and deep forecasting architectures.
Gradient-informed Laor variants, particularly Laor-n, Laor-o, and Laor-ku, frequently ranked among the top three initializers in both predictive accuracy and training efficiency. For instance, Laor-n achieved the highest MCC on Transformer (70.16 ± 0.90%) and remained competitive on MMAGRU-FCN (72.83 ± 0.86%) while consistently exhibiting low total training times (590.7 s and 1039.5 s, respectively). Similarly, Laor-ku and Laor-o demonstrated favorable performance across ConvTran, MALSTM-FCN, and MMAGRU-FCN, underscoring their robustness in hybrid and deep configurations.
Traditional data-agnostic initializers such as Xavier and Kaiming performed adequately in recurrent-heavy architectures, including MLSTM-FCN and MALSTM-FCN, but struggled to match the performance or efficiency of gradient-informed methods in attention-based or multi-component networks. For example, Kaiming-Normal yielded 71.69 ± 2.04% MCC on MLSTM-FCN, but fell below 69% MCC in Transformer and ConvTran, where model depth and architectural heterogeneity increased initialization sensitivity.
The LSUV method delivered high classification performance in most settings, often ranking in the top three MCC scores (e.g., 72.54 ± 2.29% on MALSTM-FCN, 73.74 ± 0.69% on MMAGRU-FCN), but incurred consistently high convergence costs. Its average total training time exceeded 750 s in all architectures, limiting its utility in computationally constrained deployments. In contrast, Laor-n and Laor-ku offered comparable or better accuracy with significantly reduced training overhead.
Laor-ori, the error-driven initializer, consistently underperformed relative to its gradient-informed successors. It exhibited middling MCC scores (e.g., 68.53 ± 1.07% on Transformer, 70.19 ± 2.72% on MALSTM-FCN) and longer training times, suggesting that forward-pass loss clustering—when implemented through a proxy model—may be insufficient for robust early-stage optimization.
Collectively, these results demonstrate that gradient-informed, proxy-assisted initialization achieves a favorable balance between accuracy and efficiency across architectures. This generalizability is particularly valuable in real-world forecasting pipelines, where frequent retraining, class imbalance, and evolving data distributions demand consistent and scalable learning behavior.
These cross-architecture results highlight the practical relevance of initialization strategies that balance predictive accuracy with convergence efficiency. In particular, deeper and hybrid models consistently benefited from optimization-aware initialization, while simpler recurrent models remained well-served by traditional schemes. These findings establish a foundation for the broader interpretation, practical implications, and limitations discussed in the next section.

4.4. Computational Overhead Analysis

This section quantifies the computational impact of different initialization strategies, contrasting weight initialization time against total training cost and forecasting accuracy.
Figure 8 presents side-by-side boxplots of initialization time, total training time, and test MCC performance. All initialization methods, including data-driven proxy-based variants, feature consistently low initialization times (~0.06–0.18 s per experiment). Standard schemes (Kaiming, Xavier, Orthogonal) fall within the same range, and even the slowest Laor variant (Laor-xn) remains only fractionally higher.
When compared to full training durations (520–1315 s), initialization overhead is always below 0.04% (e.g., 0.18 s/520 s = 0.033%). Thus, the cost of proxy-based initialization is negligible in both large-scale and constrained environments. Importantly, the middle panel shows that gradient-informed variants (Laor, Laor-n, Laor-ku, Laor-kn) also rank among the shortest overall training times rivaling or surpassing the fastest canonical schemes. The bottom panel further demonstrates that top-performing strategies in terms of MCC (up to 73.74% for LSUV) do not incur any meaningful cost penalty. For context, computing gradients directly on full models would require 7–17 s per candidate. With N = 10 candidates, initialization would instead take 70–172 s—roughly 400–1000 times slower than the proxy method, amounting to 15–25% of total training time. By contrast, the proxy approach achieves the same gradient-informed benefit at less than 0.04% overhead, or about 3000 times more efficient.
In summary, proxy-based gradient-informed initialization delivers a compelling balance: negligible overhead, reduced total training time, and competitive or superior predictive accuracy. This efficiency advantage makes the approach particularly suitable for resource-constrained forecasting pipelines requiring frequent retraining.

5. Discussion and Conclusions

This study introduced a proxy-assisted, gradient-informed weight initialization strategy for deep time-series forecasting models. By clustering backward gradient norms from a lightweight proxy at the numerical input layer, the method provides optimization-aware yet architecture-agnostic initialization. Compared with traditional variance-preserving and error-driven strategies, it consistently improved convergence speed and predictive accuracy, particularly in deeper and hybrid architectures such as Transformer, ConvTran, and MMAGRU-FCN.
These findings highlight that initialization plays a nontrivial role in forecasting performance, especially under practical constraints of non-stationary data, class imbalance, and limited resources. Gradient-informed clustering at the input layer alone was sufficient to guide early-stage optimization, and the fixed proxy design enabled systematic evaluation across architectures with negligible overhead (less than 0.04% of total training time). The robustness of Laor-n, Laor-o, and Laor-ku across diverse models underscores the value of combining gradient-based feedback with lightweight proxy evaluation.
At the same time, limitations remain. The method showed reduced advantage in simpler recurrent architectures (e.g., MLSTM-FCN and MALSTM-FCN), where variance-based schemes such as LSUV or Kaiming initialization remained competitive. The current linear proxy and reliance on k-means clustering represent conservative design choices that trade fidelity for efficiency. More adaptive clustering algorithms or nonlinear proxy layers may better capture complex gradient landscapes. In addition, validation was confined to financial risk prediction; extending evaluation to domains such as energy demand, epidemiology, and environmental monitoring will be important for establishing broader generalizability.
Future research should also investigate sensitivity to hyperparameters such as the number of candidates and clusters. In the present study, N = 10 was selected in accordance with the original Laor initialization methodology to ensure comparability, while k = 2 was aligned with the binary classification structure of our forecasting task (flagged vs. not flagged stocks). Although these settings provided a solid foundation, a systematic ablation across broader ranges of N and k may yield practical guidance for different architectures and tasks. Another promising direction is calibration: assessing how initialization strategies affect probability reliability through Brier score, Expected Calibration Error, and reliability diagrams. Such work is particularly relevant for risk-sensitive domains like finance, where well-calibrated probabilities matter as much as raw accuracy.
In summary, the proposed gradient-informed initialization strategy offers a practical means of improving training efficiency and stability in real-world forecasting pipelines. Its negligible computational cost, architecture-agnostic design, and compatibility with established workflows make it especially suited for frequent retraining under resource constraints. While limitations remain in recurrent settings and domain coverage, the framework provides a strong foundation for future extensions toward adaptive, calibration-aware, and cross-domain initialization strategies.

Author Contributions

Conceptualization, K.P. and L.B.; methodology, K.P.; software, K.P.; validation, K.P.; formal analysis, K.P. and L.B.; investigation, K.P.; resources, K.P.; data curation, K.P.; writing—original draft preparation, K.P.; writing—review and editing, K.P. and L.B.; visualization, K.P.; supervision, L.B.; project administration, K.P.; funding acquisition, L.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was conducted as part of the author’s studies at King Mongkut’s Institute of Technology Ladkrabang (KMITL). It was supported by the National Science, Research and Innovation Fund (NSRF) through KMITL under grant number RE-KRIS/FF68/45. The Stock Exchange of Thailand provided non-financial support through data access.

Data Availability Statement

The datasets used in this study, including End-of-Day stock data and Market Surveillance Measure List data, were obtained from the Stock Exchange of Thailand (SET). Due to licensing restrictions, the data is not publicly available. However, researchers may request access to the data through the SET research department. The data schema and preprocessing details are provided in our previously published work (Petchpol & Boongasame, 2025) [31], which is cited within the manuscript.

Conflicts of Interest

Katsamapol Petchpol is an employee of The Stock Exchange of Thailand (SET) while pursuing a Doctoral Degree at King Mongkut’s Institute of Technology Ladkrabang. The research utilized data provided by SET, with the request for data made under a student identity.

Abbreviations

The following abbreviations are used in this manuscript:
CNNsConvolutional Neural Networks
EODEnd-of-Day
GRUsGated Recurrent Units
Kaiming-nKaiming Normal initialization
Kaiming-uKaiming Uniform initialization
KMITLKing Mongkut’s Institute of Technology Ladkrabang
LaorGradient-Informed Laor Initialization
Laor-knGradient-Informed Laor Initialization with Kaiming Normal initialization
Laor-kuGradient-Informed Laor Initialization with Kaiming Uniform initialization
Laor-nGradient-Informed Laor Initialization with Normal initialization
Laor-oGradient-Informed Laor Initialization with Orthogonal initialization
Laor-oriOriginal Laor Initialization
Laor-uGradient-Informed Laor Initialization with Uniform initialization
Laor-xnGradient-Informed Laor Initialization with Xavier Normal initialization
Laor-xuGradient-Informed Laor Initialization with Xavier Uniform initialization
LSTMLong Short-Term Memory
LSUVLayer-Sequential Unit-Variance
MCCMatthews Correlation Coefficient
MSEMean Squared Error
MSMLMarket Surveillance Measure List
MTSCMultivariate Time-Series Classification
NSRFNational Science, Research and Innovation Fund
ReLURectified Linear Units
RNNsRecurrent Neural Networks
SETThe Stock Exchange of Thailand
SETSMARTSET Market Analysis and Reporting Tool
SMOTESynthetic Minority Over-sampling Technique
Xavier-nXavier Normal initialization
Xavier-uXavier Uniform initialization

Appendix A

Table A1. Classification metrics for weight initializers on the Transformer model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.
Table A1. Classification metrics for weight initializers on the Transformer model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.
GroupWeight InitializerTest Accuracy (%)Test Precision (%)Test Recall (%)Test F1 Score (%)Test MCC (%)
TraditionalKaiming-Normal84.30 ± 0.5581.84 ± 0.8288.39 ± 0.9784.98 ± 0.5268.82 ± 1.09
(Data-agnostic baseline)Kaiming-Uniform84.23 ± 0.6981.77 ± 0.9088.33 ± 1.1184.92 ± 0.6668.69 ± 1.39
Xavier-Normal84.30 ± 0.5581.84 ± 0.8288.39 ± 0.9784.98 ± 0.5268.82 ± 1.09
Xavier-Uniform84.23 ± 0.6981.77 ± 0.9088.33 ± 1.1184.92 ± 0.6668.69 ± 1.39
Normal85.22 ± 0.5084.02 ± 0.5887.19 ± 1.2885.57 ± 0.5770.51 ± 1.02
Uniform84.33 ± 0.4582.61 ± 0.7087.19 ± 1.3184.83 ± 0.5168.77 ± 0.94
Orthogonal84.25 ± 0.5881.70 ± 1.0688.52 ± 0.9584.96 ± 0.4868.75 ± 1.08
Data-driven method
Variance-basedLSUV85.20 ± 0.3984.11 ± 0.9087.02 ± 1.4485.53 ± 0.4570.47 ± 0.80
Error-driven
(original Laor)
Laor-ori84.14 ± 0.5581.49 ± 0.8688.56 ± 0.9284.88 ± 0.5068.53 ± 1.07
Gradient-informedLaor84.71 ± 0.6983.69 ± 1.7786.54 ± 2.3385.05 ± 0.6969.54 ± 1.31
(Laor variants)Laor-kn84.04 ± 0.5881.96 ± 1.2787.55 ± 1.8884.64 ± 0.6168.27 ± 1.16
Laor-ku83.97 ± 0.6281.52 ± 0.6888.06 ± 0.9584.66 ± 0.6168.15 ± 1.26
Laor-xn84.04 ± 0.5881.96 ± 1.2787.55 ± 1.8884.64 ± 0.6168.27 ± 1.16
Laor-xu83.97 ± 0.6281.52 ± 0.6888.06 ± 0.9584.66 ± 0.6168.15 ± 1.26
Laor-n85.05 ± 0.4484.22 ± 1.0986.49 ± 1.9585.32 ± 0.5870.16 ± 0.90
Laor-u84.28 ± 0.4482.70 ± 0.4186.91 ± 1.0184.75 ± 0.4968.66 ± 0.91
Laor-o84.29 ± 0.4881.95 ± 0.9988.18 ± 1.2484.94 ± 0.4468.79 ± 0.93
Note: Boldface indicates the top three scores for each metric. Higher values reflect better classification performance. Results are averaged over 20 runs for robustness.
Table A2. Convergence and initialization metrics for weight initializers on the Transformer model. Metrics include time per epoch, number of epochs, initialization duration, and total training time.
Table A2. Convergence and initialization metrics for weight initializers on the Transformer model. Metrics include time per epoch, number of epochs, initialization duration, and total training time.
GroupWeight InitializerTraining Time per Epoch (s)Number of EpochsWeight Initial Time (s)Total Training Time (s)
TraditionalKaiming-Normal7.505 ± 0.04479.0 ± 5.70.101 ± 0.012592.965 ± 0.250
(Data-agnostic baseline)Kaiming-Uniform7.508 ± 0.03880.8 ± 8.10.115 ± 0.012606.364 ± 0.312
Xavier-Normal9.513 ± 0.17579.0 ± 5.70.097 ± 0.016751.601 ± 0.990
Xavier-Uniform7.514 ± 0.03980.8 ± 8.10.098 ± 0.013606.819 ± 0.313
Normal9.521 ± 0.17581.5 ± 6.60.080 ± 0.016776.008 ± 1.158
Uniform9.529 ± 0.18677.8 ± 6.70.105 ± 0.013741.499 ± 1.240
Orthogonal9.581 ± 0.16481.4 ± 8.00.080 ± 0.013779.953 ± 1.305
Data-driven method
Variance-basedLSUV9.504 ± 0.15984.7 ± 7.10.134 ± 0.014805.164 ± 1.130
Error-driven (original Laor)Laor-ori9.536 ± 0.17578.5 ± 5.70.120 ± 0.005748.190 ± 0.994
Gradient-informedLaor7.463 ± 0.07479.4 ± 8.40.121 ± 0.007592.715 ± 0.624
(Laor variants)Laor-kn7.508 ± 0.04181.4 ± 5.30.124 ± 0.010611.257 ± 0.218
Laor-ku7.515 ± 0.03380.2 ± 4.30.118 ± 0.014602.799 ± 0.144
Laor-xn9.533 ± 0.19181.4 ± 5.30.120 ± 0.014776.073 ± 1.010
Laor-xu9.490 ± 0.20980.2 ± 4.30.134 ± 0.012761.259 ± 0.897
Laor-n7.513 ± 0.03778.6 ± 9.60.121 ± 0.016590.652 ± 0.353
Laor-u9.510 ± 0.15278.9 ± 7.20.123 ± 0.013749.999 ± 1.095
Laor-o9.530 ± 0.16381.3 ± 7.30.150 ± 0.014774.926 ± 1.182
Note: Boldface highlights the top three results per metric. Lower values are preferred, as they indicate reduced computational cost. All results are averaged over 20 independent runs.
Table A3. Classification metrics for weight initializers on the ConvTran model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.
Table A3. Classification metrics for weight initializers on the ConvTran model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.
GroupWeight InitializerTest Accuracy (%)Test Precision (%)Test Recall (%)Test F1 Score (%)Test MCC (%)
TraditionalKaiming-Normal85.43 ± 1.5286.42 ± 1.7284.57 ± 4.9185.37 ± 2.0471.06 ± 2.81
(Data-agnostic baseline)Kaiming-Uniform85.68 ± 1.7386.47 ± 1.6785.06 ± 4.9485.65 ± 2.2271.53 ± 3.29
Xavier-Normal85.43 ± 1.5286.42 ± 1.7284.57 ± 4.9185.37 ± 2.0471.06 ± 2.81
Xavier-Uniform85.68 ± 1.7386.47 ± 1.6785.06 ± 4.9485.65 ± 2.2271.53 ± 3.29
Normal84.49 ± 0.8685.08 ± 2.4384.29 ± 4.6584.55 ± 1.3269.20 ± 1.69
Uniform85.23 ± 1.7486.14 ± 2.0784.54 ± 5.8785.17 ± 2.4170.73 ± 3.15
Orthogonal85.76 ± 0.8486.81 ± 1.7184.81 ± 3.4885.73 ± 1.1771.66 ± 1.59
Data-driven method
Variance-basedLSUV84.80 ± 1.0886.30 ± 2.2083.32 ± 4.6084.66 ± 1.5669.85 ± 2.03
Error-driven
(original Laor)
Laor-ori85.21 ± 1.1886.82 ± 1.5583.47 ± 3.8085.04 ± 1.5470.59 ± 2.24
Gradient-informedLaor85.21 ± 1.3785.01 ± 2.0185.98 ± 3.4385.43 ± 1.5470.53 ± 2.70
(Laor variants)Laor-kn85.88 ± 1.2185.83 ± 1.5786.41 ± 3.8086.05 ± 1.5871.88 ± 2.23
Laor-ku86.27 ± 0.9985.92 ± 1.3887.17 ± 3.2886.48 ± 1.2672.62 ± 1.95
Laor-xn85.88 ± 1.2185.83 ± 1.5786.41 ± 3.8086.05 ± 1.5871.88 ± 2.23
Laor-xu86.27 ± 0.9985.92 ± 1.3887.17 ± 3.2886.48 ± 1.2672.62 ± 1.95
Laor-n84.64 ± 1.0684.45 ± 1.9385.44 ± 3.3784.87 ± 1.2769.40 ± 2.10
Laor-u85.92 ± 0.9485.57 ± 2.1786.96 ± 4.0886.16 ± 1.3072.02 ± 1.74
Laor-o85.34 ± 1.4986.95 ± 1.4383.63 ± 4.7285.16 ± 2.0070.89 ± 2.77
Note: Boldface indicates the top three scores for each metric. Higher values reflect better classification performance. Results are averaged over 20 runs for robustness.
Table A4. Convergence and initialization metrics for weight initializers on the ConvTran model. Metrics include time per epoch, number of epochs, initialization duration, and total training time.
Table A4. Convergence and initialization metrics for weight initializers on the ConvTran model. Metrics include time per epoch, number of epochs, initialization duration, and total training time.
GroupWeight InitializerTraining Time per Epoch (s)Number of EpochsWeight Initial Time (s)Total Training Time (s)
TraditionalKaiming-Normal13.747 ± 0.06575.8 ± 10.20.147 ± 0.0171042.183 ± 0.663
(Data-agnostic baseline)Kaiming-Uniform13.744 ± 0.05076.7 ± 7.60.117 ± 0.0141053.605 ± 0.379
Xavier-Normal17.198 ± 0.13175.8 ± 10.20.114 ± 0.0141303.742 ± 1.334
Xavier-Uniform13.747 ± 0.05176.7 ± 7.60.116 ± 0.0091053.805 ± 0.385
Normal17.186 ± 0.13575.1 ± 6.20.098 ± 0.0141290.779 ± 0.841
Uniform17.169 ± 0.12576.0 ± 7.10.097 ± 0.0141304.916 ± 0.887
Orthogonal17.144 ± 0.06075.8 ± 10.80.101 ± 0.0171298.746 ± 0.650
Data-driven method
Variance-basedLSUV17.134 ± 0.10275.9 ± 7.10.112 ± 0.0161299.713 ± 0.724
Error-driven (original Laor)Laor-ori17.161 ± 0.07775.6 ± 9.20.135 ± 0.0131296.674 ± 0.710
Gradient-informedLaor13.743 ± 0.04773.7 ± 7.70.152 ± 0.0131012.328 ± 0.363
(Laor variants)Laor-kn13.766 ± 0.05174.7 ± 9.80.134 ± 0.0121028.431 ± 0.500
Laor-ku13.758 ± 0.04372.6 ± 7.80.135 ± 0.012998.282 ± 0.335
Laor-xn17.173 ± 0.10074.7 ± 9.80.176 ± 0.0121283.035 ± 0.971
Laor-xu17.214 ± 0.12872.6 ± 7.80.136 ± 0.0131249.016 ± 1.003
Laor-n13.764 ± 0.04872.6 ± 5.90.139 ± 0.012998.713 ± 0.285
Laor-u17.180 ± 0.11276.6 ± 8.10.137 ± 0.0121315.229 ± 0.908
Laor-o17.174 ± 0.08574.0 ± 5.90.132 ± 0.0111270.980 ± 0.500
Note: Boldface highlights the top three results per metric. Lower values are preferred, as they indicate reduced computational cost. All results are averaged over 20 independent runs.
Table A5. Classification metrics for weight initializers on the MLSTM-FCN model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.
Table A5. Classification metrics for weight initializers on the MLSTM-FCN model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.
GroupWeight InitializerTest Accuracy (%)Test Precision (%)Test Recall (%)Test F1 Score (%)Test MCC (%)
TraditionalKaiming-Normal85.83 ± 1.0386.20 ± 0.7385.42 ± 2.0285.80 ± 1.1671.69 ± 2.04
(Data-agnostic baseline)Kaiming-Uniform85.35 ± 1.4686.53 ± 1.0283.88 ± 4.2285.11 ± 1.9270.84 ± 2.75
Xavier-Normal85.83 ± 1.0386.20 ± 0.7385.42 ± 2.0285.80 ± 1.1671.69 ± 2.04
Xavier-Uniform85.35 ± 1.4686.53 ± 1.0283.88 ± 4.2285.11 ± 1.9270.84 ± 2.75
Normal86.16 ± 1.0387.81 ± 1.0584.10 ± 2.8085.88 ± 1.2972.44 ± 1.94
Uniform85.06 ± 1.4387.81 ± 0.9781.57 ± 4.1584.50 ± 1.9070.40 ± 2.63
Orthogonal85.97 ± 1.1986.46 ± 1.0485.40 ± 2.4185.90 ± 1.3571.97 ± 2.36
Data-driven method
Variance-basedLSUV86.10 ± 1.1487.61 ± 1.1984.21 ± 2.6685.85 ± 1.3572.30 ± 2.20
Error-driven
(original Laor)
Laor-ori85.20 ± 1.4986.26 ± 1.2183.89 ± 3.8985.00 ± 1.8570.52 ± 2.87
Gradient-informedLaor85.22 ± 1.2287.76 ± 1.2482.00 ± 3.6284.73 ± 1.6170.68 ± 2.25
(Laor variants)Laor-kn85.26 ± 1.3586.71 ± 1.0783.41 ± 3.4784.99 ± 1.6870.65 ± 2.58
Laor-ku85.13 ± 1.1286.50 ± 1.0583.41 ± 3.2584.88 ± 1.4670.38 ± 2.09
Laor-xn85.26 ± 1.3586.71 ± 1.0783.41 ± 3.4784.99 ± 1.6870.65 ± 2.58
Laor-xu85.13 ± 1.1286.50 ± 1.0583.41 ± 3.2584.88 ± 1.4670.38 ± 2.09
Laor-n84.74 ± 1.9088.18 ± 1.2080.39 ± 4.9884.00 ± 2.5669.88 ± 3.39
Laor-u85.37 ± 1.3587.47 ± 0.7782.69 ± 3.6684.96 ± 1.7470.92 ± 2.53
Laor-o85.72 ± 1.2386.62 ± 0.9984.62 ± 3.7185.55 ± 1.6071.54 ± 2.36
Note: Boldface indicates the top three scores for each metric. Higher values reflect better classification performance. Results are averaged over 20 runs for robustness.
Table A6. Convergence and initialization metrics for weight initializers on the MLSTM-FCN model. Metrics include time per epoch, number of epochs, initialization duration, and total training time.
Table A6. Convergence and initialization metrics for weight initializers on the MLSTM-FCN model. Metrics include time per epoch, number of epochs, initialization duration, and total training time.
GroupWeight InitializerTraining Time per Epoch (s)Number of EpochsWeight Initial Time (s)Total Training Time (s)
TraditionalKaiming-Normal7.033 ± 0.04574.0 ± 6.10.101 ± 0.013520.549 ± 0.276
(Data-agnostic baseline)Kaiming-Uniform7.042 ± 0.05476.2 ± 9.90.129 ± 0.011536.354 ± 0.533
Xavier-Normal8.003 ± 0.17274.0 ± 6.10.110 ± 0.009592.341 ± 1.052
Xavier-Uniform7.041 ± 0.05276.2 ± 9.90.098 ± 0.010536.285 ± 0.516
Normal7.915 ± 0.12790.8 ± 6.00.096 ± 0.014718.800 ± 0.771
Uniform7.957 ± 0.16182.0 ± 9.60.096 ± 0.014652.558 ± 1.555
Orthogonal8.009 ± 0.16874.6 ± 7.30.099 ± 0.015597.152 ± 1.223
Data-driven method
Variance-basedLSUV7.916 ± 0.13782.8 ± 7.80.086 ± 0.009655.152 ± 1.072
Error-driven (original Laor)Laor-ori8.053 ± 0.16377.0 ± 6.90.122 ± 0.011620.208 ± 1.115
Gradient-informedLaor6.960 ± 0.05389.2 ± 12.40.111 ± 0.014620.899 ± 0.657
(Laor variants)Laor-kn7.036 ± 0.05275.7 ± 8.10.120 ± 0.012532.368 ± 0.419
Laor-ku7.017 ± 0.05376.6 ± 7.60.149 ± 0.015537.645 ± 0.408
Laor-xn8.017 ± 0.13575.7 ± 8.10.119 ± 0.014606.597 ± 1.093
Laor-xu8.033 ± 0.16076.6 ± 7.60.123 ± 0.012615.483 ± 1.217
Laor-n6.976 ± 0.06495.7 ± 7.20.148 ± 0.014667.793 ± 0.462
Laor-u7.948 ± 0.14782.1 ± 6.80.118 ± 0.013652.262 ± 0.996
Laor-o8.053 ± 0.15076.5 ± 7.60.137 ± 0.013616.200 ± 1.146
Note: Boldface highlights the top three results per metric. Lower values are preferred, as they indicate reduced computational cost. All results are averaged over 20 independent runs.
Table A7. Classification metrics for weight initializers on the MALSTM-FCN model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.
Table A7. Classification metrics for weight initializers on the MALSTM-FCN model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.
GroupWeight InitializerTest Accuracy (%)Test Precision (%)Test Recall (%)Test F1 Score (%)Test MCC (%)
TraditionalKaiming-Normal84.88 ± 2.2684.74 ± 1.6485.34 ± 6.5584.86 ± 3.3570.00 ± 3.89
(Data-agnostic baseline)Kaiming-Uniform85.08 ± 2.1084.60 ± 1.7686.05 ± 6.2485.14 ± 3.1170.39 ± 3.58
Xavier-Normal84.88 ± 2.2684.74 ± 1.6485.34 ± 6.5584.86 ± 3.3570.00 ± 3.89
Xavier-Uniform85.08 ± 2.1084.60 ± 1.7686.05 ± 6.2485.14 ± 3.1170.39 ± 3.58
Normal84.97 ± 1.6587.59 ± 1.7081.69 ± 4.9484.43 ± 2.2370.25 ± 2.99
Uniform85.36 ± 1.8586.66 ± 1.2983.80 ± 5.7085.07 ± 2.5970.95 ± 3.31
Orthogonal84.99 ± 1.8484.72 ± 1.4585.60 ± 5.5585.02 ± 2.6770.17 ± 3.23
Data-driven method
Variance-basedLSUV86.19 ± 1.2386.06 ± 1.6186.59 ± 4.5086.23 ± 1.7372.54 ± 2.29
Error-driven
(original Laor)
Laor-ori85.02 ± 1.4784.62 ± 1.7385.82 ± 4.5085.12 ± 1.9870.19 ± 2.72
Gradient-informedLaor84.60 ± 1.5286.60 ± 2.0382.16 ± 5.6684.16 ± 2.2369.51 ± 2.72
(Laor variants)Laor-kn85.01 ± 1.7084.33 ± 1.6586.22 ± 5.0685.15 ± 2.3370.20 ± 3.13
Laor-ku84.76 ± 2.2184.53 ± 1.7085.38 ± 6.6184.76 ± 3.2369.78 ± 3.89
Laor-xn85.01 ± 1.7084.33 ± 1.6586.22 ± 5.0685.15 ± 2.3370.20 ± 3.13
Laor-xu84.76 ± 2.2184.53 ± 1.7085.38 ± 6.6184.76 ± 3.2369.78 ± 3.89
Laor-n84.41 ± 1.8087.34 ± 1.4280.69 ± 5.4383.76 ± 2.4769.19 ± 3.25
Laor-u84.43 ± 2.7387.06 ± 1.5381.14 ± 7.7583.74 ± 3.8869.31 ± 4.76
Laor-o85.12 ± 2.7384.26 ± 1.5186.63 ± 7.7285.18 ± 4.1270.56 ± 4.70
Note: Boldface indicates the top three scores for each metric. Higher values reflect better classification performance. Results are averaged over 20 runs for robustness.
Table A8. Convergence and initialization metrics for all initializers on the MALSTM-FCN model. Metrics include time per epoch, number of epochs, initialization duration, and total training time.
Table A8. Convergence and initialization metrics for all initializers on the MALSTM-FCN model. Metrics include time per epoch, number of epochs, initialization duration, and total training time.
GroupWeight InitializerTraining Time per Epoch (s)Number of EpochsWeight Initial Time (s)Total Training Time (s)
TraditionalKaiming-Normal7.363 ± 0.09571.7 ± 6.50.104 ± 0.013528.049 ± 0.623
(Data-agnostic baseline)Kaiming-Uniform7.398 ± 0.08171.8 ± 6.80.097 ± 0.009530.897 ± 0.555
Xavier-Normal8.318 ± 0.19271.7 ± 6.50.099 ± 0.017596.533 ± 1.258
Xavier-Uniform7.400 ± 0.08271.8 ± 6.80.111 ± 0.011531.044 ± 0.557
Normal8.377 ± 0.18189.4 ± 7.80.096 ± 0.011748.587 ± 1.408
Uniform8.265 ± 0.22379.5 ± 13.70.095 ± 0.010656.733 ± 3.060
Orthogonal8.316 ± 0.19772.0 ± 6.20.126 ± 0.011598.442 ± 1.210
Data-driven method
Variance-basedLSUV8.271 ± 0.19475.1 ± 9.50.104 ± 0.012621.266 ± 1.838
Error-driven (original Laor)Laor-ori8.293 ± 0.22572.8 ± 8.00.134 ± 0.010603.850 ± 1.803
Gradient-informedLaor7.360 ± 0.07789.4 ± 16.50.120 ± 0.016658.065 ± 1.266
(Laor variants)Laor-kn7.360 ± 0.06771.9 ± 7.60.121 ± 0.011529.312 ± 0.510
Laor-ku7.353 ± 0.07671.8 ± 7.70.119 ± 0.010528.090 ± 0.581
Laor-xn8.344 ± 0.15471.9 ± 7.60.149 ± 0.015600.093 ± 1.173
Laor-xu8.370 ± 0.18771.8 ± 7.70.117 ± 0.014601.082 ± 1.430
Laor-n7.361 ± 0.07492.0 ± 13.90.120 ± 0.015677.343 ± 1.033
Laor-u8.300 ± 0.19780.9 ± 14.50.147 ± 0.012671.585 ± 2.866
Laor-o8.387 ± 0.18870.7 ± 7.50.125 ± 0.007593.112 ± 1.408
Note: Boldface highlights the top three results per metric. Lower values are preferred, as they indicate reduced computational cost. All results are averaged over 20 independent runs.
Table A9. Classification metrics for weight initializers on the MMAGRU-FCN model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.
Table A9. Classification metrics for weight initializers on the MMAGRU-FCN model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.
GroupWeight InitializerTest Accuracy (%)Test Precision (%)Test Recall (%)Test F1 Score (%)Test MCC (%)
TraditionalKaiming-Normal85.81 ± 0.4186.57 ± 0.6384.84 ± 1.4685.68 ± 0.5471.66 ± 0.81
(Data-agnostic baseline)Kaiming-Uniform86.07 ± 0.4686.46 ± 0.6085.58 ± 1.1486.01 ± 0.5372.15 ± 0.90
Xavier-Normal85.81 ± 0.4186.57 ± 0.6384.84 ± 1.4685.68 ± 0.5471.66 ± 0.81
Xavier-Uniform86.07 ± 0.4686.46 ± 0.6085.58 ± 1.1486.01 ± 0.5372.15 ± 0.90
Normal86.19 ± 0.4386.84 ± 0.5285.35 ± 1.3086.08 ± 0.5472.39 ± 0.84
Uniform85.93 ± 0.5386.82 ± 0.5584.77 ± 1.4385.77 ± 0.6571.89 ± 1.03
Orthogonal86.44 ± 0.3386.73 ± 0.5286.10 ± 1.1686.40 ± 0.4372.89 ± 0.65
Data-driven method
Variance-basedLSUV86.86 ± 0.3586.93 ± 0.4086.82 ± 1.1086.87 ± 0.4373.74 ± 0.69
Error-driven
(original Laor)
Laor-ori86.05 ± 0.4386.44 ± 0.5585.57 ± 1.2386.00 ± 0.5272.12 ± 0.84
Gradient-informedLaor86.24 ± 0.4986.80 ± 0.5985.53 ± 1.4086.15 ± 0.6072.50 ± 0.96
(Laor variants)Laor-kn86.04 ± 0.4786.52 ± 0.6485.42 ± 1.4785.96 ± 0.5972.09 ± 0.92
Laor-ku85.85 ± 0.5586.63 ± 0.6984.84 ± 1.8285.71 ± 0.7371.74 ± 1.04
Laor-xn86.04 ± 0.4786.52 ± 0.6485.42 ± 1.4785.96 ± 0.5972.09 ± 0.92
Laor-xu85.85 ± 0.5586.63 ± 0.6984.84 ± 1.8285.71 ± 0.7371.74 ± 1.04
Laor-n86.40 ± 0.4487.08 ± 0.5785.53 ± 1.1986.29 ± 0.5272.83 ± 0.86
Laor-u85.72 ± 0.6586.92 ± 0.5484.14 ± 1.8685.49 ± 0.8471.49 ± 1.23
Laor-o86.43 ± 0.3386.78 ± 0.4886.00 ± 1.0886.38 ± 0.4172.87 ± 0.65
Note: Boldface indicates the top three scores for each metric. Higher values reflect better classification performance. Results are averaged over 20 runs for robustness.
Table A10. Convergence and initialization metrics for all initializers on the MMAGRU-FCN model. Metrics include time per epoch, number of epochs, initialization duration, and total training time.
Table A10. Convergence and initialization metrics for all initializers on the MMAGRU-FCN model. Metrics include time per epoch, number of epochs, initialization duration, and total training time.
GroupWeight InitializerTraining Time per Epoch (s)Number of EpochsWeight Initial Time (s)Total Training Time (s)
TraditionalKaiming-Normal14.691 ± 0.11374.0 ± 6.40.095 ± 0.0121087.222 ± 0.724
(Data-agnostic baseline)Kaiming-Uniform14.681 ± 0.14774.4 ± 6.10.090 ± 0.0131091.627 ± 0.896
Xavier-Normal16.319 ± 0.19774.0 ± 6.40.082 ± 0.0081207.720 ± 1.260
Xavier-Uniform14.695 ± 0.14774.4 ± 6.10.084 ± 0.0101092.692 ± 0.895
Normal16.224 ± 0.20069.2 ± 4.50.127 ± 0.0161122.856 ± 0.893
Uniform16.281 ± 0.20370.7 ± 6.10.084 ± 0.0101150.324 ± 1.245
Orthogonal16.352 ± 0.22473.2 ± 5.30.079 ± 0.0081196.220 ± 1.195
Data-driven method
Variance-basedLSUV16.262 ± 0.21073.9 ± 6.30.064 ± 0.0131201.796 ± 1.316
Error-driven (original Laor)Laor-ori16.312 ± 0.22172.4 ± 6.60.100 ± 0.0091181.097 ± 1.447
Gradient-informedLaor14.564 ± 0.14072.7 ± 4.90.102 ± 0.0151058.180 ± 0.693
(Laor variants)Laor-kn14.712 ± 0.11874.5 ± 5.60.127 ± 0.0121095.417 ± 0.665
Laor-ku14.684 ± 0.13772.1 ± 5.40.103 ± 0.0121058.094 ± 0.742
Laor-xn16.359 ± 0.22674.5 ± 5.60.108 ± 0.0111218.062 ± 1.271
Laor-xu16.363 ± 0.21472.1 ± 5.40.124 ± 0.0081179.093 ± 1.164
Laor-n14.629 ± 0.14671.1 ± 5.30.105 ± 0.0131039.461 ± 0.766
Laor-u16.302 ± 0.23170.1 ± 5.00.102 ± 0.0081142.087 ± 1.147
Laor-o16.379 ± 0.23572.6 ± 6.50.135 ± 0.0091188.419 ± 1.537
Note: Boldface highlights the top three results per metric. Lower values are preferred, as they indicate reduced computational cost. All results are averaged over 20 independent runs.

References

  1. Ensafi, Y.; Amin, S.H.; Zhang, G.; Shah, B. Time-Series Forecasting of Seasonal Items Sales Using Machine Learning—A Comparative Analysis. Int. J. Inf. Manag. Data Insights 2022, 2, 100058. [Google Scholar] [CrossRef]
  2. Luiz, L.E.; Fialho, G.; Teixeira, J.P. Is Football Unpredictable? Predicting Matches Using Neural Networks. Forecasting 2024, 6, 1152–1168. [Google Scholar] [CrossRef]
  3. Wei, L.; Yu, Z.; Jin, Z.; Xie, L.; Huang, J.; Cai, D.; He, X.; Hua, X.S. Dual Graph for Traffic Forecasting. IEEE Access 2024, 13, 122285–122293. [Google Scholar] [CrossRef]
  4. Skorski, M.; Temperoni, A.; Theobald, M. Revisiting Weight Initialization of Deep Neural Networks. In Proceedings of the 13th Asian Conference on Machine Learning, Virtual, 17–19 November 2021; Balasubramanian, V.N., Tsang, I., Eds.; PMLR: Cambridge, MA, USA, 2021; Volume 157, pp. 1192–1207. [Google Scholar]
  5. Glorot, X.; Bengio, Y. Understanding the Difficulty of Training Deep Feedforward Neural Networks. J. Mach. Learn. Res. 2010, 9, 249–256. [Google Scholar]
  6. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
  7. Saxe, A.M.; McClelland, J.L.; Ganguli, S. Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks. In Proceedings of the 2nd International Conference on Learning Representations, ICLR 2014—Conference Track Proceedings, Banff, AB, Canada, 14–16 April 2014. [Google Scholar]
  8. Narkhede, M.V.; Bartakke, P.P.; Sutaone, M.S. A Review on Weight Initialization Strategies for Neural Networks. Artif. Intell. Rev. 2022, 55, 291–322. [Google Scholar] [CrossRef]
  9. Mishkin, D.; Matas, J. All You Need Is a Good Init. In Proceedings of the 4th International Conference on Learning Representations, ICLR 2016—Conference Track Proceedings, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
  10. Boongasame, L.; Muangprathub, J.; Thammarak, K. Laor Initialization: A New Weight Initialization Method for the Backpropagation of Deep Learning. Big Data Cogn. Comput. 2025, 9, 181. [Google Scholar] [CrossRef]
  11. Securities Met Market Surveillance Criteria. Available online: https://www.set.or.th/en/market/news-and-alert/surveillance-c-sign-temporary-trading/market-surveillance-measure-list (accessed on 5 July 2025).
  12. SEC Fines IFEC Execs for Insider Trades. Available online: https://www.bangkokpost.com/business/general/1527334/sec-fines-ifec-execs-for-insider-trades (accessed on 18 September 2024).
  13. Court Accepts Class-Action Suit Against Stark Auditors. Available online: https://www.bangkokpost.com/business/general/2839118/court-accepts-class-action-suit-against-stark-auditors (accessed on 18 September 2024).
  14. JKN Founder Admits to “Forced Selling” of Shares. Available online: https://www.bangkokpost.com/business/general/2643514/jkn-founder-admits-to-forced-selling-of-shares (accessed on 18 September 2024).
  15. An Expensive Lesson. Available online: https://www.bangkokpost.com/business/general/1872004/an-expensive-lesson (accessed on 18 September 2024).
  16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  17. Foumani, N.M.; Tan, C.W.; Webb, G.I.; Salehi, M. Improving Position Encoding of Transformers for Multivariate Time Series Classification. Data Min. Knowl. Discov. 2024, 38, 22–48. [Google Scholar] [CrossRef]
  18. Karim, F.; Majumdar, S.; Darabi, H.; Harford, S. Multivariate LSTM-FCNs for Time Series Classification. Neural Netw. 2019, 116, 237–245. [Google Scholar] [CrossRef]
  19. Yuan, J.; Wu, F.; Wu, H. Multivariate Time-Series Classification Using Memory and Attention for Long and Short-Term Dependence ⋆. Appl. Intell. 2023, 53, 29677–29692. [Google Scholar] [CrossRef]
  20. Box, G.E.P.; Jenkins, G.M. Time Series Analysis: Forecasting and Control; Holden-Day: San Francisco, CA, USA, 1976. [Google Scholar]
  21. Bollerslev, T. Generalized Autoregressive Conditional Heteroskedasticity. J. Econom. 1986, 31, 307–327. [Google Scholar] [CrossRef]
  22. Cortes, C.; Vapnik, V. Support-Vector Networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
  23. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  24. Rumelhart, D.E.; Hinton, G.E.; Williams, R.J. Learning Representations by Back-Propagating Errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
  25. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef]
  26. Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Gated Recurrent Units. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Alessandro Moschitti, Q.C.R.I., Bo Pang, G., Walter Daelemans, U.A., Eds.; Association for Computational Linguistics: Washington, CO, USA, 2014; pp. 1724–1734. [Google Scholar]
  27. LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
  28. Zivot, E.; Wang, J. Modeling Financial Time Series with S-PLUS®; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  29. He, H.; Garcia, E.A. Learning from Imbalanced Data. IEEE Trans. Knowl. Data Eng. 2009, 21, 1263–1284. [Google Scholar] [CrossRef]
  30. Chauhan, A.; Prasad, A.; Gupta, P.; Prashanth Reddy, A.; Kumar Saini, S. Time Series Forecasting for Cold-Start Items by Learning from Related Items Using Memory Networks. In Proceedings of the Web Conference 2020—Companion of the World Wide Web Conference, WWW 2020, Taipei, China, 20–24 April 2020. [Google Scholar]
  31. Petchpol, K.; Boongasame, L. Enhancing Predictive Capabilities for Identifying At-Risk Stocks Using Multivariate Time-Series Classification: A Case Study of the Thai Stock Market. Appl. Comput. Intell. Soft Comput. 2025, 2025, 3874667. [Google Scholar] [CrossRef]
  32. Zivot, E.; Wang, J. Rolling Analysis of Time Series. In Modeling Financial Time Series with S-Plus®; Springer: New York, NY, USA, 2003. [Google Scholar]
  33. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
  34. Hyndman, R.J.; Athanasopoulos, G. Forecasting: Principles and Practice, 3rd ed.; OTexts: Melbourne, Australia, 2015. [Google Scholar]
  35. Nguyen, D.; Widrow, B. Improving the Learning Speed of 2-Layer Neural Networks by Choosing Initial Values of the Adaptive Weights. In Proceedings of the IJCNN International Joint Conference on Neural Networks, San Diego, CA, USA, 17–21 June 1990. [Google Scholar]
  36. Sreekumar, G.; Martin, J.P.; Raghavan, S.; Joseph, C.T.; Raja, S.P. Transformer-Based Forecasting for Sustainable Energy Consumption Toward Improving Socioeconomic Living: AI-Enabled Energy Consumption Forecasting. IEEE Syst. Man Cybern. Mag. 2024, 10, 52–60. [Google Scholar] [CrossRef]
  37. La Gatta, V.; Moscato, V.; Postiglione, M.; Sperlí, G. An Epidemiological Neural Network Exploiting Dynamic Graph Structured Data Applied to the COVID-19 Outbreak. IEEE Trans. Big Data 2021, 7, 45–55. [Google Scholar] [CrossRef]
  38. Yuan, Q.; Shen, H.; Li, T.; Li, Z.; Li, S.; Jiang, Y.; Xu, H.; Tan, W.; Yang, Q.; Wang, J.; et al. Deep Learning in Environmental Remote Sensing: Achievements and Challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
  39. SETSMART. Available online: https://www.set.or.th/en/services/connectivity-and-data/data/web-based (accessed on 7 March 2025).
Figure 1. Proxy-assisted initialization framework. A proxy model evaluates candidate initializers for the numerical input layer; the selected configuration is then applied to the target model. Categorical embeddings and all internal layers retain standard initialization strategies.
Figure 1. Proxy-assisted initialization framework. A proxy model evaluates candidate initializers for the numerical input layer; the selected configuration is then applied to the target model. Categorical embeddings and all internal layers retain standard initialization strategies.
Forecasting 07 00047 g001
Figure 2. Trade-off between classification performance (MCC) and total training time for all initialization strategies on the Transformer model. Top three initializers in terms of MCC and convergence efficiency are marked, illustrating the balance between accuracy and computational cost.
Figure 2. Trade-off between classification performance (MCC) and total training time for all initialization strategies on the Transformer model. Top three initializers in terms of MCC and convergence efficiency are marked, illustrating the balance between accuracy and computational cost.
Forecasting 07 00047 g002
Figure 3. Trade-off between classification performance (MCC) and total training time for all initialization strategies on the ConvTran model. Top three initializers in terms of MCC and convergence efficiency are marked, illustrating the balance between accuracy and computational cost.
Figure 3. Trade-off between classification performance (MCC) and total training time for all initialization strategies on the ConvTran model. Top three initializers in terms of MCC and convergence efficiency are marked, illustrating the balance between accuracy and computational cost.
Forecasting 07 00047 g003
Figure 4. Trade-off between classification performance (MCC) and total training time for all initialization strategies on the MLSTM-FCN model. Top three initializers in terms of MCC and convergence efficiency are marked, illustrating the balance between accuracy and computational cost.
Figure 4. Trade-off between classification performance (MCC) and total training time for all initialization strategies on the MLSTM-FCN model. Top three initializers in terms of MCC and convergence efficiency are marked, illustrating the balance between accuracy and computational cost.
Forecasting 07 00047 g004
Figure 5. Trade-off between classification performance (MCC) and total training time for all initialization strategies on the MALSTM-FCN model. Top three initializers in terms of MCC and convergence efficiency are marked, illustrating the balance between accuracy and computational cost.
Figure 5. Trade-off between classification performance (MCC) and total training time for all initialization strategies on the MALSTM-FCN model. Top three initializers in terms of MCC and convergence efficiency are marked, illustrating the balance between accuracy and computational cost.
Forecasting 07 00047 g005
Figure 6. Trade-off between classification performance (MCC) and total training time for all initialization strategies on the MMAGRU-FCN model. Top three initializers in terms of MCC and convergence efficiency are marked, illustrating the balance between accuracy and computational cost.
Figure 6. Trade-off between classification performance (MCC) and total training time for all initialization strategies on the MMAGRU-FCN model. Top three initializers in terms of MCC and convergence efficiency are marked, illustrating the balance between accuracy and computational cost.
Forecasting 07 00047 g006
Figure 7. Test MCC (%) versus total training time (seconds) across five deep learning architectures: (a) Transformer, (b) ConvTran, (c) MLSTM-FCN, (d) MALSTM-FCN, and (e) MMAGRU-FCN. Each point represents a specific weight initialization strategy. Red circles indicate the top three most efficient configurations (in terms of time and MCC), gold stars denote the top three highest MCC values, and green triangles mark the top three fastest initializers. LSUV and gradient-informed Laor variants (e.g., Laor-n, Laor-ku) consistently rank among the top performers in complex models, while traditional schemes such as Kaiming-Normal and Orthogonal remain competitive in recurrent architectures. Results highlight trade-offs between convergence efficiency and predictive performance across architectures and initialization strategies.
Figure 7. Test MCC (%) versus total training time (seconds) across five deep learning architectures: (a) Transformer, (b) ConvTran, (c) MLSTM-FCN, (d) MALSTM-FCN, and (e) MMAGRU-FCN. Each point represents a specific weight initialization strategy. Red circles indicate the top three most efficient configurations (in terms of time and MCC), gold stars denote the top three highest MCC values, and green triangles mark the top three fastest initializers. LSUV and gradient-informed Laor variants (e.g., Laor-n, Laor-ku) consistently rank among the top performers in complex models, while traditional schemes such as Kaiming-Normal and Orthogonal remain competitive in recurrent architectures. Results highlight trade-offs between convergence efficiency and predictive performance across architectures and initialization strategies.
Forecasting 07 00047 g007
Figure 8. Cost–benefit analysis of weight initialization strategies aggregated across five forecasting architectures. Top: Weight initialization time ranges from 0.064 s to 0.176 s, always less than 0.04% of total training time. Middle: Total training time ranges from 520 s to 1315 s, with several gradient-informed variants (e.g., Laor-n, Laor-ku) among the fastest. Bottom: Test MCC scores range from 68.15% to 73.74%, showing efficient initializers do not sacrifice predictive accuracy. Box colors indicate top three performers for initialization overhead (blue), training time (green), and MCC (orange), demonstrating that high computational efficiency and high accuracy are simultaneously attainable.
Figure 8. Cost–benefit analysis of weight initialization strategies aggregated across five forecasting architectures. Top: Weight initialization time ranges from 0.064 s to 0.176 s, always less than 0.04% of total training time. Middle: Total training time ranges from 520 s to 1315 s, with several gradient-informed variants (e.g., Laor-n, Laor-ku) among the fastest. Bottom: Test MCC scores range from 68.15% to 73.74%, showing efficient initializers do not sacrifice predictive accuracy. Box colors indicate top three performers for initialization overhead (blue), training time (green), and MCC (orange), demonstrating that high computational efficiency and high accuracy are simultaneously attainable.
Forecasting 07 00047 g008
Table 1. Initialization policy by layer type and functional role across all architectures.
Table 1. Initialization policy by layer type and functional role across all architectures.
Layer TypeTypical UsageInitialization MethodRationale
Numerical
input layer
Input embedding for continuous featuresGradient-Informed (Proposed)Sensitive to input scale;
proxy-based optimization-aware selection
Category
embedding layer
Discrete feature
embedding (e.g., symbols)
XavierSymmetric activation;
standard for embedding weights
Linear layersAttention, classifiers, residual pathsXavier or Kaiming
(depending on
activation)
Preserves variance across layers
ReLU-activated
convolutional
layers
Temporal/spatial
convolutions
KaimingDesigned for ReLU;
prevents vanishing
gradients
GRU/LSTM
input weights
Input-to-hidden
recurrence
XavierBalanced signal propagation
GRU/LSTM
recurrent weights
Hidden-to-hidden
recurrence
OrthogonalPreserves long-term
dependencies across time steps
Bias termsAll layers with biasZerosEnsures no unintended initial activation bias
Table 2. Forecasting architectures used in the evaluation and their structural characteristics.
Table 2. Forecasting architectures used in the evaluation and their structural characteristics.
ModelDescription
TransformerAttention-only architecture [16]
ConvTranCNN-Transformer hybrid [17]
MLSTM-FCNMultivariate LSTM with fully convolutional classifier [18]
MALSTM-FCNAttention-augmented MLSTM-FCN
MMAGRU-FCNMemory–attention–GRU hybrid with CNN feature extractor [19]
Note: Architectures and descriptions are based on prior configurations from Petchpol and Boongasame (2025) without modification [31].
Table 3. Model-specific hyperparameter configurations.
Table 3. Model-specific hyperparameter configurations.
ModelLearning
Rate
DropoutWeight
Decay
Sequence
Length
Hidden
Dim
FF
Dim
HeadsEmb.
Dim
LayersCNN
Dim
Transformer0.00010.50.012832241
ConvTran0.00010.70.0015642418
MLSTM-FCN0.00010.80.001143218
MALSTM-FCN0.00010.80.011432116
MMAGRU-FCN0.000010.50.001764364
Note: Hyperparameter values are retained from Petchpol and Boongasame (2025) without modification [31]. Definitions—FF Dim: hidden dimension of the feed-forward network in Transformer/ConvTran blocks; Emb. Dim: embedding dimension of input representations; CNN Dim: number of convolutional filters in convolutional modules.
Table 4. Comparative summary of weight initializer candidates. Baseline methods (Xavier, Kaiming, Normal, Uniform, Orthogonal, LSUV, Laor-ori) represent established approaches from the literature, while the gradient-informed Laor variants (Laor, Laor-n, Laor-u, Laor-kn, Laor-ku, Laor-xn, Laor-xu, Laor-o) represent the novel contribution of this study.
Table 4. Comparative summary of weight initializer candidates. Baseline methods (Xavier, Kaiming, Normal, Uniform, Orthogonal, LSUV, Laor-ori) represent established approaches from the literature, while the gradient-informed Laor variants (Laor, Laor-n, Laor-u, Laor-kn, Laor-ku, Laor-xn, Laor-xu, Laor-o) represent the novel contribution of this study.
GroupInitializersDescription
Data-agnostic
baseline
Xavier-Normal/Uniform, Kaiming-Normal/Uniform, Normal, Uniform, OrthogonalCommonly used baselines relying only on fan-in/out and variance scaling without data or label information
Data-driven method
Variance-basedLSUVInitialization based on layer-sequential unit variance using activation statistics
Error-driven
(original Laor)
Laor-oriReimplementation of the original Laor method using k-means clustering on forward-pass loss values
Gradient-informed (Laor variants)Laor, Laor-n, Laor-u, Laor-kn, Laor-ku, Laor-xn, Laor-xu, Laor-oVariants of Laor Initialization using gradient norms during clustering, evaluated through a proxy layer and fixed across training
Table 5. Performance and convergence comparison of weight initialization strategies on the Transformer model. Baseline methods (Xavier, Kaiming, Normal, Uniform, Orthogonal, LSUV, Laor-ori) represent established approaches from the literature, while the gradient-informed Laor variants (Laor, Laor-n, Laor-u, Laor-kn, Laor-ku, Laor-xn, Laor-xu, Laor-o) represent the novel contribution of this study.
Table 5. Performance and convergence comparison of weight initialization strategies on the Transformer model. Baseline methods (Xavier, Kaiming, Normal, Uniform, Orthogonal, LSUV, Laor-ori) represent established approaches from the literature, while the gradient-informed Laor variants (Laor, Laor-n, Laor-u, Laor-kn, Laor-ku, Laor-xn, Laor-xu, Laor-o) represent the novel contribution of this study.
GroupWeight
Initializer
Test
MCC (%)
Weight
Initial Time (s)
Total
Training Time (s)
TraditionalKaiming-Normal68.82 ± 1.090.101 ± 0.012592.965 ± 0.250
(Data-agnostic baseline)Kaiming-Uniform68.69 ± 1.390.115 ± 0.012606.364 ± 0.312
Xavier-Normal68.82 ± 1.090.097 ± 0.016751.601 ± 0.990
Xavier-Uniform68.69 ± 1.390.098 ± 0.013606.819 ± 0.313
Normal70.51 ± 1.020.080 ± 0.016776.008 ± 1.158
Uniform68.77 ± 0.940.105 ± 0.013741.499 ± 1.240
Orthogonal68.75 ± 1.080.080 ± 0.013779.953 ± 1.305
Data-driven method
Variance- basedLSUV70.47 ± 0.800.134 ± 0.014805.164 ± 1.130
Error-driven (original Laor)Laor-ori68.53 ± 1.070.120 ± 0.005748.190 ± 0.994
Gradient-informedLaor69.54 ± 1.310.121 ± 0.007592.715 ± 0.624
(Laor variants)Laor-kn68.27 ± 1.160.124 ± 0.010611.257 ± 0.218
Laor-ku68.15 ± 1.260.118 ± 0.014602.799 ± 0.144
Laor-xn68.27 ± 1.160.120 ± 0.014776.073 ± 1.010
Laor-xu68.15 ± 1.260.134 ± 0.012761.259 ± 0.897
Laor-n70.16 ± 0.900.121 ± 0.016590.652 ± 0.353
Laor-u68.66 ± 0.910.123 ± 0.013749.999 ± 1.095
Laor-o68.79 ± 0.930.150 ± 0.014774.926 ± 1.182
Note: Boldface highlights the top three scores for each metric across all initializers. Lower values are preferred for time-based metrics (initialization and total training time), whereas higher values reflect better classification performance (MCC). Additional metrics are reported in Appendix A (Table A1 and Table A2).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Petchpol, K.; Boongasame, L. An Extension of Laor Weight Initialization for Deep Time-Series Forecasting: Evidence from Thai Equity Risk Prediction. Forecasting 2025, 7, 47. https://doi.org/10.3390/forecast7030047

AMA Style

Petchpol K, Boongasame L. An Extension of Laor Weight Initialization for Deep Time-Series Forecasting: Evidence from Thai Equity Risk Prediction. Forecasting. 2025; 7(3):47. https://doi.org/10.3390/forecast7030047

Chicago/Turabian Style

Petchpol, Katsamapol, and Laor Boongasame. 2025. "An Extension of Laor Weight Initialization for Deep Time-Series Forecasting: Evidence from Thai Equity Risk Prediction" Forecasting 7, no. 3: 47. https://doi.org/10.3390/forecast7030047

APA Style

Petchpol, K., & Boongasame, L. (2025). An Extension of Laor Weight Initialization for Deep Time-Series Forecasting: Evidence from Thai Equity Risk Prediction. Forecasting, 7(3), 47. https://doi.org/10.3390/forecast7030047

Article Metrics

Back to TopTop