An Extension of Laor Weight Initialization for Deep Time-Series Forecasting: Evidence from Thai Equity Risk Prediction

Petchpol, Katsamapol; Boongasame, Laor

doi:10.3390/forecast7030047

Open AccessArticle

An Extension of Laor Weight Initialization for Deep Time-Series Forecasting: Evidence from Thai Equity Risk Prediction

by

Katsamapol Petchpol

^1,*

and

Laor Boongasame

^2,3

¹

Department of Computer Science, Faculty of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand

²

Department of Mathematics, Faculty of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand

³

Business Innovation and Investment Laboratory: B2I-Lab, Faculty of Science, King Mongkut’s Institute of Technology Ladkrabang, Bangkok 10520, Thailand

^*

Author to whom correspondence should be addressed.

Forecasting 2025, 7(3), 47; https://doi.org/10.3390/forecast7030047

Submission received: 16 July 2025 / Revised: 27 August 2025 / Accepted: 1 September 2025 / Published: 2 September 2025

Download

Browse Figures

Versions Notes

Abstract

This study presents a gradient-informed proxy initialization framework designed to improve training efficiency and predictive performance in deep learning models for time-series forecasting. The method extends the Laor Initialization approach by introducing backward gradient norm clustering as a selection criterion for input-layer weights, evaluated through a lightweight, architecture-agnostic proxy model. Only the numerical input layer adopts the selected initialization, while internal components retain standard schemes such as Xavier, Kaiming, or Orthogonal, maintaining compatibility and reducing overhead. The framework is evaluated on a real-world financial forecasting task: identifying high-risk equities from the Thai Market Surveillance Measure List, a domain characterized by label imbalance, non-stationarity, and limited data volume. Experiments across five architectures, including Transformer, ConvTran, and MMAGRU-FCN, show that the proposed strategy improves convergence speed and classification accuracy, particularly in deeper and hybrid models. Results in recurrent-based models are competitive but less pronounced. These findings support the method’s practical utility and generalizability for forecasting tasks under real-world constraints.

Keywords:

deep learning; financial risk prediction; time-series forecasting; Thai stock market; weight initialization

1. Introduction

Deep learning methods have achieved substantial success in time-series forecasting tasks across various domains, including retail sales [1], sports outcomes [2], and urban traffic dynamics [3], due to their capacity to model nonlinear dependencies and multiscale temporal patterns. Despite their capabilities, deep neural networks remain sensitive to initial weight configurations, which can significantly influence optimization behavior, predictive stability, and convergence rates, particularly in deep or resource-constrained forecasting pipelines [4].

Widely adopted initialization schemes such as Xavier [5], Kaiming [6], and Orthogonal initialization [7], are valued for their variance-preserving properties but remain inherently data-agnostic. In non-stationary or class-imbalanced settings, these approaches often require prolonged optimization phases to achieve task-relevant representations [8]. Data-driven alternatives, such as Layer-Sequential Unit Variance (LSUV) [9] and Laor Initialization [10], introduce empirical feedback to guide weight selection, often resulting in improved early convergence. However, these methods typically increase computational complexity and are sensitive to model-specific engineering constraints. This limits their scalability across diverse forecasting architectures and operational settings.

Despite their limitations, data-driven initialization methods such as LSUV and Laor often improve convergence behavior, but they also increase computational burden and are less flexible across architectures. Yet, in real-world financial forecasting scenarios, particularly those involving regulatory monitoring, even modest gains in training efficiency or classification accuracy can be consequential. This is especially true in emerging markets like Thailand, where timely detection of high-risk securities directly affects investor protection and market integrity [11,12,13,14,15].

To address these limitations, this study proposes a selective, proxy-assisted initialization framework that extends Laor Initialization using backward gradient norm clustering. The approach applies custom initialization only at the numerical input layer—where parameter sensitivity is highest—while retaining conventional methods for internal layers. A lightweight linear proxy model is used to evaluate candidate weight sets, enabling gradient-informed selection without modifying the forecasting architecture.

The method is empirically evaluated on five deep learning models, Transformer [16], ConvTran [17], MLSTM-FCN [18], MALSTM-FCN, and MMAGRU-FCN [19], using a real-world forecasting task: identifying at-risk stocks in the Stock Exchange of Thailand (SET) based on the Market Surveillance Measure List (MSML). This task exemplifies structural challenges such as class imbalance, temporal drift, and high retraining frequency, making it a representative benchmark for evaluating initialization methods in practical forecasting pipelines.

The remainder of this paper is organized as follows. Section 2 reviews forecasting models, conventional weight initialization schemes, and data-driven approaches, highlighting their limitations in real-world pipelines. Section 3 presents the proposed proxy-assisted, gradient-informed initialization framework, together with the evaluation process and experimental setup. Section 4 reports empirical results across five deep learning architectures, analyzing convergence, efficiency, and computational overhead. Section 5 provides a combined discussion and conclusion, outlining key findings, practical implications, and directions for future research.

2. Background and Related Works

2.1. Forecasting Models and Their Challenges

Forecasting problems have traditionally been addressed with statistical models such as ARIMA [20] and GARCH [21], which provide interpretable predictions but struggle with nonlinear and multivariate dynamics. Classical machine learning models such as Support Vector Machines [22] and Random Forests [23] reduce these limits but require heavy feature engineering and cannot easily model sequential dependencies.

Recent advances in deep learning have greatly improved time-series forecasting. Foundational models such as Recurrent Neural Networks (RNNs) [24], along with Long Short-Term Memory (LSTM) networks [25], and Gated Recurrent Units (GRUs) [26] capture sequential dependencies directly from raw or minimally processed data. Convolutional Neural Networks (CNNs) [27] can also extract local temporal features. More recently, Transformer-based architectures [16] enable parallel processing and long-range dependencies through attention mechanisms.

While RNNs and their gated extensions laid the foundation for sequence modeling, Transformers improved scalability and expressiveness. Hybrid architectures (e.g., MLSTM-FCN, ConvTran) combine convolutional, recurrent, or attention components to balance local feature extraction with long-term dependency learning. Despite their effectiveness, performance remains sensitive to depth, regularization, and weight initialization.

This study does not aim to solve domain challenges such as non-stationarity [28], class imbalance [29], or cold start [30]. Instead, it builds on the validated forecasting pipeline proposed by Petchpol and Boongasame (2025) [31], which addresses these through principled design: preventing data leakage with strict sequence inputs [20]; mitigating non-stationarity via rolling window [32]; handling cold starts with implicit transfer learning [31]; and correcting class imbalance with SMOTE [33].

The pipeline benchmarks classical baselines (Random Walk [34], Random Forest, LSTM, GRU), which proved insufficient under real-world constraints. Final evaluation therefore centers on five deep learning architectures—Transformer, ConvTran, MLSTM-FCN, MALSTM-FCN, and MMAGRU-FCN—demonstrating superior robustness and predictive performance. This diverse set of architectures establishes a reliable foundation for isolating the impact of weight initialization, a sensitive hyperparameter that strongly influences convergence speed, training stability, and forecasting accuracy. The following sections examine its role and review existing strategies in support of a more effective initialization approach.

2.2. Conventional Weight Initialization Methods

Weight initialization plays a foundational role in the training dynamics of deep neural networks [35]. Conventional strategies such as Xavier initialization, Kaiming initialization, and Orthogonal initialization are widely adopted for their theoretical soundness and empirical effectiveness across a range of architectures [8]. Xavier initialization preserves variance across layers under symmetric activation functions such as tanh and sigmoid [5]. Kaiming initialization adjusts the variance for rectified linear units (ReLU), mitigating vanishing gradients in deeper networks [6]. Orthogonal initialization, particularly useful in recurrent structures, maintains signal norms across time steps, supporting stable temporal learning [7].

These methods are computationally efficient and architecture-compatible, which contributes to their widespread use in both research and production systems. Modern deep forecasting models, including LSTM and Transformers, like their predecessors RNNs and CNNs, continue to rely on these initialization schemes due to their compatibility with common training pipelines.

However, these strategies are inherently data-agnostic and do not incorporate information from the input distribution or task objective. As a result, they require additional optimization effort before network parameters converge toward task-relevant representations, which can prolong training time and, in some cases, lead to suboptimal convergence, particularly in forecasting tasks with limited data, non-stationarity, or shifting distributions.

2.3. Data-Driven Initialization Approaches and Limitations

To overcome the limitations of traditional data-agnostic schemes, several data-driven initialization methods have been proposed to incorporate distributional and task-specific information from the training data. Among these, Layer-Sequential Unit-Variance (LSUV) initialization improves training stability by performing forward passes to normalize the output variance of each layer sequentially [9]. LSUV requires minimal tuning and offers improved convergence, particularly in deep architectures.

Laor Initialization introduces a distinct data-driven approach that evaluates multiple randomly sampled weight candidates at the input layer using forward-pass loss as a proxy for optimization readiness [10]. The method clusters the resulting loss scores and selects the candidate associated with the most promising cluster, aiming to improve early convergence and final predictive performance. Unlike LSUV, which normalizes layer activations, Laor Initialization emphasizes loss-informed clustering to guide initialization. This design is particularly suited for tasks characterized by temporal variability and label imbalance, where early training stability can significantly affect downstream performance.

Despite their conceptual strengths, both LSUV and Laor Initialization face practical limitations. Because they rely on multiple forward passes for either activation normalization or loss-based candidate evaluation, they introduce additional computational overhead at the initialization stage. Furthermore, both methods are tightly coupled with specific model architectures. Adapting them to newer or more complex designs, such as Transformers, hybrid attention networks, or modular architectures, often requires extensive code customization and access to internal layers. This combination of computational and engineering overhead reduces their usability in heterogeneous forecasting pipelines, where efficiency and model flexibility are essential. These limitations motivate the development of more architecture-agnostic, reusable, and lightweight initialization strategies.

2.4. Toward Optimization-Aware and Architecture-Agnostic Initialization

While data-driven initialization methods offer task-informed starting points, applying them across all layers of a deep model introduces significant computational and engineering overhead. Each layer typically requires specialized access for forward or gradient evaluation, complicating their integration into diverse or modular architectures. In practice, the input layer, where raw temporal signals first interact with the model, is often the most accessible and initialization-sensitive component.

These considerations highlight the value of a selective hybrid strategy: applying data-informed initialization where it is most impactful—which, in our case, is the input layer—while retaining conventional, efficient initializers for internal components such as attention mechanisms or recurrent blocks. As discussed in the previous section, most existing data-driven methods rely on forward-pass loss to select among candidate weight configurations and are often tightly coupled to specific architectural designs. This limits their usability in dynamic forecasting pipelines that demand both architectural flexibility and computational efficiency.

Consequently, there is a clear need for an initialization strategy that is both optimization-aware and architecture-agnostic, one that can evaluate candidate weights in a task-relevant manner without requiring deep structural access or significant engineering integration. The present study addresses this need by introducing a novel proxy-assisted gradient-informed approach, which is formally described in Section 3.

2.5. Relevance to Forecasting: A Case Study in the Thai Equity Market

Financial time-series forecasting is a technically demanding task characterized by non-stationarity, class imbalance, limited historical data, and dynamic structural shifts. These characteristics increase sensitivity to model configuration choices such as architecture depth, training frequency, and weight initialization. The Thai equity market provides a particularly suitable testbed. In this setting, the Stock Exchange of Thailand (SET) implements a Market Surveillance Measure List (MSML) to flag stocks with abnormal price and volume behavior [11], aiming to protect investors from market manipulation or excessive speculation [11,12,13,14,15].

Forecasting which stocks are likely to be flagged in advance poses a complex classification problem under regulatory, temporal, and data-driven constraints. The problem is further challenged by the infrequent and imbalanced nature of MSML events, the diversity of financial instruments, and the requirement of retraining across evolving time windows. To support experimentation, this study adopts the forecasting pipeline of Petchpol and Boongasame (2025) [31], which handles data leakage, concept drift, class imbalance, and cold start conditions.

While the focus of this work is on financial risk prediction, the challenges of non-stationarity, label imbalance, heterogeneous architectures, and retraining are also common in energy demand forecasting [36], epidemiological trend detection [37], and environmental monitoring [38], where similar pressures on model reliability and efficiency exist. Thus, this case study offers broader insight into the behavior of deep forecasting models under real-world operational constraints.

3. A Hybrid Weight Initialization Framework

3.1. Proxy Model

The proposed framework introduces a selective weight initialization strategy that targets only the numerical input layer, where initialization sensitivity is most pronounced in deep time-series forecasting models. Rather than applying complex initialization schemes across the entire architecture, this design seeks to balance optimization-awareness with architectural simplicity and computational efficiency.

To achieve this, the framework employs a lightweight proxy model that evaluates multiple candidate weight configurations prior to training. The proxy serves as a surrogate for the full forecasting architecture, enabling architecture-agnostic evaluation while keeping computational cost low. This design decouples initialization assessment from internal model structures, making it fully generalizable and applicable across a wide range of deep learning architectures.

The proxy itself is implemented as a single-layer linear transformation, followed by a task-specific output head. For classification tasks such as Market Surveillance Measure List (MSML) prediction, the output head is a softmax classifier trained using cross-entropy loss defined as:

L = - \sum_{i = 1}^{C} y_{i} \log ({\hat{y}}_{i}),

(1)

where

C

is the number of classes,

y_{i}

represents the true label, and

{\hat{y}}_{i}

is the predicted probability for class

i

. In our MSML task, which is binary (flagged vs. not flagged), this categorical cross-entropy naturally reduces to the binary cross-entropy where

C = 2

, with

y_{1}

representing the probability of ‘not flagged’ and

y_{2}

representing the probability of ‘flagged’. For regression settings, it is replaced with a linear output trained via mean squared error:

L = \sum_{j = 1}^{S} {(y_{j} - {\hat{y}}_{j})}^{2},

(2)

where

S

is the number of samples,

y_{j}

is the true target value for sample

j

, and

{\hat{y}}_{j}

is the predicted value.

As illustrated in Figure 1, categorical inputs are processed separately using a standard Xavier-initialized embedding layer, while the proxy-based initialization is applied exclusively to the numerical input layer. The remainder of the forecasting architecture, including models such as Transformer, ConvTran, MLSTM-FCN, MALSTM-FCN, and MMAGRU-FCN, is adopted from the validated pipeline of Petchpol and Boongasame (2025) and used without modification [31]. These internal layers retain a standardized initialization policy based on each layer’s structure and activation behavior (summarized in Table 1), allowing the effect of the proposed method to be evaluated in isolation.

Importantly, the proxy is not discarded after evaluation. The selected weight configuration is retained and used to initialize the input layer of the actual forecasting model. This ensures that initialization reflects real data flow characteristics while avoiding overhead from repeated or full-model training. The following section introduces a data-driven evaluation procedure designed to guide this selection process, using empirical feedback from the proxy to identify initialization configurations that are more responsive to the forecasting task at hand.

3.2. Gradient-Informed Evaluation

The evaluation mechanism employed by the proxy builds on Laor Initialization (Laor-ori), which scores candidate weight configurations based on forward-pass classification loss. A pool of

N

randomly generated candidate weight matrices is created for the numerical input layer. Each candidate is assigned to the proxy model and evaluated using the task-specific loss function, resulting in scalar error values (Algorithm 1). The proxy model applies these loss functions systematically across all candidate weight configurations. For classification tasks, Equation (1) is used, while regression applications would employ Equation (2).

Algorithm 1 Laor-ori (error-driven via proxy)

Input: Model $M$ , proxy layer $P$ , dataset batch $(X, y)$ , number of candidates $N$ , number of clusters $k$ , loss function $L$ (e.g., Cross-Entropy).
Output: Model $M^{'}$ with proxy layer $P^{'}$ .
Begin:
Define a standalone proxy layer $W_{i} \in R^{d_{out} \times d_{in}}$ with dimensions matching $P$ , where $d_{in}$ and $d_{out}$ denote the input and ouput dimensionalities of the layer under consideration.
Randomly generate $N$ normal distributed candidate weight matrices:
$\{W_{1}, W_{2}, \dots, W_{N}\}$ .
For each candidate $W_{i}$ :
a. Assign $W_{i}$ to $P$ .
b. Perform forward pass: $\hat{y_{i}} = P (X) .$
c. Compute scalar loss/error: $ℓ_{i} = L (\hat{y_{i}}, y)$ .
Form the error set: $E = \{ℓ_{1}, ℓ_{2}, \dots, ℓ_{N}\}$ .
Cluster $E$ into $k$ clusters using k-means.
Identify cluster $C^{*}$ with the lowest mean error.
Average weights from all candidates in $C^{*}$ :
$W_{final} = \frac{1}{|C^{*}|} \sum_{w_{i} \in C^{*}} W_{i} .$
Assign $W_{final}$ to $P$ .

The proposed gradient-informed Laor variant (Laor) improves upon this by replacing the loss-based evaluation with the

l_{2}

-norm of the backward gradients of each candidate (Algorithm 2). The gradient norm for candidate

i

is computed as:

g_{i} = {‖\nabla_{W_{i}} ℓ_{i}‖}_{2},

(3)

where

W_{i}

denotes the candidate weight matrix in the proxy layer and

ℓ_{i}

is the corresponding scalar loss. This measure provides a direct and actionable indicator of optimization potential, unlike the loss value alone.

Algorithm 2 Laor (gradient-informed via proxy)

Input: Model $M,$ proxy layer $P$ , Optimizer $O$ (e.g., SGD, Adam), dataset batch $(X, y)$ , number of candidates $N$ , number of clusters $k$ , loss function $L$ (e.g., Cross-Entropy), randomizer $R$ : a base weight initializer (e.g., Random, Kaiming-Normal).
Output: Model $M^{'}$ with proxy layer $P^{'}$ .
Begin:
Define a standalone proxy layer $W_{i} \in R^{d_{out} \times d_{in}}$ with dimensions matching $P$ , where $d_{in}$ and $d_{out}$ denote the input and ouput dimensionalities of the layer under consideration.
Generate $N$ candidate weight matrices using randomizer $R$ :
$\{W_{1}, W_{2}, \dots, W_{N}\}, W_{i} ~ R .$
For each candidate $W_{i}$ :
a. Assign $W_{i}$ to $P$ .
b. Zero gradients in optimizer $O$ .
c. Perform forward pass: $\hat{y_{i}} = P (X)$ .
d. Compute scalar loss value: $ℓ_{i} = L (\hat{y_{i}}, y)$ .
e. Perform backward pass and compute gradient $l_{2}$ -norm:
$g_{i} = {‖\nabla_{W_{i}} ℓ_{i}‖}_{2}$
Form the gradient norm set: $G = \{g_{1}, g_{2}, \dots, g_{N}\}$ .
Cluster the $G$ into $k$ clusters using k-means.
Identify cluster $C^{*}$ with the highest mean gradient norm.
Average weights from all candidates in $C^{*}$ :
$W_{final} = \frac{1}{|C^{*}|} \sum_{w_{i} \in C^{*}} W_{i} .$
Assign $W_{final}$ to $P$ .

To demonstrate this advantage, consider the gradient descent update rule in Equation (4). If one were to take a step of size

η

, the expected first-order decrease for candidate

i

would be

{∆ ℓ}_{i} \approx - η g_{i}^{2} .

(4)

The key distinction between approaches lies in their selection criteria:
Laor-ori (Algorithm 1). Selects candidates by $ℓ_{i}$ Smaller loss values indicate lower initial misfit but do not guarantee faster optimization.
Laor (Algorithm 2). Selects candidates by $g_{i}$ Larger gradient norms imply greater reduction potential per update step, aligning initialization with optimization dynamics rather than only initial fit quality.

Both methods apply k-means clustering followed by averaging within the best cluster. This design, inherited from the original Laor method, improves reproducibility and reduces sensitivity to noise or outliers, ensuring that initialization is not determined by a single potentially unstable candidate but instead reflects a more robust consensus of promising weights.

Finally, to evaluate the robustness and variability of the proposed method, multiple gradient-informed Laor variants are explored, each differing in their random initialization schemes:

Laor-n/Laor-u: Laor with normal/uniform distribution sampling;
Laor-kn/Laor-ku: Laor with Kaiming Normal/Kaiming Uniform sampling;
Laor-xn/Laor-xu: Laor with Xavier Normal/Xavier Uniform sampling;
Laor-o: Laor with orthogonal matrix sampling.

3.3. Experimental Setup

This section describes the experimental setup used to evaluate the proposed proxy-assisted, gradient-informed initialization framework. The experiments are conducted within a multivariate time-series classification (MTSC) setting and build directly on a previously validated pipeline developed by Petchpol and Boongasame (2025) [31]. All model configurations, training routines, and evaluation metrics are held constant across experiments to isolate the impact of the weight initialization strategy.

3.3.1. Inherited Setup from Prior Work

Dataset: Daily end-of-day (EOD) stock trading data and Market Surveillance Measure List (MSML) labels were obtained from the SET Market Analysis and Reporting Tool (SETSMART) platform [39], covering the period from 16 November 2012 to 15 June 2024. The dataset includes adjusted closing prices, trading volumes, valuation ratios (P/E, P/BV), and derived technical and sentiment indicators. Each record is associated with a binary label indicating whether the corresponding equity appears on the MSML;
Preprocessing Steps: Temporal alignment was strictly maintained by using predictors at time $t$ exclusively derived from historical observations up to $t - 1$ . Forecasting labels were forward-shifted according to the forecasting horizon. Numerical features underwent standardization and precision rounding to six decimal places to enhance training stability. Missing trading values for suspended stocks were explicitly imputed with zeros, indicating absence of trading activity;
Rolling Window Training Strategy: A rolling window methodology was used to handle temporal non-stationarity, employing a training window of 1260 trading days (~5 years), a validation window of 60 trading days (~1 quarter), and a rolling step size of 60 days. This strategy resulted in 26 training-validation cycles for each experimental iteration;
Class Imbalance Handling: Given the significant minority class imbalance (~3.6% positive samples), Synthetic Minority Oversampling Technique (SMOTE) was applied within each rolling training window, doubling minority instances via interpolation with 5-nearest neighbors;
Computational Environment: Experiments were executed using PyTorch version 2.3.0 on an NVIDIA RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), Intel i7-13700KF CPU (Intel Corporation, Santa Clara, CA, USA), and 32 GB RAM, with CUDA version 12.3, ensuring reproducibility and computational efficiency;
Evaluation Metrics: The primary evaluation metric was Matthews Correlation Coefficient (MCC), complemented by supplementary metrics including F1 score, Recall, Precision, Accuracy, and Cross-Entropy Loss. Computational efficiency metrics encompassed training time per epoch, total wall-clock time, and epochs to convergence;
Model Architectures: Five deep learning models are used, spanning a range of structural patterns from fully attention-based to recurrent-convolutional hybrids as in Table 2.
Hyperparameters: Training configurations were inherited directly from prior work and remained fixed across all initialization experiments. These hyperparameters, summarized in Table 3, were tuned previously and reused to ensure a controlled evaluation setting.

For further details regarding data properties, exploratory statistics, and baseline model performance, please refer to Section 3, Section 4 and Section 5 of the original study.

3.3.2. Novel Experimental Components

This study introduces a systematic evaluation of weight initialization strategies, with particular emphasis on the proposed proxy-assisted gradient-informed Laor initialization. The following components were developed to extend the baseline forecasting framework and isolate the contribution of the initialization method.

Initialization Strategies Compared: Seventeen initialization strategies were evaluated at the numerical input layer, grouped into four categories (A comparative summary is provided in Table 4):

Traditional data-agnostic methods: Xavier (Uniform, Normal), Kaiming (Uniform, Normal), Random (Uniform, Normal), and Orthogonal.
Variance-based method: Layer-Sequential Unit Variance (LSUV).
Error-driven method: Original Laor Initialization, using clustering on forward-pass loss values.
Gradient-informed variants (proposed): Laor, Laor-Normal (Laor-n), Laor-Uniform (Laor-u), Laor-Kaiming-Normal (Laor-kn), Laor-Kaiming-Uniform (Laor-ku), Laor-Xavier-Normal (Laor-xn), Laor-Xavier-Uniform (Laor-xu), Laor-Orthogonal (Laor-o). All use the same gradient-norm clustering scheme and differ only in the randomization method used to generate candidate weights.

These strategies are applied solely at the input layer; internal layers follow fixed initializers based on their architecture and activation type (as described in Table 1). This separation isolates the impact of input-layer initialization while maintaining compatibility with diverse architectures.

Proxy-Layer and Clustering Procedure: Each strategy is evaluated through the same proxy model described in Section 3.1 and 3.2. For gradient-informed variants, $N = 10$ weight candidates are generated using the assigned randomizer, evaluated by gradient norm, and clustered using k-means with $k = 2$ . The choice of $N = 10$ follows the established methodology from the original Laor initialization paper, ensuring methodological consistency and comparability with prior work. The parameter $k = 2$ aligns with the binary classification structure of our forecasting task (flagged vs. not flagged stocks), providing a natural clustering framework for candidate selection. The final weight is computed by averaging the members of the lowest-gradient cluster. The same clustering process is used for the error-driven variant, except based on forward-pass loss. All candidates are evaluated under identical proxy conditions to ensure fair comparison.
Statistical Robustness Checks: To assess the robustness of each strategy, every experimental configuration was repeated 20 times with independent random seeds. Aggregate statistics, including mean performance, standard deviation, and training time, were reported for each model-initializer combination, allowing for comprehensive comparison of predictive accuracy and convergence efficiency.

3.4. Experiment Summary

All experiments in this study are conducted within a previously validated deep time-series forecasting pipeline, incorporating safeguards against data leakage, concept drift, class imbalance, and cold-start scenarios. These aspects are addressed through sequence-based input design, rolling-window training, SMOTE-based oversampling, and implicit transfer learning.

Five deep learning architectures, Transformer, ConvTran, MLSTM-FCN, MALSTM-FCN, and MMAGRU-FCN, are retained from prior work, in which they demonstrated strong and consistent performance for multivariate time-series classification. These configurations were previously identified as the most effective under real-world forecasting constraints.

No architectural or training modifications are introduced in this study. All preprocessing steps, hyperparameters, and training routines are preserved to ensure consistency with the original pipeline, thereby isolating the impact of weight initialization.

Only the initialization strategy at the numerical input layer is varied. This setup ensures that any observed variation in performance or training efficiency, as reported in Section 4, can be attributed solely to the effect of weight initialization.

4. Results

This section presents empirical results evaluating the proposed gradient-informed initialization strategies against traditional, variance-based, and error-driven methods. All evaluations are conducted within the controlled forecasting pipeline described in Section 3, using five deep learning architectures selected from prior work. To ensure statistical robustness, each model–initializer configuration is trained over 20 independent runs. Evaluation focuses on three key dimensions: (1) baseline performance comparison on the Transformer model, (2) generalizability across diverse architectural types, and (3) convergence speed and computational efficiency. Full classification and convergence metrics are provided in Appendix A Table A1, Table A2, Table A3, Table A4, Table A5, Table A6, Table A7, Table A8, Table A9 and Table A10.

4.1. Baseline Performance Comparison

This section presents classification and convergence results for all weight initialization strategies evaluated on the Transformer model, serving as a representative case for baseline comparison. Initializers include traditional data-agnostic methods (e.g., Xavier, Kaiming, Orthogonal), a variance-balancing strategy (LSUV), an error-driven method (Laor-ori), and eight gradient-informed variants proposed in this study. Evaluation focuses on Matthews Correlation Coefficient (MCC), which is appropriate for imbalanced classification tasks. Training time and initialization overhead are also reported.

As summarized in Table 5, baseline methods such as Normal and LSUV achieved high MCC scores (70.51 ± 1.02% and 70.47 ± 0.80%, respectively), but incurred the longest training times (776.0 s and 805.2 s). Laor-ori yielded similar training durations (748.2 s) but lower MCC (68.53 ± 1.07%), suggesting limited benefit from forward-loss clustering in the proxy-assisted setup.

Gradient-informed Laor variants demonstrated more favorable trade-offs. Laor reached 69.54 ± 1.31% MCC with significantly reduced training time (592.7 s), while Laor-n outperformed all methods in both MCC (70.16 ± 0.90%) and training time (590.7 s). These results suggest that backward-gradient clustering supports efficient convergence while preserving classification performance.

Figure 2 provides a visual comparison of test MCC against total training time for all initialization strategies. Highlighted markers identify the top three performers in terms of predictive accuracy and convergence efficiency.

Since this study reuses the previously validated experimental pipeline by Petchpol and Boongasame (2025) [31], standard baselines such as the Random Walk and the Random Forest are already incorporated, providing a transparent floor against which the benefits of more advanced initialization schemes can be assessed.

4.2. Generalizability of Weight Initializers Across Architectures

To evaluate the generalizability of weight initialization strategies across architectures with differing inductive biases, seventeen initializers were assessed on five deep learning models: Transformer, ConvTran, MLSTM-FCN, MALSTM-FCN, and MMAGRU-FCN. These models range from attention-only architectures to hybrids incorporating convolutional and recurrent modules. The tested initializers include traditional data-agnostic schemes (e.g., Xavier, Kaiming, Normal, Orthogonal), the variance-balancing LSUV method, the error-driven Laor-ori, and eight gradient-informed Laor variants. Detailed classification metrics, including MCC, Accuracy, Precision, Recall, and F1 Score, are provided in Appendix A Table A1, Table A3, Table A5, Table A7 and Table A9, while convergence-related statistics appear in Appendix A Table A2, Table A4, Table A6, Table A8 and Table A10. Comparative visualizations of performance and training time are shown in Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6.

On the Transformer model (Figure 2), Normal and LSUV achieved the highest MCC scores (70.51 ± 1.02% and 70.47 ± 0.80%, respectively), though they incurred longer training times (>750 s). Laor-n performed comparably (70.16 ± 0.90%) while reducing total training time to 590.7 s, the fastest among all methods.

On the ConvTran architecture (Figure 3), gradient-informed variants consistently outperformed other initialization strategies. Both Laor-ku and Laor-xu achieved the highest MCC of 72.62 ± 1.95%, with Laor-ku also yielding the fastest training duration (998.282 s), closely followed by Laor-n (72.02 ± 1.78%, 998.713 s). Other gradient-informed variants, such as Laor-kn and Laor-xn, also performed competitively with MCC scores of 71.88 ± 2.23%. In contrast, traditional and variance-based initializers, including Orthogonal (71.66 ± 1.59%, 1298.746 s) and LSUV (69.85 ± 2.03%, 1299.713 s), lagged behind in both predictive accuracy and convergence efficiency.

In the MLSTM-FCN architecture (Figure 4), Normal (72.44 ± 1.94%) and LSUV (72.30 ± 2.20%) yielded the highest MCC scores. Among data-informed strategies, Laor-kn achieved 70.65 ± 2.58%, offering faster training (532.368 s) than LSUV (655.2 s), suggesting a trade-off between moderate gains in speed and slight reductions in accuracy. Notably, Kaiming-Normal also performed competitively (71.69 ± 2.04%, 520.549 s), rivaling Laor-kn in both accuracy and efficiency.

For the MALSTM-FCN architecture (Figure 5), LSUV achieved the highest MCC (72.54 ± 2.29%) but incurred longer training times (621 s). Among gradient-informed initializers, Laor-o followed with 70.56 ± 4.70%, while Laor-kn and Laor-ku offered lower training times (~528 s) with MCC scores slightly below 70%. Notably, Kaiming-Normal (70.00 ± 3.89%, ~528 s) and Kaiming-Uniform (70.39 ± 3.58%, ~531 s) performed slightly better than Laor variants in both predictive accuracy and convergence time.

On the MMAGRU-FCN model (Figure 6), which integrates attention, convolution, and gated recurrence, gradient-informed variants showed strong scalability. Laor-o (72.87 ± 0.65%) and Laor-n (72.83 ± 0.86%) closely matched the performance of LSUV (73.74 ± 0.69%) while reducing training time by more than 160 s on average (1039.5 s vs. 1201.8 s).

Across all architectures, the original Laor-ori method consistently underperformed relative to its gradient-informed counterparts, indicating that error-driven clustering based on forward loss may be less effective at capturing early-stage optimization dynamics when applied through a proxy model. In contrast, gradient-informed Laor variants demonstrated notable adaptability in deeper and hybrid models such as Transformer, ConvTran, and MMAGRU-FCN. However, in simpler recurrent architectures like MLSTM-FCN and MALSTM-FCN, traditional initializers, particularly Kaiming-Normal and Kaiming-Uniform, achieved comparable or superior results in both predictive accuracy and convergence time. These findings suggest that while gradient-informed initialization offers benefits in complex model settings, its advantage may diminish in recurrent structures where variance-preserving schemes remain effective.

4.3. Cross-Architecture Comparison and Performance-Efficiency Trade-Offs

The preceding analysis reveals architecture-specific strengths of various initialization strategies, but broader patterns also emerge when comparing their behavior across forecasting models. This section synthesizes those results, highlighting general trade-offs between classification performance and convergence efficiency across all evaluated architectures.

Figure 7 visualizes this trade-off, marking initializers that achieved top three performance in MCC, total training time, or overall efficiency. These results illustrate how initialization impacts both accuracy and runtime behavior in shallow and deep forecasting architectures.

Gradient-informed Laor variants, particularly Laor-n, Laor-o, and Laor-ku, frequently ranked among the top three initializers in both predictive accuracy and training efficiency. For instance, Laor-n achieved the highest MCC on Transformer (70.16 ± 0.90%) and remained competitive on MMAGRU-FCN (72.83 ± 0.86%) while consistently exhibiting low total training times (590.7 s and 1039.5 s, respectively). Similarly, Laor-ku and Laor-o demonstrated favorable performance across ConvTran, MALSTM-FCN, and MMAGRU-FCN, underscoring their robustness in hybrid and deep configurations.

Traditional data-agnostic initializers such as Xavier and Kaiming performed adequately in recurrent-heavy architectures, including MLSTM-FCN and MALSTM-FCN, but struggled to match the performance or efficiency of gradient-informed methods in attention-based or multi-component networks. For example, Kaiming-Normal yielded 71.69 ± 2.04% MCC on MLSTM-FCN, but fell below 69% MCC in Transformer and ConvTran, where model depth and architectural heterogeneity increased initialization sensitivity.

The LSUV method delivered high classification performance in most settings, often ranking in the top three MCC scores (e.g., 72.54 ± 2.29% on MALSTM-FCN, 73.74 ± 0.69% on MMAGRU-FCN), but incurred consistently high convergence costs. Its average total training time exceeded 750 s in all architectures, limiting its utility in computationally constrained deployments. In contrast, Laor-n and Laor-ku offered comparable or better accuracy with significantly reduced training overhead.

Laor-ori, the error-driven initializer, consistently underperformed relative to its gradient-informed successors. It exhibited middling MCC scores (e.g., 68.53 ± 1.07% on Transformer, 70.19 ± 2.72% on MALSTM-FCN) and longer training times, suggesting that forward-pass loss clustering—when implemented through a proxy model—may be insufficient for robust early-stage optimization.

Collectively, these results demonstrate that gradient-informed, proxy-assisted initialization achieves a favorable balance between accuracy and efficiency across architectures. This generalizability is particularly valuable in real-world forecasting pipelines, where frequent retraining, class imbalance, and evolving data distributions demand consistent and scalable learning behavior.

These cross-architecture results highlight the practical relevance of initialization strategies that balance predictive accuracy with convergence efficiency. In particular, deeper and hybrid models consistently benefited from optimization-aware initialization, while simpler recurrent models remained well-served by traditional schemes. These findings establish a foundation for the broader interpretation, practical implications, and limitations discussed in the next section.

4.4. Computational Overhead Analysis

This section quantifies the computational impact of different initialization strategies, contrasting weight initialization time against total training cost and forecasting accuracy.

Figure 8 presents side-by-side boxplots of initialization time, total training time, and test MCC performance. All initialization methods, including data-driven proxy-based variants, feature consistently low initialization times (~0.06–0.18 s per experiment). Standard schemes (Kaiming, Xavier, Orthogonal) fall within the same range, and even the slowest Laor variant (Laor-xn) remains only fractionally higher.

When compared to full training durations (520–1315 s), initialization overhead is always below 0.04% (e.g., 0.18 s/520 s = 0.033%). Thus, the cost of proxy-based initialization is negligible in both large-scale and constrained environments. Importantly, the middle panel shows that gradient-informed variants (Laor, Laor-n, Laor-ku, Laor-kn) also rank among the shortest overall training times rivaling or surpassing the fastest canonical schemes. The bottom panel further demonstrates that top-performing strategies in terms of MCC (up to 73.74% for LSUV) do not incur any meaningful cost penalty. For context, computing gradients directly on full models would require 7–17 s per candidate. With

N = 10

candidates, initialization would instead take 70–172 s—roughly 400–1000 times slower than the proxy method, amounting to 15–25% of total training time. By contrast, the proxy approach achieves the same gradient-informed benefit at less than 0.04% overhead, or about 3000 times more efficient.

In summary, proxy-based gradient-informed initialization delivers a compelling balance: negligible overhead, reduced total training time, and competitive or superior predictive accuracy. This efficiency advantage makes the approach particularly suitable for resource-constrained forecasting pipelines requiring frequent retraining.

5. Discussion and Conclusions

This study introduced a proxy-assisted, gradient-informed weight initialization strategy for deep time-series forecasting models. By clustering backward gradient norms from a lightweight proxy at the numerical input layer, the method provides optimization-aware yet architecture-agnostic initialization. Compared with traditional variance-preserving and error-driven strategies, it consistently improved convergence speed and predictive accuracy, particularly in deeper and hybrid architectures such as Transformer, ConvTran, and MMAGRU-FCN.

These findings highlight that initialization plays a nontrivial role in forecasting performance, especially under practical constraints of non-stationary data, class imbalance, and limited resources. Gradient-informed clustering at the input layer alone was sufficient to guide early-stage optimization, and the fixed proxy design enabled systematic evaluation across architectures with negligible overhead (less than 0.04% of total training time). The robustness of Laor-n, Laor-o, and Laor-ku across diverse models underscores the value of combining gradient-based feedback with lightweight proxy evaluation.

At the same time, limitations remain. The method showed reduced advantage in simpler recurrent architectures (e.g., MLSTM-FCN and MALSTM-FCN), where variance-based schemes such as LSUV or Kaiming initialization remained competitive. The current linear proxy and reliance on k-means clustering represent conservative design choices that trade fidelity for efficiency. More adaptive clustering algorithms or nonlinear proxy layers may better capture complex gradient landscapes. In addition, validation was confined to financial risk prediction; extending evaluation to domains such as energy demand, epidemiology, and environmental monitoring will be important for establishing broader generalizability.

Future research should also investigate sensitivity to hyperparameters such as the number of candidates and clusters. In the present study,

N = 10

was selected in accordance with the original Laor initialization methodology to ensure comparability, while

k = 2

was aligned with the binary classification structure of our forecasting task (flagged vs. not flagged stocks). Although these settings provided a solid foundation, a systematic ablation across broader ranges of

N

and

k

may yield practical guidance for different architectures and tasks. Another promising direction is calibration: assessing how initialization strategies affect probability reliability through Brier score, Expected Calibration Error, and reliability diagrams. Such work is particularly relevant for risk-sensitive domains like finance, where well-calibrated probabilities matter as much as raw accuracy.

In summary, the proposed gradient-informed initialization strategy offers a practical means of improving training efficiency and stability in real-world forecasting pipelines. Its negligible computational cost, architecture-agnostic design, and compatibility with established workflows make it especially suited for frequent retraining under resource constraints. While limitations remain in recurrent settings and domain coverage, the framework provides a strong foundation for future extensions toward adaptive, calibration-aware, and cross-domain initialization strategies.

Author Contributions

Conceptualization, K.P. and L.B.; methodology, K.P.; software, K.P.; validation, K.P.; formal analysis, K.P. and L.B.; investigation, K.P.; resources, K.P.; data curation, K.P.; writing—original draft preparation, K.P.; writing—review and editing, K.P. and L.B.; visualization, K.P.; supervision, L.B.; project administration, K.P.; funding acquisition, L.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was conducted as part of the author’s studies at King Mongkut’s Institute of Technology Ladkrabang (KMITL). It was supported by the National Science, Research and Innovation Fund (NSRF) through KMITL under grant number RE-KRIS/FF68/45. The Stock Exchange of Thailand provided non-financial support through data access.

Data Availability Statement

The datasets used in this study, including End-of-Day stock data and Market Surveillance Measure List data, were obtained from the Stock Exchange of Thailand (SET). Due to licensing restrictions, the data is not publicly available. However, researchers may request access to the data through the SET research department. The data schema and preprocessing details are provided in our previously published work (Petchpol & Boongasame, 2025) [31], which is cited within the manuscript.

Conflicts of Interest

Katsamapol Petchpol is an employee of The Stock Exchange of Thailand (SET) while pursuing a Doctoral Degree at King Mongkut’s Institute of Technology Ladkrabang. The research utilized data provided by SET, with the request for data made under a student identity.

Abbreviations

The following abbreviations are used in this manuscript:

CNNs	Convolutional Neural Networks
EOD	End-of-Day
GRUs	Gated Recurrent Units
Kaiming-n	Kaiming Normal initialization
Kaiming-u	Kaiming Uniform initialization
KMITL	King Mongkut’s Institute of Technology Ladkrabang
Laor	Gradient-Informed Laor Initialization
Laor-kn	Gradient-Informed Laor Initialization with Kaiming Normal initialization
Laor-ku	Gradient-Informed Laor Initialization with Kaiming Uniform initialization
Laor-n	Gradient-Informed Laor Initialization with Normal initialization
Laor-o	Gradient-Informed Laor Initialization with Orthogonal initialization
Laor-ori	Original Laor Initialization
Laor-u	Gradient-Informed Laor Initialization with Uniform initialization
Laor-xn	Gradient-Informed Laor Initialization with Xavier Normal initialization
Laor-xu	Gradient-Informed Laor Initialization with Xavier Uniform initialization
LSTM	Long Short-Term Memory
LSUV	Layer-Sequential Unit-Variance
MCC	Matthews Correlation Coefficient
MSE	Mean Squared Error
MSML	Market Surveillance Measure List
MTSC	Multivariate Time-Series Classification
NSRF	National Science, Research and Innovation Fund
ReLU	Rectified Linear Units
RNNs	Recurrent Neural Networks
SET	The Stock Exchange of Thailand
SETSMART	SET Market Analysis and Reporting Tool
SMOTE	Synthetic Minority Over-sampling Technique
Xavier-n	Xavier Normal initialization
Xavier-u	Xavier Uniform initialization

Appendix A

Table A1. Classification metrics for weight initializers on the Transformer model. Metrics include Accuracy, Precision, Recall, F1 Score, and MCC on the test set.

Group	Weight Initializer	Test Accuracy (%)	Test Precision (%)	Test Recall (%)	Test F1 Score (%)	Test MCC (%)
Traditional	Kaiming-Normal	84.30 ± 0.55	81.84 ± 0.82	88.39 ± 0.97	84.98 ± 0.52	68.82 ± 1.09
(Data-agnostic baseline)	Kaiming-Uniform	84.23 ± 0.69	81.77 ± 0.90	88.33 ± 1.11	84.92 ± 0.66	68.69 ± 1.39
	Xavier-Normal	84.30 ± 0.55	81.84 ± 0.82	88.39 ± 0.97	84.98 ± 0.52	68.82 ± 1.09
	Xavier-Uniform	84.23 ± 0.69	81.77 ± 0.90	88.33 ± 1.11	84.92 ± 0.66	68.69 ± 1.39
	Normal	85.22 ± 0.50	84.02 ± 0.58	87.19 ± 1.28	85.57 ± 0.57	70.51 ± 1.02
	Uniform	84.33 ± 0.45	82.61 ± 0.70	87.19 ± 1.31	84.83 ± 0.51	68.77 ± 0.94
	Orthogonal	84.25 ± 0.58	81.70 ± 1.06	88.52 ± 0.95	84.96 ± 0.48	68.75 ± 1.08
Data-driven method
Variance-based	LSUV	85.20 ± 0.39	84.11 ± 0.90	87.02 ± 1.44	85.53 ± 0.45	70.47 ± 0.80
Error-driven (original Laor)	Laor-ori	84.14 ± 0.55	81.49 ± 0.86	88.56 ± 0.92	84.88 ± 0.50	68.53 ± 1.07
Gradient-informed	Laor	84.71 ± 0.69	83.69 ± 1.77	86.54 ± 2.33	85.05 ± 0.69	69.54 ± 1.31
(Laor variants)	Laor-kn	84.04 ± 0.58	81.96 ± 1.27	87.55 ± 1.88	84.64 ± 0.61	68.27 ± 1.16
	Laor-ku	83.97 ± 0.62	81.52 ± 0.68	88.06 ± 0.95	84.66 ± 0.61	68.15 ± 1.26
	Laor-xn	84.04 ± 0.58	81.96 ± 1.27	87.55 ± 1.88	84.64 ± 0.61	68.27 ± 1.16
	Laor-xu	83.97 ± 0.62	81.52 ± 0.68	88.06 ± 0.95	84.66 ± 0.61	68.15 ± 1.26
	Laor-n	85.05 ± 0.44	84.22 ± 1.09	86.49 ± 1.95	85.32 ± 0.58	70.16 ± 0.90
	Laor-u	84.28 ± 0.44	82.70 ± 0.41	86.91 ± 1.01	84.75 ± 0.49	68.66 ± 0.91
	Laor-o	84.29 ± 0.48	81.95 ± 0.99	88.18 ± 1.24	84.94 ± 0.44	68.79 ± 0.93