5.1.1. Hyperband Algorithm Application
The number of neurons in the hidden layer of a neural network significantly impacts the model’s performance. Proper selection of the number of neurons in the hidden layer can help the model to fit the data better, while improper selection may lead to overfitting or underfitting. In order to verify the effectiveness of the Hyperband hyperparameter optimization method used, this study also selected two other common optimization strategies for comparison, namely random search and Bayesian optimization. All three methods perform a hyperparameter search under the same neural network structure, training epochs (epochs = 30), training and validation set partitioning strategy (validation_split = 0.2), and early stopping mechanism (EarlyStop) to ensure fairness in comparison. We recorded each method’s tuning time and final accuracy on the test set. The search space for the three-parameter tuning methods covers key parameters such as convolutional layer channels, LSTM layer units, dropout rate, and the number of fully connected layer neurons.
The optimal model for each method was evaluated on the test set, and their accuracy and time consumption are shown in
Table 5. The results show that, although Hyperband took the longest time, it achieves the highest model accuracy (0.9876). This indicates that its efficiency advantage in resource allocation can be explored more fully in a high-performance model configuration. Random search and Bayesian optimization have shorter testing times. However, their testing accuracy is slightly lower, indicating that Hyperband is more effective in discovering superior hyperparameter combinations under the constraint of a given number of experiments.
In this study, we mainly used the Hyperband algorithm to help select the appropriate number of neurons and other hyperparameter configurations. We first conducted preliminary experiments on Hyperband, random search, and Bayesian optimization within a small hyperparameter search space to determine appropriate parameter boundary intervals. On this basis, we further expanded the search scope and used the Hyperband method for a more comprehensive search to explore potential high-performance configurations fully. The range of hyperparameter search spaces and optimal configurations set are shown in
Table 6, where the Hyperband algorithm is set with the number of training rounds = 50, bandwidth factor = 3, and sub-stage i = 2, and the optimization objective is the accuracy of the test set. The search was run on the training set using 20% of the data for validation, with early stopping based on validation loss to prevent overfitting.
5.1.2. Comparative Analysis of CNN-BiLSTM-AT Model Structure
A series of experiments was designed to compare the effects of different model structures, batch counts, and time steps. The same dataset was used for training and evaluation: m1 represents 1 CNN layer + 1 BiLSTM layer + 1 AT layer, m1 represents 1 CNN layer + 2 BiLSTM layers + 1 AT layer, m3 represents 2 CNN layers + 1 BiLSTM layer + 1 AT layer, and m4 represents 2 CNN layers + 2 BiLSTM layers + 1 AT layer, as shown in
Table 7.
(1) Analysis of Results from Different Model Structures
Table 8 shows the accuracy of four models (m1 to m4) under different time steps (w) and batch sizes (batch_2) settings. Overall, as the time step increased, the model’s accuracy generally showed a trend of first increasing and then decreasing, especially reaching a high level of accuracy in the range of w = 6 to w = 10. When the time step was w = 8 and the batch size was 32, the accuracy of the m1 model was the highest at 99.4%.
From the perspective of model structure, the m1 and m2 models generally exhibited high accuracy in all time steps and batch size settings, especially in larger time steps (w = 6 to w = 10), where their performance was remarkably stable and excellent, demonstrating good learning and generalization abilities, whereas m3 and m4 exhibited inevitable fluctuations at specific time steps, with slightly lower overall accuracy, which may indicate that the model structures of m3 and m4 had relatively weak processing power in financial indicator series data.
From the perspective of batch size, there was no significant difference in model performance between the two batch size settings in most cases. However, in certain cases (m4 model from w = 5 to w = 12), the accuracy of models with a batch size of 32 was generally slightly higher.
From the perspective of time steps, models with smaller time steps (w = 4 to w = 5) exhibited relatively lower accuracy. The models showed high accuracy with larger time steps (w = 6 to w = 10). However, their performance sharply declined at w = 12, indicating that excessively long time steps caused dataset bias and made it difficult for the models to learn effectively.
To ensure the reliability of the significance test results and verify that the observed differences in model accuracy are not due to random chance, we did not fix the random seed during model training. Instead, each model was independently run 15 times with different random initializations.
Figure 5 shows the accuracy distribution of the four models (m1, m2, m3, m4) in 15 experiments. The accuracy of m1 and m4 is relatively stable, while m2 and m3 exhibit significant fluctuations.
The average performance across these runs was used to conduct the Friedman test [
23], a non-parametric statistical test suitable for comparing multiple models across multiple datasets. From
Table 9, the resulting Friedman statistic was 34.3061 with a
p-value of 0.000, indicating statistically significant differences among the models.
To further determine which models had significant differences, we used Bonferroni–Dunn post hoc testing [
24]. From
Table 10, the results show that model m1 performed significantly better than m2, m3, and m4, with Bonferroni corrected
p-values of 0.0349, 0.0000, and 0.0000, respectively. However, the differences between m2, m3, and m4 were insignificant (
p > 0.05). These findings confirm that the performance improvement of m1 is statistically significant, rather than due to random variations.
(2) Results analysis of the best CNN-BiLSTM-AT models
After conducting experiments on different CNN-BiLSTM-AT model structures, we found that the m1 model (1CNN-1BiLSTM-AT) achieved the highest validation accuracy (0.994) at a time step of 8, batch size of 32, and 180 epochs of training. The loss value and accuracy trend chart in
Figure 6 show that the model exhibits good convergence and stability during the training and validation process. This indicates that the m1 model has strong generalization ability and excellent performance under the current optimal configuration of the Hyperband algorithm, making it an ideal model structure.
According to the performance evaluation table for the best CNN-BiLSTM-AT model (
Table 11), the overall accuracy was 99.4%, indicating strong performance. Among all samples predicted as the non-ST class, 99.5% were correctly identified as non-ST, and 99.8% of all predicted non-ST samples were accurately classified. For the ST class, 99.1% of predicted ST samples were correctly identified, while 96.6% of actual ST samples were accurately predicted as ST. In addition, compared with the model without SMOTE, the model’s accuracy was only 97.9%, and all evaluation indicators were lower than those of the SMOTE model, indicating the effectiveness of the SMOTE method.
(3) Feature Analysis of the Best CNN-BiLSTM-AT Model
① Key Features Highlighted by the Attention Mechanism
To interpret how the 1CNN-1BiLSTM-AT model identifies financial risks, we extracted and visualized the attention weights learned by the model. Specifically, the attention scores assigned to each financial indicator were averaged across all samples and time steps to obtain a global importance score. These scores were visualized in a horizontal bar chart (see
Figure 7), where the top 10 features are ranked by their average attention weights. The results show that the model places the highest attention on the current asset turnover, working capital ratio, current assets ratio, inventory turnover, operating profit margin, current liabilities ratio, net profit margin on fixed assets, earnings before interest and taxes per share, operating profit rate, and net profit margin on total assets.
② Features Identified Through Ablation Study
This article used the feature ablation method to evaluate which features significantly impact the performance of the CNN-BiLSTM-AT model. This method trains a benchmark model using all features, records its performance metrics, removes each feature one by one, retrains the model, and evaluates the changes in model performance. A significant decrease in performance indicates that the feature is important to the model and ultimately determines the ranking of the feature.
Figure 8 shows that the top 10 features had a significant impact on model performance: return on assets, total asset turnover, working capital ratio, net profit margin on total assets, cash assets ratio, fixed assets ratio, quick ratio, current liabilities ratio, earnings before interest and taxes per share, and debt-to-asset ratio.
The analysis combines the results of the attention mechanism and the ablation study to understand better how the model identifies key financial indicators. Several indicators, such as the working capital ratio, current liabilities ratio, net profit margin on total assets, and earnings before interest and taxes per share, are important in both methods, suggesting they play a consistent and critical role in financial distress prediction.