In this section, we apply our proposed algorithm to improve the training of deep learning models by reformulating their training tasks as structured convex optimization problems. Our approach is based on fixed-point theory, which provides strong theoretical guarantees for convergence and solution reliability. This makes the training process more stable, efficient, and robust, especially in the presence of noise or ill-conditioned data.
We focus on a class of models called Extreme Learning Machines (ELM) and their deeper extensions, Two-Hidden-Layer ELM (TELM). These models are known for their fast training and competitive accuracy. Unlike traditional neural networks, ELMs randomly assign hidden layer weights and only compute output weights, typically by solving a least-squares problem.
However, when the hidden layer output matrix is ill-conditioned or the data is noisy, direct pseudoinverse computations become unstable and prone to overfitting. To address this, we reformulate the training process as a convex minimization problem with regularization. This structure naturally fits into the framework of fixed-point problems, allowing us to apply our algorithm without relying on explicit matrix inversion.
4.1. Application to ELM
ELM is a neural network model initially proposed by Huang et al. [
17]. ELM is well-known for its rapid training capability and strong generalization performance. By integrating our algorithm into the ELM framework, we aim to boost both optimization efficiency and predictive accuracy.
Let us define the training dataset as , consisting of s input–target pairs, where denotes the input vector and denotes the associated target output.
ELM is designed for Single-Layer Feedforward Networks (SLFNs) and operates based on the following functional form:
where
is the predicted output,
h denotes the number of hidden neurons,
is the activation function,
and
are weight vectors for input and output connections of the
j-th hidden node, and
is the corresponding bias term.
Let the hidden layer output matrix
be defined as
The training objective is to find a solution that best approximates the output target:
which can be compactly written in matrix form as
where
is the output weight vector and
is the desired output matrix.
To enhance generalization and reduce overfitting, a LASSO regularization term is introduced. The resulting optimization problem becomes
where
denotes the
-norm and
is a regularization coefficient that controls sparsity.
4.2. Application to TELM
TELM is an extension of the traditional ELM that improves learning capacity by incorporating two hidden layers. Unlike conventional backpropagation-based multi-layer networks, TELM retains the fast training characteristics of ELM by leveraging analytic solutions in both stages. It is particularly suitable for modeling complex nonlinear relationships in high-dimensional data while avoiding the computational cost of iterative optimization.
A work by Janngam et al. [
18] demonstrated that TELM, when trained using their proposed algorithm, not only converges significantly faster than standard ELM but also achieves higher classification accuracy on various medical and benchmark datasets. Additionally, earlier work by Qu et al. [
19] showed that TELM consistently outperforms traditional ELM, especially in nonlinear and high-dimensional settings, by yielding better average accuracy with fewer hidden neurons.
These cumulative findings reinforce the choice of TELM as the core learning model for our study, particularly when enhanced with the proposed algorithm.
Let the training set be defined as , where is the input vector and is the corresponding target output.
-
Stage 1: Initial Feature Transformation and Output Weights.
To simplify the initialization process, TELM begins by
temporarily combining the two hidden layers into a single equivalent hidden layer. The combinated hidden layer matrix
is defined as
where
is the input matrix,
is the randomly initialized weight matrix for the first hidden layer,
is the bias matrix, and
is the activation function.
The output weights
connecting the hidden layer to the output layer are determined based on the linear system:
where
is the target matrix.
We find the optimal weight
u using Algorithms 7 and 8 for solving the convex optimization problem with LASSO regularization as follows:
where
is the regularization parameter that controls model complexity and prevents overfitting.
-
Stage 2: Separation and Refinement of Hidden Layers.
After computing the initial output weights
u from the first stage using (
52), the two hidden layers are separated to allow independent refinement.
To estimate the
expected output of the second hidden layer, denoted as
, we express that it satisfies the following equation:
However, rather than computing
directly from matrix inversion, we apply our proposed algorithm to solve the following convex optimization problem with LASSO regularization:
where
is the regularization parameter.
Next, TELM updates the weights and bias between the
first and second hidden layers, denoted as
and
, respectively, using the expected output
from (
57). Ideally, the following equation describes the connection between layers:
However, since both
and
are unknown, solving (
55) directly is not feasible. To address this, we reformulate the equation as
where
is the extended input matrix and
combines the weights and biases into a single matrix.
To estimate
, we solve the following convex optimization problem with LASSO regularization:
where
denotes the inverse of the activation function
G, and
is the regularization parameter.
Finally, using the estimated
from (
57), the refined output of the second hidden layer
is computed as
where
represents the updated hidden layer of the second hidden layer after adjusting the weights and biases.
-
Final Stage: Output Layer Update.
Finally, TELM updates the output weight matrix
, which connects the second hidden layer to the output layer, by solving
To obtain
, we solve the following convex optimization problem using the LASSO technique:
where
is the regularization parameter. Once
is obtained, the predicted output matrix
is computed as
This approach enhances numerical stability and improves the model ability to handle high-dimensional or noisy real-world data.
4.2.1. Experiments: Data Classification for Minimization Problems
Data classification is a fundamental task in machine learning, where the objective is to assign each input sample to one of several predefined categories. Common applications include medical diagnosis, object recognition, and fraud detection. In this work, we apply our proposed algorithm to train TELM for practical classification tasks.
To evaluate classification performance, we conducted experiments on three benchmark datasets and one real-world medical dataset. Each dataset was divided into 70% training and 30% testing sets. The details of the datasets are summarized in
Table 1.
Breast Cancer Dataset: A widely used dataset containing features extracted from digitized images of breast masses, used to classify tumors as benign or malignant.
Heart Disease Dataset: A standard dataset used to predict the presence of heart disease based on clinical attributes.
Diabetes Dataset: Contains diagnostic data for predicting the onset of diabetes in patients.
Hypertension Dataset: A real-world dataset collected by Sripat Medical Center, Faculty of Medicine, Chiang Mai University.
Table 2 summarizes the parameter settings for each algorithm compared in our experiments.
In addition, the following settings were consistently applied across all experimental setups:
Regularization parameter: .
Activation function: Sigmoid, .
Number of hidden nodes: .
Contraction mapping: .
In Algorithm 6,
is defined by
To assess and compare the classification performance of each algorithm, we employed four widely used evaluation metrics: accuracy, precision, recall, and F1-score.
Accuracy measures the proportion of correctly classified samples, both positive and negative, relative to the total number of samples. It is computed as
where
and
are the true positives and true negatives, respectively;
is the number of false positives (incorrectly predicting a patient as diseased); and
is the number of false negatives (failing to detect a diseased patient).
Precision reflects the proportion of true positives among all instances predicted as positive:
Recall, or sensitivity, represents the proportion of actual positive cases that are correctly identified:
F1-score is the harmonic mean of precision and recall, providing a balanced measure of model performance, particularly in imbalanced datasets:
The performance of each algorithm is analyzed at the 1000th iteration, as presented in
Table 3. Four datasets, breast cancer, heart disease, diabetes, and hypertension, were utilized to evaluate and compare the effectiveness of Algorithms 6 and 7 using standard classification metrics including accuracy, precision, recall, and F1-score on both training and testing data.
The results indicate that Algorithm 7 consistently performs well across all datasets. In particular, in the hypertension dataset, which reflects real-world conditions, Algorithm 7 achieves high accuracy and balanced precision–recall performance. This demonstrates its strong generalization capability and suitability for real-world medical applications that require reliable predictions and low error sensitivity.
To evaluate model performance with respect to both goodness-of-fit and model complexity, we utilize the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). These criteria are defined as follows:
Akaike Information Criterion (AIC):
where
k is the number of estimated parameters in the model and
is the maximum value of the likelihood function.
Bayesian Information Criterion (BIC):
where
n is the number of observations,
k is the number of parameters, and
is the maximum likelihood of the model.
Lower AIC and BIC values indicate better models in terms of balancing accuracy and simplicity.
To assess the consistency of the performance of model across multiple trials or datasets, we compute the mean and standard deviation (std) of AIC and BIC values.
Standard Deviation of AIC and BIC:
These statistics indicate the central tendency and dispersion of the AIC and BIC scores, where smaller standard deviations imply more stable model performance across different experiments.
To understand how well each algorithm fits the data without too much complexity, we compare their AIC and BIC values, as shown in
Table 4. Both AIC and BIC are commonly used to measure how good a model is; lower values mean that the model is more efficient and avoids overfitting.
The results show that Algorithm 7 gives lower AIC and BIC values than Algorithm 6 for all datasets. This means that Algorithm 7 is simpler and better at handling the data. The difference is most noticeable in the hypertension dataset, which comes from real-world health data. These results confirm that Algorithm 7 is a strong choice for real-world applications, where the model needs to be both accurate and not too complicated.
4.2.2. Application to Convex Bilevel Optimization Problems
The TELM model can also be formulated within the framework of convex bilevel optimization to better capture hierarchical learning structures. In this setting, we interpret the output weight learning (final step of TELM) as the solution to a lower-level convex problem, and the optimization of the hidden transformation weights (e.g., ) as the upper-level objective.
In our TELM-based learning problem, this bilevel formulation arises naturally:
The inner problem corresponds to learning the output weights
u given the fixed transformation
, and can be cast as a LASSO-type convex minimization:
where
is the second hidden layer output and
is the target.
The outer problem focuses on optimizing the hidden transformation weights
based on the optimal solution
from the inner problem. The upper-level loss is given by
Solving this bilevel problem directly is challenging due to the implicit constraint . However, by leveraging our proposed algorithm and proximal operator techniques, we can solve both levels efficiently and with guaranteed convergence under mild assumptions. This makes TELM highly suitable for structured learning tasks where the learning objectives are nested and interdependent.
To assess the performance of Algorithm 8 in solving convex bilevel optimization problems, we conducted experiments on the same datasets used in the convex optimization setting (see
Section 4.2.1). These include the breast cancer, heart disease, diabetes, and hypertension datasets, with a 70%/30% split for training and testing, respectively.
We evaluated classification performance using the same metrics—accuracy, precision, recall, and F1-score—to ensure consistency across experiments.
In this bilevel setting, we compared our method against Algorithm 1 (BiGSAM), Algorithm 2 (iBiGSAM), Algorithm 3 (aiBiGSAM), Algorithm 4 (miBiGSAM), and Algorithm 5 (amiBiGSAM).
All algorithms were configured according to the parameter settings summarized in
Table 5, ensuring fair and reproducible evaluation across all methods.
In addition, the following settings were consistently applied across all experimental setups:
Regularization parameter: .
Activation function: Sigmoid, .
Number of hidden nodes: .
To evaluate the effectiveness of the proposed algorithm (Algorithm 8), we conducted experiments on four datasets. Each algorithm was trained for 1000 iterations, and the performance was measured in terms of accuracy, precision, recall, and F1-score for both training and testing phases. The comparative results of all algortithms are summarized in
Table 6.
As shown in
Table 6, the proposed algorithm (Algorithm 8) consistently outperforms other methods across all datasets in both training and testing phases. In particular, for the breast cancer and diabetes datasets, Algorithm 8 achieves the highest test accuracy and F1-scores, demonstrating its strong generalization capability and classification performance.
Notably, in the hypertension dataset, which represents real-world medical data with high variability and complexity, the proposed method maintains superior accuracy and F1-score compared to baseline algorithms. This highlights the robustness and practical applicability of Algorithm 8 in real-world clinical settings.
Overall, the results support the effectiveness and stability of the proposed algorithm, making it a promising approach for medical classification tasks across diverse domains.
To statistically evaluate the performance of each algorithm, we computed the AIC and BIC values, including their mean and standard deviation, for both the training and testing phases. The experiments were conducted on four datasets: breast cancer, heart disease, diabetes, and hypertension. The summarized results presented in
Table 7 serve to compare the statistical efficiency of each algorithm.
According to the results in
Table 7, the proposed algorithm (Algorithm 8) consistently shows lower AIC and BIC values across several datasets. This means that the model fits well and is less likely to overfit the data. In particular, in the hypertension dataset, which contains real and complex medical data, Algorithm 8 achieves the lowest and most consistent scores. This shows that the algorithm can handle real-world situations effectively and gives reliable results.
From
Table 6 and
Table 7, it is evident that Algorithm 8 consistently outperforms all variants of BiG-SAM, including the improved versions (Algorithms 2–5).
In all datasets considered in this work (see
Table 6 and
Table 7), Algorithm 8 achieves the highest classification performance and also yields the lowest AIC and BIC scores, suggesting a better model fit with lower complexity. Moreover, its standard deviations are relatively small, indicating robustness and stability across different runs. Therefore, Algorithm 8 can be considered the most effective and reliable algorithm among those evaluated.