Next Article in Journal
Rotor Winding Short-Circuit-Fault Protection Method for VSPSGM Combining the Stator and Rotor Currents
Previous Article in Journal
Reconstruction of Motion Images from Single Two-Dimensional Motion-Blurred Computed Tomographic Image of Aortic Valves Using In Silico Deep Learning: Proof of Concept
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Training of an Extreme Learning Machine Autoencoder Based on an Iterative Shrinkage-Thresholding Optimization Algorithm

1
Doctorado en Modelamiento Matemático Aplicado, Universidad Católica del Maule, Talca 3480112, Chile
2
Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca 3480112, Chile
3
Departamento de Ciencias de la Computación e Industrias, Universidad Católica del Maule, Talca 3480112, Chile
4
Departamento de Matemáticas, Física y Estadística, Universidad Católica del Maule, Talca 3480112, Chile
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Appl. Sci. 2022, 12(18), 9021; https://doi.org/10.3390/app12189021
Submission received: 3 August 2022 / Revised: 4 September 2022 / Accepted: 5 September 2022 / Published: 8 September 2022
(This article belongs to the Topic Advances in Artificial Neural Networks)

Abstract

:
Orthogonal transformations, proper decomposition, and the Moore–Penrose inverse are traditional methods of obtaining the output layer weights for an extreme learning machine autoencoder. However, an increase in the number of hidden neurons causes higher convergence times and computational complexity, whereas the generalization capability is low when the number of neurons is small. One way to address this issue is to use the fast iterative shrinkage-thresholding algorithm (FISTA) to minimize the output weights of the extreme learning machine. In this work, we aim to improve the convergence speed of FISTA by using two fast algorithms of the shrinkage-thresholding class, called greedy FISTA (G-FISTA) and linearly convergent FISTA (LC-FISTA). Our method is an exciting proposal for decision-making involving the resolution of many application problems, especially those requiring longer computational times. In our experiments, we adopt six public datasets that are frequently used in machine learning: MNIST, NORB, CIFAR10, UMist, Caltech256, and Stanford Cars. We apply several metrics to evaluate the performance of our method, and the object of comparison is the FISTA algorithm due to its popularity for neural network training. The experimental results show that G-FISTA and LC-FISTA achieve higher convergence speeds in the autoencoder training process; for example, in the Stanford Cars dataset, G-FISTA and LC-FISTA are faster than FISTA by 48.42% and 47.32%, respectively. Overall, all three algorithms maintain good values of the performance metrics on all databases.

1. Introduction

In pattern recognition systems, efficient methods of feature selection and extraction can reduce the dimensionality problem, thus reducing both the computation time and the memory requirements of the training algorithms [1]. An autoencoder is a feed-forward neural network that builds a compact representation of the input data, and is mainly used for unsupervised learning [2]. It is composed of an encoder and a decoder: the encoder reads the input data and maps them to a lower-dimensionality space, while the decoder reads the compact representation and reconstructs the neural network input. In the same way as for all supervised learning neural networks [3,4], the core aspect of the training of an autoencoder is the backpropagation algorithm. This algorithm iteratively tunes the weights and biases of the neural network by applying the gradient descent method. Autoencoder networks are in great demand in multiple applications in modern society, for example, for dimensionality reduction, image retrieval, denoising, and data augmentation [2].
Several works in the literature have developed feature extraction methods using autoencoders and backpropagation. A comparative study between the performance of a traditional autoencoder and a denoising sparse autoencoder is presented in [5]. Another widely used architecture is the stacked autoencoder, which includes the stacked denoising sparse autoencoder [6] and the symmetric stacked autoencoder with shared weights [7]. Semi-supervised learning algorithms have also been used as autoencoder training algorithms [8,9]. However, the backpropagation training technique limits the generalization ability of the autoencoder, which results in low computational efficiency and the presence of local optima.
An extreme learning machine (ELM) is an emerging training paradigm for neural networks which includes supervised [10], semi-supervised [11], and unsupervised training [12]. The fundamental principle of this architecture is the random assignment of weights and biases in the hidden layer and the analytical determination of the weights of the output layer, in which the least squares method is applied to a system of linear equations [13,14]. The simplicity of the model, the lower numbers of training parameters, the low convergence speed, and its high generalization capacity mean that ELM neural networks have advantages for classification, regression, and clustering problems. Variants such as the kernel-ELM (K-ELM) [15], due to the significant reduction in the number of parameters and the increase in the training speed, and the online sequential ELM [16], due to its capacity to process batches of data, can even deal efficiently with problems involving large volumes of data. In the field of rolling element fault diagnosis, ELM can also solve the problem of weak signals caused by long transmission paths [17]; for example, the novel method presented in [18] can help in frame feature selection for ELM. In computer vision, the researchers of [19] proposed a visual object tracking scheme with promising results.
On the other hand, to handle the problems that local minima can cause, some studies have used heuristic optimization techniques to estimate the network parameters, such as differential evolution, particle swarming (PSO), and genetic algorithms [20]. These heuristic algorithms always search for the global minimum of the objective function and are efficient in several research fields, as is the case for the PSO algorithm that adjusts the parameters of an underactuated surface search model [21]. An improved scheme of this method integrates the PSO with global optimization capability, a 3-Opt algorithm with local search capability, and a fuzzy system with fuzzy reasoning ability [22]. The probabilistic ant colony algorithm is another optimization technique integrated into the architecture of an ELM network [23]. The K-ELM method improves the hyperspectral image classification capacity by coupling the principal component analysis, local binary pattern, and gray wolf optimization algorithm with global search capability [24].
Due to the advantages of ELM over backpropagation, the authors of [25] proposed to train autoencoder networks using an extreme learning machine (ELM-AE). A sparse ELM-AE was presented in [26] by adding the L 1 -norm in the quadratic term to give a more significant representation of data. Another approach for unlabeled feature extraction consisted of a generalized ELM-AE in which a manifold regularization factor was added to the objective function of the ELM-AE [27]. An unsupervised learning ELM-AE was proposed in [28], inspired by the use of embedding graphs to capture the structure of the input patterns, and was constructed using local Fisher discrimination analysis. A dense connection autoencoder was introduced into multi-layer learning, in which the output features of the previous layers were used by the following layers [29]. A double random hidden layer ELM autoencoder can achieve efficient extraction of features with dimension reduction capability for deep learning [30]. The L 21 -norm regularization has been used for the ELM-AE optimization problem, creating a new unsupervised learning framework that included a disperse representation despite the influence of atypical values and noisy data [31]. In addition, the correntropy-based ELM autoencoder presented in [32] was developed to extract features from noisy input patterns. To lower the number of parameters to be tuned and for the efficient calculation of the pseudo-inverse matrix, an ELM-AE with a kernel function was proposed in [33,34]. From a different perspective, the need for unsupervised batch learning means that an online sequential ELM-AE [35,36] is an exciting approach to solving large-scale applications.
From the works described above, we see that the backpropagation method and the Moore–Penrose inverse are two classical methods of estimating the weights of an autoencoder network. However, using these methods may cause overfitting when the number of unknown variables is larger than the number of training data. In addition, the accuracy is low when the number of hidden neurons is small, and the training time increases significantly when solving higher-dimensional applications. To overcome these drawbacks, researchers have proposed the use of sparse regularization techniques [37], with FISTA being the most widely used algorithm [38]. Although its origin dates back to 2009, this algorithm is still under study, thanks to its convergence speed, accuracy, and straightforward implementation. Several state-of-the-art studies have demonstrated the successful use of this optimization tool in machine learning. For example, in multilayer network design, the minimization problem defined in the autoencoder uses shrinkage-thresholding optimization techniques for dimension reduction and sparse feature representation [26,32,39]. For noisy image and video processing, a mixed scheme was presented in [40], in which sparse coding was adopted with deep learning to solve supervised and unsupervised tasks. A visual dictionary classification scheme was addressed in [41] using a simple (deep) convolutional autoencoder, for which the source of inspiration was sparse optimization and the iterative shrinkage-thresholding algorithm [38]. More recent studies [42,43] have demonstrated this algorithm’s development, continuity, and importance for solving convex minimization problems. In particular, they estimated the output weights of ELM and validated their results on classification problems.
Due to the importance of sparse regularization in the mathematical modeling of many real situations and the ability of shrinkage-thresholding algorithms to minimize these optimization problems, our work aims to study these optimization techniques in the architecture of an ELM-AE since there are few studies in the literature. Specifically, this paper proposes the use of fast algorithms from the shrinkage-thresholding family to improve the convergence speed achieved by FISTA during the training of an ELM-AE. In particular, two representative models of this class are considered: G-FISTA [44] and LC-FISTA [45,46]. To evaluate the efficiency of our scheme, six datasets that are widely used in the validation of ELM algorithms are adopted: MNIST, NORB, CIFAR10, UMist, Caltech256, and Stanford Cars. The main contributions of our article are as follows:
  • We present a novel method for computing the output layer weights of the ELM-AE network that avoids the need for the Moore–Penrose inverse. Instead of solving the linear system analytically, we use first-order iterative algorithms from the class of shrinkage-thresholding algorithms to minimize the output weights of the ELM-AE.
  • We demonstrate experimentally that the proposed method can effectively improve the computational time of the FISTA algorithm while maintaining its generalization capability. According to the theory presented in [47], the ELM-AE experiences a better performance when it achieves the lowest training error.
  • Compared to FISTA, the G-FISTA, and LC-FISTA algorithms show a considerable improvement in the training time of the ELM-AE, for all of the databases, and a very competitive reconstruction capability. Thus, applying this mathematical tool in other contexts would constitute a significant scientific contribution.
The rest of this work is structured as follows: in Section 2, we briefly explain the ELM-AE, FISTA, G-FISTA, and LC-FISTA algorithms. Section 3 describes the method proposed to train the ELM-AEs. The datasets and the results are presented in Section 4. The discussion of results is presented in Section 5. Finally, Section 6 gives the conclusions of this work.

2. ELM-AE and Shrinkage-Thresholding Algorithms

2.1. Extreme Learning Machine Autoencoder

The ELM-AE [25] is trained with an ELM algorithm [10] in an unsupervised manner. It corresponds to a single hidden layer feedforward neural network, in which the outputs match the inputs. The ELM-AE has both a high training speed and a strong generalization ability due to the pseudo-random creation of neurons in the hidden layer and the application of a pseudoinverse matrix to calculate the weights of the output layer. To improve the performance of this method, the authors of [25] assigned random orthogonal weights and biases to the hidden layer. This approach was shown to have good generalization, and to minimize H β X 2 2 and the square norm of the coefficients β 2 2 . Given a set of N training samples { x i | i N } , where x i R m are both the input and output data, the output with L hidden neurons can be expressed as follows:
f L ( x j ) = i = 1 L β i g ( w i x j + b i ) , j = 1 , N
where w i and b i represent the pseudorandom weights and biases of the hidden layer, g is an activation function, and β i represents the weights of the output layer. The previous overdetermined linear system can be compactly described as follows:
H β = X ,
where H R N × L is the hidden layer output matrix, β R L × m is the output layer weight matrix, and X R N × m is the input data matrix.
Indeed, β = H X is the standard solution to the overdetermined problem in (2), where H = ( H T H ) 1 H T is the Moore–Penrose inverse of matrix H . To improve the performance of an ELM-AE, the authors of [25] determined the output weights β by means of the equation β = I λ + H T H 1 H T X , which is the result of solving the following regularized optimization problem:
β = argmin H β X 2 2 + λ β 2 2 ,
where λ is a tuning regularization parameter.

2.2. The Shrinkage-Thresholding Class of Optimization Algorithms

A large number of applications can be formulated as a convex optimization problem [48], as shown by the following expression:
min : h ( τ ) = f ( τ ) + ϕ ( τ ) ,
where f , ϕ : τ R n R are two convex functions, f is L f -Lipschitz continuous and ϕ is a non-differentiable function that cannot be easily minimized in several machine learning applications. The solution to ϕ can be evaluated using the proximal operator [48], as shown in the following equation:
p r o x ρ ϕ ( z ) = a r g min τ 1 2 τ z 2 2 + ρ ϕ ( τ ) ,
where ρ is the stepsize and z is the initial estimate of the solution. As a consequence, the optimization of the function h uses a two-phase method: (i) performance of a forward gradient descent step on the smooth function f; and (ii) performance of the proximal operator or a backward gradient descent step on the non-smooth function ϕ .
In the following, a brief description of the FISTA, G-FISTA, and LC-FISTA algorithms is presented. These algorithms can be used to train ELM-AEs.

2.2.1. Fast Iterative Shrinkage Thresholding Algorithm

FISTA [38] is characterized by its convergence speed and high level of efficiency. When minimizing the convex problem in (4), the algorithm has a convergence rate on the order of O ( 1 / k 2 ) . The sequence of steps of the FISTA algorithm is as follows:
(1)
Update: τ k = p r o x ρ ϕ ( z k ρ f ( z k ) ) .
(2)
Execute the intermediate step: t k + 1 = 1 + ( 1 + 4 t k 2 ) 1 / 2 2 .
(3)
Update: z k + 1 = τ k + γ k ( τ k τ k 1 ) .
The term ρ ( 0 , 1 L f ] represents the step size, and γ k = t k 1 t k + 1 represents the impulse factor. FISTA’s convergence rate depends mainly on its proximal operator and the use of a weighted step z k , which is expressed as a combination of both τ k 1 and τ k . In addition, the configuration of the Lipschitz constant L f of the gradient f plays an essential part in the convergence of the algorithm.

2.2.2. Greedy FISTA

In general terms, G-FISTA [44] is a variant of FISTA that is characterized by a higher speed. Given the conditions imposed by G-FISTA, its architecture can restart the FISTA algorithm with the fixed impulse factor γ k = 1 . This algorithm updates the solution to the optimization problem in (4), setting ρ [ 1 L f , 2 L f ] , ξ < 1 and S > 1 until the maximum number of iterations is reached. The updating scheme is as follows:
(1)
Update: τ k = p r o x ρ ϕ ( z k ρ f ( z k ) ) .
(2)
Update: z k + 1 = τ k + ( τ k τ k 1 ) .
(3)
Restart: if ( z k τ k ) T ( τ k τ k 1 ) 0 , then z k + 1 = τ k .
(4)
Safeguard: if τ k τ k 1 S τ 1 τ 0 , then ρ = max { ξ ρ , 1 L f } .
The G-FISTA algorithm improves the convergence rate and the reconstruction capability of FISTA, which uses a self-adaptive restart and adjustment scheme to obtain an additional acceleration and alleviate the oscillation problem in the reconstruction process. Based on the results discussed in [44,49], G-FISTA has efficient performance when ρ [ 1 L f , 1.3 L f ] . For this reason, all experiments performed in this work were configured with ρ = 1.3 / L f , ξ = 0.96 , and S = 1.1 .

2.2.3. Linearly Convergent FISTA

LC-FISTA is an accelerated version of FISTA derived from the research in [45,46]. Under specific error conditions, LC-FISTA works with global linear convergence for a large group of real applications, which minimizes the formulation of the convex problem of the equation in (4). The iterative steps of the algorithm are as follows:
(1)
Set both τ 0 = z 0 = 0 , and α , θ , μ y L f .
(2)
Update: y k = 1 1 + θ τ k + θ 1 + θ z k .
(3)
Update: τ k + 1 = prox ρ ϕ ( y k ρ f ( y k ) ) .
(4)
Update: z k + 1 = ( 1 θ ) z k + θ y k + α ( τ k + 1 y k ) .
With a mathematical analysis analogous to that presented in [45,46], the sequence { τ k } k N generated by the LC-FISTA scheme exhibits a linear convergence rate, for fixed values of α = ( L f μ ) 1 / 2 and θ = ( μ L f ) 1 / 2 . The practical use of LC-FISTA requires tuning parameters such as the strong convexity parameter μ for the differentiable function f. In addition, we need to set τ 0 = z 0 to obtain convergence rates similar to the accelerated variant of FISTA presented in [50]. It is common to set x 0 = 0 , since the location of the global minimum of the objective function h is not known.
Remark 1.
The proximal operator p r o x ρ ϕ ( u k ) is the step that is common to FISTA, G-FISTA, and LCFISTA. This value τ k + 1 = p r o x ρ ϕ ( u k ) is closer to the global minimum of h compared with the maximum descent step u k : = z k ρ f ( z k ) . Since t k causes oscillations in the FISTA scheme when it has a value of one, G-FISTA shortens the interval between two restarts by setting t k = 1 , which is an essential condition for G-FISTA to converge with fewer iterations. In LC-FISTA, the θ and α momentum terms incorporated in the two additional equations (see (1) and (3) for LC-FISTA) improve both the accuracy of the gradient and the speed of convergence of LC-FISTA via the proximal operator.

3. Proposed Method of Training ELM-AEs Based on G-FISTA and LC-FISTA

ELM-AE has an extremely fast training speed, and good pattern reconstruction capability that is better than backpropagation-based learning. The formulation of this model can be expressed as a convex optimization problem which involves minimizing the weights of the β output layer, as shown in the following equation:
min β H β X u κ 1 + λ β v κ 2 ,
where κ 1 , κ 2 > 0 , u , v = 0 , 1 2 , 1 , 2 , and λ is a control parameter between the training error and the generalization ability.
Several methods exist for computing the solution to (6), such as the Moore–Penrose inverse, orthogonal projection, and proper decomposition [51]. However, in a real situation, the number of independent variables L in the linear system H β = X is much larger than the number of data points N, which may cause overfitting. On the other hand, the reconstruction capacity is low whenever the number of hidden neurons L is small. As the training dataset increases, the computational cost of the inverse matrix H increases significantly. Several regularization methods have been introduced in the literature to address the above problems. The two most commonly used classical techniques are ridge regression [52] and sparse regularization [37]. The sparse regularization plays an important role in feature selection, which has shown successful results in machine learning. This study presents an unsupervised feature selection framework, which integrates sparse optimization and signal reconstruction for pattern recognition models. The minimization problem takes the following form:
min β H β X 2 2 + λ | β | 1 ,
which is a special case of (4), in which we set f ( β ) = H β X 2 2 and ϕ ( β ) = λ | β | 1 .
From the description given above, we see that the estimation of β and the time required are the main challenges for training the ELM-AE. Our study proposes two novel iterative schemes for autoencoder training, G-FISTA, and LC-FISTA, which reduce the computational time of FISTA maintaining its accuracy, where FISTA is the classical algorithm for training neural networks. This optimization approach is also interesting because it can be extended to other regularization terms, which control the variables selection by calculating a derivative in the weakest sense through convex envelopes. In addition, the closed form of the proximal operator p r o x ρ ϕ ( · ) of ϕ is computed by the shrinkage operator s h r i n k ( z , λ ρ ) [38], whose i-th element is calculated as follows:
s h r i n k ( z , λ ρ ) i = s h r i n k ( z i , λ ρ ) = s i g n ( z i ) max { | z i | λ ρ , 0 } ,
where z is the initial estimation of the weight of the β output layer.
The speed that can be achieved by the use of G-FISTA and LC-FISTA in the training of the ELM-AE is significantly higher than that obtained using FISTA. In addition, for poorly conditioned matrices, the two algorithms have the same ability as FISTA to reconstruct the input signals. This is realized by the omission of several steps of the gradient descent, and correction of the solution by means of the proximal gradient. Although the iterative scheme used by G-FISTA and LC-FISTA effectively has additional terms compared to FISTA, the computational cost can be similar, since they converge with fewer iterations. The following algorithm presents the steps that are followed in the training of the ELM-AE, where the parameters to be configured are the number of iterations I, the number of neurons L, the stopping criterion P and the regularization parameter λ . The variables represent features associated with the pixels of the images, ordered according to the columns of the matrix X . In the reconstruction and variable selection process, the input data, expected target, network weights, and biases are matrix arrays that facilitate the computation of many vector operations.

4. Experimental Evaluation

This section presents the databases and evaluation metrics selected to evaluate the performance of the proposed method.

4.1. Hardware, Software and Databases

For this work, a server from the cluster of the Laboratory of Technological Research in Pattern Recognition (LITRP) of the Universidad Católica del Maule was used. The server was equipped with two CPUs, an Intel Xeon Gold 6238 252 CPU @ 2.20–4.00 GHz (56 physical cores), 126 GB RAM and a GPU NVIDIA Titan RTX. The algorithms were implemented in the MATLAB R2020a programming language.
To examine the performance of Algorithm 1, we used the CIFAR10, UMist, MNIST, NORB, Stanford Cars and Caltech256 databases, which were obtained from different free online repositories. These datasets are briefly described below.
(1)
MNIST. This dataset contains 70,000 images of handwritten digits from zero to nine. These are grayscale images of size 28 × 28 pixels which generate a vector of 784 components.
(2)
NORB. This consists of 48,600 grayscale stereo images of 50 toys from five generic classes: cars, four-legged animals, human figures, airplanes, and trucks. The size of the images is 96 × 96 pixels, and the output is a vector of 96 × 96 × 2 = 18,432 components.
(3)
CIFAR10. This dataset consists of 60,000 color images distributed into 10 classes: airplanes, birds, automobiles, cats, dogs, frogs, deer, ships, horses, and trucks. The number of images per class is 6000, with 5000 used for training and the remainder for testing. In this experiment, the color images were converted to grayscale and had a size of 32 × 32 pixels.
(4)
UMist. This contains 575 grayscale images of 20 people, showing individuals in different positions. The images are rectangular, with a size of 92 × 112 pixels.
(5)
Caltech256. This dataset contains 30,607 color images grouped into 257 categories, each of which contain at least 80 images. For the purposes of this study, the images were converted to grayscale and had a size of 100 × 100 pixels.
(6)
Stanford Cars (SCars). The Stanford Cars dataset contains 16,185 images of 196 categories of automobiles. For training, the color images were converted to grayscale images of size 96 × 97 pixels.
Algorithm 1 Training of the ELM-AE with FISTA, G-FISTA, and LC-FISTA.
Require: training set { x i } i = 1 N , maximum number of iterations I, stopping criterion P,
      number of neurons L, regularization parameter λ .
  1: Random and orthogonal assignment of weights W and biases b in the hidden layer.
  2: Calculation of the matrix H evaluating g in the terms ( x i , W , b ) .
  3: while P does not occur do
  4: Calculate the optimal value of (4) using the G-FISTA or LC-FISTA algorithms for
       f ( β ) = H β X 2 2 and ϕ ( β ) = λ | β | 1 .
  5: end while
Table 1 shows the numbers of training and test data, the numbers of categories, and the sizes of the input vectors for CIFAR10, UMist, MNIST, NORB, Stanford Cars, and Caltech256.

4.2. Selection of Evaluation Metrics

To analyze the performance of the proposed algorithm, we used the root mean square error (RMSE), mean absolute error (MAE), R 2 score, index of agreement (AI), Theil inequality coefficient (TIC), minimum error (MIN), and maximum error (MAX), as defined in Table 2 [53]. We used the RMSE and MAE metrics to represent the average error between the estimated and actual values. The lower the values of these measures, the higher the efficiency of the autoencoder.
The R 2 score is a statistical measure that is defined on the domain [ 0 , 1 ] , where a value of one indicates perfect prediction and zero an inefficient model [54]. In contrast, the TIC metric indicates the accuracy capability of the proposed system, with a value between zero and one. The closer the value is to zero, the higher the predictive efficiency of the ELM-AE method. Likewise, the IA prediction index can provide external statistical information on the predictive ability of the model. To verify the efficiency of the ELM-AE, each experiment was repeated 10 times, and the average values of the training times and the evaluation criteria represented in Table 2 were recorded.

4.3. Experimental Setup

In each experiment, the training and test data were standardized to the range [ 0 , 1 ] using the following equation:
y n o r m a = y i a y m i n a y m a x a y m i n a
where y i a is the actual value to be standardized, y m i n a and y m a x a are the minimum and maximum of the actual values. In the hidden layer, the weights and biases are random values generated by means of a uniform distribution in the ranges [ 1 , 1 ] and [ 0 , 1 ] , and the sigmoid g ( x ) = 1 1 + e x was used as the activation function. We chose this activation function due to its universal approximation capability in ELM networks.
The algorithm presented in Section 3 has the following parameters that need to be configured: the number of neurons L, the stopping criterion P, the maximum number of iterations I, and the regularization term λ . In the first scenario of the proposed method, the regularization parameter set to λ = 1 2 2 log ( L ) N , the stopping criterion P is expressed as H β k X X < ε and the maximum number of iterations is I = 50 . The threshold ε is a user-specified value that is applied to interrupt the iteration of the algorithm. For the UMist, CIFAR10, MNIST, NORB, SCars, and Caltech256 databases, we set ε = 0.30 , 0.32 , 0.60 , 0.13 , 0.40 and 0.40 , respectively. Moreover, the practical application of the FISTA, G-FISTA, and LC-FISTA algorithms for training of the ELM-AE requires tuning of the L f Lipschitz constant, and the strong convexity constant μ must be tuned for the LC-FISTA algorithm. For the convex function f ( β ) = H β X 2 2 , we chose values of L f = λ max ( H T H ) and μ = λ min ( H T H ) , where λ min and λ max are the minimum and maximum proper values of H T H . The additional control parameters in G-FISTA were described in Section 2.2.2.
In a second scenario, we evaluated the sensitivity of the regularization parameter λ on the performance of our method for the UMist, NORB, Stanford Cars, and Caltech256 datasets. This tuning term was chosen using cross-validation for each algorithm and database. Table 3 shows the optimal value of λ selected of { 10 5 , 10 4 , , 10 4 , 10 5 } , obtained by comparing the RMSE between the generated models, chosen as the λ that gives a lower RMSE. The experiment only considered the case L = 2000 for the selected databases. The other parameters of the FISTA, G-FISTA, and LC-FISTA algorithms follow the same selection criteria of the first scenario. For example, the steplength for FISTA and LC-FISTA has been ρ = 1 L f , but for G-FISTA, ρ = 1.3 L f , ξ = 0.96 , and S = 1.1 . The Lipschitz constant L f and the strong convexity parameter μ of f are taken as the maximum and minimum eigenvalue of the matrix H T H . The mathematical analysis and numerical experiments presented in [42,43,44,45,49] motivated the choice of our parameters.
To avoid the problem of overfitting when the number of neurons is increased, we applied a 10-fold cross-validation technique. This control tool allowed us to restrict the training to a certain number of neurons. We then used the test set specified in Table 1 and the metrics defined in Table 2 to evaluate the performance of each algorithm. Figure 1 shows the values of the RMSE for the training, validation, and test sets for the G-FISTA algorithm. A graphical view is not shown for FISTA and LC-FISTA since all three algorithms had a comparable level of accuracy. According to the above analysis, L = 900 and L = 1300 were the maximum neuron thresholds that could be assigned in ELM-AE for the MNIST and CIFAR10 databases.

4.4. Representation of Features with Different Numbers of Neurons

The computational complexity and execution time of the proposed method are explored in this section. For this purpose, several cases are considered with varying numbers of neurons in the hidden layer of the ELM-AE. In this phase of the experiments, we selected the measures of training time, RMSE, MAE, R 2 , IA, TIC, MIN, and MAX for comparison on the test set. These evaluation indicators were obtained as a result of training the ELM-AE with the shrinkage-thresholding class of algorithms introduced in Section 2.2.
Table 4 presents the training times and values of RMSE for the ELM-AE on each of the six datasets, for increasing numbers of neurons in the hidden layer. These experimental results show that the training speeds achieved by G-FISTA and LC-FISTA were lower than that obtained by FISTA. For example, for the CIFAR10 set, with 100 neurons in the hidden layer, the computation time fo rthe ELM trained by FISTA was 21.92 s, while the G-FISTA and LC-FISTA algorithms achieved average values of 12.31 s and 12.91 s. Likewise, when we increased the number of hidden neurons ( L ) , the training time obtained by G-FISTA and LC-FISTA was approximately half that required by FISTA. At the test stage, the RMSE metric achieved by the G-FISTA and LC-FISTA algorithms proposed in this paper indicated a similar level of error as the current FISTA algorithm. This evaluation metric became smaller as the parameter L was increased. For MNIST and CIFAR10, we obtained evaluation metrics of up to L = 500 and L = 1000 , respectively.
For each dataset, we show the execution time of the proposed method in Figure 2. A bar chart represents this evaluation measure as a function of the number of neurons. The blue, red, and orange bars indicate the training times obtained by FISTA, G-FISTA, and LC-FISTA, respectively. From the results illustrated in Figure 2, we infer that the proposed G-FISTA and LC-FISTA algorithms are faster than FISTA for training the ELM-AE.
In the following, we demonstrate the efficiency of the proposed method using four evaluation criteria: the MAE, R 2 score, IA, and TIC. A summary of the MAE and R 2 values is given in Table 5. The closer the value of MAE to zero, and the closer the value of R 2 to one, the better the performance of the autoencoder. From the values of the metrics given in Table 5, we can see that the order of magnitude is similar, and there is no significant difference between the shrinkage-thresholding methods, meaning that the proposed method maintains the same level of accuracy as the FISTA algorithm. Taking the UMist database as an example, we see that when L = 1500 , the average MAE values obtained by FISTA, G-FISTA, and LC-FISTA are 0.0172, 0.0181, and 0.0178, while their R 2 scores are 0.9861, 0.9846, and 0.9851, respectively. The IA and TIA metrics are also important for evaluating the capability of the proposed learning system. These fit metrics are described in Table 6. As the average values of the IA and TIA approach one and zero, respectively, the feature representation improves.
Table 7 shows the MIN and MAX statistical indices for the proposed model on all the selected databases. This table indicates the minimum and maximum absolute errors for the ELM-AE in the testing stage. From the results of this experimental analysis, we can conclude that the FISTA algorithm has a similar generalization capability as G-FISTA and LC-FISTA.
The best tuning of the parameter λ in ELM-AE usually influences its generalization capacity. With the λ in Table 3 chosen using cross-validation, we show in Table 8 the performance of our method for the particular case L = 2000 . The values of RMSE and the training time are the two indicators selected to compare the efficiency of FISTA, G-FISTA, and LC-FISTA in ELM-AE. The time shown in Table 8 is the execution time required to select the parameter λ . Note that G-FISTA and LC-FISTA maintain the reconstruction capability of FISTA but perform better in terms of computational time. In addition, the values of RMSE in Table 4 and Table 8 for the first and second scenarios are comparable measures. Therefore, the λ chosen in the first scenario and the additional parameters may be adequate in large-scale application problems since cross-validation is not required.

5. Discussion

The main goal of an autoencoder is to learn a compressed representation of the input and then reconstruct it within a reasonable computation time. This study proposes a robust sparse optimization-based method for training ELM-AE networks. With the control parameters specified in the results section, we see that the training times for G-FISTA and LC-FISTA are less than for FISTA. Table 9 shows these important results in terms of percentages. For the CIFAR10, UMist, MNIST, NORB, SCars, and Caltech256 databases, the average computation times for G-FISTA are 44.48%, 14.42%, 35.87%, 42.97%, 48.42%, and 39.43% faster than for FISTA, respectively. The LC-FISTA algorithm achieved speedups of 42.41%, 18.71%, 27.70%, 47.93%, 47.32%, and 43.90%. In addition, the indicators recorded in Table 4, Table 5, Table 6, Table 7 and Table 8 are coherent, and show comparable performance for all databases. The proposed training scheme is an important automatic learning method since it can help make accurate decisions in shorter periods. Consequently, incorporating fast optimization techniques into the autoencoder architecture can help improve the performance of current models, and particularly the first-order methods of the shrinkage-thresholding class.
The proximal step of FISTA requires the main computational effort, and this also holds true for G-FISTA and LC-FISTA. The additional computation required in G-FISTA and LC-FISTA increases the convergence speed, but the execution cost is higher. However, this problem can be controlled, since G-FISTA and LC-FISTA converge with fewer iterations. In the experimental evaluation described in Section 4, we raised the complexity of the problem by increasing the number of neurons in the hidden layer to get a clearer idea of the performance when solving applications involving higher dimensional datasets. Based on this evaluation criterion, we see that the proposed method can be configured to solve large-scale problems, since the vector operations involved in the ELM-AE architecture and the proposed mechanism are block-separable.

6. Conclusions

This paper has proposed the G-FISTA and LC-FISTA algorithms to reduce the training time of ELM-AEs. The performance of our methods was compared to FISTA, an algorithm of the shrinkage-thresholding class, which is a state-of-the-art method of training ELM-AEs.
Experiments were conducted on six databases: CIFAR10, UMist, MNIST, NORB, Stanford Cars, and Caltech256. The results showed that G-FISTA and LC-FISTA achieved similar computational times for the training of the autoencoder, both of which were significantly lower than FISTA. In numerical terms, G-FISTA and LC-FISTA were 37.60% and 37.99% faster than FISTA in training the ELM-AE (average value obtained from Table 9). Consequently, the advantage main of our study is to maintain the generalization capability of FISTA while the computational speed is improved. A possible disadvantage may be the hyperparameter settings, but many researchers set these values according to existing mathematical analysis in the literature to control runtime. Other advantages and disadvantages we will discuss in future perspectives.
Motivated by technological advances and the speed of convergence of shrinkage-thresholding algorithms, we intend to extend our methodology in further research to a parallel and distributed architecture. This approach will be of great interest for solving current large-scale applications in computer clusters and clouds. In addition, we aim to incorporate iterative schemes of the shrinkage-thresholding class into the architecture of a variational, convolutional deep autoencoder network to solve optimization problems with large datasets.

Author Contributions

Conceptualization, J.A.V.-C., M.M. and K.V.; methodology, J.A.V.-C. and M.M.; software, J.A.V.-C.; experimental execution and validation, J.A.V.-C.; research, J.A.V.-C., M.M. and K.V.; resources, M.M.; writing—review and editing, J.A.V.-C., M.M. and K.V.; scientific project coordination, M.M. and K.V.; funding acquisition, M.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by Universidad Católica del Maule Doctoral Studies Scholarship 2019 (Beca Doctoral Universidad Católica del Maule 2019).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

This paper is one of the scientific results of the project FONDECYT REGULAR 2020 N 1200810 Very Large Fingerprint Classification based on a Fast and Distributed Extreme Learning Machine, National Research and Development Agency, Ministry of Science, Technology, Knowledge and Innovation, Government of Chile.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
ELMExtreme learning machine
ELM-AEExtreme learning machine autoencoder
K-ELMKernel-ELM
FISTAFast iterative shrinkage-thresholding algorithm
G-FISTAGreedy FISTA
LC-FISTALinearly convergent FISTA
RMSERoot mean square error
MAEMean absolute error
R 2 R 2 score
IAIndex of agreement
TICTheil inequality coefficient
MINMinimum error
MAXMaximum error
PSOParticle swarm optimization
CPUCentral processing unit
GPUGraphics processing unit

References

  1. He, X.; Ji, M.; Zhang, C.; Bao, H. A variance minimization criterion to feature selection using laplacian regularization. IEEE Trans. Pattern Anal. Mach. Intell. 2011, 33, 2013–2025. [Google Scholar] [PubMed]
  2. Dong, G.; Liao, G.; Liu, H.; Kuang, G. A review of the autoencoder and its variants: A comparative perspective from target recognition in synthetic-aperture radar images. IEEE Geosci. Remote Sens. Mag. 2018, 6, 44–68. [Google Scholar] [CrossRef]
  3. Leijnen, S.; Veen, F.V. The Neural Network Zoo. Proceedings 2020, 47, 9. [Google Scholar]
  4. Baldi, P.; Sadowski, P.; Lu, Z. Learning in the machine: Random backpropagation and the deep learning channel. Artif. Intell. 2018, 260, 1–35. [Google Scholar] [CrossRef]
  5. Meng, L.; Ding, S.; Xue, Y. Research on denoising sparse autoencoder. Int. J. Mach. Learn. Cybern. 2017, 8, 1719–1729. [Google Scholar] [CrossRef]
  6. Meng, L.; Ding, S.; Zhang, N.; Zhang, J. Research of stacked denoising sparse autoencoder. Neural Comput. Appl. 2018, 30, 2083–2100. [Google Scholar] [CrossRef]
  7. Li, R.; Li, S.; Xu, K.; Li, X.; Lu, J.; Zeng, M. A Novel Symmetric Stacked Autoencoder for Adversarial Domain Adaptation Under Variable Speed. IEEE Access 2022, 10, 24678–24689. [Google Scholar] [CrossRef]
  8. Luo, X.; Li, X.; Wang, Z.; Liang, J. Discriminant autoencoder for feature extraction in fault diagnosis. Chemom. Intell. Lab. Syst. 2019, 192, 103814. [Google Scholar] [CrossRef]
  9. Soydaner, D. Hyper Autoencoders. Neural Process. Lett. 2020, 52, 1395–1413. [Google Scholar] [CrossRef]
  10. Huang, G.B.; Zhu, Q.Y.; Siew, C.K. Extreme learning machine: Theory and applications. Neurocomputing 2006, 70, 489–501. [Google Scholar] [CrossRef]
  11. Huang, G.; Song, S.; Gupta, J.N.D.; Wu, C. Semi-Supervised and Unsupervised Extreme Learning Machines. IEEE Trans. Cybern. 2014, 44, 2405–2417. [Google Scholar] [CrossRef] [PubMed]
  12. Chen, J.; Zeng, Y.; Li, Y.; Huang, G.B. Unsupervised feature selection based extreme learning machine for clustering. Neurocomputing 2020, 386, 198–207. [Google Scholar] [CrossRef]
  13. Huang, G.; Huang, G.B.; Song, S.; You, K. Trends in extreme learning machines: A review. Neural Netw. 2015, 61, 32–48. [Google Scholar] [CrossRef] [PubMed]
  14. Huang, G.B.; Zhou, H.; Ding, X.; Zhang, R. Extreme learning machine for regression and multiclass classification. IEEE Trans. Syst. Man Cybern. Part B-Cybern. 2012, 42, 513–529. [Google Scholar] [CrossRef]
  15. Bai, Z.; Huang, G.B.; Wang, D.; Wang, H.; Westover, M.B. Sparse extreme learning machine for classification. IEEE Trans. Cybern. 2014, 44, 1858–1870. [Google Scholar] [CrossRef]
  16. Liang, N.Y.; Huang, G.B.; Saratchandran, P.; Sundararajan, N. A fast and accurate online sequential learning algorithm for feedforward networks. IEEE Trans. Neural Netw. 2006, 17, 1411–1423. [Google Scholar] [CrossRef]
  17. Ma, J.; Yu, S.; Cheng, W. Composite Fault Diagnosis of Rolling Bearing Based on Chaotic Honey Badger Algorithm Optimizing VMD and ELM. Machines 2022, 10, 469. [Google Scholar] [CrossRef]
  18. Cui, H.; Guan, Y.; Chen, H. Rolling element fault diagnosis based on VMD and sensitivity MCKD. IEEE Access 2021, 9, 120297–120308. [Google Scholar] [CrossRef]
  19. An, Z.; Wang, X.; Li, B.; Xiang, Z.; Zhang, B. Robust visual tracking for UAVs with dynamic feature weight selection. Appl. Intell. 2022. [Google Scholar] [CrossRef]
  20. Eshtay, M.; Faris, H.; Obeid, N. Metaheuristic-based extreme learning machines: A review of design formulations and applications. Int. J. Mach. Learn. Cybern. 2019, 10, 1543–1561. [Google Scholar] [CrossRef]
  21. Li, G.; Li, Y.; Chen, H.; Deng, W. Fractional-order controller for course-keeping of underactuated surface vessels based on frequency domain specification and improved particle swarm optimization algorithm. Appl. Sci. 2022, 12, 3139. [Google Scholar] [CrossRef]
  22. Zhou, X.; Ma, H.; Gu, J.; Chen, H.; Deng, W. Parameter adaptation-based ant colony optimization with dynamic hybrid mechanism. Eng. Appl. Artif. Intell. 2022, 114, 105–139. [Google Scholar] [CrossRef]
  23. Ali, M.; Deo, R.C.; Xiang, Y.; Prasad, R.; Li, J.; Farooque, A.; Yaseen, Z.M. Coupled online sequential extreme learning machine model with ant colony optimization algorithm for wheat yield prediction. Sci. Rep. 2022, 12, 5488. [Google Scholar] [CrossRef]
  24. Chen, H.; Miao, F.; Chen, Y.; Xiong, Y.; Chen, T. A hyperspectral image classification method using multifeature vectors and optimized KELM. IEEE J. Sel. Top. Appl. Earth Observ. Remote Sens. 2021, 14, 2781–2795. [Google Scholar] [CrossRef]
  25. Chamara, L.; Zhou, H.; Huang, G.B.; Vong, C.M. Representation learning with extreme learning machine for big data. IEEE Intell. Syst. 2013, 28, 31–34. [Google Scholar]
  26. Tang, J.; Deng, C.; Huang, G.B. Extreme learning machine for multilayer perceptron. IEEE Trans. Neural Netw. Learn. Syst. 2016, 27, 809–821. [Google Scholar] [CrossRef]
  27. Sun, K.; Zhang, J.; Zhang, C.; Hu, J. Generalized extreme learning machine autoencoder and a new deep neural network. Neurocomputing 2017, 230, 374–381. [Google Scholar] [CrossRef]
  28. Ge, H.; Sun, W.; Zhao, M.; Yao, Y. Stacked denoising extreme learning machine autoencoder based on graph embedding for feature representation. IEEE Access 2019, 7, 13433–13444. [Google Scholar] [CrossRef]
  29. Wang, J.; Guo, P.; Li, Y. DensePILAE: A feature reuse pseudoinverse learning algorithm for deep stacked autoencoder. Complex Intell. Syst. 2021, 8, 2039–2049. [Google Scholar] [CrossRef]
  30. Li, R.; Wang, X.; Lei, L.; Wu, C. Representation learning by hierarchical ELM auto-encoder with double random hidden layers. IET Comput. Vis. 2019, 13, 411–419. [Google Scholar] [CrossRef]
  31. Li, R.; Wang, X.; Song, Y.; Lei, L. Hierarchical extreme learning machine with L21-norm loss and regularization. Int. J. Mach. Learn. Cybern. 2021, 12, 1297–1310. [Google Scholar] [CrossRef]
  32. Liangjun, C.; Honeine, P.; Hua, Q.; Jihong, Z.; Xia, S. Correntropy-based robust multilayer extreme learning machines. Pattern Recognit. 2018, 84, 357–370. [Google Scholar] [CrossRef] [Green Version]
  33. Wong, C.M.; Vong, C.M.; Wong, P.K.; Cao, J. Kernel-based multilayer extreme learning machines for representation learning. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 757–762. [Google Scholar] [CrossRef] [PubMed]
  34. Vong, C.M.; Chen, C.; Wong, P.K. Empirical kernel map-based multilayer extreme learning machines for representation learning. Neurocomputing 2018, 310, 265–276. [Google Scholar] [CrossRef]
  35. Paul, A.N.; Yan, P.; Yang, Y.; Zhang, H.; Du, S.; Wu, Q. Non-iterative online sequential learning strategy for autoencoder and classifier. Neural Comput. Appl. 2021, 33, 16345–16361. [Google Scholar] [CrossRef]
  36. Mirza, B.; Kok, S.; Dong, F. Multi-layer online sequential extreme learning machine for image classification. In Proceedings of ELM-2015 Volume 1; Springer: Berlin/Heidelberg, Germany, 2016; pp. 39–49. [Google Scholar]
  37. Tibshirani, R. Regression shrinkage and selection via the Lasso. J. R. Stat. Soc. B Methodol. 1996, 58, 267–288. [Google Scholar] [CrossRef]
  38. Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]
  39. Jiang, X.; Yan, T.; Zhu, J.; He, B.; Li, W.; Du, H.; Sun, S. Densely connected deep extreme learning machine algorithm. Cogn. Comput. 2020, 12, 979–990. [Google Scholar] [CrossRef]
  40. Zhao, H.; Ding, S.; Li, X.; Huang, H. Deep neural network structured sparse coding for online processing. IEEE Access 2018, 6, 74778–74791. [Google Scholar] [CrossRef]
  41. Liu, D.; Liang, C.; Chen, S.; Tie, Y.; Qi, L. Auto-encoder based structured dictionary learning for visual classification. Neurocomputing 2021, 438, 34–43. [Google Scholar] [CrossRef]
  42. Janngam, K.; Wattanataweekul, R. A New Accelerated Fixed-Point Algorithm for Classification and Convex Minimization Problems in Hilbert Spaces with Directed Graphs. Symmetry 2022, 14, 1059. [Google Scholar] [CrossRef]
  43. Chumpungam, D.; Sarnmeta, P.; Suantai, S. An Accelerated Convex Optimization Algorithm with Line Search and Applications in Machine Learning. Mathematics 2022, 10, 1491. [Google Scholar] [CrossRef]
  44. Liang, J.; Luo, T.; Schönlieb, C.B. Improving “Fast Iterative Shrinkage-Thresholding Algorithm”: Faster, Smarter, and Greedier. SIAM J. Sci. Comput. 2022, 44, A1069–A1091. [Google Scholar] [CrossRef]
  45. Bussaban, L.; Kaewkhao, A.; Suantai, S. Inertial s-iteration forward-backward algorithm for a family of nonexpansive operators with applications to image restoration problems. Filomat 2021, 35, 771–782. [Google Scholar] [CrossRef]
  46. Chambolle, A.; Dossal, C. On the convergence of the iterates of the “fast iterative shrinkage/thresholding algorithm”. J. Optim. Theory Appl. 2015, 166, 968–982. [Google Scholar] [CrossRef]
  47. Bartlett, P. The sample complexity of pattern classification with neural networks: The size of the weights is more important than the size of the network. IEEE Trans. Inf. Theory 1998, 44. [Google Scholar] [CrossRef]
  48. Beck, A.; Teboulle, M. Gradient-based algorithms with applications to signal recovery. In Convex Optimization in Signal Processing and Communications; Cambridge University Press: Cambridge, UK, 2009; pp. 42–88. [Google Scholar]
  49. Chen, L.; Xiao, Y.; Yang, T. Application of the improved fast iterative shrinkage-thresholding algorithms in sound source localization. Appl. Acoust. 2021, 180, 108101. [Google Scholar] [CrossRef]
  50. Calatroni, L.; Chambolle, A. Backtracking strategies for accelerated descent methods with smooth composite objectives. SIAM J. Optim. 2019, 29, 1772–1798. [Google Scholar] [CrossRef]
  51. Sun, P.; Yang, L. Generalized eigenvalue extreme learning machine for classification. Appl. Intell. 2022, 52, 6662–6691. [Google Scholar] [CrossRef]
  52. Tikhonov, A.N.; Arsenin, V.Y. Solutions of Ill-Posed Problems; V.H. Winston: Washington, DC, USA, 1977. [Google Scholar]
  53. Niu, X.; Wang, J.; Zhang, L. Carbon price forecasting system based on error correction and divide-conquer strategies. Appl. Soft. Comput. 2022, 118, 107935. [Google Scholar] [CrossRef]
  54. Hao, Y.; Niu, X.; Wang, J. Impacts of haze pollution on China’s tourism industry: A system of economic loss analysis. J. Environ. Econ. Manag. 2021, 295, 113051. [Google Scholar] [CrossRef] [PubMed]
Figure 1. G-FISTA Algorithm: plot of RMSE for training, validation, and testing sets versus the number of neurons.
Figure 1. G-FISTA Algorithm: plot of RMSE for training, validation, and testing sets versus the number of neurons.
Applsci 12 09021 g001
Figure 2. Bar graph representing the average training times required by FISTA, G-FISTA, and LC-FISTA for the ELM-AE.
Figure 2. Bar graph representing the average training times required by FISTA, G-FISTA, and LC-FISTA for the ELM-AE.
Applsci 12 09021 g002
Table 1. Information from the databases to train the ELM-AE.
Table 1. Information from the databases to train the ELM-AE.
Dataset# Training# Testing# Features# Category
UMist40017510,30420
CIFAR1050,00010,000102410
MNIST60,00010,00078410
NORB24,30024,30018,4325
SCars814480419312196
Caltech25621,314929310,000257
Table 2. Seven evaluation criteria to evaluate the performance of the proposed method.
Table 2. Seven evaluation criteria to evaluate the performance of the proposed method.
MetricDefinitionMathematical Expression
RMSERoot mean square error R M S E = i = 1 n ( y i a y i e ) 2 n
MAEMean absolute error M A E = i = 1 n | y i a y i e | n
R 2 R 2 score R 2 = 1 i = 1 n ( y i a y i e ) 2 i = 1 n ( y i a y ¯ a ) 2
IAWillmott’s index of agreement I A = 1 1 n i = 1 n ( y i a y i e ) 2 i = 1 n ( | y i e y ¯ a | + | y i a y ¯ a | )
TICTheil inequality coefficient T I C = 1 n i = 1 n ( y i a y i e ) 2 1 n i = 1 n ( y i a ) 2 + 1 n i = 1 n ( y i e ) 2
MINMinimum error M I N = min | y i a y i e | , i = 1 , , n
MAXMaximum error M A X = max | y i a y i e | , i = 1 , , n
Note: ya denote the vectors of actual and estimated values, respectively y - a is the average of the actual values, and n is the number of pairs of y i a and y i e
Table 3. Choice of λ for each algorithm and selected database.
Table 3. Choice of λ for each algorithm and selected database.
Algorithm Regularization Parameter λ
UMistNORBSCarsCaltech256
FISTA0.001100.0011
G-FISTA0.0011000.010.1
LC-FISTA0.011000.10.01
Table 4. Training times and RMSE values achieved by FISTA, G-FISTA, and LC-FISTA for training of the ELM-AE.
Table 4. Training times and RMSE values achieved by FISTA, G-FISTA, and LC-FISTA for training of the ELM-AE.
LAlgorithmRMSETraining Time (Seg)
CIFAR10UMistMNISTNORBSCarsCaltech256CIFAR10UMistMNISTNORBSCarsCaltech256
100FISTA0.07510.05830.08560.03850.12840.120521.922.9225.4722.3338.2263.61
G-FISTA0.07530.05800.08560.03850.12850.120512.312.4016.7312.0918.7335.73
LC-FISTA0.07510.05790.08550.03850.12850.120512.912.3418.9711.1720.6133.03
500FISTA0.02420.03680.00960.01930.0880.083648.4611.4251.2448.4083.23122.15
G-FISTA0.02420.03730.00960.01930.0880.083626.9710.2732.0626.4942.1675.94
LC-FISTA0.02420.03700.00960.01930.0880.083727.619.4935.9224.2642.5269.93
1000FISTA0.00190.02880.00440.07180.068178.8029.7880.18153.88211.51
G-FISTA0.00190.02980.00440.07180.068143.1426.8246.5480.08125.85
LC-FISTA0.00190.02940.00440.07180.068144.8325.0742.3478.60118.25
1500FISTA0.02440.00150.06170.058957.79116.54221.72285.91
G-FISTA0.02570.00150.06170.058948.3969.18117.10176.85
LC-FISTA0.02530.00150.06170.058946.4863.09116.60162.03
2000FISTA0.02140.00030.05430.052296.09161.86297.65379.08
G-FISTA0.02290.00030.05430.052278.7495.27158.85239.25
LC-FISTA0.02250.00030.05430.052275.5586.17162.87222.65
Table 5. MAE and R 2 score values achieved by FISTA, G-FISTA and LC-FISTA for training of the ELM-AE.
Table 5. MAE and R 2 score values achieved by FISTA, G-FISTA and LC-FISTA for training of the ELM-AE.
LAlgorithmMAE R 2 Score
CIFAR10UMistMNISTNORBSCarsCaltech256CIFAR10UMistMNISTNORBSCarsCaltech256
100FISTA0.05460.04150.04890.01840.09210.08370.90090.92090.92410.95330.78960.8426
G-FISTA0.05470.04130.04880.01850.09210.08370.90050.92170.92400.95330.78950.8426
LC-FISTA0.05470.04120.04880.01850.09210.08380.90050.92180.92410.95320.78960.8425
500FISTA0.01810.02560.00460.01010.06250.05660.98940.96840.99910.98830.89940.9241
G-FISTA0.01810.02580.00460.01010.06250.05660.98940.96770.99910.98830.89930.9241
LC-FISTA0.01810.02560.00470.01020.06250.05660.98940.96810.99910.98830.89940.9241
1000FISTA0.00140.02010.00180.05080.04610.99990.98070.99940.93420.9497
G-FISTA0.00140.02080.00180.05080.04610.99990.97930.99940.93420.9497
LC-FISTA0.00140.02050.00180.05080.04610.99990.97980.99940.93420.9497
1500FISTA0.01720.00050.04400.04010.98610.99990.95150.9624
G-FISTA0.01810.00050.04400.04010.98460.99990.95150.9624
LC-FISTA0.01780.00050.04400.04010.98510.99990.95150.9624
2000FISTA0.01520.00010.03910.03580.98931.00000.96240.9704
G-FISTA0.01630.00010.03910.03580.98781.00000.96240.9704
LC-FISTA0.01600.00010.03910.03580.98821.00000.96240.9704
Table 6. IA and TIA values achieved by FISTA, G-FISTA and LC-FISTA for training of the ELM-AE.
Table 6. IA and TIA values achieved by FISTA, G-FISTA and LC-FISTA for training of the ELM-AE.
LAlgorithmIATIC
CIFAR10UMistMNISTNORBSCarsCaltech256CIFAR10UMistMNISTNORBSCarsCaltech256
100FISTA0.97340.97900.97990.98790.93860.95590.07000.07120.12880.02470.12070.0981
G-FISTA0.97320.97930.97990.98790.93860.95590.07010.07080.12880.02470.12080.0981
LC-FISTA0.97330.97930.98000.98790.93860.95590.07010.07080.12880.02470.12080.0981
500FISTA0.99730.99170.99980.99710.97290.98000.02280.04490.01210.01230.08290.0678
G-FISTA0.99730.99170.99980.99710.97290.98000.02280.04540.01210.01230.08290.0678
LC-FISTA0.99730.99180.99980.99710.97290.98000.02280.04510.01210.01230.08290.0678
1000FISTA1.00000.99510.99990.98270.98690.00180.03500.00280.06680.0551
G-F ISTA1.00000.99480.99990.98270.98700.00180.03630.00280.06680.0551
LC-FISTA1.00000.99490.99990.98270.98690.00180.03580.00280.06690.0551
1500FISTA0.99651.00000.98740.99030.02979.81 × 10 4 0.05730.0476
G-FISTA0.99611.00000.98740.99030.03139.79 × 10 4 0.05730.0476
LC-FISTA0.99621.00000.98740.99030.03089.81 × 10 4 0.05730.0476
2000FISTA0.99731.00000.99030.99240.02612.16 × 10 4 0.05050.0422
G-FISTA0.99691.00000.99030.99240.02792.15 × 10 4 0.05050.0422
LC-FISTA0.99701.00000.99030.99240.02732.16 × 10 4 0.05050.0422
Table 7. MIN and MAX values achieved by FISTA, G-FISTA and LC-FISTA for training of the ELM-AE.
Table 7. MIN and MAX values achieved by FISTA, G-FISTA and LC-FISTA for training of the ELM-AE.
LAlgorithmMINMAX
CIFAR10UMistMNISTNORBSCarsCaltech256CIFAR10UMistMNISTNORBSCarsCaltech256
100FISTA0.00930.02510.0220.0660.02040.00590.12250.07430.9330.05310.18430.2533
G-FISTA0.00950.02540.02240.00650.02060.00610.12110.07420.09530.05300.18180.2505
LC-FISTA0.00930.02500.02270.00660.02060.00600.12270.07420.09170.05330.18090.2528
500FISTA0.00280.01270.00100.00380.01430.00400.05630.04650.02770.03460.12920.1814
G-FISTA0.00270.01140.00090.00340.01430.00400.05640.04710.0750.03480.12830.1811
LC-FISTA0.00280.01120.00090.00330.01440.00400.05630.04650.02780.03490.12900.1815
1000FISTA0.00060.00690.00050.01240.00280.01750.03570.01760.11410.1599
G-FISTA0.00060.00630.00050.01230.00280.01770.03730.01780.11380.1605
LC-FISTA0.00060.00580.00050.01230.00280.01760.03680.01770.11360.1602
1500FISTA0.00385.3 × 10 5 0.01050.00200.02810.00840.10100.1382
G-FISTA0.00375.4 × 10 5 0.01050.00200.03030.00820.10100.1380
LC-FISTA0.00335.4 × 10 5 0.01050.00200.02950.00820.10120.1375
2000FISTA0.00281.7 × 10 5 0.00950.00170.02520.00470.09500.1244
G-FISTA0.00291.7 × 10 5 0.00950.00170.02740.00480.09480.1249
LC-FISTA0.00251.6 × 10 5 0.00940.00170.02670.00470.09500.1244
Table 8. The performance of each algorithm for the λ chosen in Table 3.
Table 8. The performance of each algorithm for the λ chosen in Table 3.
AlgorithmRMSETraining Time (Minutes)
UMistNORBSCarsCaltech256UMistNORBSCarsCaltech256
FISTA0.02110.00010.05370.0494492144139
G-FISTA0.02250.00010.05370.049440122785
LC-FISTA0.02200.00010.05370.049437112670
Table 9. Average speeds of the G-FISTA and LC-FISTA algorithms compared with FISTA for training of the ELM-AE.
Table 9. Average speeds of the G-FISTA and LC-FISTA algorithms compared with FISTA for training of the ELM-AE.
LAlgorithmTraining Time (Seg)
CIFAR10UMistMNISTNORBSCarsCaltech256
100G-FISTA43.8417.8034.3145.8550.9943.82
LC-FISTA41.1019.8625.5249.9746.0748.07
500G-FISTA44.3410.0737.4345.2649.3437.83
LC-FISTA43.0216.9029.8949.8748.9142.75
1000G-FISTA45.259.9341.9547.9540.49
LC-FISTA43.1015.8147.1948.9244.09
1500G-FISTA16.2640.6347.1838.14
LC-FISTA19.5745.8647.4143.32
2000G-FISTA18.0541.1446.6336.88
LC-FISTA21.3746.7645.2841.26
Av. timeG-FISTA44.48%14.42%35.87%42.97%48.42%39.43%
Av. timeLC-FISTA42.41%18.71%27.70%47.93%47.32%43.90%
Note: The formula % = (1 − A1/A2) × 100 compares the computation time of algorithms A1 and A2.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Vásquez-Coronel, J.A.; Mora, M.; Vilches, K. Training of an Extreme Learning Machine Autoencoder Based on an Iterative Shrinkage-Thresholding Optimization Algorithm. Appl. Sci. 2022, 12, 9021. https://doi.org/10.3390/app12189021

AMA Style

Vásquez-Coronel JA, Mora M, Vilches K. Training of an Extreme Learning Machine Autoencoder Based on an Iterative Shrinkage-Thresholding Optimization Algorithm. Applied Sciences. 2022; 12(18):9021. https://doi.org/10.3390/app12189021

Chicago/Turabian Style

Vásquez-Coronel, José A., Marco Mora, and Karina Vilches. 2022. "Training of an Extreme Learning Machine Autoencoder Based on an Iterative Shrinkage-Thresholding Optimization Algorithm" Applied Sciences 12, no. 18: 9021. https://doi.org/10.3390/app12189021

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop