A Novel Orthogonal Extreme Learning Machine for Regression and Classiﬁcation Problems

: An extreme learning machine (ELM) is an innovative algorithm for the single hidden layer feed-forward neural networks and, essentially, only exists to find the optimal output weight so as to minimize output error based on the least squares regression from the hidden layer to the output layer. With a focus on the output weight, we introduce the orthogonal constraint into the output weight matrix, and propose a novel orthogonal extreme learning machine (NOELM) based on the idea of optimization column by column whose main characteristic is that the optimization of complex output weight matrix is decomposed into optimizing the single column vector of the matrix. The complex orthogonal procrustes problem is transformed into simple least squares regression with an orthogonal constraint, which can preserve more information from ELM feature space to output subspace, these make NOELM more regression analysis and discrimination ability. Experiments show that NOELM has better performance in training time, testing time and accuracy than ELM and OELM.


Introduction
An extreme learning machine (ELM) is an innovative learning algorithm for the single hidden layer feed-forward neural networks (SLFNs for short), proposed by Huang et al [1], that is characterized by the internal parameters generated randomly without tuning. In essence, the ELM is a special artificial neural network model, whose input weights are generated randomly and fixed, so as to get the unique least-squares solution of the output weight [1], making the performance better [2][3][4]. In the conventional model there is a lack of convergence ability, generalization, over-fitting, local minimum and parameter adjustment, all of which make the ELM superior [1,5]. Considering the learning process of the ELM, it is relatively simple. Firstly, some internal parameters of the hidden layer are generated randomly, such as input weights connecting the input layer and hidden layer, the number of hidden layer neurons, etc., which are fixed during the whole process. Secondly, the non-linear mapping function is selected to map the inputting data to the feature space, and through analyzing the real output results and expected output results, the key parameter (output weight connecting the hidden layer and the output layer) can be directly obtained, omitting iterative tuning. So, its training speed is considerably faster than that of the conventional algorithms [6].
Due to the good performance of the ELM, it is used widely in regression and classification. For meeting higher requirements, researchers optimize and improve the ELM, and have proposed many better algorithms based on ELM. Di Wang et al. combined the local-weighted jackknife and RELM, proposed a novel conformal regressor (LW-JP-RELM), which complements ELM with interval predictions satisfying a given level of confidence [7]. For improving the generalization performance, Ying Yin et al. proposed enhancing the ELM by a Markov boundary-based feature selection, based on the feature interaction and the mutual information to reduce the number of features, so as to construct more compact network, whose generalization was improved greatly [8]. Ding et al. reformulated an optimization extreme learning machine to take a new regularization parameter, which is bounded between 0 and 1, and also easier to interpret as compared to the error penalty parameter C, it could achieve better generalization performance [9]. For solving sensibility of ELM to the ill-conditioned data, Hasan et al. proposed two novel algorithms based on ELM, ridge regression and almost unbiased ridge regression, and also gave three criteria to select the regularization parameter, which improved the generalization and stability of the ELM greatly [10]. Besides, there are more effective algorithms based on the ELM, such as Distributed Generalized Regularized ELM (DGR-ELM) [11], Self-Organizing Map ELM (SOM-ELM) [12], Data and Model Parallel ELM (DMP-ELM) [13], Genetic Algorithm ELM (GA-ELM) [14], Jaya optimization with mutation ELM (MJaya-ELM) [15], et al.
Either with a simple ELM or more complex algorithms based on ELM one must find the optimal solution of the two key parameters in ELM essentially, the number of hidden layer neurons and the output weights. From input layer to the output layer, essentially ELM learns the output weights based on the least squares regression analysis [16]. Therefore, many algorithms except those mentioned above are still proposed based on least squares regression, and their main work is to find an optimal transformation matrix, so as to minimize the error of sum-of-squares. Among these strategies [17,18], introducing orthogonal constraint into the optimization problem is required and also employed widely in the classification and subspace learning. Nie et al. showed that the performance of least squares discriminant and regression analysis after introducing orthogonal constraint is much better than those without orthogonal constraint [19,20]. After introducing orthogonal constraint into ELM, the optimization problem is seen as unbalanced procrustes problems, which is hard to be solved. Yong Peng et al. pointed out that the unbalanced procrustes problem can be transformed into a balanced procrustes problem, which is relatively simple [16]. Motivated by this research, in this paper we focus on the output weight, a novel orthogonal optimizing method (NOELM) is proposed to solve the unbalanced procrustes problem, and its main contribution is that the optimization of complex matrix is decomposed into optimizing the single column vector of the matrix, reducing the complexity of the algorithm.
The remainder of the paper is organized as follows. Section 2 reviews briefly the basic ELM model. In Section 3, the model formulation and the iterative optimization method are detailed. The convergence and complexity analysis is presented in Section 4. In Section 5, the experiments are conducted to show the performances of NOELM. Finally, Section 6 concludes the paper.

Extreme Learning Machine
where N is the sample number, x i ∈ R n is the input vector, y i ∈ R m is the expected output vector, and the expected output of the i-th sample is y i , y i = [y i1 y i2 . . . y im ] T , i = 1, 2, 3, . . . , N. For selected activation function, if the real output of the SLFNs is the same as the expected output y i , the mathematical representation of SLFNs is as follows: where ω i = [ω i1 , . . . , ω in ] T is the input weight connecting the input layer to the i-th hidden layer neuron, b i is the basis of the i-th hidden layer neuron, β i = [β i1 , . . . , β im ] T is the output weight connecting the i-th hidden layer neuron and the output layer, and L is the number of the hidden layer neurons, shown in Figure 1.
where is the regularization parameter, which is used to balance the empirical risk and structural risk. Based on the Karush-Kuhn-Tucker condition, the optimal solution of is obtained: . . .

Novel Orthogonal Extreme Learning Machine (NOELM)
The orthogonal constraint is introduced into ELM, shown in Figure 1, the optimal problem is transformed as follows, where +1 ∈ ×( +1) , +1 ∈ ( +1)× is the output matrix and the output weight of the hidden layer, ∈ * . Because of the orthogonal constraint, the input samples are mapped into an orthogonal subspace, where their metric structure could be preserved.
Set > , so the problem (8) is an unbalanced orthogonal procrustes problem which is difficult to be resolved directly because of the orthogonal constraint [16]. In this paper, an improved method is proposed to optimize the problem (8) based on the following lemma.  Equation (1) can be compactly rewritten as where So, based on the theory of ELM, the optimal solution of Equation (2) is as follows, where H † is the Moore-Penrose inverse of the Matrix H, H † = H T H −1 H T . For further improving model precision, the regularization is introduced into ELM, the optimal problem is transformed as follows, where C is the regularization parameter, which is used to balance the empirical risk and structural risk. Based on the Karush-Kuhn-Tucker condition, the optimal solution of β is obtained:

Novel Orthogonal Extreme Learning Machine (NOELM)
The orthogonal constraint is introduced into ELM, shown in Figure 1, the optimal problem is transformed as follows, where H L+1 ∈ R N×(L+1) , β L+1 ∈ R (L+1)×m is the output matrix and the output weight of the hidden layer, Y ∈ R N * m . Because of the orthogonal constraint, the input samples are mapped into an orthogonal subspace, where their metric structure could be preserved. Set L > m, so the problem (8) is an unbalanced orthogonal procrustes problem which is difficult to be resolved directly because of the orthogonal constraint [16]. In this paper, an improved method is proposed to optimize the problem (8) based on the following lemma.
is the optimal solution for the problem (8) and its orthogonal complement is B * L+1 , then H T L+1 Y, H T L+1 B * L+1 is positive, semi-definite and symmetric, and The proof of Lemma 1 is simple, which can be found in the literature [21]. Motivated by Lemma 1, a local transformation is applied in the Equation (8), we relax the j-th column ρ j ( j ≤ (L + 1)) and fix others,β j = ρ 1 , . . . , ρ j−1 , ρ j+1 , . . . , ρ L+1 , then the equation could be transformed into If ρ * j is the optimal solution of Equation (10), the approximation β L+1 could be improved after replacing ρ j by ρ * j , and obviously, the modified β * is also orthogonal. To resolve the constrained problem (10) is a little difficult, so, the orthogonal complement B L+1 of β L+1 can be used to simplify the Equation (10). Set P L+1 = ρ j , B L+1 , and it is known ρ j ⊥β j , then P L+1 is the orthogonal complement ofβ j . So, in the constrained problem (10), the condition ρ j ⊥β j could be represented in another form, ρ j = P L+1 x, x ∈ R n−L is a unit vector. Thus, the problem (10) can be transformed into the following form with quadratic equality constraint: Clearly, after get the optimal solution x * of problem (11), the solution of problem (10) is ρ j = P L+1 x * . If the orthogonal complement of β * L+1 is B * L+1 , then B * L+1 = P L+1 W, W is the orthogonal complement of x * , and it can be constructed easily, using the Householder reflection I − 2uω T with ω = 1, which meets can be picked out from the following equation, For resolving the problem (11), it first rewrites the Equation (11) in general form, where A = H L+1 P L+1 , y =ŷ j . Ax − y 2 is transformed in the following form Known from the Equations (13) and (14), the parameters A and y are fixed, the minimum problem of function J(x) is transformed to the maximum of trace x T A T y approximately, showing by Equation (15), denoted by W = A T y.
x = X : X = argmaxtrace X T W . So, As known above, x ∈ R n−L is a unit vector, partitioning X * Because of x ∈ R n−L , then Note that X * = x ij is unit and orthogonal, X * = 1, then −1 ≤ x ij ≤ 1, so based on the Equation (20), for the maximum, it can be deduced that x ij = 1, i = j, then X 11 = I k , X 21 ∈ O s−k , and k = 1. Hence, Based on the analysis above, the novel optimization to objective problem (8) is proposed in the position, its detail is as follows (Algorithm 1): Algorithm 1: Optimization to objective problem (8) Basic Information: training samples (x i , y i ) N i=1 x i ∈ R n , y i ∈ R m Initialization: Set threshold τ and η S1.
Generate the input weight layer w and bas vector b; S2.
Calculate the output matrix of the hidden layer H based on Equation (3); S3.
Calculate the orthogonal β of span H T Y, and its orthogonal complement B L+1 , then r 0 = Hβ − Y ; S3.
Relax the j-th column ρ j ,ŷ j from the matrix β, Y separately, and fix the rest;

S6.
Set A = HP, y =ŷ j , then W = A T y. By SVD, W = Udia(Σ k , O s−k )V T , so as to obtain U and V; S7.

Convergence and Complexity Analysis
Considering the convergence if the algorithm, Let β i, j be a sequence of β * generating during iteration, which converges to β, so, its orthogonal complement B i, j also converges to B, where i is the iterating number, and i is the operation of relaxing the j-th column from original matrix, so, it follows Based on the equations above, it is know that ρ * j is the optimal solution of Equation (10). If Hρ j −ŷ j − Hρ * j −ŷ j = η, then for i is large enough, P i,j and P meets Set ρ * j = Px, x = 1, andρ j = P i,j x, then based on Equation (22), it has Based on Equations (10) and (22), it has Based on the Equations (23) and (24), it has , based on the derivation of the inequality above, it can deduced that f β 1,j > f β 2,j > . . . > f β n,j . By the same method and analysis, it also can be obtained that . . > f β 3,1 > . . ., so the sequence f β i,j is monotonically decreasing, and when i → ∞ , f β i,j − f β i+1,j → 0 . In a word, after analysis above, the novel algorithm monotonically decreases the objective shown in Equation (8).
It is known that the complexity of ELM derives from the calculation of output weights β, or rather, it is mainly used to calculate the inverse of matrix H T H + CI. In most cases, the number of hidden layer neurons L is much smaller than the training sample size N, L N, thus the complexity is less than least square support vector machine (LS-SVM) and proximal support vector machine (PSVM), which need to calculate the inverse of N × N matrix [16]. As we know, the complexity of ELM and OELM is O L 3 , O t NL 2 + L 3 separately. As for the complexity of the novel algorithm proposed in the paper, its main calculation is from the loop. In each iteration, it needs to find the optimal solution of one column relaxing from β, and during this, it needs to do SVD decomposition on the m × 1 matrix A T y, whose complexity is O m 2 , and then, the complexity of updating β once is O(m). So, the complexity of the proposed algorithm is O tm 3 , where t is the number of updating β. In real application, regardless of classification or regression, the output dimension is much less than the number of hidden layer neurons and the training samples size.
As we know, (Hβ) T = Y T , then β T h T i (x i ) = y i . Considering the Euclidean distance between any two data points y i and y j , because of the orthogonal constraint β T β = I, it has y i − y j = h i (x i ) − h j x j . It is known that h i (x i ) is the point in the ELM feature space, h i (x i ) − h j x j is the distance in ELM space, and y i − y j is the distance in the subspace. From this analysis, the novel ELM with orthogonal constraints is superior in maintaining the metric structure from first to last.

Performance Evaluation
For testing the performances of the novel algorithm proposed in the paper, it is compared with other learning algorithms on the classification problems (EMG for Gestures, Avila and Ultrasonic Flowmeter) and regression problems (Auto price, Breast cancer, Buston housing, etc.), which are from the University of California Irvine (UCI) machine learning repository [22], shown in Table 1. These learning algorithms include ELM [1], OELM [16] and I-ELM [23,24], their activation function is the sigmoid function, and the number of hidden layer neurons is set as three times as the input dimension. For I-ELM, the initial number of hidden layer neurons is set to zero. In the real experiments, the key parameters such as the input weights, the biases, etc., are generated randomly from [−1, 1], and then, all samples are normalized into [−1, 1], and the outputs of the regression problems are normalized into [0, 1] [25]. All simulations are done in Matlab R2016a environment. In the classification problems, ELM and OELM are selected to compare with NOELM. The experimental results are shown in Figures 2 and 3. Figure 2 shows the convergence property of NOELM. At first, the convergence rate is larger, the objective value falls rapidly, when reaching about 0.8, it falls slowly, until stable. During the whole process, the number of iterations does not vary significantly, the maximum is not more than 20, and the minimum is only about 5, so in a word, the novel algorithm is a little more effective. Figure 3 shows the comparison of the training time and classification rate. Due to the complexity above, the traditional ELM is low in complexity, and its training time is shortest. The complexity of NOELM is less than OELM, then its training time is shorter than OELM, and longer than that of ELM because of too many iterations, but the difference is not larger than 0.05. Although NOELM is not the best in terms of training time, its classification is better than the other two, the largest rate can reach 0.9. These learning algorithms include ELM [1], OELM [16] and I-ELM [23,24], their activation function is the sigmoid function, and the number of hidden layer neurons is set as three times as the input dimension. For I-ELM, the initial number of hidden layer neurons is set to zero. In the real experiments, the key parameters such as the input weights, the biases, etc., are generated randomly from  In the classification problems, ELM and OELM are selected to compare with NOELM. The experimental results are shown in Figures 2 and 3. Figure 2 shows the convergence property of NOELM. At first, the convergence rate is larger, the objective value falls rapidly, when reaching about 0.8, it falls slowly, until stable. During the whole process, the number of iterations does not vary significantly, the maximum is not more than 20, and the minimum is only about 5, so in a word, the novel algorithm is a little more effective. Figure 3 shows the comparison of the training time and classification rate. Due to the complexity above, the traditional ELM is low in complexity, and its training time is shortest. The complexity of NOELM is less than OELM, then its training time is shorter than OELM, and longer than that of ELM because of too many iterations, but the difference is not larger than 0.05. Although NOELM is not the best in terms of training time, its classification is better than the other two, the largest rate can reach 0.9.  In the regression problems, ELM, OELM and I_ELM are selected to compare with NOELM, the experimental results are shown in Tables 2-4. As mentioned above, the number of hidden layer neurons is determined based on the input dimension, so the hidden layer neurons of ELM, OELM and NOELM are fixed, and the others are dynamically increasing hidden layer neurons. Analyzing the information of Table 2, compared with I_ELM, the network complexity of NOELM is a little lower, and its structure is more compact, but it is a little worse than I_ELM in some datasets, the difference is not large and fully acceptable. As for the accuracy of training and testing from Table 3 and Table 4, comparing with ELM and OELM, the performances of NOELM is better, it has better stability. Because of characteristics of I_ELM, it constructs a more compact network and is a little superior in the training and testing accuracy in some datasets, and this is just the weak point of NOELM and other related algorithms. However, by introducing the orthogonal constraints and improving the algorithm, NOELM can greatly narrow this gap, and its performance is also acceptable.  In the regression problems, ELM, OELM and I_ELM are selected to compare with NOELM, the experimental results are shown in Table 2. As mentioned above, the number of hidden layer neurons is determined based on the input dimension, so the hidden layer neurons of ELM, OELM and NOELM are fixed, and the others are dynamically increasing hidden layer neurons. Analyzing the information of Table 2, compared with I_ELM, the network complexity of NOELM is a little lower, and its structure is more compact, but it is a little worse than I_ELM in some datasets, the difference is not large and fully acceptable. As for the accuracy of training and testing from Tables 3 and 4, comparing with ELM and OELM, the performances of NOELM is better, it has better stability. Because of characteristics of I_ELM, it constructs a more compact network and is a little superior in the training and testing accuracy in some datasets, and this is just the weak point of NOELM and other related algorithms. However, by introducing the orthogonal constraints and improving the algorithm, NOELM can greatly narrow this gap, and its performance is also acceptable.

Conclusions
In this paper, referring to the idea of OELM, the orthogonal constraint is introduced into the ELM, then a novel orthogonal ELM is proposed (NOELM), which is a special supervised learning algorithm theoretically. By contrast with the OELM, the main characteristic and contribution is to transform the complex unbalanced orthogonal procrustes problem to a simple least squares problem with orthogonal constraint based on the single vector, and to optimize the single column vector of the output weight matrix so as to obtain the optimal solution of the whole matrix. Compared with ELM and OELM, NOELM can achieve a much better neural network at fast convergence rate and higher training and testing accuracy. Although NOELM is a little weaker than I_ELM in some aspects, the gap is very narrow, and the result is still acceptable.