Semi-Supervised Classiﬁcation via Hypergraph Convolutional Extreme Learning Machine

: Extreme Learning Machine (ELM) is characterized by simplicity, generalization ability, and computational efﬁciency. However, previous ELMs fail to consider the inherent high-order relationship among data points, resulting in being powerless on structured data and poor robustness on noise data. This paper presents a novel semi-supervised ELM, termed Hypergraph Convolutional ELM (HGCELM), based on using hypergraph convolution to extend ELM into the non-Euclidean domain. The method inherits all the advantages from ELM, and consists of a random hypergraph convolutional layer followed by a hypergraph convolutional regression layer, enabling it to model complex intraclass variations. We show that the traditional ELM is a special case of the HGCELM model in the regular Euclidean domain. Extensive experimental results show that HGCELM remarkably outperforms eight competitive methods on 26 classiﬁcation benchmarks.


Introduction
Extreme Learning Machine (ELM) [1,2] was developed as a simple but effective learning model for classification and regression problems. As it is a special form of random vector functional-link network (RVFL) [3], ELM suggests that the hidden layer parameters of a neural network play an important role but does not need update during training [4,5]. Inspired by this, a large number of ELM variants have been proposed and widely applied to biomedical data analysis [6], computer vision [7], system modeling and prediction [8,9], and so on.
The key to the classic ELM is random mapping [10,11]. Despite helpful, the random mapping often suffers from poor robustness due to its randomness. To remedy the drawback, a number of works have been devoted to seeking the optimal hidden parameters. Wu et al. [12] presented a multi-objective evolutionary ELM to jointly optimize the structural risk and empirical risk of ELM. Similarly, many popular heuristic search methods, including differential evolution [13], are adopted for this purpose. However, the heuristic search is often time-consuming. Alternatively, kernel ELM (KELM) [4,14] implicitly implements the ELM hidden mapping in the reproducing kernel Hilbert space, which usually results in more stable performance. Considering that the random mapping also provides good diversity, it allows us to design an ensemble of ELMs [5,15]. It has been proven to be useful to improve the robustness of single ELM. Another tendency to enhance ELM is to make ELM deeper [16,17]. Compared with popular deep learning models, e.g., Convolutional Neutral Networks (CNNs) [18], deep ELMs lack the potential to capture deep semantics from large-scale complex data.
The above-mentioned ELMs typically belong to supervised ELM, which desires adequate training samples for model training. For most general situations, where training 1.
We propose a simple but effective hypergraph convolutional ELM, i.e., HGCELM, for semi-supervised classification. The HGCELM method not only inherits all the advantages from ELM but enables ELM to model the high-order relationship of data. The successful attempt signifies that structured information, especially highorder relationships, among data is important for ELM, which offers an alternative orientation for ELM representation learning.

2.
We have shown that the traditional ELMs are the special cases of HGCELM on the Euclidian data. We conduct extensive experiments on 26 popular datasets for semisupervised classification task. Comparisons with state-of-the-art methods demonstrate that the proposed GCELM can achieve superior performance.
The rest of the paper is organized as follows. In Section 2, we briefly review hypergraph learning, ELMs, and graph neural networks. In Section 3, we systematically introduce the framework, formulation, and implementation of the proposed method. Experimental evaluations and comparisons are presented in Section 4, followed by the conclusions and future work given in Section 5.

Notations
Through this paper, symbols for vectors are boldface lowercase italics (e.g., x), symbols for matrices are boldface uppercase roman letters (e.g., X), and symbols for scalars are italics (e.g., x ij ). Let X = x i ∈ R m ; y i ∈ R C N i=1 be the sample set consisting of N m-dimensional data points and C distinct classes, in which x i and y i denote i-th data point and its one-hot target vector, respectively. For clarity, we denote the labeled data and unlabeled data with a subscript L and U , e.g., X L and X U are the feature matrices for training and test, respectively. I N signifies an identity matrix with the size of N × N. The matrix Frobenius norm is defined as X F = ∑ ij x ij 2 1/2 . The main notations involved in this paper and corresponding definitions are given in Table 1. Table 1. Important notations used in this paper and their definitions.

N
The number of data points. m The number of features. C The number of classes. X The feature matrix of dataset, R N×m . Y The label matrix with one-hot encoding, R N×C . N T The number of labeled samples. T The labeled set. U The unlabeled set.

Θ
The hidden layer parameter matrix, R m×L . β The output layer parameter matrix, R L×C . Z The matrix of latent representation, R N×L . Z † The Moore-Penrose generalized inverse of matrix Z. L The number of hidden neurons. V The set of vertices in the hypergraph. E The set of hyperedges in the hypergraph.

W
The diagonal matrix of the hyperedge weights, R |E |×|E | .
The degree of the vertex v.

δ(e)
The degree of the hyperedge e.

D v
The diagonal matrix of the vertex degrees, R N×N .

D e
The diagonal matrix of the hyperedge degrees, R |E |×|E | .

L
The hypergraph Laplacian matrix, R N×N .

Hypergraph Preliminary
Hypergraph is the generalization of the simple graph, in which an edge can join any number of vertices. We refer to edges in a hypergraph as hyperedges. By contrast, we denote the regular graph in which each edge only connects two vertexes as a simple graph. A visual comparison between the simple graph and the hypergraph is illustrated in Figure 1. It can be seen that hypergraph can reveal a more complex data relationship, which is superior to the simple graph. For a learning task, hypergraph is usually used to represent the high-order intraclass variations among data points.
Formally, let G = (V, E , W ) be a hypergraph composed of a vertex set v ∈ V with the size of N, a hyperedge set e ∈ E with the size |E |, and a weight set of hyperedge W, where the weight of hyperedge e is indicated as w(e). A hypergraph is often described as an incidence matrix H ∈ R N×|E | whose elements indicate whether a vertex joins a corresponding hyperedge (e.g., Figure 1b). Mathematically, the incidence matrix is defined by The degree of a vertex v and the degree of a hyperedge e are given as follows, respectively  By analogy with the simple graph, spectral analysis can be used as an efficient tool for the analysis of hypergraph. The normalized hypergraph Laplacian matrix [33,34] is calculated by For a semi-supervised learning task, hypergraph is usually used by incorporating with an empirical error [35], as follows arg min where R emp (F) denotes the empirical error term over a problem-dependent prediction F.

ELMs
The basic ELM can be interpreted as two components, i.e., random hidden mapping and ridge regression classifier. Formally, ELM's hidden layer can be expressed as where Z is the hidden layer output matrix parameterized by hidden weight matrix Θ and bias vector b, σ denotes a nonlinear activation function such as Sigmoid. In the second stage, ELM computes prediction by Y = Zβ.
Here, β is the output weight matrix. Since Z is known to the output layer, Equation (7) essentially is a least-squares optimization problem and can be solved as ELM avoids iterative parameters tuning, thus significantly faster than gradient descentbased neural networks. SS-ELM is the semi-supervised version of ELM by introducing a graph Laplacian regularization term [21]. Its formulation is given by SS-ELM also has a closed-form solution. It should be noted that the graph structure information is considered in SS-ELM but it essentially works in the regular Euclidean domain.

GCNs
There is an increasing interest in generalizing convolutions to the graph domain [23]. The recent development of GNNs allows us to efficiently approximate convolution on graph-structured data. GNNs can typically divide into two categories [23,28]: spectral convolutions [36], which perform convolution by transforming node representations into the spectral domain using the graph Fourier transform or its extensions, and spatial convolutions [37], which perform convolution by sampling from neighborhood signals.
In [25], Kift et al. developed a Graph Convolutional Network (GCN) by simplifying the spectral convolution with the 1th-order Chebyshev polynomials and setting the largest eigenvalue of the normalized graph Laplacian to be 2. Formally, GCN defines spectral convolution over a graph as follows Here,Ã = I N + A is the so-called augmented normalized adjacency,D is given bỹ D ii = ∑ jÃij . However, the current GNNs highly rely on gradient descend optimizers, which is often time-consuming and prone to suffering from a locally optimal solution. To overcome the shortcomings, Zhang et al. [32] proposed a randomization-based GCN (i.e., GCELM) by combining the advantages of ELM with GCN. Instead of updating all trainable parameters, GCELM employs a random graph convolutional layer and keeps it fixed. Thus, this allows GCELM to calculate the closed-form solution for the training phase, and further resulting in a faster learning speed.

HGCELM
In Figure 2, we provide an illustration of the proposed HGCELM framework. The framework first constructs a hypergraph from the given dataset, and then feeds it into the HGCELM model consisting of a random hypergraph convolutional layer and a hypergraph convolutional regression layer. The details for HGCELM are introduced as follows.

Hypergraph Construction
We represent the high-order relationship of a dataset by constructing a hypergraph G. For this purpose, each data point x i is treated as a vertex v, and furthermore, we consider that x i is the center vertex and its k nearest neighbors are associated with the hyperedge e. As a result, each hyperedge connects k + 1 vertices. The incidence matrix of the hypergraph is defined by The degree of the vertex set V and the degree of the hyperedge set E can be expressed in diagonal matrix forms, i.e., D v and D e . There are various ways to assign the weights for hyperedges, such as the sum of similarities of vertices within a hyperedge [34]. In this paper, we view all hyperedges as equal weights and thus can be indicated as an identity matrix I ∈ R |E |×|E | . The samples belonging to the same class often have a higher probability of being assigned to N k . Therefore, it is reasonable to use hypergraph to describe the intraclass variations of data.

Random Hypergraph Convolution
Inspired by the previous GCELM [32], we propose a novel ELM mapping, called random hypergraph convolution (RGC), to incorporate the high-order relationship among data into the feature mapping. We define the random hypergraph convolution following that proposed in [30,31], except that ours does not need to be updated iteratively. It is expressed as follows: where referring to as augmented normalized incidence matrix by imposing a symmetric normalization. It should be noted that S incorporates structured information of data and can be precomputed during implementation. Thus, the random hypergraph convolution enables ELM to embed the high-order information. Following the ELM's theory, the filter parameters of the random hypergraph convolution, Θ, is randomly generated under a specific probability distribution, e.g., Gaussian distribution Θ ij ∼ N(0, 1).

Hypergraph Convolutional Regression
Based on the hidden hypergraph embedding Z, we use a hypergraph convolutional regression layer to predict labels. Formally, the layer can be written as: To solve β, the above equation can be rewritten as the following ridge regression problem: Here, Y = [Y T ; Y U ] is an augmented training target matrix. Since Y U is invisible during training stage, thus it is set to be a 0 matrix. Let M be a diagonal mask matrix with its first N T diagonal elements M ii = 1 and the rest equal to 0. We further rewrite Equation (14) as It is easy to prove that Equation (15) holds a optimal solution and its closed form is expressed The labels of the unlabeled data points can be determined depended on the obtained β * , which is given byȲ We show the overall learning steps of the proposed HGCELM in Algorithm 1. Since there is no iteration in HGCELM, it would be computationally efficient for training such model.

Computation Complexity and Connection to Existing Methods
We give a theoretical analysis for the computation complexity of our method. Ignoring the hypergraph generation procedure and assuming L N, the time complexity of HGCELM is approximately O N 3 . Compared with ELM, HGCELM has a higher time complexity, which is determined by the hypergraph convolution. Fortunately, the augmented normalized incidence matrix can be precomputed and is sparse. Thus, the hypergraph embedding can be efficiently implemented. HGCELM is more efficient than GCN since it does not use iteration.
Our HGCELM is closely related to the classic ELMs. First, HGCELM maintains the advantages of ELMs, i.e., fast learning speed achieved by a closed-form solution. Second, our HGCELM will degenerate to be the classic ELM by casting away S. Consequently, the classic ELM is a special case of the HGCELM model in the regular Euclidean domain. Furthermore, our HGCELM remedies the drawback that traditional ELMs cannot deal with structured data.

Results and Discussion
In this section, we conduct experiments to verify the effectiveness of the proposed HGCELM.

Baseline Methods
We compare our method with eight baselines including the basic ELM [2], KELM [38], SS-ELM [21], Transductive Support Vector Machine (TSVM) [19], Self-training Semisupervised ELM (ST-ELM) [39], Laplacian Support Vector Machine (LapSVM) [20], GCELM [32], and GCN [25]. For a fair comparison with previous works, we set 50 hidden neurons for these methods that contain hidden layer, i.e., GCELM, ELM, SS-ELM, ST-ELM, and GCN. The hyperparameter settings of these methods are provided in Table 3. We implement all the methods with Python 3.5 running on an Intel i5-6500 3.20 GHz CPU with 8.00 GB RAM.   Table 4 reports the test accuracy of different methods over the 26 datasets. Each method is evaluated for 30 independent runs. At the bottom of the table, two summative metrics are calculated, that is, an arithmetic mean of test accuracy over the 26 datasets and a statistical value for how many datasets our HGCELM win/tie/loss (W/T/L) than a competitor. On the basis of these results, we conclude that the proposed HGCELM method consistently outperforms the other baselines. Compared with GCELM, HGCELM achieves 2.29% improvement in terms of the averaged accuracy with a lower standard deviation. It demonstrates that hypergraph is more helpful for semi-supervised learning. This is because the hypergraph convolution aims to embed the intraclass variations, while the graph convolution focuses only on the pairwise relationship. There is also a significant improvement comparing HCGELM against the classic semi-supervised methods (i.e., SS-ELM, TSVM, ST-ELM, and LapSVM). Specifically, our method wins on 25, 26, 26, and 24 datasets, respectively. Notably, ELM and KELM are purely supervised, thus their classification accuracy is lower than most semi-supervised methods caused by the limited training samples. Despite being optimized, GCN cannot achieve better test accuracy than our HGCELM. It further signifies the effectiveness and reliability of our proposal.

Performance with Varying Training Size
To investigate the performance of HGCELM, we visualize the test accuracy under varying training sizes. In this experiment, we increase the training size from 5% to 50% and present the corresponding results of 30 evaluations with a box plot. As shown in Figure 3, the test accuracy of HGCELM tends to increase when the training sample increased. Meanwhile, HGCELM becomes more robust and more stable. When the training size is larger than 25%, HGCELM's accuracy closes to its highest accuracy and tends to be convergent. For the Iris dataset, HGCELM can achieve a remarkable accuracy (higher than 90%) using only 5 samples per class (10%). It means that HGCELM has the ability to explore and make better use of the useful information from unlabeled data.

Analysis on Decision Boundaries
To provide an intuitive understanding of the superiority of our HGCELM, we visualize the decision boundaries of ELM, SS-ELM, GCELM, and HGCELM. In Figure 4, we synthesize three types of representative data distributions, i.e., linear separable data (the first row), linear inseparable data with half circles (the second row), and linear inseparable data with concentric circles (the third row), each of which contains 100 samples with two classes. The complexity of these datasets gradually increases from top to bottom. In this experiment, we select 10 samples for training and initialize each classifier with 10 hidden neurons. Owing to the simplicity, all the four classifiers can correctly find a reliable decision boundary on the linear separable dataset. However, for the first linear inseparable dataset, semi-supervised methods (i.e., HGCELM, GCELM, and SS-ELM) can obtain more generalized decision boundaries than the supervised method (i.e., ELM). However, spectral convolution-based methods (HGCELM and GCELM) are superior to the graph Laplacianbased method (SS-ELM) in terms of decision boundary, as well as test accuracy. From the second linear inseparable dataset, HGCELM can accurately classify the data points with a better decision boundary, while the other three methods show relatively poorer ones. In particular, ELM fails to work on this dataset under the same experiment settings. This is because the ELM cannot use the unlabeled samples, making it more likely to overfitting the limited training set. Although SS-ELM shows same accuracy as HGCELM, its decision boundary cannot separate two circles. We can naturally conclude that HGCELM has a better generalization ability than the other methods. The ability is benefited from the fact that the high-order relationship among the whole data points can improve the quality of decision making. It should be noticed that due to the geometric property of data distribution, the Euclidean-distance-based graph construction strategy would result in an inappropriate graph structure on the linear inseparable datasets. Therefore, a small neighbor size is desired when constructing a graph or hypergraph for these datasets. 3HUFHQWDJHRIODEHOHGVDPSOHV

Parameter Sensitivity Study
Two parameter sensitivity studies are carried out to further explore the robustness of the proposed method. Figure 5 shows the impact of different hidden neurons. We compare our method with five baselines that require setting hidden neurons. As seen from Figure 5a-d, all the compared methods can achieve better performance when using more hidden neurons on all the datasets. Nevertheless, our proposed HGCELM enjoys a more competitive edge than the other methods. In particular, when our method uses more than 20 hidden neurons. Notice that GCN can guarantee relatively better accuracy than other competitors when using small hidden neurons. This is because the filter parameters of GCN are fully optimized. By contrast, ELMs require larger hidden neurons, which is repeatedly demonstrated by many previous works. From Figure 5c,d, the classic ELM suffers from overfitting (accuracy declined with hidden neurons) caused by the inadequate training data. The problem is overcome in other semi-supervised ELMs including our HGCELM.

Impact of λ and k
In this experiment, we further explore the impact of the other two important hyperparameters, i.e., λ and k. Figure 6a-d shows the results by grid searching. Here, we set λ = 10 −7 , 10 −6 , · · · , 10 2 and k = {3, 6, · · · , 30}, respectively. We can observe two tendencies from the results. First, to obtain a better performance, the regularization coefficient λ should be set to be less than 1. Theoretically, using a larger λ will punish HGCELM to be a compact model but also resulting in an increased risk of under-fitting the training data. Second, a larger neighbor size k will guarantee a better classification accuracy in general. However, there will exist noise samples in hyperedges if k becomes too large. We provide the suitable ranges of the two parameters as λ ∈ 10 −3 , 10 −1 and k ∈ [5,10].

Conclusions
We have proposed a novel ELM called HGCELM for semi-supervised classification. The idea behind HGCELM is to combine hypergraph convolution with ELM so that embedding the high-order relationship of data. The resulting model extends ELM into the non-Euclidian domain and endows ELM with the capability of modeling structure data. The proposed HGCELM is characterized by a very light computational burden and good generalization ability, making it easy to implement and apply in practice. Extensive experiments on 26 datasets demonstrate that HGCELM is superior to many existing methods. The successful attempt clues a promising avenue for designing randomized neural networks and graph neural networks.

Conflicts of Interest:
The authors declare no conflict of interest.