Generic Structure Extraction with Bi-Level Optimization for Graph Structure Learning

Currently, most Graph Structure Learning (GSL) methods, as a means of learning graph structure, improve the robustness of GNN merely from a local view by considering the local information related to each edge and indiscriminately applying the mechanism across edges, which may suffer from the local structure heterogeneity of the graph (i.e., the uneven distribution of inter-class connections over nodes). To overcome the drawbacks, we extract the graph structure as a learnable parameter and jointly learn the structure and common parameters of GNN from the global view. Excitingly, the common parameters contain the global information for nodes features mapping, which is also crucial for structure optimization (i.e., optimizing the structure relies on global mapping information). Mathematically, we apply a generic structure extractor to abstract the graph structure and transform GNNs in the form of learning structure and common parameters. Then, we model the learning process as a novel bi-level optimization, i.e., Generic Structure Extraction with Bi-level Optimization for Graph Structure Learning (GSEBO), which optimizes GNN parameters in the upper level to obtain the global mapping information and graph structure is optimized in the lower level with the global information learned from the upper level. We instantiate the proposed GSEBO on classical GNNs and compare it with the state-of-the-art GSL methods. Extensive experiments validate the effectiveness of the proposed GSEBO on four real-world datasets.


Introduction
Based on the homophily assumption of "like to associate with like" [1,2], the Graph Neural Network (GNN) [3] has become the promising solution for node classification. However, a large portion of edges are inter-class connections [4], and representation propagation over such connections can largely hinder the GNN from obtaining class-separated node representations, hurting the performance.
Existing GSL methods are roughly categorized into the attentive mechanism, noise detection, and probabilistic mechanism. Attentive mechanism calculates weights for edges to adjust the contribution of different neighbors during representation propagation [5][6][7][8][9]. These methods can hardly work well in practice for two reasons: (1) the mechanism may not generalize well to all nodes with different local structures (cf. Figure 1a); and (2) the attention cannot be easily trained well due to the limited labeled data [10]. Noise detection incorporates an edge classifier to estimate the probability of inter-class connection for each edge [11][12][13][14][15][16]. Although it can be better trained owing to the supervision of edge labels, the edge classifier also suffers from local structure heterogeneity and lacks consideration of global information. Probabilistic mechanism models connection from a global view, which assumes a prior distribution of edge and estimates GNN parameters with Bayesian optimization to overcome the impact of inter-class edges [17][18][19][20]. Although the edge specific parameterization largely enhances the model representation ability, it is hard to accurately access the prior distribution.  Figure 1. Examples for (a) the impact of local structure heterogeneity (node 1 and 9, the weights between nodes do not provide enough information for classes differentiation) and (b) the global view of messaging (the edge between node 3 and node 5, the deleted edge prevents the intra-class information transfer).
Despite the achievements of the existing methods, there still exists some common drawbacks: (1) Edge modeling method, the existing methods model edges with the parameter sharing mechanism, which may suffer from the local structure heterogeneity problem; (2) Local optimization, the local optimization problem focuses on optimizing the parameters with the information of neighbor nodes, which ignores the impact from the global view. Therefore, we come up with the key considerations for GSL: (1) modeling graph connection in an edge-specific manner instead of a shared mechanism; and (2) optimizing the corresponding parameters with a global objective of accurately classifying all nodes. The edge-specific modeling can overcome the local structure heterogeneity, i.e., handling nodes with different properties (e.g., node 1 and node 9 in Figure 1a) via different strategies. Besides, blindly removing the inter-class edges will increase the risk of misclassifying the target nodes (in dash circle) due to cutting off their connections to the labeled neighbors in the same class (e.g., edge between node 5 and node 9 in Figure 1b). Thus, it is necessary to optimize the graph structure from the global view instead of the local ones.
However, it is non-trivial to achieve the targets mentioned above due to the following challenges: (1) the graph structure is embedded into the GNN model, which affects the procedure of model parameter optimization once updated, requiring a careful design of the model training algorithm; (2) the calculation of the ideal global objective is intractable due to the limited labelled nodes, especially the hard semi-supervised setting.
In this work, we propose a new Generic Structure Extraction with Bi-level Optimization for Graph Structure Learning (GSEBO), which optimized the graph structure and learsn the node embeddings from the global view. In particular, we first devise a new generic structure extractor, which accounts for the graph structure with both the connectedness between nodes and the strength of connections. In addition to the adjacency matrix, the extractor adopts a learnable matrix to represent the graph structure and adjusts the representation propagation. Moreover, we design a bi-level optimization algorithm where the outer and inner optimizations update the structure and the parameters of the base graph convolution (vs feature mapping parameters). In this way, we decompose the hard optimization issue of GSEBO into two easy ones. In addition, we separate the training set into two parts (e.g., train_1, train_2), and set the objective of outer optimization as the train_2 loss to better approximate the ideal global objective. The proposed generic structure extractor can be extended to most existing graph convolution operators. We instantiate it on four representative GNN models (i.e., GCN [21], GAT [5], GraphSAGE [22], and JK-Net [23]) and compare GSEBO with state-of-the-art GSL methods, which are evaluated on four real-world node classification datasets. Extensive experiments justify the rationality, effectiveness and robustness of the proposed method. In summary, our main contributions are as follows: • We propose a novel GSEBO with bi-level optimization for edge-specific graph structure learning, which learns the graph structure from a global view by optimizing a global objective of node classification. • We devise a generic structure extractor, which parameterizes the strength of each edge during representation propagation. Besides, we summarize how the classical GNN methods are transferred in the form of a learnable graph structure with a generic structure extractor. • We evaluate the proposed GSEBO with classical GNNs as backbones and compare it with the state-of-the-art GSL methods. Extensive experiments on four real-world datasets show the superior learning ability of the proposed method compared to the existing methods.

Attentive Mechanism
The attentive mechanism methods adaptively learn the weights of edges and adjust the contributions of neighbor nodes. MAGNA [6] incorporates multi-hop context information into every layer of attention computation. IDGL [8] uses the multi-head self-attention mechanism to reconstruct the graph, which has the ability to add new nodes without retraining. HGSL [9] extends the graph structure learning to heterogeneous graphs, which constructs different feature propagation graphs and fuses these graphs together in an attentive manner. However, those methods suffer from different local structures and are difficult to train.

Noise Detection
The noise detection methods leverage the off-shelf pre-trained model to induce node embeddings or labels and incorporate an edge classifier to estimate the probability of each edge. NeuralSparse [11] considers the graph sparsification task by removing irrelevant edges. GAUG [12] utilizes a GNN to parameterize the categorical distribution instead of MLP in NerualSparse. PTDNet [13] prunes task-irrelevant edges by penalizing the number of edges in the sparsified graph with parameterized networks. Even though the noise detection methods can be well trained with the supervision of edge labels, the edge classifier also suffers from local structure heterogeneity and lacks consideration of global information.

Probabilistic Mechanism
This type of method assumes the prior distribution of graph or noise and estimates the parameters through observed values, then resamples the edges or noise to obtain a new graph. BGCN [17] estimates the parameter distribution of edges and communities by sampling edges from graph, and resamples new graphs with the estimated parameters for prediction. VGCN [18] trains a graph distribution parameter similar to the original structure through ELBO, and resamples graphs for prediction. However, both the BGCN and VGCN models are sampled from a noisy graph, and the estimated parameters also contain noise. DenNE [19] assumes the observed graph is composed of real values and noise and the prior distribution of features and that the noise is known. With a generative model, the likelihood is used to estimate the representation of nodes. However, this method highly relies on the priors of feature and noise, which is difficult to obtain accurately.

Bi-Level Optimization on GNN
LDS [15] jointly learns the graph structure and GNN parameters by solving a bi-level optimization issue that learns a discrete probability distribution for each edges. According to the learned distributions, LDS generates a new graph structure by sampling. Towards this end, LDS sets the objective of outer optimization as generating the observed edges, which clearly has a gap to the overall classification objective. Moreover, LDS needs to estimate N 2 distribution parameters at least, which is hard due to insufficient labels (|E | N 2 |. On the contrary, our method only activates a small portion of entries in Z, where A ij = 1, i.e., the number of estimated parameters is same as the number of edges in graph G (i.e., |E |).

Methodology
We first introduce the essential preliminaries for GNN, and then elaborate the graph convolution operator and bi-level optimization algorithm of the proposed GSEBO.

Preliminary
Let G = (V, E , X) represents a graph with N nodes and M edges, where V = {v 1 , v 2 , · · · , v N } and E = {e 1 , e 2 , · · · , e M } denote the set of nodes and edges, respectively.
This task aims to learn a classifier f (A, X; θ) from a set of labeled nodes to forecast the remaining nodes labels, where θ denotes model parameters. Assuming there are N labels, we index them from 1 to N without loss of generality. Formally, Y = [y 1 , y 2 , · · · , y N ] ∈ R N are the labels of the nodes, where y i ∈ R is the label of node i. The target is achieved by optimizing the model parameter θ with respect to the labeled nodes, which is formulated as: where l(·) is a classification loss and λ is a hyperparameter to adjust the strength of parameter regularization.

Generic Structure Extraction
To optimize the graph structure, the key consideration lies in (1) decoupling the graph structure from the GNNs to account for the edge-specific modeling and (2) learning the graph structure from the global information in θ.
Towards the first purpose, the core idea is to decompose the graph structure information into connectedness (the edges in the adjacency matrix) and the strength of connection (the latent variable). In general, there are two ways to model the connection strength regarding whether relying on the inductive bias of translation invariant or not. On the one hand, attentive mechanisms or noise detection models are translation invariant, which decode the connection strength of each edge from its local features. However, with the consideration of the local structure heterogeneity issue in most real-world graphs [23], it is risky to rely on the translation invariant bias. On the other hand, probabilistic mechanisms separately model the connection strength for each edge, where each edge corresponds to a specific distribution. However, it is non-trivial to set a proper prior in practice. According to these advantages and disadvantages, we summarize two considerations for extending the graph convolution: (1) edge-specific modeling; and (2) optimization without prior.
Towards this end, we first propose the generic structure extractor (GSE) to decouple the graph structure into the connectedness between nodes and the strength of the connections. To model the edges in an edge-specific way, we apply the bi-level optimization method with the inner optimization to update the common parameters and the outer optimization to optimize the weight of each edge. By introducing the bi-level optimization, we can learn the structure information from the global parameters, thus avoiding the prior assumptions. Specifically, we model the connection strength as a parameter matrix Z with the same size as A. Formally, the generic structure extractor (GSE) is abstracted as: Ã , where σ(·) is a non-negative activation function, since the value of strength is always positive (in this work, we use the min[max[0,x],1] function to restrict the value within [0, 1]). COM and AGG are the combination and aggregation functions, respectively. Noteworthy, different from GNNs, GSE decouples the graph structure from GNNs and treats it as a learnable objective. Besides, GSE is a generic extractor, which can be instantiated over most existing graph convolutions. In this way, as long as learning the connection strength is set properly, GSEBO can downweight the neighbors with inter-class connection during representation propagation, reducing the impact of inter-class with bi-level optimization.

Update Parameters with Bi-Level Optimization
To achieve the second purpose of learning graph structure from global information, it is essential to carefully design a proper training algorithm to optimize the connection strength matrix Z. Assume that we construct GSEBO with K layers, which is denoted as We have three main considerations for designing the training algorithm: • Connection strength is a relative value, which changes across different views. As shown in Figure 1b, the connection between node 3 and node 5 is weak from the local view, i.e., the edge is inter-class and should be assigned a low weight. However, this edge is essential for the classification of node 9, 10, 11, which deserves a high weight. Therefore, the optimization objective of Z should be the overall performance of node classification. • Z and W = {W (k) |k ∈ [1, K]} play different roles, but are closely related. The role of Z is close to A, which restricts the model space for the mapping from node feature to label, and the role of W includes the global mapping information for classification, which would relieve the cons of local optimization. That is, an update on Z will adjust W and its optimization procedure. Therefore, the optimization of Z and W are at two different but dependent levels. • Directly minimizing the objective function of Equation (1) to obtain the parameters Z and W is not able to achieve the desired purpose of learning the structure for the reason of over-fitting, which is shown in Figure 2.
Inner objective  Towards this end, we propose a bi-level optimization [24][25][26] with inner and outer optimizations to learn W and Z, respectively. Figure 3 shows the overall procedure of the algorithm, where the inner and outer optimization steps are iteratively executed until the stop condition.  Figure 3. The framework of GSEBO. The generic structure extractor (GSE) decouples the graph structure from GNNs to account for the edge-specific modeling; the inner and outer optimization steps are adopted for common parameters and graph structure optimization.

Inner Optimization for Common Parameters
This step is similar to the normal training of GNN for W optimization. In particular, we update W with a gradient descent based optimizer (e.g., Adam [27]) over the train_1 nodes by minimizing Equation (3) (i.e., set l inner = l( f (A, X), y; θ) + λ θ ). When calculating the gradient of W, we treat the connection strength matrix Z as constant.

Outer Optimization for Graph Structure
Similarly, we treat the sequence of W as constant to optimize the graph structure Z. Ideally, the objective should be the overall classification loss of all nodes in graph. Formally: Apparently, the calculation of the ideal objective is intractable. Similar to the global parameters optimization, we can approximate the ideal objective as the empirical risk over train_1 nodes. However, it can easily suffer from the over-fitting issue, which is shown in Figure 2. We believe the empirical risk over train_2 nodes is a better approximation of the ideal objective since it reflects to what extent the parameter generalizes well. In this light, we set l outer as the classification loss and optimize Z over the train_2 set V train2 by minimizing Equation (6). Assuming: We optimize the outer parameters Z by fixing the inner parameters W 1 , · · · , W τ . Formally, the derivative of the outer objective to the hyperparameter Z (hypergradient) is formulated as: For t = τ down to 1, we define the matrices: and the update of Z can be formulated as:

Training Process
The bi-level optimization cannot guarantee the convergence of the model; thus, we set the early stop condition as the additional requirement for training, which is shown in the second line of Algorithm 1. We set the early stop condition as: within 20 epochs, if the accuracy of the verification set is not improved, the model will stop training.
The process of updating Z is shown on line 8-11 in Algorithm 1. With the parameter Z updated by the graph structure optimization, we reoptimize the W with Z by global parameters optimization, and repeat this process until the early stopping is met. To summarize, Algorithm 1 shows the training procedure of GSEBO.

Complexity Analysis
The overall framework of GSEBO is illustrated in Algorithm 1. The computing complexity of our GSEBO mainly depends on two steps. Given a graph G, ||A|| 0 is the number of nonzeros in the adjacency matrix, d is the feature dimension, L is the layer number of GCN, and |V| is the number of nodes. In the inner optimization step, the graph convolutional network takes O(τL||A|| 0 d) computational time. In the outer optimization step, the computational time is O(τ||A|| 2 0 ). As a result, the total computational is O(τ||A|| 2 0 ) if Ld is smaller than ||A|| 0 .

Unifying Various GNN and Beyond
To extend the graph convolution operator to the most existing graph convolutions, we reformulate the classical GNN model into a unified form.

Gcn
The forward propagation of GCN is formulated as follows: where H (0) = X,Ã = A + I,D nn = ∑ iÃni , and W ∈ R D×C are the trainable parameters, I denotes the identity matrix, and σ is a nonlinear function. For better understanding of the propagation of GCN, we rewrite the embedding update in the following form: where W (l) reflects the trainable parameters on the l-th layer, σ 1 is the activate function, such as ReLU, and σ 2 = min[max[0,x],1] is the a non-negative activation function.

Gat
GAT assigns different weights to each neighbor node, and updates the node embeddings with weighted average of neighbors: whereÃ ij = 1, and j ∈ N (i), where W (l−1) k is the parameter of kth multi-head attention on (l − 1)th layer, So f tmax is the Softmax operation, and atten is the self-attention operation.

GraphSAGE
GraphSAGE is one representative of the spatial approaches, which learns how to aggregate feature information from a node's local neighborhood, and the reformulation is as follows:

JK-Net
JK-Net learns the node representation from deep layers by using a jumping network. There exist different ways to aggregate features from different layers; we take the concatenation as an example in our work (other aggregation methods are also applicable). The formulation of JK-Net is as follows: where FC is a fully connected layer and Concat is the concatenation operation.

Experiments
In this section, we conduct experiments on four datasets to answer the following research questions: We select four widely used real-world node classification benchmark datasets with graphs of citation networks (Cora and Citeseer [21]), social networks (Terrorist) [28], and air traffic (Air-USA) [29]. The statistics of the datasets are shown in Table 1. • Citation networks. Cora and Citeseer are citation networks, where the nodes are papers published in computer science, the features of each publication are described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary, the adjacency matrix is binary and undirected, which denotes the citation relation between papers, and the labels are the category of each paper. • Terrorist Attacks. Terrorist attacks describe the information related to terrorism attack entities: the attributes of the entities and the links that connect various entities together to form a graph structure. This dataset consists of 1293 terrorist attacks each assigned one of six labels indicating the type of the attack. Each attack is described as a 106-dimensional vector, with a 0 and 1 value indicating the absence and presence, respectively, of an entity. The edges are binary and undirected. • Air traffic network. Air-USA is the airport traffic network in the USA, where the nodes represent airports and the binary undirected edges indicate the existence of commercial flights between the airports. The features of nodes are one-hot degree vectors and the labels are generated based on the activity measured by people and flights passed the airports.
All the baselines and our proposed method can be applied to all types of networks. We adopt the same data split of Cora and Citeseer as [21], and a split of training, validation, and testing with a ratio of 10:20:70 on other datasets [12].

Implementation Details
In the experiments, the separate ratio of training data is set to 0.8 for the train_1 and train_2 datasets, and we optimize the outer and inner parameters on the train_1 and train_2 datasets, respectively. The latent dimension of all the methods is set to 16. The parameters for all baseline methods are initialized as the corresponding papers, and are carefully tuned to achieve optimal performances. The learning rate of inner and outer optimization is set to 0.01. The hyperparameter λ is set to 5 × 10 −4 , and we search for the inner learning depth τ over the range [5,10,15,20,25]. As bi-level optimization cannot guarantee the convergence, we set the patience of early stopping to 20 to terminate the optimization of GSEBO. To prevent overfitting, the dropout ratio is set to 0.5 for all the methods. The network architectures of all the methods are configured to be the same as described in the original papers. Our experiments are conducted with Tensorflow running on GPU machines (NVIDIA 2080Ti). For all the compared methods, we report the average accuracy on the test set over 10 runs. Table 2 shows the performance of GSEBO instantiate with classical GNNs, and Table 3 presents the results comparison of GSEBO and state-of-the-art GSL methods. From Tables 2 and 3, we have the following observations:

Improvement over Baselines
GSEBO outperforms the baselines in most cases. Considering the average performance over four datasets, the improvement of GSEBO over the baselines is in the range of 3.0-12.5%, which validates the effectiveness of the proposed method. In particular: • Probabilistic mechanisms. The performance gain of BGCN and VGCN over the vanilla GCN are limited, which might because of the unsatisfied assumption of the prior distribution. This result shows the rationality of relaxing the prior assumption and modeling the connection strength with parameters directly. • Connection modeling. DropEdge, AdaEdge, GAUG, and PTDNet achieve better performance than vanilla GNN, which modifies the structure of the graph from different perspectives, such as the smoothness [30] and robustness [31]. These results reflect the benefit of connection modeling. However, there is still a clear gap between these methods and GSEBO, which is attributed to the global objective for learning the edge strength. • Attention mechanism. The performance of MAGNA is inferior, which is consistent with the analysis in [10].

Effects across GNN Architectures
On the four GNN architectures, GSEBO achieves better performance than the vanilla version in all cases. In particular, GSEBO achieves an average improvement across datasets of 4.2%, 4.4%, 4.0%, and 5.74% over the vanilla GCN, GAT, GraphSAGE, and JK-Net, respectively. These results justify the effectiveness of structure learning of the proposed GSEBO. Across the four architectures, GSEBO achieves the largest improvement over JK-Net. We postulate the reason is that the jump connection in JK-Net makes it aggregate more hops of neighbors than the other GNNs. As the hops increase, the homophily ratio of the neighbors will decrease, i.e., more neighbors are in classes different from the target node. Therefore, optimizing the connection strength (i.e., Z) is more beneficial on JK-Net.

Effects across Datasets
From the perspective of the dataset, GSEBO consistently performs better than the vanilla version on the four datasets. Specifically, the average improvement over the four across classic GNN architectures achieved by GSEBO are 1.9%, 3.1%, 5.3%, and 8.0% on the four datasets, respectively. Considering that the four datasets come from different scenarios, these results are evidence for the potential of GSEBO to be widely applied in practice. Note that the trend of performance improvement is similar to the density of graph, where Air-USA is the most dense graph with the largest performance improvement. As the number of neighbors increases, the percentage of neighbors essential for the classification of the target node will decrease. This result can reflect the rationality of optimizing the connection strength according to the overall classification objective.
Moreover, the training loss and test accuracy of GNN methods and the proposed GSEBO are shown in Figure 4. We observe that the loss of GSEBO tend to be stable after 30 epochs and is smaller than GCN, GAT, and GraphSAGE, showing the empirical convergence of GSEBO. This is because simultaneously optimizing the parameters of W and Z would make the loss function more easily approach to the labels. However, the loss of JK-Net is smaller than GSEBO, which we attribute to that JK-Net is easier to trap in over-fitting.
GNN has achieved promising achievements with the assumption of homophily, while the noise of graphs breaks the assumption and hinders the node representation of GNN.

Robustness Analysis
We investigate the robustness of GSEBO under the different inter-class levels. In particular, we follow [13] and construct graphs based on Cora by randomly adding 1000, 3000, 5000, 10,000, and 20,000 inter-class edges, respectively. On the synthetic datasets, we compare GSEBO with Vanilla, GAUG, DropEdge, and AdaEdge. Figure 5 shows the performance on the five GNN architectures. From the figures, we have the following observations: • The margins between GSEBO and vanilla GNN on the synthetic datasets are larger than the original Cora. For example, when adding 20,000 inter-class edges, GSEBO improves the accuracy by 17.6%, 3.1%, 40.6%, and 4.4% compared to GCN, GAT, GraphSAGE, and JKNet. This result indicates the robustness of GSEBO's structure learning ability. • In most cases, GSEBO outperforms the baselines at different noisy levels, which further justifies its robustness. • GAUG and AdaEdge utilizes different strategies to update the structure of the graph, which also consistently perform better than vanilla GNN. However, their gaps to GSEBO on the synthetic data are larger than the original Cora. We postulate that the reason for this is that their objectives are affected by the intentionally added noise. • DropEdge shows worse performances than the vanilla GNN on the synthetic datasets. The comparison shows that randomly dropping edges fails to enhance GNN robustness when the noisy level is high.

Visualization of Denoising (RQ3)
To investigate the effect of the connection strength matrix Z, we further visually compare the initialization of Z (i.e., initial weight) and its value after optimization (i.e., denoised weight). As both of them are large and sparse matrix, we visualize part of the denoised weight with a region of 100 × 100, which is shown on Figure 6. We show the results on Core and Air-USA, where the improvement of GSEBO is the smallest and largest, respectively. In the figures, we use red pixels and blue pixels to represent intraclass and inter-class edges, respectively, where a deeper color represents larger values. From Figure 6a,b, we find that the denoised weight matrix mainly decreases the value corresponding to inter-class connections, which, thus, can downweight the corresponds neighbor during representation propagation. In some cases, the denoised weight increases the weight of intra-class connections, which also facilitates denoising. Apparently, most edges in Air-USA are intra-class, and GSEBO still downweights a portion of inter-class edges and increases the intra-class weights, as shown in Figure 6c,d. This is because the denoised weight is learned in order to reach the overall classification objective.

Hyperparametric Analysis (RQ4)
We then investigate how the hyperparameters of GSEBO affect its effectiveness. As introduced in Section 3, the most critical hyperparameter of GSEBO is the training depth of inner optimization step (i.e., τ), which also is the number of gradient update in each inner optimization step. For the consideration of computing cost, we cap the value of τ at 25 and changes its value with step of 5, i.e., setting τ ∈ {5, 10, 15, 20, 25}. Figure 7 shows the performance of GSEBO applied to the four classical GNN architectures on the four datasets as changing the value of τ. From the figures, we have the following observations: (1) in most cases, the overall performance of the proposed GSEBO on each dataset is relatively stable as increasing the value of τ. This result indicates the insensitivity of GSEBO to the hyperparameter. (2) While increasing the value of τ can slightly improve the performance of GSEBO, we still suggest setting it as a relatively small value. This is because enlarging τ will increase the memory and computation cost of GSEBO.

Conclusions
In this work, we propose a novel GSEBO, which utilizes the generic structure extractor to extract the graph structure as learnable parameters and learns the structure from the global view. To better optimize the structure and common parameters, we decompose it into two mutually constrained optimization objectives, i.e., a bi-level optimization, with inner and outer optimization for common parameters and graph structure optimization, respectively. Extensive experiments demonstrate the effectiveness and robustness of GSEBO on both benchmark and synthetic datasets.
Even though GSEBO achieved impressive results, there are still some limitations: GSEBO cannot be effectively applied to large graphs, which requires implementation of mini-batches. Besides, we evaluate GSEBO in the transductive setting; when new nodes are added to the graph after training, GSEBO has to retrain the entire model. For future research, we would like to explore solutions to the above limitations.