Unifying Node Labels, Features, and Distances for Deep Network Completion

Collected network data are often incomplete, with both missing nodes and missing edges. Thus, network completion that infers the unobserved part of the network is essential for downstream tasks. Despite the emerging literature related to network recovery, the potential information has not been effectively exploited. In this paper, we propose a novel unified deep graph convolutional network that infers missing edges by leveraging node labels, features, and distances. Specifically, we first construct an estimated network topology for the unobserved part using node labels, then jointly refine the network topology and learn the edge likelihood with node labels, node features and distances. Extensive experiments using several real-world datasets show the superiority of our method compared with the state-of-the-art approaches.


Background
Network structures, such as social networks, web graphs, and communication networks, are important to the functioning of complex systems [1,2]. Usually, a complete network structure is a crucial prerequisite for downstream tasks, including node classification and link prediction [3][4][5]. However, real-world networks tend to be partially observed, with nodes and edges missing due to insufficient resources and privacy protection [3,6,7]. Social networks, such as Twitter and Facebook, have restrictions for crawlers, which makes it impossible for third-party aggregators to collect complete network data. Similarly, when an internet scientist probes the route topology using traceroute, they cannot obtain the structure behind the non-cooperative routes [8]. Thus, the collected network structure is often incomplete, which creates difficulties for downstream analysis.
Our work is motivated by learning the structure of communication networks from passive measurements. In a military context, we may wish to analyze a foreign network inconspicuously [9]. One feasible way is to monitor the packet traffic between the target network and our controlled networks. In an internet reconstruction context, we may wish to obtain a map of networks that have connections with our hosted one for routing strategy optimization [10,11]. In both scenarios, we place passive traffic collectors, and we can collect the user profile (e.g., IP address) [12], hop distance (via TTL) [9,13], and label (via a community) [2] through continuous passive monitoring of the communication, starting from the target network [14]. Passive monitoring provides rich information but there are two limitations: (1) it is often impractical to collect edges or relationships between nodes within the target networks, as their traffic does not pass through our collectors; and (2) there is little control over which targets are measured and, therefore, some data are invariably missing [14]. [7], an expectation maximization approach combined with the Kronecker graphs model. They designed a scalable, metropolized Gibbs sampling approach for the estimation of model parameters, as well as inference of the missing part Z. KronEM suffers from three problems [16]: (1) only network topology is considered and side information of nodes is ignored; (2) not all realworld networks follow the Kronecker model; and (3) its speed and performance are not yet satisfactory.

Model-Based NC: Kim and Leskovec developed KronEM
Node-Similarity-Based NC: Other than the missing edges, the node identification and features in G may be available in real settings. Node-similarity-based network completion methods leverage the similarities between node features to infer Z. Matrix completion with decoupled transduction (MC-DT) [15] decouples the completion from transduction to effectively exploit the similarity information. Furthermore, joint node clustering and similarity learning (JCSL) [23] handles the situation, where the node features may be partially missing by computing the node similarities at the cluster level, then jointly co-factorizing the observed adjacency matrix with the cluster-based similarities. However, MC-DT and JCSL have two shortcomings: (1) choosing an appropriate similarity metric is a prerequisite but challenging in practice [24]; and (2) the similarity matrix is exploited linearly, which cannot reflect the nonlinearity between node features and network structures.
Network Structure Learning: Instead of using a similarity graph based on the initial features, another approach in this category is to learn a network structure. Graph neural networks (GNNs) have become the standard toolkit for learning from networks [25,26]. However, most GNNs are designed for a relatively complete network structure, which makes them unsuitable for the network completion problem. Franceschi et al. sampled graph structures from a learnable fully connected structure and employed a bi-level optimization setup for simultaneously learning the GNN parameters and the structure [27]. Chen et al. proposed an iterative method to search for a hidden graph structure that augments the initial graph structure toward an optimal graph for (semi-)supervised prediction tasks [28]. Yu et al. introduced a GCN-based graph revision module for predicting missing edges and revising edge weights via joint optimization [29]. Hao et al. embedded the graph nodes into latent space, and then computed an embedding vector for the unobserved node, with attributes compared to another node's embedding for link prediction [30]. Shin et al. presented Edgeless-GNN for attributed network embedding with edgeless nodes by utilizing a k-nearest neighbor graph based on the similarity of node attributes. However, the existing network structure learning works ignore node labels and distances.
Deep Generative Graph Models: Recent advances in deep generative graph models have furthered progressed in network completion. Inspired by GraphRNN [31], DeepNC infers the missing parts of a network via deep learning for solving NE-NC [16]. DeepNC first learns a likelihood over the edges via an RNN-based generative graph model by using structurally similar graphs as training data, then inferring the missing parts by applying an imputation strategy for the missing data. Similar to KronEM, DeepNC does not take advantage of the side information of nodes. In addition, DeepNC requires structurally similar graphs for training, which are often difficult to collect in reality; for example, routelevel networks are hard to obtain as privacy protection within autonomous systems [32]. Graphite [33] parameterizes variational autoencoders with GNN and uses an iterative graph-refinement strategy inspired by low-rank approximations for decoding. G-GCN [34] produces a unified generative graph convolutional network that learns node embeddings for all nodes by sampling graph generation sequences constructed from H. Graphite and G-GCN are designed for edgeless NC, but neither of them considers node label or distance information, which leaves room for further improvement.

Present Work
In this work, we address the challenge of network completion using side information. We propose a node-label-, feature-, and distance-based network completion method (LFD-NC), a novel unified deep graph convolutional network that infers the missing edges by leveraging the node label, feature, and distance information. Specifically, we first construct an initial network topology for Z by node labels using a stochastic block model, and then we adopt GNN to obtain a refined estimation of Z using node features; after that, distance constraints are used for pruning the refined network. Lastly, we learn a joint distribution over Z by iteratively performing refinement and pruning.
The contributions of this paper are threefold: • We formalize NC with side information as a graph refinement problem; • We propose LFD-NC, a deep graph convolutional-network-based completion method by unifying the observed structure with node label, feature, and distance information; • We validate LFD-NC through extensive experiments on several real-world networks.

Methods
We begin by introducing the network completion problem with side information; then, we propose LFD-NC, a deep graph convolutional-network-based algorithm, to solve the problem.

Problem Formulation
We assume that there is a true undirected and unweighted network G(V, E, X, Y), where V = {v 1 , v 2 , . . . v N } denotes the node set with |V| = N, E ⊂ V × V denoting the edge set, X ∈ R N×F is the feature matrix for each node in V, and Y ∈ {1, 2, . . . , C} N represents the node labels of C classes or communities. We consider the NC task with node side information over G, where only a part of the topology of G is observed. As illustrated in Figure 1, we have complete knowledge of X, Y, as well as an observed induced subgraph as the observed destination nodes set from source node u. Let V Z = V\V O be the node set of the unobserved topology, and represent all possible edges with at least one endpoint in V Z . The task of NC is to infer the missing edges and non-edges in the unobserved part The NC problem with node side information. All nodes {v 1 , v 2 , . . . , v 9 }, node feature matrix X, and node label vector Y are known. Edges (black solid lines) in O are observed, but edges (black dashed lines) in the unobserved part Z are missing. Some shortest-path distances (red solid lines) between O and Z are known in addition. To solve NC is to infer the missing edges and non-edges in Z, e.g., edge (v 3 , v 6 ) and non-edge (v 6 , v 9 ).
It is worth mentioning that the observed distance set D considered here is retrieved from passive measurements; therefore, we cannot control the observed shortest-path destination nodes D u for a given source node u ∈ V OD . In this paper, we further assume that for any source node u ∈ V OD , the destination node set D u satisfies D u ⊂ V Z ; that is, we cannot detect all the shortest-path distances from u. Although D indicates deterministic edges and non-edges in Z (see Section 2.5), we treat them as unobserved for the simplicity of the NC problem definition.
Let us model E Z through the following probability distribution: which is parameterized by Θ. Then, the objective of E-NC is to find the most likely configuration of E Z .

Overview of LFD-NC
LFD-NC is motivated by the fact that network topology, node labels, and node features are correlated. Thus, we aimed to find the optimized network topology E Z that is most consistent with the observed O, X, Y, and D. As there is no closed form for the poste- (1), LFD-NC models it by iteratively refining the network topology. It requires four phases in our computing framework: labelbased topology initialization, edge probability learning, distance pruning, and topology refinement. The LFD-NC architecture is shown in Figure 2. . In the label-based topology initialization phase, LFD-NC computes an estimated topology G L for Z, using only node label information Y. Then, in the edge probability learning phase, we take G L , node feature X and node label Y as the inputs to learn a new edge probability P X (u, v) for (u, v) ∈ M Z , which leads to a better topology G X . Notably, learning P X is a typical link-prediction task, where standard methods such as GNN [35,36] or improved methods with label propagation [37][38][39] can be directly applied as aforementioned.
We then prune edges that should not exist on the basis of node distance constraints D. Lastly, we refine the estimated topology G X by iteratively performing edge probability learning and distance pruning.
From a node-embedding perspective, we first map each node in G to a low-dimensional vector in LFD-NC's first two phases, then we learn a pairwise decoder to predict the edges and non-edges in Z in the edge probability learning phase. Lastly, we refine the network topology in the distance pruning phase, and recalculate the node embeddings correspondingly using topology refinement.
Next, we elaborate the details of each step.

Label-Based Topology Initialization
NC suffers from the cold start problem [15,34], i.e., there are no prior connections for the nodes in Z. Therefore, standard GNN methods, such as GCN [35] and GAT [40], cannot be directly applied for NC since the missing edges block message passing and aggregation between O and Z.
We present node-label-based topology initialization to overcome the cold start problem in NC, which is motivated by two insights. Firstly, the community structure is positively correlated with the underlying network topology [21]. Most complex networks show a community structure, i.e., blocks of nodes that have a high density of edges within them, and a lower density of edges between them. Community structure is often detected on the basis of the known underlying network topology. Here, we take the opposite direction and treat the community structure as a good initial estimation of the unobserved network topology. Secondly, the label information can be integrated to improve the performance of node embedding. It is proven that unifying label propagation and GNN overcomes the over-or under-smoothing issue of GNN [37,38,41]. In this paper, we treat the known labels Y as the optimized result of the label propagation procedure, and then find the corresponding topology.
We initialize edges in M Z using the stochastic block model (SBM) [3,42,43], which is widely used to model communities in complex networks by modulating the intraand extra-block connections. Specifically, the edge probability in M Z is determined by the following: where p Y(u),Y(v) ∈ [0, 1] C×C is the probability of an edge between communities. The estimated edge weight in M Z is calculated as follows: where α L is a parameter that controls the strength of P L in the estimated graph G L . If we set w L (u, v) = 0 for u, v ∈ V O , and denote W L = [w L (u, v)] ∈ R N×N as the SBM-estimated matrix, then the adjacency matrix of G L can be represented by the following: We show that A G L can be the underlying topology for label propagation. The label propagation procedure in iteration k can be formulated as follows [37]: where A G = a ij ∈ [0, 1] N×N is the partial unknown adjacency matrix of G, and D is a diagonal matrix with D ii = ∑ j A ij . We have Y (k+1) = Y (k) = Y in our settings; hence Equation (5) holds when A G = A G L , which indicates that A G L is a valid solution of Equation (5). Therefore, A G L is consistent with the label propagation.

Edge Probability Learning
Edge probability learning aims to obtain a better network topology from G L , X and Y, and we treat it as a link prediction task. After the label-based topology initialization phase, we directly exploit network embedding techniques to find the proper function f as follows: where H is the node embedding matrix. Many existing methods can be used to obtain H [35][36][37][38][39]. In this paper, we adopt GCN [35] to model f . In GCN, the hidden representations for each layer can be obtained by the following: where A = A G L + I N is the adjacency matrix of G L with added self-loops, is a layer-wise learnable weight matrix, σ(·) denotes an activation function such as ReLU [44], and H (l) is the node embedding of layer l with H (0) = [X Y] , where [X|Y] is a concatenation of X and one-hot label indicators. We consider a two-layer GCN as our forward model, and the final embedding is the following: The weight matrices W (0) and W (1) are calculated by minimizing the cross-entropy errors of labeled edges in O: Due to the sparse nature of real-world networks, there are only a small number of edges in all node pairs; thus, we generate |E O | non-edges via random sampling.
The edge probability in M Z then takes the following simple form: As a function Equation (10), we provide the realization of Equation (1) as follows: We denote P X = [p X (u, v)] ∈ R N×N as the link likelihood matrix, and set p X (u, v) = 0 for u, v ∈ V O ; then, the adjacency matrix of G X can be represented by the following:

Distance Pruning and Topology Refinement
Distance pruning and topology refinement aim to further improve the performance of node embedding. The distance constraint D indicates the existence of edges and non-edges between some node pairs, whereby clamping the edge probability of these node pairs leads to a clearer network topology G D . Then, we take G D instead of G L , and repeat the edge probability learning process to gradually refine the node embedding matrix H.
Given the distance constraint D, we may calculate two deterministic sets: an edge set E D ⊂ M Z and a non-edge set E ∅ D ⊂ M Z . The calculation is based on Observation 1 and Observation 2 [11]. Observation 1. For any given nodes u, v and w in an undirected and unweighted graph G(V, E), if |d uv − d uw |≥ 2 , then (w, v) / ∈ E holds.

Observation 2.
Let L i u = {v|d uv = i, v ∈ N} be the sets of nodes with the same distance i from u. For two given nodes v ∈ L i u and w ∈ L i+1 u , if for any node x ∈ L i u \{v}, (x, w) / ∈ E, then (v, w) ∈ E holds.
Note that Observation 2 needs u observed distances to all the other nodes in G, which cannot be met under our assumption. Therefore, E D only contains the observed direct neighbors of the distance monitor nodes V OD .
After the calculation of E D and E ∅ D , we clamp the probability of edges in E D to 1, and the probability of non-edges in Then, the distance pruning process can be represented as follows: Then, we assign the adjacency matrix A G D of G D as the masked A G X . We ignore E D in LFD-NC as E D | |E ∅ D . We summarize our algorithm in Algorithm 1. The time complexity of LFD-NC is the same as that of GCN. The complexity of line 1 in Algorithm 1 is O(|M Z |) . In GCN, it is usually satisfied that F > F (1) ≥ F (2) ; thus, the complexity of line 3 is O N 2 F + NF 2 . The complexity of lines 4 and 5 is O(|M Z |) . Since M Z < N 2 , the total complexity of LFD-NC is dominated by GCN.

Algorithm 1. LFD-NC
Input: node features X, non-edge mask M non−edge , observed graph matrix A O , SBM estimated matrix W L , and topology refinement round R. Output: estimated graph A G D .

4.
A G X = A O + P X //link prediction by Equation (12)

Experiments
We conducted a series of experiments. First, we evaluated the performance of LFD-NC and compared it with the state-of-the-art network completion methods. Then, we analyzed the impacts of the four phases.

Datasets
We evaluated the performance of LFD-NC on eight real-world network datasets. The details of the eight datasets are presented below.
Actor is a film actor network [45]. Each node corresponds to an actor, and the edge between two nodes denotes co-occurrence on the same Wikipedia page. Node features correspond to some keywords in the Wikipedia pages. Nodes are classified into five categories in terms of words in the actor's Wikipedia entry.
Cornell, Texas, and Wisconsin are three small local web networks from WebKB [45], which is a webpage dataset collected by Carnegie Mellon University from computer science departments at various universities. Nodes and edges represent web pages and hyperlinks, respectively. The feature of each node is the bag-of-words representation of the corresponding page, and the label of each node is manually classified into five categories: student, project, course, staff, and faculty.
Cora, Citeseer, and PubMed are three classic citation networks [46]. Nodes represent scientific papers, and edges represent citation relationships. Node features correspond to the bag-of-words representation of the paper and the label of each node represents one of the academic topics.
WikiCS is a Wikipedia web network [47]. It consists of nodes corresponding to computer science articles, with edges based on hyperlinks. Node features are derived from the text of the corresponding articles. Nodes are labeled into 10 classes representing different branches of the field.
The dataset statistics are shown in Table 1. We only focused on the largest connected component for each network in this paper.

Baselines and Evaluation Metrics
The baselines were chosen from five different types of network completion algorithm for comparison: • SBM [42] only uses node labels, whereby the symmetric C × C matrix of edge probability p Y(u),Y(v) is estimated from the observed O; • KronEM [7] only uses the network graph structure and ignores node features, node labels, and node distances; • MC-DT [15] employs both the pairwise similarity of node features and the network graph structure, as well as ignores node labels and node distances. The similarity information is utilized by matrix factorization in a linear way; • MLP-NC [48] considers node features and the network graph structure, as well as ignores node labels and node distances. Unlike MC-DT, MLP-NC directly learns a non-linear similarity metric; • G-GCN [34] also considers node features and the network graph structure, as well as ignores node labels and node distances. Unlike MLP-NC, G-GCN adopts a generative graph convolution model.
We treated the prediction of missing edges in Z as a binary classification, and we evaluated the performance of LFD-NC on the basis of two metrics: the area under the ROC curve (AUC) and average precision (AP). We randomly sampled equal numbers of negative and positive edges when evaluating AUC and AP.

Implementation Details
We generated O(V O , E O ) by breadth-first search (BFS) traversal from a randomly selected node and split E\E O equally for validation and testing. Before training, node features X were normalized, and then extended by concatenating the one-hot encoding of Y for a fair comparison with MC-DT, MLP-NC and G-GCN.
We adopted the realization of KronEM in SNAP [49] and kept all parameters as the default, except for the initial Kronecker matrix, which we set to [0.9, 0.7; 0.7, 0.2] for symmetry. We implemented MC-DT, using the SciPy Python library, while the feature similarity matrix was computed by the cosine metric, and we set the number of eigenvectors s to 20. We executed G-GCN from the officially released code, and kept all parameters as the default, except for F (1) and F (2) . We implemented MLP-NC and LFD-NC in PyTorch and open-sourced them at https://github.com/weiqianglg/LFD-NC.
For LFD-NC, we constructed D by randomly selecting a fixed number of source nodes for V OD ; then, we randomly split V Z into |V OD | disjoint subsets, where each subset corresponded to one source node. The topology refinement round was set to R = 4. We adopted the planted partition model in SBM initialization, and we fixed p Y(u),Y(v) = 1 if nodes u, v had the same label; otherwise, p Y(u),Y(v) = 0 in Equation (2). For G-GCN, MLP-NC, and LFD-NC methods, we set the hidden dimension as F (1) = 64 and the final dimension as F (2) = 32, and we used the Adam optimizer with a learning rate of 0.01 to train the three deep-learning models for all the datasets. We also applied an early stop strategy over AUC for the validation set, with the patience set to 20 epochs. Tables 2 and 3 summarize the comparison results for the eight real-world network datasets. We kept |V OD |= 16 and |E O |/|E| = 0.8, and we set α L = 10 −4 for Cornell, Cora, Pubmed, and WikiCS, α L = 6 × 10 −2 for Actor and Texas, α L = 8 × 10 −2 for Wisconsin, and α L = 1 × 10 −1 for Citeseer. The mean and the confidence interval of AUC and AP were measured by 10 random BFS samplings. LFD-NC outperformed all other comparative models in the eight datasets. These results demonstrate that incorporating node label and distance constraints into GNN models significantly improves the solution of an NC problem. KronEM uses only the observed network topology for completion, and SBM uses only node labels, which led to them performing the worst. MC-DT, MLP-NC, and G-GCN performed similarly to each other in general, but they consider neither node labels nor distance constraints; therefore, LFD-NC outperformed almost all of them. For example, LFD-NC achieved about 21.9-23.2% AUC and 16.1-22.5% AP absolute improvements, compared with the second-best methods in Cornell and Texas. These improvements were contributed by both node label information and distance constraints; in Cora and Citeseer, LFD-NC achieved 5.5-7.3% AUC and 5.2-8.4% AP absolute improvements, compared with the second-best methods. These improvements were mainly contributed by distance constraints. We systematically studied the impact of node labels and distance constraints, as presented below. Node labels are integrated into the initial topology by the SBM-estimated matrix, which is controlled by the parameter α L , as shown in Equations (3) and (4). The SBM estimation enhances the connections for nodes in the same class, but it creates noisy edges that do not actually exist. Therefore, we need to properly set the parameter α L . Figure 3 shows the impact of parameter α L on AUC and AP in Actor and Cora. The importance of α L clearly varied with the dataset. A properly chosen α L produced 5% absolute AUC and AP improvements, compared with α L = 0 in Actor, whereas there was no improvement in Cora. The effectiveness of the label-based topology initialization was affected by the correlation between the node features and edges. When α L was close to zero, LFD-NC degenerated into a MLP model, which did not use label information; when α L became large, more noisy edges were added into the initial topology A L , which eventually degraded the performance. Therefore, a small α L produced good results in Cora, where the edge likelihood was mainly determined by the node features; however, in Actor, neither small nor large α L produced the best results. Figure 4 presents the impacts of the distance constraints on AUC and AP. A higher deterministic edge rate E ∅ D |/|M Z achieved about 5.3-14.4% AUC and 5.5-12.6% AP absolute improvements, compared with zero monitors ( E ∅ D |/|M Z = 0 ). Distance pruning restricted the probability of edges in Z and reduced the uncertainty of the estimated topology; therefore, the AUC and AP of LFD-NC increased with E ∅ D |/|M Z .

Ablation Study
LFD-NC solves the NC problem in four sequential phases; here, we performed an ablation study in LFD-NC on Actor and Texas datasets, as shown in Table 4. Compared with the standard LFD-NC, removing the label-based topology initialization phase resulted in the highest decrease in AUC and AP (6%); removing the topology refinement phase resulted in the highest decrease in AUC and AP (11%); and removing the distance pruning phase resulted in the highest decrease in AUC and AP (>26%). The experiments show that the four phases are all necessary for LFD-NC, and the distance pruning phase is particularly important.

Conclusions and Future Work
We presented LFD-NC, a unified deep graph convolutional network for network completion. LFD-NC integrates node label, feature, and distance information through a graph refinement framework. Experiments on eight datasets demonstrated that our model outperforms state-of-the-art baseline methods.
Our work can be easily extended to directed graphs and multigraphs. To treat directed graphs, the only necessary change is to perform directed node embedding in the edge probability learning phase. To treat multigraphs, we take each type of relationship individually, and then combine all the inferred results to achieve a final completion.
We note three possible directions for future work. Firstly, our proposed model assumes that node labels Y and node features X are completely known. An interesting direction would be to perform completion when parts of X and Y are missing and noisy. Secondly, LFD-NC has a long training time for large-scale networks; thus, reducing the time complexity is also a possible future research direction. Thirdly, we integrated distance constraints from randomly selected monitors; therefore, an effective monitor placement strategy can be designed for further performance improvement.
Author Contributions: Conceptualization, Q.W.; methodology, Q.W.; software, Q.W.; writingoriginal draft preparation, Q.W.; writing-review and editing, Q.W. and G.H. All authors have read and agreed to the published version of the manuscript.