ALPINE: Active Link Prediction using Network Embedding

Many real-world problems can be formalized as predicting links in a partially observed network. Examples include Facebook friendship suggestions, consumer-product recommendations, and the identification of hidden interactions between actors in a crime network. Several link prediction algorithms, notably those recently introduced using network embedding, are capable of doing this by just relying on the observed part of the network. Often, the link status of a node pair can be queried, which can be used as additional information by the link prediction algorithm. Unfortunately, such queries can be expensive or time-consuming, mandating the careful consideration of which node pairs to query. In this paper we estimate the improvement in link prediction accuracy after querying any particular node pair, to use in an active learning setup. Specifically, we propose ALPINE (Active Link Prediction usIng Network Embedding), the first method to achieve this for link prediction based on network embedding. To this end, we generalized the notion of V-optimality from experimental design to this setting, as well as more basic active learning heuristics originally developed in standard classification settings. Empirical results on real data show that ALPINE is scalable, and boosts link prediction accuracy with far fewer queries.


Introduction
Applications of Link Prediction (LP) in networks range from predicting social network friendships, consumer-product recommendations, citations in citation networks, to proteinprotein interactions. Such networks are usually only partially observed: node pairs are either connected (or linked), disconnected (or unlinked), or of unknown status. Indeed, obtaining network connections is usually resource-intensive (e.g., wet lab experiments or questionnaires), so that many of them remain unknown. Moreover, in many real-world networks new nodes are continuously added, with very limited information on their connectivity to the rest of the network. In such Partially Observed Networks (PONs), LP algorithms can be deployed to predict the missing link status information. When no attribute or meta-data is available for the nodes-the situation we focus on in this paper-this must be done relying solely on structural information [Kashima et al., 2009], i.e., the observed part of the PON.
In some cases, a budget is available for querying an oracle for the link status of a limited number of node pairs. For example, wet lab experiments can reveal missing proteinprotein interactions, or questionnaires can ask consumers to indicate whether they have seen particular movies before. Unfortunately, such queries can be expensive, while the link status of some node pairs is more informative than those of others-queries must thus be chosen wisely. Given a finite budget, an active learning strategy, identifying and prioritizing the most informative queries, is thus required for optimal LP accuracy of the unobserved part in the PON.
For LP tasks, NE methods have become increasingly popular, owing to their high accuracy as well as versatility for other downstream tasks. Thus, in this paper we develop ALPINE (Active Link Prediction usIng Network Embedding), the first active learning method for NE-based LP in PONs. For concreteness, we derived ALPINE for Conditional Network Embedding [Kang et al., 2019] (CNE), which achieves the stateof-the-art LP accuracy [Kang et al., 2019]. Moreover, as opposed to other popular NE methods (including all those based on random walks), CNE can distinguish disconnected node pairs from those with unknown status. Additionally, its objective function is easy to express analytically, which allows principled mathematical derivations for our active learning query strategies. Yet, it will be clear that ALPINE can be derived also for a wide range of other NE-based LP methods.
Given a PON, ALPINE must thus quantify the usefulness of querying a node pair with unknown status. We introduce different strategies for doing this. First, generalizing the notion of V-optimality from experimental design and exploiting the notion of Fisher information, we derive a principled measure that quantifies the reduction in variance on the link probabilities for node pairs with unknown status. Second, we propose several other heuristic strategies similar to those used in the standard active learning literature. Example. We illustrate the idea behind ALPINE (with Voptimality query strategy) on the Harry Potter network [Evans et al., 2014] of 65 Harry Potter characters and 221 ally connections amongst them (see Fig. 1). We take node 'Harry Potter' (pentagon node) and assume its connectivity is not To the best of our knowledge, NE-based active LP has not been studied before. Indeed, while ample previous work on active learning for graphs exists, this mainly focuses on node classification [Yang et al., 2014;Yang et al., 2016;Bilgic et al., 2010]. The main contributions of this paper are thus: • We highlight the importance of distinguishing the disconnected versus unknown status of node pairs, a rather obvious but often ignored fact (Sec. 3.1). • We propose ALPINE, an active learning approach for LP algorithms that are based on NE (Sec. 3.2). • To identify the most informative node pairs to query, we generalize the notion of V-optimality to this setting. Moreover, we also propose a number of simpler heuristic query strategies inspired by active learning for standard learning settings. (Sec. 3.3.) • Qualitative and quantitative evaluations show that ALPINE with the V-optimality query strategy does indeed perform far better than random querying. Moreover, we show that two easy-to-compute heuristics achieve very similar performance, making them good alternatives on large networks. (Sec. 4).

Background
Here we briefly survey active learning, (NE-based) link prediction, and some directly related work.

Active Learning and Experimental Design
Active learning is a subfield of machine learning, which aims to exploit the situation where learning algorithms are allowed to actively choose the training data from which they learn. It is particularly valuable in domains where training labels are scarce and expensive to acquire [Brinker, 2003;Cai et al., 2017;Settles, 2009]. The success of an active learning strategy depends on how much more effective its choice of training data is, when compared to randomly sampled training data. Of particular interest to the current paper is the poolbased active learning, where a pool of unlabeled data points is provided, and a subset from this pool can be selected for labeling. In the context of the present paper, the unlabeled 'data points' are the node pairs with unknown link status, and an active learning strategy would aim to query the link status of those node pairs from this pool that are most informative for a LP algorithm when used to predict the link status of the node pairs for which it is unknown. Active learning is closely related to Optimal Experimental Design (OED) in statistics [Atkinson et al., 2007], which aims to design optimal 'experiments' (i.e., the acquisition of training labels) with respect to a statistical criterion and within a certain cost budget. The objective of OED is usually to minimize a quantity related to the (co)variance matrix of the estimated model parameters, or of the predictions this model makes on the test data points. In models estimated by the maximum likelihood principle, a crucial quantity in OED is the Fisher Information: as the reciprocal of the estimator variance, it allows quantifying the amount of information a training data point carries about the parameter.
While studied since long in statistics, the idea of variance minimization first shows up in the machine learning literature for regression [Cohn et al., 1996], and later the Fisher Information was used to judge the asymptotic values of unlabeled data for classification [Zhang and Oles, 2000]. Yet, despite this related work in active learning, and the rich and mature statistical literature on OED for classification and regression problems, to the best of our knowledge the concept of variance reduction has not yet been applied to LP in networks.

Link Prediction and Network Embedding
LP algorithms can be used in PONs to predict whether node pairs with unknown status are linked or not. It has been widely applied in friendship recommendations, recommender systems, knowledge graph completion, and more. While there are numerous conventional LP methods based on heuristic statistics [Martínez et al., 2017], recently proposed NE-based methods have been reported to outperform those [Grover and Leskovec, 2016;Kang et al., 2019].
Given a network G = (V, E) with nodes V and links E ⊆ V 2 , the goal of NE is to find a mapping f : V → R d from nodes to d-dimensional real vectors. A NE is denoted as X = (x 1 , x 2 , . . . , x n ) ∈ R n×d , where X * denotes an optimal embedding given the network G with adjacency matrix A. NE's can be used for a variety of downstream tasks, including visualization, node classification, and also LP. When used for LP, a function g : R d × R d → R evaluated on the vectors x i and x j might represent the probability (or other score) for a link to exist between i and j. In practice, g can be found by training a classifier (e.g., logistic regression) on a set of linked and unlinked node pairs, while in other cases it follows directly from the embedding model. The method CNE used in this work is of the latter type, and aims to find an embedding that maximizes the probability of the network given the embedding: Most NE methods treat node pairs with unknown status as unconnected. For example, in methods based on skip gram with negative sampling (e.g, DeepWalk, node2vec), the random walks used to determine similarities between nodes use only known existing links, ignoring the unknown but potentially existing ones. Those methods may therefore be suboptimal when applied to PONs, and they cannot exploit new knowledge that a node pair with unknown status is not connected. As we will show in Sec. 3, however, CNE can trivially be modified to distinguish these two situations-an important factor contributing to our decision to use CNE in this paper.

Related work
Our work sits at the intersection of three topics: active learning, link prediction, and network embedding. There exists work on the pairs of any two but not three. To the best of our knowledge, ALPINE is the first method for actively learning a NE for the purpose of LP. NE-based LP. Many graph embedding methods have been proposed in the past years. Based on neighborhood information, first order methods (e.g., CNE [Kang et al., 2019]) and higher order methods (e.g., Deepwalk [Perozzi et al., 2014], node2vec [Grover and Leskovec, 2016]) have been designed to perform multiple tasks, such as LP and node classification. Recently also Graph Convolutional Neural Networks (GC-NNs) [Kipf and Welling, 2017] were introduced, allowing nodes to recursively aggregate information from their neighbors. However, GCNNs mainly focus on node classification. Active learning for NE. There are a few works on learning NE in an active manner recently, but they target node classification [Cai et al., 2017;Chen et al., 2019] instead of LP. Active learning and link prediction. Work on active learning for graphs has focused on node and graph classification, as well as LP [Settles, 2009;Aggarwal et al., 2014]. The graph classification task considers data samples as graph objects, useful for drug discovery and subgraph mining [Kong et al., 2011], while the node classification task aims to label nodes in graphs [Bilgic et al., 2010;Cesa-Bianchi et al., 2013;Guillory and Bilmes, 2009]. Other active learning research for LP considers different problem settings, e.g., link classification for signed networks [Cesa-Bianchi et al., 2012], learning for graph edge flows [Jia et al., 2019], and training of the neural link predictor [Ostapuk et al., 2019]. Probably the most strongly related method is HALLP [Chen et al., 2014], which uses active learning for LP. However, the method is heuristic, it does not distinguish disconnected from unknown node pairs, and it is based on a simple LP method that classifies node pairs according to a fixed representation of them in terms of engineered features, rather than a learned NE.

Method
Section 3.1 formally defines PONs, and discusses how CNE is naturally suitable for embedding PONs. Section 3.2 describes ALPINE, our NE-based active LP framework. Section 3.3 shows how we generalize the notion of V-optimality from experimental design for ALPINE, as well as more heuristic active learning query strategies.

Link Prediction for PONs
We formally define PONs as follows: the sets of node pairs with connected and unknown status respectively, where is the set of node pairs observed to be disconnected. Therefore, K = E ∪ D represents the observed part of the PON.
To represent three types of node pair status, the adjacency matrix A of a PON has entries a ij ∈ {0, 1, null}.
The task of LP in a PON is to predict the connectivity status of node pairs (i, j) ∈ U , and this is based on the available information in G. Remarkably, most NE methods (and LP methods more generally) do not treat disconnected node pairs differently from node pairs with unknown status. This is inevitably true for methods based on random walks (as a random walk cannot transition from node i to j if not known to be connected, regardless of whether known to be disconnected), and true also for many other methods such as those based on matrix decompositions. CNE, however, can be trivially modified to elegantly do so, by maximizing the probability only for the observed part (i.e., (i, j) ∈ E ∪ D): Furthermore, the link probability in CNE is formed analytically because the embedding is found by solving a Maximum Likelihood Estimation (MLE) problem: argmax X P (G|X). Next, we will show how it allows us to quantify the informativeness of the node pairs with unknown status in ALPINE.

ALPINE
Here, we introduce ALPINE, a pool-based active learning approach [Settles, 2009] for NE with LP as a downstream task. We develop ALPINE for CNE although we stress that our arguments can be applied in principle to any other NE method of which the objective function can be expressed analytically.
ALPINE works by finding an optimal NE for a given PON G = (V, E, U ), selecting one or a few node pairs from U to query, updating the PON with the new knowledge (i.e. node pairs from U found to be connected are moved to E, those unconnected are removed from U ), and re-embedding the updated PON. This process can be iterated until the budget is exhausted or until the model is sufficiently accurate.
We will introduce different strategies for selecting the node pairs to query, relying on different utility functions u A,X : V × V → R in ALPINE which quantify how useful knowing the connectivity status of a node pair is estimated to be for the purpose of increasing the LP accuracy on the node pairs in U , when based on the updated NE (see Sec. 3.3). Specifically, each query strategy will select the next query as: for an appropriate utility function u A,X . In practice, not just the single best node pair (argmax) is selected in each iteration, but the s best ones (further referred to as the 'step size').
Thus, given a PON G = (V, E, U ), an NE model, a query strategy and associated utility function u A,X , a step size s, and a budget B (number of node pairs in U that can be queried), each iteration of ALPINE works as follows. We initialize the pool of node pairs with unknown status U (0) = U and that of the known part K(0) at step it = 0 according to A(0) of G(0) = G given.
1. Compute X * (it) as an optimal embedding of the G(it); 2. Find the best query Q(it) ⊆ U (it) as the set of |Q(it)| = min(s, B) elements from U (it) with largest values for the utility function u A(it),X * (it) ; 3. Query the oracle for the connectivity status of the node pairs in Q(it), set U (it + 1) ← U (it) \ Q(it), and add each (i, j) ∈ Q(it) revealed as connected to E(it + 1); 4. Set B ← B − |U (it)|, and break if B == 0.
If desired, the LP accuracy based on the embedding can be monitored on a hold-out set during these iterations, and one can stop early as soon as the accuracy meets a threshold.

Query Strategies for ALPINE
Here we first derive a principled utility function based on the concept of V-optimality from OED. The utility function measures the informativeness of the connectivity of a node pair by identifying to what extent its knowledge is expected to minimize the variance of the predictions for the link status of node pairs in U . After that, we also introduce a range of other heuristic query strategies. V-optimality and Variance Reduction. V-optimality from OED aims to choose the training data points so as to minimize the variance of the predictions of the learned model on the test data points. As the test set U is finite and given in PONs, this aim naturally fits our problem setting. Thus, with g the link prediction function, and with P * ij g(x * i , x * j ) = P (a ij = 1|X * ) the probability of a link between nodes i and j given the CNE embedding, the utility function used in V-optimality is the reduction of (i,j)∈U Var(P * ij ) achieved by querying a particular node pair from U . The challenge to be addressed is thus the computation of the reduction in the variance terms Var(P * ij ). Omitting details, we outline how this can be done. In ALPINE, CNE finds the optimal embedding X * as the Maximum Likelihood Estimator (MLE) given a PON with adjacency matrix A, i.e., X * maximizes P (G|X) w.r.t. X. The variance of an MLE can be quantified in terms of the Fisher Information [Lehmann and Casella, 2006]. More precisely, the Cramer-Rao bound [Rao, 1992] provides a lower bound on the variance of a MLE by the inverse of the Fisher Information: Var(X * ) I(X * ) −1 . Although the Fisher Information can often not be computed exactly (as it requires knowledge of the data distribution), it can be effectively approximated by the observed information matrix [Efron and Hinkley, 1978]. For CNE, this observed information matrix for the MLE x * i is given by (proof omitted for brevity): where γ is a CNE-parameter. Thus, we can bound the co- Using a first-order analysis (details omitted) to decompose Var(P * ij ) into a contribution from each end point as follows: and using the bound on Cov(x * i ) for all i, allows bounding the variance on the estimated probabilities Var(P * ij ) by bounding the two terms in the decomposition as follows: and similar for Var x * j (P * ij ). Querying node pair (i, j) ∈ U will reduce the covariance matrices C i and C j , as it creates additional information on their optimal values. For example for x * i (and similarly for x * j leading to C i j ), the new covariance assuming (i, j) has known status, denoted C j i , is: (3) Thus this leads to a reduction of the bounds on Var x * i (P * ij ) and Var x * j (P * ij ), and thus on Var(P * ij ) due to Eq. (1). Putting things together allows defining the V-optimality utility function, and proves a theorem for computing it: Definition 2. The V-optimality utility function u A,X * evaluated at (i, j) quantifies the reduction in the bound on the sum of the variances Var(P * kl ) (see Eqs. (1) and (2)) of all P * kl with (k, l) ∈ U , achieved by querying node pair (i, j) ∈ U . Theorem 1. The V-optimality utility function is given by: . Applying the Sherman-Morrison formula to Eq. (3), allows rewriting u ik (i, j) as: . Thus, unsurprisingly, the variance reduction is always positive.
The max-ent. query strategy is a specific variant of the popular uncertainty sampling strategy in active learning, with the entropy as the uncertainty measure. The second and third strategies both tend to query node pairs that are linked with high probability. Indeed, this is true by definition for maxprob., and approximately true for min-dis. as nearby nodes in the embedding are connected with higher probability. The intuition behind these strategies is that links are often sparse in a network, so that queries that result in the discovery of new links are more informative. The last two are degree-related, and PR i in page-rank. is the PageRank score of i evaluated by treating node pairs with unknown status as disconnected.

Qualitative evaluation
As illustrated in Fig. 1, we apply ALPINE to the case when a new node is added to a network, and examine the behaviour of the V-optimality query strategy. We take node 39 ('Harry Potter', pentagon node) as newly arrived who has only one initial known connection to node 22 ('Rubeus Hagrid', square node). Thus, the node pairs involving Harry, except the one with Rubeus, all have unknown status.
In the first iteration, the V-optimality query strategy used in ALPINE scores all the node pairs (i, j) ∈ U and suggests to query ('Harry Potter', 'Arthur Weasley') where 'Arthur Weasley' (node with brightest yellow) is the father of the Weasley family. The other members of Weasley family are also scored high. This indicates that predicting the relationships of 'Harry Potter' with other characters can be improved by first querying possible connections with members of the Weasley family -close allies of 'Harry Potter' and with many connections to other characters.
From the second to the fifth iterations, the top ranked nodes are 'Ginny Weasley', 'Fred Weasley', 'George Weasley', 'Albus Dumbledore'. These early suggestions sketch the relationships of 'Harry Potter' to the entire network, thus allowing it to be well-embedded with just a few queries.

Quantitative evaluation
We quantitatively evaluate ALPINE with different query strategies for LP on node pairs with unknown status in a PON in an iterative manner. All query strategies proposed in Sec. 3.3 are used. Additionally, a baseline query strategy which samples node pairs uniformly at random from U , is included for comparison. In order to construct PON based on the benchmarks, information on 20% of the node pairs (both links and non-links) is removed, which forms U . Then we apply ALPINE for varying budget B and and a range of step sizes s.  and Blog with B = 5M . Hereby, the LP accuracy is quantified as the AUC of all node pairs in U (0) for it = 0 with respect to the ground truth. Results are averaged over several U (0)s, and the PON with each U (0) is initialized with different random embeddings (10 × 10 for the first four, 5 × 5 for the fifth and 4 × 2 for the last). Scores of the random strategy are further averaged over 5 runs.
All non-random strategies perform consistently and substantially better than the random query strategy, apparently clustering into two groups within which the accuracy is similar. The 4 best strategies are all NE-based: v-opt., max-ent., max-prob., and min-dis. With these strategies, ALPINE boosts link prediction accuracy with far fewer queries. One interesting finding is that max-prob., while different in spirit to uncertainty maximization, still performs similar to maxent.. The reason could be that positive links are considered as more informative because real-world networks are usually sparse such that linked node pairs are more informative. The second group consists of page-rank, and max-deg., both strategies that do not require the NE. Thus, the link status of high-degree nodes is more informative than that of the random ones, but as a strategy it is inferior to the NE-based ones.
In practice, however, active learning is particularly useful when the budget is small. Thus, we investigated in greater detail the relative performance of the various query strategies for a small budget B, equal to 10% of |U |. Table 1 shows the increase in percentage points of the LP AUC compared to the AUC before active learning, and this for three different step sizes. The V-optimality query strategy outperforms the others in most cases, and is close second or third in a few other cases, although max-prob. and max-ent. are never much worse. As a side result, the Table shows that the LP AUC is relatively insensitive to the step size.
We also evaluated ALPINE for predicting the connectivity to a newly added node, using as few queries as possible, with similar conclusions (details will be in an extended version).

Scalability
The runtime analysis (on a server with Intel Core i5 CPU 2.30GHz and 8GB RAM) of ALPINE with different query strategies is shown in Table 2. The embedding dimension is set to 8, and the removed information from the original network is 20% of the node pairs. The results are averaged over 10 random runs. It shows that the computation time per iteration of the V-optimality strategy increases dramatically as the network size grows. Given this, and their competitive performance in terms of LP accuracy, max-prob. and max-ent. are probably the query strategies of choice on larger problems.

Conclusion
Link prediction is an important task in network analysis, tackled increasingly using network embeddings. It is particularly important in partially observed networks, where finding out whether a pair of nodes is linked is time-consuming or costly, such that for a large number of node pairs it is not known if they are connected or not.
We propose to make use of active learning in this setting, and introduce ALPINE, a specific active learning approach for link prediction in such partially observed networks using network embedding. We first derived a principled query strategy that generalizes the notion of V-optimality from optimal experimental design to the current setting, identifying those node pairs which, if queried, will maximally reduce the variance on the link scores for the node pairs with unknown connectivity status. Additionally, several heuristic active learning strategies are also proposed as computationally efficient alternatives. Empirical evaluations show that ALPINE with the V-optimality query strategy performs best overall, albeit at a relatively high computational cost, while two intuitive heuristics achieve similar accuracies and scale to larger networks. All query strategies outperform by a large margin the random query strategy.
As future work, we plan to further improve the scalability of ALPINE by e.g., using incremental embeddings at each iteration.