Review on Learning and Extracting Graph Features for Link Prediction

: Link prediction in complex networks has attracted considerable attention from interdisciplinary research communities, due to its ubiquitous applications in biological networks, social networks, transportation networks, telecommunication networks, and, recently, knowledge graphs. Numerous studies utilized link prediction approaches in order sto find missing links or predict the likelihood of future links as well as employed for reconstruction networks, recommender systems, privacy control, etc. This work presents an extensive review of state-of-art methods and algorithms proposed on this subject and categorizes them into four main categories: similarity-based methods, probabilistic methods, relational models, and learning-based methods. Additionally, a collection of network data sets has been presented in this paper, which can be used in order to study link prediction. We conclude this study with a discussion of recent developments and future research directions.


Introduction
Online social networks [1], biological networks, such as protein-protein interactions and genetic interactions between organisms [2], ecological systems of species, knowledge graphs [3], citation networks [4], and social relationships of users in personalized recommender systems [5], are all instances of graphs of complex interactions, which are also referred to as complex networks.While these networks are almost always dynamic in nature, a vital query is how they change over time.More specifically, what are the future associations between entities in a graph under investigation.The problem of link prediction in graphs is one of the most interesting and long-standing challenges.Given a graph, which is an abstraction of relationships among entities of a network, link prediction is to anticipate future connections among entities in the graph, with respect to its current state.Link prediction models might (i) exploit the similarity metrics as the input features, (ii) embed the nodes into a low dimensional vector space while preserving the topological structure of the graph, or (iii) combine the information that is derived from the two aforementioned points, with the node attributes available from the data set.
All of these models rely on the hypothesis that higher similarity between nodes results in a higher probability of connection [6].
Applications of link prediction include analyzing user-user and user-content recommendations in online social networks [5,[7][8][9], reconstruction of the PPI (protein-protein interaction) network and reducing the present noise [10][11][12], hyper-link prediction [13], prediction of transportation networks [14], forecasting the behavior of terrorism campaigns and social bots [15,16], reasoning and sensemaking in knowledge graphs [17], and knowledge graph completion while using data augmentation with Bidirectional Encoder Representations from Transformers (BERT) [18,19].Link prediction in these applications has been mostly investigated through unsupervised graph representation and feature learning methods that are based on the node (local) or path (global) similarity metrics that evaluate the neighboring nodes.Common neighbors, preferential attachment, Jaccard, Katz, and Adamic Adar are some of the most widely used similarity metrics that measure the likelihoods of edge associations in graphs.While these methods may seem to be dated, they are far from being obsolete.Despite the fact that these methods do not discover the graph attributes, they have remained popular for years, due to their simplicity, interpretability, and scalability.Probabilistic models, on the other hand, aim to predict the likelihood of future connections between entities in an evolving dynamical graph with respect to the current state of the graph.Another context under which the problem of link prediction is raised is relational data [20][21][22][23].In this context, when considering the relational data set in which objects are related to each other, the task of link prediction is to predict the existence and type of links between pairs of objects [24].However, the availability of labeled data allows for the supervised machine learning algorithms to provide new solutions for the link prediction task, including neural network-based methods for link prediction [25], which allow for learning a suitable heuristic than assuming strong relationships among vertices.
Similar surveys on the topic of link prediction exist, and this survey has benefited from them.The work of [26] provides a comprehensive review of the problem of link prediction within different types of graphs and the applications of different algorithms.Other related review papers on this topic include the works of [27,28].The work of [28] reviews the progress of link prediction algorithms from a physical perspective, applications, and challenges for this line of research.While some of these reviews only focus on a specific set of methodologies that are proposed for link prediction, such as the work of [27], which presents an extensive review on relational machine learning algorithms, specifically designed for knowledge graphs, some important related methodologies are overlooked in the aforementioned studies.For instance, [28] does not discuss some important graph feature learning and neural network-based techniques that have been recently developed.Our effort has been to provide a review that includes the most recent approaches for the problem of link prediction that demonstrate promising results, but are not fully covered by exceptional similar surveys, such as the works of [26][27][28].Thus, we believe that our study provides comprehensive information on the topic of link prediction for large networks, and it can help to discover the most related link prediction algorithms that are deliberately categorized into the proposed taxonomy.This study reviews similarity-based methods, including local, global, and quasi-local approaches, probabilistic and relational methods as unsupervised solutions to the link prediction problem, and, finally, learning-based methods, including matrix factorization, path and walk based link prediction models, and using neural networks for link prediction.

Background
A graph (complex network), denoted as G = V, E , can be defined as the set of vertices (nodes) V, and the interactions among pairs of nodes, called links (edges) E, at a particular time t.It should be noted that in this problem setting, self-connections, and multiple links between nodes are not allowed and, accordingly, are not taken into account in the majority of link prediction problem settings [28].The main idea behind applying feature extraction or feature learning-based methods for the link prediction problem is to use the present information regarding the existing edges in order to predict the future or missing link that will emerge at time t > t.The types of graphs can be classified into two main categories according to the direction of the information flow between interacted nodes; directed and undirected graphs.Although many of the discussed methods in the next sections of this paper can provide solutions to the link prediction problem in directed graphs, the majority of the reviewed methods in this survey address the problem of link prediction for undirected graphs.The difference between the link prediction problem for these two graph categories arises from the additional information that is required for the directed graphs.This information refers to the origin of the associated link in directed graphs, in which v x , v y conveys the existence of a directed edge from node v x to v y and v x , v y = v y , v x [29].However, edges in undirected graphs have no orientation, and the relations among the node pairs are reciprocal.The set of nodes that are connected to node v x ∈ V are known as the "neighbors" of v x , denoted as Γ(v x ) ⊆ V, and the number of edges that are connected to the node v x is referred to as |Γ(v x )|.Link prediction algorithms necessitate training and test sets to be compared in the case of model performance, similar to other machine learning methods.However, one cannot know the future links of a graph at time t , given the current graph structure.Therefore, a fraction of links from the current graph structure is deleted (Figure 1), and taken as the test set; whereas, the remaining fraction of edges in the graph is used for the training purpose.A reliable link prediction approach should provide higher probabilities for the edges that belong to the set of true positives than the set of nonexistent edges [30].Apparently, by treating the link prediction task as a binary classification problem, conventional evaluation metrics of binary classification in machine learning can be applied in order to evaluate the performance of link prediction.Within the context of the confusion matrix, TP (True Positive), FP (False Positive), TN (True Negative), and FN (False Negative) metrics can be used in order to assess performance.In this context, sensitivity, specificity, precision, and accuracy are computed, as follows ( [31]): The most common standard metric that is used to quantify the performance of the link prediction algorithms is "the area under the receiver operating characteristic curve (AUC)" [32].The AUC value represents the probability that a randomly selected missing link between two nodes is given a higher similarity score than the randomly selected pair of unconnected links.The algorithmic calculation of AUC is given by: where n is the number of total independent comparisons and n is the number of comparisons in which the missing link has a higher score than the unconnected link, while n is the number of comparisons when they show equal scores.One of the applications of link prediction is in recommender systems [5,33] that exploit information on users' social interactions in order to find their desired information according to their interests and preferences.Therefore, within this context, the following evaluation metrics are also used [34][35][36]: In the above equations, Precision at n shows the number of relevant results (r) among the top n results or recommendations.In Recall at n, R presents the total number of relevant results.In order to calculate Mean Reciprocal Rank, first, the inverse of the ranking of the first correct recommendation is calculated ( 1 rank i ) and, then, an average over the total queries (Q) is taken.In order to calculate Average Precision, Precision at a threshold of k in the list is multiplied by the change in recall from items k − 1 to k, and this process is summed over all of the positions in the ranked sequence of documents.The Mean Average Precision is then the average of all Average Precisions over total queries (Q).
In order to provide a few visualization examples for complex networks, Figure 2 demonstrates the network structure of the two different hashtag co-occurrence graphs (#askmeanything and #lovemylife) of the Instagram posts from 04/01/2020 to 04/08/2020.These two different figures clearly demonstrate the variability of the network structure, even in the same fields, i.e., Figure 2a.shows different sub-communities with its more sparse structure, while Figure 2b.represents a densely connected network example.

a.
b.

Similarity Based Methods
Similarity-based methods, which mainly focus on the topological structure of the graph, are the most straightforward and oldest link prediction metrics.These methods try to figure the missing links out by assigning similarity score, s (v x ,v y ) , between node pairs (v x and v y ) using the structural property of the graphs.These methods can be investigated under three main categories: local, quasi-local, and global approaches.

Local Similarity-Based Approaches
Local similarity-based approaches are based on the assumption that, if node pairs have common neighbor structures, they will probably form a link in the future.Because they only use local topological information based on neighborhood-related structures rather than considering the whole network topology, they are faster than the global similarity-based approaches.Many studies also showed their superior performance, especially on the dynamic networks [37].However, they are restricted to compute the similarity of all possible combinations of the node pairs, since they only rank similarity between close nodes with a distance of less than two.

Common Neighbors (CN)
CN is one of the most extensive information retrieval metrics for link prediction tasks, due to its high efficiency, despite its simplicity.The idea behind CN is very intuitive; the probability of being linked for two nodes in the future is affected by the number of their common neighboring nodes, i.e., two nodes will highly probably establish a link if they have more shared nodes.The score of this metric can be defined, as follows: where Γ(.) represents the set of adjacent nodes.
It should be noted that the resulting score using CN is not normalized, and only shows the relative similarity of different node-pairs by considering shared nodes between them.Newman used CN in order to show that the probability of collaboration between two scientists in the future can be estimated by their previous common collaborators [38].

Jaccard Index (JC)
The metric not only takes the number of common nodes into account as in CN, but it also normalizes it by considering the total set of numbers of shared and non-shared neighbors.The equation of this score that is proposed by Jaccard [39] is: SL is the metric that is also known as cosine similarity.It calculates the cosine angle between the two columns of the adjacency matrix and it is identified as the ratio of the number of shared neighbors of v x and v y to the square root of inner-product of their degrees [40], as follows: Wagner & Leydesdorff [41] showed that SI is an efficient metric, especially when the aim is to visualize the constructional pattern of relations in a graph.

Sørensen Index (SI)
The index, which is very similar to JC, is generated to make a comparison between different ecological samples [42], such that: The difference in using the summation of the degrees instead of the size of the union of their neighbors makes SI less outlier sensitive when compared to JC [43].

Preferential Attachment Index (PA)
Motivated by the study by Barabasi & Albert [44], new nodes joining the network are more likely to connect with the nodes with higher connections (hub) than the nodes with lower degrees, PA can be formulated as: 3.1.6.Adamic-Adar Index (AA) The metric is employed for the necessity of the comparison of two web-pages by Lada Adamic and Eytan Adar [45].It simply uses the idea of giving more weight to the relatively fewer common neighbors, such that: where v z refers to a common neighbor for nodes v x and v y (connected/linked to both).
Although this metric has similarities to CN, the vital difference is that the logarithm term penalizes the shared neighbors of the two corresponding nodes.It should be noted that while the other metrics include only two nodes (v x and v y ) and/or their degrees in their equations so far, AA also relates familiar neighbors (v z ) to these two nodes (v x and v y ).

Resource Allocation Index (RA)
Motivated by the physical process of resource allocation, a very similar metric to AA was developed by Zhou et al. [46] which can be formulated as: The difference in the denominator (|Γ(v z )|) of RA rather than its logarithm (log|Γ(v z )|) as in AA penalizes the contribution of common neighbors more.Many studies show that this discrepancy is insignificant, and the resulting performances of these two metrics are very similar when the average degree of the network is low; however, RA is superior when the average degree is high [47].

Hub Promoted Index (HP)
The index is proposed for assessing the similarity of the substrates in the metabolic networks [48], and it can be defined, as follows: HP is determined by the ratio of the number of common neighbors of both v x and v y to the minimum of degrees of v x and v y .Here, link formation between lower degree nodes and the hubs is more promoted, while the formation of the connection between hub nodes are demoted [6].
3.1.9.Hub Depressed Index (HD) The totally opposite analogy of HP is also considered by Lü and Zhou [28], and it is determined by the ratio of the number of common neighbors of both v x and v y to the maximum of degrees of v x and v y .Here, the link formation between lower degree nodes and link formation between hubs is promoted.However, the connection between hub nodes and lower degree nodes are demoted, such that: 3.1.10.Leicht-Holme-Newman Index (LHN) The index, which is very similar to SI, is defined as the ratio of the number of shared neighbors of v x and v y to the product of their degrees (the expected value of the number of paths of length between them) [49].It can be represented by: The only difference in the denominator as compared to SI shows that SI always assigns a higher score than LHN, i.e., |Γ(v 3.1.11.Parameter Dependent Index (PD) Zhou et al. [50] proposed a new metric in order to improve the prediction accuracy for popular links and unpopular links.PD can be defined as: where β is a free parameter and it can be tuned to the topology of the graph.One can easily recognizes that PD is degraded to CN, SL, and LHN when β = 0, β = 0.5, and β = 1, respectively.
3.1.12.Local Affinity Structure Index (LAS) LAS shows the affinity relationship between a pair of nodes and their common neighbors.The hypothesis is that a higher affinity of two nodes and their common neighbors increases the probability of getting connected [51], such as: 3.1.13.CAR-Based Index (CAR) When a node interacts with another neighbor node, it is called a first-level neighborhood; whereas, the interaction between the first-level neighbor node and its neighbor node is called the second-level neighborhood for the seed node.According to the local community paradigm (LCP) of Cannistraci [52], the researchers mostly consider the first-level neighborhood, because the second-level neighborhood is noisy; however, the second-level neighborhood carries essential information regarding the topology of the network.Therefore, CAR filters these noises and considers nodes that are interlinked with neighbors mostly.The similarity metric can be calculated, as follows: 3.1.14.The Individual Attraction Index (IA) Dong et al. [53] proposed an index that relates not only to the common neighbors of the nodes individually, but also the effect of the sub-network created by those.The IA score can be formulated as: Because IA considers the existence of links between all common neighbors, the algorithm is very time-consuming.Therefore, a simpler alternative is also proposed as: 3.1.15.The Mutual Information Index (MI) This method examines the link prediction problem while using information theory, and it measures the likelihood of conditional self-information when their common neighbors are known [54], and formulated as: where v z ∈ Γ(v x ) ∩ Γ(v y ) and I(.) is the self-information function for a node and it can be calculated by (20).Here, I(e v x ,v y |v z ) means conditional mutual self-information of the existence of a link between v x and v y and their shared set of neighbors.The smaller value of s MI (v x ,v y ) means the higher likelihood to be linked.If all of the link between common neighbors be independent of each other, the self-information of that node pair can be calculated as [6]:

Functional Similarity Weight (FSW)
This index is first used by Chou et al. in order to understand the similarity of physical or biochemical characteristics of proteins [55].Their motivation is based on the Czekanowski-Dice distance that is used in [56] in order to estimate the functional similarity of proteins.This score can be defined as: Here, β is used to penalize the nodes with very few common neighbors, and it is defined as: where |Γ avg | is the average number of neighbours in the network.

Local Neighbors Link Index (LNL)
Motivated by the cohesion between common neighbors and predicted nodes, both attributes, and topological features are examined in [57], as: where w(v z ) is the weight function that can be measured by: Here, δ(a, b) is a boolean variable that is equal to 1 if there exists a link between a and b; otherwise, it equals to 0.

Global Similarity-Based Approaches
Global similarity-based approaches, contrary to local ones, use the whole topology of the network to rank the similarity between node pairs; therefore, they are not limited to measure the similarity between nodes that are locating far away from each other.Although considering the whole topology of the network gives more flexibility in link prediction analysis, it also increases the algorithm's time complexity.Because an ensemble of all paths between node pairs is used, they can also be called path-based methods.

Katz Index (KI)
The metric, which is defined by Katz [58], sums over the sets of paths and is exponentially damped by length to be counted more intensively with shorter paths.This index can be formulated with a vector space: Here, A is the adjacency matrix and β is a free parameter (β > 0) that is also called a "damping factor".One can realize that KI yields to a very similar score when β is low enough as the paths that have higher lengths contribute less, and the similarity index is simply determined by the shorter paths [28].
In the case of β < 1 , where λ A 1 is the largest eigenvalue of the adjacency matrix, the similarity matrix can be written, as follows: where I is the identity matrix.

Global Leicht-Holme-Newman Index (GLHN)
The idea behind GLHN is very similar to that of KI, since it also considers a high similarity for the nodes if the number of paths between these corresponding nodes is high [49], such that: where β 1 and β 2 are free parameters, and a smaller value of β 2 considers higher importance for the shorter paths, and vice versa.

SimRank (SR)
This index computes the similarity starting from the hypothesis "two objects are similar if they are related to similar objects", and it is recursively defined [59].SR is equal to 1 when node v x = v y , otherwise: where γ ∈ [0, 1] is called decay factor and it controls how fast the effect of neighbor node pairs (v z 1 and v z 2 ) reduces as they move away from the original node pairs (v x ,v y ).SR can be explained in terms of a random walk process, which is, s SR (v x ,v y ) measures how long the two random walkers are expected to meet on a particular node, starting with the v x and v y nodes.Its applicability is constrained on large networks due to its computational complexity [47,60].

Pseudo-Inverse of the Laplacian Matrix (PLM)
Using Laplacian matrix L = D − A rather than Adjacency matrix A gives an alternative representation of a graph, where D is the diagonal matrix of vertex degrees [61] (D i,j = 0 and D i,i = ∑ j A i,j ).The Moore-Penrose pseudo-inverse of the Laplacian matrix, represented by L + , can be used in the calculation of proximity measures [62].Because PLM is calculated as inner product cosine similarity, it is also called "cosine similarity time" in the literature [47], and can be calculated as:

Hitting Time (HT) and Average Commute Time (ACT)
Motivated by random walk, as introduced by mathematician Karl Pearson [63], HT is defined as the average number of steps to be taken by a random walker starting from v x to reach node v y .Because HT is not a symmetric metric, one may consider using ACT, which is defined as the average number of steps to be taken by the random walker starting from v x to reach the node v y , and that from v y to reach node v x .Therefore, HT can be computed by: Here, P i,j = D −1 A, where A and D are the adjacency and the diagonal matrix of vertex degrees [47].Accordingly, ACT can be formulated as: For the sake of computational simplicity, ACT can be computed in a closed form using the pseudo-inverse of the Laplacian matrix of the graph, as follows [62]: One challenge of HT and ACT is that it gives very small proximity measures when the terminal node has high stationary probability π v y , regardless of the identity of the starting node.This problem can be solved by normalizing the scores as −s HT (v x ,v y ) .π v y and −(s , respectively [37].

Rooted PageRank (RPR)
PageRank (PR) is the metric that is used by Google Search in order to determine the relative importance of the webpages by treating links as a vote.Motivated by PR, RPR defines that the rank of a node is proportional to the likelihood that it can be reached through a random walk [47], such that: Here, P i,j = D −1 A, where A is the adjacency matrix and D is the diagonal matrix of vertex degrees.It should be noted that one can calculate the PR by averaging the columns of RPR [7].

Escape Probability (EP)
The metric, which can be derived from RPR, measures the likelihood that the random walk starting from node v x visits node v y before coming back to the node v x again [64] ; the equation of EP can be written, as follows [7]: 3.2.8.Random Walk with Restart (RWR) In a random walk (RW) algorithm, the probability vector of reaching a node starting from the node v x can be defined as: where M is called the transition probability matrix, and it can be calculated by A i,j / ∑ k A i,k , where A is the adjacency matrix [65].Because RW does not yield a symmetric matrix, the metric of RWR, very similar to RPR, looks for the probability that a random walker starting from node v x visits node v y and comes back to the initial state node v x at the steady-state, such that: 3.2.9.Maximal Entropy Random Walk (MERW) The basic MERW algorithm, which is based on the maximum uncertainty principle, was proposed as a result of the necessity in order to define uniform path distribution in Monte Carlo simulations [66].However, its applications on the stochastic models are very recent [67].Li et al. [68] proposed MERW, which maximizes the entropy of a random walk, as follows: Here, p(A t v x v y ) is the multiplication of the iterative transition matrices (M v x v z .M v z v q ...M v q v y ), where M ij can be calculated, as follows: where A is the adjacency matrix and ψ is the normalized eigenvector with normalization constant λ [6].

The Blondel Index (BI)
The index is proposed by Blondel et al. [69] in order to measure the similarity for the automatic extraction of synonyms in a monolingual dictionary.Although BI is used to quantify the similarity between two different graphs, Martinez et al. show that investigating the similarity of two vertices in a single graph can also be evaluated in an iterative manner, as: where S(t) refers to the similarity matrix in iteration t and S(0) = I. ||M|| F is the Frobenius matrix norm and it can be calculated, as follows: The similarity metric is obtained when S(t) is converged, such that s BI (v x ,v y ) = S v x ,v y (t = c), where t = c denotes the steady state level.

Quasi-Local Similarity-Based Approaches
The trade-off between the efficiency of the information regarding the whole network topological structure for the global approaches and the less time complex algorithms for the local-based methods have resulted in the emergence of quasi-local similarity-based methods for link prediction.Similarly, these approaches are limited in the calculation of the similarity between arbitrary node pairs.However, quasi-local similarity methods provide an opportunity for computing the similarity between a node and the neighbors of its neighbors.Although some of the quasi-local similarity-based methods consider the whole topology of the network, their time complexity is less than that of global similarity-based approaches.

The Local Path Index (LPI)
The index, which is very similar to the well-known approaches KI and CN, considers the local path with a wider perspective by not only employing the information of the nearest neighbors, but also the next two and three nearest neighbors [46,70], such that: where A is the adjacency matrix, β is a free parameter to adjust the relative importance of the neighbors within the length l = 2 distances and length l = 3 distances.The metric can also be extended for the higher orders as: The neighbors within the length of three distances are preferable due to increasing complexity in the higher orders of LP.One can easily realize that this similarity matrix simplifies to CN when l = 2 and may produce a very similar result to KI given low β values without the inverse transform process.The similarity between two nodes can be evaluated via s LP (v x ,v y ) = S LP v x ,v y .

Local (LRW) and Superposed Random Walks (SRW)
Although the random walk-based algorithms perform well, the sparsity and computational complexity regarding massive networks are challenging for these algorithms.Thus, Liu and Lü proposed the LRW metric [71], in which the initial resources for the random walker are assigned based on their importance in the graph.LRW considers the node degree as an important feature and it does not concentrate on the stationary state.Instead, the number of iterations is fixed in order to perform a few-step random walk.LRW can be formulated, as: Because superposing all of the random walkers starting from the same nodes may help to prevent the sensitive dependency of LRW to the farther neighboring nodes, SRW is proposed as:

Third-Order Resource Allocation Based on Common Neighbor Interactions (RACN)
Motivated by the RA index, Zhang et al. [72] proposed RACN, in which the resources of nodes are allocated to the neighbors as: where v i ∈ Γ(v x ) and v j ∈ Γ(v j ).The superiority of the RACN over the original RA has been shown in [25] while using varying datasets.

FriendLink Index (FL)
The similarity of two nodes is determined according to the normalized counts of the existing paths among the corresponding nodes with varying length L. The formulation for the FL index is as follows: where |V| is the number of vertices in the graph.The metric is favorable, due to its high performance and speed [73].

PropFlow Predictor Index (PFP)
PFP is a metric that is inspired by Rooted PageRank, and it simply equals the probability that the success of random walk starts from node v x and terminates at node v y in not more than l steps [74].This restricted random walk selects the links based on weights, denoted as ω [47], such that: The most important superiority of PFP is its widespread use in directed, undirected, weighted, unweighted, sparse, or dense networks.

Probabilistic Methods
Probabilistic models are supervised models that use Bayes rules.The most important drawback of some of these models is their being slow and costly for large networks [24].In the following, we introduce the five most important probabilistic methods of link prediction.

Hierarchical Structure Model
This model was developed based on the observation that many real networks present a hierarchical topology [75].This maximum likelihood-based method searches for a set of hierarchical representations of the network and then sorts the probable node pairs by averaging over all of the hierarchical representations explored.The model was first proposed in the work of [76], in which it develops a hierarchical network model that can be represented by a dendrogram, with |N| leaves and |N − 1| internal nodes.Each leaf is a node from the original network and each internal node represents the relationship of the descendent nodes in the dendrogram.A value of p r is also attributed to each internal node r, which represents the probability with which a link exists between the branches descending from it.If D is a dendrogram that represents the network, the likelihood of dendrogram with a set of internal node probabilities (p r ) is: In the above equation, E r is the number of links that connect nodes that have a node r as their lowest common ancestor in D. L r and R r represent the number of leaves in the left and right subtrees that are rotted in r, respectively.Setting p * r = E r L r R r maximizes the likelihood function (48).Replacing p r with p * r in the likelihood function, the likelihood of a dendrogram at its maximum can be calculated by: These equations are then utilized to perform link prediction.After a Markov Chain Monte Carlo method is used to sample a large number of dendrograms with probabilities proportional to their likelihood, the connection probability between two nodes v i and v j is estimated by averaging over all the sampled dendrograms.This task is performed for all sampled dendrograms and, subsequently, the node pairs are sorted based on the corresponding average probabilities.The higher the ranking, the more likely that the link between the node pair exists.A major drawback of the hierarchical structural model is its computational cost and being very slow for a network consisting of a large set of nodes.

Stochastic Blockmodel
Stochastic block models are based on the idea that nodes that are heavily interconnected should form a block or community [77].In a stochastic block model, nodes are separated into groups and the probability that two nodes are connected to each other is merely dependent on the group to which they belong [78].Stochastic block models have been successfully applied to model the structure of complex networks [79,80].They have also been utilized to predict the behavior in drug interactions [81].The work of [82] uses a block model in order to predict conflict between team members.Ref. [83] also utilizes a stochastic block model in order to develop a probabilistic recommender system.
As noted above, the probability that two nodes i and j are connected depends on the blocks that they belong to.A block model M = (P, Q) is completely determined by the partition P of nodes into groups and the matrix Q of probabilities of linkage between groups.While numerous partitions (models) can be considered for a network, the likelihood of a model A O can be calculated by the following [78,84]: In Equation ( 50), l O αβ is the number of links in A O between nodes in groups α and β of P, and r αβ is the maximum number of links possible, which is |α||β| when α = β and ( |α| 2 ) when α = β .Setting maximizes the likelihood function (50).By applying Bayes theorem, the probability (reliability) of a link with maximum likelihood can be computed.
Similar to the hierarchical structure model that is discussed in Section 4.1, a significant shortcoming of this method is that it is very time-consuming.While the Metropolis algorithm [85] can be utilized to sample partitions, this approach is still impractical for a large network.An example of blockmodel likelihood calculation is illustrate in Figure 3.

Network Evolution Model
Ref. [86] proposed a network topology based model for link prediction.In this model, probabilistic flips of the existence of edges are modeled by a "copy-and-paste" process between the edges [86].The problem of link prediction is defined, as follows: the data domain is represented as a graph G = (V, s), where V is the set of nodes of the network and s : V × V → [0, 1] is an edge label function.s(v i , v j ) indicates the probability that an edge exists between i and j. s (t) shows the edge label function at time t, and its of Markovian nature, i.e., s (t+1) only depends on s (t) .The fundamental idea behind the proposed network edge label copy-and-paste mechanism is that, if a node has a strong influence on another node, the second nodes association will be highly affected by the second node.The probability of an edge existing between nodes i and j at time t + 1 is as follows: where w v k v j is the probability that an edge label is copied from node v k to node v j .In Equation ( 51), the first term represents the probability that the edge label for (v i , v j ) is changed by copy and pasting.
The second term represents when the same edge label is unchanged.The linkages are obtained by iteratively updating Equation (51) until convergence.The objective function according to which the parameters are set is solved by an expectation maximization type transductive learning.

Local Probabilistic Model
The work of [87] proposed a local probabilistic model for link prediction, in which the focus of the original paper is particularly in the context of evolving co-authorship networks.Given the candidate link, e.g., nodes v i and v j , first, the central neighborhood set of v i and v j are determined, which is the set of nodes that are the most relevant to estimating the co-occurrence probability.The central neighborhood sets are chosen from the nodes that lie along paths of shorter length between v i and v j .Ref. [87] proposes an algorithm in order to determine central neighborhood set, which is, as follows: first, collecting all of the nodes that lie on length-2 simple paths, then those on length-3 simple paths, and so on.The paths are then ordered based on the frequency scores and the ones with the highest scores are chosen [87].A path length threshold is also considered for the sake of decreasing computational cost ( [87] proposes a threshold of 4 for their specific problem).Next, they form a transaction dataset that is formed by a chronological set of events (co-authoring articles).A non-derivable itemset mining is performed on this dataset, which results in all non-redundant itemsets along with their frequencies.In the end, a Markov Random Field (MRF) graph model is trained while using the derived dataset.The resulting final model gives the probability of the existence of each link v i and v j .

Probabilistic Model of Generalized Clustering Coefficient
This method that was proposed by [88] focuses on analyzing the predictive power of clustering coefficient [88].The generalized clustering coefficient C(k) of degree k is defined as [88]: number of cycles of length k in the graph number of paths of length k (52) As explained in [88], generalized clustering coefficients describe the correlation between cycles and paths in a network.Therefore, the probability of formation of a particular link is determined by the number of cycles (of different lengths) that will be constructed by adding that link [88].The concept of cycle formation model is explained, as follows: a cycle formation model of degree k (k ≥ 1) is governed by k link generation mechanisms, g(1), g(2),..., g(k), which are each described by c 1 , c 2 ,..., c k .If P v i v j k shows a path from v i to v j with length k, then c k = P((v i , v j ) ∈ E||P v i v j k | = 1) (the probability that there is a link between i and j, given that there is one path of length k between them).We know that, if there is more than one path with length k from v i to v j , then the probability that there is a link between them increases (see Figure 4 for instance).Therefore: Because of the fact that the total link occurrence probability between v i and v j is a result of the effect of multiple mechanisms of cycle formation model of degree k (CF(k)) is calculated by:

Relational Models
One drawback of the previously mentioned methods is that they do not incorporate vertex and edge attributes to model the joint probability distribution of entities and links that associate them [24].Probabilistic Relational Models (PRM) [21] is an attempt to use the rich logical structure of the underlying data that is crucial for complicated problems.One major limitation of Bayesian networks is the lack of the concept of an "object" [22].Bayesian PRMs [20,21] include the concept of an object in the context of Bayesian networks, in which each object can have their attributes and relations exist between objects and their attributes.Figure 5 is an example of a schema for a simple domain.A relational model consists of a set of classes, Υ = {Y 1 , Y 2 , ..., Y n }.In Figure 5, Υ = {Journalist, Newspaper, Reader}.
Each class also contains some descriptive attributes, the set of which is shown with A(Y).For example, Journalist has attributes Popularity, Experience, and Writing skills.In order for objects to be able to refer to other objects, each class is also associated with a set of reference slots, which is shown by Y.ρ.Slot chains also exist, which are references between multiple objects (similar to f (g(x))).Pa(Y.A) shows the set of parents of Y.A.For instance, in Figure 5, a journalist's Popularity depends on her Experience and Writing skills.Dependency can also be a result of a slot chain, meaning that some attributes of a class depend on some attributes of another class.The joint probability distribution in a PRM can be calculated, as follows [22]: In Equation ( 55), I shows an instance of a schema S, which specifies for each class Y, the set of objects in the class, a value for each attribute y.A, and a value for each reference slot y.ρ.Additionally, σ r is a relational skeleton, which denotes a partial specification of an instance of a schema, and it specifies the set of objects for each class and the relations that hold between the objects [22].The task of link prediction can then be performed by considering the probability of the existence of a link between two objects in the relational model [23].The work of [89] shows that deriving the distribution of missing descriptive attributes will benefit from the estimation of link existence likelihood.Besides, a Relational Bayesian Network, in which the model graph is a directed acyclic graph, the Relational Markov Network is also proposed [90,91], in which the graph model is an undirected graph and it can be utilized for the task of link prediction.Relational Markov Networks address two shortcomings of directed models: They do not constrain the graph to be acyclic, which allows for various possible graph representations.Additionally, they are well suited for discriminative training [92].
There exist other relational models for the task of link prediction.The DAPER model is a directed acyclic type of probabilistic entity-relationship model [93].The advantage of the DAPER model is being more expressive than the aforementioned models [94].Other Bayesian relational models in the literature include stochastic relational model [95], which models the stochastic structure of entity relationships by a tensor of multiple Gaussian processes [28], relational dependency network [96,97], and parametric hierarchical Bayesian relational model [98].

Learning-Based Methods
The feature extraction-based methods that are discussed earlier in this paper provide a starting point for the systematic prediction of missing or future associations available through learning the effective attributes.Among these effective features for link prediction, employing the topological attributes that can be extracted from the graph structure is the foundation of all learning-based link prediction algorithms, from which the pair-wise shortest distance attribute is the most common topological feature.Besides the topological attributes, some machine learning models benefit from the node and domain specific attributes, referred to as the aggregated and proximity features, respectively [99].
The introduction of supervised learning algorithms to the problem of link prediction led to the state-of-the-art models that achieve high prediction performances [100].These models view the problem of link prediction as a classification task.In order to approach the link prediction problem, supervised models are supposed to tackle a few challenges, including the unbalanced data classes that result from the sparsity property of real networks, and the extraction of the topological, proximity, and aggregated attributes as independent informative features [101].There is extensive literature on the classification models for link prediction, including the application of traditional machine learning methods into this field of research.Support Vector Machines, K-nearest Neighbors, Logistic Regression, Ensemble Learning, and Random Forrest, Multilayer Perceptron, Radial Basis Function network, and Naive Bayes are just a few of the supervised learning methods that are extensively used in link prediction.A comparison between a few of these supervised methods has been presented in [99], where, surprisingly, SVM with RBF kernel is reported to be very successful in the accuracy and low squared error of the model.
Although the traditional machine learning models for link prediction rely on user-defined feature encoding, the evolution of these models has led to the generation of automatic feature encoders, which prevent hand-engineered attributes [101].These models aim to learn graph encoding, node, and/or domain-related features into low-dimensional space, and are referred to as representation learning or graph embedding-based models for link prediction.These methods can be trained while using neural networks or dimensionality reduction algorithms [102].The applications of graph analysis and representation learning has led to the development of advanced language models that focus on language understanding, relation discovery, and question answering.Knowledge graphs, which represent sequences of relations between named entities within a textual content, are being widely investigated for the task of link prediction, relation prediction, and knowledge graph completion [103].Although many of the reviewed methods in this survey are applicable to different applications and graph types, knowledge graphs and their embedding methods are dependent to directed relationships.Examples of recent methods for knowledge graphs are Relational Graph Convolutional Neural Networks (R-GCN) [104], which are able to extract features from a given data and, accordingly, generate a directed multigraph, label node types, and their relationships in the generated graph, and, finally, generate a latent knowledge-based representation that can be used for node classification as well as link prediction.Other language models, such as Bidirectional Encoder Representations from Transformers (BERT) [18], which use pre-trained language models, and their variations, including Knowledge Graph BERT (KG-BERT) [105] and Knowledge-enabled BERT (K-BERT) [103], can extract node and relation attributes for knowledge graph completion and link prediction [16].A comprehensive review on embedding methods that are designed for knowledge graphs is available in [3].
The tasks of vertex representation learning and vertex collocation profiling (VCP) for the purpose of topological link analysis and prediction were introduced in [106,107], respectively.Comprehensive information on the surrounding local structure of embedded pairs of vertices v x and v y in terms of their common membership in all possible subgraphs of n vertices over a set of r relations is available from their VCP, written as VCP n,r x,y , and the VCP elements are closely related to isomorphic subgraphs.Thus, this method helps in the understanding of link formation mechanism from the nodes and graph representation.
Mapping the graph to a vector space is also known as encoding.On the contrary, the reconstruction of the node neighborhood from the embedded graph is referred to as decoding.Graph representation can be learned via supervised or unsupervised methods while using an appropriate optimization algorithm in order to learn the embeddings [101].This mapping can be defined for graph G = < V, E > as f : where n denotes the total number of vertices, v x is a sample node that has been embedded to d-dimensional vector space, and the embedded node is represented by v x .Figure 6 illustrates the procedure of node and graph representation.Representation learning algorithms for the task of link prediction can be divided into categories based on their decoder function, a similarity measure for graphs, and the loss function in the models [101].Therefore, we categorize these methods into (i) Matrix Factorization-Based Models, (ii) Path and Walk-Based Models, and (iii) Deep Neural Network-Based Methods.

Matrix Factorization-Based Methods
These methods are able to extract latent features with additional graph features for link prediction.In these models, the vector representation of the topology-related features produces an N-dimensional space, where N = |V| is the number of vertices in the network.The main purpose of matrix factorization-based methods is to reduce the dimensionality while also preserving the nonlinearity and locality of the graph via employing deterministic measures of node similarity in the graph.However, the global structure of the graph topology may be generally lost [108].
SVD is one of the commonly used methods as a result of its feasibility in low-rank approximations [109,110].Here, the link function L(.) is defined as G ≈ L(UΛU T ), where U ∈ R |V|×k , Λ ∈ R k×k , and k denotes the number of latent variables in SVD.The similarity s(v x , v y ) between the node pairs v x and v y is defined by L(u T v x Λu v y ).In [111], a latent feature learning method for link prediction has been proposed by defining a latent vector # » l v x and a feature vector # » a v x for each node v x , a weight vector W v for node features, a weight vector # » w e for edge features, and a vector of features # » b v x ,v y for each edge.This model computes the prediction of edge formation as: where, F is the scaling factor for each edge.
The inner-product-based embedding models for link prediction embed the graph based on a pairwise inner-product decoder, such that the node relationship probability is proportional to the dot product of node embeddings: Graph Factorization (GF) [112], GraRep [113], and HOPE [110] algorithms are examples of the inner-product-based methods for link prediction.Graph factorization model partitions the graph by minimizing the number of neighboring nodes, rather than applying edge cuts, as the storage and exchange of parameters for the latent variable models and their inference algorithms are related to nodes.HOPE [110] focuses on the representation and modeling of directed graphs, as directed associations can represent any type of graph.This model preserves the asymmetric transitivity for directed graph embeddings.The asymmetric transitivity property captures the structure of the graph by keeping the correlation between the directed edges, such that the probability of the existence of a directed edge from v x to v y is high if a directed edge exists for the opposite direction.HOPE supports classical similarity measures as proximity measurements in the algorithm, including the Katz Index (KI), Rooted PageRank (RPR), Common Neighbors (CN), and Adamic-Adar (AA).

Path and Walk-Based Methods
The developed models for link prediction that are designed based on random walk statistics prevent the need for any deterministic similarity measures.In these algorithms, similar embeddings are being produced for nodes that co-occur on graph short random walks.These algorithms investigate the node features, including node centrality and similarity via graphs exploration and sampling with random walks or search algorithms, such as Breadth First Search (BFS) and Depth First Search (DFS) [114].The random walk-based models for graphs can be divided into many different categories, according to varying perspectives.One possible division for these models includes categorization that is based on their embedding output, for instance, local structure-preserving methods, global structure-preserving methods, and the combination of the two [115].
Representations with BFS provide information regarding the similarity of nodes in the case of their roles in the network, for instance, representing a hub in the graph [102].On the contrary, random walks with DFS can provide information regarding the communities that nodes belong to.These algorithms have been recently applied along with generative models to introduce edges and nodes directly to the graph [116].Community aware random walk for network embedding (CARE), as introduced in [117], is another approach for the task of link prediction and multi-label classification.This model builds customized paths that are based on local and global structures of network, and uses the Skip-gram model to learn representation vectors of nodes.
In comparison to walk-based methods, link prediction that is based on meta path similarity has been introduced in [118], which operates a similarity search among the same type of nodes.Thus, meta path-based methods extend link prediction to heterogeneous networks with different types of vertices.In this model, a meta path refers to a sequence of relations between object types and defines a new composite relation between its starting type and ending type.The similarity measure between two objects can be defined according to random walks used in P-PageRank, pairwise random walk used in SimRank, P-PageRank, or SimRank on the extracted sub-graph or, finally, using PathSim, which captures the subtle semantics of similarity among peer objects [118].PathSim calculates the similarity of two peer objects as: where P refers to the meta path defined on the graph of network schema, p v x ,v y is a path instance between v x and v y , and p v x ,v x and p v y ,v y are the same concept for vertices v x and v y .An application of using meta-path for link prediction is in [119], which predicts drug target interactions (DTI) on the observed topological features of a semantic network in the context of drug discovery.

Neural Network-Based Methods
In order to avoid strong assumptions for every heuristic related to node similarities and edge formation, link prediction algorithms that are based on neural networks have been proposed that automatically learn a suitable heuristic from a given network.In [25], a mapping function for the subgraph patterns to link existence is being learned by extracting a local subgraph around each target link.Thus, this model automatically learns a "heuristic" that suits the graph.The powerful capabilities and simplicity of using neural network-based methods have led to the generation of a family of complex encoder-decoder-based representation learning models, such as Graph Neural Networks (GNNs) [120,121] and Graph Convolutional Neural Networks (GCNs) [104,[122][123][124].
Although the general concept of graph neural networks was first presented in [121], many neural network-based algorithms for representation learning and link prediction have been proposed, including SEAL [25], which uses GNNs to learn general graph structure features for link prediction from local enclosing subgraphs.Besides models that consider graph structure features, latent and explicit features are also investigated in the literature for link prediction.Furthermore, efficient strategies for capturing multi-modality for graphs, for instance, node heterogeneity, have been originated from neural network-based models.Another extension for graph embedding methods that have become achievable by neural networks, is the embedding of subgraphs (S ⊂ V).The attribute aggregation procedure in different neural network architectures may vary according to their connection types, and the usage of filters or gates in the propagation step of the models [115].
In order to learn the information on the neighboring nodes, GNNs [121] aim to learn a state embedding h v x ∈ R s iteratively, where s is the dimension for the vector representation of node v x .By stacking the states for all of the nodes, the constructed vectors H, and the output labels O can be represented as: where F g is the global transition function, O g is the global output function, X refers to the feature vector, and X N stands for the feature vector for all nodes.The updates per iteration can be defined as: where t denotes the t_th iteration.In this algorithm, the learning of the representations can be achieved by a supervised optimization method, such as the gradient-descent method.
The SEAL [115] algorithm that has been designed for the task of link prediction considers enclosing subgraph extraction for a set of sampled positive (observed) and negative links in order to prepare the training data for GNN and uses that information to predict edge formations.The GNN model receives the adjacency matrix (A) and node information matrix (X) as input, where each row of X corresponds to a feature vector of a vertex.The process of X preparation for each enclosing subgraph includes three components of structural node labels based on Double-Radius Node Labeling (DRNL), node embeddings, and node attributes.Another neural network-based model for the task of link prediction is HetGNN [120], which considers heterogeneous networks.This model starts with a random walk with restart strategy and samples a fixed size of correlated heterogeneous neighbors to group them based upon node types.Subsequently, neural network architecture with two modules is used in order to aggregate feature information of sampled neighboring vertices.The deep feature interactions of heterogeneous contents are captured by the first module, which generates content embedding for each vertex.The aggregation of content embeddings of different neighboring types is being done by the second module.HetGNN combines these outputs in order to obtain the final node embedding.
Multi-layer Perceptrons (MLPs) are neural network-based representation learning algorithms that approach graph embedding via message passing, in which information flows from the neighboring nodes with arbitrary depth.Message Passing Neural Networks (MPNNs) [125] further extend GNNs and GCNs by proposing a single framework for variants of general approaches, such as incorporating the edge features in addition to the node features.
Graph Auto-Encoders (GAE) and Variational Graph Auto-Encoders (VGAE) [123] are another category of graph neural networks that aim to learn the node representations in an unsupervised manner.The majority of models based on GAE and its derivations employ Graph Convolutional Networks (GCNs) for the node encoding procedure.Next, these algorithms employ a decoder in order to reconstruct the graph's adjacency matrix A. This procedure can be formally represented as: where Z is the convolved attribute matrix.GAEs can learn the graph structures while using deep neural network architectures, and reduce the graph dimensionality in accordance with the number of channels of the auto-encoder hidden layers [126].Additionally, GAE-based models are able to embed the nodes into sequences with diverse lengths.This benefits the auto-encoders not only to achieve high performances for testing over the unseen node embeddings, but also to aggregate the node attributes in order to improve their prediction accuracy [101].GC-MC [124] and Adversarially Regularized Graph Auto-Encoders (ARGA) are examples of representation models with auto-encoder architectures [127].Auto-encoders are also being used without neural network architectures, for instance, LINE [108], DNGR [126], and SDNE [128].The algorithm in LINE consists of a combination of two encoder-decoder structures to study and optimize the first and second node proximities in the vector space.Both of the DNGR [126] and SDNE [128] algorithms embed the node local neighborhood information while using a random surfing method and approach single embeddings through auto-encoders than pairwise transformations.
Although the graph representation learning models that are based on GNNs consider both graph structures and node features to embed the graph, they suffer from computational complexity and inefficiency in iterative updating of the hidden states.Furthermore, GNNs use the same parameters for all layers, which limits their flexibility.These architectures are always designed as shallow networks with no more than three layers, and including a higher number of layers is still being considered to be a challenge for CNNs [115].
The introduction of neural networks, specially convolutional neural networks, in order to graph structures, has led to extract features from complex graphs flexibly.Graph Convolutional Networks (GCNs) [122] tackle the problem of high computational complexity and shallow architectures via defining a convolution operator for the graph.Furthermore, a rich class of convolutional filter functions can be achieved through stacking many convolution layers.The iterative aggregation of a node's local neighborhood is being used in GCNs to obtain graph embeddings, where this aggregation method leads to higher scalability besides learning graph global neighborhoods.The features for these models include the information from the topology of the network aggregated by the node attributes, when the node features are available from the data domain [115].Additionally, GCNs can be utilized for node embeddings, as well as subgraph embeddings [101].Varying convolutional models have been derived from GCNs that employ different convolutional filters in their architecture.These filters can be designed as either spatial filters or spectral filters.The former type of convolutional filters can be directly operated on the original graph and its adjacency matrix; however, the latter type is being utilized on the spectrum of the graph Laplacian [114].
In [129], the problem of link prediction is studied while using a combination of two convolutional neural networks for the graph network of molecules.The molecules are represented as having a hierarchical structure for their internal and external interactions.The graph structure transformation to a low dimensional vector space is obtained from an internal convolutional layer that is randomly initialized for each node representation and trained by backpropagation.The external convolutional layer receives the embedded nodes as input to learn over the external graph representations.Finally, the link prediction algorithm consists of a multilayer neural network, which was accepting the final representations in order to predict the molecule-molecule interactions by a softmax function.
The algorithms that belong to the family of neighborhood aggregation methods, are also being referred to as convolutional models.An example is GraphSAGE [130], which aggregates the information from local neighborhoods recursively, or iteratively.This iterative characteristic leads the model to be generalizable to unseen nodes.The node attributes for this model might include simple node statistics, such as node degrees, or even textual data for profile information on online social networks.
Graph convolutional neural networks for relational data analysis is proposed in [104], which introduces Relational Graph Convolutional Networks (R-GCNs) for the task of link prediction and node classification.Because of relational models referring to directed associations, the node relationships in this models for the graph G = V, E, R are represented as (v x , r, v y ) ∈ E, where r ∈ R is a relation type for both canonical and inverse directions.This model can be considered to be a special case of simple differentiable message-passing model.In this model, the forward-pass update for entity v x in a relational multigraph can be propagated by: where σ(.) is an element-wise activation function, l denotes the layer of the neural network, h v x is the hidden state of node v x , c v x ,r refers to a problem-specific normalization constant, W is the weight matrix, and Γ r v x denotes the set of neighbor indices of vertex v x under relation r ∈ R. Thus, this model is different from normal GCNs as the accumulation of transformed feature vectors of neighboring nodes are relation-specific.For this model, using multi-layer neural networks instead of simple linear message transformation is also possible.The task of link prediction by this model can be viewed as computing node representations with an R-GCN encoder and DistMult factorization [131] as the scoring function, which is a known score function for relation representation with low number of relation parameters.The triple (s, r, o) for (subject, relation, object) is being calculated in order to determine the likelihood of possible edges as: in which R r ∈ R d×d is a diagonal matrix for every relation r.The model can be trained with negative sampling via randomly corrupting the subject or object of positive examples.

Network Data Sets
One of the challenging tasks in network research is the implementation and validation of the proposed methods and models.In the majority of the network research, the popular collections of data sets are used as common sense: a friendship network of 34 members of a Karate Club and 78 interactions among them [132], the power network of an electrical grid of western US with 4941 nodes and 6594 edges [133], an internet-based router network with 5022 nodes and 6258 edges [134], a protein-protein interaction network that contains 2617 proteins and 11855 interactions [135], a collaboration network of 1589 authors with 2742 interactions [136], an airline network of 332 nodes and 2126 edges that show the connection between airports (http://vlado.fmf.uni-lj.si/pub/networks/data/), a social network of 62 dolphins in New Zealand with 159 interactions [137], a biological network of the cerebral cortex of Rhesus macaque with 91 nodes and 1401 edges [138].
Data set collection is time-consuming and labor-intensive work.While some studies build their own data set, the researchers mostly prefer to employ an existing data set.Some popular collections of network data sets that might be used in link prediction studies are as follows: • SNAP [139]: a collection of more than 90 network data sets by Stanford Network Analysis Platform.
With biggest data set consisting of 96 million nodes.• BioSNAP [140]: more than 30 Bio networks data sets by Stanford Network Analysis Platform • KONECT [141]: this collection contains more than 250 network data sets of various types, including social networks, authorship networks, interaction networks, etc. • PAJEK [142]: this collection contains more than 40 data sets of various types.
• Network Repository [143]: a huge collection of more than 5000 network data sets of various types, including social networks.• Uri ALON [144]: a collection of complex networks data sets by Uri Alon Lab.
• NetWiki [145]: more than 30 network data sets collection of various types.
• WOSN 2009 Data Sets [146]: a collection of Facebook data provided by social computing group.
• Citation Network Data set [147]: a collection of citation network dat aset extracted from DBLP, ACM, and other sources.• Grouplens Research [148]: a movie rating network data set.
• ASU social computing data repository [149]: a collection of 19 network data sets of various types: cheminformatics, economic networks, etc. • Nexus network repository [150]: a repository collection of network data sets by iGraph.
• SocioPatterns [151]: a collection of 10 network data sets that were collected by SocioPatterns interdisciplinary research collaboration.• Mark Newman [152]: a collection of Network data sets by Mark Newman.

Taxonomy
According to the methods that were explained earlier in this paper, we propose a taxonomy to better categorize the link prediction models.In our proposed taxonomy, the link prediction techniques are mainly categorized under two sections: feature learning and feature extraction techniques (Figure 7).

Breadth first search Depth first search Community aware random walk
Meta path-based methods

Discussion
This paper presents a comprehensive and state-of-the-art literature review on link prediction analysis, in which the emerging links or missed associations are predicted in a complex network, through a custom taxonomy.We classified link prediction techniques under two main categories.Firstly, feature extraction techniques consist of the methods that start with an initial set of features and build the required resources by using these raw features in order to describe the structural similarity.We discussed these methods under three different titles due to their strategy for addressing link prediction problems; namely, similarity-based, relational, and probabilistic methods.Among these methods, similarity-based techniques are the simplest and relatively less computationally intensive.These methods aim to explore missing links by assigning similarity scores between node pairs while using the structural properties of the graphs.According to the required topological information from a network, these methods are further divided into three subcategories.Global approaches require the complete topological information of the graph; therefore, they provide relatively more accurate results.However, the whole network may not be observable, or the large size of the network may require less time-consuming methods.In such cases, local approaches, in which a maximum second or third-degree neighborhood relationship is taken into consideration, rather than a whole network, are suggested to be applied instead.This trade-off triggered the emergence of the so-called quasi-local approaches.These methods are generally more favored and applied among similarity-based methods, since they are as efficient as global approaches due to the use of additional topological information, but less time-consuming.Other feature extraction techniques used in link prediction problems covered in this study are relational and probabilistic methods.Using maximum likelihood calculations in probabilistic methods makes them relatively time-consuming and expensive to deploy.Another major drawback of these models is the lack of the concept of an object, which is addressed in relational models.Thus, these models are able to use the logical structure of underlying data that is helpful for more complex problems.Accordingly, employing relational methods in a link prediction problem requires a massive computation of marginal probability distributions for each node in the network.Although these methods are considered to be powerful, the nonexistence of the compact closed form of these distributions due to mutual dependencies in the correlated networks makes their utilization challenging [24].Secondly, feature learning-based techniques consist of methods that allow for a system to automatically learn the necessary set of features before building the required resources to further address the link prediction problems.These high-performance approaches enable the integration of extra information that is related to the network that might be effective in predicting the existence of links, such as community structure [153], users' behavior [154], common interests [99], etc.Additionally, machine learning models are useful in picking the right combination of features by optimizing an objective function, which renders these methods more preferable when compared to the previously discussed approaches in many cases.
Getoor and Diehl [155] categorized link prediction problems under four main sections: (i) the link existence, in which the likelihood of forming a connection between two nodes in the future is questioned, (ii) the link load, in which the weight of the associated links are analyzed, (iii) the link cardinality, in which the question of whether more than one link between a node pair exists or not is inspected, and (iv) the link type, in which the role of the link between node pairs are evaluated.Although the methods that are discussed in this survey mainly address the link prediction problem in networks, they can be easily prolonged to the problems of link load and link cardinality, since they both require a similar computational approach [156].Some learning-based methods and probabilistic models are being deployed for link prediction in temporal and dynamic networks.Whereas, the problem of link type differ since the prediction methods foTr multi-object type links may require special attention and the deployment of different methods.To obtain more detailed information regarding the commonly used approaches for link prediction problem in weighted, bipartite, temporal, dynamic, signed, and heterogeneous networks, please visit [72,[157][158][159][160][161], respectively.
Although the link prediction problem is an established field of research, several problems are yet to be explored in this domain.In general, the available methods in the literature produce new methods or compare the extant ones by assuming that the network is noise-free; however, some links might be missing, substituted, or fake, which is called noisy networks.While, Zhang et al. [162] compared a few numbers of similarity-based methods, but there is no detailed study that compare the robustness of different approaches.Besides, each network has its own characteristics, i.e., domain/network problem, and this makes transferring knowledge or generalizing the superiority of the link prediction algorithms challenging.Still, there are a few works that consider the effects of varying topological properties on the performance of different link prediction approaches.Furthermore, most of the real-world networks are shown to be sparse.The resulting unbalanced dataset obstructs the handling of link prediction problems, especially with the utilization of supervised techniques.Lastly, limited studies address the link prediction problem in multiplex/multilayer methods, and these studies are generally constrained with two layers.Further studies may consider this problem on multiplex networks with more than two layers.

Figure 5 .
Figure 5.An example of a relational schema for a simple domain.The underlined attributes are reference slots of the class and the arrows show the types of objects to which they are referring.

Figure 6 .
Figure 6.An example of node and graph representation.Here the node representation vectors are aggregated to generate a single graph representation.

Figure 7 .
Figure 7.A taxonomy for the feature extraction techniques and feature learning methods in link prediction literature.