Discovering Correlation Indices for Link Prediction using Differential Evolution

: Binary correlation indices are crucial for forecasting and modelling tasks in different areas of scientiﬁc research. The setting of sound binary correlations and similarity measures is a long and mostly empirical interactive process, in which researchers start from experimental correlations in one domain, which usually prove to be effective in other similar ﬁelds, and then progressively evaluate and modify those correlations to adapt their predictive power to the speciﬁc characteristics of the domain under examination. In the research of prediction of links on complex networks, it has been found that no single correlation index can always obtain excellent results, even in similar domains. The research of domain-speciﬁc correlation indices or the adaptation of known ones is therefore a problem of critical concern. This paper presents a solution to the problem of setting new binary correlation indices that achieve efﬁcient performances on speciﬁc network domains. The proposed solution is based on Differential Evolution, evolving the coefﬁcient vectors of meta-correlations, structures that describe classes of binary similarity indices and subsume the most known correlation indices for link prediction. Experiments show that the proposed evolutionary approach always results in improved performances, and in some cases signiﬁcantly enhanced, compared to the best correlation indices available in the link prediction literature, effectively exploring the correlation space and exploiting its self-adaptability to the given domain to improve over generations.


Introduction
Link Prediction (LP) is a branch of Complex Networks science that aims at explaining the evolutionary dynamics of a network, looking at possible supplementary connections which can be established between entities (nodes) in the network. A common approach to LP is to introduce a definition of similarity between entities and to calculate similarity values accordingly, between all pairs of still non-connected nodes. In the ranking induced by the similarity rates, pairs ranked prime represent relationships with a higher formation likelihood. The image of a network at time t used to compute similarities is called training network, the information deriving from the ranking is tested on the test network, representing the status of the same network at a future time-step t + 1. The concept of similarity is central to the problem; in literature, various definitions are available, including semantic [1] and topological [2] similarity. The former evaluates similarity according to features of the nodes; intuitively, two nodes are as similar as their feature values are. The latter looks at the position of nodes in the network, either limiting the analysis to a k-depth bounded local neighborhood [3], or considering the whole network at once; e.g., [4,5] the broadly used Jaccard [6] and Adamic-Adar [7] indices. Important characteristics to consider for different approaches are the requirements, e.g., the number of training items for the learning phase; the possibility of reading and analyzing the process steps as well as the result, thus the readability of results; the result type, boolean, rank or absolute value with particular details. The choice of the approach will consider such requirements and adapt the technique and setting to the goal. In this work, we study the class of topological similarities, focusing on measures based on the shared local neighborhood (i.e., common neighbors), given that semantic similarity measures can also be mapped to topological ones [8], thus can be included in the same point of view. Similarities of depth 2, e.g., Resource Allocation and Adamic-Adar, have been demonstrated in the literature to be more effective in terms of prediction ability than other more straightforward measures [8]. However, this does not apply to all domains, and simple measures, e.g., Common Neighbours or Jaccard, often can outperform more elaborate ones. It looks like no all-purpose neighborhood-based similarity ratio, able to effectively capture the peculiar characteristics of each different domain, is available in the literature for a general application on every domain. Two research questions emerge: How can the contribution of the best-performing indices in the literature on a given domain be exploited together? 2.
Is it possible to modify indices to adapt them to any single domain, to reflect its specific link formation mechanisms?
To the best of our knowledge, the only attempt to answer the first research question is a plain linear combination of well-known indices [9], where the weights regulating the contribution of each index are evolved using the covariance matrix adaptation evolution strategy [10] for numerical optimization. This linear combination can be identified as a preliminary definition of a meta-correlation, but its adaptability power to different domains is limited. Our approach contributes to finding original meta-correlations evolving basic ones using Differential Evolution (DE) [11], where an added value is provided evolving the whole meta-correlation instead of using a plain linear combination of measures. Among the existing approaches to the problem of link prediction, we have chosen to build a meta-correlation based on the best indices in the literature and to adapt them to any domain using the Differential Evolution algorithm. DE is suitable for our goal for its readability and differentiation since our aim is finding a generalized meta-correlation metric to be applied to any domain without prerequisites of knowledge, density and connection of the graph. For example, methods based on full knowledge of the graph are very difficult to apply in large graphs, so the analysis of the nodes neighborhood is certainly more convenient [3]. Among these link prediction techniques, each of which can be better than others for different contexts, simple measurements often have decent results, but in the literature, many techniques are present for enhancing performances, with different variations. The Quasi-common proximity approach [3,12] varies the basic measurements in the graph to evaluate them at point 2 of the graph and is applicable to any topological similarity measure. Path-based heuristic approaches, such as the Heuristic Semantic Walk [13], calculate the similarity of potential nodes, applicable in link prediction, by choosing on the basis of semantic heuristics the direction for the graph navigation, adding partial randomization to avoid loops. Recently, some works combine topological and semantic similarity [1,8,14] to predict links in specific domains, e.g., co-authorship networks, providing techniques that could be applied also in other domains. Adapting the approach to many different similarity measures, very satisfying results are obtained, predicting links on the basis of sub-graphs [12,14] around nodes connected by each potential link, especially when semantic features are present but also using semantic measures mapped to the graph topology [1]. While topological and semantic approaches can exploit the characteristics of the network by recommending the network structure, on the other hand, approaches based on deep learning can be very performing, but require a very high number of training elements and provide results without any possibility for the researcher to analyze the process, which can be considered a black box. Some of these approaches, e.g., SEAL [5], use random sampling on potential links, not providing a complete list of rankings, to ease the computation. Unfortunately, all these techniques are not directly comparable, not only because of the different goals and approaches but because each of them uses proper evaluation metrics, which are different for each approach and not overlapping. The used similarity metrics vary based on the graph structure or features, and anyway, domain-specific characteristics do not allow a direct comparison where tests are made on different data sets. The choice of the right approach will vary in different contexts and goals, but it will be primarily based on the requirements of each approach. In real-world applications, e.g., where a company needs a correlation metric to exploit link prediction on any domain without the need for professional resources to set up different learning algorithms for each possible domain, it is useful to build meta-correlations with generalization capabilities.
The paper structure is the following: in Section 2 a formal definition of meta-correlation indices is given, the related state of the art is presented for the basic correlation metrics, and our proposed novel meta-correlations are presented in detail; Section 3 provides in-depth information for the experiment reproducibility and setting, including network preprocessing and partition, a description of the data sets where the experiments are exploited, and the setting of the Differential Evolution pipeline. Section 4 presents the experimental results and discussion; Section 5 concludes the paper.

Meta-Correlation Indices
Correlation indices have been defined by experts with different backgrounds to capture the peculiar properties of specific domains, and only afterward used in other application domains, e.g., biology, sociology and psychology. For example, the Dice (or Gleason, or Sørenson) index has been applied initially to ecological population data, while Simple Matching has been used to measure the level of agreement between two psychologists and Tanimoto in Chemoinformatics to analyze interaction fingerprints [15]; a large corpus of indices is available in the literature [16]. Regarding the domain of Link Prediction, various measures have been proposed and applied in previous works. Particular measures, e.g., Adamic-Adar [7] index, were purposefully developed for Link Prediction applications, while other ratios have been adapted to LP, e.g., the Jaccard [6] coefficient, initially used in biology and then in LP.
Formally, let x 1 and x 2 be two events or objects and F a set of features; most indices define the similarity between x 1 and x 2 as a function of four parameters a, b, c and d, which count the presence or absence of each f ∈ F. More specifically, a(d) is the number of features available(not available) in both x 1 and x 2 ; b, and c counts the features occurring in, respectively, x 1 or x 2 only. Several indices can be seen as variations of a basic syntactic structure, where the input changes in terms of multiplicative coefficients and applied operators, e.g., summation and subtraction.
The framework introduced in this paper aims at optimizing the prediction strength of correlation indices by defining binary correlation meta-indices, which exploit structural similarity to create populations of correlation indices. The resulting indices are thus evolved using the Differential Evolution (DE) [17] algorithm. Binary correlation meta-indices are parametric formulas that subsume sets of correlation indices which include well-known indices for specific parameter values. Their parameters and structure fully characterize a meta-index; thus, a parameters assignment effectively defines a specific instance of the selected meta-index. Let for instance where the meta-correlation index, µ, can subsume both the Sokal and Sneath-1 index when α = β = 1, = 0, and γ = δ = 2, and the Common neighbors index when α = = 1, and β = γ = δ = 0. Each possible assignment of values for the coefficients tuple of the meta-index represents then a valid and unique correlation index, while the meta-index itself represents a class of correlation indices composed of all the possible five-tuple values assignments for α, β, γ, δ, . In the proposed framework, let µ meta-index used for Link Prediction on domain D, with n parameters (c 1 , . . . , c n ): • the population is composed a set of m vectors v 1 , . . . , v m of length n, each representing a correlation instance of the class subsumed by µ; • the fitness function is any evaluation metric, e.g., precision, AUC, ROC, determining the capabilities of an individual for the Link Prediction task in the domain D.
One of the central focal points of our approach is that we designed two meta-correlations to subsume sets of well-known indices and incorporate them, combining the contribution of first-order and second-order features. The goal of the design of the experiment will then be to investigate whether evolving meta-correlation indices can adapt to the peculiar characteristics of a data set where they evolve.
Let (V, E) a network, where V is a set of nodes and E is the set of edges, E ⊆ V × V, we define Γ(u) where u ∈ V as the set of neighbours of node u in the network G. Let u and v nodes of a network (V, E), the first-order features we considered are a=|Γ(u) ∩ Γ(v)|, the number of Common Neighbours between u and v, b=|Γ , the number of nodes connected only to u (resp. v) and d = |V| − (a + b + c + 2), the number of the other nodes in the network, not connected to u nor to v. The second-order features, i.e., features that consider properties of nodes at distance 2 from u or v, are the well-known in the Link Prediction literature [7] Adamic-Adar similarity score the Pseudo-Adamic-Adar 1 score and the Pseudo-Adamic-Adar 2 score for neighbours connected respectively to u or v only.
Equations (4) and (5) show the two meta-indices formulas, while Tables 1 and 2 the subsumed indices and the corresponding parameter values.

Experiments
The goal of the experiment is to investigate whether evolving meta-correlations can adapt them to the peculiar characteristics of the particular domain where they evolve.

Network Preprocessing
The usual approach for Link Prediction experiments is to divide a data set into two parts, which are conventionally defined training set and test set; the test set, usually amounting to 10-20% of the data set, is used to evaluate the performance of models built using the knowledge provided in the training set. For this work, the followed approach is instead to split the data set into three parts. First, the data set is split in training and validation set E TR+V and test set E TE , following a 90:10 ratio; then, k folds are generated from E TR+V , building k (E TR , E V ) pairs, which will be used to evolve different correlations each. Before this phase, the networks are pre-processed to remove elements such as self-loops and isolated nodes, since both do not add any contribution to similarity scores calculated using local neighborhood-based measures. Directed networks are transformed into undirected networks: when there is a connection in at least one way between two nodes, they are connected in the pre-processed network.

Data Sets
The framework has been tested on four data sets, widely used in link prediction literature. The data sets represent two main domains with some diversity in each data set. The first domain comprises CA-GrQC [18] and Netscience [19], representing the co-authorship domain with two diverse networks, respectively including papers published in the General Relativity and Quantum Cosmology categories between 1993 and 2004, and in the area of Network Science. The other two data sets, ia-radoslaw-email [20] and email-eu-core [18,21] are two e-mail exchange networks, thus representing a digital communication domain, the first between employees of a European institution, and the other of a medium-sized company. Such domains have been considered representative of authors' social networks (i.e., co-authorship) and communication networks, to test the proposed approach on real similar domains. For each domain, two instances have been chosen for sharing some similarities, to show how the metrics used to create the meta-correlation do not have themselves similar results even in similar networks, and to test if our evolved meta-correlation indices can better forecast the link creation both in similar (i.e., about the same domain), and in diverse networks (i.e., about different domains).

Settings for Differential Evolution
The population members for Differential Evolution (DE) are evolved using the information available in the training set, and their fitness is calculated on the validation set. Precision, i.e., the proportion of properly ranked edges among the top-k edges, is used as a fitness metric, while k is set to |E V |; a perfect predictor would rank all the positive edges as first. Edges in the test set are not available during the evolution process, effectively appearing as non-existent. The number of generations G was experimentally set to 300, as it was observed that further iterations did not provide any improvement. The mutant weighting factor F and the Crossover constant CR have been set, respectively, to 0.9 and 0.5, according to literature [11]. The core part of the population P is composed of instances of correlation indices p i which subsume known indices; additional individuals are obtained by applying random noise n, −0.25 < n < +0.25 to the coefficients of such correlations to explore more extensively the correlations space.
An observed problem that could arise using Differential Evolution is the loss of diversity when there is a total or too high consensus on one parameter value; this could happen for the proposed meta-correlation instances because the parameters frequently present the same values for subsumed indices. Introducing noise-altered population members allows overcoming this problem. The population for µ 1 amounts to 27 individuals, of which nine represent known indices and the rest two variations each; for µ 2 , six known correlations are considered, along with three variations each. Two DE variants have been tested in this work, namely RAND/1/EXP and RAND/1/BIN according to the conventional DE naming scheme. Both employ a random selection for the individuals used in the mutation phase, and use one pair of individuals (hence RAND/1); EXP and BIN refer to the adopted crossover schemes, meaning respectively, exponential and binary [22].

Algorithm
After pre-processing the network, the evolutionary phase to derive new correlation indices begins.
For each fold f , the best individual of the population springing at the end of the DE execution is compared versus known correlation indices. The combined knowledge available in E TR and +E V is used as ground truth to rank probable edges. Since E TE was excluded from the training process, we can test the performance on E TE , to assess the potential of the correlation in predicting future edges. The framework structure pipeline for Differential Evolution is shown in Algorithm 1.

Algorithm 1: Framework structure for Differential Evolution (DE).
Pre-process the network; Initialize the population of meta-correlation instances; for f ← 1 to K do for g ← 1 to G do for p ∈ P do y i ← generate_offspring(p i ); Rank potential edges according to y i using information in E TR ; Evaluate the fitness f ( Save the best individual p b ; Test p b on E TE using combined information from E TR and E V ;

Experiments Results
In this section, we present the results of the experiments in terms of Precision (see Section 3.3). Other suitable metrics include AUC [23], and SRD [24,25]. In Tables 3 and 4, the precision values of the best individuals across all the folds, evolved following strategies RAND/1/EXP and RAND/1/BIN respectively, are compared to known correlation indices, namely Common Neighbours (CN), Jaccard and Adamic-Adar (AA), for each data set. The best improvements in performance are obtained on the CA-GrQc dataset, on which the best individual for µ 1 performs noticeably better than the reference measures. Similar behaviour can be observed on the email-eu-core data set, where both µ 1 and µ 2 achieve higher scores than the best performing index, AA. Slight improvements are also noticeable on the netscience data set, where µ 1 ranks first. Differently from the other data sets, on ia-radoslaw-email the best performing meta-correlation is µ2, demonstrating sensible improvements, while µ 1 yields performance comparable to other measures. For all the data sets, the precision values are higher than the reference measures, for µ 1 , µ 2 , or both.
In Tables 5 and 6, the average precision and variance of the best individuals for each fold on E TR + E V for all the data sets are reported. Although the discovered correlations in some cases greatly differ in terms of their coefficients values, all of them achieve better performances, both for µ 1 and µ 2 . This probably hints at a correlation space with many separated local maxima with similar values. The intuition about local maxima is reinforced by looking at graphs in Figures 1 and 2 where the dynamics of the evolutionary process are illustrated for the Netscience and ia-radoslaw-email data sets, for both meta-correlations. Charts on the left show the performance improvement from generation 1 to 300 for meta-correlation µ 1 , on the right for µ 2 . Each line represents the evolution on a fold, following the DE RAND/1/EXP strategy; for readability, only the behavior on a subset of folds is shown.  It can be noticed that the correlations space presents plateaus on which the fitness function returns the same score: this behaviour slows down the evolution process. Other combinations of data sets, correlations and strategies report in the literature similar behaviour [11]. In the vast majority of cases, the performance improvements from the first to last generation noticeable, ranging from 3% to more than 10%. In few cases the system shows a less substantial increment in precision, e.g., in fold 5 for the ia-radoslaw-email data set, where the improvement is visible but unimportant; this can be due to the k-fold validation split.The evolution of meta-correlation is always enhanced with respect to the single measures on every data set. Our approach is not directly comparable with other techniques of Link Prediction, which are experimented in the literature on other data sets, using different similarity measures and other evaluation metrics, e.g., the Quasi-common neighbors [3] or the SEAL heuristic-learning technique [5], to cite other techniques possibly capable to be adapted to include both topological and semantic similarity, because the main focus of our work is not to obtain a better link prediction on a single domain, as happens with the cited approaches, but to discover a meta-correlation (which can be also used in other techniques like Quasi-common neighborhood) with the ability to adapt evolving, to every domain. Even if the comparison of a particular measure may eventually perform better on a particular domain, that measure does not have the power to adapt its performance to other domains, even similar ones. Our meta-correlation, based on the link prediction power of any desired measure, can evolve it to its best performance on general domains, which is our main goal. A general comparison can be done on the knowledge requirements: the more graph knowledge is required (i.e., the broader the analyzed common-neighborhood), the stronger computational capabilities are required for the prediction. Adaptability to the contexts of all domains is the main enhancement of our proposal.

Conclusions
In this work, we presented a framework based on evolutionary algorithms for Link Prediction. Differential Evolution (DE) is used to evolve the coefficients of parametric meta-correlations formulas to design domain-centered indices. Meta-correlations identify new classes of correlations; each component is identified by a different meta-correlation parameter vector, also subsuming well-known indices for specific parameters assignment. During the DE evolution process of the population of meta-correlation parameter vectors, new correlation indices are discovered, with prediction capabilities tailored to a specific data set, i.e., environment. Experiments show that the system can integrate the contribution of different features to discover new correlation indices that improve the precision value when compared to link prediction indices existing in the literature. The initial research questions now have answers: with our method, it is possible to have a general meta-correlation striking a good balance between being adaptive (more than standard indices), and less computationally intensive (e.g., than other learning-based methods achieving similar or better performance with particular metrics on particular domains), exploiting the contribution of the best-performing indices adapted to the specific link formation mechanism of each possible domain.
Future works aim at extending the experiment domain to assess the extra capabilities of the system in various research domains where discovery and optimization of correlation indices are an explicitly crucial point. To obtain the best results, it would be needed to re-implement all the approaches in the literature for link prediction or at least those we mentioned in the introduction, using the same similarity measures, the same evaluation metrics and the same data sets, to have a complete and realistic direct survey comparison among them, independently of the objective and the application context of the developer and the user.