On Investigating Both Effectiveness and Efﬁciency of Embedding Methods in Task of Similarity Computation of Nodes in Graphs

: One of the important tasks in a graph is to compute the similarity between two nodes; link-based similarity measures (in short, similarity measures) are well-known and conventional techniques for this task that exploit the relations between nodes (i.e., links) in the graph. Graph embedding methods (in short, embedding methods) convert nodes in a graph into vectors in a low-dimensional space by preserving social relations among nodes in the original graph. Instead of applying a similarity measure to the graph to compute the similarity between nodes a and b , we can consider the proximity between corresponding vectors of a and b obtained by an embedding method as the similarity between a and b . Although embedding methods have been analyzed in a wide range of machine learning tasks such as link prediction and node classiﬁcation, they are not investigated in terms of similarity computation of nodes. In this paper, we investigate both effectiveness and efﬁciency of embedding methods in the task of similarity computation of nodes by comparing them with those of similarity measures. To the best of our knowledge, this is the ﬁrst work that examines the application of embedding methods in this special task. Based on the results of our extensive experiments with ﬁve well-known and publicly available datasets, we found the following observations for embedding methods: (1) with all datasets, they show less effectiveness than similarity measures except for one dataset, (2) they underperform similarity measures with all datasets in terms of efﬁciency except for one dataset, (3) they have more parameters than similarity measures, thereby leading to a time-consuming parameter tuning process, (4) increasing the number of dimensions does not necessarily improve their effectiveness in computing the similarity of nodes. knowledge, this is the ﬁrst work that examines the application of embedding methods in computing the similarity of nodes. We conduct extensive experiments with ﬁve well-known and publicly available datasets as BlogCatalog [23,27,29], Cora [39,40], DBLP [1,17,41], TREC [1,17], and Wikipedia [26,27,29], which are widely used in the literature. Our experimental results demonstrate that similarity measures are better than embedding methods to compute the similarity of nodes; we found the following observations for embedding methods: (1) with all datasets, they show less effectiveness than similarity measures except for the BlogCatalog dataset; however, they show less efﬁciency


Introduction
Nowadays, graphs are becoming increasingly important since they are natural representations to encode relational structures in many domains (e.g., app's function-call diagrams, brain-region functional activities, bio-medical drug molecules, protein interaction networks, citation networks, and social networks), where nodes represent the domain's objects and links to their pairwise relationships [1][2][3][4][5][6][7]. Computing the similarity score between two nodes based on the graph structure is a fundamental task in a wide range of applications such as recommender systems, spam detection, graph clustering [8,9], web page ranking, citation analysis, social network analysis, k-nearest neighbor search [1,9], synonym expansion (i.e., search engine's query rewriting and text simplification), and lexicon extraction (i.e., automatically building bilingual lexicons from text corpora) [10].
Link-based similarity measures (in short, similarity measures) such as SimRank [11] are well-known and conventional techniques to compute the similarity of nodes only based on the graph structure. Recently, SimRank and its variants have attracted a growing interest in the areas of data mining and information retrieval [1,[8][9][10][12][13][14]. The philosophy of SimRank in similarity computation is that "two objects are similar if they are related to than similarity measures with this dataset, (2) they underperform similarity measures with all datasets in terms of efficiency except for the Wikipedia dataset; however, with this dataset, they show less effectiveness than similarity measures, (3) they have more parameters than similarity measures, thereby leading to a difficult and time-consuming parameter tuning process, and (4) increasing the number of dimensions (i.e., latent features) does not necessarily improve their effectiveness in computing the similarity of nodes. In addition, we observe that, among embedding methods, DeepWalk and its variants (i.e., Line and NetMF) show better effectiveness.
The contributions of this paper are summarized as follows: • Although embedding methods have been analyzed in a wide range of machine learning tasks, they are not investigated in terms of similarity computation of nodes. We investigate and analyze both the effectiveness and efficiency of embedding methods in the task of similarity computation of nodes in graphs. • We compare the effectiveness as well as efficiency of embedding methods with those of link-based similarity measures as the conventional technique to compute the similarity of nodes. • We conduct extensive experiments with five widely used datasets by employing nine different embedding methods and four different similarity measures, which all are state-of-the-art ones in the literature.
The rest of this paper is organized as follows. We discuss link-based similarity measures and graph embedding methods in Sections 2 and 3, respectively. In Section 4, we present and discuss the results of our extensive experiments. In Section 5, we conclude our paper.

Link-Based Similarity Measures
In this section, we briefly explain SimRank [11], a well-known link-based similarity measure (in short, similarity measure), and its state-of-the-art variants as JacSim [1], JPRank [17], and SimRank* [9]; their corresponding mathematical formulations are represented in Appendix A.1, in detail. SimRank: it computes the similarity between two nodes in a graph based on a philosophy that "two objects are similar if they are related to similar objects" [11]. For a node-pair (a, b) in a graph, let I a and I b be two sets of nodes directly pointing to nodes a and b, respectively. In SimRank, the similarity score of (a, b) is iteratively computed as the average of similarity scores of all possible node-pairs (i, j) where node i belongs to I a and node j belongs to I b ; this computation manner is called a pairwise normalization paradigm [1]. Consider a sample graph in Figure 1; SimRank considers nodes e and f similar since they are directly pointed to by common node b (each node is highly similar to itself) and nodes i and j are regarded as similar since they are indirectly pointed to by common nodes c and d. JacSim: it tries to solve the pairwise normalization problem, by employing both Jaccard and pairwise normalization paradigm in similarity computation. This problem is a counterintuitive property of SimRank where the SimRank score of a pair of nodes commonly pointed to by a large number of nodes tends to be lower than that of another pair of nodes commonly pointed to by a small number of nodes [1,16,20]. As an example, consider nodepairs (i, h) and (i, j) in the graph of Figure 1 where the similarity scores of some node-pairs are shown as well (We note that all the scores are computed by employing the matrix form of SimRank, JacSim, and JPRank); nodes i and h are pointed to by a single common node b, while nodes i and j are pointed to by two common nodes c and d. However, the SimRank score of (i, h) (i.e., 0.0106) is bigger than that of (i, j) (i.e., 0.0071) due to the pairwise normalization problem, while, as shown in the figure, the JacSim score of (i, h) (i.e., 0.0048) is smaller than that of (i, j) (i.e., 0.0076). SimRank*: it is a variant of SimRank trying to remedy the level-wise computation problem in SimRank; this problem happens since SimRank regards two nodes similar if some paths only with equal length exist from a common node to both of them. As an example, consider the node-pair (e, i) in Figure 1; there are no paths with the equal length from any common nodes such b to them. There is a path with length two from b to i, while there is a path with length one from b to e. Therefore, the SimRank score of (e, i) becomes zero as shown in the figure due to the level-wise computation problem; however, their SimRank* score is not zero (i.e., 0.0206). JPRank: it is another variant of SimRank that solves both the pairwise normalization and in-links consideration problems. The latter problem arises since SimRank considers only in-links to compute the similarity score. As an example, in Figure 1, the SimRank score of node-pair (b, c) is zero since I b = I c = ∅. However, both b and c are pointing to a common node f , which means b and c are somehow similar; as shown in the figure, the JPRank score of (b, c) is not zero (i.e., 0.0028).
It is worth noting that, although our sample graph in Figure 1 is directed, all the above similarity measures can be applied to undirected graphs as well. However, the in-links consideration problem is not applicable to an undirected graph and JPRank behaves exactly the same as JacSim in similarity computation [17].

Graph Embedding Methods
In this section, we briefly describe the concept of graph embedding and explain some of the state-of-the-art graph embedding methods (in short, embedding methods) in the literature; their corresponding objective functions are represented in Appendix A.2, in detail.
For a given graph G = (V, E), graph embedding aims to learn a function f ∶ V → R d that maps each node v in the graph into a vector in the d-dimensional space where d ≪ |V| [23][24][25]29]. The embedding methods exploit the graph structure to represent nodes as a low-dimensional vectors that encode the neighborhood similarity, semantic information, and community structure among nodes in the graph [22][23][24][25]. In the low-dimensional space, each dimension of the vectors can be interpreted as a latent feature [22][23][24]26]. Figure 2a illustrates the Zachary's Karate graph [42] where the clusters found by modularity maximization are shown in different colors. We applied DeepWalk on this graph to obtain its two-dimensional representation, which is shown in Figure 2b; as an example, nodes 1 and 2 in the graph are represented as <-1.18, 1.15> and <-0.30, 1.10> vectors, respectively. As observed in the figure, there is a commonality between clusters in the original graph and its representation since they encode the social relations and community structure in the graph. Therefore, in order to compute the similarity score of node-pair (a, b) in a given graph, we can calculate the proximity of the corresponding vector representations of a and b, similar to the strategy observed in the word analogy detection [25,38].  DeepWalk [23]: inspired by the remarkable achievements in the representation learning for natural language processing such as Skip-gram [38], it exploits a stream of short random walks to extract information from a graph where these short random walks can be regarded as a neighborhood for a target node v i consisting of n nodes before and after v i (i.e., window size W = 2n). DeepWalk tries to learn a model to maximize the probability of any node appearing in the v i 's neighborhood without the knowledge of its offset from v i . Line [25]: it considers the first-order proximity (i.e., direct neighbors of a node) and the second-order proximity to capture the nodes' neighborhoods; the second-order proximity follows the sociology and linguistics theories where two nodes sharing similar neighbors tend to be similar. These proximities are preserved by utilizing two distinct objective functions where two different models for them are trained separately and their results are concatenated as the final result. node2vec [26]: by following the homophily and structural equivalence hypotheses, it considers the community of a node and its structural roles in the graph, respectively, to capture the node neighborhood. node2vec utilizes a biased random walk controlled by two parameters p (return parameter) and q (in-out parameter). Suppose a random walk by using link (u, v) in a graph; to decide the next walk from node v, parameter p controls the probability of revisiting previous node in the walk (i.e., u) and parameter q controls the probability of visiting a node close to v or visiting a node that is far from v. graphGAN [27]: it considers two models, generator G and discriminator D, involving in a minimax game as follows. For a target node v, the generator tries to generate relevant nodes to the v's neighborhood (i.e., found by BFS search rooted in v) and produces fabricated samples to deceive the discriminator, while the discriminator tries to detect whether a node actually belongs to v's neighborhood or it is fabricated by the generator. In graphGAN, the objective is to train two models as a two-play minimax game. NetMF [29]: it shows that graph embedding methods utilizing the Skip-gram model (e.g., DeepWalk, Line, and node2vec) with negative sampling perform implicit matrix factorization with a closed form. NetMF draws a theoretical connection between DeepWalk's implicit matrix and graph Laplacians leading to constructing a low-rank connectivity matrix for DeepWalk, which is explicitly factorized by the singular value decomposition (SVD) technique [43] to obtain the representation vectors. ATP [24]: it is a matrix-factorization-based embedding method that tries to preserve the asymmetric transitivity property in community question answering (CQA) graphs (i.e., if question q 1 is easier than q 2 and q 2 is easier than q 3 , then q 1 is easier than q 3 ). ATP constructs graph G ′ by removing cycle links from G, and incorporates both the graph reachability (i.e., transitive closure of G ′ [24]) and hierarchy (i.e., the rank of nodes in G ′ ) into a single matrix, which is factorized by a non-negative matrix factorization (NMF) technique [44] to obtain the representation vectors; each node has two corresponding vectors since its role in the graph is regarded as both a source and a target. BoostNE [30]: it is a matrix-factorization-based embedding method that does not regard the low-rank assumption on the DeepWalk's connectivity matrix M; applying a single NMF on M may lead to obtain such representations, which are insufficient to encode the connectivity patterns among nodes. Inspired by ensemble learning methods [45], BoostNE performs multiple levels of NMF on M to construct the representation vectors. DWNS [22]: it applies the adversarial training method [46] to DeepWalk in order to improve the robustness and generalization ability of the learning process; it forces the learned classifier to be robust to adversarial examples (i.e., fabricated samples) generated from real ones through small perturbations. The training process is a two-player game where the adversarial samples are generated to maximize the model loss while the embedding vectors are optimized against them by utilizing stochastic gradient descent (SGD) [47]. NERD [28]: it mainly considers two different roles, a source and a target, for any nodes in the graph and maintains separate embedding spaces for the two distinct roles; based on this consideration, it exploits two distinct neighborhoods for each node and tries to maximize the likelihood of preserving both neighborhoods in their corresponding embedding spaces at the learning process.

Experimental Evaluation
In this section, we extensively analyze both the effectiveness (i.e., accuracy) and efficiency (i.e., execution time) of embedding methods (i.e., DeepWalk, Line, node2vec, graphGAN, NetMF, ATP, BoostNE, DWNS, and NERD) in the task of similarity computation of nodes in graphs with those of similarity measures (i.e., SimRank, JacSim, JPRank, and SimRank*) as the conventional technique for this task. Section 4.1 describes our experimental setup; Section 4.2 presents and analyzes the results.

Experimental Setup
All our experiments are performed on an Intel machine equipped with six 3.60 GHz i5-8600 CPUs, 64 GB RAM, and a 64-bit Fedora Core 31 operating system. All required codes are implemented with Python. We employ five well-known and publicly available datasets for our evaluation as follows. Table 1 shows some statistics of our datasets: • BlogCatalog [23,27,29] is a graph representing social relationships among bloggers. The node labels denote blogger interests inferred through the metadata provided by the bloggers. This graph is fully tagged by 39 different labels. • Cora [39,40] is a citation graph of academic papers in the area of computer science. The node labels denote the paper's topic (e.g., Networking-Protocols). This graph is fully tagged by 70 different labels. where nodes represent web pages and links to the hyperlinks between web pages. The node labels denote the web page's topic. This graph is partially tagged (i.e., 127 labeled nodes) by 11 different labels.
• Wikipedia [26,27,29] is a co-occurrence graph of words appearing in the first million bytes of the English Wikipedia dump. The labels represent the inferred Part-of-Speech (POS) tags of words. This graph is fully tagged by 40 different labels. In the case of undirected graphs, we create two links in both directions. In order to evaluate the effectiveness (i.e., accuracy), we utilize MAP (mean average precision), precision, recall, F-score [37], and PRES [48] as evaluation metrics. In each dataset, we consider the labels as ground truth; for each label l, we use every single node with label l as a query node for a similarity based searching, and find those nodes that are considered similar to the query as a result set. If a node in the result set is labeled with l, it is regarded as relevant to the query, otherwise irrelevant. Then, we compute precision, recall, F-score, average precision (AP), and PRES for that query as follows: where Res indicates the query result set, Rel indicates the set of relevant nodes to the query (i.e., the set of all nodes labeled with l). |Rel| and |Res| indicate the sizes of Rel and Res, respectively. In the AP measure, a precision value is computed in each position (rank) in the query result set: where P@k indicates the precision at position k, Rel(k) is set as 1 if the node in position k is regarded as relevant. Otherwise, it is set as 0. PRES considers the rank of retrieved relevant nodes in a result set and is computed as follows: where r i is the rank of the i th relevant node in the result set; for each of those m (i.e., m ≤ |Rel|) number of nodes that are relevant to the query but not retrieved in the result set, a rank is assigned by starting from the value of (|Res|+|Rel|−x+1). After computing the AP, precision, recall, F-score, and PRES for all the query nodes with label l, we take their average values to get MAP, precision, recall, F-score, and PRES values for label l. Then, we compute the average values of MAP, precision, recall, F-score, and PRES over all labels in the dataset. The aforementioned process is separately computed at top t (t = 5, 10, 20, 30) results; finally, the average accuracy over all values of t (e.g., we calculate the MAP value by taking the average of four MAP values at top 5, 10, 20, and 30 results) is regarded as the final accuracy for the dataset.
We implemented the matrix forms of SimRank, JacSim, JPRank, and SimRank* by applying the default parameter settings suggested by their original work with all datasets.
We set the impact factor C as 0.8 for all four measures. For JacSim, the importance factor α is set as 0.4 by following [1]. For JPRank, both importance factors α 1 and α 2 are set as 0.4 by following [17]; however, to indicate the value of weighting parameter β, we conducted a very simple and fast experiment by following [17] to find out between in-links and out-links which one is more beneficial to similarity computation as follows. With each dataset, we computed SimRank based on in-links and out-links only on four iterations and then compared the accuracy of these two computations; β is set as 0.9 if the similarity computation based on in-links shows better accuracy (i.e., with the DBLP dataset); it is set as 0.1, otherwise (i.e., with the TREC and Cora datasets (As explained in Section 2, JPRank behaves the same as JacSim with undirected graphs; we do not apply it to these graphs)).

Results and Analyses
In Section 4.2.1, for each dataset, we find the best iterations on which similarity measures show their highest accuracies. In Section 4.2.2, for each dataset, we find the best values of d (i.e., number of dimensions) for which the embedding methods show their highest accuracies in similarity computation of nodes. In Sections 4.2.3 and 4.2.5, we present an experimental analysis on the effectiveness and efficiency of embedding methods (i.e., based on their best values of d) in comparison with similarity measures (i.e., based on their best iterations), respectively. Section 4.2.4 analyzes the impact of the value of d on the accuracy of embedding methods.

Link-Based Similarity Measures: Best Iterations
We apply the similarity measures to our five datasets on eight iterations; then, for each similarity measure with a dataset, we find out the best iteration on which the similarity measure shows its highest accuracy. Figure 3 illustrates the results with our five datasets; as an example, SimRank shows its highest accuracy on iterations 2 and 3 with the Blog-Catalog and DBLP datasets, respectively. As already noted in Section 2, JPRank shows exactly the same results as JacSim does with undirected graphs; therefore, we do not apply JPRank to the BlogCatalog and Wikipedia datasets (i.e., 5 * 4 − 2 = 18 experimental cases are conducted). In addition, in the figure, we do not represent the precision metric since the range of its values is higher than MAP, recall, PRES, and F-score; it makes the other four metrics plotted out very close together in the figure, thereby decreasing the readability of figures.  As shown in Figure 3, for all similarity measures, the best iteration is observed before the eighth one in all datasets. Table 2 summarizes the best iterations for all similarity measures with our datasets. Note that, hereafter, when we compare the effectiveness of a similarity measure with those of embedding methods for a dataset, we consider the effectiveness of the similarity measure on its best iteration with that dataset; as an example, in the case of SimRank with the BlogCatalog dataset, we consider its effectiveness on the second iteration according to Table 2. Now, we apply ATP, BoostNE, DeepWalk, DWNS, graphGAN, Line, NERD, NetMF, and node2vec to our five datasets to obtain the low-dimensional representation vectors of the nodes. Then, to compute the similarity between two nodes in a dataset, we apply Cosine to their corresponding vectors. In order to carefully analyze the impact of the number of dimensions d in similarity computation of nodes, we set d to different values as 64, 128, 256, and 512. For each possible combination of methods, datasets, and d values (e.g., ATP with the BlogCatalog dataset when d = 64), we perform the experiment five times and select the best accuracy obtained among these five different executions as the final accuracy for that combination. More specifically, we conducted 900 (= 5 × 9 × 5 × 4) different experiments. Finally, similar to the strategy taken in Section 4.2.1, in the case of each embedding method with a dataset, we find out the best value of d for which the embedding method shows its highest accuracy in similarity computation of nodes. Figures 4 and 5 illustrate the accuracy of embedding methods with different values of d. The former figure shows the results with the BlogCatalog, Cora, and DBLP datasets, while the latter one shows the results with the TREC and Wikipedia datasets; as an example, BoostNE shows its highest accuracy when d is set as 256 and 128 with the BlogCatalog and TREC datasets, respectively. In these figures, we do not represent the precision metric due to the same reason as in Figure 3; the values of this metric with BlogCatalog, Cora, DBLP, TREC, and Wikipedia datasets for different values of d are represented in Tables A1-A5 in Appendix A, respectively. As already noted in Section 4.1, we cannot apply graphGAN to the Cora and TREC datasets due to their large sizes. Table 3 summarizes the best value of d for all embedding methods with our datasets. Note that, hereafter, when we compare the effectiveness of embedding methods with those of similarity measures for a dataset, we consider the effectiveness of embedding methods on their best values of d with that dataset; as an example, in the case of DeepWalk with the BlogCatalog dataset, we consider its effectiveness based on d = 128 according to Table 3

Effectiveness Evaluation
In this section, we analyze the effectiveness (i.e., accuracy) of embedding methods in computing the similarity of nodes and compare it with those of similarity measures with each dataset as follows. As explained in Sections 4.2.1 and 4.2.2, to compare the effectiveness of similarity measures with embedding methods for each dataset in this section, we consider their accuracies on their best iterations and best values of d represented in Tables 2 and 3, receptively. Figure 6 illustrates the accuracy of all embedding methods and similarity measures with the BlogCatalog dataset. In this figure, we do not represent the precision measure due to the same reason as in Figure 3; instead, we show the precision values for all methods in Table 4. In addition, for those embedding methods and similarity measures that show comparable accuracies, we write down the values of their corresponding MAP, PRES, recall, and F-score in the figure to have better comparison.

BlogCatalog Dataset
As observed in the figure, with the BlogCatalog dataset, NetMF shows the highest accuracy among all the embedding methods in terms of MAP, precision, recall, PRES, and F-score; however, its accuracy is close to that of DeepWalk, while BoostNE, graphGAN, and NERD show the worst accuracies. SimRank* shows better accuracy than other similarity measures, while SimRank shows the worst accuracy in terms of MAP, precision, recall, PRES, and F-score. Now, by comparing the accuracy of NetMF with that of SimRank*, it is observed that NetMF outperforms SimRank* by 73.44%, 35.68%, 42.89%, 50.00%, and 47.10% in terms of MAP, precision, recall, PRES, and F-score, respectively; Table 5 shows the percentage of improvements in accuracy obtained by NetMF over all other methods with the BlogCatalog dataset.   Figure 7 illustrates the accuracy of all embedding methods and similarity measures with the Cora dataset and Table 6 shows the precision values for all methods.

Cora Dataset
As observed in the figure, with the Cora dataset, although DeepWalk and Line show the best accuracy among all embedding methods, their accuracies are not tangible; Line outperforms DeepWalk in terms of recall, PRES, and F-score, while DeepWalk outperforms Line in terms of MAP and precision. NetMF and BoostNE show the worst accuracy among embedding methods in terms of all metrics. In the case of similarity measures, JPRank shows the best accuracy and SimRank again shows the worst one in terms of MAP, precision, recall, PRES, and F-score. Now, by comparing the accuracy of Line with that of JPRank, it is observed that JPRank slightly outperforms Line by 4.05%, 5.73%, 2.28%, 2.98%, and 3.05% in terms of MAP, precision, recall, PRES, and F-score, respectively; Table 7 shows the percentage of improvements in accuracy obtained by JPRank over all other methods with the Cora dataset.   Figure 8 illustrates the accuracy of all embedding methods and similarity measures with the DBLP dataset and Table 8 shows the precision values for all methods.

DBLP Dataset
As observed in the figure, DeepWalk shows the best accuracy among all embedding methods, while NetMF shows the worst one in terms of MAP, precision, recall, PRES, and Fscore. Among similarity measures, JPRank shows the best accuracy and it is close to that of JacSim, while SimRank shows the worst accuracy in terms of all metrics. Now, by comparing the accuracy of DeepWalk with that of JPRank, it is observed that JPRank outperforms DeepWalk by 64.89%, 50.34%, 51.70%, 52.71%, and 51.11% in terms of MAP, precision, recall, PRES, and F-score, respectively; Table 9 shows the percentage of improvements in accuracy obtained by JPRank over all other methods with the DBLP dataset.    Figure 9 illustrates the accuracy of all embedding methods and similarity measures with the TREC dataset and Table 10 shows the precision values for all methods.  As observed in the figure, DeepWalk and NetMF show the best accuracy among all embedding methods in terms of MAP, precision, recall, PRES, and F-score. However, their accuracies are not tangible, where NetMF outperforms DeepWalk in terms of precision, PRES, and F-score, while DeepWalk shows better accuracy in terms of MAP and recall. NERD shows the worst accuracy among all embedding methods. SimRank* shows the best accuracy among all similarity measures, while SimRank shows the worst one in terms of all metrics. Now, by comparing the accuracy of NetMF with that of SimRank*, it is observed that they show very close accuracy; SimRank* outperforms NetMF in terms of precision, recall, and F-score, while NetMF outperforms SimRank* in terms of MAP and PRES. Table 11 shows the percentage of improvements in accuracy obtained by SimRank* over all other methods with the TREC dataset.  Figure 10 illustrates the accuracy of all embedding methods and similarity measures with the Wikipedia dataset, and Table 12 shows the precision values for all methods. As observed in the figure, DeepWalk shows the highest accuracy among embedding methods in terms of MAP, precision, recall, PRES, and F-score; however, its accuracy is close to that of DWNS, while graphGAN shows the worst accuracy. Among similarity measures, JacSim shows the best accuracy, while SimRank shows the worst one in terms of all metrics. Now, by comparing the accuracy of DeepWalk with that of JacSim, it is observed that JacSim outperforms DeepWalk by 140.20%, 34.91%, 74.16%, 83.45%, and 66.55% in terms of MAP, precision, recall, PRES, and F-score; Table 13 shows the percentage of improvements in accuracy obtained by JacSim over all other methods with the Wikipedia dataset.  In this section, we analyze whether increasing the number of dimensions improves the effectiveness of embedding methods in computing the similarity of nodes. In Section 4.2.2, Figure 4 illustrates the accuracy of all embedding methods with the BlogCatalog, Cora, and DBLP datasets for different values of d; in addition, Figure 5 illustrates the results of the same experiments with the TREC and Wikipedia datasets. As observed in these figures, in some cases such as DWNS and NetMF with the BlogCatalog dataset, increasing the number of dimensions improves the accuracy of the embedding methods (i.e., refer to Figure 4); on the contrary, in some cases such as Line with the Wikipedia dataset (i.e., refer to Figure 5) and graphGAN with the BlogCatalog dataset (i.e., refer to Figure 4), increasing the number of dimensions adversely affects the accuracy of the embedding methods. In addition, in some cases such as DeepWalk with the DBLP dataset (i.e., refer to Figure 4) and node2vec with the TREC dataset (i.e., refer to Figure 5), we observe both improvement and reduction in accuracy by increasing the number of dimensions. In summary, increasing the value of d does not help improve the accuracy of embedding methods in computing the similarity of nodes. In Section 4.2.2, Table 3 indicates the value of d showing the best accuracy for each embedding method with our datasets.

TREC Dataset
As represented in Table 3, some embedding methods show their best accuracies when d = 64 or d = 512; we call such a case a suspicious one since it may possible to improve the accuracy of the embedding method by assigning a lower (i.e., 32) or higher (i.e., 1024) value to d, respectively. However, if the accuracy of a suspicious case is not comparable with the accuracy of the best method in the dataset, conducting the aforementioned experiment is not beneficial since our overall observations will not be affected by the new result; for example, although ATP with the BlogCatalog dataset shows its highest accuracy when d = 512 as a suspicious case, its accuracy is quite lower than that of NetMF as the best method for the same dataset (refer to Figure 6). In Table 3, there are only four following real suspicious cases that we need to consider them: both DeepWalk and Line show their highest accuracies when d = 512 with the Cora dataset and their accuracies are very close to that of JPRank as the best method with Cora (i.e., refer to Figure 7). In addition, DeepWalk and NetMF show their highest accuracies when d = 64 and d = 512 with the TREC dataset, respectively; their accuracies are very close to that of SimRank* as the best method with TREC (i.e., refer to Figure 9). Therefore, we conduct the following four new experiments:
Line with the Cora dataset and d = 1024 3.
DeepWalk with the TREC dataset and d = 32 4.
NetMF with the TREC dataset and d = 1024 Table 14 represents the accuracies of the four new experiments (i.e., in bold face) along with the accuracies of their corresponding suspicious cases. In case 1, DeepWalk shows the same accuracy as it does when d = 512; in addition, in cases 2 and 4, Line and NetMF show similar accuracies as they do when d = 512, respectively. In case 3, DeepWalk shows lower accuracy in comparison with d = 64. Therefore, we do not need to apply any changes in our results represented in Table 3.

Efficiency Evaluation
In this section, we carefully analyze the efficiency (i.e., execution time) of embedding methods in computing the similarity of nodes and compare it with that of similarity measures.

Link-Based Similarity Measures
In order to conduct a fair comparison, we implemented the matrix form of JacSim, JPRank, SimRank*, and SimRank without applying any acceleration techniques such as multi-processing (as the simplest technique), fine-grained memorization [9], partial sums memoization [49], and backward local push and Monte Carlo sampling [50]. Since the execution time could slightly change depending on the system resources such as CPU overload, to obtain an accurate execution time, we run each similarity measure on eight iterations for five times with a dataset and the average run time over the five executions is regarded as the final execution time of the similarity measure. Note that we consider only the elapsed time to compute the similarity scores as the execution time; the required time to store the results of similarity computation in a file or a database is not considered. Table 15 shows the execution time (minutes) of similarity measures with our five datasets (As already explained in Section 4.2.1, we do not apply JPRank to undirected graphs BlogCatalog and Wikipedia). With all datasets, SimRank* shows the best efficiency since it requires only one matrix multiplication in Equation (A8). With undirected datasets (i.e., BlogCatalog and Wikipedia), JacSim shows the worst efficiency since it requires two matrix multiplications and a pairwise normalization paradigm to compute matrix E in Equation (A6). With directed datasets (i.e., Cora, DBLP, and TREC), JPRank shows the worst efficiency since it requires four matrix multiplications and two pairwise normalization paradigms to compute matrices E and E ′ in Equation (A10). However, among all the available cases in Table 15, JacSim with the BlogCatalog dataset shows the worst efficiency, although BlogCatalog has less nodes than Cora, DBLP, and TREC datasets. The reason is that there are "32,787,165" node-pairs with non-empty common in-link sets in this dataset, which makes the calculation of matrix E expensive; the number of these nodepairs in Cora, DBLP, TREC, and Wikipedia datasets are "229,306", "466,990", "1,391,293", and "11,015,803", respectively.

Graph Embedding Methods
For embedding methods, the execution time is regarded as the summation of a learning time (i.e., elapsed time to construct low-dimensional representation vectors) and a similarity computation time (i.e., elapsed time to compute the similarity scores of all pairs of representation vectors by employing Cosine). With each dataset, the learning time of an embedding method is regarded as the average run time over the five executions of the method. We implemented Cosine based on a matrix/vector multiplication technique, which is significantly (i.e., almost 30 times) faster than its conventional implementation. In the case of similarity computation time, we consider only the elapsed time to compute the similarity scores by applying Cosine; the required time to store the results in a file or a database is not considered as we did for similarity measures. Table 16 represents the learning time (minutes) of all embedding methods for different values of d with all datasets where bold face numbers indicate the best efficiency with each value of d in a dataset. As observed in the table, NetMF shows the best efficiency among all embedding methods with the BlogCatalog and Wikipedia datasets regardless of the value of d, node2vec shows the best efficiency among all methods with Cora, DBLP, and TREC datasets regardless of the value of d, BoostNE, and ATP almost have better efficiency after NetMF and node2vec with all datasets, and graphGAN shows the worst efficiency among all embedding methods with all datasets regardless of the value of d. Table 17 shows the similarity computation time (minutes) based on different vector sizes (i.e., d = 64, 128, 256, 512) with our five datasets. Note that the similarity computation time for a dataset depends on the number of node-pairs and the representation vector's size (i.e., the value of d); for example, with the BlogCatalog dataset, the required time to compute Cosine for all node-pairs with vector size 64 obtained by any embedding methods (except APT and NERD) is 4.07. In the case of ATP and NERD with any value of d, the similarity computation time in Table 17 is multiplied by two since these methods construct two vectors for each node (i.e., target and source vectors) where we apply Cosine to the corresponding target vectors and source vectors of a node-pair separately; finally, the highest score is regarded as the final similarity score of the node-pair.  In order to easily compare the efficiency of all embedding methods at a glance, Figure 11 illustrates their execution times (i.e., the summation of the learning time and the similarity computation time) with all datasets; we excluded graphGAN since its execution time value is quite larger than other embedding methods. For example, with the BlogCatalog dataset when d = 64, the execution time of ATP is 11.83 as the summation of 3.69 (i.e., the learning time from Table 16) and 2 × 4.07 (As explained before, for simplicity, we regard the similarity computation time as twice that in Table 17; for ATP with Blog-Catalog when d = 64, the real Cosine calculation time is 8.33 (≃2 × 4.07).) (i.e., twice the similarity computation time from Table 17

Efficiency Comparison
In order to make a meaningful comparison, for each of our five datasets, we compare the efficiency of the best embedding method with that of the best similarity measure from Section 4.2.3, as follows: • BlogCatalog: as observed in Figure 6, NetMF with d = 256 (i.e., refer to Table 3) and SimRank* are the best embedding method and similarity measure showing highest accuracy, respectively. The execution time of NetMF with d = 256 is 6.76 (i.e., 1.85 from  Table 15); SimRank* shows almost 35% better efficiency than NetMF. • Cora: as observed in Figure 7, Line (i.e., with d = 512) and JPRank are the best embedding method and similarity measure, respectively. The execution time of Line when d = 512 is 60.31 (i.e., 34.62 + 25.69), while the execution time of JPRank with this dataset is 8.22; JPRank is almost 7.3 times more efficient than Line. • DBLP: as observed in Figure 8, DeepWalk (i.e., with d = 128) and JPRank show the highest accuracies among embedding methods and similarity measures, respectively. The execution time of DeepWalk when d = 128 is 26.33 (i.e., 6.46 + 19.87) and that of JPRank is 8.41, which means that JPRank is 3.1 times more efficient than DeepWalk. • TREC: as observed in Figure 9, NetMF with d = 512 and SimRank* are the best embedding method and similarity measure, respectively. The execution time of the former method is 109.52 (i.e., 26.46+83.06) and that of the latter one is only 1.45, which means SimRank* is significantly faster than NetMF. • Wikipedia: as observed in Figure 10, DeepWalk with d = 64 and JacSim show the best accuracies among embedding methods and similarity measures, respectively. The execution time of DeepWalk when d = 64 is 17.10 (i.e., 16.20 + 0.90) and the execution time of JacSim is 55.29, which means DeepWalk is almost 3.2 times faster than JacSim.

Discussion
Based on the results of our extensive experiments with embedding methods and similarity measures in Sections 4.2.3 and 4.2.5, we observe that the latter technique is better than the former one to compute the similarity of nodes in graphs for the following reasons.
• First, similarity measures outperform embedding methods with all datasets in terms of effectiveness except with the BlogCatalog dataset where NetMF (i.e., with d = 256) shows the better accuracy than SimRank*; however, its efficiency is 54% less than that of SimRank* with this dataset. • Second, similarity measures show better efficiency with all datasets except with the Wikipedia dataset where DeepWalk (i.e., with d = 64) is almost 3.2 times faster than JacSim; however, for this dataset, JacSim shows better effectiveness than DeepWalk and significantly outperforms it in terms of all five evaluation metrics as observed in Table 13. • Third, similarity measures have a very low number of parameters than embedding methods, thereby leading to a simpler parameter tuning process to possibly obtain a better accuracy; for example, JPRank has only three parameters as α 1 , α 2 , and β in Equation (A10), while DeepWalk has six parameters as the window size, walk length, number of dimensions, number of walks, size of training data, and learning tare.
In addition to the above findings, we observed that DeepWalk and its variants (i.e., Line and NetMF) show better effectiveness than other embedding methods in the task of similarity computation of nodes in graphs. Furthermore, it is shown that increasing the value of d (i.e., number of dimensions) does not help improve the accuracy of embedding methods in computing the similarity of nodes.

Conclusions
Embedding methods aim to represent each node in a given graph as a low-dimensional vector while preserving the neighborhood similarity, semantic information, and community structure of the nodes in the original graph. The dimensions in the low-dimensional vectors can be interpreted as latent features and the obtained vectors can be employed to compute the similarity of nodes in the graph. In this paper, we evaluated and compared both the effectiveness and efficiency of embedding methods in the task of computing similarity of nodes in graphs with those of link-based similarity measures by conducting extensive experiments with five datasets. We observed the following findings based on the results of our experiments. The similarity measures outperform embedding methods in terms of effectiveness with all datasets except with the BlogCatalog dataset where DeepWalk and NetMF show the best accuracy. The similarity measures are more efficient than embedding methods in similarity computation of nodes with all datasets except with the Wikipedia dataset where DeepWalk shows better efficiency. Finally, similarity measures are better to compute the similarity of nodes in graphs. SimRank [11]: for a given graph G = (V, E) where V represents a set of nodes and E ⊆ (V×V) is a set of links among nodes, the SimRank score of a node-pair (a, b) is defined as follows: where I a is a set of nodes directly pointing to node a, |I a | is the size of I a , and C ∈ (0, 1) is a damping factor. If I a = ∅ or I b = ∅, S(a, b) = 0. Equation (A1) is a recursive formula initialized by S 0 (a, b) = 1 if a = b; S 0 (a, b) = 0, otherwise. For k = 1, 2, ..., we have where on each iteration k, S k (a, b) is computed based on similarity scores obtained in the previous iteration k−1.
The iterative form of SimRank can be transformed to a closed matrix form [51,52], which is quite faster than the original iterative form. Let S ∈ R (|V|×|V|) be a similarity matrix whose entry [S] a,b denotes S(a, b); then, SimRank scores are computed as follows: where Q |V|×|V| is a column normalized adjacency matrix whose entry [Q] a,b = 1/|I b | if a points b; [Q] a,b = 0, otherwise. Q T is a transpose matrix of Q, I |V|×|V| is an identity matrix, and term (1 − C) ⋅ I guarantees that the main diagonal entries in S are always maximum. The recursive computation of the matrix form starts with S 0 = I for k = 1, 2, ..., as follows: JacSim [1]: it has both iterative and matrix forms; however, we explain its matrix form since it is more efficient than the iterative form while their accuracies are comparable [1]. Let JS be the similarity matrix; then, where α ∈ (0, 1) is an importance factor to control the degree of importance of Jaccard score and the one computed by the pairwise normalization paradigm, J ∈ R (|V|×|V|) is a matrix whose entry [J] a,b denotes the Jaccard score of (a, b). E ∈ R (|V|×|V|) is a matrix whose entry [E] a,b denotes the summation of JacSim scores of all node-pairs between (I a ∩ I b ) and itself normalized by value |I a ||I b |. For k = 1, 2, ..., the recursive computation is started with JS 0 = I as follows: SimRank* [9]: let S * be the SimRank* similarity matrix; then, where only one matrix multiplication is required since S * is a symmetric matrix and S * ⋅Q is identical to the transpose of Q T ⋅S * . For k = 1, 2, ..., the recursive computation is started with S * 0 = I as follows: JPRank [17]: it has been proposed by both iterative and matrix forms; here, we explain its matrix form since it is more efficient than the iterative form while their accuracies are comparable [17]. Let JP be the JPRank similarity matrix; then, where β ∈ [0, 1] is a weighting parameter for in-links and out-links, α 1 and α 2 are used to control the degree of importance of the Jaccard score and the one computed by pairwise normalization paradigm based on in-links and out-links, respectively; P ∈ R (|V|×|V|) is a row normalized adjacency matrix whose entry normalized by the values of |O a ||O b |. For k = 1, 2, ..., the recursive computation of JPRank is started with JP 0 = I as follows: where θ D and θ G are the union of all vector representations of nodes u constructed by D and G, respectively. It employs a negative sampling technique where nodes truly connected to v are used as positive samples and some fabricated nodes are used as negative ones; D is implemented by a softmax function, while G is implemented by a graph softmax function [27].
NetMF [29]: it proposes the following low-rank connectivity matrix for DeepWalk: where S denotes the volume of G (i.e., summation of all entries in A as the adjacency matrix of G), s is the number for negative samples, W is the window size, D is a diagonal matrix containing values of d 1 , ..., d |V| as its diagonal entries where d i is the degree of node v i , and P is defined as D Finally, a non-negative matrix factorization (NMF) technique [44] is applied to matrix M to generate a low-rank approximation of M ≈ S⋅T (i.e., S ∈ R |V|×d and T ∈ R d×|V| ) where row i and column i in matrices S and T contain the representation vectors of node v i when its role in the graph is regarded as a source and a target, respectively.
BoostNE [30]: it performs multiple levels of NMF resulting in the following objective function: min U l ,V l ≥ 0, l = 1, ..., k where k denotes the number of levels, U l ∈ R DWNS [22]: the objective function is as follows: L(G | Θ) + λ⋅L adv (G | Θ + n adv ), (A19) n adv = ⋅ g ∥ g ∥ 2 , where L and L adv denote the loss function of DeepWalk and the adversarial training regularizer, respectively. λ is a parameter to control the importance of the regularization term, Θ denotes the model parameters, Θ ′ denotes the current model parameters, n adv denotes the adversarial perturbation, denotes the adversarial noise level, and − ⇀ u is the vector representation of node v.
NERD [28]: for two nodes u and v with roles r 1 and r 2 , respectively, NERD tries to find representations f r 1 (u) and f r 2 (v) by utilizing ASGD [53] to maximize the following objective function: where σ denotes the sigmoid function, P n r 2 (v) does the indegree (if r 2 is a target role) or outdegree (if r 2 is a source role) noise distribution, and s is the number of negative examples.

Appendix A.3
As explained in Section 4.2.2, we did not represent the precision metric in