Fair Benchmark for Unsupervised Node Representation Learning

: Most machine-learning algorithms assume that instances are independent of each other. This does not hold for networked data. Node representation learning (NRL) aims to learn low-dimensional vectors to represent nodes in a network, such that all actionable patterns in topological structures and side information can be preserved. The widespread availability of networked data, e.g., social media, biological networks, and trafﬁc networks, along with plentiful applications, facilitate the development of NRL. However, it has become challenging for researchers and practitioners to track the state-of-the-art NRL algorithms, given that they were evaluated using different experimental settings and datasets. To this end, in this paper, we focus on unsupervised NRL and propose a fair and comprehensive evaluation framework to systematically evaluate state-of-the-art unsupervised NRL algorithms. We comprehensively evaluate each algorithm by applying it to three evaluation tasks, i.e., classiﬁcation ﬁne tuned via a validation set, link prediction ﬁne-tuned in the ﬁrst run, and classiﬁcation ﬁne tuned via link prediction. In each task and each dataset, all NRL algorithms were ﬁne-tuned using a random search within a ﬁxed amount of time. Based on the results for three tasks and eight datasets, we evaluate and rank thirteen unsupervised NRL algorithms.


Introduction
Networks are widely adopted to represent relations between objects in various disciplines. For instance, we leverage networks to study social ties [1], financial transactions [2], word co-occurrence [3], and protein-protein interactions [4]. When the network is small, traditional algorithms in graph theory can be used to analyze the network. Examples include clustering based on normalized cuts and node classification on the basis of label propagation. However, learning and prediction tasks in real-world networked systems have been incredibly complex, such as recommendations and advertising on large social networks with multifarious user-generated content [5], the completion of knowledge graphs with millions of entities [6], and fraud detection in financial transaction networks [7]. Hence, traditional graph theory cannot directly satisfy the demands of real-world network analysis tasks.
To fill this gap and utilize off-the-shelf machine-learning algorithms in large and sophisticated networks, node representation learning (NRL) [1,8] has been extensively investigated. Machine-learning algorithms often make default assumptions that instances are independent of each other, and their features exist in the Euclidean space. However, these are not applicable to networked data, because, usually, networks depict node-to-node dependencies and exist in non-Euclidean space. The goal of NRL is to learn a low-dimensional vector to represent each node, such that actionable patterns in the original networks and side information can be well-preserved in the learned vectors [9,10]. Furthermore, these low-dimensional representations can be directly leveraged by common machine-learning algorithms as feature vectors, or hidden layers can perform a wide variety of tasks, such as node classification [11,12], anomaly detection [13], community detection [14], and link prediction [15,16].
In the early 2000s, as illustrated in Figure 1, NRL was known as graph embedding [17][18][19], which was mainly used to perform dimensionality reduction for feature vectors. Its core idea can be summarized as two steps. First, given a set of instances, such as documents or images of faces, associated with feature vectors, we construct a new graph to connect all instances. Its (i, j) th edge weight is set as the shortest path distance [18], reconstruction coefficient [20,21], or similarity [17] between nodes i and j in the feature vector space. Second, we apply multidimensional scaling [18] or eigen decomposition to the adjacency matrix or its variation [17,20]. In such a way, the learned node representations can be employed as the latent features of instances. It has been proven that most dimensionality reduction techniques, such as principal component analysis and linear discriminant analysis, could be reformulated as graph-embedding processes with constraints [19]. Meanwhile, efforts have also been devoted to utilizing NRL as an intermediate process to conduct clustering [22][23][24], known as spectral clustering. Its key idea is similar to graph embedding [19]. Based on feature vectors, we construct a new network depicting similarities between instances, and then calculate the eigenvectors of the graph Laplacian matrix of this new network. The learned eigenvectors serve as the latent representations of instances. To predict the clusters, a traditional clustering algorithm such as k-means is applied to the eigenvectors.
Later, since mid 2000s, with the rapid development of the web, especially socialnetworking sites, there has been an increase in the types and numbers of real-world networks available, such as Facebook, Flickr, and Twitter. NRL has been employed as an intermediate step when performing node classification [12,25,26] and visualizations [27] in real-world networks, including online social networks such as Flickr [26] and BlogCatalog [28], academic networks [25], molecule structures [27], and linked web pages [12]. With the boom in web-based networks and prediction tasks performed on them, node representation learning, also known as network embedding, was formally defined [1,10,29] in 2014 and has attracted intensive attention in recent years [30,31]. Since real-world networks are often large, alongside the major goal, i.e., preserving the topological structure information, scalability is another common objective of NRL algorithms [8,32]. The availability of networked data boosts the development of various NRL algorithms [33][34][35], because it is significant for different practical scenarios which need distinct and appropriate NRL solutions.  However, it has become challenging to track the state-of-the-art node representation learning algorithms, given that numerous methods are available in the literature. When an NRL algorithm is proposed, only a few baseline methods and datasets are included in the empirical evaluation. Moreover, different NRL studies may use different experimental settings, such as evaluation tasks, hardware environments, and ways of hyperparameter tuning. Therefore, the lack of fair and comprehensive evaluation of state-of-the-art NRL algorithms is preventing researchers and practitioners from tracking the effective NRL algorithms for specific scenarios.
In this paper, we focus on unsupervised node representation learning. There are three major challenges in developing a fair and comprehensive evaluation framework for unsupervised NRL. First, it is difficult to guarantee fair comparisons. The running times of different NRL algorithms vary significantly. These algorithms also have different numbers of hyper-parameters and search spaces. A fair evaluation framework should take the running time and hyperparameter tuning into consideration. Second, tailored evaluation tasks are needed to comprehensively assess each NRL algorithm. Node classification and link prediction are two widely adopted downstream tasks when evaluating NRL algorithms. However, for each task, there are many ways to tune hyperparameters. Third, as different NRL algorithms are originally implemented in distinct systems, it is not easy to integrate them into a unified environment for a fair comparison.
To bridge this gap, in this paper, we develop a fair and comprehensive evaluation framework for unsupervised node representation learning (FURL). Through the development of FURL, we aim to answer two research questions. (1) How can we fairly evaluate state-of-the-art unsupervised NRL algorithms, given that they have different efficiencies and distinct search spaces for hyperparameters? (2) On the basis of a fair evaluation protocol, how can we comprehensively evaluate each unsupervised NRL algorithm? Our contributions can be summarized as follows. • We develop a novel evaluation framework-FURL, which could fairly and comprehensively evaluate unsupervised NRL algorithms. • We enable fair comparisons by performing a random search and allowing a fixed amount of time to fine-tune each algorithm. • We enable comprehensiveness by using three tailored evaluation tasks, i.e., classification fine tuned via a validation set, link prediction fine-tuned in the first run, and classification fine tuned via link prediction. • We perform empirical studies by utilizing eight datasets and thirteen state-of-theart unsupervised NRL algorithms. It is straightforward to extend these to more unsupervised NRL algorithms and datasets. Based on the systematic results from three tasks and eight datasets, we assess the overall rank of all thirteen algorithms.

Fair and Comprehensive Evaluation Algorithm-FURL
To help researchers and practitioners track the state-of-the-art unsupervised node representation learning methods, an ideal solution would have two properties. First, fairness. Existing methods are implemented using different experimental settings, including cross-validation settings, hyperparameter tuning methods, and total running time. It is important to design a mechanism to fairly compare these methods under the same setting. Second, comprehensiveness. The learned embedding representations are designed for different downstream tasks in general. It is desirable to evaluate all NRL algorithms comprehensively from multiple aspects. Additionally, when integrating these multiple evaluation results into an overall rank, the integration schema should be in line with human prior experience. For example, it is expected that an NRL algorithm with ranks {2, 3, 4} (evaluated from three aspects) should rank higher than an NRL algorithm with ranks {3, 3, 3}. This is because the former NRL algorithm ranks second in the first aspects, indicating a significant potential.
We propose a fair and comprehensive evaluation framework-FURL. Its core idea is illustrated in Figure 2. To systematically evaluate an NRL method, we apply it to performing three tasks. To make the comparisons fair, on each task, the total amount of hyperparameter tuning time and running time is the same for all methods. We now introduce the three tasks in detail. We design a tailored integration schema to combine evaluation results in three tasks. It would assess a comprehensive rank for each given NRL algorithm. We now introduce the three tasks in detail.  Figure 2. Given a node representation learning method (e.g., Method A), FURL could fairly and comprehensively evaluate it based on its performance on three tasks and many real-world networks/datasets. The three tasks are classification fine tuned via a validation set, link prediction fine-tuned in the first run, and classification fine tuned via link prediction. In each task, the hyperparameters of Method A are fine-tuned based on a random search.

Task 1-Classification Fine Tuned via a Validation Set
Node classification has been widely used to evaluate embedding models [36,37]. As illustrated in Figure 2, Task 1 consists of three steps. First, given an NRL method, e.g., Method A, we apply it to embed the entire network, e.g., Net1, and learn embedding representations for all nodes. We use k-fold cross-validation to separate all the nodes into a training set and a test set. Second, we tune the hyperparameters of Method A in the first round (k rounds in total, corresponding to k folds), and then fix the hyperparameters for all k rounds. Specifically, in the first round, we employ 1/10 of the training set as a validation set to tune hyperparameters. We leverage the embedding representations of the remaining training nodes (9/10) and their labels to train support vector machines (SVMs). Then, we use the trained SVMs to predict the classes of nodes in the validation set. Based on the classification performance on the validation set, we select the best hyperparameters for the embedding model, i.e., Method A. Third, after we have selected the best hyperparameters, we add the validation set back into the training set. Given the embedding representations learned by fine tuning model (with the best hyperparameters), we conduct k-fold crossvalidation ten times, named as ten runs. In each run, there are k rounds. In each round, we use the labels of the training nodes to train SVMs, and use them to predict the classes of test nodes. The average of all 10k results on the test sets is employed as the final performance of Method A in Task 1.
In summary, we employ a simple process, i.e., using a validation set in the first round, to fine tune the embedding model. This is because efficiency is important for hyperparameter searching. After we have identified the best hyperparameters, we evaluate Method A by performing k-fold cross-validation ten times. This is because comprehensiveness is essential to the final evaluation result. It should be noted that the validation set is generated from the training set in the first round, so test data is not leaked in the hyperparameter tuning. Task 1 is not end-to-end.

Task 2-Link Prediction Fine Tuned in the First Run
Link prediction has also been widely used to evaluate embedding models. In this task, we evaluate the effectiveness of each NRL algorithm in preserving network patterns by applying the learned representations to recover unseen links. As illustrated in Figure 2, Task 2 consists of three steps. First, given a network, e.g., Net1, we randomly select B/2 node pairs that are connected as positive samples, and B/2 node pairs that are not connected as negative samples, where B denotes the total number of node pairs to be predicted. We remove the B/2 edges of the B/2 positive samples from the original network. Then, we apply the given NRL model, e.g., Method A, to embed the new network. We will employ the learned node representations to perform link prediction. Second, in the first run, we tune the hyperparameters of Method A. In particular, we mix the B/2 positive samples and B/2 negative samples and obtain a set with B node pairs. For each pair in the B node pairs, we denote its two nodes as i and j. We compute the inner product of the embedding representations of nodes i and j, and employ the inner product to indicate the probability of having a link between nodes i and j. After obtaining the inner products of all B pairs, we rank them and compute the average precision (AP) and area under the receiver operating characteristic curve (ROC AUC), indicating the performance of link prediction. Based on this link-prediction performance, we select the best hyperparameters for the embedding model, i.e., Method A. Third, we employ the selected best hyperparameters to conduct another ten runs. In each run, we randomly select B/2 positive samples and B/2 negative samples from the original network. We remove the B/2 edges of positive samples from the original network, and obtain a new network. We apply Method A with the selected best hyperparameters to embed the new network. We calculate the inner products of all B pairs, and we compute the AP and ROC AUC. The average of the results of the ten runs is employed as the final performance of Method A in Task 2.
In summary, we fine tune Method A in the first run and obtain the best hyperparameters. We employ the selected hyperparameters to perform another ten runs. In each run, a different set of B samples are utilized to conduct the evaluation, that is, a different network (after removing B/2 edges) is used to conduct embedding. Both the hyperparameter tuning and final evaluation are conducted in an unsupervised manner.

Task 3-Classification Fine Tuned via Link Prediction
One of the core purposes of learning embedding representations is to have lowdimensional representations ready, such that we can directly apply them to downstream tasks if needed. To simulate such a scenario, we propose Task 3, in which we fine tune the NRL model in an unsupervised manner, and then apply the learned unsupervised embedding representations to perform classification. In this task, the hyperparametertuning process is the same as the one in Task 2. The final evaluation process is also the same as the one in Task 1. As illustrated in Figure 2, Task 3 consists of four steps. First, given a network, we randomly select B/2 node pairs that are connected as positive samples, and B/2 node pairs that are not connected as negative samples. We remove the B/2 edges of the B/2 positive samples from the original network. We apply the given NRL model, e.g., Method A, to embed the new network. Second, we tune the hyperparameters of Method A based on link prediction. For each node pair in the B node pairs (B/2 negative samples and B/2 positive samples), we denote its two nodes as i and j. We compute the inner product of the embedding representations of nodes i and j, and employ the inner product to indicate the probability of having a link between nodes i and j. After obtaining the inner products of all the B pairs, we compute the AP and ROC AUC. Based on this link-prediction performance, we select the best hyperparameters for the embedding model, i.e., Method A. We apply the fine-tuned model (with the best hyperparameters) to the original network, and learn embedding representations. Third, we use k-fold cross-validation to separate all nodes into a training set and a test set. We use the representations of training nodes to train SVMs, and use them to predict the classes of test nodes. Fourth, we repeat step three ten times, i.e., conducting k-fold cross-validation ten times. The average of all 10k results of the test sets is employed as the final performance of Method A in Task 3.

Integration Schema for Overall Rank
FURL enables comprehensive evaluations of unsupervised NRL methods. It involves not only two prediction tasks, but also different ways of hyperparameter tuning. Although unsupervised NRL methods are generally trained without labels, the hyperparameter searching can be conducted with or without labeled data. In Task 1, we use labels to tune hyperparameters. In Tasks 2 and 3, we tune hyperparameters without using labels.
We apply each unsupervised NRL algorithm to perform the aforementioned three tasks on all datasets. In each task and each dataset, all algorithms are assigned a fixed amount of time to conduct hyperparameter tuning. They all use random search to tune hyperparameters. Thus, we can perform the evaluation in a fair way.
Let D denote the total number of datasets used in the evaluation. Let C denote the total number of unsupervised NRL algorithms that we have included in the experiments. Let R ij denote the rank of an algorithm, e.g., Method A, among all C algorithms, based on their performance in the i th task on the j th dataset. We collect the ranks of Method A as a matrix R ∈ R 3×D . Then, the overall rank of Method A can be calculated by computing the rank product over all datasets and three tasks as follows. The rank product is a widely used approach to combine ranked lists.
which can be simplified as follows, where f rank () takes the result of Method A as input, compares it with the corresponding results of all other C − 1 algorithms, and returns the rank of Method A as output. Equation (1) is simplified into Equation (2) by applying a logarithmic function to the output of rank products before computing ranks using f rank (). Since the logarithmic operation is monotonic, this transformation does not affect the final ranks. As illustrated in Equation (2), our schema integrates the rank matrix R in two steps. First, we sum up the logarithm of the ranks of Method A on all D datasets. Then, f rank () can help us find the rank of Method A on the ith task. Second, we sum up the logarithm of the ranks of Method A on all three tasks. Finally, f rank () can return the overall rank of Method A. In each step, we follow a map-and-aggregate policy. The ranks are first mapped using a logarithmic function, then aggregated by summing up. The mapping function needs to be monotone, aiming to maintain the order of algorithms after mapping. The simplified rank product indicates that an algorithm that ranks 2th, 3th, and 4th in the three tasks has a higher potential than an algorithm that ranks 3th, 3th, and 3th.

Unsupervised Node Representation Learning Algorithms and Datasets
We included nine unsupervised NRL algorithms that are designed for pure networks, and four unsupervised NRL algorithms that are designed for attributed networks.

Unsupervised Node Representation Learning Methods Used in FURL
For a fair comparison, we separate the existing unsupervised NRL methods into two categories based on whether associated attribute information is exploited for learning node representations. We carefully select nine methods dedicated to pure networks, and four methods for attributed network embedding. We present a detailed introduction of each method as follows. We also list the hyperparameters and their corresponding search space for each method in Table 1. We roughly categorize the existing NRL algorithms for pure networks into three classes, i.e., negative-sampling-based, matrix-factorization-based, and spectral-embeddingbased. First, typical negative-sampling-based NRL algorithms includes DeepWalk [1], node2vec [31], and LINE [10]. Second, the typical matrix-factorization-based NRL algorithms include NetMF [38] and ProNE [39]. The key idea is to construct a proximity matrix and use singular value decomposition or other matrix-factorization algorithms to obtain the graph embedding. Third, the key idea of spectral embedding [22] is to compute the eigendecomposition of a normalized Laplacian matrix of the network. Spectral embedding based on NRL algorithms is not scalable, since the time complexity of eigendecomposition is O(N 2 ), where N denotes the total number of nodes. • DeepWalk [1]: It first performs random walks on the network to convert it into a series of truncated node sequences. Then, it considers the recorded node sequences as a "corpus", i.e., each node as a word and each sequence as a sentence, and applies a wording embedding technique to embed this "corpus". In this way, we could learn a low-dimensional representation for each word (node), by using efficient wording embedding techniques. • metapath2vec [40]: It formalizes meta-path-based random walks to construct the heterogeneous neighborhood of a node and then leverages a heterogeneous skip-gram model to learn node embedding representations. • node2vec [31]: It performs balanced random walks to smoothly interpolate between breadth first sampling and depth first sampling. • NetMF [38]: It proves that the existing negative sampling based on NRL algorithms could be considered as the variations of matrix factorization. Based on this theoretical analysis, it proposes an advanced matrix factorization for graphs. • ProNE [39]: It first leverages sparse matrix factorization to learn the initial node representations. Then, it uses the high-order Cheeger's inequality to modulate the spectral space of the graph and plot the initial representations on the adjusted graph.
In this way, it can integrate the local smoothing information and the global clustering information. • NetSMF [41]: It makes NetMF scalable by applying sparsification. • HOPE [42]: It proposes to preserve asymmetric transitivity by approximating highorder proximity. • SDNE [43]: It is a semi-supervised deep model with multiple layers of non-linear functions, aiming to preserve both the global and local network structures. • LINE [10]: It can scale up to networks with millions of vertices and billions of edges. It has carefully designed objective functions that preserve both the first-order and second-order proximities, which are complementary to each other. An efficient and effective edge-sampling method is proposed for model inference, which solved the limitation of stochastic gradient descent on weighted edges without compromising efficiency.

Node Representation Learning for Attributed Networks
Definition 1 (Attributed Network). We refer the textual, image, numerical, or categorial data associated with each node as node attributes. A network with node attributes is defined as an attributed network. Examples include social networks with user-generated content, paper citation networks with abstracts, and protein-protein interaction networks with property descriptions.
In this paper, we select four typical unsupervised NRL algorithms for attributed networks, as follows.
• GAE [44]: It learns node embedding representations by reconstructing the network structure using autoencoders. A two-layer graph convolutional network is employed to conduct encoding. • CAN [45]: It learns an embedding representation for each node and for each node attribute category. These two types of representations, i.e., node representations and attribute-category representations, are learned by variational autoencoders, and are in the same space. • DGI [36]: It learns node embeddings by maximizing the mutual information between the local patch representation and the global graph representation. • FeatWalk [46]: It advances random walks by introducing an attribute-enhanced walking strategy. In addition to random walks on the original network, it also allows walking from one node to another node through their shared node attribute categories. The joint truncated node sequences thus become more diverse.
To handle networks with node attributes, the field of graph neural networks (GNNs) has expanded rapidly in recent years [47][48][49][50]. Existing GNN models belong to the fam-ily of message passing frameworks [51,52], which update information for each node by recursively aggregating messages from its immediate neighbors in the graph. GNNs are generally classified into two categories, as follows.
The first class is spectral-based approaches [53][54][55][56][57][58][59], which generalize convolution operators on the grid-like data to graphs based on spectral graph theory [60]. GCN [56] employs the first-order Chebyshev polynomial approximation for graph convolution. In a follow-up, Xu et al. [59] analyze the expression power of GNNs according to the Weisfeiler-Lehman test for graph isomorphism. SGC [61] further simplifies GCN by removing the inner transformation matrices and non-linear activation functions. Although effective, they could struggle with modeling long-range dependencies because of a recursive messagepassing restriction. APPNP [58] suggests replacing the original graph convolution matrix with a graph diffusion [62]. MixHop [63], N-GCN, [64] and Truncated Krylov [65] attempt to exploit multi-scale information in each layer [66]. The second line of research is spatialbased approaches [9,34,67,68], which stem from the vector domain and focus on using the graph structure directly for message propagation. One characteristic of these is to adopt the learnable aggregation function to aggregate neighbors. GraphSage [9] introduces mean/max/LSTM pooled aggregation functions to integrate neighborhood information. GAT [69] learns to assign different attention weights at each layer based on node features. Attention-based GNN models have achieved state-of-the-art results on several graph learning tasks [68,70,71].
However, most GNN models are dedicated to end-to-end classification. There is an issue when converting GNN models into unsupervised NRL methods. For example, if we directly employ one of the hidden layers as the learned embedding representations, GNN models would perform badly in Tasks 1, 2, and 3.

Real-World Datasets Used in FURL
We included eight publicly available real-world attributed networks in the experiments. Their statistical information is summarized in Table 2. We will include more datasets in the future. Cora, Citeseer, and Pubmed [72]. These are citation networks. Nodes correspond to papers and edges correspond to citations. Each node has a bag-of-words feature vector according to the paper abstract. Labels are defined as the academic topics.
Flickr [73]. It is a social-network dataset collected from Flickr. Node attributes denote the tags that reflect the users' interests. Labels represent the groups that users have joined.
BlogCatalog [73]. It is a social network collected from a blog community. Nodes are web users and edges indicate the user interactions. Node attributes are generated from keywords of their blogs. Each user can register his/her blogs into six different predefined classes, which are considered as class labels for node classification.
Chameleon and Squirrel [74]. They are page-page networks collected from Wikipedia (December 2018), based on a given topic (either chameleons or squirrels). Nodes represent articles from the English Wikipedia. Edges reflect mutual links between articles. Node features indicate the presence of particular nouns in the articles and the average monthly traffic from October 2017 to November 2018.
Film [58]. This dataset is the actor-only induced subgraph of the film-director-actorwriter network [75]. Each node corresponds to an actor, and the edge between two nodes denotes co-occurrence on the same Wikipedia page. Node features correspond to some keywords in the Wikipedia pages. The nodes are classified into five categories in terms of the words of an actor's Wikipedia.

Results
We now introduce the evaluation results returned by our proposed algorithm-FURL.

Experimental Settings and Environment Configuration
To make it fair, for each dataset, the total given hyperparameter tuning time is fixed. When there are N nodes and X edges in a network, the total given time would be max{αN, βX} seconds, where α and β are coefficients related to the hardware configuration and software environment. We set α = 1.5 and β = 0.1 for balancing the efficiency and reliability of results. By default, we employ a random search to tune hyperparameters. In Task 1 and Task 3, we use a radial basis function (RBF) kernel SVM as the classifier. The performance of classification is evaluated by macro-averaged F1 and micro-averaged F1. In Task 2, the performance is measured by two standard metrics, i.e., area under the ROC curve (AUC) and average precision (AP) scores. In Tasks 2 and 3, we set B as 25% of the total number of edges. This means that, in each run of Task 2, we randomly remove 12.5% of edges. We integrate all of the thirteen unsupervised NRL algorithms into a unified software package. We perform all experiments in the same environment, i.e., a server with Intel Xeon Silver 4214R CPU @2.4GHz processors. The GPUs are GeForce RTX 3090 with 24 GB memory.

Overall Rank of Node Representation Learning Methods for Pure Networks
The overall ranks of unsupervised NRL methods for pure networks are summarized in Figure 3, which are computed based on Equation (2). Node attributes are not used in this evaluation. Their results in Tasks 1, 2, and 3, are summarized in Tables 3-5.  We visualize the ranks of each selected pure network embedding algorithm using a bar chart. The algorithms are sorted by their overall ranks in ascending order, from left to right.

Overall Rank of Node Representation Learning Methods for Attributed Networks
The overall ranks of all four unsupervised NRL methods for attributed networks are summarized in Figure 4, which are computed based on Equation (2). Their results in Task  1, Task 2, and Task 3, are summarized in Tables 6-8.

Research Observations
Based on the results in Figures 3 and 4, and Tables 3-8, we have made five major research observations, as follows.

1.
ProNE [39] ranks first in unsupervised NRL methods for pure networks, while CAN [45] ranks first in unsupervised NRL methods for attributed networks.

2.
The performance of an NRL method generally shows consistency across different tasks. Methods with high overall ranks often rank high in all of the three tasks. For example, ProNE ranks first in Tasks 1 and 3. CAN ranks first in Tasks 2 and 3, and ranks second in Task 1. LINE ranks sixth in Task 1, fifth in Task 2, and seventh in Task 3. However, there is a special case, in which, NetSMF ranks second in Tasks 1 and 3, and seventh in Task 2.

3.
Efficient methods benefit from more rounds of tuning, and have a relatively high performance. For each dataset, the total given time for tuning hyperparameters is fixed as N × α seconds. If a method is more efficient, it performs more rounds of hyperparameter tuning. We summarize the hyperparameter-tuning counts of all the methods in Task 1 in Table 9. We observe that efficient methods such as ProNE and DeepWalk, have tuned hyperparameters for many rounds on most datasets, while ProNE ranks first among methods for pure networks. FeatWalk has tuned hyperparameters for 338 rounds on BlogCatalog. GAE and LINE are hindered by their inefficiency. Although some methods are efficient, e.g., SDNE, NetMF, DGI, and have tuned hyperparameters for many rounds, their performances are limited by their effectiveness.

4.
By comparing the results in Tables 3 and 5, we observe that most methods have higher performance when the hyperparameters are tuned based on labels, i.e., the validation set in Task 1. This is because of the consistency between the hyperparameter-tuning task and the evaluation task, i.e., link prediction in Task 3.

5.
SDNE is dedicated to link prediction. It performs badly in Tasks 1 and 3. 6.

Analysis of Hyperparameter-Tuning Methods
By default, we employ a random search to tune hyperparameters. We now replace it with Tree-structured Parzen Estimator (TPE) [76]. We apply TPE to several selected methods and datasets. The results in the three tasks using TPE to tune hyperparameters are summarized in Tables 10-12. We observe that the results achieved by TPE, i.e., Tables 10-12, are similar to the results achieved by random search, i.e., Tables 3-8. Thus, we employ random search in FURL by default.

Discussions
For hardware, it is challenging to ensure absolute fairness because NRL methods may or may not employ GPU. We can roughly categorize NRL methods into two groups, i.e., CPU only and GPU involved. For methods in the first group, the same number of threads is allocated to each method, while different methods will use it in different ways. If a method does not have multiprocessing or multithreading functions, it will only occupy one thread even though we have allocated more resources. For methods in the second group, we can provide the same resources, though they may not make full use of them. There is a gap between the performance of the CPU and GPU. However, it should be considered as an advantage of a method, if it could fully utilize the resources or accelerate computing by GPU.
Another problem is the search space of hyperparameters. For most hyperparameters, the search space is continuous. It is impossible and impractical to fine tune precisely within a constrained time. Meanwhile, the parameter sensitiveness is non-uniform, e.g., learning rate and final task performance do not show a linear correlation. Therefore, for the most frequently used parameters, such as learning rate and dropout, we restrict the search space to a set of discrete values within the commonly used parameter range. In this way, we introduce prior results into hyperparameter searching and significantly improve the searching efficiency. Moreover, when a method runs out of the total given tuning time, we will not interrupt the hyperparameter tuning immediately. Instead, we will wait until the current trial ends.

Conclusions
We propose a fair and comprehensive framework-FURL-to evaluate and rank existing unsupervised node representation learning methods. We design three tasks with different applications and hyperparameter-tuning settings. An integration schema is proposed to compute an overall rank based on the results in the three tasks. Currently, FURL has covered thirteen unsupervised NRL algorithms and eight datasets.
In the future, we plan to include more NRL methods and datasets. We will extend this work by adding (1) more classical and state-of-the-art unsupervised node representation learning algorithms and (2) larger networks. The extended work will be dedicated to not only comparing the performance results on tasks, but also considering the efficiency (e.g., parameter efficiency, memory efficiency, learning efficiency) of unsupervised NRL methods. We will also explore to analyze the trend of performance while scaling the graph size, which will enable us to obtain deeper insight into the generalization ability of the algorithm.