An Information-Explainable Random Walk Based Unsupervised Network Representation Learning Framework on Node Classiﬁcation Tasks

: Network representation learning aims to learn low-dimensional, compressible, and distributed representational vectors of nodes in networks. Due to the expensive costs of obtaining label information of nodes in networks, many unsupervised network representation learning methods have been proposed, where random walk strategy is one of the wildly utilized approaches. However, the existing random walk based methods have some challenges, including: 1. The insufﬁciency of explaining what network knowledge in the walking path-samplings; 2. The adverse effects caused by the mixture of different information in networks; 3. The poor generality of the methods with hyper-parameters on different networks. This paper proposes an information-explainable random walk based unsupervised network representation learning framework named Probabilistic Accepted Walk (PAW) to obtain network representation from the perspective of the stationary distribution of networks. In the framework, we design two stationary distributions based on nodes’ self-information and local-information of networks to guide our proposed random walk strategy to learn representational vectors of networks through sampling paths of nodes. Numerous experimental results demonstrated that the PAW could obtain more expressive representation than the other six widely used unsupervised network representation learning baselines on four real-world networks in single-label and multi-label node classiﬁcation tasks.


Introduction
Network Representation Learning (NRL) plays a crucial role in many real-world network analysis applications, such as the protein-protein interaction [1], the community detection [2], and the evaluation of doctors [3]. The NRL aims to learn low-dimensional, compressible, and distributed representational vectors of nodes in networks. Although supervised methods [4,5] are more adaptable for specific tasks to obtain more expressive representation, the problematic and expensive coasts of getting nodes' labels cause that many unsupervised network representation learning methods [6][7][8][9][10][11] have been proposed for NRL. Among unsupervised network representation learning methods, random walk is one of the widely used strategies.
The main idea of the random walk based methods is to design various transition strategies (transition matrices) to sample walking paths, and then feed those paths into the skip-gram model [12] to obtain representations. In previous studies, the random walk based methods package all prior information, such as self-information (nodes' attribute) and local-information (knowledge from neighbors), into a transition matrix to sample 1.
How to explicitly explain what network knowledge in sampled walking paths; 2.
How to learn more expressive representational vectors of networks by designing one random walk strategy without hyper-parameters to sample the walking paths with each specific network information individually.
Since each random walk strategy has one certain transition matrix (in most cases, the matrix is not easily known.), an arbitrary random walk obeys only one specific stationary distribution when the sampling is on a connected aperiodic network [15]. In other words, a random walk strategy implicitly samples walking paths of nodes under a certain stationary distribution on a connected aperiodic networks (Since any unconnected networks can be transformed into connected ones by adding edges having minimal transition probabilities between nodes [16] and are generally aperiodic in the real world; when this paper mentions networks, it means connected aperiodic networks). This brings a novel idea of designing a random walk based method for NRL from the perspective of stationary distributions. We can first explicitly provide a stationary distribution based on a specific network knowledge, then design a random walk strategy that obeys this distribution to embed the knowledge into sampled walking paths.
Unlike the implicit sampling in the existing random walk based methods, the explicit consideration of stationary distributions brings two advantages. On the one hand, we could choose a specific network knowledge and extract the knowledge by sampling walking paths. For example, when a random walk strategy follows a stationary distribution designed by the nodes' own information (like the degree of nodes), the crucial nodes (with a high degree) will be frequently sampled and often appear in walking paths. When the walking paths are sampled by a random walk obeying a stationary distribution designed by the neighborhood information of nodes (like the degrees of neighbors), the nodes that have more important neighbors (with a high degree) are probably sampled into walking paths. In this way, we could explicitly explain what network knowledge is embedded in sampled walking paths.
On the other hand, explicit consideration of various stationary distributions can alleviate the adverse effects of a mixture of information. Many studies have been proposed to address the adverse effects of mixed data in various applications [17][18][19]. The networks with abundant information in nodes or edges could provide various stationary distributions based on the network knowledge. According to every one of the distributions to sample a set of walking paths, the walking paths include single knowledge of networks. The representational vectors of the network learned by the walking paths with single network knowledge could reduce the adverse effects of the mixture of network knowledge.
In this paper, we proposed an explainable random walk based network representation learning framework to obtain more expressive representational vectors of networks. Specifically, we first explicitly designed two stationary distributions to extract specific network knowledge individually. Second, we proposed a general random walk strategy without hyper-parameters called Probabilistic Acceptive Walk (PAW), which could obey an arbitrary stationary distribution to sample walking paths. Third, we utilized the PAW to sample walking paths under each given stationary distribution and put the paths into the skip-gram model [12] to obtain two kinds of representational vectors with specific network information. After that, the Principal Component Analysis (PCA) [20] method is used to fuse the two kinds of representational vectors into a unified feature space and a low-dimension latent representation of networks would be reselected. Numerous experimental results demonstrate that the proposed framework dominates six unsupervised network representation learning methods on four real-world datasets with two kinds of node classification tasks.
Concretely, our contributions are following:

1.
We proposed a novel random walk based network representation learning framework from the perspective of stationary distributions, explicitly explaining what network knowledge is embedded in sampled walking paths; 2.
Individual extraction of different network knowledge could alleviate the adverse effects of a mixture of network knowledge; 3.
We proposed a random walk strategy that can obey the arbitrary given stationary distribution; 4.
Numerous experimental results demonstrated that the proposed framework could obtain more expressive representation than six popular used unsupervised network representational learning methods on four real-world datasets with two kinds of node classification tasks.

Related Work
This section will briefly introduce some methods related to this paper and roughly divide them into two categories. On one hand, the random walk is an important strategy for unsupervised network representation learning. The random walk based methods design various transfer strategies to obtain walking paths, then feed those paths into the skip-gram model [12] to learn representational vectors. The Deepwalk method [6] utilized a uniform random walk strategy for sampling walking paths of nodes to learn the representational vectors. The node2vec [7] was proposed to sample walking paths of nodes for capturing the structure information by utilizing two hyper-parameters of adjusting the deep-first search and the breadth-first research. In random walk studies, the stationary distribution is an important property that has been utilized in many works [21][22][23]. Random walk strategies designed by various explicit stationary distributions can learn representational vectors from sampled walking paths of nodes from the perspectives of specific network information. It can help researchers easily extend random walk strategies and extract network information as much as possible. On the other hand, the second category of methods is based on the similarity between nodes, such as the LINE [8], the struc2vec [10], the SDNE [9], and the NetMF [11]. The first three methods utilized the similarity between vertices in the network to learn network representation. The NetMF pre-processes the network into an adjacency matrix then decomposes the adjacency matrix to obtain the embedding matrix.

Preliminary
In this section, we will introduce some important concepts mentioned in this paper.

Connected Aperiodic Network:
A connected aperiodic network is a network with the following properties: • For all pairs of nodes (u,v), there exists a path that starts at node u and ends at node v; • The greatest common divisor of the lengths of all walk paths starting from a node and returning the same node is 1.

Stationary Distribution:
The stationary distribution is defined as a probability vector π t satisfying π t = π t−1 * P, where P is a transition probability matrix. π t is the probability vector at the t step. In a connected aperiodic network, no matter what the initial distribution π 0 of the random walk is, the probability distribution will converge to a unique stationary distribution when a transition probability matrix is given [24]. Detailed Balance Condition: For a random walk on a connected aperiodic network G(V, E, P) with the probability P on edges e ∈ E, if the vector π satisfies for all node u and node v, e uv ∈ E, p uv (p vu ) is the transition probability from node u(v) to node v(u), ∑ x∈V π x = 1, then the π is the stationary distribution of the random walk. This condition is called Detailed Balance Condition [24]. Self-distribution: From the perspective of nodes' properties, we design the self-distribution for guiding the PAW method to sample the properties' walking paths. For a node u ∈ V, we define the self-distribution of the node u as where deg(u) is the degree of the node u, and |E| is the number of edges. The π sel f u is a distribution as it satisfies π sel f u ∈ [0, 1], and ∑ u∈V π sel f u = 1. The nodes with high degrees have a high probability of being sampled in walking paths, aligning with the actual situation that the nodes with higher degrees are more accessible. Neighbor-distribution: From the perspective of the neighborhood's properties, we design the neighbor-distribution for guiding the PAW method to sample the properties' walking paths. The Neighbor-distribution of a node u is defined as where N(u) is the set of neighbors of the node u, π local u ∈ [0, 1], and ∑ u∈V π local u = 1. The neighbor-distribution could be used to describe the situation of neighbors of nodes. The nodes with the high value of neighbor-distribution will be frequently sampled into walking paths because they have some neighbors who are visited with high probability from other nodes. It is similar to the idea in the Pagerank [16] that when a node has important neighbors (frequently be visited), the node should be important (frequently be visited as well from its neighbors).
Intuitively, the two kinds of network information could be utilized to distinguish the local status of nodes. The four typical local status are illustrated in Figure 1. The learned representation of nodes can depict the local status through the walking paths sampled under the two distributions, respectively.

Internal structure
Internal-outer structure Outer structure Outer-internal structure

Methodology
This section will, first, present each part of our proposed framework in detail. Second, we introduced the Probabilistic Accepted Walk (PAW) strategy and provided proof that the PAW can obey an arbitrary given stationary distribution. Third, we utilize the Principal Component Analysis (PAC) method to fuse two kinds of representational vectors into a unified space to obtain low-dimensional latent representations of nodes.

Framework
The proposed framework is illustrated in Figure 2. First, two kinds of stationary distributions were designed based on nodes' self-properties and neighborhood properties individually in the framework. Second, the PAW strategy sampled walking paths under each of the proposed stationary distributions individually, then feed the paths containing different information into two skip-gram models to learn vector-based representations separately. Third, two representational vectors of nodes were fused into a unified embedding space by the PCA algorithm to reselect the low-dimensional latent representational vectors. We first propose two stationary distributions. According to each of the proposed stationary distributions, the PAW strategy starts with a random node (node 5) and then uniformly takes one of its neighbors (node 3) as a selected target. If the transfer is accepted, the walk will move to the neighbor (node 3 is sampled into paths). Otherwise, the walk will stay at the current position (node 5) and reselect a neighbor (node 1). The random walk continuously repeats the above process until the specified path length is reached. Next, we separately feed the paths into two skip-gram models to obtain vector-based representations. Finally, we utilize the principal component analysis algorithm to fuse two representational vectors of nodes into a unified space and obtain low-dimensional latent representations.
The pseudo-code of our framework is shown in Algorithm 1. Formally, a network G = (V, E), where V denotes the set of nodes, and E denotes the set of edges. Line 1 is for initialization of two representations. Line 2 is for the calculation of two stationary distributions. Lines 3-11 display that the PAW strategy obtains different sampled walking paths under different stationary distributions (the next subsection will propose the PAW strategy in detail). Lines 12-13 show that two sets of walking paths are fed into two skip-gram models separately to obtain vector-based representations. The skip-gram model is described in [6]. Line 14 connects Φ sel f and Φ local in sequence and fuses the two representations into a unified feature space by the PCA algorithm.

5:
for each v i ∈ O do 6: Add w

PAW: Probabilistic Accepted Walk
In this subsection, we introduce the PAW strategy in detail. The strategy is inspired by the probabilistic acceptance of sampling in the Metropolis-Hastings algorithm [25].
We propose an acceptance probability to decide whether to accept the transfer from the current position to the neighbor node. The acceptance probability from node u to node v is defined as: where π u is the stationary distribution of node u, π v is the stationary distribution of node v, p uv is the transfer probability from node u to node v, p uv = 1 deg(u) , p uv is the transfer probability from node v to node u, p vu = 1 deg(v) . The pseudo-code of the PAW strategy is shown in Algorithm 2. In line 1, one node is selected as the start node. Lines 3-8 show the mechanism of random jumping with a minimal probability to maintain the connected property of graphs. In Line 9, one neighbor of the current position becomes the target of the transfer by the uniform selection. In lines 10-11, we calculate the acceptance probability, and then compare it with a random number. If a random number is lower than the acceptance probability, the step of the walk is accepted, and then the walk moves to the neighbor (Lines 13-15). Otherwise, the step of the walk is rejected. The walk stays in the current position and then reselects a neighbor (Line 17). To sample a node path, the walk repeats the above processes until the path satisfies the required length of paths.

5:
Add v into W; if r < α uv then 13: Add v into W; To state the PAW strategy obeying an arbitrary stationary distribution, inspired by [25], we prove the following Proposition 1.

Proposition 1.
Supposing π is an arbitrary distribution, the PAW strategy obeys the stationary distribution π.
Proof. For the distribution π, the distribution of the node u is π u , the distribution of the node v is π v . The node v is one of the neighbors of node u. The transition probability from node u to node v is p uv , and from node v to node u is p vu . The acceptance probability from node u to node v is α uv , and from node v to node u is α vu . We have π u * p uv * α uv = π u * p uv * min(1, We consider p uv = p uv * α uv as the transition probability from node u to node v, and p vu = p vu * α vu as transition probability from node v to node u. We can rewrite Equation (6) as π u * p uv = π v * p vu .
According to the detailed balance condition, the PAW strategy obeys the stationary distribution π.

Fusion of Feature Spaces
Principle Component Analysis (PCA) is widely used for exacting low-dimensional latent features in a unified embedding space. Since the PAW samples the walking paths from two stationary distributions, the learned representational vectors belong to different embedding spaces. Simply sequential connection of the two kinds of features need to be fused into a unified embedding space. After obtaining the sequential connection of the two kinds of features in one line, we utilize the PCA algorithm to fuse the vectors into one unified embedding space for obtaining high-quality low-dimensional latent vectors.

Experiments
In this section, we evaluated the proposed framework on four real-world datasets. We assessed our method both on the multi-label and single-label classification tasks, and the experimental results demonstrated improvements over baselines in most cases.

Datasets
We employed four real-world datasets to evaluate the proposed method's performance comprehensively, and the detailed statistics of these datasets are listed in Table 1. In the table, |V| is the number of nodes; |E| is the number of edges; the 'Yes' of Multi-label means the dataset is for multi-label classification tasks; otherwise, it is for single-label classification tasks; the 'Labels' means the number of labels of nodes.  [27] is a subgraph of the PPI network representing relationships between proteins for Homo Sapiens. The labels obtained from the hallmark gene sets represent biological states. Hamilton, Mich [28] are the social friendship networks extracting from Facebook networks at two American institutions, respectively, and they consist of nodes and edges representing people's relationships. The labels of nodes are the majors of users.

Compared Algorithms
Six widely used methods are utilized as baselines compared with our framework on the single-label and multi-label classification tasks.
DeepWalk [6] randomly selects one neighbor with equal probability and moves to the neighbor, repeating the selecting and moving until meeting the length of paths. It treats walking paths as sentences equally and uses the skip-gram model [12] to learn the latent representations. LINE [8] proposes a new definition of similarity, consisting of the first-order similarity and second-order similarity to represent the closeness between vertices. This method optimizes a designed objective function, which preserves both the global and local graph structures. node2vec [7] extends the DeepWalk and proposes a biased random walk, capturing biased randomness and structural equivalence. It comprehensively considers depth-first search and breadth-first search. SDNE [9] proposes a deep structure based model to capture the highly non-linear graph structure, which has multiple layers of non-linear functions, and to preserve the network structure by jointly exploiting the first-order and second-order proximity. struc2vec [10] measures the node similarity at different scales using a hierarchy and encodes structural similarities by constructing a multilayer graph and generating structural context for nodes. NetMF [11] regards random walk strategies as various matrix factorization methods. It connected the skip-gram based graph embedding algorithms and the theory of graph Laplacian, then presents the algorithm for computing network embedding.
We proposed two variants of our method ('PAW128' and 'PAW256') with the number of walks as 80, the walk length as 40, the window size as 10, and d separate = 128 for both of them. For the PAW128, we set d merge = 128. For the PAW256, we set d merge = 256. The probability of a random jump to other nodes sets 0.001. For baselines, we set the dimension of vector-based representations in all the baselines as 128; the window size in the NetMF as 'Large'; set LINE with the 2-step representation; other parameters are set as default in corresponding papers.

Experiment Results
To demonstrate that the proposed framework could learn more expressive representational vectors on node classification tasks, we compare our method with other baselines on four real-world datasets (two datasets for multi-label classification tasks and the other two datasets for single-label classification tasks). We utilize a one-vs-rest logistic regression [6] to evaluate the learned representational quality of all methods. We randomly sample a certain fraction of the nodes during the training phase as the training set, and the remaining nodes are used for testing. We repeat this process 10 times to calculate the average values on the F1 metric.

Multi-label Classification
We employ two widely utilized multi-label datasets for our experiments. In these datasets, each node is assigned one or more labels.

BlogCatalog:
In this experiment, we increase the training proportion of datasets from 10% to 90%. Our proposed method is significantly stronger than other baselines on the Micro-F1 metric shown in Figure 3, the Macro-F1 metric shown in Figure 4, and the Weighted-F1 metric shown in Figure 5. In addition, our method is particularly prominent on the Macro-F1 metric, which illustrates the effectiveness of classifying the rare categories on the BlogCatalog dataset. The struc2vec method is not shown, because it has poor performance on all metrics from 10% to 90% training proportion (Micro-F1:0.1-0.14; Macro-F1:0.04-0.05; Weight-F1:0.09-0.1) on the BlogCatalog dataset. The gap of experimental results between PAW128 and PAW256 is not apparent, which means that the dimension of representational vectors has not much impact on results. Table 2, in the high promotion of training data (50% or higher), PAW256 has the best performance than other methods. In the low proportion of training data (lower than 50%), the proposed method loses the best performance (only lower than the NetMF method) on the PPI dataset. In the low proportion of training data, the drawback of the proposed method probably is caused by not enough training in the skip-gram model. When we focus on comparing the proposed method with other random walk based methods (DeepWalk, node2vec), we could find our method performs better than the random walk based methods. The reason for the disastrous performance of node2vec is that its default hyper-parameters are not adapted for this dataset. This illustrated the poor generalize of random walk based methods with hyper-parameters from one side.

PPI: As is shown in
In the above multi-label classification missions, the random walk based methods generally perform better than the node-similarity based methods. We guess the nodesimilarity based methods may need specific node-similarity to improve their performance in corresponding missions.

Single-Label Classification
In this experiment, we randomly select the proportion of nodes from 10% to 90% as the training set and use the rest of the nodes to evaluate the performance. Our results are shown in Table 3, and we bold the highest results in each column of each dataset.
The PAW256 got the best performances in the two datasets with all proportions of training data. Although the proposed methods (PAW128, PAW256) performed a slight advance from some baselines (especially the NetMF), compared with the random walk based methods (DeepWalk, node2vec), the proposed approaches are better with apparent improvements. Comparing the different dimensions of representational vectors, we found that low-dimension (128) representation performs slightly weak than the high-dimension (256), which is caused by the low-dimension losing network information.

Limitations
In some cases, like the description in the above subsection (Multi-label Classification tasks), there is no enough training for the skip-gram model in our proposed method with a lower proportion of training data. In this paper, the skip-gram model in our proposed framework compares with the other random walk based methods (DeepWalk, node2vec) to demonstrate the effectiveness of the individual extraction of network knowledge. If we apply a model that required less training data instead of the skip-gram model, our method might avoid the performing drawbacks in a lower proportion of training data. The methods based on the similarity of nodes (LINE, SDNE) did not get satisfactory performance in the above classification tasks, which is caused, in our opinion, by the default similarity not for specific missions. The method (struc2vec) aims to find nodes with similar local structures leading to the nodes are close to each other in the embedding space, even if they are far apart in the original network. Therefore, it is probable that the methods are not suitable for the classification tasks in this paper.
Overall, the above experimental results demonstrated that the representational vectors from the sampled walking paths with individual-specific network information could obtain more expressive representation on four real-world networks with single-label and multilabel node classification tasks. Furthermore, the probability accepted walk strategy (PAW) had demonstrated a bright application prospect where it could be instead of the random walk part of all random walk based methods.

Conclusions
In this paper, we proposed an unsupervised network representation learning framework of an information-explainable random walk strategy from the perspective of stationary distributions. First, two stationary distributions based single network knowledge were proposed for guiding random walk to sample walking paths. Individual extraction of different network knowledge alleviates the adverse effects of a mixture of network knowledge, thus improving representational vectors' express ability. Second, we proposed a random walk strategy named Probability Accept Walk to adapt multiple distributions. The strategy makes random walk obey an arbitrary given stationary distribution. Third, we utilized the principal component analysis algorithm to fuse two original feature spaces into a united space and reselect the latent representation.
The proposed framework could explain the specific knowledge of networks embedded in sampled walking paths. Independent learning representational vectors based on each kind of network knowledge could alleviate the adverse effects of mixed network knowledge. To adapt various distributions, the probabilistic accepted walk (PAW) algorithm has the ability to obey arbitrary given stationary distributions. The probability accepted walk strategy (PAW256) outperformed in four real-life data compared to all other six unsupervised network representation learning methods. The results achieved are 0.4311 (Micro-F1), 0.2985 (Macro-F1), 0.4094 (Weighted-F1) for BLOGCATALOG, 25.29 (Micro-F1) for PPI, 31.58 (Micro-F1) for Hamilton, and 45.13 (Micro-F1) for Mich at 90% of training data for PAW 256. In the future, we would like to explore designing various stationary distributions to extract various specific information of networks and attempt novel learning models instead of the skip-gram model to improve the performance of representation.