Properties of Vector Embeddings in Social Networks

: Embedding social network data into a low-dimensional vector space has shown promising performance for many real-world applications, such as node classiﬁcation, node clustering, link prediction and network visualization. However, the information contained in these vector embeddings remains abstract and hard to interpret. Methods for inspecting embeddings usually rely on visualization methods, which do not work on a larger scale and do not give concrete interpretations of vector embeddings in terms of preserved network properties (e.g., centrality or betweenness measures). In this paper, we study and investigate network properties preserved by recent random walk-based embedding procedures like node2vec, DeepWalk or LINE. We propose a method that applies learning to rank in order to relate embeddings to network centralities. We evaluate our approach with extensive experiments on real-world and artiﬁcial social networks. Experiments show that each embedding method learns different network properties. In addition, we show that our graph embeddings in combination with neural networks provide a computationally efﬁcient way to approximate the Closeness Centrality measure in social networks.


Introduction
Social network analysis has been attracting great attention in the recent years. This is in part because social networks form an important class of networks that span a wide variety of media, ranging from social websites, such as Facebook, Twitter and citation networks of academic papers. Mining and analyzing data from these social network sites generated interesting insights, like, for example, insights on network formation processes (e.g., [1,2]), content distribution processes (e.g., [3,4]) and human (online) behavior (e.g., [5,6]). Furthermore, interesting applications have become possible, like, for example, different types of recommender systems [7,8] or media analysis applications [9].
Data Analysis and Machine Learning techniques play an essential role in mining social network data. However, whenever we use such statistical machine learning techniques on graph analysis tasks, we have to find a suitable vectorial representation for a network at hand. The most straightforward representation consists of an adjacency matrix, where edges between nodes are indicated as an entry in a squared matrix. Due to its quadratic size and its sparsity the adjacency matrix is not very well suited for traditional machine learning algorithms. As a remedy, low-dimensional vector embeddings have become a promising and powerful tool for analyzing large social networks.
Graph embeddings represent every node in a graph as a low-dimensional, real-valued vector preserving different network properties, like, for example, first-or second-order proximities [10] or topological-structures [11]. Typically, graph embeddings have been obtained by computationally intensive eigenvalue decomposition methods. Motivated by the success of deep learning techniques in the natural language processing (NLP) area, several novel graph embedding methods have been proposed. They learn dense vector representations utilizing random walks over the network. For example, DeepWalk [12] samples node sequences and feeds them to a Skip-Gram based of be approximated efficiently in linear time. However, we do not give a formal proof of the runtime complexity computation as it goes beyond the current study.
The remainder of the paper is organized as follows. In Section 2, we provide the definitions required to understand the problem and models. In Section 3, we provide a short overview on recent embedding techniques and inspecting approaches. In Section 4, we define the problem that we want to study and the proposed method. In Section 5, we then describe our experimental setup and evaluate the proposed approach. Finally, in Section 6, we draw our conclusions and discuss future research directions.

Definitions and Preliminaries
A graph G = (V, E) consists of a set V = {v 1 , . . . , v n } of n nodes and a set E ⊆ V × V of edges between nodes. The adjacency matrix of G is the n × n-matrix A with entries We restrict ourselves to undirected graphs, where, for all nodes where the adjacency matrix is symmetric. For a node u ∈ V, the ego-network of u is the restriction of G to u and all its neighbors. We denote the set of egos in a social graph by U ⊆ V.
Definition 1 (Graph embedding). Let G be a graph. An embedding of G is a map f : |V|. Therefore, Y ∈ R |V|×d denotes the embeddings of the graph G and Y i the ith row of Y.
Definition 2 (First-order Proximity). The first-order proximity in a graph is the local pairwise proximity between two nodes. For each pair of nodes linked by an edge (v i , v j ), the weight on that edge, w ij , indicates the first-order proximity between v i and v j . If no edge is observed between v i and v j , their first-order proximity is 0.
Definition 3 (Second-order Proximity). The second-order proximity between a pair of vertexes describes the proximity of the pair's neighborhood structure. Let N i = {w i,1 , . . . , w i,|V| } denote the first-order proximity between v i and other vertexes. Then, second-order proximity is determined by the similarity of N i and N j . Second-order proximity compares the neighborhood of two nodes and treats them as similar if they have a similar neighborhood.

Graph Embedding Techniques
Recently, methods that use distributed representation learning techniques in an NLP domain, like the Skip-Gram algorithm, have gained attention from the research community. These NLP methods have been adapted to calculate graph embeddings that preserve first and second order proximities. For obtaining the embeddings, these methods create symbolic sequences comparable to natural language text by conducting random-walks over the graph. The methods yield lower time complexity compared to eigenvalue decomposition methods [26]. Moreover, they are able to map the nonlinear structure of the network into the embedding space. They are especially useful when one can either only partially observe the graph, or the graph is too large. In this section, we quickly review the different random-walk based embedding techniques: • DeepWalk [12]: This approach learns d-dimensional feature representations by simulating uniform random walks over the graph. It preserves higher-order proximities by maximizing the probability of observing the last c nodes and the next c nodes in the random walk centered at v i . More formally, DeepWalk maximizes: where Y i is the embedding vector of the node v i and c is the context size. We denote the mapping function for this method DeepWalk : V → R d , where d is the embedding size. Therefore, the DeepWalk embedding of the node v is denoted by DeepWalk(v). • node2vec [10]: Inspired by DeepWalk, node2vec preserves higher-order proximities by maximizing the probability of occurrence of subsequent nodes in fixed length random walks. The crucial difference from DeepWalk is that node2vec employs biased-random walks that provide a trade-off between BFS and DFS graph searches, and hence produces higher-quality and more informative embeddings than DeepWalk. More specifically, there are two key hyper-parameters p ∈ R + and q ∈ R + that control the random walk. Parameter p controls the likelihood of immediately revisiting a node in the walk. Parameter q controls the traverse behavior to approximate BFS or DFS. For p = q = 1, node2vec is identical to DeepWalk. We denote the node2vec embedding by node2vec : V → R d , where d is the embedding size. Therefore, the node2vec embedding of the node v is shown by node2vec(v). • loc [15]: This approach limits random walks to the neighborhood around egos to make artificial paragraphs. ParagraphVector [16] is properly applied to learn local embeddings for egos by optimizing the likelihood objective using stochastic gradient descent with negative sampling [27]. Formally, given an artificial paragraph v 1 , v 2 , v 3 , . . . , v t , . . . , v l for ego u i , the goal is to update representations in order to maximize the average log probability: where Y i is the embedding vector of the ego u i , l is the length of the artificial paragraph, and c is the context size. Therefore, there is a mapping function loc : U → R d , where d is the embedding size. We denote the loc embedding of the ego u by loc(u). • LINE [14]: It learns two embedding vectors for each node by preserving the first-order and second-order proximity of the network in two phases. In the first phase, it learns d/2 dimensions by BFS-style simulations over immediate neighbors of nodes. In the second phase, it learns the next d/2 dimensions by sampling nodes strictly at a 2-hop distance from the source nodes. Then, the embedding vectors are concatenated as the final representation for a node. Indeed, LINE defines two joint probability distributions for each pair of nodes, one using adjacency matrix and the other using the embedding. It minimizes the Kullback-Leibler (KL) divergence [28] of these two distributions. The first phase distributions and the objective function are as follows: Probability distributions and objective function are similarly defined for the second phase. This technique adopts the asynchronous stochastic gradient algorithm (ASGD) [29] for optimization. In each step, the ASGD algorithm samples a mini-batch of nodes and then updates the model parameters. We denote this embedding as a function LINE : V → R d that maps nodes to the vector space, where d is the embedding size. Therefore, the LINE embedding of the node v is denoted by LINE(v).

Techniques for Inspecting Embeddings
Graph embeddings usually create vectors of several 100 dimensions per node in the graph. While eigenvalue based decomposition methods give some formal guarantees on the retained network properties, random-walk based methods are stochastic in nature and depend heavily on hyper-parameter settings. Therefore, analyzing the retained graph properties requires either applying the embeddings to particular graph analysis tasks, like node classification, clustering, community detection or link prediction, or to visualize structural relationships. For example, if two nodes u and v are directly connected in the graph, they should appear close to each other when in a visualized embedding space. In this section, we review different methods to inspect graph embeddings in order to motivate the development of our own approach.
• Visualization: To gain insight into binary relationships between objects, the relations are often coded into a graph, which is then visualized. The visualization is usually split in the layout and the drawing phase. The layout is a mapping of graph elements to points in R d . The drawing assigns graphical shapes to the graph elements and draws them using the positions computed in the layout [30]. The effectiveness of DeepWalk is illustrated by visualizing the Zachary's Karate Club network [12]. The authors of LINE visualized the DataBase systems and Logic Programming (DBLP) co-authorship network, and showed that LINE is able to cluster together authors in the same field. Structural Deep Network Embedding (SDNE) [31] was applied on a 20-Newsgroup document similarity network to obtain clusters of documents based on topics. • Network Compression: The idea in network compression is to reconstruct the graph with a smaller number of edges [23]. Graph embedding can also be interpreted as a compression of the graph. Wang et al. [31] and Ou et al. [32] tested this hypothesis explicitly by reconstructing the original graph from the embedding and evaluating the reconstruction error. They show that a low-dimensional representation for each node suffices to reconstruct the graph with high precision. • Classification: Often in social networks, a fraction of nodes are labeled which indicate interests, beliefs, or demographics, but the rest are missing labels. Missing labels can be inferred using the labeled nodes through links in the network. The task of predicting these missing labels is also known as node classification. Recent work [10,12,14,31] has evaluated the predictive power of embedding on various information networks including language, social, biology and collaboration graphs. The authors in [15] predict the social circles for a new node added into the network. • Clustering: Graph clustering in social networks aim to detect social communities. In [33], the authors evaluated the effectiveness of embedding representations of DeepWalk and LINE on network clustering. Both approaches showed nearly the same performance. • Link Prediction: Social networks are constructed from the observed interactions between entities, which may be incomplete or inaccurate. The challenge often lies in predicting missing interactions. Link prediction refers to the task of predicting either missing interactions or links that may appear in the future in an evolving network. Link prediction is used to predict probable friendships, which can be used for recommendation and lead to a more satisfactory user experience. Liao et al. [34] used link prediction to evaluate node2vec and LINE. Node2vec outperforms LINE in terms of area under the Receiver Operating Characteristic (ROC) curve.

Problem Statement
Problem 1-Explaining Embedding Relatedness: Let G be a social network graph where every node u ∈ G has an embedding Y u obtained from random-walk based embedding methods such as DeepWalk(u), loc(u) , node2vec(u), and LINE(u). The inner-product Y u · Y v of two nodes u and v determines a similarity relation between the two nodes. Due to the random-walk based creation of the embeddings, we assume that this similarity can be approximated by similar network properties of the neighborhoods N(u) and N(v) of node u and v, respectively. Therefore, we aim to explain the inner-product Y u · Y v by a weighted linear combination of centrality measures obtained from the neighborhoods N(u) and N(v). Degree centrality DC(u), closeness centrality CC(u), betweenness centrality BC(u) and eigenvector centrality EC(u) have been chosen as centrality measures due to their importance in social network analysis and due to their well understood properties [35].
Degree centrality of node u is simply the degree of u [36]: Closeness centrality of a node is the average sum of the inverse of the distance to other nodes. Formally, closeness is defined as: where d(u, v) is the length of the shortest path between (u, v) [36].
Betweenness centrality counts the fraction of shortest paths going through a node. Betweenness centrality of a node u is then formally defined as follows: where σ s,t (u) is the number of shortest paths between node s ∈ V and t ∈ V that pass through node u. σ s,t is the number of shortest paths between node s ∈ V and t ∈ V [36]. Eigenvector centrality generalizes degree centrality by incorporating the importance of the neighbors. The eigenvector centrality of u i is a function of its neighbors' centralities. It is proportional to the summation of their centralities: where λ is some fixed constant [36]. Closeness centrality is widely used to study information flow in social networks [37]. Betweenness and eigenvector centrality are commonly used to detect and investigate community structure in social networks [38].
Problem 2-Predicting network properties: In order to consider single nodes without a neighborhood, we consider the problem of predicting the centrality values DC(v), CC(v), BC(v) and EC(v) for a particular node v based on its embeddings only. We aim to find a nonlinear mapping between the embedding vector and a single centrality property of its node. By obtaining such an approximately correct mapping, we can conclude that the vector space of the embeddings retains the structural information of the network property.
Overall, both case studies allow us to gain insights on the embedding properties in terms of centrality measures retained and the relatedness of their neighborhoods. In addition, predicting network properties successfully allows us to approximate computationally complex centrality measures more efficiently.

Explaining Embedding Relatedness
We formulate our approach such that the similarity of two embeddings Y u and Y v can be approximated by a weighted sum of network properties: where p i is a function that computes similarity of the pair (u, v) based on network property i, k is the number of network properties considered, and w i is the weight of the network property i. To estimate the weights w i for every property, we cast the problem into a learning to rank problem.
Learning to rank is an important problem in web page ranking, information retrieval and many other applications [39]. Given a ranking of items according to some query items, learning to rank obtains a function based on the similarity between the query and its ranked items. Several types of machine learning algorithms have been considered for this problem: pointwise methods, pairwise methods, and listwise methods. Most recent works have applied pairwise methods for learning to rank on graphs [40,41]. In pairwise ranking, one is given examples of order relationships among objects, and the goal is to learn from these examples, a real-valued ranking function that induces a ranking or ordering over the object space. We consider the problem of learning such a ranking function when the data is represented as a graph, in which nodes correspond to objects and embeddings encode similarities between objects. Among existing approaches for learning to rank, rankSVM [42] is a commonly used method extended from the popular support vector machine (SVM) [43] for data classification. In training an SVM classifier, a weight vector is computed on the training data. This weight vector can be used as an important measure of the centrality to the classifier. These weights can explain what combination of centralities can explain embeddings.
In learning to rank, we first sort nodes according to their similarities. More formally, for each node u i , we sort all other nodes u j ∈ U \ {u i } according to the inner product similarity Y i · Y j . Therefore, each pair of nodes has a rank label that we use as ground truth. We denote the ground-truth vector by t ∈ R |z|×1 , z = |m| × |m − 1|.
Furthermore, we need to compute the similarity in terms of DC, CC, BC and EC between every pair of nodes (u i , u j ). Given a node u i ∈ U, we calculate the centrality measures for every node v ∈ N(u i ), where N(u i ) defines the neighborhood of node u i . Indeed, in our approach, we consider network properties within a subgraph around a focal node. To compare how properties of two subgraphs are similar, we first calculate the probability distribution of each centrality by histogramming the centrality measures for every node v ∈ N(u i ). We then measure the similarity between two distributions over the same centrality by KL divergence as follows: where P u i and Q u j are probability distributions over the same centrality measure. If D KL is low, it means the distributions are very similar and vice versa. In order to guarantee a certain stability of our approach, we assume that the KL divergence can be estimated in a reasonably large neighborhood and that the graph is connected. Therefore, for each pair of nodes, we have a feature vector x i ∈ R 4 , i = 1, . . . , z. We denote the feature matrix by X ∈ R |z|×4 . The rankSVM [42] model is built by minimizing the objective function 1 2 w = (w DC , w CC , w BC , w EC ) is the centrality vector, where w DC denotes the weight for degree, w CC is the weight for closeness, w BC denotes the weight for betweenness, and w EC is the weight for eigenvector centrality. C > 0 is the regularization parameter. The C parameter provides a trade-off between the misclassification of training examples and the simplicity of the decision surface. A low parameter value makes the decision surface smooth, whereas a high parameter-value aims at classifying all training examples correctly by giving the model freedom to select more samples as support vectors.
is a suitable loss function such as (t) = max(0, 1 − t) 2 . w 2 is the regularization term to avoid overfitting by penalizing large coefficients in the solution vector [42]. The overall goal is to find w that optimizes the approximation of X · w to t.

Predicting Graph Properties
In our second approach, we aim to analyze embedding properties in terms of single node properties. We formulate the problem as a regression task that attempts to predict centrality values using embeddings as input. In detail, we use a feed-forward neural network model for learning nonlinear relationships between the embedding as a input variables and the different centrality measures as single output variable [44]. The architecture of the model that approximates centrality values of v ∈ V is described as follows: • Input layer: The input is given by one of the different embeddings for a single node, namely DeepWalk(v), loc(v) , node2vec(v) or LINE(v).

•
Hidden layer: The hidden layer consists of a single dense layer with ReLU activation units [45] . • Output layer: The output layer has a sigmoid unit [46]. We choose the sigmoid unit since normalized centrality values are in the range of [0, 1]. • Optimizer: Stochastic gradient descent (SGD) [47], which is a popular technique for large-scale optimization problems in machine learning.
Since centralities are continuous variables, we need an error criterion that measures, in a probabilistic sense, the error between the desired quantity and our estimate of it. Therefore, we use the mean squared error (MSE), which is a common measure of estimator quality of the fitted values of a dependent variable.

Experiments
In this section, we report on the conducted experiments to evaluate the effectiveness and efficiency of our proposed method. We apply the method to several real-world as well as artificial social networks. We consider normalized values of centralities in our experiments.

Dataset
Since our learning-to-rank method is based on estimating node similarities using distributional properties of their neighborhoods, we focus our experiments on data sets providing such structures. Ego-networks serve as possible datasets. In Ego-networks, every graph contains sub-graphs around a focal node called ego. The idea is to break up the large graph into smaller, easier to manage components and study the properties of the subgraphs. The ego-network model allows the small patterns, anomalies, and features to be discovered that would be missed when an entire graph is analyzed [15]. Therefore, we use ego-networks from three major social networking sites: Facebook, Google+, and Twitter, available from the University of Stanford [17]. Table 1 describes the details of the datasets we used in our experiments. Moreover, we generate an artificial scale-free graph utilizing the Barabási-Albert model [48]. The Barabási-Albert model is an algorithm generating random graphs using a preferential attachment process. The process starts with an initial graph of m nodes. One new node is added to the network at each time step t ∈ N. In more detail, the preferential attachment process works as follows: • With a probability p ∈ [0, 1], this new node connects to m existing nodes uniformly at random.

•
With a probability 1 − p, this new node connects to m existing nodes with a probability proportional to the degree of node which it will be connected to.
We simulate the Facebook graph using the Barabási-Albert model, which generates 87,516 edges for 4000 nodes. Similar to the Facebook dataset, we divide the Barabási-Albert graph into 10 ego-networks. The community library in Python [49] properly clusters the graph into several communities. We consider the node with highest betweenness as ego for each subgraph [17]. For further investigations, we compare degree, closeness, betweenness and eigenvector centrality distribution between one ego-network of the Facebook and one from the artificial graph. Figure 1 demonstrates centrality distribution of two random egos; ego '686' of the Facebook and ego '25' of the artificial dataset. We then require a metric to measure whether two distributions are identical. The Kolmogorov-Smirnov test [50] is the most popular test to find identical distributions. Therefore, we consider pairs of egos with similar size (number of edges) and use the Kolmogorov-Smirnov test. The Kolmogorov-Smirnov test generates two key values: KS statistic and p-value. If the KS statistic is small or the p-value is high, then the distributions of two samples are the same. Table 2 describes details for pairs of ego-networks that have nearly the same number of nodes. For all pairs, distributions are similar since KS statistic is low (around 0.1) and the p-value is higher than 0.1 [50].

Parameter Settings
• DeepWalk: Here, we apply the Skip-Gram model [13] on the node sequences generated by biased random walk. We set parameters as follows: the context size c = 10, the embedding size d = 128, the length of each node sequence t = 40, and the number of node sequences for each node γ = 80 [12]. • loc: Here, we apply Paragraph Vector [16] to learn embeddings for limited sequence of nodes, the same as paragraphs in text. In the Paragraph Vector Distributed Memory (PV-DM) model, optimal context size is 8 and the learned vector representations have 400 dimensions for both words and paragraphs [16]. • node2vec: This algorithm operates the same as DeepWalk, but the hyper-parameter p and q control the walking procedure. With q > 1, the random walk is biased towards nodes close to the start node. Such walks obtain a local view of the underlying graph with respect to the start node in the walk and approximate BFS behavior in the sense that samples are comprised of nodes within a small locality. The parameter p controls the likelihood of immediately revisiting a node in the walk. If p is low (< min(q, 1)), it would lead the walk to backtrack a step and this would keep the walk "local" close to the starting node. The optimal values of p and q depend a lot on the dataset [10]. In our experiment, we consider two settings: node2vec(1) keeps the walk local with p = 2 −8 , q = 2 8 , while node2vec(2) walks more exploratively with p = 2 8 , q = 2 −8 . • LINE: LINE with first-order proximity, in which linked nodes will have closer representations, and LINE with second-order proximity, in which nodes with similar neighbors will have similar representations. In both settings, we consider the embedding size d = 128, batch-size= 1000, learning-rate ρ = 0.025.

Quantitative Results
In this section, we conduct two experiments: first, we inspect embeddings to see what network property is learned by each embedding technique. Second, we try to predict the centrality value itself.

Inspecting Embedding Properties
Embeddings as a low-dimensional representation of the graph are expected to preserve certain properties of the graph. Table 3 displays properties that are learned by rankSVM [42] using different embedding techniques. The following findings can be inferred from the table: • Overall, we can explain the ranking either by combining betweenness or eigenvector or degree centralities of the node's neighborhood. Closeness is not important in order to retain the ranking. The accuracy of SVM in all experiments is around 60%, which shows that there are some explaining network properties missing. • LINE and DeepWalk, which are able to explore the entire graph, can learn betweenness and eigenvector centrality of nodes. Betweenness is a global centrality metric that is based on shortest-path enumeration. Therefore, it is needed to walk over the whole graph to estimate betweenness centrality of nodes. Eigenvector centrality measures the influence of a node by exploiting the idea that connections to high-scoring nodes are more influential. This means that a node is important if it is connected to important neighbors. Therefore, computing eigenvector centrality also requires exploring globally the entire graph. This is done in practice by both LINE and DeepWalk, hence they learn eigenvector and betweenness centrality of nodes around 60%. • node2vec with p < 1 and q > 1 walks locally around the starting node. loc also walks over a limited area of the network. Therefore, they are not able to capture the structure of the entire network to learn betweenness or eigenvector centrality. The only property that is locally available is degree of nodes, hence is it learnt by node2vec(1) and loc. • node2vec with q < 1 and p > 1 is more inclined to visit nodes that are further away from the starting node. Such behavior is reflective of DFS, which encourages outward exploration. Since node2vec(2) walks through the graph deeply, it could learn the eigenvector centrality.

Approximating Centrality Values
We aim to compute network centralities in an efficient time complexity. We report the results in terms of Mean Squared Error, hence smaller error gives better approximation. To approximate each centrality, we randomly selected 70% of nodes in the Facebook graph as training set and the rest as a test. Table 4 demonstrates average value and standard deviation of centralities as well as errors where we feed the model with different embeddings. The Normalized Root Mean Squared Error (RMSE) is a standard statistical metric to measure model performance.
We also report results in terms of Normalized Root Mean Squared Error (NRMSE) and Coefficient of Variation of the RMSE: where y is the average of target values on test set. y max and y min are maximum and minimum of target values on test set [44]. It can be seen that, for closeness centrality, the error range is quite low compared to the average value, hence the regression model approximates closeness sufficiently well. Although closeness centrality has not been a relevant factor for explaining the ranking of nodes in our previous learning-to-rank experiment, a nonlinear mapping as provided by the neural network can estimate closeness values. A possible interpretation is the existence of local manifolds in embedding space containing nodes with similar closeness values. Since closeness represents the average distance of a node to all other nodes in the graph, it also hints that graph distances are preserved in embeddings space (in a nonlinear manner). However, for the other centrality measures, RMSE values are in the range of the average value or even higher. Hence, it seems that embeddings alone are not strong enough features to approximate these centralities.

Conclusions
This work has tackled graph embeddings to investigate network topological properties such as degree, closeness, betweenness, and eigenvector centrality. Empirical evaluation on real-world social networks indicated that each embedding technique can retain a different combination of network properties. We studied and reported recent existing methods of inspecting embeddings. We also presented an approach to approximate centrality values using neural networks. Our results revealed that closeness centrality is only centrality, and can be approximated in a more efficient time. For future work, we will analyze and compute the runtime of closeness centrality approximation precisely. We believe that there are some promising research directions exploiting embeddings to approximate some complex measures such as length of shortest path between two nodes.