Dynamics-Preserving Graph Embedding for Community Mining and Network Immunization

: In recent years, the graph embedding approach has drawn a lot of attention in the ﬁeld of network representation and analytics, the purpose of which is to automatically encode network elements into a low-dimensional vector space by preserving certain structural properties. On this basis, downstream machine learning methods can be implemented to solve static network analytic tasks, for example, node clustering based on community-preserving embeddings. However, by focusing only on structural properties, it would be difﬁcult to characterize and manipulate various dynamics operating on the network. In the ﬁeld of complex networks, epidemic spreading is one of the most typical dynamics in networks, while network immunization is one of the effective methods to suppress the epidemics. Accordingly, in this paper, we present a dynamics-preserving graph embedding method (EpiEm) to preserve the property of epidemic dynamics on networks, i.e., the infectiousness and vulnerability of network nodes. Speciﬁcally, we ﬁrst generate a set of propagation sequences through simulating the Susceptible-Infectious process on a network. Then, we learn node embeddings from an inﬂuence matrix using a singular value decomposition method. Finally, we show that the node embeddings can be used to solve epidemics-related community mining and network immunization problems. The experimental results in real-world networks show that the proposed embedding method outperforms several benchmark methods with respect to both community mining and network immunization. The proposed method offers new insights into the exploration of other collective dynamics in complex networks using the graph embedding approach, such as opinion formation in social networks.


Introduction
Complex networks have been widely used to represent the heterogeneous relationships among interactive elements in many real-world systems, such as social networks [1], neuronal networks [2], proteinprotein interaction networks [3], and the World Wide Web [4]. In the past decades, extensive studies have focused on investigating the statistical mechanisms of network structure [5], as well as various dynamics in complex networks [6]. Accordingly, a series of network analytic tasks have been proposed, among which community mining ( [7][8][9][10][11]) and node importance identification ( [12][13][14]) have drawn extensive attention. The purpose of community mining is to identify groups of nodes with relatively dense connections in terms of network structure, while node importance is usually related to specific dynamics on networks [15,16]. For example, identifying influential nodes is essential for network immunization to contain epidemic spreading in complex networks [17][18][19]. In this paper, we focused mainly on investigating how epidemic dynamics on networks can promote community mining and network immunization.

Motivation
In the past decades, community mining has drawn a lot of attention in the field of complex networks. Many metrics have been proposed to guide the process of community mining. For example, the Kernighan-Lin method literally divides network nodes into smaller subgroups based on normalized cut [20]. There are also many other methods that rely on user-defined heuristics, such as modularity [21,21] and graph spectrum [22]. Meanwhile, in the field of machine learning, researchers have been focusing on developing various unsupervised clustering algorithms. However, due to the high dimensionality of network structure, most of them cannot be used directly on clustering network elements. In this line, the graph-embedding approach has been proposed to automatically encode network elements into a low-dimensional vector space such that downstream machine learning algorithms can be used to solve specific network analytic tasks, such as community mining [23][24][25][26][27].
To date, many graph embedding methods have been proposed with the purpose of preserving various structural properties of complex networks (e.g., [28][29][30]), even with heterogeneous node/edge types [31,32]. However, most existing studies focused mainly on representing static network information, such as structural proximity, equivalence, and identity [33][34][35][36]. Little attention has been paid to characterizing the properties of dynamics on networks, such as epidemic/information spreading. The challenge lies in that a dynamic process on networks is usually nonlinear, and its operation is jointly determined by both the network structure and the nature of the dynamics itself. When focusing only on structural properties, it would be difficult to characterize various dynamics operating on the network, let alone manipulate the dynamic processes based on the generated embeddings.
In the field of complex networks, epidemic spreading is one of the most typical dynamics on networks. Existing studies have shown that epidemic spreading on a network is jointly determined by network structure and dynamic characteristics of the spreading [37]. Given an initially infected node (called source node), the next infections depend on the location and status of all infected nodes during the spreading. On the one hand, the neighbors of an infectious node are more likely to be infected. Therefore, the generated embeddings can by nature preserve the proximity of network nodes. On the other hand, by taking into consideration the time of infection in the propagation sequences, the embeddings can also reflect the infectiousness and vulnerability of network nodes in the face of epidemic spreading. In this case, a dynamics-preserving graph embedding method can offer new insights into, as well as new tools for epidemic intervention and control on networks.

Related Work
One type of dimension reduction technique, the graph (or network) embedding approach has been extensively studied in the past decade [23][24][25][26][27]. Several methods have been proposed to exploit the spectral property of network adjacency and its variants, such as IsoMap [38], LLE [39], and Laplacian eigenmaps [40]. These methods try to preserve the local relationships (i.e., the first-order proximity) of each node in the network. The key step lies in how to construct the first-order proximity of each node by finding its k nearest neighbors. Along this line, a great deal of graph embedding methods have been proposed to preserve the structural properties of complex networks. For example, several factorization-based embedding methods have been proposed to preserve the first-order proximity [41], as well as higher-order proximities of networks [42]. The basic idea is to learn embeddings of each node such that the inner product between any two learned vectors approximates certain measures of structural proximity (i.e., community-based embeddings [43]). Besides community-based embedding methods, many researchers have also been focusing on learning latent representations of higher-order structural properties of large-scale networks, such as structural equivalence [34,36,44], and role-based similarity/identity [35,45].
In recent years, many random-walks based embedding methods have also been proposed to preserve various order of structural proximities [30,33,46,47]. Such methods usually take two steps: first, a set of node sequences are sampled via random walks on a network, where densely connected nodes are more likely to be sampled in the same sequence. Then, node embeddings can be obtained by maximizing the co-occurrence probability of nodes appearing nearby in the sampled sequences. In doing so, such co-occurred nodes are more likely to have similar embeddings and thus be clustered into the same community. The difference lies in the neighborhood sampling strategies starting from each target node. For example, the DeepWalk method used the depth-first sampling strategy to sample the set of nodes [30], while the node2vec method further introduced a biased random walk procedure to balance the depth-first and breadth-first sampling strategies [33]. Extensive studies have shown that the random-walks based graph embedding methods can perform well in node clustering and classification tasks. Nevertheless, random walks are artificially designed and cannot depict any real-world dynamic processes on networks.
As another well known dynamic process on networks, epidemic spreading can also be treated as a sampling strategy to solve graph embedding problems. Similarly to random walks, to represent the epidemic spreading on networks, it would be natural to first simulate the epidemic process on a network and generate a set of propagation sequences [48,49]. However, epidemic spreading is completely different from random walks. First of all, the Markov property holds for random walks on networks: conditional on the present, the future is independent of the past. While for the epidemic dynamics on networks, things are different: Starting from an infected node, a sequence of nodes can be generated based on the time of infection. Given a set of infectious nodes at any time, the next node that will be infected depends on the locations and states of all the infectious nodes in the network. In this case, the Markov property does not hold anymore. To solve this problem, in this paper, we aimed to tackle the dynamics-preserving graph-embedding problem to represent the infectiousness and vulnerability of network nodes with respect to the dynamics of epidemic spreading on networks. In doing so, the node representations or embeddings can further be used to identify important nodes for network immunization.
Network immunization is one of the effective methods to suppress the epidemic dynamics on networks [6,50]. Typical network immunization strategies include random immunization [18], target immunization [18], and acquaintance immunization [51]. Max-degree immunization [52] is the first proposed target immunization strategy, in which a proportion of nodes with the highest degree are selected for vaccination before epidemic spreads. The main idea behind this is that nodes with a higher degree are more likely to spread disease. Since then, different target immunization strategies have been proposed based on node importance and various measures of centrality, such as degree [52], betweenness [53], eigenvector centrality [15]. However, most existing studies focus mainly on the structural properties of the network, which ignore the characteristics of epidemic dynamics on networks. It is expected that the dynamics-preserving node embeddings can help improve the efficiency of network immunization by taking into consideration of the characteristics of epidemic dynamics.

Our Contributions
In this paper, we present a dynamics-preserving graph embedding problem that aims to generate node representations by preserving both the structural and dynamic properties of networks. The main contributions of this paper are as follows: • We develop a dynamics-preserving graph embedding method (EpiEm) to generate node representations that preserve the dynamic characteristics of the epidemic spreading on networks. Specifically, we first generate a set of propagation sequences by simulating the Susceptible-Infectious process on a network, and then learning node representations from an influence matrix using the singular value decomposition method.

•
We propose an embedding-based network immunization strategy to immunize network nodes based on the preserved infectiousness and vulnerability in their representations. Such representations embed not only the structural properties of the network, but also the epidemic dynamics on the network.

•
By conducting experiments on both synthetic and real-world networks, we demonstrate that the proposed embedding method outperforms the state-of-the-art graph embedding methods in terms of community mining tasks. Moreover, we also show that the embedding-based network immunization strategy outperforms several typical network immunization strategies by considering the dynamic characteristics of epidemic spreading.
The remainder of this paper is organized as follows. In Section 2, we first introduce the dynamicspreserving graph embedding problem. Then, we propose an embedding method of learning node representations based on singular value decomposition. Accordingly, we present an embedding-based network immunization algorithm in the face of epidemic spreading on networks. In Section 3, we evaluate the performance of the proposed methods in terms of node clustering in both synthetic and real-world networks. Moreover, we also evaluate the performance of the embedding-based network immunization algorithm by comparing with several benchmark algorithms. Finally, we conclude this work in Section 4.

EpiEm: A Dynamics-Preserving Graph Embedding Method
In this section, we develop a graph embedding method to preserve the dynamic properties of epidemic spreading on networks. First, we generate a set of propagation sequences by simulating the Susceptible-Infectious (SI) epidemic dynamics on networks using the Gillespie algorithm (Section 2.1). Based on the generated propagation sequences, we then build an asymmetric influence matrix and learn node representations using a singular value decomposition method (Section 2.2). Finally, based on the obtained representations about node infectiousness and vulnerability, we proposed an embedding-based network immunization strategy to identify important nodes of a network with respect to epidemic dynamics (Section 2.3).

Generating Propagation Sequences
Without loss of generality, we simulate epidemic dynamics based on the Susceptible-Infectious (SI) model [54]. Under the SI model, the population is divided into two categories, i.e., S and I, to represent the proportion of susceptible and infected individuals respectively. Accordingly, S + I = 1. In a well-mixed population, the model can be formulated as an ordinary differential equation: where r is the transmission rate representing the number of effective contacts per susceptible individual per unit time that are sufficient to spread the disease. Given a network G = (V, E), where V = {1, · · ·, n} is the set of nodes and E = {e ij |i, j ∈ V} is the set of edges, each propagation sequence is simulated under the SI model based on the Gillespie algorithm [55,56], given as follow. At the beginning, one node is chosen to be infected.

1.
Calculate the state transition rates of each node. The rate at which a susceptible individual i becomes infected is trans_rate i (t) = r × number of his/her infected neighbors. The infected individuals remain infected. The total transition rate at time t is After time ∆t, determine the next node to change its state, where ∆t is sampled from an exponential distribution with mean 1/λ(t). The node k will change its state if where v is a random number generated from the uniform distribution U[0, 1).
Starting from each node i ∈ V, a set of K disease propagation sequences PS k i |1 ≤ k ≤ K will be sampled respectively. In total, there will be K|V| propagation sequences. Each propagation sequence PS k i = {(i l , t l )|t l−1 ≤ t l } consists of node-time pairs (i l , t l ) ordered by infection time t l . Here it should be emphasized that the propagation sequences are essentially different from random walks. As the node orders in the sequences reflect the order of infection, two adjacent nodes are not necessarily neighboring nodes in G. In doing so, the node embedding learned based on the propagation sequences can well preserve the epidemic dynamics. The pseudocode for generating the propagation sequences is given in Algorithm 1.
Generate ∆t based on λ(t) ; 6 Determine the next node v based on Step (2);

Learning Node Representations
Our learning algorithm is inspired by the widely used representation learning method Glove [57]. The Glove method uses a local sliding window on the sentences to count adjacent words that co-occur in a sliding window. Based on that, a global word co-occurrence matrix is constructed and each element counts the number of co-occurrences of the corresponding two words. Then, the words are embedded as low-dimensional vectors, which are fed into non-linear functions and optimized to fit the co-occurrence matrix. That is, to maximize the possibility of each co-occurred case. As a result, frequently co-occurred words have similar representations. However, the order of words in each word pair is not considered and the matrix is symmetric. Meanwhile, the nodes appearing in propagation sequences are naturally ordered by time, thus the Glove method can not be directly adopted.
Here, we first construct an asymmetric influence matrix. In particular, let X be a matrix of dimension |V| × |V|. Initially, the elements in matrix X are set to be 0. Then, we traverse the sampled propagation sequences. In a propagation sequence PS k i , the co-occurrence of the source node i and a later infected node j ∈ PS k i indicates a case that node i influenced node j, which makes X ij + = 1. Note that we set a global time threshold and nodes appearing above the threshold in the propagation sequences are not considered as the corresponding strength of influence is weak. When we traverse all the propagation sequences, we can obtain the matrix X. Here, we adopt singular value decomposition (SVD) [58], one of the commonly used matrix decomposition methods to encode the nodes into a low dimensional space. The SVD algorithm is an effective mathematical model for data compression and dimension reduction. Through the formula X = UΣZ T , matrix decomposition is carried out for the high-dimensional matrix X. U, Z are square matrices with dimension |V|, and each column corresponds to one left/right singular eigenvector. Σ is a diagonal matrix, and the values on the diagonal are the corresponding eigenvalues for these eigenvectors, ordered in descending order. The eigenvalue is a measure of the importance of the eigenvector, and the dimension reduction is performed accordingly. Suppose the resulting vector dimension is set as d, then matrices U , Z , Σ are generated by taking the first d eigenvectors of U and Z respectively, as well as the first d dimensions of Σ. The SVD method has been adopted in recommendation systems [59,60] to encode user and product matrices respectively. In this paper, the product of matrices U and Σ represents the infectiousness embeddings, denoted as Inf. Moreover, Z represents vulnerability embeddings, denoted as Vul. If the inner product of the vectors Inf i and Vul j is larger, then the node i has a great impact on node j. Meanwhile, if the vectors Inf i and Inf j are close to each other in the low-dimensional vector space, the nodes i and j should have similar impact on other nodes. The pseudocode for learning infectiousness embeddings and vulnerability embeddings is given in Algorithm 2.

Algorithm 2: The EpiEm Algorithm
Input: Network G, Transmission rate r, Dimension d, Number of propagation sequences per node K, Termination time T, Influence matrix X Output: Infectiousness Embedding Inf, vulnerability Embedding Vul 1 for each node i ∈ V;

An Embedding-Based Network Immunization Strategy
We propose a static immunization strategy based on the learned node embeddings. Traditional static target immunization strategies, such as max-degree immunization [52] and eigenvector centrality immunization [15], vaccinate nodes with the highest score based on corresponding measures of centrality. For instance, in max-degree immunization, nodes with the highest degree are vaccinated before epidemics spreads. The main reason behind is that the nodes with a higher degree are assumed to be more likely to infect other nodes as they are connected to more nodes. In this paper, we argue that the dynamic characteristics of epidemic dynamics can be summarized with learned embeddings based on past propagation sequences, and used for important node identification before new epidemic outbreaks.
Given the node representations, we can calculate the influence strength for each node pair accordingly. In particular, we denote an influence strength matrix as W, in which each element W ij represents node i's ability to transmit an epidemic to node j. The higher the value, the more likely that node j is in the propagation sequence triggered by node i. Each entry w ij of the influence strength matrix W can be calculated by the inner product of corresponding node representations, i.e., w ij = Inf i · Vul j . Further, we define w i = ∑ j∈V w ij for each node i as its influence score in terms of all nodes in the network. The higher the score, the more critical the node is in the process of epidemic propagation. We rank the nodes based on w i , and target the nodes with the highest scores for immunization given a limited number of vaccines.

Experiments
In this section, we carried out a series of experiments to evaluate the performance of our proposed EpiEm algorithm for learning node representations. Two network analysis tasks were evaluated, including node clustering and network immunization. Specifically, we first visualize the effect of the EpiEm method on community mining based on two small networks and then use cluster indicators on three air-traffic networks for quantitative evaluation. Furthermore, simulations were carried out on three air-traffic networks and one paper citation network for network immunization. The effectiveness of our method was verified via comparison with other classical vaccination methods.

Clustering Visualization on Barbell and Karate Networks
The barbell graph bar(m, n) is a synthetic network which connects two complete m-node subgraphs (G1 and G2) linked by a path P of length n, and all nodes of the two complete subgraphs are isomorphic. Without loss of generality, we use the barbell graph bar(10, 10) in our experiments (see Figure 1). In this section, we adopt the EpiEm method to generate the nodes' infectiousness embeddings, and achieve node clustering through a k-means algorithm. For visualization purposes, we directly set the dimension of node embeddings in bar(10, 10) as d = 2. As each node in the propagation sequence is associated with an infection time, we carefully set the termination time T = 0.1 to reflect the early stage of an epidemic such that the maximum length of the propagation sequences is 10. We generate K = 80 propagation sequences for each node with transmission rate r = 0.5. Figure 1 visualizes the node representations and clustering results for the bar(10, 10) network obtained by the EpiEm method. The nodes are divided into three clusters by the k-means algorithm, as shown in Figure 1b, in which different clusters are represented with different colors. It can be observed that the nodes in the complete subgraph of G1 and G2, and the nodes on the path P can be completely separated. As the neighbors of an infectious node are more likely to be infected, the generated embeddings can by nature preserve the proximity of network nodes. The Euclidean distance represented by the nodes in the embedding space can reflect the ability of the nodes to influence each other during the spread of the disease, as shown in Figure 1a. For example, once node 10 is infected, the outbreak may spread rapidly to G1, or it may spread gradually through path P to G2. A similar conclusion applies to G2. Moreover, we can see two special nodes 10 and 19 in the 2D plane. If there exists an outbreak in subgraph G1 (resp., G2), it must first infect node 10 (resp., node 19) before causing new infections in G2 (resp., G1).  Zachary's karate network is a representative real-world social network, which consists of 34 nodes and 78 edges. Each node represents a member of the karate club, and each edge represents the relationship between members inside the club. Many community mining algorithms are evaluated based on this data set, and two clusters can be identified by these algorithms based on the structure information. In this paper, we apply the EpiEm method to generate node representations. The parameters are set as r = 0.5, K = 80 and d = 2, so that the node embedding result can be directly displayed in the 2-D vector space. The nodes are divided into two clusters by the k-means algorithm, and are marked with different colors in Figure 2b. It can be observed that the clustering results are consistent with those of the well-known community mining methods. Furthermore, more interesting findings can be observed from the epidemic dynamics perspective in Figure 2a. For example, node 0, 33 are in the center of their respective cluster. The two center nodes can easily spread epidemics to all the nodes in their cluster, while other nodes are more likely to spread to a limited set of nodes. Obviously, the center nodes have different infectiousness abilities compared with other nodes. As a consequence, their node embeddings are placed on the far left corner. Meanwhile, some nodes bridge together two clusters, such as node 2,8,13,19. They play the role of mediating the process of epidemic propagation from one cluster to another, and thus their corresponding embeddings are similar as shown in the center. Moreover, nodes 4, 5, 6, 10, 11, 12 are densely connected but far away from other nodes. The epidemics can then spread quickly among them, instead of spreading to other nodes. As a result, their embeddings are close in the vector space. A similar phenomenon can be observed for node 14, 15, 18, 20 and 22.

Quantitative Evaluation for Clustering on Real Networks
Since both epidemic spreading and random walks are typical dynamics on networks, we compare our EpiEm method with several random-walks based graph embedding methods. In addition, as the EpiEm method adopts matrix factorization to achieve dimension reduction, we also compare with the spectral clustering method which reduces dimension using matrix factorization approach.

1.
Spectral Clustering [40]: This is a matrix factorization approach to calculate the d smallest eigenvectors of the normalized Laplacian matrix of graph G as the feature representation of nodes.

2.
DeepWalk [30]: This approach is one of the first attempts to apply the word2vec approach for network embedding. The neighbor information of the nodes is captured via simulating uniform random walks. (We use the code provided by the author, Source: https://github.com/phanein/ deepwalk) 3.
Node2vec [33]: This is another random-walks based network embedding method. The random walks are balanced between breadth-first and depth-first sampling strategies with hyperparameters p and q. If there is no specific explanation, we adopt the default values of p and q in the authors' paper. (We use the code provided by the author, Source: https://github.com/aditya-grover/node2vec) The experiments are carried out in three real-world air-traffic networks, which are widely used for evaluation representation learning methods. The details of the data sets are listed as follows.

1.
Brazilian air-traffic network [35]: The network has 131 nodes and 1038 edges. The data counts airport activities by the National Civil Aviation Administration (ANAC) from January to December 2016, which records the total number of landings and takeoffs in 2016. The dataset has four node labels.

2.
European air-traffic network [35]: The network has 399 nodes and 5995 edges. The data counts airport activities by the Statistical Office of the European Union (Eurostat) from January to November 2016. The dataset has four node labels.

3.
USA air-traffic network [35]: The network has 1190 nodes and 13, 599 edges. The data counts airport activity by the Bureau of Transportation Statistics from January to October. The dataset has four node labels.
We quantitatively compare the node clustering performance in terms of two entropy-based clustering indicators.

1.
Homogeneity [61]: It measures the percentage of detected clusters containing only a single class label through conditional entropy.

2.
Completeness [61]: It measures the percentage of nodes with the same class label allocated to the same cluster through conditional entropy.
It is expected that the nodes with the same class labels are clustered into the same cluster. Therefore, the larger the two indicators, the better the embedding method. Specifically, the parameter settings of the benchmark methods are given in Table 1. Notably, for the node2vec method, we adopt a grid search method to determine the best p and q for our experiments. During the experiments, different graph embedding methods are first used to generate node representations/embeddings, then the k-means method is used to cluster network nodes into four clusters, which is the same as the number of node labels.   Table 2 shows that for all of these networks, Deepwalk, Node2vec and Spectral Clustering perform poorly, with scores significantly lower than the EpiEm method. The reason behind this is that Deepwalk and Node2vec methods conduct random walks on the network, so they focus on capturing the adjacency between network nodes through a Markov process. Moreover, the spectral clustering method calculates eigenvectors of the normalized Laplacian matrix of the adjacency matrix. Therefore, these methods tend to preserve the structural proximity of the network. Nevertheless, the node labels are not necessarily related to network proximity. For instance, two hubs may not be directly connected, while under the EpiEm method, propagation sequences are sampled based on the epidemic dynamics, and hub nodes tend to affect a large proportion of nodes. Therefore, two hubs nodes, especially those with similar neighborhood structures, may affect similar sets of nodes although they are not directed connected. As a result, they have similar representations. Note that the performance result is to demonstrate those epidemic models are better choices than proximity-based methods to characterize dynamic properties of air-traffic networks, it does not mean that our methods are superior to others in other networks.

Network Immunization
In this section, we evaluate the performance of the proposed immunization strategy based on the EpiEm method for network immunization. Simulations are carried out on the above mentioned three air-traffic networks and one paper citation network named Cora network. The Cora network [62] network data set consists of machine learning papers. It has 2708 nodes and 5429 edges. In this corpus, each paper is quoted or referenced by at least one other paper. The papers are divided into seven categories. We compared with benchmark static immunization strategies, including random immunization, max-degree immunization, and eigenvector centrality immunization. The detailed description of the benchmark methods are shown as follows: 1.
Random immunization [18]: A proportion of nodes in the network are randomly selected for vaccination before epidemic spreads. The probabilities to select different nodes are the same.

2.
Max-degree immunization [52]: A widely used target immunity strategy. A proportion of nodes with the highest degree are selected for vaccination before epidemic spreads.

3.
Eigenvector centrality immunization [15]: Another widely used target immunity strategy. Eigenvector centrality is defined as the main eigenvector of the network adjacency matrix, in which each element indicates the eigenvector centrality for the corresponding node. The nodes with the largest eigenvector centrality are selected for vaccination before epidemic spreads.
For this experiment, we performed simulations based on the infectiousness embeddings and vulnerability embeddings obtained, and compare the results with benchmarks. The transmission rate is set as r = 1.0. The vaccine coverage is set as M = 10% for the three air-traffic networks and M = 1% for the Cora network as the last network is large. We performed 10 rounds of simulations, and take the average to obtain final results. Figure 3 shows the fraction of infected nodes as the epidemics spread with time under the four immunization strategies. The number of infected nodes increases with time and remains stable after a certain proportion of nodes infected. Among all the immunization strategies, random immunization performs apparently worst and results in the highest fraction of infected nodes in all the data sets. The result is in line with the fact that random immunization does not take the relative importance of different nodes into consideration. Meanwhile, EpiEm are most effective in all the data sets, slowing down the increase of infections in the early stage of the epidemic outbreak, and controls the epidemic outbreak. In comparison, the max-degree immunization and eigenvector centrality immunization do not specifically consider the critical nodes in terms of the process of epidemic propagation, and are outperformed by EpiEm.  We further investigate the performance of the proposed EpiEm immunization strategy with varying vaccine coverage. We set the vaccine coverage M = 5%, 10%, 15%, 20% for the three air-traffic networks and M = 1%, 3%, 5%, 10% for Cora network as this network is relatively larger, to simulate real situations where limited vaccines can be given. Figure 4 shows that for the three air-traffic networks, when the vaccine coverage reaches 20%, the number of infected nodes increases very slowly. For the Cora network, only 10% of vaccine coverage will slow down the growth of infections.
We also investigate the immunization efficiency of EpiEm strategy at different transmission rates. The vaccine coverage M are set to be 10%, and the epidemic transmission rate are chosen from r ∈ {0.5, 1.0, 1.5, 2.0}. As shown in Figure 5, increasing the transmission rates speeds up the increase of infections, which in turn results in a higher fraction of infected nodes. Therefore, we should enlarge vaccine coverage when the transmission rate becomes higher.

Conclusions
In this paper, we have proposed a dynamics-preserving graph embedding method, EpiEm, which preserves both network structure and dynamic properties of the epidemic spreading on networks. The learned network embedding can be applied for node clustering, as well as network immunization before epidemic outbreaks. Different from existing random-walks based embedding methods, EpiEm samples propagation sequences by simulating epidemic dynamics on networks. We have adopted the Gillespie algorithm to simulate the Susceptible-Infectious dynamics on networks. Using a singular value decomposition method, we have encoded the node interactions during the epidemic spreading by two sets of node representations, i.e., the infectiousness and vulnerability. Experiments on both synthetic and real-world networks have shown that our proposed embedding method outperforms several benchmark methods in terms of node clustering and network immunization. The results and findings can offer new insights into, as well as new tools for investigating more complicated and realistic dynamics in complex networks.
In the future, the proposed method can be extended in the following directions: First, the SI model used in this paper is relatively simple. To solve other complex network analytic tasks, it would be possible to generate node embeddings based on more complicated epidemic dynamics on networks, such as SIS and SIR epidemic models. Second, the singular value decomposition method used in this paper have high computational complexity. To deal with large-scale networks, it would be necessary to develop more efficient embedding methods. One possible way is to extend the word2vec algorithm to our EpiEm method. Finally, it is expected that the dynamics-preserving graph embedding approach can also be used to investigate other types of dynamics on networks, such as opinion formation in social networks.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.