Community Detection of Multi-Layer Attributed Networks via Penalized Alternating Factorization

: Communities are often associated with important structural characteristics of a complex network system, therefore detecting communities is considered to be a fundamental problem in network analysis. With the development of data collection technology and platform, more and more sources of network data are acquired, which makes the form of network as well as the related data more complex. To achieve integrative community detection of a multi-layer attributed network that involves multiple network layers together with their attribute data, effectively utilizing the information from the multiple networks and the attributes may greatly enhance the accuracy of community detection. To this end, in this article, we study the integrative community detection problem of a multi-layer attributed network from the perspective of matrix factorization, and propose a penalized alternative factorization (PAF) algorithm to resolve the corresponding optimization problem, followed by the convergence analysis of the PAF algorithm. Results of the numerical study, as well as an empirical analysis, demonstrate the advantages of the PAF algorithm in community discovery accuracy and compatibility with multiple types of network-related data.


Introduction
Network science is one of the most active research fields in recent years [1], which has been successfully applied in many fields, including the social science to study social relationships among individuals [2], biology to study interactions among genes and proteins [3], neuroscience to study the structure and function of the brain [4], and so on. Networks can represent and analyze the relational structure among interacting units of a complex system, and in many cases, the units of a network can be divided into groups with the property that there are many edges between units in the same group, but relatively few edges between units in different groups. Such groups are known as communities, which are often associated with important structural characteristics of a complex system. [5,6].
For example, in social networks, communities can correspond to groups with common interests [7,8]. In World Wide Web networks, communities can correspond to webpages with related topics [9]; in brain networks, they can correspond to specialized functional components [10]; and in protein-protein interaction networks, they can correspond to groups of proteins that contribute to the same cellular function [11]. Communities are often useful for understanding the essential functionality and organizational principles of networks. Therefore, community detection is considered a fundamental problem in understanding and analyzing networks [6].
Community detection has been widely studied in many application fields since the 1980s. Various models and algorithms have been developed in different fields, such as machine learning, network science, social science, and statistical physics. Community detection is a computationally

Problem Formulation
In this section, we will describe in detail the problems of community detection based on matrix factorization from the single network to the multi-layer attributed networks.

Single Network
Let G = (N, E) denote a single network, where N = {1, . . . , n} is the node set that represents the units of the modeled system, and E ⊆ N × N is the edge set containing all pairs of nodes (u, v) such that nodes u and v share a social, physical, or functional relationship, where N × N denotes the Cartesian product of N and N. A network G can be characterized by an n × n adjacency matrix A = (A ij ) with each A ij ∈ {0, 1}, where A ij = 1 means that there exists an edge from nodes i to j in network G; otherwise, is not. The purpose of community detection is to identify a partition of N with community structure via the observed adjacency matrix A. Due to numerous definitions of communities, there are numerous approaches to implementing community detection. In view of the simplicity and effectiveness of a matrix factorization approach, in this article we consider the problem of community detection based on the framework of matrix factorization.
In the framework of matrix factorization, the problem of community detection, given a predetermined number of communities k * , can be formulated as the following optimization problem, where C is the unknown n × k * matrix used to find k * communities, S is the unknown k * × k * weight matrix, and · F denotes the Frobenius norm. This optimization problem is the same as that studied in [28], except that in our optimization problem the non-negative constraints on matrix elements of C and S are removed to improve computational efficiency. The matrix C can be viewed as the community label matrix of A. By treating each row of C as a point in R k * , we divide these points into k * clusters via k-means or any other clustering algorithm. Then, we assign the network node i ∈ N to community k ∈ {1, . . . , k * } if and only if row i of matrix C is assigned to cluster k. From a statistical point of view, we find that the above optimization problem (1) is closely related to the well-known stochastic block model (SBM) [29]. Specifically, under the k * -community SBM with the n × k * ground truth label matrix C (0) and the k * × k * connectivity probability matrix S (0) , once the diagonal elements of the adjacency matrix A are also considered as random terms, not fixed to be zero, then the conditional expectation of A given Similar to the least-squares method, the ground truth labels C (0) can be predicted by minimizing the sum of squares of the observations A ij 's and their conditional expectation E(A ij |C)'s: subject to ∑ k * j=1 C ij = 1 for each i ∈ {1, . . . , n}. This minimization problem is very hard to achieve, as the range of C, {C ∈ {0, 1} n×k * : ∑ k * j=1 C ij = 1}, includes k * n values. Consequently, to make the corresponding calculation feasible, (3) may be relaxed into (1), if the accuracy of community recovery can be guaranteed. Note that in (1), the ranges of C and S are relaxed into the Euclidean spaces R n×k * and R k * ×k * , respectively, whereas in other methods, such as the non-negative matrix factorization methods [28], the ranges can be relaxed into R + n×k * and R + k * ×k * . Here, we remove the non-negative constraints to improve computational efficiency and compatibility of the proposed method, which will be explained in the following section.

Multi-Layer Attributed Network
Once the structural information from multiple sources and the attribution information of the network nodes can be collected together, we will consider the so-called multi-layer attributed network, which is written as G Mul Att = (N, E (1) , . . . , E (m * ) , X) and characterized by m * n × n adjacent matrices {A (1) , . . . , A (m * ) } as well as an n × p attribution matrix X. This is a unified framework, which can include the single network, multi-layer network, and attributed network. To achieve community detection of G Mul Att , we study the following integrative matrix factorization problem, where {ω m } m * m=0 with ∑ m * m=0 ω m = 1 are the weight parameters specified beforehand and V is a p × k * matrix as the right part of the matrix factorization of X. Similarly, to solve (4), we consider the following approximate minimization problem, Note that throughout this section, the weights {ω m } m * m=0 need to be given beforehand by the users according to background knowledge. To determine these weights, one user may have to take into account the importance and scale of data from each source. If no additional information is available, for simplicity, the weights can be equally distributed.

Learning Algorithm
We present a penalized alternating factorization (PAF) scheme to minimize (5). In particular, the objective function is minimized step by step by fixing any m * + 2 matrices in {C (1) , C (2) , S (1) , . . . , S (m * ) , V} and then optimizing the objective function with respect to the remaining one. The algorithm is described in details as follows.
Here, α, β, and ν are set to be three small positive numbers, used to ensure convergence of the algorithm. Note that in the update step of the above algorithm, all the update formulas have explicit expressions. Specifically, given which has the explicit expression in in (6) Similarly, we update C (1,t+1) in (7). Then, in (8), given which has the explicit expression in (9)

Theoretical Analysis
Next, we consider the convergence theory of the PAF algorithm. We will present that the iteration sequence generated by the PAF algorithm converges to a critical point of (5).
Proof. Please see Appendix A.1.
Proof. Please see Appendix A.2. where Proof. Please see Appendix A.3.

Numerical Study
We now present the results of some numerical study to demonstrate the performance of the PAF algorithm, and the comparison with some existing methods, abbreviated as SCP, ANMF, and NMF, respectively. SCP is the spectral clustering with perturbations [33]. ANMF and NMF are the non-negative matrix factorization methods proposed in [28] for directed and undirected networks respectively. All the network data are generated from SBM or multi-layer SBM, and the attribution data are generated from multivariate normal distributions, where the distribution parameters will be specified in each following setting. We will use the normalized mutual information (NMI) to measure the consistency between the predicted labels and the true community labels.
First, we consider the following two simulation settings for single networks and attributed networks: I The n × n adjacency matrix A is generated from the undirected SBM with the parameters Each row of the n × 2 attribution matrix is independently generated from the multivariate normal distribution N 2 (µ k , σ 2 I 2 ), where the kth element of µ k is 1 and the remaining element is 0, and σ 2 = 0.15. II The same as Setting I, except that the undirected SBM is replaced by directed SBM.
The simulation results for Settings I and II are summarized in Figure 1, where SCP(A), NMF(A), ANMF(A), and PAF(A) denote applying SCP, NMF, ANMF, and PAF to A, respectively, k-means(X) denotes applying k-means to X and PAF(A, X) denotes applying PAF to (A, X). The results of SCP(A), NMF(A), ANMF(A), and PAF(A) in Figure 1 suggest that (1) PAF is a very good alternative to NMF and ANMF in terms of accuracy of community detection, and (2) NMF, ANMF, and PAF outperform SCP in situation where directed networks are studied. On the other hand, the comparison between PAF(A, X) and the other methods in Figure 1 suggests that applying k-means to the attribution data alone fails to achieve community detection; however, once the attribution data and the network data are combined, much better results can be obtained than using the network and attribution data separately.
Next, we consider the following two simulation settings for multi-layer networks and multi-layer attributed networks: III The m * = 3 n × n adjacent matrices {A (1) , A (2) , A (3) } are generated independently from the undirected multi-layer SBM with common community labels, where the parameters are set as follows, Each row of the n × 3 attribution matrix is independently generated from the multivariate normal distribution N 3 (µ k , σ 2 I 3 ), where the kth element of µ k is 1 and the remaining elements are 0, and σ 2 = 0.15. IV The same as Setting III, except that the undirected multi-layer SBM model is replaced by directed multi-layer SBM.
The simulation results for Settings III and IV are summarized in Figure 2, where PAF(A (1) , (2) ), and SCP(A (3) ) suggests that (1) integrating community information from the multiple adjacent matrices of the network layers may perform better than using each network layer separately, and (2) using the PAF algorithm to achieve integrative community detection for the multi-layer attributed network can make appropriate use of the network-related data from multiple sources.

Empirical Analysis
In this section, we apply the proposed PAF method to a dataset that comes from a network study of a corporate law partnership, which was carried out in a Northeastern US corporate law firm, referred to as SG&R, 1988-1991 in New England and previously studied in [45,56,57]. The dataset includes 71 attorneys of this firm and three network layers, co-work layer, advice layer, friendship layer, as well as some attributes of the attorneys, such as status (1 = partner; = associate), gender (1 = man; 2 = woman), office (1 = Boston; 2 = Hartford; 3 = Providence), years with the firm, age, practice (1 = litigation; 2 = corporate), and law school (1: Harvard, Yale; 2: UCON; 3: other). We treat the attribute "status" as the ground truth community label as in [45]. In fact, after eliminating six isolated nodes, the heatmap plots of the adjacency matrices with nodes sorted by each attribute variable indicate that the partition by "status" can present a strong assortative structure. Then, the data of the remaining six attributes together with the three network layers form a multi-layer attributed network to be studied, with m * = 3 network layers and p = 6 attribute variables, which falls right into the scope of application of the proposed method.
Intuitively, all these three network layers and six attributes can contribute to the community detection task with the ground truth label "status". Specifically, the descriptive analysis results in Figure  3 present that all these six attributes can provide useful information to distinguish the two values of "status"; the top three panels of Figure 4, i.e., the heatmap plots of the three adjacent matrices partitioned by the ground truth labels, partly present block structure according to the two values of "status".    "gender", "office", "practice", and "law school". The right two panels are the box-plots of "status" versus the two count variables "seniority" and "age".  The authors of [45] offered a comparison of seven methods for community detection of this dataset, we recall the NMI results of these methods in [45] by Table 1, together with the NMI result obtained by applying the proposed PAF method to the multi-layer attributed network with m * = 3 network layers and p = 6 attribute variables. Table 1 indicates that the NMI performance of the PAF method is almost the same as the best existing one. Intuitively, the heatmap plots of the three adjacent matrices partitioned by the predicted labels of the PAF method are given in the bottom three panels of Figure 4, which are quite similar to those partitioned by the ground truth labels. Viewed from another perspective, we present the plots of the three network layers, colored by the ground truth labels and the predicted labels by PAF, respectively, in Figure 5, which indicate that the partition by both the ground truth labels and the predicted labels by the proposed PAF method can present a strong assortative structure for the three network layers, especially for the friendship layer. These results demonstrate the practicability of the PAF method in community detection of multi-layer attributed networks.

Conclusions
We have proposed PAF-a unified framework and algorithm that is applicable to community detection of multi-layer attributed networks-as well as its special cases, such as single networks, attributed networks, and multi-layer networks. The main idea of PAF is replacing the community label matrix at two different positions in the original objective function with two different substitution matrices, penalizing the gap between the two substitution matrices, and then alternately optimizing each of the substitution matrices as well as some other variable matrices. The results of the simulation study and empirical analysis demonstrate the advantages of the PAF algorithm in community discovery accuracy and compatibility with multiple types of network-related data.
In our future work, we will study community detection of multi-layer attributed networks in statistical ways, where likelihood functions under some statistical models of multi-layer attributed networks will be considered.

Conflicts of Interest:
The authors declare no conflicts of interest.
Appendix A.4. Proof of Theorem 1 To establish the proof of Theorem 1, we need to recall the definition of Kurdyka-Łojasiewicz (KL) property and prove the following two properties.
Here, dist(x, X ) denotes the shortest distance between the point x and the point set X , i.e., dist(x, X ) = min y∈X dist(x, y), and Φ η denotes the class of all concave and continuous functions ξ : [0, η) → R + , η ∈ R + , which satisfies: a) ξ(0) = 0; b) ξ is continuous differentiable on (0, η); c) for all s ∈ (0, η), ξ (s) > 0. Now, we are ready to present the proof of Theorem 1, based on the fact that the function H is a KL function.
Accordingly, from (A18) and (A19), we obtain that for any γ > 0, > 0, there exists an integer t u > 0 such that for all t > t u , H(Θ (t) ) < H(Θ) + γ and dist(Θ (t) , Ω) < . (A20) It is known that the function H is a KL function [59], then by using the KL inequality in Definition (A1) and (A20), there exist ξ, such that According to Proposition 3, we have Besides, ξ is concave and we have that Then according to propositions 2 and 3, we have Summing up (A24) over t from 1 to z and then yields the following inequality, i.e., We take limits on the left side of inequality of (A26) as z → ∞ and get It is easy to know that (A27) implies that {Θ (t) } ∞ t=1 is a Cauchy sequence, and thus is a convergent sequence and this completes the proof, i.e., the sequence {Θ (t) } ∞ t=1 converges to a critical point of H in (A17).

Appendix B. Some Additional Numerical Results
In this section, we mainly investigate the computational efficiency of the proposed algorithm for relatively large scale multi-layer attributed networks via some additional numerical results. We consider the following simulation setting for multi-layer networks and multi-layer attributed networks.
V The m * = 3 n × n adjacent matrices {A (1) , A (2) , A (3) } are generated independently from the undirected and directed multi-layer SBMs, respectively, with common community labels, where the parameters are set as follows, Each row of the n × 3 attribution matrix is independently generated from the multivariate normal distribution N 3 (µ k , σ 2 I 3 ), where the kth element of µ k is 1 and the remaining elements are 0, and σ 2 = 0.08.
As suggested in Figure A1, the proposed algorithm framework is compatible with a variety of network related data, has relatively good accuracy of community discovery and acceptable computational efficiency. Compared with the algorithm SCP, which provides initial value for the proposed algorithm, the proposed algorithm does not excessively reduce the computational efficiency.  Figure A1. The left two panels present the NMI results for Setting V and the right two panels present the RT (running time in log-second) results for Setting V.