SoftNet: A package for the analysis of complex networks

: Identifying the most important nodes according to speciﬁc centrality indices is an important


Introduction
Let G be a connected, undirected, unweighted graph with a large number of nodes n and a number of edges significantly smaller than n 2 .We assume there are no self-loops or multiple edges in G. Networks represented by such kind of graph occur in many applications, such as epidemiology, genetics, telecommunications, and energy distribution; see [7,13,14,30].It is usual to associate to the graph G a symmetric adjacency matrix A = [A ij ] ∈ R n×n with entries A ij = 1, if nodes i and j are connected by an edge, and A ij = 0, otherwise.
It is often meaningful to extract from a large graph numerical values describing global properties of the graph, such as the ease of traveling between vertices, or the importance of a chosen node.A walk in a network is an ordered list of nodes such that successive entries of the list are connected.A well-known fact in graph theory is that the number of walks of length m ≥ 1 starting at node i and ending at node j is given by [A m ] ij , that is, the entry (i, j) of the m-th power of the adjacency matrix.Let us assume that the coefficients c m in the matrix-valued function are nonnegative and decay fast enough to ensure convergence of the series.Then, the ease of traveling between the nodes i and j can be measured by [ f (A)] ij , with i = j, while the importance of node i can be quantified by [ f (A)] ii .
The following definitions, which are discussed in [13][14][15]18,19,21], are motivated by the discussion above: • the degree of node i, given by [Ae] i = e T i Ae, provides a measure of the importance of node i; • the f -subgraph centrality of node i, defined by furnishes a more sophisticated measure of the importance of node i than its degree; • the f -communicability between nodes i and j, quantifies the ease of traveling between nodes i and j; • the f -starting convenience of node i, given by quantifies the ease of traveling from node i to anywhere in the network.This is the sum of the communicabilities from node i to all other nodes, scaled so that the sum of the quantity over all nodes is one.
When the adjacency matrix A is large, i.e., when the graph G has many nodes, direct evaluation of f (A) generally is not feasible.Benzi and Boito [5] applied pairs of Gauss and Gauss-Radau rules to compute upper and lower bounds for selected entries of f (A).This work is based on the connection between the symmetric Lanczos process, orthogonal polynomials, and Gauss-type quadrature, explored by Golub and his collaborators in many publications; see Golub and Meurant [24] for details and references.A brief review of this technique is provided in Section 2. An application of pairs of block Gauss-type quadrature rules to simultaneously determine approximate upper and lower bounds of expressions of the form (1.7) when u and v are "block vectors", i.e., matrices with many rows and very few columns, is described in [21].
The main drawback of quadrature-based methods is that the computational effort is proportional to the number of desired bounds.Therefore these methods may be expensive to use when bounds for many expressions of the form (1.7) are to be evaluated.This situation arises, for instance, when we would like to determine one or a few nodes with the largest f -subgraph centrality in a large graph, because this requires the computation of upper and lower bounds for all diagonal entries of f (A).
A method to produce upper and lower bounds for quantities of the form (1.7) has been proposed in [20].It is based on that knowledge of a few leading eigenvalue-eigenvector pairs gives bounds for every entry of f (A), with little computational effort in addition to computing the eigenvalue-eigenvector pairs.For example, determining the m most important nodes of a graph, with m much smaller than the number of nodes n, amounts to finding the m nodes with the largest f -subgraph centrality.It is possible to quickly evaluate bounds for all entries [ f (A)] ij if a partial spectral factorization of A is available.Using these bounds we can determine a set of ≥ m nodes containing the m nodes of interest, and compute tighter bounds for the nodes in this set, if necessary, by employing Gauss-type quadrature rules.When n, the complexity of this hybrid algorithm is much smaller than computing upper and lower bounds for all entries [ f (A)] ii , i = 1, . . ., n, by Gauss quadrature.
In this work, we present a MATLAB package for the identification of the m most important nodes according to the centrality/communicability indices discussed above, based on two matrix functions, namely, the exponential (1.2) and the resolvent (1.3).Either the f -subgraph centrality, the f -communicability or the f -starting convenience can be computed.The computation can be performed by using one of three different methods: Gauss quadrature, partial spectral factorization, or the hybrid method; the latter two algorithm have been introduced in [20].This paper is organized as follows.Section 2 recalls how upper and lower bounds for quantities of the form (1.7) can be determined via Gauss quadrature.Approximation via partial spectral factorization of A is discussed in Section 3 and the hybrid method is summarized in Section 4. Section 5 presents the SoftNet package as well as a graphical user interface (GUI) that simplifies its use.A brief description of the code and its use also is provided.Section 6 describes some numerical experiments and Section 7 contains concluding remarks.

Approximation by Gauss quadrature
Let A be a symmetric matrix of order n and suppose that we are interested in computing bounds for bilinear forms where u and v are given vectors and f is a smooth function defined on an interval [a, b] ⊂ R that contains the spectrum of A. Since we can focus on the case u = v.
The matrix A has the spectral decomposition A = QΛQ T .Then we can write i.e., we may regard F u,u (A) as a Stieltjes integral; see [20,24] for further details.We approximate this integral by Gauss-type quadrature rules as follows.Let u be of unit Euclidean norm.Application of k steps of the Lanczos algorithm to A with initial vector u gives the k × k symmetric tridiagonal matrix T k .It can be shown that e T 1 f (T k )e 1 is a k-node Gauss quadrature rule G k for the Stieltjes integral (2.2).A (k + 1)-node Gauss-Radau quadrature formula Ĝk+1 with a fixed node at θ ≥ ρ(A) for approximating the Stieltjes integral also can be defined.This discussion assumes that the Lanczos algorithm does not break down.Breakdown is very rare and allows the computations to be simplified.
Under the assumption that the derivatives of f (x) have constant sign on the convex hull of the support of the measure, which is met by the functions (1.2) and (1.3), and the Radau node θ is suitably chosen, pairs of Gauss and Gauss-Radau rules furnish lower and upper bounds of increasing accuracy for the quadratic form (2.2).For the functions (1.2) and (1.3), and the Radau node θ chosen as described, we have For a user-chosen accuracy τ, we terminate the iterations with the Lanczos algorithm when 3) The default value in the code is τ = 10 −3 .
The matrix functions are applied to the tridiagonal matrices by using their spectral factorization.Thus, let When f is the exponential function, we let µ be the largest eigenvalue of T k and evaluate instead of e T k to avoid overflow.
Regarding the choice of the Radau node θ, we often may let θ = A ∞ .Alternatively, we can use the MATLAB function eigs or the function irbleigs described in [2,3] to determine an estimate of the largest eigenvalue λ 1 of A.

Bounds via partial spectral factorization
This section recalls how to derive bounds for expressions of the form (2.1), with u = v = 1, by using a partial spectral factorization of A. Introduce the spectral factorization A = QΛQ T , where the eigenvector matrix Q = [q 1 , q 2 , . . ., q n ] ∈ R n×n is orthogonal and the eigenvalues in the diagonal matrix where ũk = u T q k and ṽk = v T q k .Let the first N eigenpairs {λ k , v k } N k=1 of A be known.Then F u,v (A) can be approximated by The following results from [20] shows how upper and lower bounds for F u,v (A) can be determined with the aid of the first N eigenpairs of A.
Theorem 1.Let the function f be nondecreasing and nonnegative on the convex hull of the spectrum of A and let F the N largest eigenvalues of A and let q 1 , q 2 , . . ., q N be associated orthonormal eigenvectors.Then we have where and F To determine which nodes have the largest f -subgraph centrality (1.4), we use the inequalities (3.3) and (3.4).The N leading eigenpairs {λ k , q k } N k=1 of A and the bounds (3.2) and (3.3) can be used to determine a subset of nodes that contains the vertices with the largest value of the node metric we are considering.
Let L (N) u,v and U (N) u,v be the lower and upper bounds defined in Theorem 1.Since we seek an approximation of the centrality value for all the nodes of the network, we will either set u = v = e i , as in (1.4), or u = e and v = e i , as in (1.6).So, F (N) u,v will be a quantity depending on an index i = 1, . . ., n.We will write u,v to simplify the notation when we are computing either e,e i .When approximating (1.5), we will fix a value of j and consider m denote the mth largest value of the vector (F contains the indices of the nodes that can be considered important with respect to the desired centrality index. A computational difficulty to overcome is that we do not know in advance how the dimension N of the leading invariant subspace span{v 1 , v 2 , . . ., v N } of A should be chosen in order to obtain useful bounds (3.2) or (3.3).We use the restarted block Lanczos method irbleigs described in [2,3], which computes the leading invariant subspace {λ k , q k } k=1 of A for a user-chosen dimension , and allows the extension of such subspace by successively increasing the value of .Using irbleigs, we compute more and more eigenpairs of where |S| denotes the number of elements of the set S. This stopping criterion is referred to as the strong convergence condition.As shown in [20], the set S (N) m contains the indices of the m nodes with the largest f -subgraph centrality.
The criterion (3.6) for choosing N is useful if the required value of N is not too large.The weak convergence criterion has been introduced to be used for problems for which a large value of N is required in order to satisfy (3.6), and this makes it impractical to compute the associated bounds (3.2).The weak convergence criterion is well suited also for application in the hybrid algorithm described in Section 4. This criterion is designed to stop increasing N when the values F (N) u,v do not increase significantly with N. Specifically, we stop increasing N when the average increment of the values in the vector F (N) u,v is small when the Nth eigenpair {λ N , q N } is included in the bounds.The average contribution of this eigenpair to ), and we stop increasing N when for a user-specified tolerance τ, whose default value in the code is 10 −3 .Note that when this criterion is satisfied, but not (3.6), the nodes with index in S m with many more than m indices.In particular, we may not be willing to compute accurate bounds for a specific node metric by applying the approach of Section 2 to all nodes with index in S (N) m .We therefore describe how to determine a smaller index set J , which is likely to contain the indices of the m most important nodes.We discard from the set S (N) m .Thus, for a user-chosen parameter σ > 0, we include in the set J all indices i ∈ S (N) The default value for σ in the software is 10 −3 .

The hybrid method
We summarize here the algorithm corresponding to the hybrid method.The first step is to compute a partial spectral factorization of the adjacency matrix A. Such a partial factorization makes it possible to determine a set of candidate nodes that contains the most important nodes according to a chosen criterion, e.g., the f -subgraph centrality.The accuracy of upper and lower bounds for the candidate nodes is then improved by a suitable application of Gauss and Gauss-Radau quadrature rules.

The SoftNet software package
The package SoftNet for MATLAB is available at the web page http://bugs.unica.it/cana/software as a compressed archive.Uncompressing it, a directory named SoftNet will be created; in order to use the package the user should add its name to the search path.The package SoftNet consists of 14 MATLAB routines for the identification of the m most important nodes in a network according to different centrality indices.The package also includes the function irbleigs from [2,3] [34] and available at [28].
The package Contest by Taylor and Higham [33] contains different kind of synthetic networks and can be used to generate further numerical tests.We provide a convenient interface to this package.
Table 1 lists the 14 MATLAB routines with a description of their purpose.The first group, "Computational Routines," includes the functions for computing different centralities (subgraph centrality (1.4), communicability (1.5), and starting convenience (1.6)) with respect to two different matrix functions, the exponential (1.2) and the resolvent (1.3).The computations can be performed with three different methods, namely the Gauss quadrature method recalled in Section 2, the low-rank approximation presented in Section 3 and the hybrid method described in Section 4. The section "Auxiliary Routines for the Graphical User Interface" lists some routines required to start and use the graphical user interface.vipnodes Starts the graphical user interface.compute Performs the computations according to the chosen parameters.choose_contest Allow selection of a synthetic network from the Contest Package [33].initialize_gui Initializes the default settings for the graphical user interface.The computational routines are totally independent from the graphical user interface and can be used by the user from the MATLAB command line.For example, the command [vip, vipsgc] = sgcenlr(A,'exp',10); identifies the 10 most important nodes according to the subgraph centrality when the lowrank approximation is used for the computation.vip and vipsgc are vectors containing the indices of the nodes that are candidates to being the most important nodes and the values of their subgraph centrality, respectively.
Identifying the 5 most important nodes of a network whose adjacency matrix is A with respect to the starting convenience can be done by the following lines of code func = 'exp'; % the function to be used nnodes = 5; % the number of nodes to be identified theta = eigs(double(A),1,'LA'); % estimation of the largest eigenvalue opts = struct('gausstolq',1e-5,'gaussmaxn',150,'gaussmu',theta,'show',1) [vip, vipsgc, info, iters, allstconv] = stconvgauss(A,func,nnodes,opts); The third line computes the largest eigenvalue, since its estimation is needed for the computation of the Gauss-Radau rule.The struct opts is initialized on the fifth line, where the tolerance (2.3) for the convergence of Gauss quadrature is chosen, as well as the maximum number of iterations, and the value µ used for the spectrum shift (2.4).Setting the show variable to 1 displays a waitbar during the computations.
The output values are: • vip: indices for the most important nodes; • vipsgc: values of starting convenience for the identified nodes; • info: a vector containing a flag that indicates convergence and shows the number of matrix-vector products; • iters: the number of iterations performed for each node; • allstconv: the values of the starting convenience for each node.
Table 2 reports a subset of the options used for tuning the performance of the package; all the options have a default value.Refer to the second column of the table and to the description of the algorithms in [20] for their meaning.The available options are described in the various functions.All the functions can be used interactively with the vipnodes graphical user interface, located in the main directory of the package.The GUI starts by typing the command vipnodes in the MATLAB Command Window; see Figure 1.
The GUI consists of one input panel, on the left, and an output area, on the right.The former allows the user to set different parameters to perform the computations, the latter shows some information about the loaded network and the results, once the computations are done.A drop-down menu at the top of the window allows the user to perform different tasks as follows: • File.This menu allows the user to load a network in three different ways: load it from a mat file, extract it from the workspace, and create it by the Contest package [33], if the latter is installed.
• Export.This menu allows the user to export the results as a mat file, as a text file, or export them to variables in the workspace.

•
Reset.Reset options and computed results or just the results.

•
Stop.Interrupt the computations if they take too long time.
• Previous results.Display a table with results of the previous computation.
The first step to complete in order to carry out the computations is to load an adjacency matrix through the "File" menu on the top left of the main window.Once this task is done, general information about the network are shown, namely the number of nodes, the number of edges and, if the network contains self-loops, the number of removed edges.The parameters are set to their default values, and can be modified by the user.By pressing the "Find nodes" button the computations start.If the user chooses to show the animation (this possibility is not given if computation via Gauss quadrature is selected), a new window "Animation" will appear.It contains a spy plot of the adjacency matrix associated to the Figure 2 shows a typical Animation window.In this case, the computations aim to identify the 5 most important nodes with respect to the subgraph centrality of the Power network included in the package.The computations were carried out by the low-rank method, with the strong convergence condition.The number of computed eigenpairs is the minimal integer N such that the cardinality of S (N) 5 is 5.
Figure 3 shows the same window after identifying the same number of nodes as in Figure 2 by the hybrid method.In this case, the cardinality of the set S (N) 5 is larger than 5, and the final computation to identify the 5 most important nodes is performed by Gauss quadrature.
Once the computations are done, the main window shows the following information: • the method used to perform the computation (low-rank with strong convergence, hybrid method or Gauss quadrature); • the line of code that has to be written on the command window to perform the same computation without using the graphical user interface; • whether the strong or weak convergence criteria are satisfied (if one of them was selected); • the number of used eigenpairs (if either the low-rank or hybrid methods have been used for the computations); • the number of VIP nodes identified (if either the low-rank or hybrid methods have been used); • the elapsed time; • a table with the index of the identified nodes and the value of the corresponding centrality index.Figure 4 shows the main window once the computations related to Figure 2 have been carried out.Figure 5 shows the same window after the hybrid method has been used.
Note that the lists of nodes produced by the two methods is the same, but the values of the centrality index are slightly different.This happens because the value of the subgraph centrality computed by the low-rank approximation is estimated as an average of the lower and upper bounds computed by the method, while the value computed by Gauss quadrature is more accurate.

Numerical experiments
This section provides some numerical experiments to explore the performance of the centrality indices used in the software, namely the f -subgraph centrality and the f -starting convenience.In particular, we compare them to the following well-known centrality indices: • degree: the number of edges adjacent to a node; • betweeness: the number of shortest paths that pass through the node; • closeness: the reciprocal of the sum of the length of the shortest paths between a node and all other nodes in the graph; • eigenvector: a score is assigned to each node taking into account connections with nodes that have high scores; • pagerank: a variant of the eigenvector centrality.
The computation of the centrality indices listed above has been done by the centrality function included in Matlab.An example of its usage it is the following: centr = 'betweenness'; % the centrality to be used nnodes = 5; % the number of nodes to be identified G = graph(A); % converts the adjacency matrix A in to the graph G values = centrality(G,centr); % computes all the centralities of graph G [∼, node_ind] = sort(values,'descend'); % sorts all the centralities disp(node_ind(1:nnodes)); % displays the index for the nodes with the largest centrality The string centr can be set to degree, betweenness, closeness, eigenvector and pagerank.The first network we analyze is the famous Zachary's karate club network [31].The most important nodes of the network are node 1 and node 34, which stand for the instructor and the club president, respectively.Each column of Table 3 reports the ranking of the 5 most important nodes obtained by using the centrality indices listed above, namely degree, betweenness, closeness, eigenvector, and pagerank centralities, compared with the ranking obtained by the exp-subgraph centrality, the res-subgraph centrality, the exp-starting convenience, and the res-starting convenience.The value used for α in (1.3) is 0.95 • (ρ(A)) −1 .We remark that for this example, the ranking is not very sensitive to the choice of the parameter α.
It is worth noting that all the centrality indices correctly identify nodes 1 and 34 as the most important ones.The list of the 5 most important nodes contains the same indices except for the betweeness centrality, which includes in the list node 32, and the closeness centrality, which determines that node 32 and 9 are among the 5 most important nodes.
The second example we are going to consider is the Facebook network included in the package.The graph has 63731 nodes and 1545686 edges.Neither the exponential nor the resolvent of the adjacency matrix A can be evaluated in a straightforward manner due to the large size of the matrix.We therefore apply the hybrid algorithm described in Section 4 to find the 10 most important nodes in the network.Table 4 reports in each column the ranking of the 10 most important nodes according to the centrality indices described above and computed by the centrality function of Matlab.Table 5 reports the ranking of the 10 most important nodes obtained by the exp-subgraph, are not guaranteed to be the nodes with the largest index value we are searching for.Also the weak convergence criterion (3.7) may yield a set S (N)

Figure 2 .
Figure 2. The animation window after the identification of the 5 most important nodes for the Power network by the low-rank approximation method.

Figure 3 .
Figure 3.The animation window after the identification of the 5 most important nodes for the Power network by the hybrid method.

Figure 4 .
Figure 4. Main window after the identification of the 5 most important nodes for the Power network by the low-rank approximation method.

Figure 5 .
Figure 5. Main window after the identification of the 5 most important nodes for the Power network by the hybrid method.

Table 3 .
Ranking of the 5 most important nodes for the karate network identified by the centrality function of Matlab, the f -subgraph centrality, and f -starting convenience.degree betw clos eigvec pagerank exp-sgc res-sgc exp-stc res-

Table 1 .
Routines and GUI.Identifies the m most important nodes according to fcommunicability of node i through partial SVD; see Section 3. gaussexp Computes the bilinear form u T f (A)v, with f (A) = exp(A) through Gauss quadrature; see Section 2. gaussres Computes the bilinear form u T f (A)v, with f (A) = (I − αA) −1 through Gauss quadrature; see Section 2. sgcengauss Identifies the m most important nodes according to f -subgraph centrality through Gauss quadrature; see Section 2. sgcenhyb Identifies the m most important nodes according to f -subgraph centrality through partial SVD and Gauss quadrature; see Section 4. sgcenlr Identifies the m most important nodes according to f -subgraph centrality through partial SVD; see Section 3.
stconvgauss Identifies the m most important nodes according to f -starting convenience through Gauss quadrature; see Section 2. stconvhyb Identifies the m most important nodes according to f -starting convenience through partial SVD and Gauss quadrature; see Section 4. stconvlr Identifies the m most important nodes according to f -starting convenience through partial SVD; see Section 3. Auxiliary Routines for the Graphical User Interface .

Table 2 .
Problem definition and options.