A Set-Theoretic Approach to Modeling Network Structure

: Three computer algorithms are presented. One reduces a network N to its interior, I . Another counts all the triangles in a network, and the last randomly generates networks similar to N given just its interior I . However, these algorithms are not the usual numeric programs that manipulate a matrix representation of the network; they are set-based. Union and meet are essential binary operators; contained_in is the basic relational comparator. The interior I is shown to have desirable formal properties and to provide an effective way of revealing “communities” in social networks. A series of networks randomly generated from I is compared with the original network, N .


Introduction
The textbook way of describing network structure is to represent a network, N , as two sets (N, L) where N is a set of nodes and L is a set of unordered pairs {x, y} ⊆ N, called links (In graph theory, these unordered pairs are called "edges". This seems to be derived from the edges of the solid "dodecahedron puzzle" of Sir William Hamilton (1857) and retained through inertia. However, since in social networks they connect individuals, it seems more appropriate to call them "links") [1,2]. However, although textbook network theory is almost always set based, virtually all computer network algorithms are algebraic [3,4]. Any network can be represented by its adjacency matrix, A n,n , where a i,j = 1 if {i, j} is a link and 0 otherwise. There is an abundance of matrix algorithms one can use, such as eigenvector evaluation [3]. In this paper, we supplement these matrix-based algorithms. The common goal is to describe the nature of a network in terms of fundamental properties. A matrix based approach yields numeric properties; the set based approach of this paper yields set-theoretic properties.
Unfortunately, there is a dearth of practical set manipulation software. To overcome this problem, we created our own C ++ set management system [5]. In this system, sets are strongly typed; for example, there are "sets of nodes" and "sets of links" which are completely distinct. Invoking the subroutines that execute set operations can be awkward and takes time to master; however, one can faithfully duplicate all of the pseudocode presented in this paper. (The C ++ source code for all procedures of this paper can be obtained from the author.) Section 2 is long and somewhat heavy for an algorithm paper. The pseudocode for the set based procedure ω that reduces any network N to its "interior" I is presented as Pseudocode I. However, first, we must formally develop the notions of neighborhood closure and irreducibility on which this algorithm, ω, is based. Then, it must be shown (Proposition 2) that ω really is a well-defined function mapping N into itself, that is that the output of ω is unique, regardless of the order in which the elements of N are encountered. Finally, it must be shown (Proposition 4) that I can be characterized as a network of "chordless" cycles.
We believe it is worth it. First, the reduction algorithm ω separates N into distinct "communities", a process which is always of interest. Second, I appears to be an excellent, compressed descriptor of the network N . Much of the remaining paper is a justification of this observation.
To support the claim that I is a rather good descriptor of any network N , the paper follows an unusual course. It is shown in Section 4 that from I one can generate a series of networks N 1 , N 2 , . . ., each of which has the same interior I and similar network properties as N . Section 3 develops those properties, most of which come from the network literature. A short procedure (Pseudocode II) is presented, primarily to illustrate the flexibility of a set based approach since such algorithms already exist in the literature. Section 3.3 is devoted to showing that I preserves important centrality features, including both the "center" and "betweeness centers" of a network. This requires three lemmas and two propositions, which might be skipped on a first reading.
Finally, in Section 4, the interior I of a small network N ( Figure 3) is used to generate by expansion (Pseudocode III) three networks N 1 , N 2 , N 3 together with Table 2 comparing their properties including their principal eigenvector, with those of N . We leave it to the reader to decide if these generated networks are similar to N .

The Interior
Let S be a set. An operator τ : 2 S → 2 S is an injective function which maps subsets of S into subsets of S. We denote operators by Greek letters and use postfix notation, as in . Closure operators are a staple of topological mathematics.
If we replace axiom C1 with an contractive axiom I1, so that for all X, Y ∈ S: , then ι is said to be an interior operator. We use ι to denote interior operators and ϕ to denote closure operators; they are similar, except that one is contractive while the other is expansive.
If one visualizes S as a polytope, then its closure might be the smallest sphere containing S (often called its convex hull), while its interior could be the largest inscribed sphere, or ball. Alternatively, if one thinks of S as being a bit of irregular surface terrain with ridges and valleys, then a closure operator fills in the valleys until the terrain is uniformly smooth. An interior operator, in contrast, levels the peaks and ridges until a smooth terrain is obtained.
Let N be a network. For any Y ⊆ N, we say the neighborhood of Y is Y.η = {z|∃y ∈ Y, {y, z} ∈ L} ∪ Y. (In graph theory, Y.η is sometimes called the "closed neighborhood" of Y, and denoted N[Y], while N(Y) = Y.η−Y is called the "open neighborhood" [1,2]). Finally, since all operators map sets into sets, even when we are talking about the neighborhood of a single node, for example z in (1) below, we express it as {z}.η. A neighborhood closure operator, ϕ η , on N can be defined by η, Y.ϕ η must be idempotent, implying ϕ η is a closure operator. (C3, or idempotency, is normally the most difficult property to prove when establishing a closure, or interior, operator.) The neighborhood closure operator, ϕ η , is fundamental to the development of following sections.

The Network Interior
Consider any node y ∈ N, and suppose there exists z ∈ {y}.ϕ η implying {z}.η ⊆ {y}.η. Such a node, z whose "horizon" is contained in that of y, contributes very little to the information content of the network so that its removal from {y}.η will result in little information loss. This node z ∈ {y}.ϕ η can be reduced. The node y is irreducible if {y}.ϕ η = {y}. A sub-network, I ⊆ N , of irreducible nodes is called the network's interior.
In the remainder of this section we define an operator, ω, which reduces any network to its irreducible core, and prove that it is almost an interior operator.
If {y} is not closed, only elements z in {y}.η could possibly be in {y}.ϕ η so only those need be considered. If {z}.η ⊆ {y}.η so that z ∈ {y}.ϕ η , we say z is subsumed by y, or z belongs to y. We can remove z from N, together with all its connections, and add z to {y}.β, the set of all nodes belonging to {y}. This set {y}.β is called its β-set. Of course, y ∈ {y}.β. The cardinality |{y}.β| is called its β-count.
The pseudocode reduce Pseudocode I was used to implement a process ω that reduces any network N to its irreducible core, I = N .ω. Applied to N 1 , the well-known "Karate" network [6], this reduction code yields the interior depicted by bolder links in Figure 1. In this figure, two nodes of the interior have been suffixed by ":n" to denote their β-count. Only nodes 1 and 33 have non-trivial β-sets of 12 and 8 elements, respectively, which have been delimited by dotted lines. (The β-set of node 33 might equally well have been the β-set of node 34; however, 33 precedes 34 in the reduction process). Proposition 1. The process ω described above is (I1) contractive and (I3) idempotent.

Proof.
Readily, ω is contractive and it is idempotent because, when I = N .ω is irreducible, the loop is not executed, so N .ω.ω = I = N .ω.
One can show that N ⊂ N need not imply that N .ω ⊂ N .ω, so ω is not an interior operator, even though we call I = N .ω the "interior". Proof. Let y 0 ∈ I, y 0 ∈ I . Then, y 0 belongs to some point y 1 in I and y 1 ∈ I else because y 0 .η ⊆ y 1 .η implies y 0 ∈ {y 1 }.ϕ so I would not be irreducible.
Similarly, since y 1 ∈ I and y 1 ∈ I, there exists y 2 ∈ I such that y 1 belongs to y 2 . Now, we have two possible cases; either y 2 = y 0 , or not.
Suppose y 2 = y 0 (which is often the case), then y 0 .η ⊆ y 1 .η and y 1 .η ⊆ y 0 .η or y 0 .η = y 1 .η. Hence, i(y 0 ) = y 1 is part of the desired isometry, i. Now, suppose y 2 = y 0 . There exists y 3 = y 1 ∈ I such that y 2 .η ⊆ y 3 .η, and so forth. Since I is finite, this construction must halt with some y n . The points {y 0 , y 1 , y 2 , . . . y n } constitute a complete graph Y n with {y i }.η = Y n .η, for i ∈ [0, n]. In any reduction, all y i ∈ Y n reduce to a single point. All possibilities lead to mutually isomorphic maps.
Proposition 2 assures us that, even though which nodes are preserved in I is completely dependent on the order in ω that they are visited, the output must be effectively identical. For example, in Figure 2, assume the nodes x and z are irreducible elements of I. In each case, if y 0 ∈ I, then y 1 or y 2 could be as well. They are the equivalent nodes defining the isometry. Each set of equivalent nodes must be a "complete" subgraph of N . (A graph, or network, K is said to be complete if for all x, y ∈ K, there is a link {x, y}. A complete graph on n nodes is denoted by K n .) A sequence,ρ = y 0 , . . . , y n of n + 1 nodes, where {y i−1 , y i } ∈ L, or a set of n links ρ = {y 0 , y 1 }, . . . , {y n−1 , y n } is called a path ρ(y 0 , y n ) of length n. It is often easier to describe a path in terms of its nodes,ρ rather thanρ, which is more precise. By |ρ(x, z)|, we mean the length of the path independent of whether we are counting nodes or links.
A cycleĊ = y 0 , y 1 , . . . , y n , where y n = y 0 , of length n ≥ 4 is said to have a bridge if there exists a pathρ( . If the path consists of a single link, it is called a chord. If C has no such chords, it is said to be a chordless cycle. Graphs, in which every cycle of length ≥ 4 must have a chord, are called "chordal graphs" [7].) Proposition 3. The nodes of a chordless cycle are irreducible.

Proposition 4.
Let N be a finite network with I = N .ω being an irreducible subset. If y ∈ I is not an isolated point, then either: (1) there exists a chordless k-cycleĊ, k ≥ 4 such that y ∈Ċ; or (2) there exist chordless k-cyclesĊ 1 ,Ċ 2 each of length ≥ 4 with x ∈Ċ 1 z ∈Ċ 2 and y lies on a path from x to z.
The preceding establishes that any link sequence in I terminates in a cycle of length ≥4. Since N is symmetric, the link sequence could be extended in the opposite direction yielding (2).
The condition that y not be an isolated point is significant. Any tree structured network reduces to a single point, as do many networks consisting of triangles.

Corollary 1. N is connected if and only if I is connected.
A collection of chordless cycles constitutes a cycle system which is itself a matroid [8] with a well defined rank [9]. If the network is projected onto a planar representation, then counting those cycles without a bridge yields the rank. Figure 3 illustrates the interior of a small network on 21 nodes. It is a cycle system of rank 5. Here, the links of the interior have been made bolder and again its nodes have their β-counts appended. The β-sets, such as {e:2}.β, are suggested by dotted lines. Note that this process effectively resolves the question of partitioning networks into disjoint communities [3,10,11], without having to specify the number of communities in advance, although some β-sets would have to be combined.

Reduction Performance
Technically, the ω process of Pseudocode I is O(n 2 ) since it can achieve a worst case performance on the unbalanced network of Figure 4 provided the outer loop of the ω code of Pseudocode I encounters the nodes in order of their subscripts. Then, it will remove only one node on each iteration. However, in practice, ω appears to actually offer sub-linear performance. With networks of several thousand nodes, ω has never required more than seven iterations. For example, given the well-known Newman co-authorship network [12] of 363 persons with 823 connecting links, three iterations of the outer loop of the ω code of Pseudocode I reduces the network to 65 individuals with 111 links constituting its interior shown in Figure 5. (A fourth iteration is required to verify that there are no more reducible nodes.) The node Stauffer, in the upper left, has a β-set of 23 elements for which it may be regarded as a surrogate, and the lower left node Barabasi has a β-set of 41 elements. In the case of the Newman co-authorship network, the interior represents a significant reduction in the complexity of the network,  Figure 5. The interior I of N 3 , the 363 node co-authorship network of Newman [12].

Network Properties
There are several scalar properties associated with every network N , including n_nodes = |N|, n_links = |L| and density = |L|/|N|. The average node degree over all nodes is 2 · density, since every link has two end nodes [1,2]. These are trivial to calculate given N and L.
The number of triangles [13] embedded in N can be calculated by the count_triangles whose code is given in Pseudocode II.

Pseudocode II, count_triangles
Here, the k_count of a link denotes the number of triangles for which the link {x, z} is one "side". Since that triangle has three links, n_triangles = k_total/3. The computational cost of {x}.nbhd ∩ {z}.nbhd is essentially constant, so the cost of count_triangles is linear, or O(L).
Other scalar properties are dependent on the concept of shortest paths. Let x, z be two nodes in a connected network N . Because N is connected, there exists a path ρ(x, z) of length n. This may, or may not, be the shortest path (of minimal length) between them. We let σ(x, z) denote the (or all) shortest path(s) between x and z. The path length |σ(x, z)| is also known as the distance, d(x, z), between x and y [1,2]. The diameter(N ) of the network is the maximal distance, d(x, z) for all x, z ∈ N. The eccentricity of a node x is e(x) = max(d(x, z)) for all z ∈ N. The radius, r(N ), of the network is minimum eccentricity of any node y [2].

Communities
Many networks, especially those that represent social connections, are spotted with "clusters" of more densely connected nodes. These clusters of triangular links, which are often called communities, arise from the social phenomenon called triadic closure [14]. It is known that, in many social contexts, if x is connected to y and y is connected to z, then x is likely to be connected to z. Even though triadic closure is not really a closure operator, its principle has been identified on many repeated occasions [11,15]. (As normally encountered, triadic closure is not idempotent. Applied literally, the triadic closure of any network would be the complete graph/network on its n nodes).
However, we know of no formal definition as to what really constitutes a "community". There have been numerous efforts to identify communities in a network [16]. Several work on the principle of "bisection" in which removal of certain links separates the network into n distinct communities [10]. A common problem is that usually n must be designated in advance. Others iteratively partition the network, often using the Fiedler eigenvector [17]. Here, the question is when to stop the iteration.
A portion of the network that is dense with triangles may be regarded as a community. A connected sub-network of triangles is called a k-truss [18]. A connected subset of triangles could be tree-structured, so it is common to specify that a k-truss is a connected collection of links with a k_count > 1, where the k_count of a link {x, z} is |{x}.η ∩ {z}.η| as in Pseudocode II. If k_count = 2, the Karate network of Figure 1 Figure 5.
The larger values of the principal eigenvector of A n×n (the adjacency matrix of the network) can indicate well-connected nodes, and often communities [3]. Nodes 1, 3, 33 and 34 of N 1 , the Karate network of Figure 1, dominate its principal eigenvector. The principal eigenvector of N 2 , the small network of Figure 3, are given in Table 2. Here, nodes d, e, m, r stand out. Higher values in this eigenvector appear to correlate with higher node degree. The nodes Barabasi, Jeong and Oltvai (in {Jeong}.β) are most prominent in the eigenvector of the Newman network.
All of these methods have been proposed to denote "communities". We would suggest that the β-sets attached to I also denote "communities".

Important Nodes
A fundamental quest in the analysis of many networks is the identification of its "important" nodes. They may be a node of high degree in a community, but need not be. In social networks, "importance" may also be defined with respect to the path structure [19,20]. Those nodes, C d = {y ∈ N } for which the eccentricity, e(y), or ∑ x =y d(x, y), is minimal, have traditionally been called the center of N [1,2]; they are "closest" to all other nodes. It is well known that this subset of nodes must be edge connected. One may assume that these nodes in the "center" of a network are "important" nodes.
Alternatively, one may consider those nodes which "connect" many other nodes, or clusters of nodes, to be the "important" ones. Let nsp xz (y) denote the number of shortest paths σ(x, z) containing y; then, those nodes y for which nsp xz (y) is maximal are involved in the most connections. Let C b = {y ∈ N }, for which nsp xz (y) is locally maximal. This is sometime called "betweenness centrality" [19][20][21]. (In [20], Newman proposed the notion of "random walk betweenness" as an alternative to shortest path betweenness).

Network Properties Preserved by the Interior
The next three lemmas, culminating in Proposition 5, help clarify the interaction of β-sets with the nodes of I. In these lemmas, we assume that x 0 , y 0 and z 0 ∈ I. Lemma 1. Let y k ∈ {y 0 }.β. There exists a node sequence y 0 , y 1 , . . . , y k such that y i ∈ {y 0 }.β, Proof. In the reduction process of Pseudocode I, if y i+1 is subsumed by y i , then {y i+1 }.β ⊂ {y i }.β yielding the chain of nested sets {y k }.β ⊂ {y k−1 }.β ⊂ . . . ⊂ {y 0 }.β.
Proof. By Lemma 2, we know ∃{x, z 0 }, {z, x 0 } ∈ L. If {x 0 , y 0 } ∈ L, we are done. Thus, let us suppose not. By Proposition 4, we can assume ∃y ∈ I (or a sequence y i ) such that {x 0 , y}, {y, z 0 } ∈ L. We claim {x, y} ∈ L, since otherwise y, x 0 , . . . , x, . . . , z 0 , y is a chordless cycle of length ≥ 4, and hence by Proposition 3 is irreducible. Similarly, {y, z 0 } ∈ L.  In this figure, solid lines denote links that are "known" to exist for one reason or another. The dotted (. . . ) lines that enclose β-sets were established by the reduction process. Each conforms to Lemma 1. Observe that the entire set of nodes, {x 1 , x 2 , z 1 , z 2 , z 3 } could constitute either {x 0 }.β or {z 0 }.β depending solely on the order of node reduction. This is illustrated in N 1 , Figure 1, where {33}.β could have been {34}.β. Proposition 2 establishes that the structure of the interior, I, is independent of the order in which nodes are encountered in the ω process; however, the structure of β-sets produced by the code reduce can be very dependent on this order.
While in many networks the β-sets will be separated (as in Figure 3), they may be links between them. It is not hard to imagine a link between a ∈ {e}.β and c ∈ {h}.β. The lemmas establish that either such a link must introduce a new chordless cycle into I, or else there must be an abundance of "triangles" surrounding the network interior. ρ(x, z) be a path where {x, z} ∈ L (i.e., |ρ(x, z)| ≥ 2) and x ∈ {x 0 }.β and z ∈ {x 0 }.β. Then, there exists a path ρ (x, y, z) where y ∈ I and |ρ (x, y, z)| ≤ |ρ(x, z)|.

Network Centrality
Proposition 6. If N is not unbalanced, then the center C d (in terms of distance) is an element of (or intersects with) the interior I of N .
Proof. If x and z are in separated β-sets, then σ(x, z) = x = x k , x k−1 , . . . , x 0 ∪ y 1 , . . . , y m ∪ z 0 , . . . , z n where y 1 = x 0 , y m = z 0 and y i ∈ I. Since N is not unbalanced, we may assume k ≈ n, so the center of σ(x, z) is one of the y 1 , . . . , y m .
If x and z are in connected β-sets and |ρ(x, z)| ≥ 2, then Proposition 5 establishes the existence of a shortest path through I as well.
If x, z ∈ {x}.β, then no shortest path involves I; however, since N is not unbalanced, these constitute a small number of cases and can be ignored.
In Figure 2b, if y 1 is in the center C, then so are y 0 and y 2 , implying C ∩ I = ∅. Proposition 6 requires that N not be too unbalanced. Figure 4 illustrates why. It is not hard to show that y 5 is the center with maximum distance over all x being d(x, y 5 ) = 4. Our rule of thumb is that a network is reasonably well-balanced if, given any x ∈ {x 0 }.β, then the probability that a randomly chosen y is also in {x 0 }.β is small, that is pr(y ∈ {x 0 }.β|x ∈ {x 0 }.β) < ε where ε < 0.20.

Proposition 7.
If N is not unbalanced, then any center C b of N (in terms of betweenness) is an element of I.
Proof. This proof follows the line of Proposition 6, in which, unless x and z are in the same β-set, all shortest paths σ(x, z) either involve I or have a path ρ (x, y, z) of the same length through I. Hence, a node y for which σ x,z (y) is maximal will be an element of I.
That I contains the betweenness center is evident in the Karate network of Figure 1 and the small network of Figure 3. Figure 7 illustrates a somewhat different "unbalanced" network in which x and z ∈ I are betweenness centers. One can calculate that nsp(x) = nsp(z) = 6 * 6 + 4 * 6 * 6 + 4 * 6 = 204 which are locally maximal.
Calculating betweenness centers is computationally expensive, even with improved algorithms (e.g., [21]). Knowing that they must exist in the interior I and restricting the calculation to just those nodes can greatly improve performance, especially when betweenness is employed in other procedures (e.g., [10]). Consequently, dwelling too much on unbalanced networks can be self defeating since the majority, and possibly almost all, networks are well-balanced.

Network Generation by Expansion
The interior, I, of a network N represents its global structure. If the β-count is appended to each node of I, how well does I represent N as a whole? In effect, what is the information content of I, so augmented?
One measure of the information content of any collection of network properties is the ability to construct, or generate, similar networks based on those properties. For example, given a network N = (N, L) one can construct many different networks N = (N , L ) such that |N | = |N| and |L | = |L|. However, they need not be at all similar to N . Here, we are using "similar" in its colloquial sense. A formal notion of "similarity" would require it to be an equivalence relation. One way of determining the nature of networks with a given interior, I, and known β-counts is to randomly generate some. Let I be given. Suppose the β-count of a node y is greater than one. New nodes can be attached to replace those of the original β-set. Let y:n be the node to be expanded (n > 1) and let z denote the new node. Our code generates artificial node names of the form 'A0, B0, . . ., Z0, A1, . . .'. The last generated node in the expansion of Figure 5 is M11. Besides the link {y, z}, we require {z}.η ⊆ {y}.η. A random number determines how many of the other nodes in {y}.η will be linked to z, and which, if any, of those are also randomly chosen.
In the reduction process, ω, nodes with considerable β-sets may be subsequently reduced themselves. In the re-expansion, a portion of the β-count of y may be transferred to the β-count of z. Pseudocode for a procedure expand to implement an operator ε that generates new nodes relative to the interior is given in Pseudocode III. (ε, as shown here, is a round-robin procedure expanding one node in a β-set at a time. An alternative, and slightly faster, process can be found in [22]).
To what extent are the network features of N enumerated in the preceding section preserved in the randomly generated networks, N .ω.ε? Readily, the generation process ε was constrained so that |N .ω.ε| = |N | and N .ω.ε.ω = I = N .ω, thus the path-based centers of Section 3.4 are preserved. Some other network properties are illustrated in Table 1. Table 1. Network properties of networks in Figure 8 generated from I = N 2 .ω, Figure 3.  Table 2 presents the principal eigenvector associated with the nodes of N 2 in Figure 3 and for the three expansions shown in Figure 8. Note that, except for the ten nodes of I, node values for generated expansions are not comparable with node values of the original N . This section began with the question "how well does I represent N as a whole?" Figure 8 and Tables 1 and 2 provide abundant evidence that, given just I, with each node augmented with its β-count, a random process can generate new networks whose properties are very similar to those of N . It would seem to be a very good description of N .

Observations
This paper might have been titled "An Operator Approach to . . ." since the operators η, ϕ η , ω and ε play such an important role. This aspect is briefly suggested by Proposition 8, but not enlarged. However, surely, interesting networks are dynamic; they change over time which demands an operator approach. Thus, one might ask: "Is a transformation τ : N → N continuous?" [23] The operators ω and ε are, in fact, "continuous" with respect to ϕ η . Moreover, it appears that N .ν = N −N .ω = N −I is a violator space in the sense of [24]. This could be expanded in the future.
However, computability is such a dominant theme in current network analysis and understanding that we thought focusing on the use of set-theoretic computer procedures such as reduce, count_triangles and expand was more important. Programming with set operators is not widespread. However, these set-theoretic procedures appear to be fast and quite scalable. The reduction, ω, of the Newman co-authorship network to Figure 5 took 0.008 s; reduction of the smaller networks (Figures 1 and 3) were each less than 0.001 s. Calculation of the eigenvectors of Figure 5 exceeded 5 s. Such anecdotal evidence is suggestive, but far from definitive.
Only standard set-theoretic reasoning was used to develop the reduction process, ω, which leads to the concept of the "interior", I, of a network, N , and its β-set. It is a powerful concept that effectively captures the essence of many networks, as shown by Section 4, in which very similar networks can be generated from I alone. Moreover, by reducing a network to its interior, one effectively partitions the network into it constituent β-set communities.
However, the reduction process has its limitations. Some networks are nearly irreducible to start with. The sparse network of Norwegian corporate directors [25] is an example. Hierarchical networks reduce to a single node, that is a single node interior with a very large β-set. Other networks can be too dense. The complete network K n also reduces to a single node. However, we believe that the easily computed interior is a most effective network descriptor and possibly should be an automatic first step in network description and understanding.