1. Introduction
The analysis and understanding of the effects of symmetry has been a guiding principle in physics since the beginning of the 20th century [
1]. The investigation of symmetry in network analysis—besides the area of (applied) math—seems not as far developed and depends on the field of research. Especially in chemistry, the recognition of graph symmetry plays a role, as the structure of molecules can be modeled with graphs and the distinction of molecular structures (i.e., checking for (non-) isomorphism) is of great interest [
2]. This has also led to entropy-based measures, not only applicable in chemistry but also in the fields of biology, sociology and psychology [
3,
4].
The study of graph symmetries has been a field of research for many years [
5]. Two graphs are said to be isomorphic if there exists a bijection of the set of nodes so that one graph equals the other under the transformation of the bijection. If bijections from a graph to itself besides the trivial identity bijection exist, a graph is symmetric and this symmetry is captured by the graph’s automorphism group. The formal details will be introduced in
Section 2. Both problems (graph isomorphism—GI, and graph automorphism—GA) are not in
and thus not easy to solve in general. For GI it is even unclear whether it is in
(the complexity class of “nondeterministic polynomial-time complete” problems, often informally referred to as the class of the hardest problems to solve) [
6]. Either way, efficient algorithms exist [
7,
8,
9,
10] and the latest research gathers momentum in lowering the theoretical worst-case complexity [
11,
12] (still under review as of September 2017).
The strict symmetry definition in terms of topological symmetry is not the only way to characterize graph symmetries, as Garlaschelli et al. [
13] show in their review. Nonetheless, we concentrate on strict symmetry for the rest of this paper. Erdős and Rényi [
14] showed that random graphs tend to be asymmetric, that is, there do not exist automorphisms that map nodes onto nodes by preserving the graph’s structure. In contrast, for graphs that are the result of structures (or processes, social relations, etc.) that occur in reality—thus the name real-world graphs—it has been shown that they do not obey the law of pure randomness [
15] but follow other principles, such as the power law probability distribution [
16]. Therefore, the conclusion “symmetries in real-world graphs are unlikely” is false without further evidence. Several empirical investigations on the actual symmetry of complex/real-world networks already exist [
10,
17,
18], with an emphasis on [
19], but all these publications have in common that only a very small selection of graphs (about 25 maximal) was taken into account. A theoretical overview of symmetry in complex networks is given by Garrido [
20]. Our contribution is to undertake a symmetry analysis on a much larger scale, utilizing graph data from the meta-repository
networkrepository.com (
http://www.networkrepository.com, last accessed 28 August 2017) [
21].
The rest of this article is organized as follows: We formally define graph symmetry and other concepts in
Section 2 and will then describe the procedure we carried out (
Section 3). We end by presenting the obtained results (
Section 4) and by giving a short outlook (
Section 6) and conclusion (
Section 5). Our Appendices contain additional details on the analysis (
Appendix A), a theorem and a proof concerning the redundancy measure we present in
Section 3 (
Appendix B), some more result diagrams (
Appendix C), and additional results (
Appendix D).
2. Definition of Graph Symmetry
A graph G is a set of nodes (or vertices) V and a set of edges E that together form a topological structure . We define (w. l. o. g.) and , which is a simple graph (no self-loops, no edge weights, no multiple edges) with at most edges. This definition makes nodes distinguishable by their label . Furthermore, it is sufficient to consider only connected graphs. That is, no two nodes exist that cannot be reached by a sequence of nodes connected by an edge (a path). For brevity, we write edges as where appropriate.
Let us now formally define isomorphisms and automorphisms of graphs. Both are bijections that map nodes onto nodes by preserving topology. These bijections are permutations of the node set and we write them as cycles. Two graphs and are isomorphic iff with and the mapping — the image of applied on u. If , is an automorphism of the graph G. The identity mapping is always a (trivial) automorphism. For any bijection there naturally exists an inverse and the composition of both bijections is . The set of all automorphisms that exist for a graph G form the group that obeys the four group properties:
Identity: .
Inverses: and .
Closure: .
Associativity: .
A subset of automorphisms that yields when applying the above properties is called a generating set; the elements are generators. If contains only , it is said to be trivial and G is said to be asymmetric. The group size is simply the number of elements (permutations) and if , G is symmetric.
The automorphism group induces equivalence classes of the nodes of the graph, which are called orbits. An orbit of a node u is defined as . That is, all nodes that can be mapped onto each other are on the same orbit and the set of all orbits is a non-overlapping partition of V. The number of nodes on an orbit defines its length. A node with an orbit length of one is said to be fixed by .
The definition of automorphisms can easily be extended to non-simple graphs. If E is allowed to be a multiset, multiple edges are possible. If edges are expressed by ordered tuples instead of sets, the graph is directed. In both cases, the above definitions are directly applicable. If we interpret the requirement as a constraint that limits the set of all possible bijections , we can think of further constraints, for example, induced by a function that assigns weights to edges. So for the automorphism of a weighted graph, additionally must hold.
The group containing all bijections of the above form is called the symmetric group of
V and is denoted
. The group
is called a subgroup, denoted
. The more constraints are introduced to define an automorphism of a graph, the smaller the subgroup becomes. Let, for example,
G be a simple graph with the automorphism group
and
be the same graph with an additional edge weight function, then
. This consequence will become important in
Section 3.
3. Description of the Graph Symmetry Analysis Procedure
The analysis procedure can be divided into several independent steps. Each step is automated by implementing a Python script:
Retrieval of graph metadata from networkrepository.com (includes the download link to the actual dataset).
Selection of the datasets to download (by file size).
Retrieval of the actual datasets (compressed zip-archives).
Loading, selection and transformation of the graph data to perform the symmetry computations.
Calculation of statistics and their analysis (discussed in
Section 4).
The crucial part of the analysis is the computation of generators of the automorphism group of each graph in step four. The normalized network redundancy (introduced in
Section 3.5) is based on the number of orbits that can be deduced from the generators.
We will give a brief overview of each of these steps before we present and interpret the results. Additional aspects and details are described in
Appendix A. The analysis was performed in February/March 2017, thus all information we give was up-to-date at that time.
3.1. The Data Repository: networkrepository.com
networkrepository.com [
21] is a platform that advertises to be “[t]he first interactive data and network repository with real-time analytics” (
http://www.networkrepository.com, last accessed 28 August 2017). It lists about 3900 datasets and provides several (sometimes approximate) characteristics for each dataset. The datasets are grouped into classes of origin (for example, “Biological Networks”, “Cheminformatics”, “Social Networks”, etc.) with the smallest class containing less than 10 networks, the largest (“Miscellaneous Networks”) more than 2500. A striking advantage of this repository is that it also contains many datasets which can be found in other sources (for example, the one of GEPHI [
22] (
https://github.com/gephi/gephi/wiki/Datasets, last accessed 29 August 2017), SNAP [
23] or the DIMACS challenges [
24], 10th DIMACS challenge (
http://dimacs.rutgers.edu/Challenges/, last accessed 29 August 2017)). This allowed us to easily retrieve many datasets at once without the need to combine several sources and—potentially—different data formats. A flaw of using this source of data is the presence of datasets that are not actual real-world graphs, as the discussion of the results will show.
3.2. Data Download Selection
We decided to download all datasets with a (compressed) size of about 70 megabytes or less. Even if this sounds to be very far away from “big data”, considering nodes, the largest network in our sample consisted of 11,950,757 nodes and m = 12,711,603 edges. Looking at the edges, the largest graph in the sample has nodes and m = 25,165,784 edges. The obvious reason for this is that large graphs also have, encoded as a list of edges, a fairly low consumption of disk space—mainly depending on the number of edges. Certainly, not the disk space but the algorithmic complexity of obtaining the automorphism group of a graph, which normally increases at least linearly with the size of the graph, was crucial for this decision. In the end, we downloaded a total of 3015 datasets with an overall compressed size of 11.5 gigabytes.
3.3. Data Formats
The downloaded graphs are encoded in either one of two plain-text data formats:
The Matrix Market exchange format [
25].
An edge list format [
26] (Chapter 9).
The former is a matrix format that contains the actual matrix format information in the first line, and each following line encodes a matrix entry by position. The latter also begins with some meta-information in the first lines, followed by a list of edges (one per line).
It is of high importance to understand how the graph data is encoded, to decide which datasets to choose (simple graphs) and which to exclude (all other graphs) from the symmetry analysis.
3.4. Data Analysis Selection
As stated in our introduction, we focus on simple graphs. One of the reasons is that
saucy [
10], which we used to compute generating sets for the automorphism groups, cannot handle weighted graphs.
networkrepository.com also contains datasets of dynamic networks that change over time. These should be excluded as it is not clear which state of the network is the one to analyze (start, end, average, every state from start to end, …?). Some other datasets consist of disconnected components which we also excluded from further processing.
Often, for example, [
19], non-simple graphs are “simplified” by removing multiple edges, loops and edge weights, and replacing edge 2-tuples (directed) by edge 2-sets (undirected). We refrained from doing so as this simplification is likely to “add” symmetry that is actually not present in the original version of the graph (see the example in
Figure 1). This is why we view these properties as constraints in
Section 2 that restrict the set of automorphisms of a graph. Sometimes, some of these simplifications may be appropriate, but as we want to analyze exact symmetry, this one in particular is not. However, additional results for simplified graphs are presented in
Appendix D.
Another issue we had to face was the existence of duplicates, which had several causes: Sometimes the same graph was put into different classes, sometimes a graph existed in the same class but under a slightly different name. In the end, from the 3015 downloaded datasets, a total of 1805 were sorted out (non-simple graphs) and another 277 graphs were disconnected and thus also excluded from the analysis. From the remaining 933 graphs, 31 were identified as duplicates having a different name after manual postprocessing. Most of these duplicates were part of the “DIMACS10” class, which contains datasets from a graph clustering and partitioning challenge held in 2012 (
http://www.cc.gatech.edu/dimacs10/, last accessed 30 August 2017). We found all graphs of this special class to be also contained in one of the other classes. Some other graphs were also duplicated in different classes, but only few were duplicated in the same class.
All in all, we decided to remove all duplicates from our analyses on the complete dataset but to keep duplicates for the analyses on the class level to prevent a bias within classes.
3.5. Data Analysis Procedure
Finally, we describe which characteristics were computed for each graph. As we laid our focus on the analysis of symmetry, it was most important to compute the automorphism group of each graph. To achieve this, we utilized
saucy [
10]. Calling
saucy with a simple graph as input returns the size of the group, the number of permutations that generate the group, and some other information (sum of the permutation’s support, depth and number of nodes of the search tree), which we do not use further. Actually, we did call
saucy from Python and additionally computed the orbit partition
(see the details in
Appendix A).
As the size of an automorphism group grows exponentially fast with the degree of symmetry, the number representing size is split into two values
a and
b so that
. However, the group size as an absolute value is not a very intuitive way to describe how symmetric a graph actually is. Think, for example, of the case of a graph that has 100 nodes and 15 of them can be mapped onto each other in any way, while all others are fixed. The group of this graph is equivalent to
with
(see also the results in the last column of
Table 1).
MacArthur et al. [
19] introduced a relative measure for the degree of symmetry, which they call “network redundancy”:
It is much more robust against
how nodes on the same orbit can be mapped onto each other, because an orbit only contains the information
that nodes can be mapped onto each other. We believe a better measure is
the normalized (network) redundancy. It perfectly captures the asymmetric case (
) as well as the completely symmetric case (
) and better correlates with the name “redundancy”. A completely symmetric graph (a transitive graph) has a normalized redundancy of 1 (100%) as all nodes are equivalent and therefore redundant. Moreover, asymmetric graphs of different size can be compared because the normalized redundancy is 0 and not
(for large
n). The normalized redundancy is of course only defined for graphs with more than one node, but in practice this is no restriction at all.
Furthermore, the number of nodes
n and edges
m are reported and we apply the
RG algorithm [
27] for graph clustering to examine the existence of modular groups in the graph. The popular modularity partition quality measure
Q [
28] is used to quantify the existence of modular groups.
4. Results
The main goal of this article is to provide further evidence of symmetries that exist in real-world graphs.
Table 1 shows a summary of some basic statistics of the datasets. The last two columns contain symmetry information obtained by
saucy. In the end, we analyzed a total of 902 different graphs (duplicates already removed). As discussed previously, the size of the automorphism group is not a good measure for the degree of symmetry. Nonetheless, it can be seen that 75% of the analyzed graphs have a very small group size. In total, 272 of the 902 graphs are asymmetric, which means about 70% contain symmetries.
From the median values (50%) of the first two columns in
Table 1 it can be seen that many analyzed graphs are quite small. However, also relatively larger graphs were part of the analysis, as the maximum values show. Many graphs seem to have a modular structure which is not unusual for real-world graphs [
16] (see Section 3.7; higher values of
Q imply a more modular structure).
In
Figure 2 the distribution of the number of edges is plotted (note the log-scaled x-axis and bin size). Symmetries are found for every size of graphs; only for larger graphs (
) do symmetries seem to be under-represented. An explanation for this is the existence of many graphs that are generated following some random graph model and, therefore, are asymmetric. The details of this are described below. The plot for the number of nodes
n looks similar, so we decided to put it in
Appendix C (
Figure A2). More than half of the graphs contain between 10 and 100 edges. This “bias” emerges from the high number of graphs from the “chem” class, which is very large (over 570 datasets) and consists predominantly of graphs with less than 100 edges.
As for the graphs’ size, modular graphs often do contain symmetries, as
Figure 3 shows. Which modularity values for a partition are interpreted as “good” largely depends on the graph’s size. For example, the Karate network [
29] is relatively small (
,
) and the optimal modularity maximizing partition yields a value of nearly 0.42 [
27].
A bit surprising is the large fraction of asymmetric graphs with very low modularity (<0.05): Normally, maximum modularity decreases when the density
of the network—that is, the number of edges
compared to the maximal possible number of edges
—increases (see
Figure A3). That is because with an increasing number of edges it becomes impossible to partition the set of nodes in a way that nodes within clusters are more connected to each other than to nodes in different clusters (for details see [
28,
30]). On the other hand, a graph is likely to become symmetric if more and more edges are added, which increases its density [
31].
Indeed, all the graphs with
have a very high average density of
(std
). By directly investigating the analyzed datasets we found out that most of these low-modularity low-symmetry graphs come from the
dimacs and
bhoslib classes. The
dimacs class contains data from the second DIMACS challenge, which was about “Maximum Clique, Graph Coloring, and Satisfiability” (
http://dimacs.rutgers.edu/Challenges/, last accessed 29 August 2017). These graphs were generated, thus no actual real-world networks. The
bhoslib class is discussed below. A third group of graphs were part of the
misc class and have names
G1,
G2 and so on. These are also generated as described by Helmberg and Rendl [
32]—most of them following a random model. As a consequence, these findings explain the large amount of highly non-modular and asymmetric graphs in
Figure 3. It is also important to note that very non-modular graphs are in general unlikely to be real-world networks anyway [
16] (Section 3.7).
Until now we only distinguished between asymmetric and symmetric graphs without taking the degree of symmetry into account. In
Figure 4 the histogram of the normalized redundancy is shown. One can see that most of the analyzed graphs have a relatively low normalized redundancy (over 80% of them with
). However, even a low redundancy of say 0.1 implies that about 10–20% of all nodes are affected by the symmetries, as the following example shows.
is true for
. If all nodes affected by
are on exactly one orbit, approximately all
orbits but one are trivial, which means about 10% of the nodes are affected. Conversely, about 20% of all nodes are affected if there are
trivial and
non-trivial orbits, each consisting of exactly two nodes. A further increase of non-trivial orbits is not possible (proof in
Appendix B).
Due to the very diverse class sizes of the
networkrepository.com data, we did not carry out an in-depth comparison of analysis results between different classes. Nonetheless,
Figure 5 presents a glance at the distributions of normalized redundancy per class. In total, 975 graphs (duplicates among different classes are now included) are spread over 17 classes. Note that
networkrepository.com has 21 different classes in total, due to the data selection described earlier, and some classes vanish in our analysis (“Brain Networks”, “Ecology Networks”, “Massive Network Data”, “Dynamic Networks”). However, we want to give some further information for the largest classes of our analysis.
bhoslib (“
Benchmarks with
Hidden
Optimum
Solutions for Graph Problems”,
http://www.nlsde.buaa.edu.cn/~kexu/benchmarks/graph-benchmarks.htm, last accessed 4 September 2017): All these graphs are asymmetric. They all are generated from the “Model RB” [
33], a derived random graph model. As argued earlier, random graphs tend to be asymmetric.
chem: This largest class contains cheminformatics data which has overall a large degree of symmetry. The origin of all those networks is described by Borgwardt et al. [
34]: They extracted proteins from an online enzyme information system and transformed them into undirected graphs. The nodes are labeled and we also took this labeling into account (see our discussion on graph simplification in
Section 3) by additionally providing the partition vector of labels to the
saucy call.
saucy supports initially labeled nodes as starting point for the search procedure.
dimacs/
dimacs10: These datasets come from different DIMACS challenges (
http://dimacs.rutgers.edu/Challenges/, last accessed 4 September 2017). The former, especially, contains a large number of highly symmetric graphs, but also many asymmetric graphs as described above.
misc: This class contains datasets which seem not to fit any of the other classes. Therefore it is hard to make any valid assertions. Certainly one can see some interesting patterns—many graphs exist with and there are five subclasses with —which gives room for further inspections.
socfb: Although these networks seem to be mostly asymmetric, only four of the 29 actually are. All these graphs are quite large—more than 15,000 nodes on average—as they represent parts of the Facebook social network.
It is worth noting that the 11 remaining smaller classes contain nearly only symmetric networks (54 of 55 are symmetric). Overall, these quick inspections show there exist differences between the different classes.
6. Outlook
This section is structured in a description of our current work in graph clustering diagnostics and some educated guesses on potential application fields. We consider the implementation of the graph-analysis procedure for this study as a stepping stone to the extraction of the generators of the automorphism group of a graph. However, given the automorphism group of a graph (or its generators), we currently work on the construction of invariant graph partition-comparison measures. Combined with the standard partition-comparison measures, one can separate stable and unstable areas of graphs and perform a detailed and quantified analysis of the effects of graph symmetries. An early example of this kind of analysis for the optimal solution of the famous Karate graph [
29] can be found in [
4] (pp. 18–19). For the Karate example, our analysis showed that symmetry did not affect the modularity optimal solution. In this example, the originally published solution is both optimal and stable and, in addition, easily interpretable by sociologists.
However, in general, this may not be the case, and without analysis, one does not know. In our own research, we focus on graph clustering, and checking the impact on modularity maximizing graph partitions shall be accomplished in our future research.
This article lays the foundation for further research in terms of possible influences on network analysis methods. Having shown that symmetries are very likely to exist in real-world networks, it is worth inspecting such methods and the outcome of them on how they are affected by graph symmetries. Also, the interpretation of the results must possibly be reconsidered. For example, what is the implication of two clusters of nodes (result of a graph clustering algorithm) where nodes can be exchanged between clusters by an automorphism (single nodes are all nodes at once)? Or, what does it mean if nodes inside a cluster can be mapped onto each other?
The comparison of the within-class distributions of the normalized network redundancy in
Figure 5 shows that graphs from different classes have a different degree of symmetry. We have not performed an in-depth analysis of graphs aggregated by the different classes, but a glance on it showed that there is room for further research. Especially, the connection between other properties of graphs—like density or clustering coefficient [
16]—and the observed symmetry could be of interest. If a high correlation of symmetry with other graph properties or measures exists, this could ease the decision of whether to perform a symmetry analysis if the considered property/measure is cheap to obtain.
Also, the connection between the graph’s underlying degree distribution—which is the base of most graph models, for example, [
35]—and symmetry characteristics could help to better understand complex networks. This is because the degree distribution has a direct impact on the theoretically possible automorphisms, as only nodes having the same degree can be mapped onto each other.
A different area of application is in communication networks. One line of research in this area relates to information cascades in networks [
36], which describe the spread of some information (or disease) between connected individuals. Again, one could ask what is the impact of redundant nodes? Do they speed up or slow down the spread of information? Another line of research deals with the optimal construction of technical communication networks with regard to their computational complexity, the number of components (gates), noise tolerance in communication channels, or deterministic error reduction in randomized polynomial time algorithms. The theoretical base of this is the construction of expander graphs by using the proved Ramanujan conjecture from the theory of automorphic forms, which is nicely summarized in [
37], and a recent survey of applications in theoretical computer science is presented by Hoory et al. [
38]. We conjecture that expander graphs could be a comparison standard for information cascades in complex real networks.