In this section, we present the key methods that allow a computational approach to the connectomic characterisation of the MAPK network in the sequel.
2.1. Complex Network Analysis
We start by listing a number of measures, or network metrics, found in complex networks [
17] that are central to the subsequent study and topological analysis of the MAPK network.
To compute the complex network measurements, the publicly available Python library NetworkX [
18] was used with minor modifications, which we open-sourced at
https://github.com/UoS-PLCCN/grn-metrics (accessed on the 11 May 2022).
All computations were made on a standard laptop with an Intel Core i7 Processor and 32 GB of RAM.
Communities. Community structure is one of the most studied features of networked systems [
19]. Communities in a network are usually described as groups of densely connected nodes with sparse connections to the nodes of other groups.
Communities in our study were identified using the Clauset–Newman–Moore greedy modularity maxim [
20], and were implemented using NetworkX’s implementation of the function
greedy_modularity_communities(). Greedy modularity maximization begins with each node in its own community, then joins the pair of communities that most increases the modularity metric until no such pair exists.
Modularity. With the communities identified using greedy modularity maximization, we can additionally compute the resulting modularity of this particular partition of the graph. This will naturally be the maximal modularity value.
Given a set of
n communities
C, Clauset et al. in [
20] proved that modularity can be computed as
where
is the number of intra-community links,
the sum of degrees within the community and
the resolution parameter.
A high modularity value naturally implies that the network naturally lends itself to division into modules. We use the pre-computed communities along with NetworkX’s modularity function to calculate it.
Small-world coefficient. Next, we are interested in degrees of small-worldness of the network under study [
21]. A small-world network is more clustered with a smaller characteristic path length than degree-preserved random networks. In other words, most nodes can be reached from every other node by a small number of hops or steps. The small-world coefficient that determines how intense this effect is in a network refers to the ratio of clustering coefficient and characteristic path length, which are normalized relative to those of the random networks.
The small-world coefficient in our study was calculated using a modified implementation from NetworkX’s sigma function from the smallworld module. Specifically, the original implementation uses a random reference graph, which we deem insufficient, and thus modify the function to use Erdos–Renyi graphs instead.
Clustering coefficient. The global clustering coefficient measures the clustering present in the entire network, as opposed to localization around a single node’s neighbourhood.
It is defined as the total number of triangles present in the network divided by the number of all possible triangles that the network could have [
22]. Given a network with
n vertices and
t triangles, the global clustering coefficient can be computed as
. We use NetworkX’s
triangles function to calculate
t.
Characteristic path length. The characteristic path length of a graph
, also known as the average path length, is the the average distance between any two pair of nodes in the graph,
, where
is the shortest path between
u and
v. A smaller characteristic path length is linked to a higher efficiency in the transfer of information in the graph [
23].
We compute the characteristic path length using NetworkX’s average_shortest_path_ length function.
Erdos–Renyi graphs. A
Erdos–Renyi graph [
24] is a random graph with
vertices, where the presence of each edge
is determined to be independent and identically distributed with probability
p. By the law of large numbers, as the number of nodes in such random graph tends to infinity, the number of generated edges approaches
. The likelihood of generating a network
G of
n vertices and
m edges is given by
We compute Erdos–Renyi graphs according to the procedure outlined in Algorithm 1. We use , , which is the number of nodes and edges in the network that is the object of our study, and . All other random samples—e.g., during the operations that guarantee that the result graph is connected—are from a uniform probability distribution.
Rich-club coefficient. Afterwards, we turn our attention to the rich-club coefficient of a complex network [
25], a feature which describes how well connected a set of hub nodes are to one another and has been shown to influence structural and functional characteristics of networks, including topology, the efficiency of paths and distribution of load [
26]. Intuitively, a subnetwork with only rich-club nodes should have more connections than a random network with the same degree and edge distributions.
Algorithm 1 ER graph generation algorithm. | |
Input: n, , p | |
Output: | |
| ▹n nodes |
| |
whiledo | |
| |
| |
end while | |
while not connected do | |
| |
| |
| ▹ the function is provided from NetworkX |
| |
while do | |
| |
end while | |
| |
| |
end while | |
For each degree
k in a complex network, given the amount of vertices
with degree larger than
k, and the amount of edges
present between said vertices, the rich-club coefficient is defined as:
Intuitively, this is the ratio of the amount of possible edges between nodes with degree larger than k () and the actual amount of edges present in said nodes.
In our study, similar to [
27] we normalize the above value with the rich-club coefficient value of a random network,
. As our null-model, we use random networks with the same number of edges and degree distribution as our original graph, generated according to the Maslov and Sneppen (MS) procedure [
28]. Specifically, we use the original network as a starting point, then perform
degree-preserving edge swaps, where
m is the number of edges in the network.
Unlike [
27], however, our
is the average of
for 1000 random networks. To this end, we modified NetworkX’s the
rich_club_coefficient implementation.
It is worth noting that given the denominator in Equation (
3), the rich-club coefficient is only defined for degrees
k such that
. In other words, we can only calculate the coefficient for degrees
k such that the subgraph with vertices of degree higher than
k has at least two nodes. Thus, given the degree distribution in the MAPK network, which will be further discussed in
Section 3.2.1, we can only compute the coefficient for degrees
, since
and
. Specifically, we compute RCC For
, as its value for
and
is generally not informative.
2.2. Mapping to Hallmarks of Cancer
Gene ontology (GO) IDs were manually associated with individual hallmarks of cancer (see
Table S1). Subsequently, the sets of genes captured by the communities that were identified using the above-mentioned techniques were input to the BiNGO [
29], a Cytoscape plugin that has the ability to analytically infer the over-represented gene ontology biological process terms present in a network, or any of its subgraphs. We applied BiNGO to each identified community configured with a
p-value threshold of
, using hypergeometric statistical testing and the Benjamini and Hochberg false discovery rate (FDR) correction, to compute statistically significant GO IDs.
The top ten GO IDs that were linked to hallmarks of cancer (using
Table S1) were then counted. A community was associated with a hallmark if at least two GO IDs linked to the same hallmark. This allowed us to associate each community to at least two hallmarks.
To further quantify how strongly the communities associate with their hallmarks, we computed a measure of prominence (MOP). The MOP of a hallmark for a given community is defined as the ratio between the number of GO IDs linked to that hallmark in that community, divided by the average number of GO IDs linked to that hallmark across all six communities, i.e., it is given by
where
is the number of GO IDs in community
i linked to hallmark
j, and
denotes the average number of GO IDs for hallmark
j. Hence, the MOP quantifies how much the occurrence of a community’s associated hallmark differs across all communities.
To quantify statistical significance, we conducted for each community and hallmark a one-sided binomial test using the Matlab function “myBinomTest()” implemented by Matthew Nelson [
30]. The
p-values for each community and its dominant hallmarks were calculated using Mathworks Matlab version R2020b.