Do signaling networks and whole-transcriptome gene expression profiles orchestrate the same symphony?

Abstract Studying relationships among gene-product expression profiles is a common approach in systems biology. Many studies have generalized this subject to different levels of the central dogma information flow and assumed correlation of transcript and protein expression levels. All these efforts have updated the signaling network models and expanded the current signaling databases, which include interactions among the gene-products extracted based on either the literature or direct and indirect experiments. In fact, due to unavailability or high-cost of the experiments, most of the studies do not look for the direct interactions (gene-protein or protein-protein) and some of them are contradictory. In addition, it is now a standard practice to undertake enrichment analysis on biological annotations especially in omics research to make claims about the potentially implicated biological pathways in disease. Specifically, upon identifying differentially expressed genes, molecular mechanistic insights are proposed based on statistically enriched biological processes for disease etiology and drug discovery. However, it remains to be demonstrated that expression data may be used as a reliable source to infer causal relationships among gene pairs. In this study, using four common and comprehensive databases i.e. GEO, GDSC, KEGG, and OmniPath, we extracted all relevant gene expression data and all relationships among directly linked gene pairs in order to evaluate the rate of coherency or sign consistency. We illustrated that the signaling network was not more consistent or coherent with the measured expression profile compare to random relationships. Finally, we provided the pieces of evidence and concluded that gene-product expression data, especially at the transcript level, are not reliable or at least insufficient to infer biological relationships among genes and in turn describe cellular behavior.

Studying relationships among gene-product expression profiles is a common approach in systems biology. Many studies have generalized this subject to different levels of the central dogma information flow and assumed correlation of transcript and protein expression levels.
All these efforts have updated the signaling network models and expanded the current signaling databases, which include interactions among the gene-products extracted based on either the literature or direct and indirect experiments. In fact, due to unavailability or highcost of the experiments, most of the studies do not look for the direct interactions (geneprotein or protein-protein) and some of them are contradictory. In addition, it is now a standard practice to undertake enrichment analysis on biological annotations especially in omics research to make claims about the potentially implicated biological pathways in disease. Specifically, upon identifying differentially expressed genes, molecular mechanistic insights are proposed based on statistically enriched biological processes for disease etiology and drug discovery. However, it remains to be demonstrated that expression data may be used as a reliable source to infer causal relationships among gene pairs. In this study, using four common and comprehensive databases i.e. GEO, GDSC, KEGG, and OmniPath, we extracted all relevant gene expression data and all relationships among directly linked gene pairs in order to evaluate the rate of coherency or sign consistency. We illustrated that the signaling network was not more consistent or coherent with the measured expression profile compare to random relationships. Finally, we provided the pieces of evidence and concluded that gene-product expression data, especially at the transcript level, are not reliable or at least insufficient to infer biological relationships among genes and in turn describe cellular behavior.
In network biology, defining relationships among nodes is crucial for the downstream analysis (1). The most available high-throughput data to infer molecular relationships are arguably whole-transcriptome expression profiles analyzed with statistical models (2). A main challenge is extrapolating causality in signaling and regulatory mechanisms from a significant correlation between any given gene pair. A lot of spurious correlations among gene pairs may occur without any causal relationship that could happen indirectly or stochastically (3). Reverse engineering algorithms are developed to tackle this challenge and to infer gene networks and regulatory interactions from expression profiles (4).
When considering signaling networks, their main players are proteins whose activity is often regulated by post-translational modifications such as phosphorylation. Hence, inference of signaling networks can be directly inferred from (Phospho)proteomic and protein-protein interaction data (5). This data is hard and expensive to acquire. Given the correlation between protein and gene expression, a common alternative approach is to use gene expression to estimate interactions between proteins. However, in general, the gene expression or transcriptomics discuss about what appears to happen in a biological system, while the signaling network exhaust to what makes it happens and has happened in a complex view of the system (6). This therefore begs the question whether gene expression profiles strengthen the logic i.e. activatory/inhibitory mechanism of signaling circuits.
In this study, we aimed to examine the coherency between expression profiles and the type of relationship, in signaling networks, for all possible gene pairs. Imagine in a gene pair (A, B) where gene A activates gene B. If the expression profiles of both were correlated positively, we infer that expression data strengthen the logic of this signaling relationship and are thus coherent. In contrast, let gene A inhibit gene B. In this case, the coherent gene pairs are negatively correlated. If gene A activates gene B and there is negative correlation between them or if gene A inhibits gene B and there is positive correlation between them, we called an incoherent relationship between the gene pairs. In addition to these simple scenarios, we have also considered more complicated subgraphs in a signaling network (See Table 1) to answer the question raised above. . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 21, 2019. . https://doi.org/10.1101 For this, we used expression datasets in the Gene Expression Omnibus (GEO) (7) and Genomics of Drug Sensitivity in Cancer (GDSC) (8) to extract all relevant gene expression profiles. Two literature-curated databases for signaling pathways, namely the Kyoto Encyclopedia of Genes and Genomes (KEGG) (9) and OmniPath (10) were used to extract the type of relationships among directly linked gene pairs. Therefore, coherency analysis was undertaken independently for all four combinations of databases in parallel (see Fig. 1). The edge lists obtained in the previous step were converted into adjacency matrices using igraph package in R (13). Then, the adjacency matrix was self-multiplied more than the diameter of the network. After that, we randomly selected 1000 unconnected gene pairs for which the corresponding elements in the matrix were zero (gene pairs with no direct immediate and non-immediate interactions). For these gene pairs, that we call unconnected gene pairs (UGPs), the same downstream analyses were implemented to compare significance and sign of correlation for all types of connected gene pairs (Supplementary file 1 section 6, Supplementary files 7).
We extracted specific subgraphs from the signaling networks to investigate any relationship between gene expression profiles and complex structure of gene pairs. DNFBL, DPFBL1, and DPFBL2 are subgraphs of gene pairs which influence each other directly twice (see Table 1). These pairs are readily found by checking the source and target nodes in the edge lists (or upper and lower triangles in adjacency matrices). We then focused on connected gene pairs which also influence each other indirectly by a sequence of intermediate nodes.
Using matrix self-multiplication, the weighted and un-weighted adjacency matrices of the component of eligible edges in signaling network are powered by the network radius magnitude. Considering that the network is directed and the adjacency matrix is not . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 21, 2019. . https://doi.org/10.1101/643866 doi: bioRxiv preprint symmetric, the feed-forward and feed-back loops i.e. MNFBL1-2, MPFBL1-2, MFFL1-2, and MNFFL1-2 are determined (Table 1). For more detailed explanation, see Supplementary   File  1  sections  7  and  8. . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 21, 2019. . https://doi.org/10.1101/643866 doi: bioRxiv preprint

3
. R e s u l t s Table 2 provides overall details of the four parallel coherency analyses presented here including the dimension of the expression matrices generated from whole-transcriptome expression profiles, and the size and diameter of the giant component in each analysis. Of note, the number of unique edge list genes was higher in OmniPath than KEGG. In addition, the ratio of genes common to the edge list and gene expression profiles was also greater in OmniPath.
For correlation analysis between any gene pair, we only considered gene expression datasets which have more than two samples. These gene pairs were considered as eligible edges for downstream statistical analysis.
The ratio of eligible edges to all edges was calculated for all four analyses (see Fig. 2B).
The ratio of eligible edges in the OmniPath edge list was greater than KEGG based on both GDSC and GEO databases. Also, the ratio of eligible edges was greater in GDSC compared with GEO.
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 21, 2019. . https://doi.org/10.1101/643866 doi: bioRxiv preprint  Samples with expression data for the gene pairs may have come from different datasets and therefore should be separated and analyzed independently. Fig. 2A represents the effect of this preprocessing on a gene pair in our dataset. Briefly, in Fig. 2A   The ratio of coherent edges in OmniPath is totally more than KEGG. Also, the ratio of coherent edges in GDSC database is more than GEO. Fig. 4 shows the FDR-adjusted P-vlaues versus r correlation coefficients of activation and inhibition edges in all four analysis. A symmetric pattern of coefficients is recognizable for both of activation and inhibition edges in all four analyses. This suggests that the correlation between a given gene pair is not largely affected by the sign of the interaction. In the other word, even activation edges illustrate an overrepresentation of strong positively correlated gene pairs in all four analysis, the inhibition edges do not display any enrichment in strong negative side of plots compared to strong positive side. It also shows that the majority of coherent gene pairs are related to activation not inhibition edges. In the next step, we tried to explore more and provide reasoning about the incoherent edges by focusing on more complex subgraphs in signaling network.   Table 3).
Similar to the simple activation or inhibition edges, correlations are computed and categorized considering the correlation sign for each mentioned subgraph and the calculated P-values. These data are available in Supplementary file 2 for all four analyses. The ratio of subgraphs for KEGG/GEO analysis are presented in Table 3.
To statistically compare, the correlation analysis was also implemented on multiple sets of 1,000 randomly unconnected gene pairs (UGP) and Mann-Whitney proportion test was then computed to compare all of the proportions illustrated in Table 3 5). Since the number of edges are very low, we aimed to continue our search for coherency in larger structure of subgraphs. We therefore investigated the large subgraphs which contain more than two edges i.e. MNFBLs, MPFBLs, MFFLs and MNFFLs. However, we did not observe any strong coherent relationship among the gene pairs again, suggesting that, depending on the structure of the subgraph, gene expression profiles do not match the logic of signaling circuits.
Note that the ratios for the analysis based on OmniPath and GDSC is more uniformly distributed hold a candle to others and there is not any kind of dual feedback loop structures i.e. DNFBL and DPFBLs in OmniPath signaling network which can be controversy (Fig. 5).
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 21, 2019. . https://doi.org/10.1101/643866 doi: bioRxiv preprint  Although there is a general assumption that the expression level could strengthen or weaken the signal to transduce in signaling pathway, but we illustrated that in many instances, there is not a noticeable coherency between the mRNA level of gene pairs and the way (i.e. logic) they manipulate one another (Fig. 5). However, we also showed that there is a sort of association between the structure of the subgraphs and gene pair expression profiles.
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 21, 2019. . https://doi.org/10.1101/643866 doi: bioRxiv preprint Expression profiles of the unconnected gene pairs were statistically more independent than connected ones. To support this idea, two signaling databases and two gene expression databases were used and the similar results acquired in the analysis of all four combinations.
Based on the correlation results in Fig. 3A, the volcano plots in Fig. 4 which exhibit no significant difference between activation and inhibition edges, the ratios in table 3 and Fig. 5, causal correlation can be inferred poorly at the transcript level at least in a multicellular eukaryotic such as human. Proportional tests in supplementary file 3, suggest that there is a statistical difference between UGP and other subgraphs and this demonstrates that structure of subgraphs affect the coherency. It is also strongly advocated to use information in signaling networks, or define relationships between the genes, assess the gene expression at both transcript and protein level or look for the direct inteartions.
In this study, we aimed to focus on the impact of the relationship logic on the destination of any given stimulated signaling pathway which usually ignored in functional genomic studies. We demonstrated that differentially expressed genes have only a little information of the whole story of associated mechanism. Most of these kinds of altred expression are disappeared gradually and ignored by the whole system of signaling network either stimulated endogenously or exogenously.
The authors would like to thank Dr. Julio Saez-Rodriguez and Bence Szalai for their valuable comments and discussions.
. CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a  1  9  .  M  a  t  a  J  ,  M  a  r  g  u  e  r  a  t  S  ,  B  ä  h  l  e  r  J  .  P  o  s  t  -t  r  a  n  s  c  r  i  p  t  i  o  n  a  l  c  o  n  t  r  o  l  o  f  g  e  n  e  e  x  p  r  e  s  s  i  o  n  :  a  g  e  n  o  m  e  -w  i  d  e  p  e  r  s  p  e  c  t  i  v  e  .  T  r  e  n  d  s  i  n  b  i  o  c  h  e  m  i  c  a  l  s  c  i  e  n  c  e  s  .  2  0  0  5  ;  3  0  (  9  )  :  5  0  6  -1  4  .   2  0  .  L  u  P  ,  V  o  g  e  l  C  ,  W  a  n  g  R  ,  Y  a  o  X  ,  M  a  r  c  o  t  t  e  E  M  .  A  b  s  o  l  u  t  e  p  r  o  t  e  i  n  e  x  p  r  e  s  s  i  o  n  p  r  o  f  i  l  i  n  g  e  s  t  i  m  a  t  e  s  t  h  e  r  e  l  a  t  i  v  e  c  o  n  t  r  i  b  u  t  i  o  n  s  o  f  t  r  a  n  s  c  r  i  p  t  i  o  n  a  l  a  n  d  t  r  a  n  s  l  a  t  i  o  n  a  l  r  e  g  u  l  a  t  i  o  n  .  N  a  t  u  r  e  b  i  o  t  e  c  h  n  o  l  o  g  y  .  2  0  0  7  ;  2  5 (  . CC-BY-NC-ND 4.0 International license certified by peer review) is the author/funder. It is made available under a The copyright holder for this preprint (which was not this version posted May 21, 2019. . https://doi.org/10.1101/643866 doi: bioRxiv preprint