DynSig: Modelling Dynamic Signaling Alterations along Gene Pathways for Identifying Differential Pathways

Although a number of methods have been proposed for identifying differentially expressed pathways (DEPs), few efforts consider the dynamic components of pathway networks, i.e., gene links. We here propose a signaling dynamics detection method for identification of DEPs, DynSig, which detects the molecular signaling changes in cancerous cells along pathway topology. Specifically, DynSig relies on gene links, instead of gene nodes, in pathways, and models the dynamic behavior of pathways based on Markov chain model (MCM). By incorporating the dynamics of molecular signaling, DynSig allows for an in-depth characterization of pathway activity. To identify DEPs, a novel statistic of activity alteration of pathways was formulated as an overall signaling perturbation score between sample classes. Experimental results on both simulation and real-world datasets demonstrate the effectiveness and efficiency of the proposed method in identifying differential pathways.


Introduction
With the rapid development of high-throughput technology, including microarrays and deep sequencing, tremendous amounts of various omics data have been generated and accumulated, which provides unprecedented opportunities for understanding molecular mechanisms of cells and disease etiology [1][2][3][4][5]. Relative to basic gene units, pathways are higher-level functional modules in cells, and play more critical roles in various cellular processes, such as tumorigenesis [6]. Identifying differentially expressed genes (DEGs) often suffers from being less statistically reproducible and less biologically interpretable, while identifying differentially expressed pathways (DEPs) provides more consistent and more reliable knowledge about cancer or other diseases through maximizing the potential of omics data [7,8]. Currently, computational methods for identifying DEPs is still underdeveloped theoretically and practically, although a large number of computational methods are available for DEG identification [6].
Current methods for identifying DEPs can be categorized into three generations [9]: overrepresentation analysis, functional class scoring, and pathway topology. Overrepresentation analysis methods only test how significantly a pathway contains DEGs than by chance, and employ Fisher's exact test or geometric distribution-based test to estimate the significance. Functional class scoring methods assume that not only large changes, but also weaker, coordinated changes of genes in

Framework of DynSig for Pathway Analysis
Pathways as gene networks consist of a set of gene nodes and a set of gene links. The gene links reflect regulatory, signaling, or binding relationships, in which the source gene activates or inhibits the destination gene. From a regulatory aspect, genes could be in one of three regulatory states: downregulated, non-significantly regulated, or upregulated. The state transition patterns along gene links are potentially associated with the distinction between two phenotypes of cells. For example, if they are both overexpressed in cancer tissues relative to those in normal tissues, one can say that the gene links are likely related to, or even drive, cancer. We reason that pathways with many such cancer-related gene links should be associated with cancer, and here, present a gene link-based method (DynSig) for identifying DEPs, in which the disparity of pathways in signaling dynamics between classes is explored and utilized. Figure 1 illustrates the framework of DynSig. Specifically, given a pathway, the method first decomposes it into a set of gene links and introduces Markov chain model to model the signaling pattern along each gene link in each class. Then, each link is scored based on an MCM-based classification rule, via a leave-one-out cross validation procedure, which subsequently results in a sample-link score matrix. Based on the score matrix, DynSig finally statistically detects differentially expressed links and DEPs.

Data Preparation
An increasing number of methods have been developed to discretizate continuous gene expression data for analyzing gene expression patterns. Generally speaking, these methods discretizate the continuous gene expression values into three discrete gene regulatory states: downregulated (D), non-regulated (N) and upregulated (U). Based on the discretization, a matrix of continuous gene expression can be transformed into a matrix of regulatory states, i.e., s ij M×N , where s ij ∈ {D, N, U} represents the regulatory states of the i-th gene in the j-th sample, M is the number of genes, and N is the number of all samples. In this paper, the recently developed biology-constrained method [22] is used as the default method to discretize gene continuous expression data.
Without loss of generality, suppose that all the gene expression samples belong to two classes (e.g., control vs. case or cancer vs. normal). Then these samples can be divided into three sets: Two training sets with samples selected from the two classes, respectively, and a test set that comprises all of the rest samples.

Modeling the Dynamics of Gene Links Using Markov Chain Model
In statistics, a Markov chain is defined as a stochastic process with Markov property, in which the next state depends only on the current state but irrelevant with any previous event along the chain [23]. Generally, there are two types of Markov chain models, i.e., stationary and non-stationary, which differ in whether transition matrix is shared along the process or not [23]. Compared with stationary MCM, non-stationary one allows different state transition probability matrices (TPMs) for each transition step along the chain of process, so that complex stochastic processes can be modeled without information loss. Considering the complexity of signaling between genes, we here employ non-stationary MCM to model pathways. Specifically, we here consider a chain consisting of two adjacent genes, i.e., gene link, and establish its MCM, which can be readily extended to signaling along a cascade of three or more genes. Figure 2 illustrates the MCM of gene links (MCMLink). Topologically, a pathway consists of a set of gene links comprised of two adjacent genes. We view a gene link l as a Markov chain of two genes (nodes), which transits biological information from the source gene g 1 to the destination gene g 2 , as shown in Figure 2.  Figure 1. The flowchart of the proposed method DynSig. The raw input data includes an L-link pathway and a continuous expression matrix with sample labels (e.g., Class 1 vs. Class 2). The continuous gene expression values are discretized into three regulatory states based on discretization methods: downregulated (D), non-regulated (N) and upregulated (U). Furthermore, all the samples are divided into three sets: two training sets for Class 1 and Class 2, respectively, and a test set comprising the rest of samples. For the i-th link (i = 1,2, …, L) in the pathway, two Markov chain model links (MCMLinks) are built based on the two training sets, respectively. Then, by using the leave-one-out-cross-validation (LOOCV) procedure, the j-th sample is classified by the i-th link and we have aji = 1 if the classification is right and 0 otherwise. The classification power of the i-th link is calculated by averaging the resulted scores {aji}j=1,2,…,N, i.e., di = (a1i + a2i + … + aNi)/N and the classification power of the pathway is calculated by averaging the classification powers of all links, DEP = (d1 + d2 + … + dL)/L. Permutation test is utilized for evaluating p-values of the resulted classification powers. The raw input data includes an L-link pathway and a continuous expression matrix with sample labels (e.g., Class 1 vs. Class 2). The continuous gene expression values are discretized into three regulatory states based on discretization methods: downregulated (D), non-regulated (N) and upregulated (U). Furthermore, all the samples are divided into three sets: two training sets for Class 1 and Class 2, respectively, and a test set comprising the rest of samples. For the i-th link (i = 1,2, . . . , L) in the pathway, two Markov chain model links (MCMLinks) are built based on the two training sets, respectively. Then, by using the leave-one-out-cross-validation (LOOCV) procedure, the j-th sample is classified by the i-th link and we have a ji = 1 if the classification is right and 0 otherwise. The classification power of the i-th link is calculated by averaging the resulted scores {a ji } j=1,2, . . . ,N , i.e., d i = (a 1i + a 2i + . . . + a Ni )/N and the classification power of the pathway is calculated by averaging the classification powers of all links, DEP = (d 1 + d 2 + . . . + d L )/L. Permutation test is utilized for evaluating p-values of the resulted classification powers. Mathematically, an MCM can be specified by three parameters, i.e., a finite set of discrete states, initial probability distribution of states (P0), and a state transition probability matrix M, as shown in Figure 2. From a viewpoint of biology, we simply specify a tri-state set {D, N, U} for nodes within an MCM. Given continuous gene expression values, we infer the regulatory states of a gene using discretization methods, for example, the recently developed biology-constrained method [22]. Along a gene link, the regulation states transit to carry signaling specific to a particular biological process. The signaling patterns beneath the gene link are encapsulated in the initial state distribution and the state transition probability matrix. We need to learn them from training data as follows.
Given a gene link l from g1 to g2 and a Ntrain-sample training set for class control/case, we denote the initial state distribution as P0(x), x = D, N, or U for the source gene g1 and the TPM as M = {mij, i, j = 1, 2, 3}, where mij represents the regulatory state transition probability from the i-th state on gene g1 to the j-th state on gene g2. Totally, there are nine state transition modes, as shown in Figure 2. The initial distribution P0(x) is estimated as the likelihood of gene g1 being in state x-in other words, the frequency of gene g1 being in state x in the training set, is    Mathematically, an MCM can be specified by three parameters, i.e., a finite set of discrete states, initial probability distribution of states (P 0 ), and a state transition probability matrix M, as shown in Figure 2. From a viewpoint of biology, we simply specify a tri-state set {D, N, U} for nodes within an MCM. Given continuous gene expression values, we infer the regulatory states of a gene using discretization methods, for example, the recently developed biology-constrained method [22]. Along a gene link, the regulation states transit to carry signaling specific to a particular biological process. The signaling patterns beneath the gene link are encapsulated in the initial state distribution and the state transition probability matrix. We need to learn them from training data as follows.
Given a gene link l from g 1 to g 2 and a N train -sample training set for class control/case, we denote the initial state distribution as P 0 (x), x = D, N, or U for the source gene g 1 and the TPM as M = {m ij , i, j = 1, 2, 3}, where m ij represents the regulatory state transition probability from the i-th state on gene g 1 to the j-th state on gene g 2 . Totally, there are nine state transition modes, as shown in Figure 2. The initial distribution P 0 (x) is estimated as the likelihood of gene g 1 being in state x-in other words, the frequency of gene g 1 being in state x in the training set, is where s g 1 ,j represents the state of gene g 1 in the j-th sample in the training set, and I(·,·) is an indicator function, yielding 1 if the two elements are equal and 0 otherwise. Similarly, we estimate m ij in M as follows: Let x and y represent the i-th and j-th states in the set {D, N, U}, respectively. With g 1 as the source gene and g 2 as the destination gene, m ij (i, j = 1, 2, 3) that represents the conditional likelihood of y occurring on x in this N train -sample training set can be estimated as where s g 1 ,i and s g 2 ,i are the states of genes g 1 and g 2 in the i-th sample in the training set.
Based on the MCM of l on the N train -sample training set above, a statistical way to infer the class likelihood of an observed chain sample from the test set is provided. Assume a test sample t, in which the two genes of the gene link l, g 1 and g 2 are in regulatory states x and y , respectively. The likelihood of the test sample t belonging to the same class as the N train -sample training set can be estimated as a joint probability, i.e., where P 0 (x ) and P(y |x ) are the initial probability of x and the conditional probability of state y (gene g 2 ) on state x (gene g 1 ) that can be obtained by Equations (1) and (2), respectively.

Scoring Gene Links for the Disparity of Signaling Dynamics
Biologically, gene links can exhibit different activity levels in different cellular states, and so a non-trivial gene link can distinguish two different phenotypes due to the disparity of signaling dynamics. Given two MCMs of a same gene link on training sets of two classes (e.g., control and case), we can calculate the likelihoods of a test sample t belonging to the two classes according to Equation (3) and subsequently, the test sample t is assigned to the class with the larger likelihood. Based on the classification rules above, we apply leave-one-out cross-validation (LOOCV) to recursively divide the whole N samples into training sets and a test set consisting only one sample. Suppose a ij N×L are the overall classification results with a ij indicating the classification of the i-th sample by the j-th link, we have a ij = 1 for correct classification and 0 otherwise. The overall classification power of the j-th gene link on all the N samples is defined as d j = (a 1j + a 2j + . . . + a Nj )/N, where N is the total number of samples.
To assess the statistical significance of the overall classification power of the j-th link, i.e., d j , a permutation test pipeline is then designed: (1) Randomly shuffling the class labels of samples; (2) applying the LOOCV procedure to the shuffled data; (3) repeating (1-2) B = 1000 times and obtaining B accuracies rd b , b = 1, 2, . . . , B; 4) Calculating the p-value for the observed d j as where rd b is the bth rd and I is an indicator function yielding 1 if true and 0 otherwise.

Identifying Differentially Expressed Pathways
Since a pathway is essentially a set of organized gene links, the activity of it can be estimated based on the activity of the gene link components. Assume a pathway that contains L gene links. Let d j represent the score of the j-th gene link, a novel statistic for measuring the pathway-level differential expression can be calculated as The statistic, DEP, reflects the disparity of signaling dynamics in the pathway between two classes. From Equation (5), it can be seen that the calculation of DEP only involves the association perturbation of two directly connected genes in the pathway, and thus avoids noise or bias from the expression of individual genes.
To estimate the significance of the DEP score of a pathway, permutation test is utilized here. Specifically, Z = 1000 random pathways are generated with the same genes and the same number of links (i.e., L) in the original pathway. Then, the p-value for DEP of the original pathway can be calculated as where rDEP i , i =1, 2, ..., Z, is the DEP value of the ith random pathways. A pipeline of the proposed approach DynSig is shown in Figure 1.

Principal Pattern of Signaling Dynamics Specific to a Cancer Type
Pathways are a directed gene network in which genetic or physiological information flows along gene chains. In a differentially expressed pathway, there may exist principal patterns of signaling dynamics specific to the cancer type. The use of MCM here allows for finding such principal patterns that consist of principal state transitions of consecutive gene links along a cascade gene chain. Considering that each link is modeled individually, we determine the principal pattern of a given cascade chain starting from the ending link that is usually key to the biological role that the chain plays. For the ending link, the principal state transition will be taken as one of the nine modes that has the maximum transition probability. Then, we determine the principal state transition of the penultimate link. Suppose that the destination gene of it is A, which is also the source gene of the ending link. The principal state transition is the mode that has the largest transition probability conditioned on the principal state of A in the ending link. Similarly, the principal state transitions of other subsequent links along the chain are determined. Finally, we obtain the principal pattern of the chain by concatenating the resulting principal state transitions of each link.

Simulation Data Generation
Assume two classes (control vs. case) of equal sizes n = 60 and 200 pathways, of which one half are DEPs that are differentially expressed between the two classes, and the other half are non-DEPs The 200 pathways follow two types of pathway structures: a cascade structure consisting of ten genes, and a complex structure mimicking a real Kyoto Encyclopedia of Genes and Genome (KEGG) pathway (cell cycle), as illustrated in Figure 3. Three regulatory states of genes are assumed, i.e., S = {D, N, U}, where D, N, and U represent downregulated, non-regulated, and upregulated, respectively. For a sample class, genes associated with it are assumed to be predominantly in one of the three regulatory states (major state), and pathways associated with it then exhibit principal state transition patterns consisting of the major states of such genes. Accordingly, each of the 100 DEPs were generated to take two different principal patterns of activity in the case and control classes respectively, while each of the 100 non-DEPs were assumed to be in a same principal pattern between the case and control classes. Let r be the probability of the dominant pattern of a pathway, the major states of the associated genes will be present at a probability of r, and the two other states are preset to occur at an equal odd of (1 − r)/2. From a viewpoint of biology, the probability (r) reflects the variability of cell systems under a particular phenotype. Gene expression values are then simulated as follows: When genes are in major states, their expression values were randomly sampled from normal distributions, N (1, 4), N (3, 4), and N (5, 4), for state D, N, and U, respectively, and when genes are in minor states, the expression values from different gamma distributions, Γ (1, 0.5), Γ (3, 0.5), and Γ (5, 0.5), for state D, N, and U, respectively. Additionally, to examine the effect of DEGs on DEP identification, we also varied the proportion (ρ) of DEGs in DEPs when generating the simulation data. In summary, we totally simulated 2 × 3 × 4 data scenarios by varying r = {50%, 60%, 70%, 80%} and ρ = {30%, 50%, 70%} for the two types of pathway structures, and generated 20 random datasets in each data scenario.

Simulation Data Study
We first applied the proposed method to analyze the simulated data.
where TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. For comparison, we also analyzed the simulation data using the previous five methods, Global test [13], LRpath [24], TAPPA [15], Clipper [25], and DEGraph [17]. Global test fits the relationship between gene expression and clinical outcome of samples using a generalized linear model, and calls pathways significantly differentially expressed if any of regression coefficients is statistically non-zero [13]. LRpath relates the odds of a gene belonging to a predefined pathway with the significance of differential expression in a linear function, and then employs a Wald statistic to determine if a pathway is significantly differentially expressed. Compared with Global test and LRpath, the three methods, TAPPA, Clipper, and DEGraph, can make use of the information of pathway topology for the identification of DEPs. For example, similar to our method, TAPPA is based on gene links, but uses a pathway connection index as a differential expression statistic, and Clipper models gene expression distributions of different sample groups with different

Simulation Data Study
We first applied the proposed method to analyze the simulated data.
where TP, TN, FP, and FN denote the numbers of true positives, true negatives, false positives, and false negatives, respectively. For comparison, we also analyzed the simulation data using the previous five methods, Global test [13], LRpath [24], TAPPA [15], Clipper [25], and DEGraph [17]. Global test fits the relationship between gene expression and clinical outcome of samples using a generalized linear model, and calls pathways significantly differentially expressed if any of regression coefficients is statistically non-zero [13]. LRpath relates the odds of a gene belonging to a predefined pathway with the significance of differential expression in a linear function, and then employs a Wald statistic to determine if a pathway is significantly differentially expressed. Compared with Global test and LRpath, the three methods, TAPPA, Clipper, and DEGraph, can make use of the information of pathway topology for the identification of DEPs. For example, similar to our method, TAPPA is based on gene links, but uses a pathway connection index as a differential expression statistic, and Clipper models gene expression distributions of different sample groups with different graph Gaussian models of a pathway graph. Comparatively, DEGraph relies on spectral analysis to extract pathway topological information for pathway analysis.  (Tables S1-S3). Figure S1 illustrates the changes of AUCs with the probability parameter r for different methods in differential scenarios of ρ. From this figure, it can be found that our method achieved the highest AUCs among these methods in all the scenarios. Even when r is lowered to 0.5, i.e., only 50% samples have genes in their major states, the AUCs of our method are still up to more than 80%, whereas small r significantly degrades the performance of the previous methods. This suggests that our method is insensitive to noise (r), but most of the previous methods not. Figure S1 also reveals that all the methods exhibit an increasing pattern of AUCs with r in different ρ scenarios, as expected. Further comparison indicates that our method and TAPPA led to very close changing patterns in all the scenarios, which should be because they both are based on gene links. At the same time, three previous methods, Global test, DEgraph and Clipper, had similar changing patterns, which should be because they all are based on the utility of gene correlation via multivariate analysis. Among all these methods, only LRpath considers neither gene correlations nor pathway topology, which should be the reason for its poorest AUC performance, as shown in Figure S1.

Applications to Real-World Expression Data
To evaluate the proposed method on real data, we collected two benchmark gene expression datasets: liver cancer dataset [26] and acute lymphocytic leukemia (ALL) dataset [27]. For the liver cancer data, all the samples are divided into two groups: One in which patients had early intrahepatic recurrence (n1 = 20, REC) and another in which patients did not (n2 = 40, NREC), and each sample consists of the expression levels of 7129 probes. The ALL dataset characterizes gene expression signatures in acute lymphocytic leukemia cells associated with known genotypic abnormalities for adult patients, and consists of n1 = 37 patients with presence of BCR/ABL gene rearrangement (BCR group) and n2 = 41 patients with absence of BCR/ABL gene rearrangement (NEG group). Each sample in the dataset consists of the expression levels of~11,556 probes. We preprocessed the two datasets as follows: The intensities of multiple probes matching a same Entrez ID were averaged as the expression values of the gene, and non-specific or noise genes were filtered out using a coefficient of variation (CV) filter [28] with a CV cutoff of 0.05. To apply MCMLink, all the genes in the two datasets were discretized into three states, downregulated (−1), non-regulated (0), and upregulated (1), using the biology-constrained discretization method [29].
For the analysis, 220 pathways were downloaded from KEGG database [30] and used as candidate pathways. Since for a given pathway, not all genes in it are present in a particular dataset, we examined the member genes of these pathways with the two datasets. By examination, 213 of the 220 pathways were found to have more than one gene link present in the liver cancer dataset, and 218 have more than one gene link present in the ALL dataset. We considered only these qualified pathways in subsequent analyses on the two datasets.

Identification of Differentially Expressed Pathways
We identified DEPs with DynSig for the two real-world datasets. For each pathway, the obtained p-value was adjusted using SLIM [31] to produce a q-value for controlling false positive rate (FPR). As a result, for the liver cancer data, 48 pathways (Table S4) were called significantly differentially expressed between the two liver cancer classes at an ad hoc q-value cutoff of 0.1, and 102 (Table S5) for the ALL dataset. For comparison, five previous methods, Global test [13], LRpath [24], Clipper [25], TAPPA [15] and DEGraph [17], were also applied to analyze the datasets. Figure S2 shows the cumulative probability distribution (CPD) curves of q-values across all the pathways by each of these methods for the two datasets. From Figure S2, it can be found that the CPD curve of our method has a larger increase of CPD, around 0.05, than those of the previous methods, irrespective of whether the data are liver cancer data or the ALL data, suggesting the superior power of our method in identifying DEPs.
For the liver cancer data, three previous methods, Global test, LRpath, and TAPPA called no differentially expressed pathways at an ad hoc q-value cutoff of 0.1, and another two previous methods, Clipper and DEGraph, only three and one pathways, respectively. By relaxing a p-value cutoff to 0.05, these previous methods still called very few pathways as being significantly differentially expressed: 9 for Global test, 8 for LRpath, 7 for TAPPA, 27 for Clipper, and 14 for DEGraph, respectively. Literature survey shows that most of DEPs identified by our methods were previously reported to be related to liver cancer, for example, p53 signaling pathway (p-value = 0.012), transcriptional misregulation in cancer (p-value = 0.003), and hepatitis B (p-value = 0), which are likely differentially expressed between recurrent and non-recurrent liver cancer. However, these pathways were not called significantly expressed by all the previous methods.
For the ALL data, the five previous methods led to disparate results: three, LRpath (7), TAPPA (9), and DEGraph (60), called relatively few DEPs at an ad hoc q-value cutoff of 0.1, while other two methods, Global test (136) and Clipper (132), called a large number of DEPs. Compared with the previous methods, DynSig called 102 DEPs at an ad hoc q-value cutoff of 0.1, which is moderate and may be more reasonable, statistically. Given the presence of the BCR/ABL chimera, pathways including BCR and/or ABL1 would be biologically affected, and be true DEPs between the two classes of ALL, BCR, and NEG. Of the total 218 pathways, nine were found to be BCR and/or ABL1-involved according to KEGG pathway annotation. Table 2 shows the identification results of the nine pathways by DynSig and the five previous methods. From this table, it can be found that TAPPA missed eight of the nine pathways and LRpath found only four. Martini et al. [25] previously reported that GSEA [12], SPIA [32] and BPA [16] called only 2, 2, 1 of the 9 pathways at a uncorrected p-value cutoff of 0.1, respectively. Compared with these results, our method recognized almost all the nine BCR/ABL1-related pathways (8), which is comparable with the three previous methods, Global test (8), Clipper (7), and DEgraph (7). Table 2. Identification results of the nine BCR/ABL1-involved pathways by our methods and previous methods for the ALL data.

Pathway
Our Method Global Test LRpath TAPPA Clipper DEGraph

Gene Links Play Significant Roles in Pathway Activity
Gene links, as a dynamic element of a pathway, may play crucial roles in pathway activity. We examined how such gene links classifies samples. For the liver cancer data, the resulting 48 significant pathways contain 9949 unique gene links present in the dataset. Figure 4A shows the p-values of these links classifying the dataset. For comparison, we randomly sampled the same number of random gene links from the total genes, and calculated their p-values of classifying the dataset based on DynSig. Figure 4A illustrates the differences between counts of p-values in bins resulted by the true links and randomly generated links on liver data. From this figure, we can clearly see that, compared with the randomly generated links, the true links had more small p-values (e.g., <0.5) and fewer large ones (e.g., >0.5), showing that gene links of the selected DEPs are more discriminative than by chance. Similar results were obtained on the ALL dataset (26,309 true gene links), as shown in Figure 4B. We also compare the cumulative probability distributions (CPD) of p-values between the true and random gene links for the two datasets, as shown in Figure 4C. When a classifier is not discriminative, the p-values will uniformly distribute between 0 and 1, and have a CPD curve along the diagonal line: y = x. From Figure 4C, we clearly see that, on both datasets, the true gene links hold CPD curves farther away from y = x than the those of the random links, especially with p-values < 0.5. These suggest the better classification power by the true gene links (p-value < 2.2×10 -16 according to a t-test). Taken together, these results demonstrate that gene links, as a dynamic element in pathway activity, tend to be discriminative between cancer and normal tissues or between different cancer subtypes. We then overlaid genes involved in the significant gene links (p-value < 0.05) onto the network map of pathways. For the ALL data, one of DEPs called by DynSig is neurotrophin signaling pathway (KEGG ID: hsa04722), which as an ABL1-involved pathway that has been proven to behave distinctly between BCR and NEG ALL patients. Biologically, neurotrophin signaling transmits positive signals like enhanced survival and growth, and interplays with a variety of intracellular signaling cascades, such as MAPK pathway. Figure 5 shows the overlaid neurotrophin signaling pathway downloaded from KEGG database. It can be seen that hub genes tend to be involved in the significant links, as expected, and that most gene links along the paths towards the biological end of cell survival are called significant to the classification of BCR-ABL and NEG tissues. Among the paths, in particular, two involving nuclear factor-κB (NF-κB) and starting from TrkA/B/C and p75NTR, are exclusively recognized, with all involved gene links significant.
Biologically, NF-κB as a family of transcription factors regulates the genes involved in inflammatory responses, proliferation and differentiation [33]. It is well known that BCR-ABL fusion results in the activation of NF-κB that can trigger tumorigenesis [34,35]. We then examined the state transitions of the link IKBA-NFκB in BCR-ABL and NEG tissues. Results reveals that the state transition with maximum probability is N → U in BCR-ABL patients, but N → N in NEG patients, suggesting that the dissociation between IKBA gene and NF-κB is activated in BCR-ABL, but not in NEG. The dissociation of IKBA from NF-κB in BCR-ABL should activate NF-κB and consequently, cell proliferation. We then overlaid genes involved in the significant gene links (p-value < 0.05) onto the network map of pathways. For the ALL data, one of DEPs called by DynSig is neurotrophin signaling pathway (KEGG ID: hsa04722), which as an ABL1-involved pathway that has been proven to behave distinctly between BCR and NEG ALL patients. Biologically, neurotrophin signaling transmits positive signals like enhanced survival and growth, and interplays with a variety of intracellular signaling cascades, such as MAPK pathway. Figure 5 shows the overlaid neurotrophin signaling pathway downloaded from KEGG database. It can be seen that hub genes tend to be involved in the significant links, as expected, and that most gene links along the paths towards the biological end of cell survival are called significant to the classification of BCR-ABL and NEG tissues. Among the paths, in particular, two involving nuclear factor-κB (NF-κB) and starting from TrkA/B/C and p75NTR, are exclusively recognized, with all involved gene links significant.
Biologically, NF-κB as a family of transcription factors regulates the genes involved in inflammatory responses, proliferation and differentiation [33]. It is well known that BCR-ABL fusion results in the activation of NF-κB that can trigger tumorigenesis [34,35]. We then examined the state transitions of the link IKBA-NFκB in BCR-ABL and NEG tissues. Results reveals that the state transition with maximum probability is N → U in BCR-ABL patients, but N → N in NEG patients, suggesting that the dissociation between IKBA gene and NF-κB is activated in BCR-ABL, but not in NEG. The dissociation of IKBA from NF-κB in BCR-ABL should activate NF-κB and consequently, cell proliferation.

Principal Patterns of Pathways Reflect Abnormality of Signaling Dynamics in Cancer
Biological information is necessary to propagate along pathway networks for the mediation and maintenance of the life of cells. Such signaling should be manifested in the state transition patterns of gene chains. We then recognized the principal patterns of pathways specific to each of the two classes for the ALL and liver datasets. For the ALL data, take neurotrophin signaling pathway as example and consider one of the two significant NF-κB-ending cascades, i.e., 4916 → 53358 → 2549 → 5291 → 207 → 4792 → 4790 (Entrez IDs), as shown in Figure 6A. Figure 6C compares the principal patterns of the cascade between the two ALL classes, BCR and NEG. From Figure 6C, we can clearly see that the two classes have two distinct state transition patterns with respective to the link cascade. Especially, in NEG, the consecutive D or N states along the path lead to the low expression of NF-κB, implying the suppression of cellular survival in NEG. We also noticed that signaling protein, GAB1 (Entrez ID: 2549), is mainly upregulated in BCR but not in NEG. This is in accordance with the experimental observation that GAB1 is tyrosine-phosphorylated in response to B cell antigen receptor engagement [36,37]. For the liver cancer data, similar results were also obtained. Figure 6B,D shows the result for one of three-link chain in differentially expressed hepatitis B pathway: 5295 → 208 → 572 → 863 (Entrez IDs), which is involved in the activation cascade of caspases responsible for apoptosis execution. From Figure  6B,D, it can be seen that the principal pattern in NREC, D → N → N → U, leads to overexpression of CASP3 (Entrez ID: 863), and finally, activates the apoptosis of cell. This is consistent with the non-recurrence of liver cancer in NREC patients.

Principal Patterns of Pathways Reflect Abnormality of Signaling Dynamics in Cancer
Biological information is necessary to propagate along pathway networks for the mediation and maintenance of the life of cells. Such signaling should be manifested in the state transition patterns of gene chains. We then recognized the principal patterns of pathways specific to each of the two classes for the ALL and liver datasets. For the ALL data, take neurotrophin signaling pathway as example and consider one of the two significant NF-κB-ending cascades, i.e., 4916 → 53358 → 2549 → 5291 → 207 → 4792 → 4790 (Entrez IDs), as shown in Figure 6A. Figure 6C compares the principal patterns of the cascade between the two ALL classes, BCR and NEG. From Figure 6C, we can clearly see that the two classes have two distinct state transition patterns with respective to the link cascade. Especially, in NEG, the consecutive D or N states along the path lead to the low expression of NF-κB, implying the suppression of cellular survival in NEG. We also noticed that signaling protein, GAB1 (Entrez ID: 2549), is mainly upregulated in BCR but not in NEG. This is in accordance with the experimental observation that GAB1 is tyrosine-phosphorylated in response to B cell antigen receptor engagement [36,37]. For the liver cancer data, similar results were also obtained. Figure 6B,D shows the result for one of three-link chain in differentially expressed hepatitis B pathway: 5295 → 208 → 572 → 863 (Entrez IDs), which is involved in the activation cascade of caspases responsible for apoptosis execution. From Figure 6B,D, it can be seen that the principal pattern in NREC, D → N → N → U, leads to overexpression of CASP3 (Entrez ID: 863), and finally, activates the apoptosis of cell. This is consistent with the non-recurrence of liver cancer in NREC patients.

Discussion
Currently, most of current methods are based on counting DEGs, in which only the static information in pathways is considered. By contrast, the proposed method extracts signaling flows along a pathway by modeling gene links. As a result, the proposed method takes advantage of pathway topology, as well as characterizes dynamics of pathways based on MCM. One of advantages of the proposed method is that it allows for detecting abnormal state transitions along gene links or pathways in cancer for the purpose of DEP identification, as demonstrated in applications to two real-world cancer datasets (Figures 5-7). For example, based on the liver cancer data, we find that a gene chain ending with CASP3, which inactivates CASP3 and suppresses apoptosis of cells, is potentially associated with the recurrence of primary liver cancer patients. These will definitely help to gain deep insight into the molecular mechanisms of cancer.
A comprehensive comparison of the classification power of gene links with those of random gene pairs confirmed the justification of DynSig in capturing the disparity of dynamics of pathways between classes ( Figure 5). In particular, DynSig accurately called eight of the nine BCR/ABL1-involved pathways for the ALL dataset outperforming previous methods (Table 1). These initial results on gene links of length two genes are encouraging, and future work will extend DynSig to gene chains of length three or more genes, that provide a more objective and comprehensive understanding of signaling dynamics along pathways.

Discussion
Currently, most of current methods are based on counting DEGs, in which only the static information in pathways is considered. By contrast, the proposed method extracts signaling flows along a pathway by modeling gene links. As a result, the proposed method takes advantage of pathway topology, as well as characterizes dynamics of pathways based on MCM. One of advantages of the proposed method is that it allows for detecting abnormal state transitions along gene links or pathways in cancer for the purpose of DEP identification, as demonstrated in applications to two real-world cancer datasets (Figures 5 and 6). For example, based on the liver cancer data, we find that a gene chain ending with CASP3, which inactivates CASP3 and suppresses apoptosis of cells, is potentially associated with the recurrence of primary liver cancer patients. These will definitely help to gain deep insight into the molecular mechanisms of cancer.
A comprehensive comparison of the classification power of gene links with those of random gene pairs confirmed the justification of DynSig in capturing the disparity of dynamics of pathways between classes ( Figure 5). In particular, DynSig accurately called eight of the nine BCR/ABL1-involved pathways for the ALL dataset outperforming previous methods (Table 1). These initial results on gene links of length two genes are encouraging, and future work will extend DynSig to gene chains of length three or more genes, that provide a more objective and comprehensive understanding of signaling dynamics along pathways.

Conclusions
We have proposed a signaling dynamics-based approach, DynSig, for identifying differential pathways from high-throughput transcriptomics data analysis. The method takes emphasis on gene links, instead of gene nodes as usual, which facilitate the use of topological information of pathways in pathway analysis. Specifically, we first decompose the pathway networks into a set of gene links and introduce MCM to characterize and capture the dynamics of pathways. Finally, a new signaling dynamics-based statistic was derived to measure the disparity of pathways between different conditions of cells. Experimental results on simulation data and two real-world datasets, liver cancer and ALL datasets, demonstrated the effectiveness and efficiency of the proposed method.
Supplementary Materials: The following are available online at http://www.mdpi.com/2073-4425/9/7/323/s1, Figure S1: Comparison of changing pattern of AUCs among our method and five previous methods on simulation data. , Figure S2: Comparison of CPD curves of q-values among our and previous methods on the liver cancer (a) and ALL (b) data, Table S1: Performance Comparison of different methods on simulation DATA with CASCADE/Complex pathway topology under r = 0.6 and ρ = 0.3, 0.5, 0.7, Table S2: Performance Comparison of different methods on simulation DATA with CASCADE/Complex pathway topology under r = 0.7 and ρ = 0.3, 0.5, 0.7, Table S3: Performance Comparison of different methods on simulation DATA with CASCADE/Complex pathway topology under r = 0.8 and ρ = 0.3, 0.5, 0.7, Table S4: Differentially expressed pathways on liver cancer data, Table S5: Differentially expressed pathways on ALL data.