Systems Biology Approaches to Investigate Genetic and Epigenetic Molecular Progression Mechanisms for Identifying Gene Expression Signatures in Papillary Thyroid Cancer

Thyroid cancer is the most common endocrine cancer. Particularly, papillary thyroid cancer (PTC) accounts for the highest proportion of thyroid cancer. Up to now, there are few researches discussing the pathogenesis and progression mechanisms of PTC from the viewpoint of systems biology approaches. In this study, first we constructed the candidate genetic and epigenetic network (GEN) consisting of candidate protein–protein interaction network (PPIN) and candidate gene regulatory network (GRN) by big database mining. Secondly, system identification and system order detection methods were applied to prune candidate GEN via next-generation sequencing (NGS) and DNA methylation profiles to obtain the real GEN. After that, we extracted core GENs from real GENs by the principal network projection (PNP) method. To investigate the pathogenic and progression mechanisms in each stage of PTC, core GEN was denoted in respect of KEGG pathways. Finally, by comparing two successive core signaling pathways of PTC, we not only shed light on the causes of PTC progression, but also identified essential biomarkers with specific gene expression signature. Moreover, based on the identified gene expression signature, we suggested potential candidate drugs to prevent the progression of PTC with querying Connectivity Map (CMap).


PTC cells
In order to obtain the real GENs in normal thyroid cells, early-stage, and late-stage PTC shown in Figures S1-S3, we used system identification and system order detection method prune false-positives of candidate GEN by their corresponding NGS data.
For the protein interactive systematic model of candidate PPIN in candidate GEN, the protein interactions of the sth protein in thyroid cells for sample n are described by the following protein interactive equations: where ys[n] denotes the expression level of the sth protein and yg[n] denotes the expression level of the gth protein for the nth sample; αsg is the interaction ability between the sth protein and the gth protein; bs represents the basal level of protein s, which is used to model the unknown interactions like acetylation, ubiquitination etc.; vs [n] is the stochastic noise of the sth protein for the sample n, which is due to data noise and model uncertainty; Ss represents the number of proteins interacting with the sth protein; and S represents number of proteins in candidate PPIN; N represents the number of data samples (patients).
For the gene regulatory systematic model of candidate GRN in candidate GENs, the transcriptional regulation of the ith gene of thyroid cells in sample n is given in the  (2) and (3)

Parameter estimation of the systematic models of candidate GENs via system identification method and system order detection
Before the obtainment of real GENs, we constructed the candidate protein-protein interaction network (PPIN) model (1) and gene, miRNA and lncRNA regulatory network models (2), (4), and (5) in candidate GRN, respectively; we use the system identification method and system order detection scheme to identify the protein interactive parameters αsg, bs in protein-protein interaction model and regulatory parameters βis, τit, δiw, ki, λps, μpt, ψpw, ep, γqs, ρpt, ζqw, fq of gene, miRNA and lncRNA regulatory models by NGS data of each thyroid to prune the false positives of candidate PPINs and GRNs to identify the real GENs of every stage of PTC. In order to identify these interactive and regulatory parameters, we can rewrite equations (1), (2), (4), (5) as the following linear regression forms: The above regression equations (6) , , , which could be simply represented as follows, respectively, databases or some experimental data, we apply system order detection method to prune false-positives protein interactive abilities, transcriptional regulatory abilities and posttranscriptional regulatory abilities from candidate GEN to obtain real GENs.
One of system order detection scheme, Akaike Information Criterion (AIC) could detect the system order (the number of regulations) of real GRN through the identification system scheme. The system order detection criteria AICs including the sth protein of protein interaction, the ith gene, the qth lncRNA and the pth miRNA are respectively shown as follows: In (26),

Applying the PNP method to extract core GENs in the real GENs
Real GENs are still very complex. It is still very difficult to investigate the carcinogenic progression mechanisms of different stages of PTC (i.e. from normal stage to early stage and from early stage to late stage). Therefore, we applied the PNP method on the real GRNs to extract the corresponding core GENs in each PTC stage. By the real GENs, the system models of interactions and regulations in PPIN and GRN can be shown in the followings: where the sub-network matrix Zpp= represent the system matrices associated with TFs transcriptional regulatory abilities, lncRNAs transcriptional regulatory abilities and miRNAs post-transcriptional regulatory abilities on miRNAs, respectively. If a protein is not interactive or regulatory with protein, gene, lncRNA and miRNA, the corresponding parameters will disappear in the pruning proceed by AIC and are padded with zero in the system network matrix H.
The Principal network projection method (PNP) was applied to the system network matrix Z to extract core GEN in real GEN of each stage PTC, and the PNP is based on the singular value decomposition method in the following: