Genome-Scale Metabolic Modeling with Protein Expressions of Normal and Cancerous Colorectal Tissues for Oncogene Inference

Although cancer has historically been regarded as a cell proliferation disorder, it has recently been considered a metabolic disease. The first discovery of metabolic alterations in cancer cells refers to Otto Warburg’s observations. Cancer metabolism results in alterations in metabolic fluxes that are evident in cancer cells compared with most normal tissue cells. This study applied protein expressions of normal and cancer cells to reconstruct two tissue-specific genome-scale metabolic models. Both models were employed in a tri-level optimization framework to infer oncogenes. Moreover, this study also introduced enzyme pseudo-coding numbers in the gene association expression to avoid performing posterior decision-making that is necessary for the reaction-based method. Colorectal cancer (CRC) was the topic of this case study, and 20 top-ranked oncogenes were determined. Notably, these dysregulated genes were involved in various metabolic subsystems and compartments. We found that the average similarity ratio for each dysregulation is higher than 98%, and the extent of similarity for flux changes is higher than 93%. On the basis of surveys of PubMed and GeneCards, these oncogenes were also investigated in various carcinomas and diseases. Most dysregulated genes connect to catalase that acts as a hub and connects protein signaling pathways, such as those involving TP53, mTOR, AKT1, MAPK1, EGFR, MYC, CDK8, and RAS family.


Introduction
Recent developments in omics data, such as genomics [1], transcriptomics [2], proteomics [3], metabolomics [4], and fluxomics [5], can be observed by consulting numerous biological databases. Such advancement in high-throughput data acquisition has shifted the focus from data generation to processing and understanding how to integrate the collected information. Because the amount of available omics data is increasing rapidly, advanced computational methods are required to mine the relevant knowledge from these data sources to boost the development of genome-scale metabolic models (GSMMs) of several organisms, including humans. Metabolism is the primary biological mechanism that is linked from genotype to phenotype; it can help us understand cell physiology and certain disease phenotypes caused by metabolic dysregulation [6,7]. GSMMs composed of metabolites and reactions allow the representation of the full set of metabolic processes within a cell to be curated using knowledge of cellular functions from the literature. GSMMs combined with constraint-based approaches lead to the development of methods to simulate cell behavior, such as flux balance analysis (FBA) [8].
involved in the high confidence group ( Figure 1B). Recon 2.2, a human general metabolic network, and data on the dependence of tissue-specific reactions were provided as the input information for the CORDA algorithm to construct a tissue-specific metabolic network ( Figure 1C), which was saved in Systems Biology Makeup Language (SBML) format ( Figure 1D) for model exchange. Both cancer (CA) and healthy (HT) models were built by using the CORDA algorithm and merged into a basal (BL) model. Each tissue-specific metabolic network comprised thousands reactions and species, rendering it difficult to modify and analyze manually. We developed a systems biology program (SBP) platform that supported the SBML file to perform the simulation and analysis of metabolic networks in the general algebraic modeling system (GAMS) environment. The SBP platform can transfer a metabolic network in SBML format to its GAMS model automatically ( Figure 1E). The BL model in Figure 1F was used to investigate how healthy cells undergo metabolic reprogramming to become cancer cells. Finally, we can employ the BL model to compute network topology and conduct a flux variability analysis ( Figure 1G).

Flux Balance Analysis
FBA is a constraint-based modeling approach where the stoichiometry of the underlying biochemical network constrains the solution. The stoichiometric matrix of a typical metabolic system is underdetermined, and infinite solutions are possible. The FBA formulates a metabolic network as a linear programming problem, as shown in Equation (1), wherein the solution of the underdetermined system is a member of the solution space and optimizes an objective function of choice, such as maximal growth and maximal ATP synthesis rate.
where v f /b is the forward/backward flux vector of reversible reactions; N is an m × n stoichiometric matrix where m is the number of metabolites and n is the number of reactions; v LB f /b,i and v UB f /b,i are the positive lower and upper bounds of the i th backward/forward flux, respectively; Ω is the set of forward and backward fluxes. The coefficients (w ATP and w biomass ) are the weighting factors for computing the linear objective function. The objective of FBA is to maximize the biomass formation rate (v biomass ) for the cancer model. However, normal cells may have different objectives depending on growth signals [28]; thus, the maximum ATP synthesis rate (v ATP ) is applied in this situation.
FBA assumes that metabolic networks will reach a steady state constrained by the stoichiometry. Even though the stoichiometric constraints lead to an underdetermined system, the bounded solution space of all feasible fluxes can be identified, that is, a large set of alternative flux distributions with an identical objective value. We minimized the squared sum of all internal fluxes for FBA to ensure efficient channeling of all fluxes through all pathways to eliminate numerous flux distributions in Equation (1). The Euclidean norm problem for minimization is expressed as follows: where obj * is the optimal objective value obtained from Equation (1), and Ω Int is the set of intracellular reactions. The problem expressed in Equation (2) is a quadratic programming problem that has a unique solution.

Oncogene Inference Problem
We introduced a tri-level optimization problem (TLOP) to simulate gene screening procedures in a wet lab to infer oncogenes. The flowchart of the in silico experiment is presented in Figure 2. GSMMs of cancerous and normal cells were reconstructed, as shown in Figure 2A, the procedures of which are discussed in Figure 1. Both models were then applied to compute the flux distribution patterns at the cancer and normal situations ( Figure 2B). The flux template, which acted as a control, was built according to the flux distributions of cancer and normal models ( Figure 2C). Flux alterations were computed from each mutant ( Figure 2D) and then compared with the control ( Figure 2E). The template was incorporated with the TLOP to infer oncogenes, as shown in Figure 2E-I. The outer optimization problem was employed to decide which genes were modulated ( Figure 2F). The mutated genes were provided for the inner optimization problem to compute the flux distribution pattern to evaluate the flux alteration ( Figure 2D-G). The TLOP framework can be formulated by using the procedures from Figure 2C-I, and is expressed as follows: Outer optimization problem: where equal is the fuzzy equal objective function that represent the fuzzy goals. For example, the LFC MUBL m and LFC CABL m should be restored to a state that is as close as possible; N BL is the stoichiometric matrix of basal models; the integer vector z is used to determine mutated enzymes; δ is the regulated strength parameter for the mutants; and Ω MU is the set of mutated reactions.
The hierarchical multiple objectives are considered in the outer optimization problem in Equation (3). The first priority is to employ the fuzzy equal measure to determine that the mutant logarithmic flux changes (LFC MUBL m and LFC MUHT m ) are as close to the templates (LFC CABL m and LFC CAHT m ) as possible. The logarithmic fold change (LFC m ) between the synthesis rates or fluxes of the m th metabolite in cancerous or dysregulated (denoted as deficient) and basal or healthy (denoted as normal) states is computed as follows: where the overall synthesis rate (r m ) of the m th intracellular compound in deficient and normal states is evaluated as follows: Here Ω c is the set of metabolites located in different compartments, and Ω m is the set of metabolites. The expression enclosed by brackets in Equation (5) indicates the synthesis rate of the i th metabolite that summed the influxes of the forward reactions and backward reactions. Each intracellular metabolite may exist in different compartments of the metabolic network, e.g., nine compartments in Recon 2.2, so that the overall synthesis rate of the m th compound is computed by Equation (5).
The second and third objectives are determining that the similarity ratios of the flux reprogramming are maximized. The similarity ratios (SR T,BL/HT and SR T rxn ,BL/HT ) of the flux-sum synthesis for metabolites and ratios of fluxes for all reactions are evaluated as follows: where the similarity indicator (µ T/T rxn m ) for the m th metabolite is defined as follows: where the logarithmic fold change (LFC T/T rxn m ) of the m th metabolite for the templates are provided in advance. The tolerances for increase or decrease are defined as tol + = log 2 (1 + ε) and tol − = log 2 (1 − ε), respectively, and ε is the percentage of flux alteration. A numerical example is provided (Supplementary Materials, Figure S1) to illustrate the computation of flux template, similarity ratio, and logarithmic fold change.

Association of Gene-Protein-Reaction
The TLOP framework must decide which genes are selected for mutation. Most optimal strain design methods [29][30][31][32][33] involve directly applying the reactions of the stoichiometric matrix to determine the optimal modulations. Posterior decision -making must be conducted to determine the corresponding gene that each optimal reaction integrates with the gene association. Such decision-making is tedious and incapable of identifying the reactions that are regulated by isozymes or redundant genes. A regulatory FBA framework [34] has been used to apply gene association incorporated with the stoichiometric matrix to formulate a connective matrix, wherein logical operations "AND" and "OR" were applied to build a gene-protein-reaction (GPR) expression, as shown in Figure 3A, for conceptual description. Nonetheless, such a GPR model is still unable to discriminate redundant genes, and the regulatory FBA becomes a discrete nonlinear optimization problem. This study introduced a strategy for enzyme pseudo-coding numbers to separate the original gene association into two parts, as shown in Figure 3B. The first part can identify the reductant pseudo-enzymes and isozymes, such that the reduced association catalyzes the reactions. The second part represents single gene or complex gene corresponding to an enzyme pseudo-coding number. is a redundant enzyme that is identical to E 1 and catalyzed the same reactions. r 2 is catalyzed by the isozymes (E 1 and E 3 ). E 1 is encoded by G 1 , and E 3 is encoded by a complex of G 3 and G 4 .
The pseudo-enzymes are then used as upper-level variables in Equation (3) to select modulated genes and the value for the regulated bounds is computed using the following equations: where Ω IZ is the set of reactions regulated by isozymes. A bound for down-regulation or knockout is assigned to its original region if the reaction is catalyzed by isozymes.

Nested Hybrid Differential Evolution Algorithm
The candidate genes in Equation (3) are represented by integer variables such that it is a mixed-integer optimization problem that is NP-hard [35]. Classical algorithms for solving bi-level optimization problems have applied duality theory to convert the inner level optimization problem into constraints in the outer-level problem. However, the duality transformation is difficult for multilevel optimization problems, such as TLOP in this study. In this study, we extended the NHDE algorithm to solve the TLOP [36]. The computational concept of NHDE (Supplementary Materials, Figure S2) is based on a hybrid differential evolution (HDE), which was extended from the original differential evolution (DE) algorithm. The basic operations of the NHDE algorithm are similar to those of the DE and HDE algorithms, except for coding representation, selection, and evaluation operations. The NHDE algorithm has been applied to solve metabolic engineering [33] and biomedical problems [36,37]. In the outer optimization problem, the NHDE algorithm is used to determine which pseudo-enzymes are selected to be modulated, and the inner optimization problems (FBA and UFD) are then solved using a linear and quadratic optimization solver. An optimal solution for each candidate individual is achieved when the UFD problem is convergent, and the set of these individual solutions comprises a feasible solution to TLOP.
The TLOP framework in Equation (3) considers the hierarchical multiple objectives. We introduced the equal sum and minimum decision method to evaluate the fitness of individuals in NHDE. The evaluation procedures regarding the fitness η for individuals are expressed as follows: where η CABL and η CAHT denote the fitness of CA model to BL and HT models, respectively, and they are evaluated using the hierarchical objectives as follows: The η CAHT is calculated using a similar procedure. The first priority of the objectives is to employ the fuzzy equal operation to determine that the mutant flux change ratios are as close to the templates as possible. The membership grades, η MUBL and η MUHT , for each fuzzy equal of logarithmic fold change are described in Supplementary Materials ( Figure S3).

Templates of Flux Patterns for Cancer and Normal Cells
The metabolic model of Recon 2.2 consists of 5324 species, 7785 reactions, and 1675 associated genes ( Figure 1A). Until date, Recon 2.2 has not been employed to reconstruct a tissue-specific GSMN model. Overall, 12865 gene expression profiles for normal and tumorous human tissues were acquired from HPA in the first step of the proposed procedures ( Figure 1A). Based on the protein expressions and literature survey ( Figure 1B), 1365 (including 74 reactions from literature) and 721 high confidence reactions for normal and cancerous colorectal tissues, respectively, were retrieved. The CORDA algorithm was applied to reconstruct the HT and CA colorectal models. The HT model comprised 2591 species, 4034 reactions, and 1483 genes, and the CA model comprised 2404 species, 3725 reactions, and 1454 genes. Figure 4 presents the intersections of species, reactions, and genes for both models. Both models were merged into the BL model to investigate how the healthy cell could be metabolically reprogrammed to become cancerous. Thus, the BL model is the union set of HT and CA models and includes 2692 species, 4284 reactions, and 1539 genes. These three models were engaged in the FBA and UFD problems to compute flux-sum distributions. The templates of flux-sum alterations for CA to BL and HT models were computed using the LFC m defined in Equation (4).
Total 565 choke-point metabolites (metabolites involve in only one synthesis or degradation reaction) in the normal and cancerous metabolic networks were determined by topology analysis. Due to that different uptakes can lead to a bias of the computational results, same uptakes (Supplementary Materials, Table S1) for normal and cancerous metabolic networks were used in this study.

Inferred Oncogenes
The NHDE algorithm was applied to solve the TLOP, and the top 20 one-hit oncogenes were determined; their average similarity ratios are presented in Table 1. These genes participated in various metabolic pathways. The average similarity ratio (Ave. SR) for each mutant was greater than 98%, and the extent of similarity for flux change ratio (Ave. CR) was greater than 93%. The inferred results revealed that these mutants have a similar flux pattern to that of cancer cells. Surveys of PubMed and GeneCards indicated the inferred oncogenes are related to various carcinomas and diseases, as shown in Table 1, and among the results, seven oncogenes were observed in CRC. From the survey of GeneCards, we obtained 1360 and 852 genes associated with FAP and HNPCC, respectively, and there are 117 FAP-associated and 49 HNPCC-associated enzyme genes included in the reconstructed colorectal model (Supplementary Materials, Table S2). Three enzyme genes (CAT, G6PD, and CDO1) obtained from the NHDE algorithm are FAP-associated genes. Therefore, the results reveal that the inference optimization framework is a useful computer-aided oncogene detection system.
We also applied RNA-sequencing data of the healthy and cancerous colorectal tissues obtained from the TCGA database [24] on the analysis of the 20 inferred oncogenes. The RNA-sequencing data consist of 41 healthy samples and 478 cancer samples. A two-tailed t-test was used to analyze the RNA-sequencing data, and a p-value less than 0.05 (typically ≤ 0.05) is statistically significant. The result shows that 17 out of the 20 oncogenes are significant as shown in Table 1. The RNA-sequencing data obtained from the TCGA database can also be employed to build GSMNs for oncogene inference in the TLOP problem. The reconstructed GSMNs were used to evaluate the similarity ratio and the extent of similarity of the mutant flux patterns for the 20 oncogenes (Supplementary Materials, Table S3). We found that 19 out of the 20 oncogenes have high similarity level as those in Table 1. Moreover, GSMNs reconstruted by the CORDA and iMAT algorithms based on Recon 2.02 and Recon 3D general models were used to validate the 20 oncogenes. The computation results are shown in Supplementary Materials (Table S3) for comparison and reveals that these 20 oncogenes could also have high similarity ratios in different GSMNs. Non-small cell lung carcinoma [67]. ¶ Average similarity for flux change ratio; § Average similarity ratio of the mutant flux pattern to the templat; † Disease is obtained from GeneCards database and score is accessed from GeneCards; ‡ Brief description of gene function and references from PubMed and cancer databases.
Catalase, encoded by CAT, is an essential antioxidant enzyme that catalyzes the decomposition of hydrogen peroxide (H 2 O 2 ) to water and molecular oxygen. CAT catalyzes three decomposition reactions of H 2 O 2 in the cytoplasm, peroxisome, and mitochondria of the GSMN model. Based on the computation results, we determined that a 78% downregulation of CAT could increase the production of reactive oxygen species (ROS) that exhibits carcinogenic effects [38,39]. We used the STRING database [68] to investigate how CAT is related to apoptosis, proliferation, and cell growth signaling pathways ( Figure 5A). We observed that CAT was strongly related to the TP53, mTOR, AKT1, MAPK1, EGFR, MYC, CDK8, and RAS family; these are well-known carcinogenic genes. Notably, CDK8 is a verified oncogene that induces aberrant activation of the canonical WNT/β-catenin pathway occurring in almost all CRCs and contributs to cell growth, invasion, and survival [69]. We determined that CAT was also linked to CDK8 through TP53 and MYC. Finally, the 20 inferred genes in Table 1 were employed to investigate their protein-protein interactions (PPIs), as shown in Figure 5B. CAT showed all characteristics of a housekeeping gene, in that it acts as a hub to which most dysregulated genes are linked and then connected to the protein signaling pathways [40]. We determined from the STRING survey that CYBRD1, CDO1, and LIPC were disjointed in the PPI. The enzymes connected to CAT primarily participate in the glycolysis, pentose phosphate, amino acid, and lipid pathways. For example, G6PD, H6PD, G6PC3, GPI, GRHPR, and SLC37A4 are associated with the glycolysis pathway. G6PD is a cytoplasmic enzyme that catalyzes the first step of the pentose phosphate pathway, which plays a key role in producing NADPH and the ribose required to synthesize DNA. G6PD deficiency is among the genetic factors hypothesized to protect against colorectal carcinogenesis [48].

Performance of Enzyme Pseudo-Coding
Furthermore, we applied a reaction-based approach to determine oncogenes in order to illustrate the performance of the proposed pseudo-enzyme strategy. Six top-ranked one-hit reactions are presented in Table 2. The reaction PGI is catalyzed by GPI, and the reaction r0161 is regulated by AGXT, such results are identical to that in Table 1. On the basis of the gene association of Recon 2.2, we determined that the gene RPIA catalyzes two reactions (r0249 and RPI). However, the reaction-based approach only determined r0249 with average similarity ratio of 0.981 and extent of flux changes of 0.935. The proposed pseudo-enzyme strategy was then applied to regulate both reactions. An average similarity ratio of 0.955 and average flux change ratio of 0.771 were obtained, which were smaller than that obtained using the reaction-based approach. Following the posterior procedures, we observed that the gene HMGCL could obtain an identical result through the reaction-based approach, but it also catalyzed another reaction HMGLx and included the isozyme HMGCLL1. The gene PRODH2 achieved the same result but catalyzed four reactions. Finally, CAT regulated three reactions (CATp, CATm, and r0010), but each reaction achieved different results.  [38][39][40]. § Average similarity for flux change ratio; † Average similarity ratio of the mutant flux pattern to the template; ‡ Brief description of gene function and references from PubMed and cancer databases.

Flux Variability Analysis
To avoid the solution bias of FBA used in the TLOP problem for evaluation of flux alterations, flux variability analysis (FVA) was applied to compute the minimum and maximum values of each metabolite for the normal model and the mutants. The flux ranges were compared between mutant and normal models, and each flux difference between mutant and normal models was classified into seven categories according to the definition presented in Supplementary Materials ( Figure S4). The number of metabolites in different categories for the template and all mutants are shown in Figure 6. The result shows that about 71% of the synthesis rates of metabolites are complete decrease and about 5.8% complete increase. Such complete flux variances for all mutants are about 97% similar to the template and the average similarity ratios are about 82% as shown in Supplementary Materials ( Figure S5). Furthermore, the choke-point metabolites in the normal and cancerous models were determined by topology analysis, and total 565 choke-point metabolites were determined. The flux variance patterns of the metabolites with gene association for the template and mutants are shown in Figure 7. Such choke-point metabolites can be used as essential candidates for discovering biomarkers, because the single flux of synthesis or degradation reaction for each choke-point metabolite is corresponding to the gene expression measurement. We observed that major deregulations occurred in lipids, nucleotides, carboxylic acids, organic acids, organic oxygen compounds, and organic heterocyclic compounds. Lipids consist of fatty acyls, glycerophospholipids, prenol lipids, sphingolipids, cholestane steroids, and steroidal glycosides. The fluxes of six choke-point metabolites (cardiolipin, 4beta-methylzymosterol-4alpha-carboxylic acid, 14-demethyllanosterol, (S)-2,3-epoxysqualene, desmosterol, and 7-dehydrodesmosterol) among lipids are complete increase, and the others are decrease.

Conclusions
This study introduced a tri-level optimization framework that incorporated protein expressions of normal and tumor tissues for inferring oncogenes. The protein expression data or RAN-sequencing data, input of the optimization framework, was acquired from public databases such as HPA or TCGA. To enable provision of personalized medical treatment, the input of the optimization framework can be replaced with a patient's protein expressio data. The proposed TLOP is an NP-hard problem, and no commercial software programs could be used to solve the problem. The NHDE algorithm with hierarchical objectives was applied to solve the problem. Moreover, this study introduced enzyme pseudo-coding numbers for expressing gene association to avoid the performing posterior decision-making that is necessary for the reaction-based method. CRC was applied as the case study, and the NHDE algorithm inferred 20 top-ranked oncogenes that cause various carcinomas and diseases based on surveys of PubMed and GeneCards. Notably, the evidence illustrated that the inference optimization framework enables oncogene prediction.

Supplementary Materials:
The following are available online at http://www.mdpi.com/2218-1989/10/1/16/s1, Figure S1: A numerical example to illustrate the computation of template, similarity ratio, and LFC, Figure S2: Flowchart of the NHDE algorithm for solving the TLOP, Figure S3: Definition of triangular membership function of fitness, Figure S4: Categories of flux variance between the normal and mutant model, Figure S5: Average similarity ratios computed by FVA, Table S1: Nutrients from RPMI-1640 and DMEM medium, Table S2: Genes associated to FAP and HNPCC from GeneCards database, Table S3: Average change ratios (CR) and similarity ratios (SR) of oncogenes in different GSMNs.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: