Inference of Regulatory System for TAG Biosynthesis in Lipomyces starkeyi

Improving the bioproduction ability of efficient host microorganisms is a central aim in bioengineering. To control biosynthesis in living cells, the regulatory system of the whole biosynthetic pathway should be clearly understood. In this study, we applied our network modeling method to infer the regulatory system for triacylglyceride (TAG) biosynthesis in Lipomyces starkeyi, using factor analyses and structural equation modeling to construct a regulatory network model. By factor analysis, we classified 89 TAG biosynthesis-related genes into nine groups, which were considered different regulatory sub-systems. We constructed two different types of regulatory models. One is the regulatory model for oil productivity, and the other is the whole regulatory model for TAG biosynthesis. From the inferred oil productivity regulatory model, the well characterized genes DGA1 and ACL1 were detected as regulatory factors. Furthermore, we also found unknown feedback controls in oil productivity regulation. These regulation models suggest that the regulatory factor induction targets should be selected carefully. Within the whole regulatory model of TAG biosynthesis, some genes were detected as not related to TAG biosynthesis regulation. Using network modeling, we reveal that the regulatory system is helpful for the new era of bioengineering.


Introduction
The improvement of the bioproduction ability in microorganisms is one of the important themes in bioengineering fields. Several types of empirical breeding approaches, such as constructing random mutant strains [1][2][3] and improving key enzyme activities within the biosynthesis pathways [4,5], have been developed over the years and applied to expand the capabilities of microorganisms. Those approaches have yielded useful host strains, but their development is quite time-consuming and costly. To solve this high cost problem, system approaches combined with usual breeding methods have been applied to produce useful host strains of microorganisms. Among the several host microorganisms used in bioproduction, the oleaginous yeast Lipomyces starkeyi is quite important. This produces edible oil with a fatty acid composition similar to that of palm oils, at a reasonable cost [6][7][8][9]. Thus, improving the oil production by this oleaginous yeast is a fascinating goal, for not only bioengineering but also in industrial use as an alternative to palm oil.
in 256 experiments. Among the 256 experiments, 184 measured 8 time points from 0 to 240 h in several types of strains, 12 measured 4 time points from 48 to 192 h in a wild-type strain or low oil productivity strain, and 60 measured 3 time points from 24 to 72 h in wild-type strains or high oil productivity strains. Together, the data points were considered to clarify when the TAG biosynthesis occurs [31].
First, the expression profiles of 7799 genes were described as the log 2 -value of the raw expression signals, and then transformed into Z-scores. Subsequently, we defined the oil productivity in each condition, as follows: Here, OIL(t) and Cell(t) are the oil production data and the cell concentration data, which were measured as the phenotypic data at every time point (t) in each strain, and oil_productivity(t) was calculated as the TAG productivity per cell. From a biological viewpoint, a time-gap will occur between the defined oil_productivity(t) and the state of gene expression. The effect of gene expression at time t can be detected as the difference in oil productivity between time t + 1 and time t. In this method, since the last time point of each time series data had no oil productivity information, these data were deleted. Finally, we utilized 210 experimental data for calculations.

Gene Selection
To construct the regulatory network of the TAG biosynthesis pathways, we utilized the Kyoto Encyclopedia of Genes and Genomes Database (KEGG https://www.genome.jp/kegg/pathway.html) and JGI Genome Portal Database (https://genome.jgi.doe.gov/portal/) Databases. In the KEGG Database and by prior investigations [32], the lipid biosynthesis pathway in another oleaginous yeast, Yarrowia lipolytica, is available. We obtained the gene information related to the TAG biosynthesis pathways in Y. lipolytica and searched for the homologous genes in L. starkeyi by using the JGI Database. Finally, 88 genes were detected as components of the TAG biosynthesis pathway in L. starkeyi.

Factor Analysis
To clarify the regulatory system of the TAG biosynthesis pathway, we determined the optimal number of subgroups that were regulated by the same system, by a factor analysis. First, the 88 selected genes and defined oil productivity were described as observed variables, and factor analyses were used to find their suitable numbers of regulatory factors.
Factor analysis is a statistical method for describing the variability among observed variables in terms of a potentially lower number of latent variables [33]. The initial assumption is that any observed variables may be related to any latent variables. Let us assume that there are p latent variables and q observed variables x 1 , x 2 , . . . , x q , with means u 1 , u 2 , . . . , u q . Note that the number p of the latent variables is always smaller than the number q of the observed variables. Each observed variable is expressed as linear combinations of p latent variables, as follows: where x i is the vector of the expression levels of gene i, α i,j is the partial regression weight of the latent variable F j , and ε i is an independently distributed error term with zero mean and finite variance. The observed variables expressed by Equation (2) can be summarized in a matrix form: If there are n samples in each of the observed variables, then X and U are the (p × n) matrices composed of the observed data and their means, respectively. The partial regression coefficients of each latent variable are indicated as elements of A, the (q × p) latent interaction matrix. In matrix A, each column corresponds to a factor and each row corresponds to an observed variable, and thus each element of A indicates the strength of the regulation from each protein to each gene. The F matrix is the latent variable matrix, and E is the (q × n) error matrix.
The variance-covariance matrix between the observed variables Σ structurized by parameters is described, as follows: Here, A is the factor loading matrix of latent variables, Φ is the covariance matrix among factors, and Ψ 2 is the covariance matrix of error terms. From this structurized matrix, the values of the partial regression weight matrix A and the variances of the "errors" are estimated.
To detect the suitable number of subgroups in the TAG biosynthesis pathways, exploratory factor analysis (EFA) was performed. EFA is utilized to reveal the latent structure, by assuming that the observed data are a synthetic amount of a lower number of latent variables. In this study, EFA was executed by a principal factor method with promax rotation, which is a general method for fitting rotating factors to a hypothesized structure of latent variables. In this study, we applied EFA for dividing several subgroups of TAG biosynthesis pathway genes, so we utilized Kaiser criterion, which is one of the major criterions, as for estimating the number of factors at first. By Kaiser criterion, the number of factors is known to be overestimated. Thus, we reduced the estimated factor number one by one to confirm the suitable number of factors, which control TAG biosynthesis pathway genes. In this step, we applied a scree plot to avoid underestimating the number of factors.

Stepwise Network Modeling
The 88 genes detected by the factor analysis were divided into two types. One type is the genes classified into the same group with oil productivity, thus reflecting that those genes are controlled by the same transcriptional regulatory system with oil productivity. The other type is the genes classified into other groups from oil productivity, which means that those genes are controlled by a different transcriptional regulatory system than that of oil productivity. To clarify the whole regulatory system of TAG biosynthesis, we applied stepwise network modeling as follows: STEP 1: Initial model assumption of oil productivity group; STEP 2: Model optimization of oil productivity group; STEP 3: Definition of pseudo variables from subgroups; STEP 4: Initial model assumption among pseudo variables; STEP 5: Model optimization of pseudo variables.

Initial Model Assumption
For the SEM calculations, we had to assume the initial model in each step. In this case, only one variable was defined as an objective variable, and the star model can be applied as the initial model. In STEP 1, the oil productivity was assigned as the objective variable. The genes classified into the same group with the oil productivity were assumed to be the effect variables for the oil productivity. We arranged 15 genes as the parent nodes for the oil productivity as a child node in an initial model. The whole regulatory system for TAG biosynthesis was inferred by a pseudo variables network. Given the restrictions of the SEM calculation, it is difficult to construct the optimal network model with the selected 88 genes. We calculated the representative value of each group from the measured data of components within the group and defined one pseudo variable corresponding to one group. In the pseudo variables network, the variable including oil productivity was defined as the objective variable, and the other variables were assumed as the effect variables for the regulator of oil productivity in the initial model. Thus, 8 variables were arranged as the parent node for the one specific variable as a child node. In the initial model, the parent nodes were assumed to be independent, and the relationships between parent nodes were not identified. Thus, the initial model was expressed by Here, f is a vector of effect variables arranged as parent nodes, and g is the data of the objective variable. Since the parent nodes were assumed to be independent, the weights of the relationships between them were defined as O matrices. The matrix Γ is a vector representing the effectiveness of parent nodes to child nodes. The errors that affect the objective variables are denoted by ε.

Network Modeling
To detect the regulatory structure of TAG biosynthesis, we previously applied our developed network modeling method based on SEM calculation [34]. We applied this method to the initial models to obtain the optimized regulatory network model. In this study, all variables in the network model were defined as observed variables, and none were defined as latent variables.
In the inferred network model, the variance-covariance matrix between the arranged n variables Σ(θ) was given by Let I denote the identity matrix, Λ denote the n × n matrices of the inferred parameter matrix, and Φ denote the covariance matrix of the error terms. The real covariance matrix Σ is calculated from the observed data. Each element of the model's variance-covariance matrix Σ(θ) is expressed as a function of the parameter θ, and all parameters in Σ(θ) are calculated to minimize the difference from Σ by the maximum likelihood method: Here, |Σ| is the determinant of matrix Σ, tr(Σ) is the trace of matrix Σ, and q is the number of observed variables.
In the SEM calculation, the similarity between the constructed model and the actual relationships is predicted by comparing Σ(θ) and Σ, and the quantitative similarity can be detected as fitting scores.
To evaluate the inferred model accuracy, we utilized six different fitting scores as criteria: χ 2 values (CMIN), the goodness of fit index (GFI), the adjusted goodness of fit index (AGFI), the comparative fit index (CFI), the root mean square error of approximation (RMSEA) and the Akaike's information criterion (AIC). These criteria indicate the qualities of model adaptation to the measured data, and they have threshold values to determine whether the model is suitable. A CMIN value higher than 0.01 was considered as a well fitted model, and GFI and CFI values above 0.90 are required for a good model fit [35][36][37]. RMSEA is one of the most popular parsimony indexes and is independent of a huge sample number. In the RMSEA criteria, values below 0.05 would represent a good model fit, even though values of 0.10 or more are considered to indicate that the constructed model is far from the actual data [38]. Furthermore, we evaluated the model fitting by AIC [39], to compare the independent model and the saturation model. We used the SEM software package SPSS AMOS 27 (IBM, Armonk, NY, USA).
To infer the optimal network structure, we applied our developed iteration algorithm for model optimization [27]. Recently, we improved our iteration algorithm to escape from a local optimal solution combined with the genetic algorithm [34]. In this algorithm, we utilized the probabilities of all edges in the inferred model, which were calculated by the inverse matrix of the Fisher information matrix of parameters, to detect the significance of each edge. The detected non-significant edges were deleted from the inferred model step by step, and all parameters were re-calculated in each step. All of the edges were detected as significant, and we utilized modification index (MI) scores, calculated Bioengineering 2020, 7, 148 6 of 15 from chi-square statistic, to estimate the possible relationships between variables. The regulatory relationship with the highest MI score at each step was add to the inferred model as a new edge. We applied this iteration algorithm to the observed and latent variables at first, and then the error terms were executed.

Gene Classification by Factor Analysis
Since the TAG biosynthesis pathways are a specific system for survival under nutrient-limited conditions in L. starkeyi, the system was expected to be divided into some subgroups for its regulation; that is, some regulatory factors were expected to be expressed simultaneously in TAG biosynthesis-related genes. To reveal the regulatory subgroups in TAG biosynthesis, we performed exploratory factor analysis (EFA) and then applied confirmatory factor analysis (CFA) by the principal factor method with promax rotation. Since EFA is commonly used for identifying the number of factors with effects on the observed variables, we applied EFA first. After the number of factors was determined, we applied CFA to classify the observed variables strictly.
To detect the genes in the same regulatory system as the oil productivity, we executed factor analysis to 88 TAG biosynthesis-related genes and oil productivity, and thus the compiled expression profiles of 89 variables measured under 210 conditions were classified by their regulatory factors. In EFA, the Kaiser criterion was utilized to estimate the suitable number of factors. The Kaiser criterion asserts that the number of factors is the same as the number of eigen values of the covariance matrix that are greater than one, and 12 factors were extracted as regulatory factors of the 89 variables. Among the 89 variables, 88 have the highest factor loading values to nine factors, and the remaining three factors were considered to be ineffective for these variables. Furthermore, the scree plot of EFA indicated that the suitable number of extracted factors can be nine. The cumulative sum of the squared factor loadings for nine factors was 83.872%, and this means that nine factors were sufficient to explain the 89 variables. Thus, we executed CFA for 89 variables with nine factors, and they were well classified by the nine factors according to their highest factor loadings. The communality, which can clarify the percent of variance in each variable explained by nine factors, and the factor loading of each factor, are displayed in Table 1. Note: The highest value of factor loadings is colored in red.
The communalities of the 88 TAG biosynthesis-related genes were over 0.500, and among them, 63 genes were higher than 0.800. These high communality values reflected the fact that the expression of TAG biosynthesis genes could be explained by these nine factors. On the other hand, the communality of oil productivity was 0.382. This means that the regulatory system for oil productivity was not only TAG biosynthesis genes, and other regulatory systems are involved. By CFA, the 89 variables were well classified into nine subgroups by their factor loadings: Group 1 had 24 genes, Group 2 had 21 genes, Group 3 had 15 genes and oil productivity, Group 4 had 9 genes, Group 5 had 5 genes, Group 6 had 6 genes, Group 7 had one gene, Group 8 had 3 genes, and Group 9 had 4 genes.

Oil Productivy Network: Figures, Tables and Schemes
The confirmatory factor analysis (CFA) classification of oil productivity was included in Group 3, with 15 genes that are known to be related to the reactions for TAG biosynthesis. To infer the regulatory system for oil productivity within the group, the network modeling method was applied. First, we connected the 15 genes to oil productivity as an initial model. In the initial model, the oil productivity was affected by all 15 genes.
The entire architecture of the inferred network model among 15 genes and oil productivity is shown in Figure 1a. Although the initial model has a very simple structure, the inferred model shown in Figure 1a represents a more complicated system. This inferred model included three cyclic regulations: ACL2 -> SLC1 -> PDB1 -> tid_2139 -> ACL1 -> ALC2, ACL2 -> SLC1 -> PDB1 -> ACL2, and SLC1 -> PDB1 -> tid_2139 -> SLC1. The goodness-of-fit scores, which detect the fitting level of the inferred model, are shown in Table 2. All of the indices satisfied the requirements for deciding whether the model was suited to the measured data.   Note: the probability from χ 2 values (CMIN(P)), the goodness of fit index (GFI), the adjusted goodness of fit index (AGFI), the comparative fit index (CFI), the root mean square error of approximation (RMSEA) and the Akaike's information criterion (AIC).
To clarify the regulation of oil productivity, the strong relationships were extracted from the inferred model. The edges, which have high weight absolute values (>0.3), are displayed in Figure 1b. This model was a perfectly hierarchical structure, with no cyclic regulation. From this network, we extracted the oil productivity regulations displayed in Figure 1c. In this figure, DGA2 was reasonable as a regulator of oil productivity, since DGA2 is the gene encoding an enzyme for TAG synthesis. Interestingly, other inferred regulators of oil productivity were palmitoyl-CoA biosynthesis-related genes. In the whole network model, 46 regulatory relationships were estimated among 16 variables. Among the 46 regulatory relationships, 32 relationships were positive regulation and the remaining 14 relationships were negative. The relative strength of each association is shown as a standardized regression weight in Table 3, and all edges within the model were significant (p < 0.01).

Regulatory Network of TAG Biosynthesis
To infer the regulatory system of the TAG biosynthesis pathways, we executed our modeling method to infer the regulatory network model among the nine classified groups. By the restriction of the SEM calculation, the 89 variables should be summarized to lower numbers of variables. We classified the 89 variables into nine groups by factor analysis, and these groups were considered to reflect subgroups of the regulatory system. Thus, the inference of a regulatory network among nine groups is suitable to reveal the regulatory system of TAG biosynthesis in its entirety.
Before the application of the SEM calculation, the pseudo-data of each group should be calculated. We calculated the average values from the expression profiles of the classified genes into each group as pseudo-data. As the initial model, we assigned Group 3 as an objective variable and the other groups as effective variables, since the oil productivity variable was included in Group 3. Directed edges were connected to Group 3 from the other groups in the initial model. The inferred network model among nine groups is displayed in Figure 2a, and the goodness-of-fit scores of each index are shown in Table 4. From these scores, all indices indicated that the inferred model is suitable for the measured data. As shown in Figure 2a, 21 regulatory relationships were estimated among nine variables, and the architecture of this inferred model was hierarchical. Among the 21 regulatory relationships, 16 were positive and 5 were negative.
The regulatory system around Group 3 extracted from the inferred model is displayed in Figure 2b. Only six groups were related to Group 3, and Groups 2 and 9 were considered to have strong regulatory relationships with Group 3. The regulations from Groups 4 and 8 to Group 3 were weak and negative, and the regulation from Group 6 to Group 3 was positive but weak. The relative strength of each association is shown as a standardized regression weight in Table 5, and all edges within the model were significant (p < 0.01). 2b. Only six groups were related to Group 3, and Groups 2 and 9 were considered to have strong regulatory relationships with Group 3. The regulations from Groups 4 and 8 to Group 3 were weak and negative, and the regulation from Group 6 to Group 3 was positive but weak. The relative strength of each association is shown as a standardized regression weight in Table 5, and all edges within the model were significant (p < 0.01).

Discussion
In this study, we used factor analysis to classify TAG biosynthesis genes according to their regulatory system. Group 1 includes many genes related to mitochondria or peroxisomes. Group 2 has many primary metabolism pathway-related genes, and the members of Group 3 include some genes known as regulatory factors of TAG biosynthesis. Almost all of the genes classified into Groups 4 and 5 were related to mitochondria. Group 6 included genes related to TAG biosynthesis. The cholesterol esterase TGL1 was the sole component of Group 7, Group 8 included two triosephosphate isomerase genes, and Group 9 included two pyruvate decarboxylases and one diacylglycerol acyltransferase. From the tendencies of the group members, the groups were divided into three types: the groups that are not related to or reduced TAG biosynthesis (1, 4, and 5), the groups that are related to or induced TAG biosynthesis (2, 3, and 9), and the groups that could not be characterized by their features (6, 7, and 8).
The genes classified into Group 3 included known regulatory factors for TAG biosynthesis. The induction of acyl-CoA is important for TAG biosynthesis [16]. Among the 15 genes classified into Group 3, seven genes were related to acyl-CoA synthesis. Furthermore, among the remaining eight genes, seven genes were related to the reactions from glycerol-3-P to the Kennedy Pathway. Thus, the genes within Group 3 are considered to be reasonable for encoding regulatory factors for the TAG biosynthesis, and we inferred the regulatory network of oil productivity shown in Figure 1. In this inferred network model of oil productivity, a specific feature was detected. The inferred regulatory system for oil productivity includes some cyclic structures involved in negative regulations: ACL2 -> SLC1 -| PDB1 -> tid_2139 -> ACL1 -> ALC2, ACL2 -> SLC1 -| PDB1 -> ACL2, and SLC1 -| PDB1 -> tid_2139 -| SLC1. The regulations from SLC1 to PDB1 and tid_2139 to SLC1 were negative. SLC1 is related to the conversion from lysophosphatidic acid (LPA) to phosphatidic acid (PA) within the Kennedy pathway, and PDB1 is related to the conversion from acetyl-CoA to citrate in mitochondria. Both genes are considered to be important for TAG biosynthesis. These cyclic structures within the inferred model indicate the possible existence of unknown negative feedback in the regulation of TAG biosynthesis.
The extracted regulations of oil productivity, shown in Figure 1b, included known regulators of TAG biosynthesis. In oleaginous yeast L. starkeyi, the expression of ACL1 and DGA2 involved in the acyl-CoA synthesis pathway and the Kennedy pathway, respectively, in the mutants with greatly elevated lipid production was increased compared to that in the wild-type strain [16,31]. Furthermore, the overexpression of ACL1 or DGA2 in L. starkeyi led to an increase in oil productivity (Takaku et al.; unpublished data). That is, ACL1 and DGA2 are considered to be regulatory factors and play a vital role in TAG biosynthesis in L. starkeyi, and our inferred network model fit well with this knowledge. Even though ACC1 is related to the conversion from acetyl-CoA to malonyl-CoA, which is important for acyl-CoA synthesis, ACC1 negatively regulated oil productivity in this model. This feature means that the excessive induction of ACC1 may have a negative effect on TAG biosynthesis. Silverman et al. examined the effect of overexpression of genes involved in TAG synthesis in the oil productivity and reported that the most drastic increase in oil productivity was the overexpression of DGA2 in oleaginous yeast Y. lipolytica [40]. Liu et al. also investigated the transcriptional activity of lipogenic promoters and reported that the transcriptional activity of the ACC1 promoter was decreased in the oil accumulation phase in Y. lipolytica [41]. These reports agree with our inferred network model, and DGA2 and ACC1 may be common regulatory factors in oleaginous yeasts.
To clarify the whole regulatory system of TAG biosynthesis, we inferred network models between the groups classified by factor analysis. Since biosynthetic systems generally have many components that are related by complicated structures, revealing the regulatory system may be useful to detect the target factors for system control. The whole regulatory system in TAG biosynthesis is shown in Figure 2a, in which the system is rather complicated, but some groups were detected as having no relation to oil productivity (Group 3). The groups with no relations to Group 3 were Groups 1 and 5, which were determined to be "the groups which are not related to or reduced TAG biosynthesis" by the factor analysis classification. The genes within the groups were related to TAG biosynthesis from a chemical reaction viewpoint, but not related to TAG biosynthesis from a regulation system viewpoint. The regulatory network model for Group 3 is shown in Figure 2b. In this figure, Groups 2 and 9 were detected as strong positive regulators for TAG biosynthesis and were determined to be "the groups that are related to or induced TAG biosynthesis" by the factor analysis. Interestingly, Group 2 has strong negative regulation to Group 8, which includes the genes related to the conversion between G3P and DHAP. DHAP is the starting compound of glycerol-3-P to the Kennedy pathway, and the conversion from DHAP to G3P is not effective for TAG biosynthesis. Actually, Group 8 has a weak but negative regulation of Group 3 in the inferred network model. Thus, the negative regulation from Group 2 to Group 8 is reasonable for Group 2, as a positive regulator of TAG biosynthesis.

Conclusions
The inferred models in this study effectively reflected the known regulatory relationships for TAG biosynthesis. Furthermore, our inferred network model detected some special features in the regulatory viewpoints, such as a negative feedback system for oil productivity and non-related genes for TAG biosynthesis regulation. This computational modeling method will help us to reveal the mechanisms underlying measured biological data.