Fused Graphical Lasso Recovers Flowering Time Mutation Genes in Arabidopsis thaliana

: Conventional breeding approaches that focus on yield under highly favorable nutrient conditions have resulted in reduced genetic and trait diversity in crops. Under the growing threat from climate change, the mining of novel genes in more resilient varieties can help dramatically improve trait improvement efforts. In this work, we propose the use of the joint graphical lasso for discovering genes responsible for desired phenotypic traits. We prove its efﬁciency by using gene expression data for wild type and delayed ﬂowering mutants for the model plant. Arabidopsis thaliana shows that it recovers the mutation causing genes LNK1 and LNK2 . Some novel interactions of these genes were also predicted. Observing the network level changes between two phenotypes can also help develop meaningful biological hypotheses regarding the novel functions of these genes. Now that this data analysis strategy has been validated in a model plant, it can be extended to crop plants to help identify the key genes for beneﬁcial traits for crop improvement.


Introduction
The changing world climate continues to threaten our ability to keep global food production in line with the growing population [1]. The manifestation of climate change in terms of extreme temperatures, altered rainfall patterns and changing geographical distribution of pests and pathogens will render it difficult to maintain crop yield and quality [1][2][3][4]. The recent 2021 winter storm in Texas caused at least USD 230 million losses to citrus crops and USD 150 million losses to vegetable crops in the state, according to an estimate by Texas A&M Agrilife Extension Service [5]. In addition, latest climate models predict that, for the remainder of the 21st century, the state will face dry conditions comparable to or even exceeding the driest centuries of the last 1000 years [6]. The decreasing water supply against rapidly an increasing state population will pose a major challenge to the agricultural industry.
In the last few centuries, plant breeders successfully used crossing and selection to improve the agronomic character of cultivated crops, such as wheat, maize, rice, barley and others, resulting in dramatic increases in food production [7]. However, this selective breeding among relatively few high yield varieties has resulted in a significant reduction in genetic diversity and trait diversity in major crops [7,8], contributing to the loss of certain genes that are responsible for efficiency or adaptation to stress(es) [8,9]. During the course of evolution, nature has evolved new genes and shuffled and selected these genes in a wide range of environments to produce the diversity evidenced in wild species [8]. In order to enhance the rate of genetic gain [10] and to develop climate resilient varieties, it is important to identify these genes in crop wild relatives (CWR) and carry out their controlled incorporation into novel germplasm [8]. With the recent technological innovations including the development of rapid and cheap sequencing technologies, genomics-assisted breeding is considered to have the greatest potential for overcoming these challenges [2,7]. The discovery of these genes can now be paired with genome editing techniques such as CRISPR/Cas9 to rapidly develop climate change resilient crops, including plants with better biotic and abiotic stress tolerance and enhanced nutritional value [7]. For developing varieties with high yield under changing climate conditions, the following traits have been suggested by biologists for further investigation [11]: (a) increased root/shoot ratio, (b) vernalization periods, (c) maturity, (d) regulation of node formation/internode distance, (e) harvest index variations and (f) allelopathy. This work focuses on the vernalization or flowering related pathways, which are sensitive to temperature, water and photoperiod changes [12][13][14].

Flowering in Plants
Plants control the transition to flowering such that seed production takes place under optimal conditions [15]. By perceiving and processing different environmental and internal signals, the plant makes a binary decision to flower or not to flower [16]. For example, experiments in perilla (mint) and tobacco [17,18] showed that once the leaves detect an increase in day length, which is a cue for the onset of spring and summer, they transmit a signal to the shoot apex to initiate flowering. The decision to flower also depends on favorable internal signals such as plant age and stage of development. Experiments with maize plants suggest that the meristem commits to form flowers only in the presence of at least four to six leaves [18,19]. Similarly, the leaves of the impatiens send a continuous inductive signal to keep the plant from reverting back to vegetative stage [16,20].
Arabidopsis is a preferred model plant for understanding the intricate and complex relationships between different gene pathways contributing to flowering time control [21]. It is one of the most well studied plant species with a vast amount of data available for verification of predicted gene interactions. Additionally, many of the flowering related responses are found to be phylogenetically well conserved among different plant species. Even with rapid growth in plant genomics over the last few decades, the exact functions of most genes remain unknown, even in model plant species [8].
We use circadian RNAseq data from two Arabidopsis thaliana phenotypes: The wild type and delayed flowering time lnk1/lnk2 mutant from [22]. Night light-inducible and clock regulated genes (LNK) control daily and seasonal rhythmic patterns in Arabidopsis by linking photoreceptor genes to the plant endogenous circadian clock. Unlike most light responsive genes, which respond most strongly to light during the subjective day, LNK genes were identified by their strong response to light in the middle of the night. While genes responding strongly to a light pulse in the middle of the day are expected to be involved in maximizing processes such as photosynthesis, those responding strongly to a pulse in the middle of the night are expected to be involved in sensing day break and tracking day length [22]. Further gene knockout experiments in that study confirmed that the lnk1/lnk2 mutant has delayed flowering time under long day conditions [22]. In this paper, we prove the efficacy of fused graphical lasso data analysis approach in identifying LNK1/2 genes as causal genes for substantial flowering phenology (the seasonal timing of reproduction and other life history events) changes using the publicly available gene expression data collected in that study. We also predict novel interactions of these genes and network level changes in wild type and mutant subnetworks for developing meaningful biological hypotheses.

Dataset
The interactions between light and the circadian clock are responsible for the control of several biological activities in plants including flowering time and seedling photomorphogenesis [23][24][25][26]. We use the previously published circadian time series gene expression data from wild type (WT) and light genes lnk1/lnk2 mutant Arabidopsis phenotypes [22]. We test the efficacy of a fused graphical lasso in learning network level changes and identify the trait related genes LNK1 and LNK2 responsible for differences between the two phenotypes.
RNAseq data from the NCBI SRA project SRP018266 (https://www.ncbi.nlm.nih.gov/ bioproject/PRJNA188075, accessed on 18 July 2020) deposited by authors in [22] were used for this analysis. The dataset contains RNA sequencing data for both wild-type (WT) and lnk1/lnk2 mutants grown under long day conditions (16 h light/8 h darkness), where the samples had been harvested every 4 h over a 24 h period beginning 2 h after the onset of the light period. The mapped and processed raw gene counts were obtained using RNAseq-er API [27] from the European Bioinformatics Institute (https://www.ebi.ac.uk/fg/rnaseq/ api/tsv/90/getRunsByStudy/SRP018266 accessed on 18 July 2020). The sample annotation file was constructed using sample information from the NCBI SRA database.

Sample Selection Based on Phenotypic Observations
During germination of the Arabidopsis seed, a radicle first emerges from the plant embryo that penetrates down into the soil and develops into the root. This is followed by the emergence of the hypocotyl, which develops into the stem. One major phenotypic difference between the wild type (WT) and lnk1/lnk2 mutant is the elongation of the hypocotyl, which is the embryonic stem. The lnk1/lnk2 double mutant was found to have significantly longer hypocotyls than WT seedlings under continuous white light conditions [22] and the phenotype was stronger under red light than under blue light. In addition, the lnk1/lnk2 mutant was observed to flower later than WT plants during longer days (16 h light/8 h darkness), with no difference in flowering time during shorter days (8 h light/16 h darkness). Thus, the genes responsible for the trait are expected to track seasonal changes in day length. Since a longer day also means an earlier sunrise and the red/far-red light photoreceptors called phytochromes, which are known to be responsible for stronger hypocotyl length, rhythmically express maximum levels during the day, the differential expression of trait genes is expected to be at the maximum in the early morning.
Since the goal is to investigate light and circadian related gene networks, we focus on data obtained during the subjective day. This is also supported by the fact that most clock genes downregulated in the mutant are found to peak at Zeitgeber time 10 (ZT10) in the wild type [22]. A zeitgeber is any external signal that alters the circadian rhythm. When light is considered as a zeitgeber, zeitgeber time is defined relative to dawn [28] and thus ZT10 would be 10 h after dawn. Taking these phenotypic observations into consideration, we select a subset of circadian time series data from ZT2-ZT10 for further analysis, spanning a total of 8 daytime hours.

Data Processing and Filtering
Three replicates were used for each Zeitgeber time for both mutant and WT, resulting in a total of 18 sample points. Genes with less than 10 mapped gene reads were filtered, leaving 23,380 genes for analysis. In order to account for non-idealities frequently present in RNAseq data, the variance of gene expression was stabilized using the DESeq2 [29] package with the terms 'genotype + time + genotype:time' using all sample points (ZT2-ZT22) before selecting 18 samples corresponding to ZT2-ZT10 for further analysis. A quick two sample t-test from package VIPER [30] found 1773 differentially expressed genes at significance level of 1 × 10 −2 .

Gene Selection for Fused Graphical Lasso
In order to create a list of light responsive and circadian genes, the gene ontology (GO-Slim) terms containing 'light' or 'circadian' were collected from the TAIR database [31]. Genes corresponding to these terms were then obtained using BiomaRt from Ensembl, which is the plants website [32]. Out of these, 918 were found among 23,380 filtered genes with at least 10 mapped RNAseq reads and 84 were found to be differentially expressed between the two phenotypes.

Fused Graphical Lasso Gene Network Inference
Over the past decade many approaches for modeling gene networks have been proposed for studying drought pathways [33] and pathogen resistance in crops [34], aiding the discovery of new biological gene interactions and studying cancer drug design [35][36][37][38]. These approaches include Boolean networks, Bayesian networks, probabilistic Boolean networks and Gaussian graphical models [39][40][41][42][43][44]. Out of these, Gaussian network learning based approaches have been used extensively in the past due to their scalability and wide application relative to microarray based data. However, their applicability to RNAseq data has not been explored. In order to learn the network level changes between wild type and the delayed flowering mutant, we briefly discuss a network learning approach based on Gaussian graphical models with fused lasso penalty. Consider the n × m data matrix x, where n is the number of genes and m is the total number of samples and so each observation is an n dimensional vector x i ∈ R n . In a Gaussian graphical model, these are assumed to be independently and identically normally distributed x i ∼ N (µ, Σ), where µ is an n dimensional mean vector and Σ is a positive definite n × n covariance matrix [45].
The gene connectivity is described by the precision matrix Θ = Σ −1 where genes which are conditionally independent (zero entries) and are assumed to be disconnected. The goal of network learning is to estimate this matrix Θ. In the ideal case where we have large enough number of samples for each node, we can estimate Θ by maximizing the following log likelihood with respect to Θ [45]: where S is the empirical covariance matrix. This yields S −1 as the estimate of Θ. However, under higher dimensional settings, S becomes singular and cannot be inverted.
To overcome this, we can instead optimize the penalized log likelihood driven by [45] the following: where λ is the non-negative tuning parameter. This penalty is also known as lasso penalty and the optimization problem is called the graphical lasso. In effect, a larger value of λ forces entries very close to zero in the co-variance matrix to exactly zero, thus resulting in a sparse network. Since our goal is to discover differential edges between the wild type and mutant subnetworks, we focus on an extension of the graphical lasso called the joint graphical lasso [45]. The joint graphical lasso jointly learns many graphical models which are related but have some distinct gene interactions. Instead of learning a graphical model separately for wild type and mutant subnetworks, joint graphical lasso makes optimum use of the available data by learning common and differential edges resulting in performance gains [45]. We focus on the generalized fused lasso penalty formulation of joint graphical lasso where the log-likelihood function for two classes is provided by [45] the following: where P({Θ}) is given by [45] the following: and Θ (1) and Θ (2) are the positive definite precision matrices for the two classes and m k is the number of samples in the kth class. Highly efficient algorithms based on the alternating directions method of multipliers (ADMM) [46] exist for solving this optimization problem [45]. The penalty P({Θ}) consists of two different penalties: λ 1 is a lasso penalty on off-diagonal elements, while λ 2 penalizes the difference between the corresponding entries in the two precision matrices. Larger values of λ 1 encourage sparser networks in both classes, whereas larger values of λ 2 encourage both networks to be similar to one another. Thus, the network sparsity and similarities are controlled separately [45]. Since we are interested in recovering differential edges for biological hypothesis generation, we will focus on values of λ 1 close to one and values of λ 2 close to zero. As discussed in the earlier subsection, we used time series samples for ZT2-ZT10 and gene expression for TF and circadian/light related genes to create a differential network with parameters λ 1 = 1 and λ 2 = 0.282; the genes were found to be connected with 728 and 92 predicted interactions in WT and lnk1/lnk2 mutant subnetworks, respectively, and 62 common interactions. In order to verify the robustness and consistency of the predicted gene network, the network was recreated and tested multiple times by adding random additional 1000 genes in the input dataset. Although the time series observations are not independent, we were able to empirically verify that under the high edge penalty, the joint graphical lasso is able to reliably recover the gene network.

Network Inference Using MARINa Algorithm
The Master Regulator Inference Algorithm (MARINa) algorithm [47] infers transcription factors (TFs) controlling the transition between the two phenotypes. Since the algorithm assumes that the differences in phenotypes are caused by transcription factors, it cannot recover mutation causing non-TF genes. Therefore, we use the additional information from TF-gene interactions database to learn if LNK1 and LNK2 genes are known to interact with any of the transcription factors identified by MARINa. A list of all potential Arabidopsis regulator genes (transcription factors; TF) was obtained from the Plant Transcription Factor Database [48] (PlantTFDB, version 5.0). ARACNe-AP [49] was used to create genome-wide TF-gene regulatory network based on mutual information (MI) with default parameters using all samples. The resulting 3-column TF-gene-MI file was then used to predict the most influential key regulators (TF) contributing to differences between the two phenotypes for 18 selected samples using MARINa [47].
High confidence predicted and verified interactions between Arabidopsis genes obtained from the Plant Transcription Factor Database (PlantTFDB v5.0) [48] were used for constructing trait related regulatory network. Interactions among top 50 predicted regulators and 84 differentially expressed light and circadian genes were extracted using a unix script. Flourish studio (https://flourish.studio accessed on 19 July 2021) webservice was used to create a network graph using these interactions.

Wild Type and Flowering Mutant Subnetworks
The results of fused lasso graphical modelling between wild type and lnk1/lnk2 mutant are shown in Table 1. As expected, LNK1 and LNK2 genes have no connections in the mutant subnetwork. Next, we compared the top 10 hub genes in WT and mutant subnetworks are shown in Table 2. CCA1 (AT2G46830) and LHY (AT1G01060) are the most connected nodes in both subnetworks. Both of these genes play a crucial role in the regulation of circadian rhythms and act as accelerators of flowering time [50]. FKF1 (AT1G68050) is a clock regulated gene that regulates the transition to flowering in Arabidopsis [51]. Additionally, FKF1 and GIGANTEA (GI) complex are required for the measurement of day-length [52]. COR27 (AT5G42900) is a known positive regulator of flowering [53]. GRP7 (AT2G21660) is an upstream regulator of a flowering repressor gene FLC [54] and grp7-1 mutants are known to flower late [55]. eip6 mutants were found to flower early and showed increased expression of flowering-time and floral organ identity genes [56]. atgstu17 mutants (atgstu17-1 and atgstu17-2) were also observed to flower late [57]. In [58], Arabidopsis plants overexpressing MIPS2 were shown to have delayed flowering. Similarly, CDF6 was recently known to delay flowering [59]. While atnap null mutant and WT plants are developmentally indistinguishable in terms of bolting and flowering times, silique senescence was found to be dramatically delayed in the mutants [60]. ATHB12 is known to negatively regulate the growth of the inflorescence stem [61]. Thus, the hub genes predicted by graphical modeling are biologically meaningful. Finally, we observe the gene interactions of CCA1 specific to the mutant subnetwork to evaluate whether they can explain the late flowering phenotype. Among the six neighbors of CCA1 specific to the mutant subnetwork are the genes COL9 (AT3G07650), B-BOX32 (AT3G21150) and LIR1 (AT3G26740). COL9 and B-BOX32 are known to delay flowering [62,63]. Similarly, Arabidopsis thaliana plants heterologously expressing ryegrass LIR1 show a minor delay in flowering time [64,65].

Predicting Mutation Causing Genes
In order to test if it is possible to infer LNK1 and LNK2 genes as the ones responsible for mutation, we collected genes which are present in the wild type gene subnetwork but not in the mutant subnetwork and that have a degree of at least two. The Interpro domain enrichment terms obtained for these genes using STRING protein-protein interaction database (v.11.0) [66] are shown in Table 3. It can be observed that LNK genes are on the top of the list.

Predicted Interactions of LNK1/2 Genes
Next, we sought to verify the biological relevance of gene interactions predicted by graphical modeling. Tables 4 and 5 list predicted interactions of LNK2 and LNK1 genes, respectively. The verified interactions from literature are shown in bold. LNK genes are also known to contribute to the activation of afternoon genes such as ELF4 (AT2G40080) and FKF1 (AT1G68050) [22], which explains the positive correlation in Tables 4 and 5. Interaction between FKF1 and LNK genes was also predicted in [67]. CCA1 (AT2G46830) is known to interact with both LNK1 and LNK2 [68]. The nature of this interaction is not yet known. However, LNK genes are known to function as transcriptional coactivators of evening-phased genes TOC1 and PRR5 [69], which are known to repress morning-phased gene CCA1. This could explain the negative correlation between CCA1 and LNK genes. In addition to the verified interactions discussed above, we predicted several novel interactions which might be useful in developing new hypotheses for the control of flowering time. It is important to note here that the graphical lasso cannot distinguish between protein-protein interactions and interactions at the gene regulatory level. For example, the interactions discussed above are gene regulatory interactions but the model also predicted interactions between FKF1 and GI genes that are known to interact in a blue light dependent manner at the protein level in the flowering pathway [52,70,71].  Table 4. Predicted interactions of LNK2 gene using a fused graphical lasso. Table 5. Predicted interactions of LNK1 gene using a fused graphical lasso.

Comparison with Existing Methods
We compare the results of the fused graphical lasso in uncovering mutation causing genes with that of another algorithm, which is the Master Regulator Inference Algorithm (MARINa) [47]. Since the MARINa algorithm assumes transcription factors and not genes as the cause of mutations, as expected, the LNK1 and LNK2 genes are not present in the top 10 inferred master regulators, as shown in Table 6. However, the predicted transcription factors are found to be involved in flowering and light related pathways. We then used these results along with the publicly available TF-gene interactions for the Arabidopsis genes to investigate whether the LNK1/2 genes interacts with any of the predicted master regulators. Figure 1 shows the network of high confidence interactions from PlantTFDB [48] between the top 50 master regulators predicted using MARINa and 84 differentially expressed light and circadian related genes. The legends TFlight and TFcirc denote transcription factors involved in circadian and light related pathways, respectively, while light, circ and circlight denote genes involved in circadian, light and both pathways, respectively. Since LNK1 and LNK2 are involved in daylength regulated control of flowering, they are involved in both light and circadian pathways. As it can be observed, LNK1 (AT3G54500) and LNK2 (AT3G54500) are present in the MARINa predicted network. Additionally, the major hub genes in this network, CDF5 (AT1G69570) and FLC (AT5G10140), are well known genes in the circadian and flowering pathways [50]. However, unlike the graphical lasso, the algorithm is unable to rank the genes since there is no information regarding phenotype specific gene subnetworks. The differences in the two subnetworks obtained from using the fused graphical lasso were used for Interpro based enrichment to rank the genes. Additionally, the network created by graphical lasso was based only on the gene expression data without any additional TF-gene regulatory information. This is important for non-model crops such as wheat for which extensive regulatory information might not be publicly available. On the other hand, the Interpro information used for ranking the genes is widely available even for commercial crops.

Discussion
In this paper we proposed the use of a joint graphical lasso with fused penalty for identifying genes responsible for phenotypic differences in two Arabidopsis thaliana flowering phenotypes. Authors in [22] showed that the mutants resulting from knock out genes such as LNK1 and LNK2 in Arabidopsis thaliana have longer hypocotyls and delayed flowering times. By using publicly available circadian gene expression data from wild type and late flowering mutant from that study [22], we showed that the graphical lasso based analysis was able to identify the causal genes LNK1 and LNK2. In addition, network level changes in wild type and flowering mutant were also predicted and identified, which can assist biologists to learn and hypothesize about how mutation at just two genomic locations can result in genome wide transcriptional and regulatory changes. The gene interactions were verified against available biological knowledge and many novel interactions of LNK1/2 genes were predicted. One major disadvantage of this method is that it ignores the lowly expressed transcription factors (TF) since mRNA expression is not an accurate representation of TF activity. While another algorithm, MARINa, was not able to identify the genes LNK1 and LNK2 since it focuses only on TFs; the two algorithms might complement one another for developing stronger biological hypotheses.

Data Availability Statement:
The data used in this study are publicly available at NCBI with the accession number SRP018266.