Modeling of Flowering Time in Vigna radiata with Approximate Bayesian Computation

: Flowering time is an important target for breeders in developing new varieties adapted to changing conditions. A new approach is proposed that uses Approximate Bayesian Computation with Differential Evolution to construct a pool of models for ﬂowering time. The functions for daily progression of the plant from planting to ﬂowering are obtained in analytic form and depend on daily values of climatic factors and genetic information. The resulting pool of models demonstrated high accuracy on the dataset. Day length, solar radiation and temperature had a large impact on the model accuracy, while the impact of precipitation was comparatively small and the impact of maximal temperature has the maximal variation. The model pool was used to investigate the behavior of accessions from the dataset in case of temperature increase by 0.05–6.00 ◦ . The time to ﬂowering changed differently for different accessions. The Pearson correlation coefﬁcient between the SNP value and the change in time to ﬂowering revealed weak but signiﬁcant association of SNP7 with behavior of the accessions in warming climate conditions. The same SNP was found to have a considerable inﬂuence on model prediction with a permutation test. Our approach can help breeding programs harness genotypic and phenotypic diversity to more effectively produce varieties with a desired ﬂowering time.


Introduction
Among the cultivated species in the legume family, mungbean (Vigna radiata (L.) Wilczek), also known as green gram), a short duration crop, is sometimes considered as a "minor" crop, due to limited breeding efforts and restricted planting area that is much less than the range of suitable soil and climatic conditions [1]. Noteworthy, it has become an increasingly important crop across Asia and beyond, showing a steady rise in global production (FAO 2018). Mungbean may provide additional income to farmers as a rotation crop and it adds nitrogen to the soil. It is a valuable source of protein and essential micronutrients such as folate and iron. With few secondary metabolites that generate a beany taste [2], it works well as a plant-based protein. Mungbean is a well-suited model organism among legume plants due to its relatively small genome size, short life-cycle, self-pollination, and close genetic relationship to other important legume crops.
In this work, we built a new dynamical model for time to flowering in 1596 accessions of Vigna radiata phenotyped during six field trials in different years and locations. Developing on our previous studies for chickpea [40][41][42] and mungbean [43], a new approach was proposed that uses Approximate Bayesian Computation with Differential Evolution [44,45] to construct a pool of models for time to flowering. The models in the pool use daily values of climatic factors and GWAS results for genetic information and genotype-by-environment interactions to model a daily progression rate from sowing to flowering.
The pool of models is used to investigate the impacts of different factors on the predicted time to flowering. Further, the performance of accessions in the available dataset was simulated in climate warming conditions using the daily temperature increased on 0.05-6.00 • . The association of variation of time to flowering and genotype information was analyzed using Pearson correlation coefficient.

The Overview
The methodology proposed in this work consists of construction of pool of models with low training and validation scores using Approximate Bayesian Computation, investigation of impacts of different factors to the model using permutation test, forecast of time to flowering using generated daily weather with increased minimal temperature (see Figure 1). These steps are further described below.

Plant Material
The mungbean accessions from a WorldVeg collection described in [5] were phenotyped during several field experiments (see Table 1). Details on the phenotyping experiment and its subsequent analysis will be presented in a future manuscript (Sokolkova et al. in preparation).
The time to flowering for this dataset ranges from 31 to 249 (see Figure 2). To build the simulation model for time to flowering we used 1438 accessions for cross-validation and 158 for control (see Figure 1).  The phenotypic effect of a gene is most often described by an average difference between different genotypes, but a gene can further influence the phenotype, causing differences in variance between genotypes. To investigate this opportunity the varianceheterogeneity GWAS (vGwas) has been proposed [46]. The vGWAS approach facilitates the identification of loci that are involved in the genetic control of environmental variability.
The vGWAS uses a Brown-Forsythe (similar to Levene's test), statistical test for equality of group variances based on an ANOVA of the absolute deviation from the median to identify SNPs that control phenotypic variance. This test is resistant to deviations from the normal phenotypic distribution in GWAS.
The SNP from indels were excluded together with those with gaps in at least one sample and the number of variants greater than two. The minimum frequency of the alternative allele was set to 15%. Using vGWAS approach, 15 SNP were selected (see Table 2) associated with flowering time at a significance level of 0.05 [46].

Prediction of Time to Flowering with a Pool of Models
Let's denote a pool P = {M k , k = 0, ..., K − 1} of individual models M k and a function Y(M, i) that predict time to flowering given a model and an information on a sample i as its arguments. Then, the predicted time to flowering y i of sample i is calculated as (1).
Information on a sample includes the data on climatic factors and genotype. Let's denote for the model M k the fitness threshold that the plant needs to reach to be able to flower as U k and the plant's fitness to start flowering at day t as u k (i, t). Then, Y(M k , i) is defined as (2).
The plant's fitness to start flowering u k (i, t) increases at each day by the value v(i, t) that is calculated as a linear combination of N control functions F n with coefficients β n (3).
Control function can be represented as a tree with elementary functions in the nodes and climatic factors in its leaves (see Figure 3). The list of elementary functions includes arithmetic operations: '+', '-', '*', '/', and expressions: , where X is a name of a climatic factor and X b,± is a constant base value of this factor. These constants are specific to factors, so there are T b,± x base maximal and T b,± n minimal temperature, R b,± precipitation, D b,± day length and S b,± solar radiation. The tree is encoded using Grammatical Evolution (GE) [48,49] as a vector of length L of integer valued codons using breadth-first search (BFS).

Approximate Bayesian Computation
The model M is characterized by parameters that include coefficients β n , n = 0, . . . , JN + N − 1 together with fitness threshold U, the minimal coefficient value β min , L GE codons (integer numbers) and constants T b,± x , T b,± n , R b,± , D b,± and S b,± . Parameters are to be found that minimize the deviation of the model output from data defined by Formula (4): where α and λ are regularization coefficients, τ i is a number of days from planting to flowering measured experimentally. Thus, in addition to parameters, the model M is characterized by hyperparameters: the number of control functions N and the number of codons L and regularization coefficients α and λ. The Approximate Bayesian Computation (ABC) algorithm proposed in [45], in combination with Differential Evolution Entirely Parallel (DEEP) method [50,51], was adapted for construction of a new model. Using a prior distribution for hyperparameters (N, L, α, λ) and a measure of discrepancy between model solution and data Q Equation (4), ABC allows us to construct a chain of hyperparameter values sampled from their posterior distribution.
The algorithm operates on a set B of parameter vectors θ j = (N, L, α, λ). Hyperparameters are initialized from the prior distribution and updated on each iteration using a recombination rule Equation (5).
where W 1,2 are weight coefficients sampled from normal distribution, m, n are indices of members from B, b is an index of the best vector, and ε is a vector of small normally distributed numbers. Hyperparameters are accepted or rejected using a probabilistic criterion that compares chances of new hyperparameters to be sampled from the posterior with a random number. Optimization procedures include cross-validation to avoid overfitting. The 10% of samples from the available dataset are saved as a control set and are excluded from optimization. The remaining cross-validation set of samples is split on each iteration into training I training and test I test samples containing 80% and 20% of samples counting from cross-validation set. For each set of hyperparameters produced by ABC, a model M * is obtained using DEEP method that minimizes Q Equation (4) on a training set, and Q training = Q(M, I training , 0, 0) and Q test = Q(M, I test , 0, 0) are then calculated on a training and test set, respectively. The algorithm is run during the predefined number of iterations G.

Construction of the Pool of Models
The output of the algorithm consists of three sequences: models M g , g = 0, ..., G − 1, training Q training g , g = 0, ..., G − 1 and test Q test g , g = 0, ..., G − 1 scores, respectively. We apply a transform Q = Q/ dim(I) to make training and test scores comparable, where dim(I) is the number of samples in corresponding set.
As with all similar inference methods, the members of these sequences represent posterior distribution only after some burn-in iterations. To determine the index starting from which models are considered to be acceptable we compared the sequences of Q training g and Q test g , g = C, ..., G − 1 with Mann-Whitney criterion for a series of C = 10, 20, 30, ..., G − 11. The minimal C for which p-value of test is greater than 0.05 defines a subsequence of models that passes a cross-validation test that means that the scores on training and test sets are indistinguishable. This subsequence of models contains candidates for the desired pool.
To further restrict the pool size, only those models are included for which the median absolute difference in days between measured time to flowering and model prediction is less than predefined threshold.

Estimation of Impacts of Climatic Factors and Genotype Information to the Model
A permutation test was performed to analyze the contribution of climatic factors T x , T n , P, S, D and SNP values to time to flowering. The essence of the method is to rearrange the data of the factor of interest between dates or samples and record the change in the error. The impact of the climatic factor is then an average error increase over all permutations.

Simulation of Climate Warming
To explore the impact of climate warming, we simulated development of accessions with our model increasing the minimal temperature on 0.00, 0.05, 0.10, 0. 25

A Pool of Models for Time to Flowering
An optimization procedure of Approximate Bayesian Computation described in Section 2.5 was run for 500 iterations with 40 burn-in steps with 8 parameter vectors in a set. DEEP method used 1152 and 286 samples for training and test set, respectively. Using procedure described in Section 2.6 the convergence was found after 360 iterations (see Figure 4).  To verify the validity of the constructed model we compared the mean values of the root mean scores with a Mann-Whitney-Wilcoxon criterion. The difference between training and validation scores, 13.95 and 14.29, respectively, was not statistically significant (p = 0.05262 > 0.05).
As described in Section 2.6, 22 models were selected for the pool from 60 candidates with the lowest root mean square scores for control set for which the median absolute difference in days between measured time to flowering and model prediction is less than 72. The models in the pool accurately describe cross-validation and control data in terms of the Pearson correlation coefficient, median and root mean square difference in the number of days to flowering between model solution and data (see Table 3). The models in the pool consist of different control functions.
The prediction of time to flowering with a pool is made using Formula (1). The Pearson correlation coefficient between the pool solution and data is 0.98, median discrepancy in days is 6.47 and root mean squared error is 11.25. Thus, the accuracy of the pool for the control set is better than any model in the pool.
The pool accurately describes cross-validation and control data (see Figure 5), and the Pearson correlation coefficient is ≈0.98 in both cases.

The Best Model in the Pool
The best model in the pool is № 9 (see Table 3) and has 6 control functions Equation (6), and the Pearson correlation coefficient is 0.98 for the control set.
v(i, t) = 9.83F 0 + 10.00F 1 + 2.32F 2 − 2.92F 3 + 6.32F 4 − 10.00F 5 where F n are climatic control functions. All control functions are non-linear, and all climatic factors except day length D in F 5 have non-linear effect in all functions. This illustrates the complex nature of the effect of climatic factors on time to flowering for accessions in the dataset.

Impacts of Climatic Factors and SNPs to the Model
To assess the impacts of climatic factors to the pool a permutation test was performed with 100 permutations (see Figure 1).
The day length and solar radiation have the maximal impact on the model, the impact of maximal temperature has the maximal variation between permutations (see Figure 6). Maximal temperature impacts the model solution almost twice weaker than day length, while the impact of precipitation is the weakest one (see Figure 6).
To assess the impact of genotype information, a test was performed 100 times in which the SNP values were permuted between accessions and the difference in root mean square error recorded (see Figure 7). The SNPs № 1 and № 7 have the largest effect to the model solution.

Simulation of Climate Warming
The pool was used to predict time to flowering of samples from the dataset for which the prediction error was less than 12 days in the scenario that simulates climate warming. The minimal temperature at each day was increased. Samples from the dataset exhibited different behaviors (see Figure 8). For one group of samples the time to flowering was approximately the same for increments up to 2.1 • degrees and drops by approximately 5 days for further increments (see Figure 8A). For samples from another group the time to flowering fluctuates, but gradually increases by approximately 5 days for 6 • increment ( Figure 8B). Two other groups include samples with variation of time to flowering less than 5 and greater or equal 5 days and less or equal 10 days, respectively ( Figure 8C,D). To characterize the relation between SNPs and the behavior of samples under increased temperature the Pearson correlation coefficient between SNPs and the difference between time to flowering for normal and temperature increased on 6 • C was calculated ( Figure 8A). No strong relationships were found as all coefficients were <0.5. The highest positive value of 0.32 (p-value < 10 −16 ) suggests a weak correlation for snp7 that had a biggest impact to the model in permutation test (see Figure 7).
The weak correlation was obtained between SNP value and the maximum slope of the dependence of time to flowering on temperature increment for snp7 that had the lowest negative Pearson correlation coefficient −0.36 (p-value < 10 −16 ) ( Figure 8B). The same low dependence can be found for snp0, snp3, snp4, snp5, snp6, snp8, snp10, snp14 with coefficients above 0.3 in absolute value.
The correlation between SNPs and maximum rate of change of the slope of time to flowering dependence and standard deviation of time to flowering was almost negligible (<0.2) for all SNPs (see Figure 9C,D).

Discussion
Mungbean (Vigna radiata (L.) Wilczek) is used in several traditional cuisines across Asia as a rich source of proteins and micronutrients. It has been an orphan tropical crop for a long time but could fulfill a role in a range of agroecologies. It fits into crop rotations due to a short duration cycle. One setting where mungbean fits well is in rotation with winter wheat, which is often harvested late in spring, leaving only a short summer season for mungbean; short duration is particularly critical to set seed before cool weather begins in this role [43]. Development of new varieties adapted to different conditions is necessary to meet the needs of growing global demand. The WorldVeg collection is a valuable resource for mungbean breeding [52]. The adaptation of adaptive traits like flowering time to specific environments is blueprinted in genomes [53,54] so that different genotypes responds to local conditions in different ways.
Here we proposed a new modeling approach with a pool of models to predict flowering time in mungbean based on daily values of maximal and minimal temperature, precipitation, day length and solar radiation as climatic factors. The genotype information in the form of SNPs identified by the vGWAS method is used to include genotype-byenvironment interactions. A pool of models is built using Approximate Bayesian Computation technique [45,55] that uses Differential Evolution and Grammatical Evolution to obtain parameters of models that include analytic form of functional dependence of phenotype on different factors. ABC makes it possible to obtain not only one solution, but a sample from posterior distribution of unknown parameters.
We performed model parameterization on a dataset of 1596 accessions phenotyped in 6 different environments. Analysis of association revealed 15 SNPs associated with flowering time. To increase the reliability of model predictions, we constructed a pool of 22 models trained on the dataset of 1438 samples and tested on 158. The advantages of our model are the automatically constructed analytic form of functional dependence and the presence of the genotype-to-climatic factors interactions. In comparison to dynamical models developed earlier [43], the formulae (6) is more complex. That is because the data on accessions with long time to flowering was included into the dataset.
Though the amount of accumulated data is constantly growing the understanding of the role of temperature and day length in adaptation to different agroecological conditions is still incomplete [56]. According to the permutation test day length, solar radiation and temperature had a large impact to the model, while the impact of precipitation was comparatively small.
We used the constructed model pool to forecast the time to flowering in case of climate warming. The temperature was increased by 0.05-6.00 • . Several types of behavior of samples were observed in which the time to flowering decreased or increased. The Pearson correlation coefficient between SNP value and the change in time to flowering revealed a weak but significant association of SNP7 with behavior of the accessions in warming climate conditions. The same SNP was found to have a considerable influence on model prediction in the permutation test. SNP1 was important according to permutation test and has a correlation with the change in time to flowering less than 0.30.
The impact of climate change has been studied mostly for major cereal crops like wheat, maize, barley, etc. [57], but also cowpea (Vigna unguiculata [L.] Walp.) [58,59] and grass species [60]. The considerable variation in crop responses to future climate was described for chickpea in previous studies for South Asia and East Africa [61]. Earlier crop flowering and maturity have been observed and documented in recent decades, and these are often associated with warmer (spring) temperatures [62].

Conclusions
The model pool developed in this work illustrates the approach based on Approximate Bayesian Computation with Differential and Grammatical Evolution for construction of dependencies of plant phenotype on climatic factors and genotype-by-environment interactions. Due to extensive cross-validation the obtained pool of models provide high accuracy on available dataset.
The models in the constructed pool consist of more complex function dependencies in comparison to previous studies due to more diverse dataset. The results of permutation tests showed that the impacts of all climatic factors except precipitation are important to the model solution. The day length has the maximal impact and the impact of maximal temperature has the maximal variation.
Our results confirm the diversity of behaviors of genotypes. The time to flowering is reduced in constructed forecasts for climate warming conditions for some but not all accessions. The SNP № 7 was found to be weakly but significantly correlated with model response to climate change and model accuracy in the permutation test. This genotypic and phenotypic diversity can be used in breeding programs to produce varieties with desired flowering times.