On-Line Optimal Input Design Increases the Efficiency and Accuracy of the Modelling of an Inducible Synthetic Promoter †

Synthetic biology seeks to design biological parts and circuits that implement new functions in cells. Major accomplishments have been reported in this field, yet predicting a priori the in vivo behaviour of synthetic gene circuits is major a challenge. Mathematical models offer a means to address this bottleneck. However, in biology, modelling is perceived as an expensive, time-consuming task. Indeed, the quality of predictions depends on the accuracy of parameters, which are traditionally inferred from poorly informative data. How much can parameter accuracy be improved by using model-based optimal experimental design (MBOED)? To tackle this question, we considered an inducible promoter in the yeast S. cerevisiae. Using in vivo data, we re-fit a dynamic model for this component and then compared the performance of standard (e.g., step inputs) and optimally designed experiments for parameter inference. We found that MBOED improves the quality of model calibration by ∼60%. Results further improve up to 84% when considering on-line optimal experimental design (OED). Our in silico results suggest that MBOED provides a significant advantage in the identification of models of biological parts and should thus be integrated into their characterisation.


Introduction
Synthetic biology, a discipline at the interface of biology, engineering and computer science, seeks to engineer cells with new functionalities. Despite great progress towards this goal [1], the prediction of the in vivo behaviour of synthetic circuits is still a challenge that hinders technological applications. For the field to reach its full potential, the accurate prediction of the dynamics of synthetic circuits needs to be achieved. Mathematical models can serve as a tool to gain a mechanistic understanding of a system and could tackle this issue; however, their adoption in synthetic biology has so far been limited [2]. Indeed, while biophysical and static models have been successfully proposed as tools to guide the automated design of biological circuits [3,4], the use of mathematical models is often confined to the interpretation of experimental data.
The lack of widespread use of mathematical models can be attributed to the limited availability of extensively characterised biological components, the often neglected contextual dependence of the behaviour of circuits [5] and the difficulty of estimating unknown parameters of large multipart systems from input-output data. The latter, usually sparse and noisy, are traditionally acquired using experimental platforms (e.g., microplate readers) which pose constraints towards flexible experimental designs. "Pulses" or "steps" of chemicals are the de facto standard stimuli to probe part behaviour. Unfortunately, such designs often yield poorly-informative data. Indeed, these inputs are low-pass filtered by molecular diffusion, which limits their frequency content. Furthermore, while persistent excitation is proven optimal to identify linear systems [6], there is not such a general result for non-linear systems such as biological networks. The question is then, is it possible to design experiments to identify models that predict the behaviour of biological circuits, and if so, how?
Model-based Optimal Experimental Design (MBOED) allows the design of experiments with maximal information content for parameters inference. Previous works addressed the potential of using MBOED in biological systems. Bandara et al. [7] considered a cell signalling example and showed that optimally designed experiments supported a 60-fold decrease in the variance of parameter estimates when compared to intuition driven designs. Similarly, Ruess et al. showed how optimised dynamic inputs outperformed random inputs in characterising a light-inducible promoter [8].
Despite these promising results, in general, the synthetic biology community has not adopted MBOED. Among the reasons: optimally designed experiments are difficult to implement with traditional experimental platforms and the skills to design them are not widespread in wet laboratories. Technological developments (e.g., microfluidics) and software tools (e.g., AMIGO2 [9]) help to alleviate these problems but they have steep learning curves. Is MBOED worth the extra effort of adoption?
This work addresses this question by identifying a mathematical model of a building block in synthetic biology: an inducible promoter. Many synthetic promoters use DNA sequences from the same organism for which they are engineered, and so potentially suffer from unwanted regulation from other genes in the genome. This additional regulation makes disentangling and modelling promoter activity a challenge. To overcome such issues, we considered an orthogonal promoter [10], i.e., a S. cerevisiae promoter engineered with DNA sequences from E. coli. The promoter, designed by Gnügge et al. [10] (Figure 1), controls the expression of a fluorescent reporter, Citrine, when cells sense the chemical IPTG. IPTG permeates the cell through the heterologous transporter Lac12 and binds the LacI repressor, thereby relieving its downregulation on promoter activity. Binding of the constitutively expressed tTA to the tetO 2 site leads to the expression of Citrine.
As a first step, we refined a mathematical model of the inducible promoter (M PLac ), obtaining M PLac,r using available data [10]. We then proposed a simplified model structure, M 3D , able to mimic the dynamics of M PLac,r (M IP,r ). We compared the performance of optimal and intuition-driven input profiles for the identification of M IP,r using the posterior distributions of the inferred parameters. Finally, we assessed the improvement of on-line MBOED over off-line MBOED as quantified by the accuracy of the estimates and their convergence rate to the actual parameter values. Note that, while in off-line MBOED the optimised input was computed for the entire experiment before this begins, on-line MBOED was characterised by iterative, data-informed refinements of both the model and stimulus profile within a single experiment.
Our results suggest that iterative off-line and on-line MBOED enable the design of more informative experiments for the characterisation of biological parts, e.g., synthetic promoters. Furthermore, we quantify the increase in the confidence of the inferred parameter estimates. The manuscript is organised as follows: Section 2 discusses how we recalibrate a model of the inducible promoter, define a reduced model and use it to compare the information content of different input classes. The section closes with a comparison between off-line and on-line MBOED schemes. Section 3 expands on the importance of MBOED for the design of more informative experiments in synthetic biology. Section 4 details our in silico experiments, the comparison of information content of different input classes and the design of optimal experiments. Finally, Section 5 presents our conclusions and future directions.

Refitting Gnügge et al.'s Model
First, we recovered M PLac , the model published by Gnügge and colleagues [10], and independently assessed its ability to describe the experimental data reported in the original paper [10]. The dataset includes 24 IPTG dose-response curves (five samples, 12 h intervals). Our analysis reveals that M PLac consistently underestimates by 20-30% the steady states at intermediate concentrations ( Figure 2, grey line).
To resolve this discrepanc,y we set out to recalibrate M PLac using enhanced Scatter Search (eSS) [11].

The Dynamics of the Inducible Promoter Is Captured by a Reduced-Order Model
We developed a reduced order model structure (M 3D ) to constrain the number of parameters to be identified as well as the computational cost associated with optimal experimental design. The model structure reads as follows: where R, P f and P m are the concentrations of Citrine mRNA, immature folded protein and matured (fluorescent) protein, respectively. The model encompasses transcription, translation and maturation of the fluorescent reporter and has eight parameters. Transcription depends on the concentration of the inducer IPTG through a Michaelis-Menten kinetics, where v is the maximal induced transcriptional rate; h the Hill coefficient; K r the Michaelis-Menten constant; and α is the basal transcription. Translation occurs at a rate k p and the folded protein matures at rate k f . All biochemical species are subject to linear degradation, occurring at rates γ for mRNA and γ f for protein. The M 3D model structure builds on the assumption of time-scale separation between the expression of the repressor LacI, its dimerisation and subsequent binding to the operator sites and to IPTG, which are considered at quasi steady state, and Citrine expression. Calibrating M 3D to the time-series data in [10], we obtained M IP . We observed that M IP better fits both the steady-states, i.e., the dose-response curve (Figure 2a), and the dynamics of the system (see Figure 2b for an example). Despite its lower order, M IP achieved predictive capabilities comparable to M PLac,r considering the sum of squared errors over the whole set of experimental data ( Figure 2c). It is worth noting that M IP is characterised by a smaller rise time (1.8 h) than both M PLac and M PLac,r (7.9 h) ( Figure 2b). Here, the rise time is defined as the time taken by the output to rise from 10% to 90% of the steady-state. Unfortunately, the low sampling frequency used in the original study [10] impeded any further constraints on the characteristic time-scale of the system.
As we aimed to explore the informative content of different input classes (Section 2.3), we required our reduced model to closely mirror the dynamics of the considered inducible promoter. We therefore used M PLac,r to produce a set of pseudo-experimental data to re-calibrate M IP , thereby obtaining M IP,r . While M IP could have been selected as a nominal model for MBOED, the possibility of expanding the available dataset and further constraining model calibration made us prefer M PLac,r as a reference. The limited complexity of the underlying biological system prompted us to adopt stepwise, pulses, ramp-wise and stepwise random inputs in the generation of the pseudo-data (Section 4.1). The results of M IP,r parameterization show that this model mimics the dynamics of M PLac,r (Figure 3). Step (a); pulse (b); ramp (c); and random (d) inputs (red line) were used to simulate Citrine dynamics in M PLac,r . By sampling the simulated expression profiles and adding 5% Gaussian noise, we obtain pseudo-data (green circles). The green line represents the response of the calibrated M IP,r to these data.

Intuition-Driven Inputs Are Poorly Informative
We next aimed to compare the performance of intuition-driven stimuli (step, pulse and random) for the parameterisation of M 3D . We generate N j = 100 stimulation patterns for each of the three classes (Section 4.2). By simulating the response of M IP,r to each input, we collected pseudo-experimental data to inform the calibration of M 3D . We cast parameter estimation as a non-linear optimisation problem and addressed it using eSS [11]. The use of standard metrics (e.g., z-score) to quantify the statistical significance of the distance between nominal and estimated parameter values was ruled out by the parameter posteriors not following a normal distribution. To overcome this limitation, we defined the where i identifies the i th entry in the parameter vector and j is the index of the input profile yielding the parameter estimate p (j) i . Note that ε (j) i = 0 when the parameter estimate equals its true value, while the absolute value ensures equal treatment of under-and over-estimates.
In Figure 4, the distributions of relative error for the 100 input profiles highlight a differential sensitivity of the output to the parameters (Figure 4b-d). We note that the high dispersion in the estimates of α, v and γ aligns with a preliminary analysis of practical identifiability (results not shown), which suggested a high correlation between these parameters. Practical identifiability issues can indeed hamper the accuracy of the estimates of the affected parameters. Interestingly, the ε i distributions advocate that the intuition-driven inputs convey a similar amount of information (Figure 4b-d). This was further confirmed by the absence of statistically significant difference in the average relative error (ε) (Figure 4f), defined as:ε where N p is the number of parameters in the model structure and N j is the number of input profiles.

Optimal Input Design (OID) Increases the Accuracy of Model Identification
Next, we questioned the improvement in the accuracy of parameter inference enabled by optimally designed experimental schemes. To reflect wet-lab experimental constraints, originating from the assumption of using a microfluidic-based platform, and to compare the informative content of intuition-driven and optimised inputs under similar conditions, we set the duration of the experiment to 3000 min and the sampling intervals to 5 min. Furthermore, we assumed a constant duration of 200 min for the steps in the optimised input (Section 4.4). Consequently, we formulated OID as a constrained optimisation problem that searches for the IPTG concentrations, (i.e., amplitude of the steps of IPTG) that maximise the information content. The optimisation relies here on the D-optimality criterion [12], which seeks to maximise the determinant of the Fisher Information Matrix (F ) [6,13].
We designed N j = 100 optimised stimulation profiles (see Figure 4a for an example), applied them to M IP,r to generate pseudo-data and performed model calibration. In our results, the use of optimised inputs enabled a noticeable reduction in ε compared to intuition-based inputs (Figure 4b-e). The higher confidence of parameter estimates even for the poorly identifiable parameters α, v and γ, translated to a 64% reduction inε for the optimally designed input over the intuition-driven counterparts.

Clustering of the Optimised Inputs
To explore the properties of the optimised stimulation profiles and their effect on the accuracy of parameter estimates, we clustered the 15-dimensional input space through Self-Organising Maps (SOM) [14]. Specifically, the N j = 100 optimally designed inputs were projected on 49 vector prototypes or nodes using the SOM Toolbox [15] (Section 4.5). Hence, the nodes were grouped using agglomerative, hierarchical clustering. Both the silhouette and the Calinksi-Harabasz internal validity index [16] suggested an optimal partitioning of the dataset into five clusters. In Figure 5, the accuracy of parameter inference, as quantified by the average relative error (ε) for the inputs assigned to each group, proved to be cluster-specific ( Figure 5a). As the observed differences could not be ascribed to the correlation betweenε and the value of the objective function attained by the optimiser, we investigated the pattern of IPTG values of the prototype vectors. This analysis revealed that the best performing clusters, C 4 and C 5 , are characterised by a high inducer concentration after 10 h of stimulation (Figure 5b). It is worth noting that, despite the system being initially in an uninduced state, the optimiser selected low IPTG values in the first steps: a counter-intuitive choice for an experimentalist.

On-Line OID Supports the Automation of Model Calibration
We have shown that off-line OED can be used to reliably infer mathematical models. Nevertheless, the necessary intervention of human experts at each identification cycle [8] could increase the cost and weaken the robustness of model calibration. To address this problem, we propose to automate model inference through integrating PE/OED and in vivo experiments in an on-line, closed loop. In off-line model identification (Figure 6a), PE/OED is performed before and after the experiment; in on-line model identification, the experiment consists of a sequence of n sub-experiments or loops. In each sub-experiment i, once the waiting time for updating the model (t s ) has elapsed, parameter estimates are refined using the measured data, and MBOED on the current model, M(p i ), is used to compute the optimal input, u i+1 , for the next loop (Figure 6b).

Off-line OID
On-line OID In on-line OID, the experiment is structured as a sequence of n sub-experiments (e.g., n = 2), each lasting t H . Before the experiment begins, M(p 0 ) is used to optimally design the input, u 1 , to be applied in the first sub-experiment. After a time t S , the model is refined to M(p 1 ) and used to design the input, u 2 , for sub-experiment 2. The procedure is iterated for the duration of the experiment.
We explore the dependency of parameter uncertainty on the trade-off between the duration of the sub-experiments and number of OID loops for n equal to 1 (off-line), 3 and 5 (on-line) using N j = 30 input profiles. The results show that the average relative error,ε, after 50 h of experiment is a decreasing function of the number of loops. Specifically, n = 5 enables a 54% reduction inε over off-line optimally designed inputs (Figure 7a) because of the improvement in accuracy of the estimates of poorly identifiable parameters (α and γ) (Figure 7b-d). Finally, the evaluation ofε every 5 h reveals that on-line OID accelerates the convergence of the average relative error: when n = 5, 20 h of experiment would provide parameter estimates with an higher accuracy than those inferred over 50 h of off-line OID (Figure 7e).

Discussion
Ensuring the accurate and efficient inference of mathematical models represents a route to their widespread use in synthetic biology. In this paper, the quantification of the informative content of different input classes encourages the integration of MBOED in the characterisation pipeline of biological parts. We reached this conclusion by identifying a deterministic model for the orthogonal, inducible promoter designed by Gnügge et al. [10]. We chose to consider a regulated promoter as, in eukaryotes, each node in a synthetic network requires a new promoter. Furthermore, it is generally assumed that the low complexity of synthetic promoters makes the definition of informative, intuitive, experimental schemes amenable. Clearly, our results challenge this belief (Figure 4f).
Aiming to compare the informativeness of the different input classes, we recovered the model structure (M PLac ) proposed in [10]. The noticed gap between the simulated and experimental transition-region of the dose-response curve prompted us to attempt refining the calibration of M PLac . We cast model inference as a multi-experiment fitting problem and address it using cross-validation. Despite finding that M PLac,r gives a 56% improvement in the fitting over M PLac (Figure 2c), we can only speculate on the cause of this enhancement.
We next verified that the dynamics of M PLac,r , described by pseudo-data generated using this model, could be exhaustively captured by the lower-order model M IP,r (Figure 3). This result and the smaller computational cost incurred in MBOED with a simpler model led us to select M IP,r as a reference model of the underlying biological system in the subsequent analysis.
We found that experiments with optimised inputs improve model calibration when compared to intuition-driven inputs (Figure 4f). Interestingly, we could not recover the instinctive difference between the performances of the intuition-driven classes of inputs (Figure 4f). In addition, it is important to note that the lower average error yielded by the optimised inputs does not imply that all parameter estimates improve. For example, pulse inputs allow a narrower ε distribution for k p to be attained (Figure 4b-d). Nevertheless, optimally designed inputs support the decrease in the variability of the estimates of poorly identifiable parameters, e.g., α and γ (Figure 4b-d). We further explored the properties of the optimised input profiles, seeking for features to which the higher accuracy of parameter inference could be ascribed. The cluster analysis by self-organising maps revealed the presence of five groups with differential performance (Figure 5a). Here, the clusters providing smaller average relative error (ε) were denoted by high induction levels at the fourth step. It is worth considering that, despite the system being initialised in absence of induction, the optimiser selected low IPTG concentrations at the beginning of the experiment: an observation which highlights the relevance of MBOED.
Finally, we investigated the improvement of on-line over off-line OID as a first step towards the automation of model calibration. Indeed, on-line OID would promote a standardised, more robust, and cost-effective identification of mathematical models. We found that on-line OID further improves the calibration of poorly identifiable parameters (Figure 6b-d), leading to a monotonic increase in the accuracy of parameter estimates with the number of loops (Figure 6a), although we expect this increase to eventually stop. In fact, once the experiment duration has been defined, the performance of on-line OID depends on the properties t H , the duration of the sub-experiments, and t S , the waiting time for updating the model. t H represents a trade-off between the number and duration of each loop; t S between the amount of data acquired in each iteration and the computational cost of PE/OED. Notably, the faster convergence of the estimates to the true parameter value observed with on-line OID would allow the overall duration of the experiment to be reduced.
Overall, these results suggest that the model-based design of dynamic perturbations would underpin our ability to perform a painstaking inference of mathematical models of biological systems: a requirement if synthetic biology is to advance as an enabling technology founded on engineering principles.

Generating Pseudo Experimental Data for the Identification of M IP,r
To re-calibrate parameter values in M IP and obtain M IP,r , we chose to simulate the response of M PLac,r to step, pulse, ramp and random inputs over 3000-min long experiments. For each of these four input classes, we defined a generating function; we then designed three inputs for each class.
Step inputs are obtained using: As generating function of the ramp input, we used: where a was set to 10 µM, 100 µM ( Figure 3c) and 1000 µM for each of the three inputs generated for this class. It should also be noted that a Zero Order Holder filter with a window of 60, 150 and 250 min was applied to the first, second and third inputs, respectively. Finally, the pseudo-random inputs are defined as: In all simulations, we added a 5% Gaussian noise and assigned the initial conditions of the system to the steady state values derived from a 24 h simulation of M PLac,r with 0 µM IPTG as the input. All experiments were simulated in AMIGO2 [9] and Citrine was sampled every 5 min. For more details on these procedures, we refer the reader to our GitHub repository [17].

Generating Pseudo Experimental Data for the Comparison of Input Classes
The inputs we used to compare the informative content of different stimuli were defined as follows: In all simulations, we added a 5% Gaussian noise and set the initial conditions of the system to the analytical steady-state of M IP,r with IPTG equal to 0 µM; all experiments were simulated in AMIGO2 [9] and Citrine was sampled every 5 min. For more details on these procedures we refer the reader to our GitHub repository here [17].

Parameter Estimation
Parameter estimation was formulated as a non-linear optimisation problem, whose objective is to identify the parameter values that minimise a scalar measure of the distance between model predictions and (pseudo) experimental data. We used the weighted least squares as a cost function, with weights set to the inverse of the experimental noise. This is defined as Gaussian noise with variance equal to 5% of the maximum value reached by the output, simulated with the true parameter values p * , within an experiment. To solve the optimisation problem, we relied on eSS [11]: a hybrid method that combines a global and a local search to speed up convergence to optimal solutions. In the initial phase, eSS explored the space of solutions, and then, as local search, we employed the nonlinear least squares solver.
To strengthen the predictive capabilities of the calibrated models, we used cross-validation in the identification of M PLac,r and M IP,r . In both cases, the available experimental datasets were randomised and split into training (66%) and test (33%) sets. Parameter estimation was run on the training set starting from 100 initial guesses for the parameter vector. The latter were obtained as Latin hypercube samples within the allowed boundaries for the parameters. Among the optimal solutions, the one that minimises the SSE on the test set was selected as the vector of parameter estimates. We note that, when comparing the informative content of different input classes, parameter estimation was not performed using cross-validation. Details on the allowed bounds for the parameters and the scripts used for parameter estimation are provided in the GitHub repository [17].

Off-Line Optimal Experimental Design
To reflect wet-lab experimental constraints, under the hypothesis of using a microfluidic-based platform, we fixed the sampling times (1 every 5 min) and the experiment duration (3000 min). We further set the initial condition to the steady-state in absence of induction. As a result, we restricted the optimisation to identifying the input (IPTG) time profile that maximises the information yield of the experiment. Here, information was quantified as a metric defined on the Fisher Information Matrix (F ) [6,13], whose formulation for the homoscedastic case reads: where y(i) is the value of the observable (Citrine), simulated with the parameter vector p, at the ith sampling instant, σ 2 represents its variance and N is the number of sampling times. Note that σ 2 are independently distributed samples drawn from a normal distribution with mean 0 and variance equal to the 5% of the maximum expression level of Citrine. The F sets a lower bound on the variance of the parameter estimates through the Cramer-Rao inequality: where C is the covariance matrix. Intuitively, as the eigenvalues of the F are related to the inverse of parametric variances, attempting to maximise the determinant of F (D-optimality) corresponds to minimising the product of the parametric variances. To find the most informative input (u * ), we formulated MBOED as an optimal control problem and searched for: where p * is the parameter vector considered for OID. We note that, in each of the N j = 100 designed input profiles, the optimisation started from an initial guess for the parameter vector p * obtained as Latin hypercube sample within the allowed boundaries for the parameters. Based on a comparison between Differential Evolution (DE) [18], a global optimisation method featuring good convergence properties and suitable for parallelisation, and eSS [11] (results not shown), we selected the latter to solve the optimisation problem. The solvers fminsearch and fmincon were employed for the local and final-local search. Details on the allowed bounds for the parameters, the 100 initial guesses for the parameter vector and the true value of the parameters are provided in the GitHub repository [17].

Clustering of the Optimised Inputs
To further explore the properties of the optimally-designed inputs, we performed a cluster analysis by self-organising maps (SOM). This is a two-stage approach, in which the N j = 100 optimally designed inputs are grouped into a reduced number of prototype vectors by the SOM, and then the SOM is clustered. In the first phase, the SOM was linearly initialised as a 7 × 7 rectangular lattice which was trained for 30 iterations in batch mode, using a Gaussian neighbourhood function with constant radius equal to 1 [15]. Linear initialisation and batch training algorithm were selected for their computational efficiency [19]. The remaining parameters, upon exploration of alternative sizes, neighbourhood functions and initial/final radii, were identified as the configuration with the best quantitative measure of mapping quality, i.e., quantization and topological errors. Hence, the SOM nodes were grouped by hierarchical, agglomerative clustering using Euclidean distance (default) and average linkage. The obtained dendrogram was verified for dissimilarity and inconsistency.
Abstraction of the inputs to the prototype vectors was performed using the SOM Toolbox [15]. We used the Statistical and Machine learning MATLAB toolbox to cluster the SOM. Principal component analysis of the prototype vectors was performed in Python.

On-Line Optimal Experimental Design
To compare the performance of on-line OID with the off-line counterpart, we adopted the same sampling times (1 every 5 min), experiment duration (3000 min) and number of steps (15). This resulted in t H = 1000 and 600 min for n = 3 and 5 loops, respectively. We further set the initial condition of each experiment to the steady-state in absence of induction. In each sub-experiment, the optimal input was applied to the system to simulate Citrine dynamics and to obtain pseudo-data. The augmented dataset was used to update the model parameters and compute the initial state of the next sub-experiment. Hence, MBOED was used to design the input for the following loop. We note that, to reduce the computational cost incurred in on-line OID, the F was calculated only over the sampling times of each loop.

Conclusions and Future Work
In summary, we highlight MBOED as an enabling tool for the inference of predictive mathematical models of biological parts in synthetic biology. Focusing on the identification of a mathematical model of a synthetic promoter, whose simplicity would encourage the definition of intuitive experimental schemes, our computational results report a ∼60% increase in the confidence of parameter estimates when optimally designed input profiles are compared to intuition-driven stimuli. While this result confirms evidence in the scientific literature [7], in this work, we highlight for the first time, to the best of our knowledge, that the reduction of the parameter search space allowed by on-line OED further leads to an 84% improvement in the accuracy of parameter inference. Besides the necessary in vivo validation and investigation of the scalability of computational cost for systems of higher complexity, these results motivate efforts towards the development of cyber physical platforms to automate model calibration, in which MBOED and microfluidic experiments are embedded in an identification loop.