Component Characterization in a Growth-Dependent Physiological Context : Optimal Experimental Design

Synthetic biology design challenges have driven the use of mathematical models to characterize genetic components and to explore complex design spaces. Traditional approaches to characterization have largely ignored the effect of strain and growth conditions on the dynamics of synthetic genetic circuits, and have thus confounded intrinsic features of the circuit components with cell-level context effects. We present a model that distinguishes an activated gene’s intrinsic kinetics from its physiological context. We then demonstrate an optimal experimental design approach to identify dynamic induction experiments for efficient estimation of the component’s intrinsic parameters. Maximally informative experiments are chosen by formulating the design as an optimal control problem; direct multiple-shooting is used to identify the optimum. Our numerical results suggest that the intrinsic parameters of a genetic component can be more accurately estimated using optimal experimental designs, and that the choice of growth rates, sampling schedule, and input profile each play an important role. The proposed approach to coupled component?host modelling can support gene circuit design across a range of physiological conditions. Record Type: Published Article Submitted To: LAPSE (Living Archive for Process Systems Engineering) Citation (overall record, always the latest version): LAPSE:2019.0539 Citation (this specific file, latest version): LAPSE:2019.0539-1 Citation (this specific file, this version): LAPSE:2019.0539-1v1 DOI of Published Version: https://doi.org/10.3390/pr7010052 License: Creative Commons Attribution 4.0 International (CC BY 4.0) Powered by TCPDF (www.tcpdf.org)


Introduction
Proposed applications of synthetic biology demand complex synthetic constructs involving dynamic internal regulation.Novel analytic and experimental approaches will be needed to efficiently navigate the corresponding design space [1,2].Model-based design approaches promise to (partially) replace costly experiments with computer simulations.A range of theoretical approaches to automated or computer-assisted design have been published [3][4][5][6][7][8][9][10]. When supported by reliable characterization of genetic regulatory components, automated design algorithms can greatly increase the efficiency of design.As an example, Nielsen et al. demonstrated efficient automated design of very large genetic logic circuits from a carefully designed and characterized regulatory library [11].Systematic characterization of genetic components can include the generation of standardized data sheets [12], the use of standardized relative units [13], and the predictive characterization of circuit dynamics [14].This type of characterization is demanding and resource-intensive; to date it has rarely been accomplished.The resulting knowledge deficit is a major bottleneck to the wider use of model-based design.
A drawback of standard approaches to characterization is that they fail to distinguish host physiology from the intrinsic properties of the construct [15].The physiological state incorporates, among other features, (i) the DNA quantity and gene copy number, (ii) available RNA polymerases (RNAPs) and ribosomes, and (iii) the cell volume, all of which impact gene expression [16,17].Calibration of model parameters without accounting for the cell's physiology results in aggregate parameters that describe lumped effects from both host and component.Such a model cannot be trusted to extrapolate beyond the physiological state in which it was calibrated.Separating host state and component behaviour is a difficult and multivariate problem.Experimentalists cannot directly perturb or even measure many physiological properties.The cell's physiological state can be modulated indirectly by external or internal perturbations including modulation of (i) nutrient sources [17,18], (ii) antibiotics [18], (iii) gene expression burden [18], and (iv) metabolic fluxes [19,20], as well as temperature, pH, and osmolarity.The aggregate effect of each of these perturbations influences growth rate, but predicting host properties from the perturbation or the growth rate is not generally possible.However, for nutrient-limited growth, it has been shown that the exponential growth rate acts as a summary statistic for the physiological state of an E. coli host cell [16,17].Klumpp and Hwa demonstrated how nutrient-limited growth rates can then be used as an aggregate predictor of the physiological effects on gene expression [16].
In [16], Klumpp and Hwa stop short of proposing an explicit dynamic model for coupling physiology to gene expression.They focus primarily on steady-state behaviour.More recent works have developed coupled models of gene expression, host physiology, and growth rates [21][22][23].
Here, we focus specifically on the use of a coarse-grained model to empirically predict physiological properties from an observed exponential growth rate.Our model distinguishes intrinsic parameters of the genetic construct from extrinsic parameters of the host's physiological state.This distinction allows for extrapolation across physiological conditions and for reuse of the estimated component parameters.We have aimed to keep the parameter set small to maximize both identifiability and interpretability.
Noise and nonlinearity make precise estimation of the parameters that characterize biomolecular systems a challenging task [24,25].Optimal experimental design (OED) tools offer a means to improve the efficiency of data collection for model calibration [26,27].Despite its potential to increase experimental efficiency and the precision of parameter estimates, OED has not seen widespread use in laboratory experiments within systems or synthetic biology.Two notable exceptions are reported by Bandara et al. and Ruess et al.; both groups implemented optimal experimental design for efficient calibration of dynamic biological models [28,29].OED techniques are especially promising when coupled with optogenetic or microfluidic techniques, which allow for a broad range of dynamic perturbations in vivo [30].Here, we employ optimal experimental design algorithms originally demonstrated for chemical and bioprocess engineering applications [31][32][33] for the characterization of a genetic component from simulated data.
We develop our physiologically aware model of gene expression for an exponentially growing E. coli population, because this is the only host system for which the necessary data has been collected.We use nutrient quality in exponential phase as a predictable controller of the cell's growth rate and relevant physiological state.We use the model to optimally design dynamic induction experiments across a set of growth rates to simulate efficient estimation of the intrinsic parameters of the genetic component.We adopt an optimal design approach that expresses the experimental design as an optimal control problem, which can be efficiently solved using numerical optimal control methods, such as multiple shooting.We treat the induction profile, growth rate, and sampling schedule as experimental controls.Their selection is simultaneously optimized over multiple sub-experiments.This simultaneous optimization of sampling schedule and multiple experimental perturbations, including growth rates, is an improvement over previously published accounts of OED in synthetic biology (although see [34]).Previously published analyses either have assumed constant sampling rates or have chosen sampling schedules in a secondary optimization step [29,35], both of which are sub-optimal [36].We use numerical simulations to demonstrate that the optimal experiments outperform intuitively designed experiments: the optimal designs improve both parameter estimation accuracy and out-of-sample prediction accuracy.We further demonstrate that the use of multiple growth rates is important for model identifiability over realistic parameter ranges.

Materials and Methods
In this work, we design optimal experiments for calibrating a dynamic model of expression of a genomically integrated gene, induced by an activating transcription factor (TF).For simplicity, we assume that the controlled induction input is the activating transcription factor copy number, u. (More realistically, the controlled input would be some signal that influences the TF abundance.)Our population-averaged model incorporates both gene-specific intrinsic parameters and parameters that characterize aspects of the host physiology.The physiological parameters of the model are dependent on the steady-state exponential growth rate, λ, controlled via nutrient quality.The model describes mRNA copy number, X rna , and protein copy number, X prot , d dt ( The gene's intrinsic characteristics are captured by the intrinsic parameters α, K r , K t , K rt , δ, β, and K M , while expression is also dependent on the physiological parameters: V, g, P a , G, R f , and λ, which are growth-rate-dependent, and η and ξ, which are fixed.The intrinsic parameters characterize each gene's induction behaviour.In contrast, the physiological parameters reflect the state of the host cell (Figure 1).These parameters are summarized in Table 1, along with nominal values and relevant ranges.Justification for these parameter values is provided in the next section.The observed growth rate can be used to predict many physiological parameters of the host, which in turn influence gene expression.(B) By accounting for physiological parameters, we can optimally estimate intrinsic parameters of a genetic construct, which reflect properties of its sequence.These intrinsic parameters can be reused across growth conditions and can be used to guide changes to the construct sequence.

Derivation of the Physiological Gene Expression Model
We develop the model for the case of an exponentially growing E. coli population, where growth rate is controlled by nutrient limitation.While some features of the model may generalize to other cell types, it is only in the case of E. coli that sufficient data has been collected to provide reasonable estimates of functional relations and parameter values.
We employ two standard measures of growth rate.The doubling rate µ is the inverse of the doubling time τ µ .The exponential growth rate λ = ln(2)/τ µ is used to describe population growth as P 0 e λt , and is thus suitable to use as the dilution rate in a differential equation model.We treat the exponential growth rate, λ, as an independent variable determined by experimental conditions.We also assume the growth rate is constant throughout each experiment, which implies that no nutrient shifts occur.

Cell Volume and Mass, DNA Content, and Protein Mass
Prior work in bacterial physiology has established expressions for the physiological parameters V, G, and g in terms of λ. (While each of the parameters is expected to undergo random fluctuations and to vary with the cell cycle, these expressions approximate average values over time.) Cell volume has been shown to scale with growth rate exponentially [37,38]: Here the constant V 0 is the "initiation volume" measured to be V 0 = 0.28 µm 3 by Si et al. [37].The parameters C and D represent the time periods required to replicate the chromosome and to septate, respectively.We take C = 40 min and D = 20 min [38].
An expression for the average number of genome equivalent lengths of DNA, G, is given in terms of C, D, and λ by Cooper and Helmstetter [38]: Individual loci vary in copy number based both on growth rate and on their location relative to the origin of replication.(At faster growth rates there are more copies of genes near the origin because the cell must have multiple rounds of DNA replication underway.)We use the constant l ori to indicate the gene's position relative to the origin: l ori = 0 at the origin; l ori = 1 at the terminus.Bremer and Churchwood [39] provide a simple derivation for the average gene copy number, g, of a specific locus as g = e ((C+D)−l ori C)λ . (4)

Total RNA Polymerase (RNAP)
We first note that the buoyant density of E. coli has been observed as constant across growth rates: Kubitschek et al. found ρ = 1.09 pg µm −3 [40] while Basan et al. report ρ = 0.215 pg µm −3 [41] (average of reported values).We use the dry weight data from [17] with Si et al.'s description of volume [37] to select an intermediate density of ρ = 0.55 pg µm −3 .
The average cell mass for a given growth rate is then This total cell mass M Tot can be partitioned into fractions of protein and other constituents.We fit the protein fraction of the mass, denoted Φ pr , to data from [17] with a linear function (Figure S1): with κ pr = −6.47min and Φ pr0 = 0.65.The total RNAP fraction of the overall protein mass, which we denote Φ p , exhibits an approximately linear dependence on growth rate, which we fit to data from [17] (Figure S2) as with κ p = 0.30 min and Φ p0 = 0.0074.We can then express the total mass of all RNAP protein, M RN AP , as Protein Mass The total number of RNAPs per cell can be determined by dividing M RN AP by the protein mass per RNAP core enzyme, which we denote m rnap and is estimated as 6.3 × 10 −7 pg [42].

Available RNAP
Each RNAP can be classified by its state: freely diffusing, weakly DNA bound at a non-specific site, actively transcribing other genes, paused or non-functioning during transcription, or immature [43][44][45].Below, we describe promoter binding in terms of a thermodynamic model that takes into account the observed rapid equilibrium between non-specifically bound and freely diffusing RNAPs [43].We therefore define the combined pool of free and non-specifically bound RNAPs as the available RNAP pool, with molecular population size P a .Denoting the fraction of RNAP in this available pool by Φ a , we estimate, from Bakshi et al. [43] and Stracy et al. [46] (details in the Supplementary Materials Section 1.3): with κ a = −9.3min and Φ a0 = 0.59.Using this relation we have an expression for the available RNAP:

Transcription Rate
We describe the initiation of transcription via a thermodynamic equilibrium model of promoter occupancy involving available RNAPs, transcription factors (TFs), promoter copies, and non-specific binding sites along the genomic DNA [47,48].At a given growth rate, we assume there are P a available RNAP copies and T a active transcription factor copies diffusing along the genomic DNA, and that the DNA contains N s non-specific binding sites to which these DNA binding proteins may weakly attach and g copies of the regulated promoter of interest.Further, we assume that N s P a , T a , g and that each binding of an RNAP or a TF to a non-specific site or a promoter is characterized by an associated binding energy; rn and rg for RNAP binding to the non-specific sites and promoters, respectively, and tn and tg for transcription factor binding to non-specific sites and promoters, respectively (all are negative [49]).
We use these species and site counts to enumerate the possible arrangements of RNAP and TF across the genome, and we use the binding energies to derive Boltzmann weights for each arrangement [47].We denote the differences between the energy involved in binding the promoter and the background non-specific binding as ∆ t = tg − tn and ∆ r = rg − rn .(Note, tn > tg and rn > rg so that ∆ t and ∆ r are both negative [49].)We denote the Boltzmann weights as K r = e (with k B , Boltzmann's constant, and T, temperature in degrees Kelvin) where rt is the binding energy between RNA polymerase and transcription factor when both are bound to the same promoter.Then, the equilibrium probability of a single promoter being occupied by an RNAP is (further details in the Supplementary Materials Section 1.4) With g promoter copies, and presuming that open complex and promoter escape occurs at a fixed rate α [50] we have the initiation rate as We next establish estimates of the parameter values for transcription.To begin, multiplying G by the genomic density of non-specific binding sites, η = 5 × 10 6 sites/genome, yields an estimate of the total number of non-specific binding sites N s as a function of growth rate [47].
From Heyduk et al., we have the rate of open complex formation α for the phage lambda P R promoter as 19.2 min −1 [50].Assuming P R exemplifies a strong promoter, we estimate a reasonable range based on analysis of constitutive promoters and mutants of the P R sequences as 1 − 30 min −1 [50][51][52][53].
The constants K r and K t can be interpreted as ratios of dissociation constants for the DNA-binding species (RNAP and TF) binding to non-specific DNA versus the promoter sequence [47].This provides a convenient method for constraining their feasible values from reported dissociation rates.Expressing them as such yields; Here K ns rnap and K ns t f are dissociation constants of the TF and RNAP with respect to non-specific DNA binding and K prom rnap and K prom t f are dissociation constants for promoter binding.The non-specific dissociation constant for RNAP, K ns rnap , has been observed to be approximately 10, 000 nM [47].The promoter-specific dissociation constant, K prom rnap , varies from promoter to promoter.For lacP1, it is approximately 550 nM and for T7 it is approximately 3 nM [47].To represent a relatively weak constitutive leak for an inducible promoter, we expect values nearer to the lacP1 dissociation constant would be reasonable, and so presume K prom rnap could range from 250 to 1000 nM.Using the above value for K ns rnap and the range for K prom rnap , we estimate a feasible range for K r to be between 10 and 40 (unitless).We set K prom rnap to be 250 nM, so our nominal value for K r is 40, suggesting a slightly stronger leak than found in uninduced lacP1.(Over the growth rates we consider, we found K r to be practically unidentifiable, with high variability in its estimated value.We therefore fixed it to the nominal value for our experimental design and fitting.) To estimate K t , Stormo suggests a reasonable range for the promoter-specific dissociation constant, K prom t f , of 0.01-1000 nM [54].We assume we are working with a relatively strong activator and that lies in the range 0.01-5 nM.Stormo also suggests that non-specific binding dissociation constants are between three and six orders of magnitude less than the specific binding constants.For simplicity, we follow Bintu et al. and let K ns t f = 10,000 nM [47].This yields a range for K t between 2000 and 10 6 .We chose K prom t f = 0.02 and K t = 5 × 10 5 as nominal values.As noted, the parameter K rt can be expressed as K rt = K r K t exp(− rt /k B T) with rt representing the energy involved in the RNAP-TF interaction at the promoter.Few estimates exist for such values.Bintu et al. use demonstrative values of rt /k B T ranging from −3.5 to −4.5.We allow a wider range from −3 to −5.This allows K rt to range from 4.02 × 10 5 to 1.61 × 10 10 .We take a nominal value of K rt = 1.09 × 10 9 .

mRNA Degradation
The data in [17] indicate that the mRNA decay rate is linear in mRNA copy number, with a growth-rate-independent rate constant.In [16], the authors hypothesize that this is due to the maintenance of a constant concentration of RNase E, the primary ribonuclease involved in mRNA decay initiation in E. coli [55].(RNase E exhibits auto-regulation which appears to keep its concentration constant [56][57][58].Moreover, RNase E appears to be expressed in excess, resulting in insensitivity to small changes in its abundance [58,59].)Under this assumption and using mass action, we have a simple model for mRNA decay as follows mRNA Decay Rate (copy Here E is the copy number of RNase E. Assuming E V is constant across growth rates (and therefore E ∝ V), we can express the mRNA degradation rate as mRNA Decay Rate (copy # per min) = δξX rna . ( Here ξ is the constant concentration of relevant degrading enzymes in the host.The parameter δ represents the susceptibility of the the mRNA transcript to RNase degradation, and acts as a mass action constant.The half-life of mRNAs can range from 1 − 10 min; Chen et al. suggest the mean RNA half life is near to 2.5 min [60][61][62].Based on a constant concentration of RNase E of ξ = 900 µm −3 [63] and taking a nominal half-life of 3 min, we suggest δ ≈ 2.57 × 10 −4 µm −3 min −1 .

Total and Free Ribosome Populations
A linear relation for the fraction, Φ r , of protein mass that is composed of ribosomal protein was derived in [18]: Fitting this model to data from Bremer [17] yields estimates of κ r = 5.48 min and Φ r0 = 0.030, (see Figure S4).The total ribosomal protein mass M Rib is then Each ribosome has a mass of 2.7 MDa of which 35% is protein [64].The individual ribosomal protein mass is therefore m rib = 1.57× 10 −6 pg.We then have the number of ribosomes per cell, R Tot , as Based on results in Dai et al. [65], we have assumed that about 10% of the ribosomes are inactive.Assuming that these are free, we denote this fraction as Φ f = 0.1 and have

Translation Rate
We employ the Michaelis-Menten model of translation initiation proposed by Borkowski et al. [66].In terms of the free ribosome concentration R f , Translation Rate (copy # per min) = β In this model, the mRNA's ribosome binding site (RBS) is characterized by two constants: β, the maximal translation initiation rate per mRNA, and K M , a half-saturating constant specific to the given RBS.A detailed justification for this model is provided in the Supplementary Materials Section 1.6.
To estimate β, we note that translation initiation rates on the order of β = 4 min −1 have been observed for lacA [67].Additional studies of RBS activity suggest a wide variation in expression levels induced by RBS alterations, ranging over orders of magnitude [68,69].We presume that a reasonable range for the maximal translation initiation rate is from 1 min −1 to 10 min −1 .
The values of K M reported by Borkowski et al. [66] are reported in arbitrary units.They observe a range of saturation levels from near linear behaviour to near constant saturation [66] from which we can propose a range of K M values based on the observed free ribosome concentration, which in our model is between 1000 µm −3 and 3000 µm −3 .We let K M range between 750 µm −3 and 1500 µm −3 , with a nominal value of K M = 750 µm −3 , which is approaching saturation (and hence the near constant translation efficiency observed by Klumpp and Bremer [16,17]).

Optimal Experimental Design
To illustrate calibration of the intrinsic parameters of the model, we consider a set of dynamic induction experiments conducted over multiple growth rates, where the experimenter can select (i) the (constant) growth rate, (ii) the time-varying induction profile, and (iii) the sampling schedule.We define an experiment as a set of N = 3 sub-experiments, each with a constant growth rate λ (possibly repeated), organized into vector λ = [λ (1) , λ (2) , λ (3) ].(The exponential growth rates, λ (i) , are reported as doubling rates µ (i) (db/hr) in the results for interpretability).In each sub-experiment, the population begins at rest and responds to a dynamic induction signal in the form of a time varying transcription factor copy number, u(t) = [u (1) (t), u (2) (t), u (3) (t)].(Such an input u (i) (t) could be implemented by, e.g., a calibrated induction system [70] or closed loop control of fluorescently tagged TFs [71].)For computational efficiency, we have restricted each u (i) (t) to be a piecewise constant, with 6 constant control values delivered for 100 min each, for a total duration of t f = 600 min.We have constrained the maximum value of each u (i) (t) to u max = 1000 and the minimum to u min = 0.This wide range was selected to ensure it spans the unsaturated range of the promoter well in excess.For each sub-experiment, we also select sampling schedules for both mRNA and protein species.A single sampling schedule for a specific species is defined as a list of points, s (species) = {p 1 , ...p l , ..., p L }.Each p l = (τ l , c l ) consists of a time τ l ∈ [0, 600] and a positive integer count c l of the number of samples taken at that time.We have restricted the design such that c l ∈ {1, 2, 3}.The total number of samples for each species, c Tot is constrained to be no more than 12.The sampling schedule for each sub-experiment, S (i) , consists of a schedule for each species (prot) }.These are collected into an overall schedule: S = {S (1) , S (2) , S (3) }.
Optimization over the sampling schedules as defined above involves integer programming, which poses significant computational challenges.We avoid these by employing a relaxation approach [33,72,73].We treat the sampling schedule as a continuous sampling density such that (prot) (t)} and the overall sampling schedule is W (t) = [W (1) , W (2) , W (3) ].These sampling densities are restricted to be a piecewise constant over 48 equal intervals (each of length 12.5 min) during each sub-experiment.Each sampling density w (i) (species) (t) is upper bounded by w max = 3  12.5 , and the integral of the sampling distribution, ŵ(species) n is bounded by the maximal number of samples, c Tot : The relaxed schedule must be discretized to arrive at an implementable sampling schedule [33].In practice, the optimal densities are often bang-bang [73] and so can be easily discretized by the sampling interval length (12.5 min) to recover integer sample counts.To account for non-integer values, we applied the Sum-Up-Rounding strategy, a common heuristic in integer programming, to recover integer solutions that approximate the optimal discrete schedule [72].This discretization strategy effectively rounds the continuous density while respecting the sampling constraints [72].After generating the sampling counts, c l , from the sampling densities, we fixed their associated times τ l to the centre of each of the 12.5 min intervals.
To characterize a genetic construct with the induction experiments as defined above, we seek an experimental design that can accurately estimate the intrinsic parameter set θ.As stated in Section 2.1.4,the parameter K r proved to be practically unidentifiable under the model specifications described above.Consequently, we fixed it to its nominal value, and took the intrinsic parameter set to be estimated as θ = [α β K t K rt δ].These parameters characterize the regulated promoter, its regulating transcription factor and the downstream gene sequence, independent of growth context.We denote an estimate of the true parameters as θ.For our objective, we seek to minimize the determinant of the covariance matrix det cov( θ) , in what is known as a D-optimal design [74].The determinant of the covariance matrix is also known as the generalized variance [75]; minimizing it is equivalent to minimizing the volume of the confidence ellipsoid defined by the covariance matrix [76].Because the model is nonlinear and the true parameter vector is, by definition, unknown, it is not possible to estimate the parameter covariance matrix a priori.Instead of estimating cov( θ) directly, we use the Fisher information computed at an initial guess θ o as a proxy.In linear models (or in the limit of large sample sizes or low signal to noise ratio), the inverse of the FIM is asymptotically equivalent to cov( θ) [77]: In this case minimizing det cov( θ) is equivalent to maximizing the determinant of the Fisher information matrix, det (I).In the non-linear finite sample regime in which our experiments are conducted, this relation is admittedly tenuous; the use of an initial guess θ o also introduces potential errors.However, the process of iterating the local approximation and successively updating θ o after each experiment has been numerically demonstrated to yield convergence to the true parameter set in some cases [26,27].We thus define our objective as Θ D (I) = det (I), which is generally known as the D-optimality score [76].(Other options such as A, E, and E-modified optimality minimize other properties of the covariance [74,76,78].) To determine the optimal set of experiments, we follow an optimal control-based procedure described in [31][32][33].(This approach to OED was demonstrated by application to chemical and bioprocess models.It has seen limited use in systems biology [28] and has not previously been applied to component characterization in a synthetic biology context.)To formulate a control problem we must state the objective, the D-optimal score Θ D (I) = det (I), in terms of the model dynamics.We restate the model in the following generic form: Here X = [X rna , X prot ] and F is the right hand side of the model expressed in Equation ( 1).We determine the local sensitivities of X rna and X prot with respect to each parameter θ i , by solving the following system of sensitivity equations Because the parameters are dimensioned, their scalings can vary widely, leading to poor conditioning of the Fisher information matrix and associated computations [31].To rectify this, we scale the sensitivities by the parameter values (i.e., we use logarithmic sensitivities) Applying this change of variables yields The initial conditions for the state variables and their sensitivities are constrained to be at steady state with respect to a zero induction input (u = 0): Because the steady state depends on both the growth rate and the initialized parameter values, these initial conditions are not constants.They therefore appear as nonlinear constraints in the optimization problem.
We denote the sampling variance for observations of species by σ 2 species .We assume normally distributed errors with variances equal to 5% of the species quantity: We further assume no covariance between species measurements.The (scaled) Fisher information matrix I(θ, t) can then be written for the relaxed problem as a differential equation, as per [33] where the matrices Ω(t) and Σ(t) are defined as Here w (species) n (t) are sampling densities, as defined above.Further details on the integration of the Fisher information over continuous sampling densities can be found in [33,73].Because samples are assumed to be independent across time points as well as species, the Fisher information for the experiment is additive across both.(Here we use the Fisher information for homoskedastic error, despite the heteroskedasticity of the model.We found this simpler formulation performed equivalently and was more tractable.)Therefore, we can express the cumulative Fisher information for a given sub-experiment from time 0 to t f as The Fisher information for the total experiment, I Tot , is then the sum over all sub-experiments The overall objective (D-optimality score) is then The matrix is symmetric, so only the diagonal and lower triangular elements need to be computed.The value of Θ D (I Tot ) can vary over many orders of magnitudes for different experiments, and so to improve numerical accuracy we take the logarithm of the determinant as the objective.Finally, because most optimization packages are minimizers, we invert the sign: For numerical stability, we compute the determinant using QR factorization (details in Section 2 of the Supplementary Materials).
In summary, we formulate our OED optimal control problem as Objective: Subject to (∀i ∈ {1, . . ., N}) : With Constraints: This problem formulation matches that used by Telen et al. and Hoang et al. [31,33].We used a multiple shooting algorithm to solve the optimal control problem [32].We implemented this algorithm in CasADi, a rapid prototyping optimal control toolbox, using its MATLAB interface [79].CasADi uses algorithmic differentiation to compute first and second derivatives with respect to both the objective and the constraints.This higher-order information is used for optimization in an interior-point barrier method implemented in the nonlinear programming package IPOPT [80].Further details of the multiple shooting algorithm and optimization settings can be found in Section 2 of the Supplementary Materials.

Comparing Lumped and Physiologically Aware Models
Models of gene expression typically use lumped parameters that confound physiological effects with intrinsic features.Such models are usually calibrated against data collected from cells grown in a particular medium (and so at a single growth rate).To illustrate the consequences of neglecting growth effects, we consider a lumped version of our growth dependent model (Equation ( 1)) where parameters A = αg V , B = are treated as growth-rate-independent constants.(The one exception is the protein decay rate, which is equal to the exponential growth rate.)Of course, this parameter lumping does not impact model behaviour for the growth rate at which the model is calibrated, but accurate extrapolation to other growth rates cannot be assured.This loss of accuracy is illustrated in Figure 2. Panels A and B show predictions of the full growth-dependent model and the lumped model, both calibrated to fast growth rates (3 db/h).As the growth rate drops, significant deviation in the predicted steady state copy number of both protein and mRNA is observed.In Figures 2C,D, we see similar deviation for the steady state concentrations of these species.(We note that the relative deviations in concentration are comparatively small.Protein concentration is important for, e.g., metabolic enzymes.In contrast, copy number is more important for DNA binding proteins, which tend to disperse along the DNA rather than the total cell volume [43,71].)A complementary measure of inaccuracy due to lumping appears in Figure 2E, which shows how the lumped parameter values vary with growth rate when the full model is used to determine their values (in comparison to their values calibrated at the reference fast growth rate of 3 db/h) .The divergence in these parameters sets has a significant effect on the model's dynamic behaviour.Figure 2F shows the root mean squared error (RMSE) between the two models' response to the input profile depicted in Figure 2G.The difference in dynamic behaviour between the lumped and full model under this dynamic induction, at a slow growth rate of 0.5 db/h, is shown in Figure 2G.

Null and Optimal Experimental Designs
We next explore how experimental design can improve the estimation of the intrinsic parameters of the full growth-dependent model (Equation ( 1)).We have selected a null experiment, shown in Table 2, to represent a reasonable non-optimized design.It consists of a set of logarithmically (base 10) distributed pulses, delivered at evenly spaced growth rates, with evenly spaced samples taken every half-hour (mRNA and protein samples are taken at the same time).We propose this as a sensible first experiment for fitting a dynamic model.We designed the null experiments assuming limited information on the dynamics of the specific gene expression system.They were selected to involve a reasonably comprehensive set of perturbations and measurements.For comparison, we examine three variants of the null experiment, also shown in Table 2, each with a perturbation to either the growth rates, induction profile, or sampling rate of the null.The growth variant is identical to the null case, except it is performed over a narrower range of growth rates.The sampling variant differs from the null by redistributing the samples so that their rate is halved but the sampling number at each point is doubled (with the same total number of samples).The induction variant uses linearly (rather than logarithmically) spaced strengths of induction pulses.Shown with each design is its predicted optimality score (the negative log determinant of the scaled Fisher information matrix computed at the true parameter set: Equation ( 35)).As expected, each variant provides less information than the null (as measured by D-optimality).In contrast, Figure 3 depicts an optimal experiment for our model.This design was generated by using the true nominal parameter vector as the algorithm's initialization parameters.This is an artificial scenario (the true parameters are not known initially, by definition), but it serves to demonstrate the difference between intuitive designs and optimal designs selected by our method.In Figure 3, each column describes a sub-experiment, labeled with the corresponding optimal growth rate.The top row, (A-C), depicts the optimal input profiles; the middle row, (D-F), shows the system response in both mRNA and protein copy number.The last row, (G-I), shows the sampling densities (in the shaded regions) as well as the results of the Sum-Up-Rounding discretization scheme (depicted by the stem plots, where the height is an integer representing the sample count).The optimality score of this design is −69.6.(In this case, the optimal design selected the maximum growth rate twice.We suspect this is due to higher sensitivities at faster growth.More complex models typically demand observations over a wider range of growth rates.) Comparing the null and optimal designs we see that, while the null experiments have an intuitive appeal because of their wide, even distribution of experimental choices, none achieves a score similar to the optimized design.Further, while null variants exhibit differences in optimality scores, when compared to the optimized experiment these differences are small.This suggests that manually tuning aggregate design measures is less effective than the holistic optimization provided by the OED approach.An optimal experiment (designed at the true value of θ).Each column depicts one sub-experiment, labelled with optimal doubling rates: µ (1) , µ (2) and µ (3) (corresponding to exponential growth rates λ (1) , λ (2) and λ (3) ).The top row (A-C) depicts the optimal inputs, u (1) (t), u (2) (t) and u (3) (t) (on a log 10 scale).The middle row (D-F) shows the system response (both mRNA and protein copy number) for each induction profile.The last row shows the sampling densities w   prot (depicted by the stem plot).The optimality score of this experiment was −69.6.

Utility of Optimal Designs for Parameter Identification and Prediction
The difference in optimality scores between the null and optimized experiments suggests significant improvements in parameter estimation using the optimized design.However, the statistical theory used to derive the optimality score can only be guaranteed to hold a priori for linear models in the limit of small observation variances and large samples [77].To validate that our optimality scores correspond to improved parameter estimation accuracy, we used simulated experiments to assess the correlation between theoretical and observed variability.To generate simulated data, we simulated the model using the nominal true parameters and added normally distributed observation error with a variance equal to 5% of the corresponding species count.We used a multiple shooting approach for parameter estimation; obvious outliers were removed (details in Section 2 of the Supplementary Materials). Figure 4A shows, on the vertical axis, the logarithm of the generalized variance (determinant of the observed covariance matrix) for the collection of parameter estimates (each corresponding to an experimental design).This measure of fit variability is computed from the collection of parameter fits to independent simulations of the given experiment.From this set of estimates we computed the observed covariance matrix and the generalized variance.The optimality score of the design is shown on the horizontal axis (lower is better), computed at the true parameter value.Figure 4A shows that the optimality score (objective of the OED approach) and the observed parameter covariance (sampled measure of design 'quality') correlate well.In particular, we note that the optimal design achieves both a better optimal score and a smaller generalized parameter variance when compared to the non-optimal designs.This suggests that the objective function (using the homeostatic FIM) is a useful measure of design quality, despite the nonlinearity and heteroskedasticity of the model.We also note that adopting a uniform sampling schedule or reducing the range of growth rates is expected to result in considerably worse performance, as evidenced by comparison between null variants.In addition to accuracy of parameter estimates, accurate predictions of the system behaviour for out-of-sample conditions is also important for model-based design.For linear regression models, D-optimal designs that minimize the generalized variance of the parameter estimate are equivalent to designs that minimize the upper bound on prediction variance (General Equivalence Theorem) [81].This guarantee does not generalize to our dynamic, nonlinear, and heteroskedastic model.To verify if prediction accuracy indeed correlates with D-optimality for our model, we ran a second simulation study.For each of the parameter estimates used in calculating the fit variability for the experimental designs, we simulated the model response for a new dynamic experiment.We chose out-of-sample conditions: growth rates and induction levels not used in fitting.We simulated the model with the true nominal parameter values, and computed the integrated squared error (ISE) between the true and fitted values in each case (thus including error at all time points).Figure 4B shows the generalized parameter variance along the horizontal and the log ISE on the out-of-sample data set on the vertical axis.The plots show that the generalized parameter variance correlates reasonably well with prediction accuracy: the optimal design performs better than any of the null designs.This reflects the linear case, in which the D-optimal design sets an upper-bound on the expected prediction variance [81].
So far, we have compared designs in the idealized (and artificial) scenario in which OED is applied to the model parametrized by its true values.We relaxed that assumption by first generating five intrinsic parameter sets from a uniform distribution spanning the intervals specified in the feasible ranges from Table 1.We then ran the OED algorithm using models parametrized by these 'perturbed" parameter sets.The optimality scores of these designed experiments, when evaluated against simulated data generated by the true parameter set, are plotted in Figures 4A,B, depicted as X's.

Discussion
We have proposed a physiologically aware model of gene expression in E. coli that accounts for the effects of nutrient limitation on the host physiology.The model is more complex than standard models of gene expression, but this comes with several benefits.The model naturally suggests a partitioning of the parameter set into (1) intrinsic parameters that characterize the genetic component and (2) a set of empirically derived parameters characterizing the host's physiological state as a function of nutrient-mediated growth rate.The intrinsic parameters can thus be reused across a range of growth conditions.In particular, because the intrinsic parameters can be linked to sequence properties of the component (e.g., promoter affinities, RBS strength, and mRNA stability), they could provide insight as to which aspect of a component's DNA sequence could be altered to achieve a desired effect across a variety of contexts.Future work could attempt to link such intrinsic parameters to sequence properties directly.In contrast, the extrinsic parameters can be used to predict the host's context based on the observed growth rate in a range of nutrient conditions.
Accurate estimation of the intrinsic parameter set is more difficult than estimation in context-naive models.We have demonstrated that optimal experimental design can mitigate this challenge by identifying maximally informative experiments.We applied a comprehensive OED platform, optimizing over constant growth rate, time-dependent induction profile, and sampling schedule.Many current experimental design approaches in systems biology use general purpose optimization algorithms that scale poorly to such multivariate and non-linear designs [30,74].This often leads to non-optimal selection of certain design variables [36].To achieve computational efficiency in our optimization tasks, we re-interpreted the problem via optimal control, with growth rates, sampling schedules, and induction levels as control variables.This optimal control framing of OED problems has so far been underutilized in systems biology; it allows access to the rich tool-set developed by control and process engineers, including direct multiple-shooting and collocation methods [31][32][33].These methods are ideally suited for use with emerging experimental tools such as optogenetic induction systems and automated culture and microfluidic devices [30].Our numerical results suggest that the multiple shooting algorithm provides an efficient method to optimize experiments, improving both parameter estimation accuracy and prediction accuracy over unoptimized designs.
The optimal experimental design algorithm presented here could be improved in a number of ways.We used the simplest and most tractable form of the Fisher information.More accurate formulations account for the sample variance's parametric dependence [29] and for the constraints imposed by initial conditions [82].Additionally, our approach provides a design that may be only locally optimal.Current iterative designs are based on the assumption that the iterated algorithm does not become trapped in some local minimum.Past numerical studies of iterative applications of OED have shown consistent convergence to the true parameter set (global optimum), but there is scope for more rigorous investigation [26,27].Improved computational efficiency may allow users to address multiple scenarios and larger models, or even implement online experimental design algorithms that can update the design in real-time [83,84].Our algorithm was implemented with CasADi, which provides rapid-prototyping capabilities for control problems, algorithmic differentiation and interfaces to powerful non-linear programming solvers like IPOPT [79,80].This tool facilitates rapid implementation, but requires some mathematical expertise.Further packaging of OED tools for use by experimentalists would no doubt increase adoption in systems and synthetic biology.
The experiments proposed in this work are demanding, requiring sampling of multiple species and control of inputs over extended time periods.In practice, it is not currently possible to precisely control the transcription factor count in vivo.It would suffice to map the experimental input (e.g., light, chemical inducer) to the expected average copy number of TF per cell.This could be implemented through precise calibration experiments, achievable with current techniques [70].Future experiments may be able to implement closed-loop control using real-time measurements to ensure the robust tracking of desired inputs.It also may be possible to modify the OED algorithm to account for the effects of input variability (an errors-in model).In this work, we have assumed time-series measurements of both mRNA and protein quantities.Emerging experimental tools, combined with assays like RNA-seq [85], will likely enable these complex dynamic experiments in the future [30].
For the model considered here, observing only one species results in a structurally unidentifiable parameter set.Any parameter lumping to alleviate unidentifiability leads to a loss of modularity in characterization.Even with observations of both species, the parameter K r , which characterizes promoter leak, had to be excluded from the analysis because it was practically unidentifiable over reasonable parameter ranges.In practice, it may be possible to ignore promoter leakiness in many cases.However, for strong leaks, it is important to identify the parameter K r ; fixing it, as we have, may introduce significant bias in other parameter estimates.In addition to these specific identifiability concerns, the re-usability of parameter estimates is limited by the use of relative units, which are specific to the measurement instrument they are calibrated on.The methods used in this work should ideally be implemented using absolute units, which will allow for comparison between parameter values calibrated on different instruments and in different labs [86][87][88].
Our description of physiological state was restricted to a single case: exponentially growing E. coli host cells, with growth rate determined by nutrient quality.Our model formulation was made possible by significant experimental efforts into characterizing precisely this physiological response [17].Beyond nutrient limitation, several other relevant growth perturbations are of interest to synthetic biologists, including translation-inhibiting antibiotics, gene expression burden, and metabolic knockdowns.Phenomenological growth laws and coarse-grained proteome models have been extended to some of these conditions [18,19,41].Recent results suggest that certain metabolic perturbations may effect proteome partitioning (and possibly RNAP and ribosomal fractions) in a manner similar to nutrient limitation [19,20].Certain physiological properties, such as the ribosome-to-protein ratio, can also be predictably linked to the growth rate controlled by expression burden or antibiotic dosage [18,41].The linear ribosome-to-protein ratios may even generalize across species (Figure S1 of [18]).The details necessary to link these coarse-grained models to the specifics of gene expression, as originally proposed in Klumpp and Hwa [16] (and further elaborated on here) are still lacking.While the nutrient-limited growth theories have taken decades to build, modern techniques can likely accelerate this host characterization process, allowing generalization across strains, growth inhibition conditions, and potentially even microbial species.
Extensions of phenomenological growth theories to heterologous protein expression burden and modification of metabolic fluxes could have significant impact on metabolic engineering.These effects are especially relevant to the emerging sub-discipline of dynamic metabolic engineering, where synthetic gene expression circuits dynamically modulate both fluxes and heterologous enzyme expression [89][90][91].Design of such systems is challenging; the engineered regulatory components modulate expression burden and enzymatic activity, which in turn affects the growth rate, broader cell physiology, and the regulatory component behaviour itself [92][93][94].An extended growth theory combined with the optimal experimental design algorithm and gene expression models presented here could be valuable for guiding future work in this area.
We have proposed the coupling of physiologically aware modelling and OED techniques to address limitations in the accuracy and generalizability of component characterization.Current gene expression models often fail to account for the host or environmental state, and are calibrated with poorly constrained parameter estimates using ad hoc experimental designs.These shortcomings contribute to the limited use of model-based design in synthetic biology.Poor parameter estimates result in lackluster predictive accuracy, and context-naive model predictions cannot generalize, resulting in recalibration for each new design and circumstance, and thus little advantage of model-based design over trial-and-error approaches.Our model has yet to be validated in vivo, but wider use of such coarse-grained, context-aware models and optimal experimental designs will ideally maximize researchers' return on experimental investment and aid efforts to rationally design gene expression circuits.

Figure 1 .
Figure 1.(A)Growth media nutrient quality dictates the growth rate of an E. coli culture.The observed growth rate can be used to predict many physiological parameters of the host, which in turn influence gene expression.(B) By accounting for physiological parameters, we can optimally estimate intrinsic parameters of a genetic construct, which reflect properties of its sequence.These intrinsic parameters can be reused across growth conditions and can be used to guide changes to the construct sequence.

Figure 2 .
Figure 2. (A,B) Steady state protein (A) and mRNA (B) copy number predictions for the lumped and full model across growth rate for two different constant input levels.(C,D) Corresponding protein (C) and mRNA (D) concentrations.(E) Relative difference in the lumped parameter values, fit at 3 db/h, compared with the values predicted by the full growth dependent model.(F) Root mean squared protein concentration error over the dynamic simulation (G) between the full and lumped model.(G) Predicted protein concentrations of the lumped and full model over a dynamic simulation with a growth rate of 0.5 db/h.

Figure 4 .
Figure 4. (A) Results from the true optimal (optimal experiment designed at true parameter value) and null experiments (circles): optimality score on the horizontal axis; observed generalized variance of the parameter estimate on the horizontal axis.(B) The same experiments: generalized parameter variance on the horizontal axis; log integral of the squared error on an out-of-sample prediction experiment (inset) on the vertical axis.Optimal experiments designed with erroneous initial parameter guesses are shown as X's.(Linear trend lines also shown.)