A Modelling Framework Linking Resource-Based Stochastic Translation to the Optimal Design of Synthetic Constructs

Simple Summary In synthetic biology, it is commonplace to design and insert gene expression constructs into cells for the production of useful proteins. In order to maximise production yield, it is useful to predict the performance of these “engineered cells” in advance of conducting experiments. This is typically a complex task, which in recent years has motivated the use of “whole-cell models” (WCMs) that act as computational tools for predicting different aspects of cell growth. Many useful WCMs exist, however a common problem is their over-simplification of ribosome movement on mRNA transcripts during translation. WCMs typically don’t consider that, for constructs with inefficient (“slow”) codons, ribosomes can stall and form “traffic jams”, thereby becoming unavailable for translation of other proteins. To more accurately address these scenarios, we have built a computational framework that combines whole-cell modelling with a detailed account of ribosome movement on mRNA. We show how our framework can be used to link the modular design of a gene expression construct (via its promoter, ribosome binding site and codon composition) to protein yield during continuous cell culture, with a particular focus on how the optimal design can change over time in the presence or absence of “slow” codons. Abstract The effect of gene expression burden on engineered cells has motivated the use of “whole-cell models” (WCMs) that use shared cellular resources to predict how unnatural gene expression affects cell growth. A common problem with many WCMs is their inability to capture translation in sufficient detail to consider the impact of ribosomal queue formation on mRNA transcripts. To address this, we have built a “stochastic cell calculator” (StoCellAtor) that combines a modified TASEP with a stochastic implementation of an existing WCM. We show how our framework can be used to link a synthetic construct’s modular design (promoter, ribosome binding site (RBS) and codon composition) to protein yield during continuous culture, with a particular focus on the effects of low-efficiency codons and their impact on ribosomal queues. Through our analysis, we recover design principles previously established in our work on burden-sensing strategies, namely that changing promoter strength is often a more efficient way to increase protein yield than RBS strength. Importantly, however, we show how these design implications can change depending on both the duration of protein expression, and on the presence of ribosomal queues.


S1 Model equations
The reactions used in our model are a stochastic version of those used in [1]. One of the key differences is our implementation of translation, where we use a multi-step reaction scheme instead of a single-step one. In Table S1, we display both the reactions and rate calculations where appropriate. Symbols for variables and parameters are detailed in Tables S2 and S3, respectively. In what follows we define the following elements and their corresponding sets: ∈ , , , ℎ , ∈ , , , , ∈ all species except mRNA . Table S1. Biochemical reactions and their associated reaction rates.

Process Reaction Rate
Nutrient transport ∅

S1.1 Formulating growth rate to keep cell mass constant
In this section we detail how growth rate is calculated. We begin with a cell in mid-exponential growth.
Over the course of our stochastic simulation steps, new cell mass is continually produced due to protein production. To keep cell mass constant between successive stochastic steps, we find a dilution rate, , that balances the increase in cell mass due to protein production. We determine such dilatation rate as explained hereafter.
We define as the cell mass expressed in terms of average amount of amino acids in the cytoplasm. We use indices to denote what the cell mass is at different stages during a stochastic simulation step. Given this, denotes the cell's mass at the beginning of a simulation step. denotes cell mass following completion of a reaction during the considered stochastic simulation step (which could involve production of one protein molecule). Finally, denotes the mass of the cell following any potential dilution. The change in cell mass during a stochastic simulation step is defined as ∆ = ( − ). To ensure that the cell mass is maintained constant from one stochastic simulation step to the next, we need to impose = . This is achieved by defining = : (S1) Defining this way is equivalent to considering a specific dilution rate intended to keep cell mass constant between stochastic simulation steps. This dilution rate is defined as − , i.e. the negative rate of change of cell mass between stochastic steps, under the constraint = : In our simulations, we have chosen the initial protein values (see Section S2) so that the total amount of amino acids corresponds to the approximate average number of amino acids in an E. coli cell [2], i.e. 10 amino acids. Therefore, to express the dilution rate in terms of the mass of one cell, we divide by this target average number of amino acids per cell:

S2 Variables and initial values
The variables used in our model are shown below, alongside the initial values we used in our simulations. Endogenous initial values were chosen by running preliminary deterministic simulations and using their steady-state output as the initial value for the stochastic simulations. Initial values for the heterologous simulations are taken from the steady-state values of an endogenous simulation with a nutrient quality of 100 (no unit).

S3 Parameter values
A list of the parameters used in our simulations is presented in Table 3. Values taken from literature are indicated in the column titled "Source" with the corresponding bibliographic reference(s) indicated in square brackets. Most parameter values are taken from [1]. When values are modified from those proposed in the literature, this is indicated with one or multiple † symbols and the corresponding explanation for this change is presented in Section S3. The following abbreviations are used in Table 3: aa: amino acid. Rf: ribosomal footprint = 30 nucleotides = 10 aa; the span of one ribosome on an mRNA transcript. Input: different values within the range shown have been used to conduct stochastic simulations. '-' denotes no unit.

S3.1 Explanation for changes in some values compared to those proposed in [6]
The change of some of the values relative to those proposed [1] is founded on intuitive reasoning explained here-in. We note that the values in [1] also include some uncertainty, and so a detailed parameter fit against an extensive experimental data set would be required to further validate the proposed values. This is however beyond the scope of our model.

Nutrient-related parameters ( †)
In [1], parameters and take values of 726 min -1 and 5,800 min -1 respectively, which we reduce by 2000-fold to values of 0.363 min -1 and 2.9 min -1 . This is for two reasons which are multiplicative: firstly, we vary nutrient quality from 10 to 600 in order to reach saturation of growth rate values in our model output ( Figure 4A, main text), which is two to three orders of magnitude higher than in [1]. This would significantly change the dynamics of nutrient transport and metabolism, therefore to maintain the rates of these processes, we first reduce 1000-fold. Secondly, [1] uses two classes to describe 'enzymatic proteins' with identical transcription and translation dynamics. We combine these into one class, , meaning that any calculation involving proteins would be over-represented 2-fold compared to [1]. This requires requires balancing and by a further factor of two, yielding a reduction of 1,000 × 2 = 2,000 in total.

Adjusting half-saturation constants ( † †)
Half-saturation constants (HSCs) are used in Hill equations for reactions involving transcription, translation and nutrient metabolism. In [1], a single parameter is used for each of translation and metabolism, and two are used for transcription due to a separation of ribosomal and non-ribosomal reactions; in our model, we furthermore use transcription HSCs for each protein fraction. Those parameters used for translation and transcription relate to quantities of ('energy units'), where one unit is subsequently used in each translation elongation step. The HSC values used in [1] for translation and /non-transcription are 7.0, 4.4 and 430 energy units per cell respectively; we deem these as too low for our model, as explained below.
In order to utilise the saturating effects from the Hill equations, the HSCs for transcription and translation should be within a similar order of magnitude as the steady-state value of energy units in the cell. While literature is sparse on average quantities of energy units, we can calculate a value from simple approximations. In our simulations, the number of mRNA molecules for each species at steady state is on the order of ~10 3 , and each such molecule requires 30 (for E,Q,H) or 750 (for R) energy units to fully translate, or approximately ~10 2 (see Table 2). We reason that a typical E. coli cell in exponential growth would be adapted to store energy reserves (in the form of ATP) to be able to sustain some translational capacity for short periods without any nutrient intake [7]. For example, in order to translate just 1% of its proteome using existing energy reserves only would require approximately × = 10 3 energy units per cell. This suggests that, in order to be in the energy range where saturation effects from Hill kinetics are applied, the HSCs for transcription and translation should be at least 10 in our model. Given this, we increased the values of HSCs for transcription and translation to be in the range 1,000 to 30,000 energy units. Although the precise values are arbirtary, we chose our values to ensure that (i) the HSC for ribosomal transcription was higher than for the other native fractions (i.e. it gets inhibited more readily), and (ii) the lowest value was the HSC for translation (i.e. it gets inhibited less readily). This is to reflect the fact that, in response to low energy levels, an E. coli cell inhibits ribosomal RNA biogenesis more significantly than other processes [8].

Adjusting Hill coefficients ( † † †)
In [1], the Hill coefficients used for nutrient metabolism and translation elongation were both 1, producing a classic Michaelis-Menten saturation curve. We instead wanted these processes to behave more switch-like to reflect the fact that many cellular processes are tightly controlled. To do this, we increased the Hill coefficients from 1 to 3 for nutrient metabolism (ℎ ) and from 1 to 2 for translation elongation (ℎ TL ), meaning that change in kinetics will now occur over a smaller range of values.

Adjusting parameters to reflect assumptions about E. coli transcript composition ( † † † †)
In order to make the ratio of -transcripts to -transcripts in the cell more realistic, we changed values of and from 4.14 and 948.93 in [1] to 20 and 850 respectively (unit: molecs min -1 cell -1 ). In both our model and in [1], and have the same transcript length and so any differences to their translation rate will be due to the difference in the number of transcribed genes belonging to each fraction. The ratio of approximately 4:950 from [1] implies that -transcripts are 238 times more abundant than -transcripts. Given that a typical E. coli cell contains approximately 1,380 mRNA molecules in exponential growth [9], the ratio from [6] implies there are × 4 ≈ 6 enzymatic transcripts at steady state. Intuitively, this is too small, therefore we adjust this ratio to 20:850, which gives × ≈ 32 enzymatic transcripts, a more reasonable value.
We additionally reduced the ribosome transcription rate, from 930 in [1] to 27 (unit: molecs min -1 cell -1 ). This is because 68% of the cell's RNA polymerases work on ribosomal RNA synthesis [10] and ribosomal mRNA is 30x longer than non-ribosomal mRNA in our framework. It follows that the ribosomal transcription rate is % ≈ 2.3% of the total transcription rate of an E. coli cell. This suggests that E. coli's total mRNA synthesis rate is approximately . % = 1174 min -1 cell -1 , which is close to the sum of our constituent transcription rate parameters ( + + = 897 min -1 cell -1 ). Our value is slightly lower as we wanted to account for the added effect of RNA polymerase traffic jams on ribosomal RNA genes [10].

S3.2 mRNA:protein mass ratio calculations
During our simulations, we calculate the mRNA:protein mass ratio for comparison with bacterial growth laws. We take the average molecular weight of a nucleotide to be 330 g mol -1 and of an amino acid as 118.9 g mol -1 [11]. The total mass of mRNA transcripts is then calculated as the sum of the mass of each transcript type multiplied by its corresponding abundance in the cell. The total protein mass was obtained similarly, giving the quantities needed for the mRNA:protein mass ratio.

S4.1 Proteome fraction changes in response to nutrient quality
In Figure 4 from the main text, we used our model to reproduce a number of 'bacterial growth laws' by showing how different variables changes in response to increasing nutrient quality, . For the endogenous simulations (without inclusion of a synthetic gene construct), we showed that increasing increased the growth rate to the point of saturation while monotonically increasing the mRNA:protein mass ratio. While not central to our main results, it may also be useful to observe what happens to the steady-state values of all the endogeneous proteome components: , and . We display the results below for the seven different values of as used in Figure 4. Figure S1. Steady-state endogenous proteome fractions plotted against nutrient quality.
From Figure S1, it can be seen that and mirror each other's dynamics with R being the dominant species at high values of , while shows a slight relative decrease when increases. This can be explained in terms of how the cell distributes its resources when it has access to a large energy supply: in higher nutrient quality conditions (i.e. at higher values of ), more energy is available to the cell per nutrient molecule absorbed and thus there is no need for the cell to absorb and metabolize as many nutrients as in lower nutrient quality conditions (smaller values of ). As a result, decreases, and the cell has more resources to spend on other fractions. For increasing values of , this causes to increase, due to its large maximal transcription rate. At even higher values of , eventually gives space to , due to the fact that mRNA species of are negatively autoregulated.

S4.2 Changing 'slow codon' parameters
In addition to the parameters mentioned in the main text, we changed three other key parameters to better understand the effect these had on the model's output: i) Slow codon position. In addition to placing a slow codon towards the end of the mRNA transcript, we conducted simulations with a slow codon near the beginning of the transcript (footprint position 5 R f ). In this instance, translation is still slowed down, however only very short queues can form, leading to a lower amount of sequestered ribosomes. ii) Slow codon efficiency. In addition to a slow codon that has a relative efficiency of 0.5%, we ran simulations with a slow codon with a relative efficiency of 3%. This was used to assess how changing the efficiency of a codon affects queue formation. iii) mRNA length. Our standard simulations use mRNAs with a length of 30 R f . In reality, mRNA length can vary significantly. We also ran simulations with mRNAs of length 60 R f to test whether this has an impact on queue formation. In this instance, the slow codons were located at footprint positions 5 R f or 56 R f .
As the impact of these parameters was not the focus of our study, we provide just a basic snapshot of their effects, in the form of the relationship between and at steady state. In general, the more efficient the translation process is, the higher the values of are for equivalent values of . In addition, the linearity of each trend suggests how predictable the dynamics of translation and protein production are, with more linear trends conveying higher predictability. In this light, linear regression fits are provided for each case studied. Slower codons towards the end of the mRNA typically give rise to very inefficient constructs, depending on prom H and RBS H .

Higher efficiency Default
Longer mRNA

S4.3 Absolute values of heterologous protein yield
In Section 3.2.2, we presented an analysis of hypothetical growth scenarios and calculated how much heterologous protein would be produced over time. When presenting this data in Figure 6, we used a normalised form of protein yield ( ( ) norm ) at each time point in order to make comparisons between different construct designs easier to interpret. The absolute values were not included in the core manuscript, however it may still be useful for readers to see these. In Figure S3, we therefore present heat maps of absolute protein yield values for both growth scenarios (uncapped exponential growth, turbidostat growth) and for both of the main codon compositions used (no slow codons, one slow codon towards the end of a transcript). The interpretation of this figure is identical to that in Figure 6b, except we now: • show the full suite of heat maps for both growth scenarios separately. This is because the absolute protein yield increases over time even when protein production dynamics are timeinvariant, such as in a turbidostat operating at steady-state capacity • use a separate scale bar for each heat map. This is in order to preserve visual comparisons between construct designs at a specific time interval For subfigure a, the shading of heat maps over the time intervals is identical to those in Figure 6b. In subigure b, the shading of each heat map remains consistent over the time points because the protein production dynamics of a turbidostat operating at steady-state capacity remain constant over time.   . Absolute values of heterologous protein yield for a variety of construct design and growth conditions. Values from different promoter and RBS strengths are displayed using 3x3 matrices. Values from different codon compositions are shown via differently coloured rows (blue: no slow codon; orange: one slow codon towards the transcript's end). Values from different growth scenarios are split into the two subfigures (a: uncapped exponential; b: turbidostat operating at steady-state capacity). For a more thorough interpretation of the figure's design, see the caption for Figure 6 in the main manuscript.

Simulation convergence
As stated in Section 2.1, a simulation stops when all variables have reached convergence. This is defined as when a variable's quantity does not deviate by more than 1% from its moving mean, where the moving mean is calculated from the last 10% of the simulation time. Below, we show how this is performed for the heterologous protein species ( ).

Ribosome density plots
As explored in Section 3.2.1 and in Figure 5d, we can plot the proportion of ribosomes on mRNA H transcripts that are on each footprint position.

Proteome mass fractions
Finally, we plot the steady-state mass fractions of each proteome class, and represent these as pie charts.