A Language for Modeling And Optimizing Experimental Biological Protocols

Automation is becoming ubiquitous in all laboratory activities, leading towards precisely defined and codified laboratory protocols. However, the integration between laboratory protocols and mathematical models is still lacking. Models describe physical processes, while protocols define the steps carried out during an experiment: neither cover the domain of the other, although they both attempt to characterize the same phenomena. We should ideally start from an integrated description of both the model and the steps carried out to test it, to concurrently analyze uncertainties in model parameters, equipment tolerances, and data collection. To this end, we present a language to model and optimize experimental biochemical protocols that facilitates such an integrated description, and that can be combined with experimental data. We provide a probabilistic semantics for our language based on a Bayesian interpretation that formally characterizes the uncertainties in both the data collection, the underlying model, and the protocol operations. On a set of case studies we illustrate how the resulting framework allows for automated analysis and optimization of experimental protocols, including Gibson assembly protocols.


Introduction
State of the art in lab automation. Automation is becoming ubiquitous in all laboratory activities: protocols are run under reproducible and auditable software control, data are collected by high-throughput machinery, experiments are automatically analyzed, and further experiments are selected to maximize knowledge acquisition. However, while progress is being made towards the routine integration of sophisticated end-to-end laboratory workflows and towards the remote access to laboratory facilities and procedures [1][2][3][4][5], the integration between laboratory protocols and mathematical models is still lacking. Models describe physical processes, either mechanistically or by inference from data, while protocols define the steps carried out during an experiment in order to obtain experimental data. Neither models nor protocols cover the domain of the other, although they both attempt to characterize the same phenomena. As a consequence, it is often hard to attribute causes of experimental failures: whether an experiment failed because of a misconception in the model, or because of a misstep in the protocol. To confront this problem we need an approach that integrates and accounts for all the components, theoretical and practical, of a laboratory experiment. We should ideally start from an integrated description from which we can extract both the model of a phenomenon, for possibly automated mathematical analysis, and the steps carried out to test it, for automated execution by lab equipment. This is essential to enable automated model synthesis and falsification by concurrently taking into account uncertainties in model parameters, equipment tolerances, and data collection. Processing a unified description for an experimental protocol. A program integrates a biophysical model of the underlying molecular system with the steps of the protocol. In this case the protocol comprises a single instruction, which lets a sample equilibrate for t seconds, where t is a parameter. The initial concentration of the sample is 1 for the first species and 0 for all the others. The value of t is selected as the one the maximizes a cost function, in this case the difference between y 5 and y 1 after the execution of the protocol. The optimization is performed on a Gaussian process given by the semantics of the program (biophysical model and protocol) integrated with experimental data.
Our approach. We present a language to model and optimize experimental biochemical protocols that provides such an integrated description of the protocol and of the underlying molecular process, and that can be combined with experimental data. From this integrated representation, both the model of a phenomenon and the steps carried out to test it can be separately extracted. Our approach is illustrated in Figure 1. We provide a probabilistic semantics for our language in terms of a Gaussian process (GP) [6], which can be used to characterize uncertainties. Such a semantics arises from a Linear Noise Approximation (LNA) [7,8] of the dynamics of the underlying biochemical system and of the protocol operations, and it corresponds to a Gaussian noise assumption. We show that in a Bayesian framework the resulting semantics can be combined in a principled way with experimental data to build a posterior process that integrates our prior knowledge with new data. We demonstrate that the Gaussian nature of the resulting process allows one to efficiently and automatically optimize the parameters of an experimental protocol in order to maximize the performances of the experiment. On a series of case studies, including a Gibson Assembly protocol [9] and a Split and MIx protocol, we highlight the usefulness of our approach and how it can have an impact on scientific discovery.
Related work and novelty. Several factors contribute to the growing need for a formalization of experimental protocols in biology. First, better record keeping of experimental operations is recognized as a step towards tackling the 'reproducibility crisis' in biology [10]. Second, the emergence of 'cloud labs' creates a need for precise, machine-readable descriptions of the experimental steps to be executed. To address these needs, frameworks allowing protocols to be recorded, shared, and reproduced locally or in a remote lab have been proposed. These frameworks introduce different programming languages for experimental protocols, including BioScript [11,12], BioCoder [2], Autoprotocol [13], and Antha [14]. These languages provide expressive, high-level protocol descriptions but consider each experimental sample as a labelled 'black-box'. This makes challenging the study a protocol together with the biochemical systems it manipulates in a common framework. In contrast, we consider a simpler set of protocol operations but we capture the details of experimental samples, enabling us to track properties of chemical solutions (concentrations and temperatures) as the chemicals wherein react during the execution of a protocol. This allows us to formalize and verify requirements for the correct execution of a protocol and to optimize various protocol or system parameters to satisfy these specifications.
Our language derives from our previous conference paper [5] by introducing stochasticity in chemical evolution, and by providing a Gaussian semantics, particularly for protocol operations, for the effective analysis of the combined evolution of stochastic chemical kinetics and protocols and experimental data. The language semantics has been already implemented (but not previously formally presented) in a chemical/protocol simulator [15]. We expect that this style of semantics could be adapted to other protocol languages, along the principles we illustrate, to provide a foundation for the analysis of complex stochastic protocols in other settings.
Paper outline. In what follows, we first introduce the syntax and semantics of our language. We then show how we can perform optimization of the resulting Gaussian process integrated with data. Finally, we illustrate the usefulness of our approach on several case studies, highlighting the potential impact of our work on scientific discovery.

Materials and Methods
In this section we introduce the syntax of the language we propose for modelling experimental protocols. A formal semantics of the language, based on denotational semantics [16], is then discussed. The physical process underlying a biological experimental protocol is modeled as a Chemical Reaction Systems (CRS). As a consequence, before introducing the language for experimental protocols, we first formally introduce Chemical Reaction Networks (CRNs) and Chemical Reaction Systems (CRSs).

Chemical Reaction Network (CRN)
Chemical Reaction Networks (CRNs) is a standard language for modelling biomolecular interactions [17]. Definition 1. (CRN) A chemical reaction network C = (A, R) ∈ CRN = Λ × T is a pair of a finite set A ⊆ Λ of chemical species, of size |A|, and a finite set R ⊆ T of chemical reactions. A reaction τ ∈ R is a triple τ = (r τ , p τ , k τ ), where r τ ∈ N |Λ| is the source complex, p τ ∈ N |Λ| is the product complex and k τ ∈ R >0 is the coefficient associated with the rate of the reaction. The quantities r τ and p τ are the stoichiometry of reactants and products. The stoichiometric vector associated to τ is defined by υ τ = p τ − r τ . Given a reaction τ i = ([1, 0, 1], [0, 2, 0], k i ) we refer to it visually as τ i : consists of a concentration vector µ (moles per liter), a covariance matrix Σ, a volume (liters), and a temperature (degrees Celsius). A CRN together with a (possibly initial) state is called a Chemical Reaction System (CRS) having the form (A, R), (µ, Σ, V, T).
A Gaussian process is a collection of random variables, such that every finite linear combination of them is normally distributed. Given a state of a CRN its time evolution from that state can be described by a Gaussian process indexed by time whose mean and variance are given by a Linear Noise Approximation (LNA) of the Chemical Master Equation (CME) [7,18] and is formally defined in the following definitions of CRN Flux and CRN Time Evolution.
We should remark that, if one assumes mass action kinetics, then for a reaction τ it holds that i is the product of the reagent concentrations, and k τ encodes any dependency of the reaction rate k τ on volume and temperature. Then the Jacobian evaluated at µ is given by [8].
In the following, we use µ, Σ for concentration mean vectors and covariance matrices respectively, and µ, Σ (boldface) for the corresponding functions of time providing the evolution of means and covariances.
An ill-posed problem may result from technically expressible but anomalous CRS kinetics that does not reflect physical phenomena, such as deterministic trajectories that are not uniquely determined by initial conditions, or trajectories that reach infinite concentrations in finite time.
Then, we have Consider an initial condition of µ a = 0.1, µ b = µ c = 0.001, then the time evolution of mean and variance for the GP defined in Definition 4 are reported in Figure 2.

A Language for Experimental Biological Protocols
In what follows we introduce the syntax of our language for experimental biological protocols.

Definition 5.
(Syntax of a Protocol) Given a set of sample variables x ∈ Var and a set of parameter variables z ∈ Par, the syntax of a protocol P ∈ Prot for a given fixed CRN C = (A, R) is P = x (a sample variable) (p 1 ...p |A| , r V , r T ) (a sample with initial concentrations, volume, temperature) let x = P 1 in P 2 (introduce a local sample variable x) Mix(P 1 , P 2 ) (mix samples) let x 1 , x 2 = Split(P 1 , p) in P 2 (split a sample P 1 by a proportion p in (0..1)) Equilibrate(P, p) (equilibrate a sample for p seconds) Dispose(P) (discard sample) p = z (a parameter variable) r (a literal non-negative real number) Moreover, let-bound variables x, x 1 , x 2 must occur (as free variables) exactly once in P 2 .
A protocol P manipulates samples (which are CRN states as in Definition 2) through a set of operations, and finally yields a sample as the result. This syntax allows one to create and manipulate new samples using Mix (put together different samples), Split (separate samples) and Dispose (discard samples) operations. Note that the CRN is common to all samples, but different samples may have different initial conditions and hence different active reactions. The single-occurrence (linearity) restriction of sample variables implies that a sample cannot be duplicated or forgotten.

Gaussian Semantics for Protocols
There are two possible approaches to a Gaussian semantics of protocols, that is, to a semantics characterized by keeping track of mean and variance of sample concentrations. They differ in the interpretation of the source of noise that produces the variances. We discuss the two options, and then choose one of the two based on both semantic simplicity and relevance to the well-mixed-solution assumption of chemical kinetics.
The first approach we call extrinsic noise. The protocol operations are deterministic, and the the evolution of each sample is also deterministic according to the Rate Equation (that is, Definition 4(1)). Here we imagine running the protocol multiple times over a distribution of initial conditions (i.e., the noise is given extrinsically to the evolution). The outcome is a final distribution that is determined uniquely by the initial distribution and by the deterministic evolution of each run. For example, the sample-split operation here is interpreted as follows. In each run we have a sample whose concentration in general differs from its mean concentration over all runs. The two parts of a split are assigned identical concentration, which again may differ from the mean concentration. Hence, over all runs, the two parts of the split have identical variance, and their distributions are perfectly correlated, having the same concentration in each run. The subsequent time evolution of the two parts of the split is deterministic and hence identical (for, say, a 50% split). In summary, in the extrinsic noise approach, the variance on the trajectories is due to the distribution of deterministic trajectories for the different initial conditions.
The second approach we call intrinsic noise. The protocol operations are again deterministic, but the evolution in each separate sample is driven by the Chemical Master Equation (i.e., the noise is intrinsic to the evolution). Here we imagine running the protocol many times on the same initial conditions, and the variance of each sample is due ultimately to random thermal noise in each run. This model assumes a Markovian evolution of the underlying stochastic system, implying that no two events may happen at the same time (even in separate samples). Also, as usual, we assume that chemical solutions are well-mixed: the probability of a reaction (a molecular collision) is independent of the initial position of the molecules in the solution. Moreover, the well-mixture of solutions is driven by uncorrelated thermal noise in separate samples. Given two initially identical but separate samples, the first chemical reaction in each sample (the first "fruitful" collision that follows a usually large number of mixing collisions) is determined by uncorrelated random processes, and their first reaction cannot happen at exactly the same time. Therefore, in contrast to the extrinsic noise approach, a sample-split operation results in two samples that are completely uncorrelated, at least on the time scale of the first chemical reaction in each sample. In summary, in the intrinsic noise approach, the variance on the trajectories is due to the distribution of stochastic trajectories for identical initial conditions.
In both approaches, the variance computations for mix and for split are the same: mix uses the squared-coefficient law to determine the variance of two mixed samples, while split does not change the variance of the sample being split. The only practical difference is that in the intrinsic noise interpretation we do not need to keep track of the correlation between different samples: we take it to be always zero in view of the well-mixed-solution assumption. As a consequence, each sample needs to keep track of its internal correlations represented as a local Σ matrix of size |A| × |A|, but not of its correlations with other samples, which would require a larger global matrix of potentially unbounded size. Therefore, the intrinsic noise interpretation results in a vast simplification of the semantics (Definition 6) that does not require keeping track of correlations of concentrations across separate samples. In the rest of the paper we consider only the intrinsic noise semantics.

Definition 6. (Gaussian Semantics of a Protocol -Intrinsic Noise)
The intrinsic-noise Gaussian semantics [[P]] ρ ∈ Prot × Env → S of a protocol P ∈ Prot for a CRN C = (A, R), under environment ρ ∈ Env = (Var ∪ Par) → S, for a fixed horizon H with no ill-posed time evolutions, denotes the final CRN state (µ, Σ, V, T) ∈ S (Definition 2) of the protocol and is defined inductively as follows: The semantics of Equilibrate derives from Definition 4. The substitution notation ρ{x ← v} represents a function that is identical to ρ except that at x it yields v; this may be extended to the case where x is a vector of distinct variables and v is a vector of equal size. The variables introduced by let are used linearly (i.e., must occur exactly once in their scope), implying that each sample is consumed whenever used.
We say that the components (µ, Σ) ∈ R |A| ≥0 × R |A|×|A| of a CRN state (µ, Σ, V, T) ∈ S form a Gaussian state, and sometimes we say that the protocol semantics produces a Gaussian state (as part of a CRN state).
The semantics of Definition 6 combines the end states of sub-protocols by linear operators, as we show in the examples below. We stress that it does not, however, provide a time-domain GP: just the protocol end state as a Gaussian state (random variable). In particular, a protocol like let x = Equilibrate(P, p) in Dispose(x) introduces a discontinuity at the end time p, where all the concentrations go to zero; other discontinuities can be introduced by Mix. A protocol may have a finite number of such discontinuities corresponding to liquid handling operations, and otherwise proceeds by LNA-derived GPs and by linear combinations of Gaussian states at the discontinuity points.
It is instructive to interpret our protocol operators as linear operators on Gaussian states: this justifies how covariance matrices are handled in Definition 6, and it easily leads to a generalized class of possible protocol operators. We discuss linear protocol operators in the examples below.
Example 2. The Mix operator combines the concentrations, covariances, and temperatures of the two input samples proportionally to their volumes. The resulting covariance matrix, in particular, can be justified as follows. Consider two (input) states Let 0 and 1 denote null and identity vectors and matrices of size k and k × k. Consider a third null state C = (0, 0, Define the symmetric hollow linear operator Mix = The zeros on the diagonal (hollowness) imply that the inputs states are zeroed after the operation, and hence discarded. Applying this operator to the joint distribution, by the linear combination of normal random variables we obtain a new Gaussian distribution with: Example 3. The Split operator splits a sample in two parts, preserving the concentration and consequently the covariance of the input sample. The resulting covariance matrix, in particular, can be justified as follows. Consider one (input) state A = (µ A , Σ A , V A , T A ) and two null states Applying this operator to the joint distribution we obtain: By projecting µ , Σ on B and C we are left with the two output states , as in Definition 6. The correlation between B and C that is present in Σ is not reflected in these outcomes: it is discarded on the basis of the well-mixed-solution assumption.
Example 4. In general, any symmetric hollow linear operator, where the submatrix of the intended output states is also zero, describes a possible protocol operation over samples. As a further example, consider a different splitting operator, let x, y = Osmo(P 1 , p) in P 2 , that intuitively works as follows.
A membrane permeable only to water is placed in the middle of an input sample A = (µ A , Σ A , V A , T A ), initially producing two samples of volume V A /2 but with unchanged concentrations. Then an osmotic pressure is introduced (by some mechanism) that causes a proportion 0 < p < 1 of the water (and volume) to move from one side to the other, but without transferring any other molecules. As the volumes change, the concentrations increase in one part and decrease in the other. Consider, as in Split, two initial null states B and C and the joint distributions of those three states µ, Σ. Define a symmetric Applying this operator to the joint distribution we obtain: describing the situation after the osmotic pressure is applied. Again we discard the correlation between B and C that is present in Σ on the basis of the well-mixed-solution assumption. The Mix operator can be understood as diluting the two input samples into the resulting larger output volume. We can similarly consider a Dilute operator that operates on a single sample, and we can even generalize it to concentrate a sample into a smaller volume.
Note that, by diluting to a fixed new volume W, we obtain a protocol equivalence that does not mention, in the equivalence itself, the volumes V i resulting from the sub-protocols P i (which of course occur in the semantics). If instead we diluted by a factor (e.g. factor=2 to multiply volume by 2), then it would seem necessary to refer to the V i in the equivalence.
To end this section, we provide an extension of syntax and semantics for the optimization of protocols. An optimize protocol directive identifies a vector of variables within a protocol to be varied in order to minimize an arbitrary cost function that is a function of those variables and of a collection of initial values. The semantics can be given concisely but implicitly in terms of an argmin function: a specific realization of this construct is then the subject of the next section.

Definition 7. (Gaussian Semantics of a Protocol -Optimization)
Let z ∈ Par N be a vector of (optimization) variables, and k ∈ Par M be a vector of (initialization) variables with (initial values) r ∈ R M , where all the variables are distinct and Par is disjoint from Var. Let P ∈ Prot be a protocol for a CRN C = (A, R), with free variables z and k and with any other free variables covered by an environment ρ ∈ Env = (Var ∪ Par) → S. Given a cost function C : R |A| × R N → R, and a dataset D ⊆ f in R M × R N × R |A| , the instruction "optimize k=r, z in P with C given D" provides a vector in R N of optimized values for the z variables. (For simplicity, we omit extending the syntax with new representations for C and D, and we let them represent themselves). The optimization is based on a stochastic process X ∈ R M+N → R |A| ≥0 × R |A|×|A| derived from P via Definition 6.
[[optimize k = r, z in P with C given D]] ρ = argmin u∈R N E y∼X(r,u|D) [C(y, u)] where X(r, u) = [[P]] ρ{k←r,z←u} f or all r ∈ R M and u ∈ R N , where E y∼X(r,u|D) [·] stands for the expectation with respect the conditional distribution of X(r, u) given D.
From Definition 6 it follows that, for any (r, u) ∈ R M+N , X(r, u) is a Guassian random variable. In what follows, for the purpose of optimization, we further assume that X is a Gaussian process defined on the sample space R M+N , i.e., that for any possible (r 1 , u 1 ) and (r 2 , u 2 ), X(r 1 , u 1 ) and X(r 2 , u 2 ) are jointly Gaussian. Such an assumption is standard [6] and natural for our setting (e.g., it is guaranteed to hold for different equilibration times) and guarantees that we can optimize protocols by employing results from Gaussian process optimization as described in detail in Section 2.4.

Optimization of Protocols through Gaussian Process Regression
In Definition 7 we introduced the syntax and semantics for the optimization of a protocol from data. In this section, we show how the optimization variables can be selected in practice.
In particular, we leverage existing results for Gaussian processes (GPs) [6]. We start by considering the dataset D = {(r i , u i , y i ), i ∈ {1, ..., n D }} ⊆ f in R M × R N × R |A| comprising n D executions of the protocol for possibly different initial conditions, i.e., each entry gives the concentration of the species after the execution of the protocol (y i ), where the protocol has been run with k = r i , z = u i . Then, we can predict the output of the protocol starting from x = (r, u) by computing the conditional distribution of X (Gaussian process introduced in Definition 7) given the data in D. In particular, under the assumption that X is Gaussian, it is well known that the resulting posterior model, X(x|D), is still Gaussian with mean and covariance functions given by where σ 2 I is a diagonal covariance modelling i.i.d. Gaussian observation noise with variance σ 2 , µ(x) and Σx ,x are the prior mean and covariance functions, Σ D,x is the covariance between x and all the points in D, and y D , µ D are vectors of dimensions R |Λ|·n D containing for all (x i , u i , y i ) ∈ D respectively y i and µ((x i , u i )) [6]. Note that, in order to encode our prior information of the protocol, we take µ(x) to be the mean of the GP as defined in Definition 6, while for the variance we can have Σx ,x to be any standard kernel [6]; in the experiments, as is standard in the literature, we consider the widely used squared exponential kernel, which is expressive enough to approximate any continuous function arbitrarily well [20]. However, we remark that, for any parametric kernel, such as the squared exponential, we can still select the hyper-parameters that best fit the variance given by Definition 6 as well as the data [6]. We should also stress that the resulting X(x|D) is a Gaussian process that merges in a principled way (i.e., via Bayes' rule) our prior knowledge of the model (given by Definition 6) with the new information given in D. Furthermore, it is worth stressing again that X(x|D) is a GP with input space given by R M+N , which is substantially different from the GP defined by the LNA (Definition 5), where the GP was defined over time.
For a given set of initialization parameters r our goal is to synthesize the optimization variables that optimize the protocol with respect to a given cost specification C : R |A| × R N → R, i.e., we want to find u * such that where E y∼N (µ p (x),Σ p (x,x)) [·] is the expectation with respect to the GP given by Eqn (3) and (4)). Note that r is a known vector of reals (see Definition 7), hence we only need to optimize for the free parameters u. In general, the computation of u * in Eqn (5) requires solving a nonconvex optimization problem that cannot be solved exactly [21]. In this paper we approximate Eqn (5) via gradient-based methods. These methods, such as gradient descent, require the computation of the gradient of the expectation of C with respect to u: Unfortunately, direct computation of the gradient in (6) is infeasible in general, as the probability distribution where the expectation is taken depends on u itself (note thatx = (r, u)). However, for the GP case, as shown in Lemma 1, the gradient of interest can be computed directly by reparametrizing the Gaussian distribution induced by Eqn (3) and (4).

Lemma 1. Forx = (r, u), let D(x) be the matrix such that D(x)D T (x) = Σ p (x,x). Then, it holds that
]. (7) In particular, D(x) as defined above is guaranteed to exist under the assumption that Σ p is positive definite and can be computed via Cholesky decomposition. Furthermore, note that if the output is uni-dimensional then D(x) is simply the standard deviation of the posterior GP inx. where, for example, in d 1 we have that the initial concentration for a, b, c are respectively 0.1001, 0.0015, 0.001, volume and temperature at which the experiment is performed are 1 and 20 and the observed value for b at the end of the protocol for T = 0 is 0.001. We assume an additive Gaussian observation noise (noise in the collection of the data) with standard deviation σ = 0.01. Note that the initial conditions of the protocol in the various data points in general differ from those of P, which motivates having Eqn (3) and (4) dependent on both r and u. We consider the prior mean given by Eqn 1 and independent squared exponential kernels for each species with hyper-parameters optimized through maximum likelihood estimation (MLE) [6]. The resulting posterior GP is reported in Figure 3, where is compared with the prior mean and the true underlying dynamics for species b (assumed for this example to be deterministic). We note that with just 6 data points in D the posterior GP is able to correctly recover the true behavior of b with relatively low uncertainty.
We would like to maximize the concentration of b after the execution of the protocol. This can be easily encoded with the following cost function: For r = [0.1, 0.001, 0.001, 1, 20] the value obtained is T ∼ 230, which is simply the one that maximizes the posterior mean. In Section 3.1 we will show how, for more complex specifications, the optimization process will become more challenging because of the need to also account for the variance in order to balance between exploration and exploitation.

Results
We consider two case studies where we illustrate the usefulness of our framework. The first example illustrates how our framework can be employed to optimize a Gibson assembly  (1)), µ p (posterior mean of species b given by Eqn (3)), and the true dynamics of species b, assumed to be a deterministic function for this example, for r b = 0.001 (initialization variable relative to species b). It is possible to observe how, with just a few data points, the posterior mean reflects correctly the true dynamics. Right: Standard deviation of b after training (square root of solution of Eqn (4)) as a function of r b and T. The variance is higher for combinations of T and r b where no training data are available.
protocol [9] from experimental data. In the second example we illustrate the flexibility of our language and semantics on a combination of protocol operations. Furthermore, we also use this example to show how our framework can be employed to perform analysis of the protocol parameters while also accounting for the uncertainty in both the model dynamics, the protocol parameters, and the data collection.

Gibson Assembly Protocol
We start by considering a simplified version of the Gibson assembly, a protocol widely used for joining multiple DNA fragments [9]. We consider the following model of Gibson assembly described by Michaelis-Menten kinetics, where two DNA double strands, AB and BA (with shared endings A and B but different middle parts), are enzymatically joined in two symmetric ways according to their common endings, and the resulting strands ABA and BAB are circularized into the same final structure O. The resulting dynamical model is given by the following ODEs (assuming AB is over abundant with respect to BA):  (2)) as a function of x BA and T. The variance is minimized for x BA ∼ 1. This is due to the fact that all training data have x BA = 1.
where k cat 1 , k cat 2 , K m 1 , K m 2 are given parameters. Gibson assembly is a single-step isothermal protocol: once all the ingredients are combined, it can be written as: where [1, x BA , 0, 0, 0] is a vector of the initial concentration of the various species with AB initialized at 1 mM, BA at x BA mM, and all the other species to 0 mM for a sample of volume of 1 µL and at a temperature 50 Celsius degrees, equilibrated for T seconds. For this example we have that the set of optimization variables is u = [x BA , T] and the goal is to find initial condition of BA (x BA ) and equilibration time (T) such that, at the end of the protocol, species O has a concentration as close as possible to a desired value O des , while also keeping the equilibration time and the initial concentration of BA close to some reference values. This can be formalized with the following cost function: where O(T) is the concentration of strand O after the execution of the protocol. T re f , T norm , I re f , and β, are parameters describing, respectively, the reference equilibration time, a normalization term for the equilibration time, the reference initial concentration of BA, and a weight term such that if β > 1 we give more importance to O(T) being close to O des compared to equilibration time and initial conditions being closer to their reference values. Similarly, λ ∈ [0, 1] is a weight term that balances the importance that equilibration time and the initial concentration of BA are close to their reference values. In our experiments we fix O des = 0.6, T re f = 700, T norm = 1000, I re f = 1, and β = 20. The dynamical model considered in Eqn (8)-(12) is obtained by combining multiple enzymatic reactions into single Michaelis-Menten steps and the value of the reaction rates may not be accurate. Hence, to obtain a more accurate model, we combine Eqn (8)- (12) with the experimental data from [9] under the assumption that observations are corrupted by a Gaussian noise of relatively high standard deviation of 0.1. In Figure 4 we plot the synthesized (approximately) optimal values of T and x BA for various values of λ by using gradient descent, with the gradient computed as shown in Lemma 1. Note that even for λ = 0 or λ = 1 the cost cannot be made identically zero. This is due to the uncertainty in the model that leads to a non-zero variance everywhere. Note also that for λ = 1 the synthesized time T is smaller than 700 (the reference equilibration time). This can be explained by looking at Figure 4 (Right), where we plot the variance of O(T) as a function of T and x BA . In fact, as we have available only data (reported in the Appendix B) for x BA ∼ 1 the variance increases when x BA is far from 1. As a consequence, the algorithm automatically balances this tradeoff by picking an equilibration time that is close to the target value but also allows for an x BA close to 1 in order the keep the variance under control.

Split and Mix Protocol
Consider the CRN C = ({a, b, c}, R), where R is given by the reactions: Consider also the associated protocol P split&mix shown below in the syntax of Definition 5 (with underscore standing for an unused variable, with initial concentration vectors ordered as (a 0 , b 0 , c 0 ), and initial states matching the pattern ((a 0 , b 0 , c 0 ), volume, temperature): This protocol interleaves equilibration steps with mixing and splitting. During the first Equilibrate, only the second reaction is active in sample A, due to its initial conditions, yielding monotonic changes in concentrations. During the second Equilibrate, similarly, only the third reaction is active in sample B. During the third Equilibrate, after samples A and B have been manipulated and mixed, all three reactions are active in sample E, yielding an oscillation.
The semantics of Definition 6 can be used to unravel the behavior of P split&mix . In particular, integration of the CRN C in samples A and B yields states S A , S B that, after some split and mix manipulations, determine the initial conditions for both mean and covariance for the integration of C in sample E.
We can numerically evaluate the protocol following the semantics: this is shown in Figure  5 (Left & Center), where the protocol evaluation results in three simulations that implement the three integration steps, where each simulation is fed the results of the previous simulations and protocol operations.
To show the usefulness of the integrated semantics, in Figure 5 (Right) we perform a global sensitivity analysis of the whole protocol with respect to the protocol parameters t 1 = 100, t 2 = 100, t 3 = 1000, the duration of the three Equilibrate, and s 1 = 0.5, the proportion of Split. We produce density plots of the means and variances of a,b,c at the end of the protocol when t 1 ,t 2 ,t 3 ,s 1 vary jointly randomly by up to 10%. This is thus a sensitivity analysis of the joint effects of three simulations connected by liquid handling steps. It is obtained by providing the whole parameterized protocol, not its separate parts, to the harness that varies the parameters and plots the results. We could similarly vary the initial concentrations and reaction rates, and those together with the protocol parameters.

Discussion
Automation already helps scaling up the production of chemical and biochemical substances, but it is also hoped it will solve general reproduceability problems. In the extreme, even one-off experiments that have no expectation of reproduction or scaling-up should be automated, so that the provenance of the data they produce can be properly recorded. Most pieces of lab equipment are already computer controlled, but automating their interconnection and integration is still a challenge that results in poor record keeping. A large amount of bottom-up development will be necessary to combine all these pieces of equipment into "fabrication lines". But it is also important to start thinking top-down at what the resulting connecting "glue" should look like, as we have attempted to do in this paper. This is because, if automation is the ultimate goal, then some pieces of equipment that are hard to automate should be discarded entirely.
Consider, for example, the Split operation from Definition 6, which separates a sample into two samples. That represents a vast array of laboratory procedures, from simple pipetting to microfluidic separation, each having its own tolerances and error distributions (which are well characterized and could be included into the language and the analysis). Despite all those variations, it is conceivable that a protocol compiler could take an abstract Split instruction, and a known piece of equipment, and automatically tailor the instruction for that equipment. Digital microfluidics is particularly appealing in this respect because of the relative ease and uniformity of such tailoring [22]. Therefore, one could decide to work entirely within digital microfluidics, perhaps helping to make that area more robust, and avoid other kinds of equipment.
Are there drawbacks to this approach? First, as we mentioned, equipment should be selected based on ease of automation: whole new lab procedures may need to be developed in replacement, along with standards for equipment automation. Second, protocol languages are more opaque than either sets of mathematical equations or sequences of laboratory steps. This needs to be compensated for with appropriate user interfaces for common use in laboratories.
About the second point, as a first step, we have separately produced an easily deployed app (Kaemika [15]) that supports the integrated language described here, although pragmatic details of the syntax diverge slightly. We have used it to run simulations and to produce Figures 2 and 5 (see Appendix A). It uses a standard ODE solver for simulating the "Equilibrate" steps of the protocols, i.e., the chemical reaction kinetics. It uses the Linear Noise Approximation (LNA) for stochastic simulations; that is, means and variances of concentrations are computed by an extended set of ODEs and solved with the same ODE solver along with the deterministic rate equations. The liquid handling steps of the protocols are handled following the semantics of Definition 6, noting that in the stochastic case means and variances are propagated into and out of successive Equilibrate and liquid handling steps. The result is an interleaved simulation of Equilibrate and liquid handing steps, which is presented as an interleaved graphical rendering of concentration trajectories and of droplet movements on a simulated microfluidic device. The sequence of protocol steps can be extracted as a graph, omitting the detailed kinetics. The kinetic equations operating at each step of the protocol can be extracted as well.
In this context we should also stress that our proposed Gaussian semantics makes various assumptions: (1) we assume that molecular processes described by CRNs and liquid handling operations can be modelled by a Gaussian process, and (2) we assume that the collected data are corrupted by additive Gaussian noise. The former assumption is justified by the precision of lab equipment to perform liquid handling operations, whose uncertainty is generally very small, and by the Central Limit Theorem (CLT). The CLT guarantees that the solution of the Chemical Master Equation will converge to a Gaussian process in the limit of high number of molecules [23], as is common in wet lab experiments. In particular, as illustrated in Experiment 5.1 in [8], already a number of molecules of the order of hundreds for each species generally guarantees that a Gaussian approximation is accurate. It is obvious that, in case of experiments with single molecules, such an approximation may be inaccurate and a full treatment of the CME would be more appropriate [24]. The assumption that observations are corrupted by Gaussian additive noise is standard [25]. However, we acknowledge that in certain scenarios non-Gaussian or multiplicative observation noise may be required. In this case we would like to stress that Gaussian process regression can still be performed with success, at the price of introducing approximations on the computation of the posterior mean and variance [6].
As automation standards are developed for laboratory equipment, we will be able to target our language to such standards, and extend it or adapt it as needed. In this context, our future works include extension to both the syntax and semantics of our language to include common laboratory procedures that we have not investigated here, for example, changing the temperature of a sample (as in a thermal cycler) or its volume (as in evaporation or dilution). These are easy to add to our abstract framework, but each corresponds to a whole family of lab equipment that may need to be modeled and integrated in detail. Furthermore, we plan to extend the semantics to allow for more general stochastic processes that may be needed when modelling single molecules.

Conclusions
We have introduced a probabilistic framework that rigorously describes the joint handling of liquid manipulation steps and chemical kinetics, throughout the execution of an experimental protocol, with particular attention to the propagation of uncertainty and the optimization of the protocol parameters. A central contribution of this paper is the distinction between the intrinsic and extrinsic approach to noise, which leads to a much simplified semantics under a chemically justified assumption of well-mixed-solutions. The semantics is reduced to operations on Gaussian states, where any modeled or hypothetical protocol operation is a symmetric linear operator on Gaussian states.
The Gaussian process semantics approach is novel to this work, and is novel with respect to the Piecewise Deterministic Markov Process semantics in [5], which treats chemical evolution as deterministic. The semantics in this work is about a collection of deterministic protocol operations, but note that (1) stochasticity in chemical kinetics is propagated across protocol operations, making the whole protocol stochastic, and (2) we have shown examples of how to easily incorporate new protocol operators as linear operators on Gaussian states, which may include ones that introduce their own additional stochasticity. The Gaussian approach in this paper enables principled integration of a protocol model with experimental data, which in turn enables automated optimization and analysis of experimental biological protocols. The syntax of protocols is chosen for mathematical simplicity, reflecting the one in [5]; a closely related but more pragmatic syntax is now implemented in [15].

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A Simulation Script
This is the script for the case study of Section 3.2 for the protocol P split&mix , in Kaemika [15]. The protocol begins at 'species {c}' and ends at 'equilibrate E'. For the sensitivity analysis of Figure 5, a function f abstracts the equilibrate time parameters e1,e2,e3 and the split proportion parameter s1 and yields the concentrations of a,b,c at the end of the protocol. A multivariate random variable X, over a uniform multidimensional sample space w, is constructed from f to vary the parameters. Then X is sampled and plotted. and run the script with LNA enabled.