A Novel Framework for Parameter and State Estimation of Multicellular Systems Using Gaussian Mixture Approximations

Multicellular systems play an important role in many biotechnological processes. Typically, these exhibit cell-to-cell variability, which has to be monitored closely for process control and optimization. However, some properties may not be measurable due to technical and financial restrictions. To improve the monitoring, model-based online estimators can be designed for their reconstruction. The multicellular dynamics is accounted for in the framework of population balance models (PBMs). These models are based on single cell kinetics, and each cellular state translates directly into an additional dimension of the obtained partial differential equations. As multicellular dynamics often require detailed single cell models and feature a high number of cellular components, the resulting population balance equations are often high-dimensional. Therefore, established state estimation concepts for PBMs based on discrete grids are not recommended due to the large computational effort. In this contribution a novel approach is proposed, which is based on the approximation of the underlying number density functions as the weighted sum of Gaussian distributions. Thus, the distribution is described by the characteristic properties of the individual Gaussians, like the mean and covariance. Thereby, the complex infinite dimensional estimation problem can be reduced to a finite dimension. The characteristic properties are estimated in a recursive approach. The method is evaluated for two academic benchmark examples, and the results indicate its potential for model-based online reconstruction for multicellular systems.


Introduction
Multicellular systems are not only found at the core of many fundamental biomedical processes like cell differentiation [1] and wound healing, but also play an essential role in a wide spectrum of biotechnological processes ranging from pharmaceutical manufacturing [2,3] to biopolymer production [4] to biological waste-water treatment [5]. The individual cells do not only interact with each other, but also with their environment. These interactions are affected by individual cellular properties like cell volume and gene/protein expression levels, which underlie a certain cell-to-cell variability, i.e., the properties appear distributed within the cell population. The reasons for such heterogeneities are manifold, including differences in local substrate concentrations [6] or non-synchronous cell-cycles [7], as well as stochastic effects on the genomic and proteomic level [8].
From an evolutionary point of view, this variability is argued to benefit the robustness of the cell population against rapid environmental changes. However, on an industrial scale, such variabilities This motivates the design of a novel state estimation concept for PBEs proposed in this manuscript. The basic idea can be summarized as follows: The full NDF is approximated by a Gaussian mixture distribution (GMD) [16]. This technique is already established for the statistical analysis of heterogeneous cell populations measured with appropriate techniques such as flow and mass cytometry (see, e.g., [17,18] and the references therein). Instead of accounting for the dynamics of the full NDF, one only describes the dynamics of integral quantities of the Gaussians. In contrast to [19], we not only characterize those dynamics in terms of the Gaussians' means alone, but also account for the dynamics of the Gaussians' covariances. Subsequently, estimators are applied to reconstruct means and (co)variances of these Gaussian distributions. Thereby, the approach guarantees high flexibility, as it is not limited to a specific numerical solution method like the above-mentioned discretization-based approach: Depending on the problem, tailored numerical schemes can be implemented to compute the integral quantities of the Gaussians, e.g., moment methods [20]. Moreover, one is not restricted to a specific type of estimator: in principle, any established state estimation or observer design technique is applicable.
This manuscript is structured as follows: At first, the modeling of multicellular systems dynamics using PBEs is briefly introduced. Subsequently, it is explained how the full NDF is approximated with GMDs and how the dynamics of their integral quantities can be derived from the corresponding PBEs. The next section outlines the application of Kalman and particle filters, both representing established estimation techniques [13], for reconstructing the integral properties of the Gaussian distributions in special cases. Results for two numerical case studies are shown and discussed. A summary is given and potential future work outlined.

Modeling of the Multicellular Dynamics
Each cell can be characterized by individual properties, like size/shape or (bio)chemical composition, which are summarized in the cellular state vector z = [z 1 , . . . , z N z ] T . For single cell modeling of their dynamics, complex cellular mechanics on different time and size scales, such as metabolic, proteomic, genomic and regulatoric, have to be taken into account. In the following, it is assumed that the temporal evolution of z can be sufficiently described with a system of stochastic differential equations (SDEs) [21]: Therein, the vector field f z = f z 1 , . . . , f z Nz T characterizes deterministic mechanics, while the remainder of the right-hand side accounts for stochastic effects by a vector Wiener-process W t in the space of the intracellular coordinates. Here, D(z(t), t) is a square matrix of dimension N z characterizing the effects of the Wiener-process on the cellular dynamics. Its elements may depend on z(t) and extracellular components. In the case of negligible stochasticity, Equation (1) reduces to a set of ordinary differential equations (ODEs):ż As already pointed out in the Introduction, several reasons may put restrictions on the available measurements such that only certain cellular properties or functions thereof are accessible, which is modeled by the output equation: In a population of cells, individual properties underlie cell-to-cell variability and thus appear distributed. Dynamics of the distributed properties in terms of the cell number density distribution n(t, z) can be described in the framework of population balance modeling (PBM), giving rise to the general population balance equation [21]: where: and the double dot inner product is defined as: Therein, D i,j (z(t), t) denotes the (i, j)-element of D(z(t), t). Alternatively, Equation (4) can thus be written as: The left-hand sides of Equations (4) and (7) characterize accumulation; while the first terms on the right-hand side describe the change of properties resulting from deterministic cellular kinetics (e.g., growth and biochemical reactions), the second stochastic effects and the third terms contain source processes like cell birth (e.g., budding and division) and death (e.g., apoptosis). In this contribution, rather small time horizons are considered on which the first effect is dominant in comparison to the latter two, such that we take P(n(t, z), z, t) = 0. It has to be mentioned that coupling to other species with distributed parameters (e.g., a second cell population) and non-distributed parameters (e.g., extracellular substrates) has also been neglected in this study. In the case of restrictions on measurements as described earlier, only projections of n(t, z), but not the full NDF itself can be measured: n y (t, y) = H(t, n(t, z), z) .
Here, H(t, n(t, z), z) is the output operator that depends on the available single cell measurements represented by Equation (3). For example, out of two cellular states z = [z 1 , z 2 ], only one may be accessible y z = z 2 (e.g., cell size, but not cell growth) via flow cytometry. As a result, only the marginal distribution: n y (t, z 2 ) = H(t, n(t, z), z) = z 1,max z 1,min n(t, z) dz 1 (9) will be available. The main idea of the proposed approach predicates on approximation of the NDF with basis functions, e.g., a sum of Gaussian normal distributions: It is well known that each distribution can be approximated by such a mixture density, yet a large number of individual Gaussians may be required to obtain a sufficient accuracy of the approximation. Furthermore, it has to be mentioned that there are several approaches to obtain these approximations, each having certain advantages and drawbacks (see, e.g., [16] and the references therein). For the proposed estimation method outlined in the following, the specific method is not important, as long as the approximations are sufficiently accurate. To improve the readability, dependence on t will not be written out explicitly in the following. The dynamics of the full NDF can be characterized by the dynamics of integral quantities of individual elements, i.e., mixture component a (i) (i.e., the zeroth order moment), vector of mean values µ (i) (i.e., the normalized first order moments) and covariance Σ (i) (consists of normalized central second order moments). The corresponding state vectors for individual densities are given as: It is obvious that their temporal evolution and the corresponding measurement vector: depend directly on the population balance Equation (4) and the available cellular measurement information. For the general nonlinear case, a closed set of equations is only found in rare cases, and thus, one relies on approximate closure techniques. These include power series expansions and approximate moment methods [20]. The variables v and w denote random perturbations on the process dynamics and measurements not included in the model formulation. For linear intracellular dynamics: is obtained. Here, A z is an N z -dimensional square matrix. If furthermore, the measurement dynamics are linear: a closed set of equations is found for the dynamics of means and covariances. In case the stochasticity is assumed to be independent from the intracellular properties and time invariant D(z, t) = D, the following equations are obtained:

Observability of Multicellular Systems Dynamics
Prior to the design of any state estimator/observer, it has to be clarified if it is even possible to reconstruct non-measurable quantities from measurable ones with a model-based approach. This issue, also known as observability analysis, has received much attention over the last few decades. Initially, the focus was on finite dimensional systems dynamics described by linear and nonlinear ordinary differential equation systems [22,23]. Additionally, more general graph-based concepts were developed [24,25]. However, the application of these concepts to population balance models is scarce and mostly relies on a prior discretization step that transforms the infinite dimensional system to a high-dimensional system of ODEs (e.g., [26]). Recently, a new concept was presented that allows statements on the observability of ensemble systems [27]. Unfortunately, only sufficient conditions are provided for a limited class of individual (cellular) dynamics, and more general effects like cell death and cell division, as well as interaction with a non-distributed species are neglected.
In the proposed framework, the infinite dimensional system represented by the multi-dimensional PBE in Equation (4) is transformed to a finite dimension using the approximation of the multi-dimensional number density distributions by sums of the Gaussian distribution. This allows a straightforward analysis of observability in multicellular systems dynamics with established methods from linear and nonlinear systems theory. In the case of linear single cell kinetics, the overall dynamic and output equations of individual GMDs are given as: Therein, the vector Σ (i) contains all diagonal and upper-diagonal elements of Σ (i) , and the vector D contains the corresponding values of D. The dynamic system in Equation (17) is observable if the Kalman rank condition [22]: is fulfilled. For more general nonlinear dynamics represented by Equation (12), weak local observability is given under the following (sufficient) condition [23]: Furthermore, the system could also be tested for structural observability [24,26] posing a necessary condition on local observability. It has to be mentioned that these conditions are only valid for individual state vectorsẋ (i) , but not for the whole system, i.e., all GMDs i = 1, . . . , N GMD at once. However, it is immediately clear that the previously presented conditions have to be fulfilled for the special case N GMD = 1. For N GMD > 1, a data association algorithm is necessary that assigns N GMD model outputs y (i) Here, in addition to the rank conditions in Equations (18) and (19), simultaneous convergence of the data association algorithm has to be guaranteed. Derivation of a complete set of conditions for the latter is beyond the scope of this contribution and represents a challenge for future research efforts. Possible solution approaches may be found in the field of multiple target tracking (see [28] and the references therein).

Estimator Design
The second key idea of the approach is to design suitable estimators for the state vectors defined in Equation (11) instead of the full NDF. Depending on the dynamics described in Equation (12), different established concepts may be applied to reconstruct the full state vector x from the available measurements. As alternative to deterministic methods such as (non-)linear Luenberger or sliding-mode observers, recursive Bayesian estimation algorithms can be employed [13]. Therein, individual states and measurements are formulated as probability density functions (PDF). Only for the following explanation, N GMD = 1 will be assumed, and indices (i) will be dropped in the following. Bayes' rule for recursive state estimation reads as: and describes how the posterior density p(x k | y 1:k ) of the state estimate at time t k taking into account all measurements up to t k depends on the density p(y k | x k ), which characterizes the likelihood of the measurement to be in the predicted state p(x k | y 1:k−1 ) at t k . The latter contains information on available measurements according to Equation (13), the dynamics of the state vector defined in Equation (12) and the random perturbations. The density in the denominator is given by: and represents a normalizing constant. Analytic solutions for recursive Bayesian estimation are only found for limited cases. If the posterior PDF is approximated by a Gaussian, random perturbations on states v and measurements w are drawn from zero-mean Gaussian distributions, and furthermore both, f and h are linear functions of the state vector defined in Equation (11): then a Kalman filter represents the optimal solution to the recursive estimation problem [13]. Its time-discrete implementation for the estimation of meanx k and error covariance P k of the states PDF for the time-discrete version of the relations given above, F dis and H dis , contains the following steps: (I) Prediction step, a priori state and error covariance estimates: (II) Computation of the estimator gain: (III) Correction using current measurement y k , posterior estimates: However, as mentioned earlier in connection with the observability analysis, for N GMD > 1, the estimation problem is more complicated, and a single Kalman filter cannot be implemented easily. A possible solution is to use a set of separate Kalman filters for estimation of the state vectorsx (i) , yet the association with the measurement vectors y (j) has to be determined in each time step. For the current study, a rather simple algorithm was implemented where the association of the model to measurements is determined from the cost matrix C with elements given as: for each recursion. The association with the minimal overall cost can be determined using combinatorial optimization techniques, e.g., the Hungarian algorithm [29]. Although more complex algorithms could have been implemented, in this study, we stick to this simple approach, as it showed a sufficient performance.
If the assumptions stated in the Equations (22) do not hold, suboptimal algorithms like extended Kalman filters or particle filters have to be employed. The latter puts a low number of conditions on the process model and the involved random perturbations. A large number of different particle filtering algorithms has been studied in the previous decades (see, e.g., [13] and the references therein for more detailed information). The major steps of the implementation used in this contribution are briefly summarized in the following: (I) Initialization: A set of N P initial estimates (particles) is drawn from the initial state PDF: Subsequently, the following steps are executed recursively for k > 0: (II) Propagation: Each particle is propagated from t k−1 to t k with Equation (12): (III) Weighting: The relative likelihood q j of each particle conditioned on the current measurement is computed by evaluating p(y k,j | x − k,j ) using Equation (13). Furthermore, all weights are normalized.
(IV) Regularization and resampling: The posterior state PDF p(x k | y k ) is approximated by a sum of weighted kernel functions, and N P a posteriori particles are generated from this PDF.
From the posterior PDF, any desired statistical measures, e.g., mean and covariance, can be determined.
As an important advantage over the presented Kalman filter, no special augmentation is necessary for N GMD > 1. Here, the posterior PDF p(x k | y k ) would show multiple modes, each corresponding to one GMD.
For good performance and accurate estimation, certain requirements have to be fulfilled for the measurements: Population snapshots are obtained for example from high-throughput measurements of single cells, e.g., via flow cytometry. The number of measured cells has to be large enough to be a representative sample of the cell population. For flow cytometric measurements, the number of measured cells per sample typically exceeds 10 4 individual cells [10], thereby guaranteeing a representative measurement. Additionally, the samples have to represent the corresponding process dynamics, and hence, a certain sample density with respect to time is required to guarantee convergence of the estimator. Nevertheless, unrepresentative samples can be compensated up to a certain degree with a well-designed and tuned estimator. In general, the performance improves when more cellular components are measured, in particular for high-dimensional problems with a large number of cellular components in the model. As stated in the introduction, there is an upper limit to the number of measurable cellular components due to technical and economic limitations. However, even if a certain cellular component cannot be measured individually, data of the population mean may be available, which can be included in the model formulation to improve the overall estimation accuracy.

Comments on the Computational Implementation
All benchmarks that are presented in the following were implemented in MATLAB 2016b. Artificial measurement data were generated using a large number of simulations of the single cell dynamics defined in Equation (1) with different initial conditions. Stochastic dynamics were solved with the Euler-Maruyama-scheme [30]. At discrete time points, snapshots were generated using Equation (3). Subsequently, those were approximated by a Gaussian mixture distribution using the function fitgmdist in MATLAB 2016b, which employs the expectation maximization algorithm [16], to obtain the measurement vectors given in Equation (13). Parameters for both benchmarks are found in Tables 1 and 2. For the first example, artificial single cell measurements were subjected to white noise N (0, 0.01). This measurement noise appeared as a constant bias in the measured variances of individual Gaussians, but did not affect the measurements of Gaussian means.
Up to this point, constraints on the states have not been addressed. For the elements of the state vector in Equation (11), certain conditions have to be fulfilled to guarantee physical meaningful estimates, e.g., non-negative µ and positive definiteness of Σ. In the Kalman filter context, constrained state estimation algorithms have been studied (see, e.g., [13]). However, in this contribution, this issue was neglected in the estimation algorithm and had to be addressed in a subsequent step. In a particle filter setup, constraints are handled in the initialization and resampling step, respectively. Here, only particles representing valid states fulfilling the constraints were drawn from the current PDF.

Linear Dynamics: Protein Expression
In the first example, heterogeneous gene expression was considered. Cells can be different with respect to individual mRNA and protein levels z 1 and z 2 . Moreover, it was assumed that the cells differed with respect to their individual mRNA-expression rates z 3 . The process also took stochasticity of the mRNA expression into account. The resulting single cell dynamics are given by the following set of SDEs: It was assumed that only the individual protein levels z 2 can be measured via flow cytometry, and thus, the measurement equation reads as: In the context of the approximation procedure presented above, this translates into the following dynamics of individual Gaussians mean vectors and covariance matrices: As Σ (i) = Σ (i)T , the state vector is given by: The corresponding measurement vector consists of individual mixture components, mean values and variances of the one-dimensional distribution n(t, z 2 ) representing the projection of the full NDF n(t, z) on the measurable subspace of z: Parameter values are found in Table 1. Observability can be inspected by application of the Kalman rank criterion on the observability matrix W O using the system matrix A and the output matrix C as defined in Equation (18). The first is derived as: while the latter is given by: It can be shown easily that: and hence, the resulting system is observable. Thereby, the necessary condition for state estimator design has been fulfilled.
In the simulation scenario, it was assumed that the overall cell population consisted of two different subpopulations that exhibited different gene expression degrees. These two subpopulations cannot be distinguished with flow cytometry at the start as both show the same variability with respect to the individual protein levels. Two individual Kalman filters were designed to track the individual Gaussians that corresponded to the two subpopulations. As mentioned earlier, an additional data association scheme augmented the individual estimators by assigning measured and modeled Gaussians: In this study, for cost optimal assignment, the Hungarian method [29] was applied to the cost matrix in each sampling step.
Estimation results are shown in Figure 2. It can be seen that the non-measurable states were reconstructed sufficiently accurately with the possible exception of S 1,1 , characterizing the variability within the cells with respect to cellular mRNA-levels. The main reason for this degraded performance lies in the stochastic dynamics. Despite its simplicity, data assignment with the described method was able to guarantee convergence of the overall estimation in the current study with two subpopulations. However, in the case of more subpopulations, advanced strategies may have to be taken into account.

Nonlinear Dynamics: Intracellular Oscillator Modeled by Lotka-Volterra Dynamics
In the second benchmark, an intracellular oscillator was considered. Examples for such oscillatory behavior are found for example in the modeling of homeostasis and the cell cycle (see [31] and the references therein). It was assumed that the cell population exhibited variability with respect to two unspecified intracellular properties. Their coupled behavior was characterized by a Lotka-Volterra-type dynamics. As in the first benchmark, only the second cellular property was accessible. Thus, the following set of systems equations was obtained: This also means that only projections to z 2 were measurable for the overall cell population number density function. Thus, after applying the proposed GMD approximation of the NDF, the following set of measurements was obtained: In contrast to the previous example, no closed set of equations was found to describe the evolution of Gaussians from one sample point to the next. Therefore, an approximate moment method based on the direct quadrature method of moments [20] was implemented: therein, each Gaussian was represented by a set of random sampling points. These were propagated by the nonlinear intracellular dynamics. The propagated samples themselves were approximated with a new Gaussian. The overall procedure was rather complex, and a direct function f(x) was not determined easily. Hence, it was hardly possible to check for nonlinear observability of the cell population with the methods introduced previously. However, the nonlinear single cell dynamics were (weakly) observable, and the weak observability of individual Gaussians has been conjectured. In the implementation of the resampling step of the particle filter, a rather simple static normal distribution was implemented as the kernel function. All parameter values used in the numerical implementation are reported in Table 2.
Simulation results are depicted in Figure 3. The set of particles represents the PDF of the estimated state vector (depicted in grey scale, darker areas indicating regions of higher particle density). The measurements (and also the cell populations NDF) were approximated by two GMDs. Initially, the particles were evenly distributed (broad PDF), reflecting the uncertainty about the current state of the process. However, after a short period, a bimodal PDF was obtained indicating that the particles had focused within two regions. The PDF modes represent the particle filters' estimates for the state vectors. It can be seen that most states were accurately estimated. In general, the estimates of mixture components and means seemed to be more reliable than the estimates of the covariance. Here, the estimate for S 1,1 , i.e., the variance with respect to non-measurable component z 1 , shows some significant deviations at some time points. The simulation showed that the estimates became worse for periods in which the two measurement GMDs overlapped, yielding a kind of practical non-observability. This was particularly obvious in the period 8 < t < 12 where the uncertainty in the estimates increased (the PDF became broader), and the PDF modes did not represent good estimates for the measurements any longer. However, after a short time, the estimator was able to recover. It is also clear that the particle filter had to converge before a new critical practical non-observability situation arose. It has to be mentioned that the implemented particle filter algorithm was relatively simple and could be improved further by taking into account advanced techniques, e.g., different resampling strategies and improved particle prediction steps [13]. Another point for fine-tuning of the performance was found in the approximate moment method for the prediction of the state vectors: here, a more complex approximate moment closures could be considered.

Conclusions
This contribution proposed a novel approach to model-based online state estimation for multicellular systems from snapshot data. In classical approaches, state estimators are designed for grid-based discrete approximations of the involved PBEs. This approach is not favorable for multicellular systems: here, modeling is based on single cell kinetics commonly involving a high number of cellular states. As each of these states translates directly into one independent coordinate of the corresponding PBE, generally high-dimensional partial differential equations are obtained. Application of the grid-based discrete method would involve out-of-scale computational costs. In contrast, the proposed technique is based on an approximation of the cell number density distribution and snapshot measurements by Gaussian mixture densities. These are characterized by integral quantities, namely the mixture component (zeroth order moment), mean (first order moments) and covariance (centralized second order moments). In an online estimation, those are reconstructed instead of the full NDF. Thereby, the estimation is reduced from an infinite to a low dimensional problem, which enables the application of established estimator techniques, like Kalman and particle filtering. Dynamic equations for the integral quantities have to be derived from the PBE. Here, a closed set of equations is only found under special conditions, and approximate moment methods have to be implemented. The proposed framework also ensures a degree of flexibility with respect to the specific approximate moment method, as well as the specific estimator. Furthermore, well-known observability criteria from systems theory can be employed. Within two benchmark studies, the performance of Kalman and particle filters for the reconstruction of non-measurable properties of the cellular system was analyzed. In the first, a closed set of ODEs was used to compute means and covariances for linear cellular dynamics, while an approximate moment method was used in the second benchmark, which involved nonlinear cellular dynamics. The results indicate that the proposed approach represents a promising tool for online reconstruction of non-measurable cellular properties, which may also be extended to other multicellular processes with non-negligible cell death and cell division. Even though the proposed estimation technique is shown for the approximation of the full NDF with Gaussian mixture densities in this contribution, the general procedure could also be adopted for different mixture densities, e.g., Gamma distributions [32]. As for the second nonlinear benchmark, dynamics of the integral distribution quantities could be determined with an approximate moment method. Furthermore, it has to be mentioned that the introduced technique could also be applied to other particulate processes, e.g., particle formation or crystallization, which will be investigated in the future.