Structural Identifiability and Observability of Microbial Community Models

Biological communities are populations of various species interacting in a common location. Microbial communities, which are formed by microorganisms, are ubiquitous in nature and are increasingly used in biotechnological and biomedical applications. They are nonlinear systems whose dynamics can be accurately described by models of ordinary differential equations (ODEs). A number of ODE models have been proposed to describe microbial communities. However, the structural identifiability and observability of most of them—that is, the theoretical possibility of inferring their parameters and internal states by observing their output—have not been determined yet. It is important to establish whether a model possesses these properties, because, in their absence, the ability of a model to make reliable predictions may be compromised. Hence, in this paper, we analyse these properties for the main families of microbial community models. We consider several dimensions and measurements; overall, we analyse more than a hundred different configurations. We find that some of them are fully identifiable and observable, but a number of cases are structurally unidentifiable and/or unobservable under typical experimental conditions. Our results help in deciding which modelling frameworks may be used for a given purpose in this emerging area, and which ones should be avoided.


Introduction
Computational systems biology relies heavily on dynamic modelling to understand the mechanisms of complex biological processes, with the ultimate goal of controlling and optimising them [1]. To this end, control-theoretic concepts such as structural identifiability and observability are being increasingly applied in this research area. Observability is the possibility of inferring the system's internal state from knowledge of its input and output [2], while identifiability is the possibility of inferring its parameters [3]. Since parameters can be considered as constant states, identifiability can be seen as a particular case of observability. We refer to the joint analysis of structural identifiability and observability as 'SIO'. The concept of observability was introduced by R.E. Kalman in the 1960s and was later extended to nonlinear systems in the 1970s [4]. Since the analysis of SIO can be very challenging in nonlinear models, its application in biological modelling has become widespread only recently, thanks to computational advances [5].
Microbial communities or consortia are a specific class of biological systems that result from the coexistence of different species of microorganisms. Despite their abundance, detailed information on the composition and function of microbial communities (both within humans and Earth-wide) has been obtained only in the last few decades [6], thanks to experimental advances in molecular biology. Likewise, the application of control-theoretic techniques to their analysis is largely unexplored, despite a number of recent examples [7][8][9].
The development of accurate and informative mathematical models is a key tool for understanding, predicting, and controlling the behaviour of microbial consortia [10,11]. Modelling efforts often focus on bacteria, but some models also include bacteriophages, which are viruses (also called phages) that infect bacteria [12]. Nonlinear systems of ordinary differential equations can be used to describe the different types of interactions present in these communities and their emergent dynamics. Different types of models are used to this end. In some of them, such as the generalized Lotka-Volterra (gLV) models, the dynamics arise from species-species interactions, which may be competitive or cooperative. Another approach, that of species-metabolite interaction models, describes the emergent behaviour of the community as a result of the competition for common resources. Several versions of these models have been presented in the literature [13][14][15][16].
The parameters in microbial community models may have different meanings, such as quantifying the degree of cooperation or competition between species, or representing growth, degradation, or dilution rates. Therefore, knowledge of their values can provide valuable insight about community dynamics. However, our ability to build informative models can be greatly influenced by experimental limitations. Species abundances are often inferred solely from high-throughput DNA sequence data. The resulting datasets provide estimates of relative abundances, but typical modelling approaches-including gLV-describe absolute abundances instead. Remien and coworkers [17] analysed the structural identifiability of gLV models with absolute and relative data, as well as the local practical identifiability of a synthetic community. Interestingly, they found that certain interaction parameters were unidentifiable from relative data.
As structural identifiability is a requirement for successful parameter estimation, observability is a requirement for successful state estimation. Since non-identifiability and non-observability are often linked (e.g., due to one or more parameters being correlated with a given state variable), both properties (SIO) need to be analysed jointly. The lack of SIO can sometimes, but not always, be remedied by modifying the experimental setup so as to measure more state variables or functions of them. Even when it is possible to first perform parameter estimation offline, and then use the parameter estimates for the online estimation of the unknown state variables using a state observer, it is crucial to ensure the SIO of both experimental setups (online and offline). Otherwise, errors in the estimates of the unknown parameters would be propagated to the estimates of the unmeasured states. Since the SIO of a model determines the possibility of inferring its parameters and internal state, it is crucial to analyse these properties to guarantee the reliability of the modelling results. Thus, systematic studies on the SIO of the models available for a particular application are useful resources; see, e.g., [18]. However, such analyses of the properties of the most widely used microbial community models are currently lacking. In the present paper, we perform the said study. We analyse three main types of models: species-species interaction models, which include the aforementioned gLV as well as a variant of them devised for compositional data; species-metabolite interaction models, including quadratic interactions and enzymatic kinetics, and a third class of models that are suited for therapeutic applications with phage cocktails.

Definitions
We study models that assume that the populations of each microorganism are sufficiently large to be well described by deterministic equations. Furthermore, we neglect spatial variability. Therefore, we use ordinary differential equation (ODE) models that can be written as where x(t) ∈ R n , u(t) ∈ R q , θ ∈ R p , and y(t) ∈ R m represent the state, input, parameter, and output vectors, respectively. We refer to their elements with subindices, e.g., θ i , x j . The inputs are assumed to be known, while the parameters are, in principle, unknown. The outputs are the measured quantities, which often consist of direct measurements of state variables, although they may also be functions of them. Broadly speaking, we say that a model of the form (1) is observable (respectively, structurally identifiable) if its state vector x(t) (respectively, parameter vector θ) can be determined by measuring its future outputs y(t) and inputs u(t) in a bounded time horizon. We consider local versions of these properties, which hold in a neighbourhood of each variable.
More formally, we say that a parameter θ i of (1) is structurally locally identifiable (SLI) if, for almost all vectors θ * , there is a neighbourhood N (θ * ) in which the following condition holds:θ ∈ N (θ * ) and y(t,θ) = y(t, θ * ) If (2) is not fulfilled, θ i is structurally unidentifiable (SU). If all its parameters are SLI, the model is SLI; otherwise, it is SU.
Similarly, a state variable x i (τ) is observable if it can be distinguished from any other neighbouring states from knowledge of the output y(t) and input u(t) vectors in the We use the acronym FISPO (full input, state, and parameter observability) to refer to a model that is SLI and observable [19].

Analysis Methods
The structural local identifiability and observability (SIO) of a nonlinear model can be determined with a differential geometry approach, applying techniques based on the concepts presented, e.g., in [4]. To this end, the state variables and parameters are included in an augmented state vector,x = [x, θ], and an observability-identifiability matrix O NL I (x) is calculated as follows [20]: where L f h(x) is the Lie derivative of the output, and higher-order derivatives can be obtained as If O NL I (x) has full rank, i.e., n + p, the model is FISPO. Note that the ith column of O NL I (x) contains partial derivatives with respect to the ith element ofx, which is either a parameter or a state variable. Therefore, when rank(O NL I (x)) < n + p, the identifiability of each parameter and the observability of each state variable can be determined by removing the corresponding column and recalculating the matrix rank [2]: if the rank decreases, the corresponding parameter (respectively, state variable) is SLI (respectively, observable). If it remains constant, the parameter is SU (respectively, the state variable is nonobservable).
There is an alternative method that calculates the matrix rank more efficiently. It adopts a differential algebra approach [21] and allows us to compute the set of observable variables in polynomial time. For the purpose of notation, capital letters are used for the initial conditions of a function and its derivatives, i.e., u (r) (0) = U (r) and y (r) (0) = Y (r) for r ≥ 0 and then U = ( j , ... for j = 1, ..., n y , then R U, Y denotes the field adjoining the indeterminates to R. In [21], the transcendence degree is also calculated with the rank of O NL I (x). In order to compute the matrix, a different approach is taken to avoid the expensive calculation of the Lie derivatives. The underlying procedure can be seen in Algorithm 1.
We have implemented both methods in the Matlab toolbox STRIKE-GOLDD [22]. In the analyses reported in this paper, we have used preferentially the second, most efficient algorithm.
Finally, we note that it is possible to extend the approach defined in this section to models with unmeasured inputs; see [19,23,24] for details.

Algorithm 1: Probabilistic algorithm to test local algebraic observability in polynomial time
Preprocesing Construct a straight-line program encoding the variational system ∇P with P =ẋ − f (u(t),x(t)) and the expressions used during its integration. Specialization Specialisation of the parameters, θ * , and the inputs, u * Power Series Solution Computation of the power series solution of ∇P at order nx + 1 with a specialised value for the states Jacobian computation Evaluation of ∇y on the previous results, giving the coefficients of the Jacobian matrix Rank computation Calculation of the matrix rank and transcendence degree if transcendence degree = 0 then System is algebraically observable else Determine which variable or variables are not observable. end

Models
A number of microbial community models have been proposed within the framework defined by (1); we describe them in the remainder of this section. Their diagrams are shown in Figure 1.

Species-Species Interaction Models (SSI)
Species-species interaction models (SSI) assume that the dynamic behaviour can be modelled as the result of direct interactions between species. If we only take into account two-way interactions between species, the dynamics of each species is given bẏ where x i is the abundance of species i, h i its intrinsic growth, and f ij describes the increase or decrease in species i due to the interaction with species j. Typical SSI models include the classic Lotka-Volterra models of predator-prey interactions from community ecology, as well as different versions of them.

Generalized Lotka-Volterra Models (gLV)
The generalized Lotka-Volterra model (gLV) is of the form (6), with the assumption that h i = r i · x i and f ij = β ij · x i · x j . This yieldṡ Thus, gLV models have one r i parameter per state variable, which represents its growth rate, and one β ij parameter for each pair of state variables, which represents their interaction rate. Note that the latter are not necessarily symmetrical, i.e., β ij = β ji .
The three main classes of microbial community models analysed in this paper. Positive interactions (i.e., which increase the abundance of the entity to which they point) are represented by solid arrows, while negative interactions (which decrease the abundance of the target) are represented by dashed lines with a square end. SSI models (left) describe the dynamics as direct competition or cooperation between species (depicted as two different bacteria). SMI models (central panel) describe the dynamics as the result of competition for resources. In the diagram, bacterium X 1 consumes metabolite m 2 , while X 2 consumes m 1 ; furthermore, m 2 can also be transformed into m 1 , which is a by-product detrimental to X 1 . The panel on the right depicts a phage cocktail (PC) in which each phage (P S , P R ) attacks a specific bacterium; the immune response I is also included.

Composite Lotka-Volterra Models (cLV)
The composite Lotka-Volterra framework (cLV) was introduced by [14]. It was motivated by the fact that microbiome datasets are often obtained by high-throughput sequencing and are therefore compositional, i.e., they have an arbitrary total imposed by the instrument [25]. In many practical applications, only relative abundance measurements are available; however, the state variables in gLV models represent absolute abundances. By applying a technique from compositional data analysis, the additive log-ratio transformation [26], absolute abundances are replaced with the logarithms of pairwise ratios of the abundances. This transformation turns gLV models into cLV models, whose variables represent relative abundances.
To derive the cLV equations, let us denote the sum of all species abundances by N = ∑ n i=1 x i , and the relative abundance of each species by π i = x i N . Note that both N and π i are time-dependent. Then, by setting the mean community size to 1, the following equation is obtained: whereĀ ij = A ij − A nj . It is possible to include an additional term in (8) to account for external disturbances: Equation (9) must be rewritten as (1) in order to perform SIO analysis, which leads to

Species-Metabolite Interaction Models (SMI)
It has been argued that the emergent dynamics of certain microbial communities are more faithfully represented as competition among species for a common resource, instead of the predator-prey relationship implied by SSI models [15]. Since the resources are typically metabolites, the resulting models are called species-metabolite interaction models or SMI. They have the following general structure: where each of the n m metabolites is denoted by m j . Several choices of interaction functions f i j and h j il are possible.

Quadratic Species-Metabolite Interaction Models (QSMI)
The simplest SMI models assume quadratic terms. Ref. [15] argued that this assumption yields a faithful representation of a well-mixed system, and included constant dilution terms for metabolites (d i ) and microorganisms (d * j ). The resulting model is of the forṁ where the f j terms allow for a constant influx of metabolite j and φ j il for its production as a by-product of a reaction involving another metabolite. We refer to the model described in (12) as QSMI, a quadratic species-metabolite interaction.

SMI Models with Simple Monod Growth Kinetics (MSMI)
The incorporation of the Monod growth terms leads tȯ We refer to model (13) as MSMI. The parameterization resulting from Monod growth kinetics has more parameters than the QSMI one. A simplified version of this class of models (including only one metabolite) appeared in [16], where a consumer-resource model using Monod kinetics was proposed as a means to explain the diversity of microbial communities having different growth rates.

A Phage Cocktail Model (PC)
A particular type of microbial community includes bacteriophage viruses, also called phages. Thanks to their ability to infect specific bacteria, phages provide an alternative to antibiotics in therapeutic applications. Many context-specific models involving phages have been proposed, including some [27,28] that describe their role in antimicrobial resistance. Here, we consider an illustrative case study: a phage cocktail model presented by Li et al. [29] that includes two bacterial strains, one sensitive (S) and one resistant (R) to therapy, the two phages that target them (P S and P R , respectively), and the immune response I.Ṡ where the phage-bacteria interactions are given by F(P i ) = φ·P i 1+P i /P C for i ∈ {R, S}. Both bacterial strains are killed by the immune response, as well as by the corresponding phage; the kinetics of the immune killing is parameterized by and K D . Both bacteria undergo logistic growth with a maximum capacity given by K C , and sensitive bacteria may mutate into phage-resistant bacteria with a mutation probability µ. The immune response is activated by the sum of both bacterial abundances, with activation rate α and saturation given by K I . As for the phages, it is assumed that they have the same adsorption rate φ, burst size β, and decay rate ω, but different injection rates, ρ S and ρ R . The former three parameters are considered unknown, but the injection rates are assumed known since we are modelling a phage therapy application.
Li et al. [29] also presented a scaled version of (14), which has a state vector given by and the following equations:

Results
We have used the approach described in Section 2 to analyse the SIO of the models described in Section 3. Since SIO depends on the type of measurements available-i.e., on which state variables can be measured-for each model structure, we consider several possible measurement configurations. Furthermore, we consider models of different dimensions (e.g., with two or three bacterial strains, and with one or two metabolites) in order to find general identifiability patterns, i.e., which parameters are always unidentifiable irrespective of the number of species included in the model. Results of the analyses are given in Tables 1-5. The main findings and trends are discussed in the remainder of this section.
An overview of the results of gLV models can be seen in Table 1. The gLV models are FISPO if and only if all their state variables are outputs, i.e., if their absolute abundances are measured. If only some of its state variables are measured, some interaction rates (β) are unidentifiable-specifically, the ones related to the interactions among the unmeasured species and the other species. The interaction rates between two measured species remain identifiable. In contrast, when the measurements consist of relative abundances, all the interaction rates (β) are unidentifiable. Growth rates (r) are always identifiable, both with absolute and relative measurements. As for observability, a state variable is observable if and only if it is measured in absolute terms.
x * An asterisk as a subscript ( * ) refers to the set of all the possible subindexes. We use subscripts i, j, k such that i = j, i = k, j = k. RM (relative measures) means all the possible combinations of the form xi ∑ n j=1 x j . Table 2. cLV. Results for the composite Lotka-Volterra models.
x u y Id Non-Id Obs Non-Obs π i An asterisk as a subscript ( * ) refers to the set of all the possible subindexes. Table 2 shows the results of the cLV models. Their dimensions have been reduced taking into account that x n = 1 − ∑ n−1 i=1 x i . These models were introduced for the purpose of achieving observability, a goal that they achieve for all measurement configurations. Their identifiability is strongly influenced by the existence of (measured) inputs. In the absence of inputs, all the model parameters are unidentifiable. When we introduce an input, the parameters related to the external perturbations (B) become identifiable, regardless of which variables are measured. Tables 3 and 4 summarize the results of the SMI models. Despite their different kinetics, all of their state variables are observable if and only if they are outputs. As for their identifiability, all the dilution terms d and d * are always identifiable. The constant flux of a metabolite ( f ) is identifiable only if the abundance of that metabolite is an output. The identifiability of other parameters depends on the type of kinetics. In QSMI models, k terms are always unidentifiable. If m r is an output, ψ * r (where the * means all the possible indexes) are identifiable. When we measure more than one metabolite, some φ become identifiable: if x i is observed, φ 2 i1 and φ 1 i2 are identifiable. In MSMI models, V * are all always unidentifiable. If x i is an output, V i * are identifiable. If m r is observed, φ r * s are identifiable. Finally, if x i and m r are both outputs, K ir and K * ir are identifiable. Lastly, the original formulation of the PC model is FISPO if and only if the immune response (I) is observed (Table 5). If I is not measured, it becomes unobservable, even if the other variables are outputs. In this case, measuring either bacterial abundances (S, R) or phages (P S , P R ) yields identical results; measuring both sets does not improve the identifiability nor observability with respect to measuring a single one. For its scaled version, of all the possible output combinations, we have considered those that are experimentally more feasible: (x 1 , . In all these cases, the model is FISPO.  I all -all -S, R, P S , P R r, r , m, P C , K C , K D , K N , β, α, φ, ω E, K I S, R, P S , P R I S, R r, r , m, P C , K C , K D , K N , β, α, φ, ω E, K I S, R, P S , P R I P S , P R r, r , m, P C , Any all -all -'Any' means that all combinations listed in Section 4 have given the same result.

Discussion and Conclusions
We have analysed the structural identifiability and observability (SIO) of more than a hundred variants of the most common microbial community models. For each type of model structure, we considered different dimensions, in order to obtain general results, and different output configurations, so as to establish which measurements are more useful for identification purposes.
Notable examples of the insights obtained in this study include the finding that, for generalized Lotka-Volterra models (gLV), all the microbial species present in the community must be measured in order to achieve full observability and identifiability. In contrast, if one is only interested in determining their growth rates, it is sufficient to measure a single species. Another species-species interaction formalism, the cLV, has better observability properties than gLV at the expense of worse identifiability. As for species-metabolite interaction models, an interesting conclusion is that the only observable state variables are the measured ones, for both QSMI and MSMI model classes. Finally, the SIO of the phage cocktail model that we have analysed exhibits a strong dependency on the possibility of measuring the immune response. A scaled version of this model, which has reduced dimensions, achieves full SIO at the cost of losing the biological meaning of its reparameterized variables.
Thus, our results have revealed the limitations of several model variants, and facilitate the tasks of choosing a model that is appropriate for a particular application and designing the corresponding experimental setup. While it would be impossible to analyse all possible model structures that could be used for the study of microbial communities in a short paper, the present work has studied the most relevant ones, and serves as a demonstration of the methodology that can be applied for other models.

Data Availability Statement:
The scripts used to obtain the results presented in this paper can be found in the 'models' folder of STRIKE-GOLDD, https://github.com/afvillaverde/strike-goldd accessed on 13 April 2023.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations
The following abbreviations are used in this manuscript: