An Information-Theoretic Framework for Optimal Design: Analysis of Protocols for Estimating Soft Tissue Parameters in Biaxial Experiments

A new framework for optimal design based on the information-theoretic measures of mutual information, conditional mutual information and their combination is proposed. The framework is tested on the analysis of protocols—a combination of angles along which strain measurements can be acquired—in a biaxial experiment of soft tissues for the estimation of hyperelastic constitutive model parameters. The proposed framework considers the information gain about the parameters from the experiment as the key criterion to be maximised, which can be directly used for optimal design. Information gain is computed through k-nearest neighbour algorithms applied to the joint samples of the parameters and measurements produced by the forward and observation models. For biaxial experiments, the results show that low angles have a relatively low information content compared to high angles. The results also show that a smaller number of angles with suitably chosen combinations can result in higher information gains when compared to a larger number of angles which are poorly combined. Finally, it is shown that the proposed framework is consistent with classical approaches, particularly D-optimal design.


Introduction
Soft tissues exhibit complex biomechanical behaviour, including nonlinearity, anisotropy and heterogeneity [1]. Moreover, the tissues also demonstrate inelastic properties, such as rate-dependence, hysteresis and permanent set. The important link between biomechanics and their physiological function has motivated a large number of ex-vivo studies aimed at characterising their biomechanical properties. Given the complex interplay between the different aspects of their biomechanical properties, the experimental design of ex-vivo soft tissues is extremely challenging and has been a subject of investigation, and a variety of experiments have been proposed [2][3][4][5][6].
Since a variety of soft tissues are thin-e.g., blood vessels, heart valves and skinbiaxial testing is a widely used experimental technique that allows the independent stretching of the tissue in two orthogonal directions and for the corresponding forces to be measured [7,8]. Applying different stretches in two directions allows the characterization of the in-plane anisotropic behavior of a given tissue, while a range of stretches provides us with its nonlinear elastic response. However, even with this relatively simple set of options, tensor F = ∇ X x. The ratio of the volume after deformation to that before deformation is given by J = det(F). Soft tissues are commonly regarded as incompressible due to their high water content; i.e. J is constrained to be unity.
We consider the hyperelastic model proposed by Gasser et al. [24], which defines the strain energy density as where I 1 = tr(F F) is the first invariant of the right Cauchy-Green strain tensor C = F F and I 4 = M · CM is the fourth invariant representing the stretch along fiber direction M.
The resulting Cauchy stress is given by where p acts as the Lagrange multiplier to enforce incompressibility and I is the identity matrix. For this model, the set of unknown parameters can be written as {k 1 , k 2 , κ, µ}, assuming that the fiber direction M is known a priori (based on another experiment; e.g., light scattering [6]). κ represents the dispersion of collagen fibers, which is usually measured from optical experiments. Its value lies between 0 (perfectly anisotropic) and 1/3 (perfectly isotropic). The value of µ corresponds to the shear modulus of the neo-Hookean term in (1), which represents the amorphous and non-fibrous extracellular matrix. Its role in the mechanics of soft tissues is limited to small strains and is largely constant across different tissues. In this paper, in order to simplify the problem, we assume that κ = 0.1 and µ = 1 kPa are known and fixed. Thus, the aim of an ex-vivo biomechanical experiment is to determine parameters k 1 ∈ [5, 100] kPa and k 2 ∈ [5,80] robustly and with high confidence [25,26]. A commonly used experiment called biaxial testing is described bellow.

Biaxial Experiments for Soft-Tissues
Many of the soft tissue types are planar with a small thickness. In a biaxial experiment, a square-shaped tissue sample is mounted via clamps or rakes and stretched along two orthogonal directions aligned with the sample edges ( Figure 1a). If these directions are used as the two coordinate axes and incompressibility is assumed, the stretching results in a diagonal deformation gradient tensor: where λ 1 is the stretch along the first in-plane direction and λ 2 is the stretch along the second in-plane direction. The fiber direction M is generally aligned with the first coordinate axis, which results in only normal stress components. As no force is applied along the thickness of the tissue, σ 33 = 0 is used to determine the Lagrange multiplier p. Thus, we obtain The applied stresses σ 11 , σ 22 are controlled using load cells. The resulting strains, defined as e 1 := λ 1 − 1 and e 2 := λ 2 − 1, are measured from the marker positions (although e 1 and e 2 are not the usual strain measures, we use these as our observations). It is important to note that a homogeneous stress and strain state is assumed in the middle of the sample (Figure 1a). Therefore, an implicit assumption is that the material properties and sample thickness are homogeneous. Moreover, these measurement techniques carry an error due to the limitations in measurement tools and/or the deviation from homogeneity, incompressibliity and material direction.

Protocol Definition
In practice, there are two approaches to the biaxial experiment: (1) displacementcontrolled, where known stretches are imposed and forces are measured; and (2) forcecontrolled, where known forces are applied and stretches are measured. Generally, the force-controlled approach is used as it is easier to implement. Therefore, in the forcecontrolled approach, different values of stresses σ 11 and σ 22 can be applied.
A single-angle biaxial protocol is defined as a straight line in the σ 11 -σ 22 space (Figure 1b). That is, the ratio between the two stresses is kept constant while the applied forces are increased until a maximum value σ max = 200 kPa. Thus, for a chosen angle φ, we apply where σ ∈ [0, σ max ]. For σ, 100 linearly spaced observation points between zero and the maximum stress (σ max = 200 kPa) are used. The resulting strains are calculated by iteratively solving Equation (5) for λ 1,2 and thereby obtaining e 1,2 . In practice, a combination of angles can be successively tested. We refer to this combination as the experimental protocol that needs to be optimally designed. For each angle, it is easy to acquire large numbers of points as the sample is continuously stretched. However, to vary between angles, it is essential to restart the experiment at zero applied force, which further requires the "pre-conditioning" of the sample by cyclically applying small stretches. This makes it practically difficult to apply an arbitrarily large number of angles. Therefore, in practice, usually only five angles are tested.

Information-Theoretic Framework for Optimal Design
The problem of optimal design typically refers to the choice of a design of experiments such that the design is optimal with respect to a pre-determined statistical criterion. We propose that the information-theoretic measures naturally define such statistical criteria.
The central idea is that information gain [27,28] from an experiment or protocol-as quantified by the information-theoretic quantities of mutual information and conditional mutual information-can be directly used as a reasonable statistical criterion for optimal design. These quantities are described next after presenting the framework for optimal design.

Optimal Design Problem
Consider the following general model: where M denotes a forward model that takes θ ∈ R m and outputs y ∈ R n . Note that θ may contain initial and boundary conditions of the model and that y may subsume the output at many time-points in the case of a dynamic system. Subsequently, consider that the measurement model is as follows: where H p represents the observation operator, z ∈ R d represents the measurement vector, and ε represents the vector of measurement error/noise. Note that the the observation operator H p depends on the design of experiments, which specifies which quantities are measured. Given a set of possible H p = {H 1 , H 2 , · · · , H h }, and a statistical criterion S(H p ) to be maximised, the optimal design is given bŷ In the case of the biaxial experiments, the model M represents the model for the force controlled experiment (Sections 2.1.1 and 2.1.2) and H p essentially denotes the experimental protocol (see Section 2.1.2) representing the combination of angles-with each representing a straight line in the σ 11 -σ 22 plane-along which the strain measurements of e 1 and e 2 are acquired. With the possible variation of each angle between 0 and π/2, the set Φ of possible angles φ is constructed through a uniform discretisation of the space between 0 and π/2 into α levels; thus, The possible set of protocols is then given by any combination of elements in Φ with the restriction that the number of elements in a protocol must be limited to C. Thus, if Φ ⊂ Φ is a subset of angles representing a protocol, our set of protocols is given by where | · | represents the number of elements in the set. In other words, we choose at least 1 and up to C elements from Φ, with the total elements in H p being

Information-Theoretic Quantities for Optimal Design
In the framework of Section 2.2.1, we propose that information-theoretic quantities of mutual information and conditional mutual information are a natural choice for the statistical criterion S. Denoting the random variables associated with θ and z as Θ and Z, respectively, the mutual information (MI) between the parameters Θ and the measurements Z is defined as [27] where p X (x) represents the probability density of a random variable X with a realisation X = x and support X X . The mutual information I(Θ; Z) quantifies the amount of information that can be gained on average by one random variable-e.g., Z-knowing about the other-e.g., Θ. Indeed, with this interpretation, MI is a good candidate for the statistical criterion S for optimal design. For an individual parameter, Θ i , or indeed for any combination of parameters {Θ i , Θ j }, the corresponding information gains can be similarly computed through I(Θ i ; Z) and I({Θ i , Θ j }; Z), respectively. Thus, while I(Θ i ; Z) quantifies the information gain individually for the parameter Θ i , the quantity I({Θ i , Θ j }; Z) quantifies information gain for the pair {Θ i , Θ j } jointly. A measure of correlation between the parameters Θ i and Θ j is, however, missing and is provided by conditional mutual information (CMI), defined as The CMI I(Θ i ; Θ j |Z) represents the additional information gained about the parameter Θ i when both Θ j and Z are known (term II) relative to when only the measurements Θ i alone are known (term III). Note that CMI is symmetrical-i.e., I(Θ i ; Θ j |Z) = I(Θ j ; Θ i |Z)and can be interpreted as a measure of dependence between the parameters given the measurements Z. It should also be noted that both MI and CMI are non-negative.
With the above background, many statistical measures can be constructed. For example: 1.
The mutual information for any single parameter may be maximised, giving S = I(Θ i ; Z). This approach only concerns the posterior of the parameter Θ i and ignores all other parameters; 2.
The joint mutual information may be maximised, giving S = I(Θ; Z). In the sense of classical optimal design, this can be interpreted as D-optimal design. This is because D-optimal designs minimise the determinant of the inverse Fisher Information Matrix, and S = I(Θ; Z) measures the information gain in the joint Θ space; 3.
The sum of individual parameter mutual information may be be maximised, giving In the sense of classical optimal design, this can be interpreted as A-optimal design. This is because A-optimal design minimises the trace of the inverse Fisher Information Matrix, and S = ∑ m i=1 I(Θ i ; Z) measures the sum of the information gains for all the parameters; 4.
Alternatively, one may seek to maximise individual parameter information gain while minimising pairwise CMI, thus seeking both small posterior variances and minimising pairwise correlations between the parameters. In this case, the statistical criterion is where τ > 0 is a regularisation parameter. Note that high CMI implies that a large amount of information can be gained only about a combination of the two parameters (for instance, their sum or product), but not for each parameter individually. Thus, we seek to minimise the CMI.
Note that the above list is not exhaustive, and based on the interpretations of MI and CMI, other criteria may be constructed based on the desired sense of optimality.

Estimating Mutual Information
In general, the forward model in Equation (8) is non-linear, and thus even if the observation operator is linear (implying linear combinations of the state are measured), the analytical computation of mutual information is intractable. Thus, the informationtheoretic quantities of MI and CMI must be estimated. A common method is to generate samples of Θ through the specification of an appropriate prior probability density p Θ (θ). Denoting these N s samples as θ (i) , i = {1, 2, · · · N s }, each θ (i) can be propagated through the forward and observation models of Equations (8) and (9) to produce corresponding samples of Z, denoted as z (i) . The samples of θ (i) and z (i) can subsequently be used on non-parametric estimators of MI and CMI. Such non-parametric estimators can broadly be classified into two categories: kernel density estimators (KDE) [29] and k-nearest neighbour (kNN) estimators [30,31]. For an overview of such methods, we refer to [32]. While the estimator proposed by Kraskov et. al. [30] is widely used and performs very well across a range of scenarios, one of its drawbacks is that it suffers from higher errors when extreme correlations are present between the variables and/or when the the data are effectively in a lower-dimensional manifold. Since we are working with models that specify explicit relationships between the variables through the forward and observation model, this is likely to be true for the data set of (θ (i) , z (i) ). Thus, in this study, we employ the local non-uniformity correction (LNC) proposed in [33], which includes a correction term to the original estimator by Kraskov et al. [30]. This term accounts for strong dependencies between the variables through local principle component analysis [33]. The method of [33] is used for the estimation of all MIs, and CMIs are estimated from the difference of two MIs; see Equation (15).

Dimensionality Reduction for the Biaxial Experiment
One of the main difficulties in estimating information-theoretic quantities is related to the data dimension. Non-parametric estimation is particularly challenging whenever the data are close to manifolds embedded in high-dimensional spaces. This is indeed the case when a physical model relates parameters and observable quantities. One of the possible ways to overcome this difficulty, or at least to mitigate it, is (dimension or) model reduction, which aims at discovering the underlying low-dimensional structure of a set of data (a comprehensive review of the topic can be found in [34][35][36][37]). A large spectrum of methods has been proposed in the literature. In the present contribution, we adopt a local reduced-basis method (similar in spirit to the methods proposed in [38,39]). Let the strains computed by the model be e 1,2 (σ; φ; k 1 , k 2 ), where k 1 and k 2 are the model parameters (k 1 , k 2 ) ∈ Ω k ⊂ R 2 , and σ ∈ Ω σ ⊂ R is the variable defined in Section 2.1.2. Let n ∈ N * ; thus, we introduce the following approximation: which is well defined by virtue of the Eckart-Young theorem. First, let us observe that a given protocol consists of a set of known angles Φ. An efficient way to construct the local reduced basis is therefore to introduce a Proper Orthogonal Decomposition (POD) for each of the angles φ j ∈ Φ. This corresponds to the search for an approximation of the form e (j) where r k Ω k = δ ik ( ·, · Ω σ ,withΩ k being the standard L 2 scalar product). The error in the approximation is related to the number n of modes retained: In the present work, a number n = 4 of modes proved to be sufficient in order to obtain errors smaller than 10 −3 in L 2 norm in the solution reconstruction. This means that the set of elements e 1,2 (σ; φ j ; k 1 , k 2 ) was close to the linear subspace spanned by the first n = 4 modes r (j) i . Henceforth, instead of considering the discretised e 1,2 we consider their coordinates in the subspace given by

Validation of Results against Existing Methods
Several methods and criteria to define and reach an optimal design of experiments have been proposed [12]. Among them, D-optimality criterion attempts at maximising the determinant of the information matrix. In the present case, this is equivalent to minimize the determinant of the inverse of the average Hessian of the loss function we would introduce in a classical parameter estimation method. In a noisy setting, and, in particular, when the noise is Gaussian, this cost function is equivalent to minus the logarithm of the likelihood function. Let the misfit function be f (θ) and E Θ denote the expectation operator. The average of the Hessian reads: where θ * is the value of the parameter minimising the loss function.

Overview of Approach for the Biaxial Experiments
In the context of the biaxial experiments, the parameters are k 1 and k 2 , represented as random variables K 1 and K 2 , respectively. The variability in these parameters is considered to be uniform (thus imposing a uniform prior distribution) in the following intervals: k 1 ∈ [5, 100] kPa and k 2 ∈ [5, 80]. For a single value of angle φ, the measurements are the strain values e 1 and e 2 and are measured at 100 points along the line defined by the angle φ. Here, we consider α = 16 discrete values of possible measurement angles φ uniformly distributed between, and including, 0 • and 90 • . For each angle φ, separate reduced bases of four modes for e 1 and e 2 are constructed through POD over 400 values of (K 1 , K 2 ) sampled uniformly in the aforementioned parametric space. Thus, for any angle φ, the dimensionality reduction approach projects e 1 and e 2 measured at 100 points along the line defined by φ to a basis of 4 + 4 modes. For a given protocol consisting of multiple angles, the measurement vector z (with a corresponding random variable Z) is the collection of all the reduced basis representations of e 1 and e 2 along the angles in the protocol. Lastly, the maximum number of angles in a protocol is restricted to C = 5, giving a total of 6884 unique combinations of the α = 16 angles.
For the estimation of MI and CMI, a total of N = 10, 000 values of (K 1 , K 2 ) are uniformly distributed in the parametric space. For each sample (k (i) 2 ), the numerical model of the biaxial experiment is run to produce e (i) 1 and e (i) 2 , which are then projected on to the reduced basis, giving z (i) . The N triplets of (k (i) 1 , k (i) 2 , z (i) ) are subsequently used for the estimation of MI and CMI through the LNC estimator (see Section 2.2.3). In Equation (16), we use τ = 1.

Results and Discussion
For all the 6884 combinations of angles, three statistical criteria are evaluated: (i) I(K 1 ; Z), (ii) I(K 2 ; Z) and (iii) I(K 1 ; Z) + I(K 2 ; Z) − I(K 1 ; K 2 |Z). While the first two criteria aim to maximise the information gain about K 1 and K 2 individually, the third criterion aims to maximise the information gain about K 1 and K 2 simultaneously while minimising the information dependence between them. Figures 2-4 show the variation in these three criteria when grouped by the number of angles in a protocol. In these figures, the values of information criterion when using two approaches to uniformly discretise the angular space within protocols are also presented. Observations from these plots are as follows:

1.
Generally, all the three information criteria increase with the increasing number of angles in the protocol. Intuitively, this is expected, as a higher number of angles implies more measurement data and hence a higher potential for the improved estimation of the parameters. This observation is true for the maximum information gain, minimum information gain and the mean information gain; 2.
Across all the three criteria, it is observed that the uniform discretisation is not necessarily reflective of the best protocol for estimating the parameters. In fact, in most cases, the performance of uniform discretisation is close to the mean information gain observed across all the angle combinations; 3.
From Figures 2 and 3, it is observed that the angular combinations that maximise information gain for K 1 are not identical-and vary significantly when more than two angles are simultaneously used-to those that maximise information gain for K 2 . This further motivates the use of a criterion that balances information gains in both the parameters while minimising their interdependence; 4. Figure 4 shows that the best combinations that maximise a balanced criterion, such as I(K 1 ; Z) + I(K 2 ; Z) − I(K 1 ; K 2 |Z), are a trade-off between the combinations of angles that maximise I(K 1 ; Z) and I(K 2 ; Z) individually. For example, when five angles are considered, the angles that maximise I(K 1 ; Z) are φ a = [66, 72, 78, 84, 90] and those that maximise I(K 2 ; Z) are φ b = [30,36,42,48,54], while the combination that maximises I(K 1 ; Z) + I(K 2 ; Z) − I(K 1 ; K 2 |Z) is [30,36,48,78,90], which has two angles from φ a and three angles from φ b . It should be noted that such a trade-off between maximising individual parameter gains is still significantly different to a uniform discretisation; 5.
Finally, it is observed that the worst combinations are all low angles: [0, 6,12,18,24]. This can be related to the fact that, at low angles, the applied stress is largely aligned along the stiff fibers of the tissue, thus resulting in lower strain values. Thus, the lower angles provide a small range of the observations, while the larger angles provide a larger range (Figure 5a), thereby containing more information about the parameters.   From this point onward, we present results only for the balanced information criterion S = I(K 1 ; Z) + I(K 2 ; Z) − I(K 1 ; K 2 |Z). Figure 6 shows the variation in S across all the combinations (x-axis and in log-scale to capture the spread) grouped by the number of angles in a protocol and sorted according to the increasing order of S within each such group. Within each group, observing the minimum and maximum values of S shows that a better choice of angles can lead to more than a 100% increase in the information gain compared to a poor choice. Furthermore, this shows that good combinations of a lower number of angles can lead to higher information gain compared to a higher number of angles with poor combinations. For example, the maximum S when only one angle is used is higher than many combinations with two to four angles. This emphasises the utility of optimal design and the proposed framework.       Figure 8 shows that even though four combinations are used in the index 26 protocol, it produces a lower S compared to when only a single angle is used in the index 28 protocol. Furthermore, since Figure 8 shows the first 150 out of 6884 combinations of Figure 7 (which is sorted in creasing order of S), all combinations here are relatively low S-producing protocols. Observing the high density of angles in the region φ < 24 • is indicative that lower values of angles-in particular, those less than 24 • -are relatively less informative when compared to higher values of angles. This behaviour is also apparent in Figure 9, which shows S values for protocols that use only one angle, and where a sharp jump can be observed when transitioning from 18 • to 24 • . This peculiar behaviour may be explained by the physics of the biaxial experiment. Looking at the resulting strains e 1 and e 2 of this transition (Figure 5b), we observe that the e 1 changes from positive to negative values. This behavior captures the important coupling between the two normal stresses and strains and is also related to the fiber dispersion in our constitutive model (Equation (1), [24]). It is remarkable and encouraging that the information-theoretic framework captures the physics of the problem without explicitly considering it in the framework. While for simpler lowdimensional models, the association between physics and optimal design may be relatively easy to see, inferring such behaviour is, in general, not trivial for more complex and higher-dimensional models.
Similarly to Figure 9, the results of the information-theoretic optimal design are further analysed for a higher number of angles. When two angles are considered, the S values in increasing order of magnitude and the corresponding angle combinations are shown in Figure 10. This figure re-iterates observations made previously: (i) the choice of combinations significantly affects the information gain, where the best combination gives approximately 20 nats more information compared to the worst combination; and (ii) generally speaking, higher angles are more informative compared to lower angles-in particular, angles below 24 • . While a similar analysis for more than two angle combinations can be easily performed, the efficient visual representation of such results is cumbersome and avoided in this manuscript.   To further illustrate the validity of the information-theoretic approach, a comparison with a classical method (see Section 2.2.5) is presented. For one and two angles in a protocol, Figure 11 shows a comparison between S and the log of the determinant of the inverse Fisher Information Matrix, log |H −1 |. It is encouraging that a high correspondence between the two metrics is observed. In particular, increases in S, implying higher information gains, are accompanied by corresponding decreases in log |H −1 |, implying a smaller volume of the parameter posteriors. A Pearson correlation coefficient of r = −0.76 is observed between S and log |H −1 |, implying a high similarity between the two metrics and validating the information-theoretic approach in part. We note that, when the number of the parameters become large, evaluating the Hessian would imply a non-negligible computational cost. On the contrary, the method used to evaluate the mutual information, as a primarily Monte Carlo-based estimation, is less severely dependent on the number of parameters. Furthermore, the computation of derivatives (either numerically or through adjoint based methods) may be cumbersome for certain types of models. Finally, we note that the effect of noise on information gain, and hence optimal design, can be easily assessed in the proposed framework by adding noise to the samples of Z (see Equation (9)).

Conclusions
A framework for optimal design based on information-theoretic quantities of mutual information and conditional mutual information is proposed. The framework treats information gain as the central criterion for inverse problems and proposes several informationtheoretic frameworks for a desired sense of optimality. The capabilities of this framework are tested on the optimal design problem for biaxial experiments, where the effect of the angle combinations along which the strains are measured is assessed in terms of parameter estimation through information gain. Without including any physics-based reasoning, and purely through the information-theoretic measures, it is found that low angles ≤ 24 • are not very informative regarding the parameters relative to high angles. These observations are then found to be consistent based on physics-based reasoning, thereby showing the efficacy of the proposed framework. Furthermore, it is demonstrated that measurements for a low total number of angles which are carefully chosen can be more informative compared to the case when measurements along a high number of poorly chosen angles are acquired, thus highlighting both the importance of optimal design for biaxial experiments and the utility of the proposed framework in determining good angle combinations. The application of the proposed framework to classical optimal design is performed, and it is shown that the results produced by the new framework are consistent with classical frameworks.

Limitations and Future Work
While the proposed framework is shown to perform well on a two-parameter problem, its performance in higher parameter problems is not assessed. This assessment represents the primary limitation of this work and an area of future assessment. In particular, the problems envisaged are largely related to the performance of the MI and CMI estimators in higher dimensions of both parameters and the measurements. While a dimensionality reduction approach was adopted in this study to minimise the adverse effects of the latter, this may not be possible in many forward and inverse problems. Thus, a large area of future work is related to the development of efficient and robust MI and CMI estimators. Note that several approaches are being proposed by researchers to solve this problem; see for example [40][41][42][43][44][45][46]. Lastly, a thorough comparison against classical optimal design methods