Practical Understanding of Cancer Model Identifiability in Clinical Applications

Mathematical models are a core component in the foundation of cancer theory and have been developed as clinical tools in precision medicine. Modeling studies for clinical applications often assume an individual’s characteristics can be represented as parameters in a model and are used to explain, predict, and optimize treatment outcomes. However, this approach relies on the identifiability of the underlying mathematical models. In this study, we build on the framework of an observing-system simulation experiment to study the identifiability of several models of cancer growth, focusing on the prognostic parameters of each model. Our results demonstrate that the frequency of data collection, the types of data, such as cancer proxy, and the accuracy of measurements all play crucial roles in determining the identifiability of the model. We also found that highly accurate data can allow for reasonably accurate estimates of some parameters, which may be the key to achieving model identifiability in practice. As more complex models required more data for identification, our results support the idea of using models with a clear mechanism that tracks disease progression in clinical settings. For such a model, the subset of model parameters associated with disease progression naturally minimizes the required data for model identifiability.


Introduction
Mathematical models serve an important role in the development of cancer theory and provide a framework to integrate and understand clinical data [1][2][3][4][5][6][7][8][9]. The attractiveness of mathematical models in clinical application comes from their ability to predict possible outcomes of a hypothetical treatment scenario based off of a set of mechanisms, which contrasts the black-box approaches in machine learning. Mathematical modelers assume that the relevant characteristics of an individual for treatment design can be represented by a set of parameters [10][11][12][13]. The influence of these parameters is then expressed in functional responses, which dictate the rate of each reaction within a preset model structure. The particular forms of the functional response and model structure are often borrowed from classic ecology and population studies and tested on the data of cohorts of patients, which make them a shared canvas for all patients [14]. On the other hand, the set of parameters that distinguishes the treatment outcome is unique to each individual [15][16][17]. If this unique set of parameters can be determined for a particular patient, then the model is presumed to be useful in predicting the most appropriate treatment for that patient. Thus, the concept of mathematical modeling coincides with the central idea behind precision medicine, where treatment should be formulated from the characteristics of each individual [18,19].
Mathematical models can be used to fit clinical observations and make real-time predictions of treatment outcomes. Given a sufficient amount of data for a patient and an appropriate model, one can use various statistical techniques to estimate the patientspecific set of parameters. Yet, "sufficient" is a quantity determined by model complexity.
Comprehensive models, such as those found in system biology studies, are more biologically realistic, but the extra layers of complexity often hinder attempts to estimate the patient-specific set of parameters. On the flip side, simpler models with fewer parameters may not be capable of fully capturing the set of possible clinical outcomes qualitatively and quantitatively. Yet, simple models can still sometimes be too complex relative to the available data. Let us consider a thought experiment using a simple differential equation describing the birth and death processes of a population of cancer cells x: where β and δ are the per capita birth and death rates, respectively. We define m = β − δ. This is perhaps the simplest growth equation that can be used as a building block for a more complex model. One example of the use of this simple model is by Claret et al. [20].
In clinical settings, estimates of cancer growth are available over time, either by direct methods, such as imaging or indirect estimates from tumor proxy [21][22][23]. The clinical data would then come in pairs of (t n , x n ), where x n is the cancer population measured at time t n . We can assume that the model fits the data perfectly in this thought experiment. Yet, without prior knowledge of β or δ, there would be no way to determine a unique value for either parameters in this very simple model. We can only find an infinite number of pairs of β and δ that give the same value for m. Note that this is related, but distinct from uncertainty or sensitivity analysis, where the uncertainty in each parameter is constrained based on the error in the parameter estimation. The above thought experiment is a toy example meant to demonstrate the concept of model unidentifiability, which heuristically implies the inability to estimate the unique patient-specific set of parameters from the data. In other words, if the model is not identifiable (relative to the available data), then it is possible to find many sets of parameters that fit the data equally well statistically. To see why model identifiability represents a great obstacle in realizing the potential of mathematical models in precision medicine, we look at a specific example by Wu et al. [24]. This example comes from a modeling study using data from a clinical trial testing intermittent androgen deprivation therapy for treatment of metastatic prostate cancer [25]. The model used in the study is a mechanistic model developed from the classic Droop cell quota model and tested against several clinical datasets [11,15,26,27]. In fact, most of the model parameters can be determined directly from information in literature [13,28], yet the model is still unidentifiable with respect to the available clinical data. Figure 1 shows that five distinct sets of parameters demonstratively capture the data well in the fitting portion, yet only one provides accurate forecast. This example illustrates why potential issues of model unidentifiability should be addressed completely prior to clinical application of mathematical models.  [24] with permission distributed under a Creative Commons Attribution (CC BY) license. The color of the fitted parameters corresponds to the forecast trajectory of the same color. In the fitting portion, five different sets of parameters produce nearly indistinguishable good fits to the data. However, in the forecasting portion, only one set provides accurate forecasting.

Structural and Practical Identifiability
Depend on the types of model identifiability, there are various examples and techniques to address the issues of model identifiability [29][30][31][32][33][34][35][36][37][38][39][40]. Here, we offer our perspective on this issue. Consider a dynamical system of the form: where x(t) ∈ R m represents the state variables, y(t) ∈ R d represents the measurable output (e.g., the data), u(t) ∈ R p represents the (control) input vectors, for example the administered drug, and θ ∈ R q represent the set of constant parameters. Note that while θ can contain time-varying parameters, for a biological system, this complexity can usually be avoided by using a functional representation of the time-varying parameters or explicitly modeling the underlying processes that drive the temporal changes. Thus, for simplicity, we take θ to contain only constant parameters. The general definition of model identifiability follows [29,31].
Definition 1 (Model identifiability). The dynamic system given by Equations (2) and (3) is identifiable if θ can be uniquely determined from the given input u(t) and measurable output y(t).
If a system is not identifiable, then it is unidentifiable. Furthermore, if y(t) does not contain error, then the identifiability of the system is referred to as structural identifiability. Otherwise, it is referred to as practical identifiability.
The first example that we gave is an instance of structural unidentifiability and the second is a case of practical unidentifiability. We expand on what it means for a dynamical model to be identifiable with respect to a measurable output. For this next example, we will look at a one dimensional logistic equation Here, r and K are the intrinsic rate of growth and carrying capacity, respectively, for the population denoted by x. Let (r 1 , K 1 ) and (r 2 , K 2 ) be two sets of parameters and assume perfect measurement y(t) = x(t). We will also assume that x(0) is known. Then, if (r 1 , K 1 ) and (r 2 , K 2 ) results in the same models dynamics, then For the condition above to hold for all x(t), we must have r 1 = r 2 and r 1 K 1 = r 2 K 2 , which implies (r 1 , K 1 ) = (r 2 , K 2 ) for almost all x(0), except for possibly a set of measure zero. For instance, if x(0) = 0 or K, then the system is at its steady state, making the aforementioned comparison obsolete. To put it simply, if a model is identifiable, then two sets of parameters that give the same model dynamics must be identical. This means that an identifiable model does not have potential issue in Figure 1. Instead, the uncertainty in model forecast is solely dependent on the uncertainty in the data.
We remark that the above system does not contain a control term; however, for biological systems, if the identifiability of the system without the control can be studied, then adding the control term afterward usually does not change its structural identifiability, see the example given by Eisenberg and Jain [31]. What we showed here is an example of a direct test of model identifiability, originally used by Denis-Vidal and Joly-Blanchard [41]. While the direct test method is often not used in practice, it serves as an intuitive description for model identifiability from a dynamical system perspective.
To test for model structural identifiability, software built on differential algebra theory, such as DAISY, is the gold standard [42]. However, structural identifiability does not guarantee practical identifiability, which is necessary for clinical application. The practical identifiability of a model should be studied with data, in particular using the Fisher information matrix or profile likelihood [24,31]. In these scenarios, the available data dictates the formulation of the model. However, because mathematical models are often developed independent of the data collection, modelers often must sacrifice certain realistic aspects of the model to keep it identifiable relative to the available data. Reversely, if one first builds a set of candidate models and finds out the required data for accurate model identification, then it may be possible to obtain these data during the collection process. We consider the latter case an ideal scenario, where mathematical modelers and clinicians can collaborate effectively.

Observing-System Simulation Experiment via Monte Carlo Method
In the ideal scenario mentioned above, Monte Carlo simulation experiment is our tool of choice to obtain the information on the data required for a model to be identifiable. First, we introduce the statistical model [43]: where h(x(t i ), u(t i ), θ) is the measurement, θ is the vector of parameters estimated from {y(t i )} N i=1 observations at time {t 1 , . . . , t N }. Assuming no model error, then the general form of the measurement error is with f ≥ 0, i taken to be independent and identically distributed random variables with mean 0 and variance σ 2 0 . For biological application, it is reasonable to expect the measurement error to be proportional to the measurement itself, so we fix f = 1, giving us a relative error. The steps of the Monte Carlo simulation method follow [29,44].

1.
Determine the appropriate set of true parameters θ 0 for the simulation.

2.
Numerically solve the ODE model to obtain the measurements at desired time points.

3.
Generate M sets of simulated data from the statistical model (6) and (7) with a Gaussian error structure and a chosen standard deviation σ 0 % around mean 0. 4.
Fit the model to each of the M simulated data sets to obtain the parameter estimates Here, we take M to be 200 sets.

5.
Calculate the average relative estimation error (ARE) for each element of θ as where θ i are the k-th element of θ 0 andθ i , respectively. 6.
A model is practically identifiable if the ARE is less than the variance (σ > 0%), meaning we want the error in the parameter estimation to be less than the error in the data. When the variance σ is 0% and ARE is sufficiently close to 0%, then the model is considered to be structurally identifiable. We borrow the idea from the observing-system simulation experiment where we will use different hypothetical sets of data to test whether they can help us identify key parameters [45,46]. By continually restricting the amount of information we have from the data, we can approximate the threshold of information required for model identification.
The MC simulation approach is not error-free. One such limitation comes from choosing the initial guesses. For example, if we start our initial guess close enough to the true set of parameters, then the effect of the error σ may be limited. However, if we have our guesses far away from the true set of parameters, then the numerical optimization may become trapped in some local minimum away from the true estimate or the parameters may not be sensitive enough to be estimable. One can start with random samples of initial guesses to have a better chance at reaching the true estimates. However, this sampling approach does not inherently deal with the issue of the insensitive parameters. Here, we pick the initial guesses randomly within 50% of the true value. In order to rule out any parameters that are not sensitive with respect to the tolerance of the numerical optimization schemes, we carry out the MC approach for each individual parameter with error-free data (σ 0 = 0%). Any parameters that cannot be refitted within reasonable ARE will be eliminated (or become fixed) from the pool of free parameters. The remaining parameters are deem to be sensitive enough for the numerical scheme. As we amp up the tolerance of the numerical scheme, eventually we should be able to fit all parameters when no error is present in the data.

Two Mathematical Models for Prostate Cancer
Many mathematical models for prostate cancer have been developed in the past two decades [1,18,47] with many recent studies focused on immuno-and chemo-treatments of prostate cancer [48][49][50][51][52][53][54][55][56]. Here, we divert from this trend and instead use two simpler mechanistic models to demonstrate the concept of model identifiability in practice. Both models contain a clear prognostic parameter that keeps track of cancer progression, which greatly simplifies their structure. Using these models, we aim to show that even if the model itself may not be identifiable, having a model-based prognostic parameter allows modelers to focus the resource to identify these key parameters. This would be helpful in practical settings due to limitation in data acquisition.
A cancer stem cell model. Cancer stem cells propel cancer's therapeutic resistance and are thought to be a primary factor in the initiation and progression of prostate cancer [57][58][59][60]. Utilizing the mathematical model below, in conjunction with the stem cell hypothesis, could provide a better understanding of prostate cancer's acquisition of castration resistant cells and their heterogeneity within a mass. Prostate cancer stem cells are thought to express little to no androgen receptors, giving them the ability to multiply their population without a hormone requirement [61]. Resistance is achieved with cancer stem cells' ability to thrive in the absence of androgen, which provides a means for cancer to continue to evolve during and after intervention with intermittent androgen deprivation therapy [12,17,62].
Prostate cancer stem cells continue to rapidly divide after treatment, either asymmetrically to form differentiated cells or symmetrically to form additional stem cells. The production of differentiated cells results in negative feedback of the production of stem cells. However, unlike stem cells, differentiated cells are affected negatively by androgen deprivation therapy. The ability to withstand androgen deprivation is just one of the many contributing factors that give rise to the renewal of stem cells. For instance, mitochondrial fission factor expression plays a role in the evolution and multiplicity of prostate cancer stem cells [63].
Here, we use a novel model built upon this concept for prostate cancer from the studies by Brady-Nicholls et al. [12,17,62]. The model consider three compartments, the cancer stem cells (S), the differentiated cancer cell (D), and the PSA byproduct (P). While it is simpler in structure, the model has shown promises in its applicability.
The cancer stem cell population S is assumed to divide at a rate λ to produce either one stem cell and one cancer cell with probability p s , or two cancer cells. This division has a negative feedback from the differentiated cancer cells, which takes the form S S+D . The cancer cell is killed by the drug at a constant rate α, where T x denotes the application of the drug. PSA is produced by cancer cells at a rate ρ, which is cleared from the blood stream at a rate ψ.
Since the drug applications for these model, u and T x , are known input. For simplicity, we can treat them as constant. Since they are known, their variation in time should not affect the identification of the other factors. Additionally, in practice, the drug application would be fixed for a certain period of time depending on the specific treatment. We take the following parameter values as the true values for our study: A cell quota cancer model. Prostate cancer cells require androgen for growth, which is why the effect of androgen is regularly incorporated into prostate cancer model [1,64,65]. However, the quantitative connection between androgen and prostate cancer growth is not well characterized, leading to various functional forms used for this purpose.
Here, we use a cancer model that integrates the effect of androgen based on a stoichiometric modeling framework [15,66,67]. The model was developed in a series of studies that highlight the importance of androgen dynamics in prostate cancer growth [11,13,16,[26][27][28]64,68,69]. In this model, cancer independence to androgen is modeled as a variable explicitly and can be used as an indicator of cancer growth. Meade et al. later expanded on this idea to build a more biologically realistic model of cancer growth for predicting treatment failure [16]. Despite its simplicity, the model is founded on established biological principle and can capture and predict the dynamics of cancer progression.
dP dt = bQ baseline production + σxQ production by cancer cells The cancer population, denoted by x, grows based on the Droop cell-quota model. The death rates are contributed by an androgen dependent term, ν(t) R Q+R x, and a density dependent term, δx 2 . Here, ν(t) is the maximal androgen dependent death rate for the cancer. The authors assume that the cancer cells lose their dependence on androgen at a rate −dν, which can be interpreted as the "rate of gaining androgen independence". With this interpretation, under androgen deprivation therapy, the treatment would gradually become ineffective. Q and P are the intracellular androgen level and serum PSA, respectively. The dynamics of Q is governed by an influx of serum androgen and the uptake of cancer cells. γ 1 and γ 2 represent the rates at which androgen is being produced by the testes and the adrenal gland, respectively, with the drug application denoted by u. P is assumed to be produced as a baseline by normal cells, but mainly by cancer cells, and is cleared from the blood stream at a constant rate. We take the following parameter values as the true values for our study: u = 0.5 (dimensionless), µ m = 0.009 day −1 , q = 0.4 nmol day −1 , R = 3 nmol L −1 , δ = 45 L −1 day −1 , d = 0.0001 day −1 , γ 1 = 0.08 day −1 , γ 2 = 0.004 day −1 , Q m = 30 nmol L −1 , b = 0.0001 µg nmol −1 day −1 , σ = 0.001 µg nmol −1 L −1 day −1 , and = 0.1 day −1 [13,15].

Parameter Optimization
When the data are of a single type, we use the standard root mean squared error (RMSE), for example, the cancer stem cell model with PSA data. When the data composed of multiple types of data, for example with PSA and androgen, we weigh the error contribution from each source equally. Any variation from this fitting procedure will be mentioned on a case-by-case basis. Finally, we use the built-in function lsqnonlin (MATLAB) for our optimization.

The Identifiability of Two Prostate Cancer Models
First, we study the identifiability of the model given the measurements that are usually available directly for parameter estimation. In the case of the cancer stem cell model, this measurement is taken to be PSA. In the case of the cell quota model, the measurements are PSA and androgen. We also note that the spacing between the synthetic data points is kept constant in this section.
The cancer stem cell model. An example of the synthetic data and fitting for the cancer stem cell model is presented in Figure 2. Table 1 shows the results from the sensitivity test for each individual parameter for the cancer stem cell model. Out of the seven parameters and initials, p s and D(0) are not sensitive enough to be identifiable for larger measurement error. On the other hand, λ, α, ρ, ψ, and S(0) appear to be sufficiently sensitive. Thus, we fix p s and D(0).  Next, we carry out the MC scheme for all of the remaining parameters at once, namely λ, α, ρ, ψ, S(0). The results in Table 2 show that only ψ is practically identifiable. To see why the other parameters are not identifiable with only PSA data alone, we fix ψ and test the identifiability of all 2-combinations of the remaining parameters (e.g., fit two parameters at a time while fixing the rest). We find that none of the 2-combinations are practically identifiable. Since each parameter being tested is sensitive enough to be identifiable by themselves, this indicates the existence of an unknown relationship among the remaining four parameters (e.g., λ, α, ρ, S(0)). To demonstrate this point, we plot the estimated values of these parameters in 2-combinations and show that an approximate relationship between these parameters can be obtained by a simple regression Figure 3. Without error (σ 0 = 0%), the relationships between λ, α, and ρ are evident by performing the 2-combination test.
In Brady et al. [62], the authors find that the prostate cancer stem cell renewal rate p s is a good indicator of resistance timing. However, our analysis shows that in order to utilize p s to make clinical predictions, modelers must have a solid grasp on the values of all other parameters and a good understanding of the appropriate value for p s in the cancer stem cell model. This agrees with the approach taken in Brady et al. to obtain model identifiability for p s [17,62].  The cell quota cancer model. An example of the synthetic data and fitting for the cancer stem cell model is presented in Figure 4 Similarly, we carry out these tests with respect to the 13 parameters and initials for the cell quota cancer model. Table 3 shows that out of these, only µ, Q m , and are sufficiently sensitive when each parameter is fitted individually with PSA and androgen data. When fitting the three sensitive parameters together, we find that only is practically identifiable, see Table 4. To see why the remaining two parameters (µ and Q m ) are not identifiable, we fix and study the correlation between these two parameters. Here, we find a similar linear relationship between estimates for µ and Q m with or without error in the data, see Figure 5. This hidden correlation interferes with the estimability of these two parameters.
In Baez and Kuang, the parameter d (or variable ν(t)) is created to keep track of the development of cancer resistance. However, our analysis indicates a similar issue to the cancer stem cell model where the relevant parameter is not identifiable with the available data. If we want to have an accurate estimate of d, we must have a strong grasp on the values of all other variables in the model and a good guess for an appropriate value for d.

Observing-System Simulation Experiment-Identifiability of Treatment Resistance Parameter
Now, we turn our focus to answering the question: what amount of data are necessary to determine the key model-based prognostic parameter? To address this question, we synthesize candidate sets of data that vary in the type and frequency of data collected. Then, we attempt to study the identifiability of these key parameters using each synthetic dataset.
The identifiability of p s in the cancer stem cell model. Recall that p s is not identifiable even when being estimated by itself with only PSA data (Table 1). Thus, we will carry out simulation to determine the amount of data required to sufficiently characterize the p s , which is used to predict treatment success and failure in Brady et al. [17,62]. For the experiment, we assume all parameters (except for p s ) can be obtained from other means, which means they are fixed to the values used to make the synthetic dataset for these simulation experiments. Table 5 summarizes the main results of the experiments to determine the identifiability of p s . The frequency of data points appears to be the most influential factor to identify p s , which is followed by the inclusion of the measurement of cancer stem cells. On the other hand, measurement of the cancer population, optimization tolerance, and (linear) weight contribution from different sources of error have less of an impact on the identifiability of p s . Interestingly, if a measurement of PSA can be taken roughly every 5 h, then we could accurately determine p s as well (given that it is the only parameter we need to estimate).
While the results in Table 5 suggest the frequency of measurements plays a key role, when fitting all parameters together with pseudo-continuous measurement of PSA and cancer stem cells, the model remains to be unidentifiable (see Table 6). However, if the measurements are very precise (σ 0 ≈ 0%), estimated values of each parameter are within acceptable ranges that may still be useful in making predictions (less than 10% difference from the true value).
The identifiability of d in the cell quota cancer model. Similarly, recall that d is not identifiable in the cell quota cancer model even when being estimated by itself with both androgen and PSA data (Table 3). Hence, we carry out simulation to determine the amount of data required to identify p s . As before, we fix all other parameters for these simulation experiments. Table 7 summarizes the main results of the experiments. We reach a similar conclusion that the frequency of data points appears to be the most influential factor in identifying d, which is followed by the inclusion of the measurement of cancer cells. However, with larger error margins, the measurement of cancer cells loses its effectiveness, which is problematic due to the fact that accurate estimations of cancer populations are difficult in practice. Meanwhile, all other factors have a negligible effect on the estimation of d. Finally, if a measurement of cancer cells, androgen, and PSA can be taken every 24 h, then we could accurately determine d. Interestingly, if a measurement of the cancer population is not available, but we can obtain pseudo-continuous data for Q and P, then it is possible to determine the value of d within a reasonable range. Table 5. The identification of p s in the cancer stem cell model. The test is carried out for p s (all other parameters and initials are fixed to their true values). Baseline frequency (data) indicates a measurement is taken every 10 days. Pseudo-continuous data indicates a measurement is taken roughly every 2.4 h. Increased optimization tolerance refers to one fold increase in the function tolerance and optimality tolerance of the optimization function. Weight (ω) comes from the minimization objective. ω > 0.5 means higher weight is given to the error in P and ω < 0.5 means higher weight is given to the error in S, which is given by ω × RMSE P + (1 − ω) × RMSE S . Asterisk ( * ) indicates practical identifiability.  Table 7. The identification of d in cancer cell quota model. The test is carried out for p s (all other parameters and initials are fixed to their true values). Baseline frequency (data) indicates a measurement is taken every 10 days. Pseudo-continuous data indicates a measurement is taken roughly every 2.4 h. Increased optimization tolerance refers to a one-fold increase in the function tolerance and optimality tolerance of the optimization function fmincon (MATLAB). Weight 1 = ω 1 and weight 2 = ω 2 come from the minimization objective, which is As before, if none of the other parameters are known, then the identifiability of the model is not possible even with pseudo-continuous data of the cancer population, androgen, and PSA (see Table 8). Yet, if those measurements can be taken very precisely (σ 0 ≈ 0%), then the parameters can still be estimated within reasonable accuracy for application.

Discussion
Mathematical models not only contribute to the foundation of cancer theory but can also be integrated to provide a better prognostic tool for clinicians in clinical settings. For example, one may apply mathematical models to better understand cancer progression dynamics and make predictions of treatment outcomes based on a patient's characteristics [18]. Yet, the issue of practical identifiability remains a major obstacle to realizing the clinical potential of mathematical models. In this study, we explore the issue of model identifiability from a clinical perspective. First, we study the general identifiability property of the model. Then, we narrow down the parameter that can be used to predict treatment outcomes and look for the appropriate set of data for its identification using Monte Carlo simulation. Our results provide insights into the type of data acquisition that can enable future incorporation of mathematical models into clinical applications.
The frequency of data collection plays a major role in model identifiability. It is well known that increasing the number of measurements increases the chances of obtaining true estimates of model parameters assuming that the measurements are perfect and the model is structurally identifiable [70]. Here, we demonstrate in both examples that increasing the frequency of measurements, even in the presence of Gaussian noise, can increase model identifiability. However, simply increasing the number of data points will not overcome the issue of structural identification. The dataset should cover multiple temporal regions of cancer growth, so that the model can be tested more comprehensively to prove its usefulness. This finding suggests the development of devices or procedures to obtain measurements, such as PSA and androgen, on a regular basis can help to accurately identify the values of prognostic parameters.
Cancer population data can help reduce the uncertainty in model identifiability. We demonstrate that the inclusion of cancer population measurements (or stem cells) can increase the model identifiability. In practice, this can be completed by using imaging data or indirectly measuring circulating cancer cells [71][72][73][74][75]. On the other hand, for models that incorporate mechanisms for cancer growth using androgen, androgen data seem to be necessary for model identifiability. Unfortunately, these measurements are not widely adopted, making it difficult to integrate these ideas effectively. Furthermore, we carried out the same computational experiments on several cancer models with multiple subpopulations (not shown). The results suggest a measurement that helps to distinguish different cancer subpopulations (e.g., the frequency of each cancer subpopulation) may be necessary to obtain model identification.
Highly accurate data may be the key to addressing model identifiability in practice. Perhaps the most intriguing finding is that with very accurate data, the prognostic parameters and some other model parameters are reasonably identifiable. Unlike the frequency of measurements, the accuracy of measurements does not require additional compliance from the patients. With continual advances in the techniques and equipment to measure the relevant biomarkers, high-accuracy data may be the key to obtaining model identifiability in practice. We also note that certain biological markers, such as androgen, vary significantly throughout the day and with diets [64], so better clinical protocols may need to be implemented to obtain more accurate measurements.
So far, we have only discussed the applicability of model identifiability in terms of key prognostic parameters. However, one can analytically derive treatment outcomes based on a combination of a set of model parameters with mathematical analysis. This can provide a deeper understanding of key factors that drive the progression of cancer and may even shed light on novel treatments. However, to apply analytical results in practice, one needs to assess the interconnection between the parameters and how they may change based on external factors. If how these parameters change over time during treatment can be assessed, one can then use the analytical condition to determine the treatment outcome directly. Nevertheless, the issue of model identifiability remains a crucial component for this approach to work. Another aspect of model identifiability is the statistical method used for parameter estimation. Most approaches in literature use an individual fitting, which limits the amount of data used for parameter estimation for each individual data. An alternative approach is to use population fitting with mixed effects. This should not be confused with pulling individual data and fit to the average. Instead, this approach assumes that for each parameter, its value varies per individual, but follows some distributions for the whole population. Thus, we can utilize the data of all patients simultaneously to fit the model. A software often used to implement this approach is Monolix [76]. Examples of this approach can be found in within-host viral dynamics literature [77,78]; however, it has yet to gain traction in mathematical and computational oncology literature.
In summary, we find that incorporating frequent data measurements, different types of data (especially those related to the cancer population), and high-accuracy measurements will increase the likelihood of practical identification of prognostic parameters. As more complex models contain more parameters, making it a more difficult task to obtain complete model identification, our results advocate for the use and development of models with a mechanism that tracks disease progression. By incorporating such a mechanism, a subset of model parameters (associated with the mechanism) naturally becomes the focus of model identifiability. This reduces the issue of model unidentifiability and provides a means for making predictions regarding the outcome of treatment. There are several major limitations to our studies, such as the assumption of a perfect model (no model error). These issues can perhaps be accounted for by continually improving the model development or by using a data assimilation approach, such as the Kalman filter [79,80]. We also do not employ patient-specific data for our simulation study. These can be explored in future studies. Data Availability Statement: All codes will be made available upon request.