Next Article in Journal
High-Accuracy Parallel Neural Networks with Hard Constraints for a Mixed Stokes/Darcy Model
Previous Article in Journal
Matter-Aggregating Systems at a Classical vs. Quantum Interface
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Entropy-Based Approach to Model Selection with Application to Single-Cell Time-Stamped Snapshot Data

by
William C. L. Stewart
1,*,
Ciriyam Jayaprakash
2,* and
Jayajit Das
3,4,*
1
GIG Statistical Consulting LLC., 391 E. Livingston Avenue, Columbus, OH 43215, USA
2
Department of Physics, The Ohio State University, Columbus, OH 43210, USA
3
Steve and Cindy Rasmussen Institute for Genomics, Abigail Wexner Research Institute at Nationwide Children’s Hospital, Columbus, OH 43205, USA
4
Department of Pediatrics, College of Medicine, The Ohio State University, Columbus, OH 43210, USA
*
Authors to whom correspondence should be addressed.
Entropy 2025, 27(3), 274; https://doi.org/10.3390/e27030274
Submission received: 17 December 2024 / Revised: 14 February 2025 / Accepted: 19 February 2025 / Published: 6 March 2025
(This article belongs to the Section Entropy and Biology)

Abstract

:
Recent single-cell experiments that measure copy numbers of over 40 proteins in thousands of individual cells at different time points [time-stamped snapshot (TSS) data] exhibit cell-to-cell variability. Because the same cells cannot be tracked over time, TSS data provide key information about the statistical time-evolution of protein abundances in single cells, information that could yield insights into the mechanisms influencing the biochemical signaling kinetics of a cell. However, when multiple candidate models (i.e., mechanistic models applied to initial protein abundances) can potentially explain the same TSS data, selecting the best model (i.e., model selection) is often challenging. For example, popular approaches like Kullback–Leibler divergence and Akaike’s Information Criterion are often difficult to implement largely because mathematical expressions for the likelihoods of candidate models are typically not available. To perform model selection, we introduce an entropy-based approach that uses split-sample techniques to exploit the availability of large data sets and uses (1) existing generalized method of moments (GMM) software to estimate model parameters, and (2) standard kernel density estimators and a Gaussian copula to estimate candidate models. Using simulated data, we show that our approach can select the ”ground truth” from a set of competing mechanistic models. Then, to assess the relative support for a candidate model, we compute model selection probabilities using a bootstrap procedure.

1. Introduction

Ordinary differential equations (ODEs) are commonly used to model the sub-cellular dynamics of proteins and mRNA [1,2,3]. Usually, ODEs describe the deterministic dynamics of average concentrations, which facilitates the estimation of reaction rates (also known as model parameters) from experimental data. This is often an important step towards building mechanistic biological models, but the task of estimating model parameters from experimental data is challenging for a variety of reasons [4,5,6], especially when the number of distinct proteins measured in experiments is smaller than the number of model parameters. Recent developments in single-cell experimental techniques for the longitudinal measurements of transcripts and proteins in an individual cell, such as single-cell RNA-seq [7] or single-cell mass cytometry by time-of-flight (CyTOF) [8,9], which can simultaneously measure over a thousand different RNA sequences or more than thirty different protein species in a single cell, appear to alleviate this problem. Since individual cells are not tracked across time in these experiments, the measurements generate a large collection of time-stamped snapshot (TSS) data. Another challenge stems from the cell-to-cell differences in the copy number of a protein (or abundance), which contains variation present at the pre-stimulus state (also known as extrinsic noise), as well as variation arising from the inherent stochasticity of biochemical reactions over time (also known as intrinsic noise) [10,11]. When the observed protein abundances are large, intrinsic noise can be ignored, extrinsic noise is known to play a significant role [12], and single-cell protein signaling kinetics can be well approximated by ODEs. By comparing the distribution of protein abundances in single cells observed at time t to the predictions obtained from an ODE or a stochastic model that evolves single-cell protein abundances seen at an earlier time (e.g., t = 0 , also known as initial conditions), we can estimate the parameters of the candidate model using a generalized method of moments (GMM) approach [13,14]. GMM, which contrasts sample moments and their corresponding expectations, is widely used in econometrics [15]. In this paper, we propose an entropy-based approach to address the larger question of model selection for systems that can be described by deterministic dynamical models (e.g., sets of ODEs) with randomness arising from initial conditions. Specifically, for the models considered here, we show that our cross-entropy [16] approach can find the best ODE model from a set of competing candidate models, where the best model neither over-fits nor under-fits the available data.
The primary goal of model selection is to find the best model relative to some defensible criterion, and two attractive criteria are cross-entropy and Kullback–Leibler (KL) divergence [17]. The latter is a non-negative number that, for a pair of random variables can provide a useful measure of dependence known as mutual information [18]. But more generally, KL divergence measures the “distance” between two probability distributions [13] where the KL divergence vanishes for a pair of identical distributions. Usually, the distribution that gave rise to the protein abundances observed at time t (denoted f) is considered to be the “ground truth”, while the other distribution is most often a candidate distribution (or model) that is presumed to be “close” to f. A defining feature of KL divergence is that it is zero if and only if the candidate model and f are the same. Typically, the best candidate model will strike a balance between under-fitting (i.e., over-estimating the random error) and over-fitting (i.e., under-estimating the random error). Since KL divergence is a difference in expectations taken with respect to f (i.e., cross-entropy minus entropy), and since entropy depends only on f, it suffices (for the purpose of model selection) to find the model with the smallest cross-entropy. Given a finite set of candidate models, the model with the smallest cross-entropy is also the model with the smallest KL divergence, which makes it the best approximating model to f.
There are however two very important challenges when performing model selection from TSS data. First, the multivariate probability density of the observed protein abundances at time t (denoted symbolically as f, and in words as the “ground truth”) is rarely known. Second, while we can estimate the parameters of each candidate model, we cannot evaluate the likelihood of any model. Fortunately, we can deal with the first challenge by minimizing cross-entropy (instead of KL divergence), as cross-entropy only requires realizations from f, not complete knowledge of f. As for the second challenge, which cannot be avoided, we tackle it head-on by estimating the likelihood for each candidate model (see Section 2 and Appendix B for more details) from the time evolution of initial conditions (see Section 3 and Appendix A for more details).
The remaining sections of this paper are organized as follows. In the first subsection of Methods, we give mathematical descriptions of KL divergence and cross-entropy, and we briefly explain how parameters of a candidate model are estimated. Then, in the second subsection of Methods, we outline our model selection approach, leaving technical details (such as the estimation of the Gaussian copula, marginal densities, and marginal cumulative distribution functions) to Appendix A and Appendix B. Furthermore, in the second subsection of Methods, we briefly describe a complementary model selection approach that applies Akaike Information Criterion [19] corrected for small samples (AICc) [20] to differences between the mean protein abundances observed at time t and the mean protein abundances predicted at time t. In Section 3, we describe how synthetic TSS data are generated, and we describe the time-evolution of initial conditions using different candidate models. Finally, we demonstrate the utility of our model selection approach in Section 4, and we give some interesting insights and suggestions for further improvements in Section 5.

2. Methods

2.1. Kullback–Leibler and Cross-Entropy

Consider the following: from a large collection of genetically identical cells [e.g., a clonal population of Natural Killer (NK) cells], a subpopulation is extracted, and the abundances of n distinct proteins are observed in each cell. We denote a random draw from this subpopulation by x [ 0 ] . Now, cells obtained from a different subset of the original collection are allowed to evolve over time. For these cells, which are observed for the same set of distinct proteins at time t, let y [ t ] denote a random draw from this subpopulation. Collectively, the protein abundances in cells observed at time 0 and in cells observed at time t are a simple example of TSS data.
We assume that x [ 0 ] G and that y [ t ] F with density f ( y [ t ] ) and that y [ 0 ] (which is not observed) has the same distribution as x [ 0 ] , since they both represent random draws from the same initial collection of similar cells. Furthermore, in accordance with most real-world applications, we assume that G, F, and f are almost never known exactly, (i.e., mathematical expressions are typically not known, but empirical estimates from observed data may be possible). Now, consider modeling the time-evolution of TSS data with different deterministic models d i for i = 1, 2, and 3, where each d i is a set of coupled nonlinear ODEs descibing the dynamics of single-cell protein abundances over time. Specifically, the ith model d i has a set of reaction rates (see Figure 1) that depend on freely varying parameters θ i (see Appendix A for more details). We evolve initial conditions x [ 0 ] to time t using ODE model d i to arrive at predicted abundances x i [ t ] which have distribution H i and density h i . We refer to h i , interchangeably, as the i th candidate model. For each candidate model, we assume there exists a unique parameter vector θ i * that minimizes
E f log f y [ t ] h i y [ t ] θ ,
where the expectation in expression (1) is the Kullback–Leibler (KL) divergence between f ( y [ t ] ) and candidate model h i ( y [ t ] θ ) . Note that, when the i th candidate model and the “ground truth” are the same, h i = f and the KL divergence is zero. Since we have restricted attention to a single point in time, we suppress the dependence of x i [ t ] and y [ t ] on t hereafter.
KL divergence can also be expressed as the difference between cross-entropy and entropy
E f log h i ( y θ ) E f log f ( y ) .
So, when θ = θ i * , Expression (2) is the KL divergence between f and h i . Since entropy does not depend on h i , it suffices (for the purpose of model selection) to find the candidate model that minimizes cross-entropy:
C E ( f ( y ) h i ( y θ i * ) ) E f log h i ( y θ i * ) .
However, because f is rarely known, the cross-entropy in Equation (3) cannot be computed. Instead, we must approximate the cross-entropy by averaging over N independent realizations of f. To compute the approximation, we also need θ i * and h i , but unfortunately, both are rarely known as well. In the case of θ i * , we can replace it with the generalized method of moments (GMM) estimate of θ (denoted θ ˜ i ). Briefly as in [13], we define θ ˜ i to be the value of θ that minimizes the weighted sum of squared differences between the first and second moments of the observed abundances y, and the first and second moments of the predicted abundances x i [15]. More generally, if one thinks of the differences between corresponding moments as a difference vector, then the quadratic form of this vector and a symmetric, positive-definite weight matrix (denoted W) defines a norm, and this norm is the cost that θ ˜ i minimizes. There are many possible choices for W (see [14,15,21] for more details). Note that for most over-determined systems, θ ˜ i is consistent for θ i * [22]. In the next subsection, we will describe how h i can be estimated from initial conditions x [ 0 ] , coupled ODEs d i , and its corresponding estimated reaction rates θ ˜ i . Our estimate of h i is denoted h ^ i ( y θ ˜ i ) .

2.2. Computing Approximate Cross-Entropy

Single-cell experiments such as CyTOF typically measure protein abundances across thousands of cells, and as such, TSS data often contain a wealth of information for estimating marginal densities and marginal distribution functions. Consequently, we decided to leverage Sklar’s theorem [23] to estimate the multivariate density h i ( y θ ˜ i ) from its marginal densities, marginal distribution functions, and a copula describing the dependence between the abundances of n distinct proteins in a single cell (see Appendix B for more details). Here, we implement a Gaussian copula, which is computationally fast, mathematically convenient, and (at least in the case of TSS data) quite accurate (see Appendix C).
Another benefit of working with large samples is that split-sample techniques allow us to avoid the bias that typically arises when parameter estimation and model selection are performed on the same dataset [19]. Specifically, we use 20% of the available TSS data for parameter estimation (i.e., to compute θ ˜ i ), and 80% for model selection (i.e., to compute h ^ i ( y θ ˜ i ) . Now, the cross-entropy that we want (see Equation (3)), can be approximated by
E f log h ^ i ( y θ ˜ i )
and the expression in (4) can be estimated from the corresponding sample average (denoted ACE):
A C E ( f ( y ) h ^ i ( y θ ˜ i ) ) 1 N k log h ^ i ( y k θ ˜ i ) ,
where N independent cells at times 0 and t are used to compute h ^ i ( y θ ˜ i ) and ACE, respectively.
Because our proposed model selection approach is based on an approximation, it is useful to have (1) independent confirmation that the selected model actually minimizes KL divergence, and (2) some measure of relative support for the model that ACE selects compared to the other models that could have been chosen. To address the first point, we compute AICc for each candidate model by working with the likelihood for the mean difference. Specifically, we assume that the mean of y minus the mean of x i is multivariate normal with expectation zero, and that its variance–covariance matrix has off-diagonal elements equal to zero [20]. Because (1) the initial protein abundances are independent across cells, (2) we assume initial abundances are also independent within cells for simplicity, and because (3) the sample size is large, departures from normality and departures from independence should be small. Note that, when model selection is based on a transformation of the original TSS data (e.g., the mean difference), it is usually possible to construct several different AICc-like statistics. The one we chose to compute here is easy to implement and natural, but more elaborate constructions are also possible. When the candidate model that minimizes ACE also minimizes approximate AICc, one has (to some degree) additional assurance that the selected model minimizes cross-entropy [see Equation (3)], and therefore minimizes KL divergence [see Equation (2)]. Furthermore, by bootstrapping the observed TSS data, we can estimate the model selection probabilities for each candidate model (i.e., the probability that model h i is selected). Typically, model selection probabilities will help users quantify and interpret the level of relative support for the selected model in ways that are comparable to, or better than, differences in AIC or AICc [24].

3. Data Description

In most real-world applications that model the time-dependent kinetics of protein abundances, the “ground truth” (denoted f) is rarely known. However, because we are primarily concerned with improving model selection (i.e., increasing accuracy, increasing usability, and developing additional measures of relative support), we chose to mimic observed TSS data y by simulating (without intrinsic noise) protein abundances at time t from three scenarios of interest: SMALL, MEDIUM, and LARGE (see Table A1 for more details). For the SMALL, MEDIUM, and LARGE scenarios, we create ground truth models where we vary one, two, and three parameters of θ . For instance, in the ground truth MEDIUM scenario, candidate model h 2 is expected to explain the TSS data better than h 1 , which is too simplistic, and better than h 3 , which over-fits to noise in the data (see Table 1).
For each “ground truth” scenario, we always know which candidate model should yield the best fit. For example, consider candidate model h 2 which has two freely varying parameters, θ 1 and θ 3 , because we set θ 2 = 9 × θ 1 . Relative to h 1 and h 3 , candidate model h 2 should be the closest to the “ground truth” MEDIUM scenario, provided that under-fitting and over-fitting are accounted for appropriately (see Table 2). Similarly, candidate models h 1 and h 3 are expected to to be closest to “ground truth” scenarios SMALL and LARGE, respectively, since in h 1 we set θ 2 = 9 × θ 1 and θ 3 = 2 × θ 1 with only θ 1 varying freely; and since h 3 has all three parameters varying freely. Of course, when implementing ACE (and AICc), we pretend that the “ground truth” is unknown.
Table 1. Under-fitting and over-fitting. The cost (shown in brackets) decreases as the complexity of the candidate model increases, so h 3 has the lowest cost. Yet ACE selects h 2 , the correct candidate model, 95% of the time (see Table 3), which implies that ACE appropriately balances the under-fitting of h 1 and the over-fitting of h 3 . The parameter estimates θ ( θ 1 , θ 2 , θ 3 ) are shown for each candidate model; and as a point of reference, the “ground truth” MEDIUM scenario is shown in bold.
Table 1. Under-fitting and over-fitting. The cost (shown in brackets) decreases as the complexity of the candidate model increases, so h 3 has the lowest cost. Yet ACE selects h 2 , the correct candidate model, 95% of the time (see Table 3), which implies that ACE appropriately balances the under-fitting of h 1 and the over-fitting of h 3 . The parameter estimates θ ( θ 1 , θ 2 , θ 3 ) are shown for each candidate model; and as a point of reference, the “ground truth” MEDIUM scenario is shown in bold.
Model: θ 1 θ 2 θ 3 Cost
MEDIUM:0.100.900.18[–NA–]
h 1 :0.093(9 × 0.093)(2 × 0.093)[0.0370]
h 2 :0.100(9 × 0.100)0.182[0.0074]
h 3 :0.0970.8960.182[0.0071]
To generate TSS data, we begin by simulating uncorrelated initial conditions from a multivariate log-normal distribution with parameters μ = (5.25, 7.60, 5.25, 7.60, 5.25, 5.25), and σ 2 = (0.15, 0.06, 0.15, 0.06, 0.15, 0.15). Now, let us consider simulating data for the “ground truth” LARGE scenario. To accomplish this, we take half of the initial conditions (discussed immediately above) and we evolve them to time t using coupled ODEs d 3 and parameters (0.10, 0.95, and 0.18). Computing ACE and model selection probabilities for all three candidate models takes about 2 h on a 2.5 GHz computer, but apart from computational time, there is no limit on the number of candidate models one can consider.

4. Results

We show that our model selection approach works (Table 2) and that approximate AICc often selects the same candidate model as ACE (Table 3). Next, we repeated our model selection approach for each of 1000 bootstrap resamples of the simulated TSS data, and we computed model selection probabilities for each candidate model.
For the “ground truth” LARGE scenario, ACE selected the correct candidate model h 3 (with 3 freely varying parameters) 100% of the time. Similarly, for the “ground truth” MEDIUM and SMALL scenarios, ACE selected the correct candidate models h 2 (with 2 freely varying parameters) and h 1 (with 1 freely varying parameter) 95% and 76% of the time, respectively (see Table 3).

5. Discussion

5.1. New Tools for Model Selection with Single-Cell Data

Mechanistic models based on ODEs describing subcellular kinetics of proteins are widely used in computational biology for gleaning mechanisms and generating predictions. On the other hand, to model random gene expression data, stochastic mechanistic models such as the telegraph model have been used [25,26]. It is common to have multiple candidate mechanistic models that can be set up to probe different hypotheses describing the same biological phenomena, and an important task in model development is to rank order the candidate models according to their ability to describe the measured data. The availability of large, high-dimensional, single-cell datasets allows for estimation of model parameters using mean values and higher-order moments of the measured data; however, rank ordering candidate models from such data may not be straightforward when using standard approaches in model selection (e.g., AIC, AICc, and Kullback–Leibler). Here we propose a model selection approach based on cross-entropy for ODE-based models that are calibrated against means and higher-order moments of the measured data. We show as “proof-of-concept” that our proposed approach successfully rank orders a set of ODE models against synthetic single-cell datasets.

5.2. ACE Complements Approximate AICc

As shown in Figure 2 (Panel C), approximate AICc is virtually independent of ACE, and concordance between ACE and approximate AICc appears to provide additional support for the candidate model selected by ACE. So, any improvement in approximate AICc would likely only benefit ACE. One potential area for improvement might be a recalibration of the penalties used in approximate AICc. In particular, it’s not immediately clear what the penalty should be for approximate AICc, as the parameter estimation is based on thousands of cells, whereas the model selection is based on differences in only six means. Moreover, when one combines a split-sample approach with model selection (as we have done here), the penalty terms in AIC and AICc may no longer be needed [27]. Indeed, the so-called penalties are actually corrections for the bias that arises when parameter estimation and model selection are performed on the same dataset.
By design, the additional penalty term in AICc depends on the corresponding sample size (here, AICc uses only six means). As such, the number of distinct proteins must be larger than the number of freely varying parameters; otherwise, the denominator of the additional penalty term will be zero or negative. Also, because the AICc likelihood is base on the differences in means, AICc may have more difficulty than ACE accurately rank ordering two or more candidate models with similar means. Note that ACE (as implemented here, with a split-sample approach) does not have either design limitation: (1) ACE penalizes indirectly for complexity and does not require a bias correction, and (2) ACE makes use of the entire multivariate density, not just the means.
For the MEDIUM and LARGE “ground truth” scenarios, ACE outperforms approximate AICc, but for the SMALL “ground truth” scenario, approximate AICc does better than ACE. This suggests that the means tend to carry the bulk of the information in the “ground truth” SMALL scenario, so there’s very little benefit to estimating the other candidate models (which contain information about higher-order moments and cross-moments). However, when the higher-order moments and cross-moments begin to matter, as is likely the case with the more complex “ground truth” scenarios, the benefit of estimating the candidate models with 2 and 3 freely varying parameters is likely greater.

5.3. Limitations and Future Directions

Presently, our ACE approach to model selection based on TSS data has three main limitations: (1) it is not designed to handle intrinsic noise, (2) there is considerable latitude in terms of multivariate density estimation that we have only scratched the surface of here, and (3) there may be more efficient ways to incorporate split-sample techniques. Extending our ACE approach to include both extrinsic and intrinsic noise may be possible for relatively short evolution times and/or for networks with a relatively small number of interacting proteins. Further, for other applications, users may want to include higher-order moments and/or different copulas or kernel density estimators [28,29].
When researchers are unable to specify the full candidate model, the likelihood is not known and model selection is often challenging. However, when the sample size is large and consistent estimators of the model parameters and candidate models exist, we propose a split-sample entropy-based approach that allows users to find the best approximating model to the “ground truth”. Furthermore, our approach is quite flexible with respect to (1) parameter estimation (e.g., choosing which moments to use—first moments only, first and second moments, etc.), and (2) multivariate density estimation (e.g., choosing a “good” kernel density estimator and/or copula for each candidate model).

6. A Selected Sensitivity Analysis

To examine what could happen with the analysis of TSS data observed at a different time point (e.g., t = 4.5) and at a smaller sample size (e.g., 4000), we show the results below in Table 4 and Table 5, respectively. As expected, with longer evolution times, each of the candidate models are more differentiated, and so selecting the best model is (in some ways) easier. However, relative to t = 1.5, the dependence between variables at t = 4.5 is increased. As such, the number of effectively independent variables [30,31] is reduced at t = 4.5, and this makes parameter estimation more challenging, especially when 3 reaction rates are allowed to vary freely. As expected, the performance of ARE drops due to the reduction in sample size.

Author Contributions

Conceptualization, W.C.L.S.; methodology, W.C.L.S., C.J. and J.D.; software, W.C.L.S.; validation, W.C.L.S.; formal analysis, W.C.L.S.; investigation, W.C.L.S.; resources, W.C.L.S.; data curation, W.C.L.S.; writing—original draft preparation, W.C.L.S., C.J. and J.D.; writing—review and editing, W.C.L.S., C.J. and J.D.; visualization, W.C.L.S., C.J. and J.D.; supervision, W.C.L.S., C.J. and J.D.; project administration, W.C.L.S., C.J. and J.D.; funding acquisition, J.D. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by NIH grant R01AI146581 to J.D., the Abigail Wexner Research Institute at the Nationwide Children’s Hospital.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The software will be made available to interested researchers upon request (email: minitether@gmail.com).

Acknowledgments

We thank John Wu for developing the C++ code to solve the non-linear ODEs, and GIG Statistical Consulting LLC, which provided computational resources for the simulations and data analyses.

Conflicts of Interest

Author William C. L. Stewart was employed by the company GIG Statistical Consulting LLC. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Appendix A. Evolving Protein Abundances with ODEs

We consider a simplified model of biochemical reactions that describe early-time signaling kinetics in mouse Natural Killer (NK) cells stimulated by ligands cognate to activating CD16 and inhibitory Ly49A receptors. We explicitly model the kinase Syk bound to activating receptor-ligand complex and the phosphatase SHP1 bound to inhibitory receptor-ligand complex. The enzyme Syk phosphorylates a key signaling protein Vav1 where phosphorylated Vav1 (pVav1) induces activation of NK cells. The enzyme SHP1 dephosphorylates pVav1 and thus mediates inhibition of NK cell activation. In our model, Syk and SHP1 react with the substrate protein Vav1 and pVav1, respectively, following the reactions shown in (Figure 1). Activating and inhibitory NK cell receptors are not included explicitly in this model; instead, Syk and SHP1 represent Syk and SHP1 proteins bound to the activating and inhibitory receptors, respectively. The reactions in the model are similar to that of the zero-order ultrasensitivity model proposed by Goldbeter and Koshland [32]. The abundances for the protein species, Syk, Vav1, Syk-Vav1, SHP1, SHP1-pVav1, and pVav1 evolve in time with a set of nonlinear ODEs describing mass-action kinetics for the above reactions.
According to the Law of Mass Action, the signaling kinetics of protein abundances y j ( t ) , which evolve in time through r biochemical signaling reactions, can in general be described by a set of coupled ODEs
d y j dt = p Λ j p ( y , θ p ) ,
where j indexes the distinct proteins observed in a single cell at time t. Here, Λ j p ( y , θ p ) is the propensity of the p th reaction affecting the abundance of the j th protein. Consider, for example, the reaction V a v 1 + S y k V a v 1 S y k ; the Law of Mass Action asserts that the rate of change of the abundance y V a v 1 S y k is proportional to the product of the abundances y V a v 1 and y S y k (the reactants that produce V a v 1 S y k ) and the propensity is defined to be θ y v a v 1 y S y k where θ i is the rate of the reaction, typically, in s 1 . Starting from a more microscopic Master Equation, this corresponds to neglecting correlations, i.e., replacing the value of the simultaneous abundance of products by the factorized product of abundances, thus reducing the dynamical equation to deterministic ODEs.
The time evolution for the abundances of the six protein species is described by the ODEs below.
d y S y k d t = θ 1 y S y k y V a v 1 + ( 0.12 + θ 2 ) y S y k V a v 1 d y V a v 1 / d t = θ 1 y S y k y V a v 1 + 0.12 y S y k V a v 1 + θ 3 y S H P 1 p V a v 1 d y S y k V a v 1 d t = θ 1 y S y k y V a v 1 ( 0.12 + θ 2 ) y S y k V a v 1 d y S H P 1 / d t = 0.14 y S H P 1 y p V a v 1 + ( 0.05 + θ 3 ) y S H P 1 p V a v 1 d y p V a v 1 d t = θ 2 y S y k V a v 1 0.14 y S H P 1 y p V a v 1 + 0.05 y S H P 1 p V a v 1 d y S H P 1 p V a v 1 d t = 0.14 y S H P 1 y p V a v 1 ( 0.05 + θ 3 ) y S H P 1 p V a v 1
The ODEs allow for both linear and nonlinear dynamics, and the parameter values for all simulated data are taken (in part) from the parameters used in [33] (see Table A1 for more details).
Table A1. Ground truth scenarios. For each “ground truth” scenario, the corresponding parameter values (i.e., rate constants) are shown.
Table A1. Ground truth scenarios. For each “ground truth” scenario, the corresponding parameter values (i.e., rate constants) are shown.
Ground Truth: θ 1 θ 2 θ 3
LARGE:109518
MEDIUM:109018
SMALL:109020
The above ODEs are solved numerically using a Runge–Kutta Cash–Karp nonlinear solver in C++ at time t = 1.5 s to generate synthetic data y at time t and to evolve initial conditions x [ 0 ] for each candidate model: h 1 , h 2 , and h 3 .

Appendix B. Estimating Candidate Models

Let i ( 1 , 2 , 3 ) index the candidate models, and let h ^ i ( y ) be the estimate of the ith candidate model h i ( y ) . To compute h ^ i ( y ) , we will need (1) an estimate of the marginal cumulative distribution function of each distinct protein (denoted H i j ), (2) estimates of their corresponding marginal densities h i j , and (3) an estimate of the correlation structure between protein abundances within a cell.
Let x i be the evolution of initial condition x [ 0 ] to time t based on a set of coupled ODEs that depend on parameters θ ˜ i . The n components of x i = ( x i 1 , , x i n ) are continuous representations of protein abundances for n distinct proteins. Now, since x i j is predicted in m cells indexed by k = 1 , , m and since m is assumed to be large (say > 1000 ), the marginal cumulative distribution function of x i j is well estimated by the following equation:
H ^ i j m ( v ) = 1 m k 1 { x i j k v } ,
where x i j k is the predicted abundance for the j-th distinct protein in the k-th cell, 0 < v < , and 1 is the indicator function that takes the value 1 when the event { x i j k v } is true. By the strong law of large numbers,
H ^ i j m a s H i j ,
as m . Furthermore, to estimate the corresponding marginal density functions h i j , we use the default kernel density estimation procedure [34] in the statistical software package R (version 4.2.1). The estimate of h i j is denoted h ^ i j
To estimate the correlation structure, we apply a double-transform to x i j so that the resulting random vector z i has a multivariate normal distribution with mean zero and variance-covariance Σ . Specifically, we define u i j k H ^ i j ( x i j k ) and z i j k Φ 1 ( u i j k ) , so that ( z i 1 k , z i n k ) M V N ( 0 , Σ ) for each k. Then, our estimate of the correlation structure (denoted R ^ ) is simply the standard estimate of Σ appropriately rescaled. Finally, our estimate of the i th candidate model evaluated at y is
h ^ i ( y ) = C R ^ G a u s s ( y ) × j h ^ i j ( y j ) ,
where y ( y 1 , , y n ) represents protein abundances observed at time t in a single cell, and
C R ^ G a u s s ( y ) 1 det R ^ exp 1 2 × y R ^ I y T ,
with I being the [ n × n ] identity matrix. Hence, our estimate of the i th log-likelihood of θ ˜ i based on y is log h ^ i ( y ) , and now, the computation of ACE in Equation (5) is straightforward.

Appendix C. Accuracy of Gaussian Copula

To assess the accuracy of the Gaussian copula, we compared the exact log-likelihood of simulated initial conditions where the distribution of protein abundance for the j th distinct protein is known to be multivariate log normal ( μ j , σ j 2 ) and uncorrelated with the abundances of other proteins (see Data Description for more details). The log-likelihood differences shown in Table A2 suggest that the error in our log-likelihood calculations for ACE is only slightly greater than 1%.
Table A2. Accuracy of the Gaussian copula: L ^ ( ϕ ) . We generated initial conditions for six distinct proteins across 3000 and 6000 cells, respectively, using a multivariate log-normal distribution indexed by ϕ ( μ , σ 2 ) . The Gaussian copula over-estimated the exact log-likelihood by slightly more than 0.5% when N = 3000 ; and it over-estimated the exact log-likelihood by slightly less than 1% when N = 6000 .
Table A2. Accuracy of the Gaussian copula: L ^ ( ϕ ) . We generated initial conditions for six distinct proteins across 3000 and 6000 cells, respectively, using a multivariate log-normal distribution indexed by ϕ ( μ , σ 2 ) . The Gaussian copula over-estimated the exact log-likelihood by slightly more than 0.5% when N = 3000 ; and it over-estimated the exact log-likelihood by slightly less than 1% when N = 6000 .
Sample Size log L ( ϕ ) log L ^ ( ϕ )
3000−5859.07−5821.99
6000−11,718.14−11,611.02

References

  1. Rohrs, J.A.; Wang, P.; Finley, S.D. Understanding the dynamics of T-cell activation in health and disease through the lens of computational modeling. JCO Clin. Cancer Inf. 2019, 3, 1–8. [Google Scholar] [CrossRef] [PubMed]
  2. Loskot, P.; Atitey, K.; Mihaylova, L. Comprehensive review of models and methods for inferences in bio-chemical reaction networks. Front. Genet. 2019, 10, 549. [Google Scholar] [CrossRef] [PubMed]
  3. La Manno, G.; Soldatov, R.; Zeisel, A.; Braun, E.; Hochgerner, H.; Petukhov, V.; Lidschreiber, K.; Kastriti, M.E.; Lönnerberg, P.; Furlan, A.; et al. RNA velocity of single cells. Nature 2018, 560, 494–498. [Google Scholar] [CrossRef] [PubMed]
  4. Stewart, W.C. The fundamentals of statistical data analysis. In Systems Immunology; CRC Press: Boca Raton, FL, USA, 2018; pp. 41–50. [Google Scholar]
  5. Ashyraliyev, M.; Fomekong-Nanfack, Y.; Kaandorp, J.A.; Blom, J.G. Systems biology: Parameter estimation for biochemical models. FEBS J. 2009, 276, 886–902. [Google Scholar] [CrossRef]
  6. Raue, A.; Kreutz, C.; Maiwald, T.; Bachmann, J.; Schilling, M.; Klingmüller, U.; Timmer, J. Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 2009, 25, 1923–1929. [Google Scholar] [CrossRef]
  7. Saliba, A.E.; Westermann, A.J.; Gorski, S.A.; Vogel, J. Single-cell RNA-seq: Advances and future challenges. Nucleic Acids Res. 2014, 42, 8845–8860. [Google Scholar] [CrossRef] [PubMed]
  8. Mukherjee, S.; Jensen, H.; Stewart, W.; Stewart, D.; Ray, W.C.; Chen, S.Y.; Nolan, G.P.; Lanier, L.L.; Das, J. In silico modeling identifies CD45 as a regulator of IL-2 synergy in the NKG2D-mediated activation of immature human NK cells. Sci. Signal. 2017, 10, eaai9062. [Google Scholar] [CrossRef]
  9. Spitzer, M.H.; Nolan, G.P. Mass cytometry: Single cells, many features. Cell 2016, 165, 780–791. [Google Scholar] [CrossRef]
  10. Swain, P.S.; Elowitz, M.B.; Siggia, E.D. Intrinsic and extrinsic contributions to stochasticity in gene expression. Proc. Natl. Acad. Sci. USA 2002, 99, 12795–12800. [Google Scholar] [CrossRef]
  11. Das, J.; Jayaprakash, C. Systems Immunology: An Introduction to Modeling Methods for Scientists; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
  12. Feinerman, O.; Veiga, J.; Dorfman, J.R.; Germain, R.N.; Altan-Bonnet, G. Variability and robustness in T cell activation from regulated heterogeneity in protein levels. Science 2008, 321, 1081–1084. [Google Scholar] [CrossRef]
  13. Wu, J.; Stewart, W.; Jayaprakash, C.; Das, J. BioNetGMMFit: Estimating parameters of a BioNetGen model from time-stamped snapshots of single cells. NPJ Syst. Biol. Appl. 2023, 9, 46. [Google Scholar] [CrossRef] [PubMed]
  14. Lück, A.; Wolf, V. Generalized method of moments for estimating parameters of stochastic reaction networks. BMC Syst. Biol. 2016, 10, 98. [Google Scholar] [CrossRef] [PubMed]
  15. Hansen, L.P. Large sample properties of generalized method of moments estimators. Econometrica 1982, 50, 1029–1054. [Google Scholar] [CrossRef]
  16. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley: Hoboken, NJ, USA, 2006. [Google Scholar]
  17. Kullback, S.; Leibler, R.A. On information and sufficiency. Ann. Math. Stat. 1951, 22, 79–86. [Google Scholar] [CrossRef]
  18. Wu, C.; Li, S.; Cui, Y. Genetic association studies: An information content perspective. Curr. Genom. 2012, 13, 566–573. [Google Scholar] [CrossRef] [PubMed]
  19. Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
  20. Cavanaugh, J.E. Unifying the derivations of the Akaike and corrected Akaike information criteria. Stat. Probab. Lett. 1997, 31, 201–208. [Google Scholar] [CrossRef]
  21. Hayashi, F. Econometrics; Princeton University Press: Princeton, NJ, USA, 2000. [Google Scholar]
  22. Hall, A.R.; Inoue, A. The large sample behaviour of the generalized method of moments estimator in misspecified models. J. Econom. 2003, 114, 361–394. [Google Scholar] [CrossRef]
  23. Sklar, A. Fonctions de répartition à n dimensions et leurs marges. Ann. l’ISUP 1959, 8, 229–231. [Google Scholar]
  24. Dajles, A.; Cavanaugh, J. Bootstrap Approximation of Model Selection Probabilities for Multimodel Inference Frameworks. Entropy 2024, 26, 599. [Google Scholar] [CrossRef] [PubMed]
  25. Chen, L.; Zhu, C.; Feng, J. A generalized moment-based method for estimating parameters of stochastic gene transcription. Math. Biosci. 2022, 345, 108780. [Google Scholar] [CrossRef]
  26. Feng, J.; Jing, L.; Ting, L.; Yifeng, Z.; Wenhao, C.; Leonidas, B.; Chen, J. What can we learn when fitting a simple telegraph model to a complex gene expression model? PLoS Comput. Biol. 2024, 20, e1012118. [Google Scholar] [CrossRef]
  27. Zhang, J.; Yang, Y.; Ding, J. Information criteria for model selection. WIREs Comput. Stat. 2023, 15, e1607. [Google Scholar] [CrossRef]
  28. Hastie, T.; Tibshirani, R.; Friedman, J.H. The Elements of Statistical Learning: Data Mining, Inference, and Prediction: With 200 Full-Color illustrations; Springer: New York, NY, USA, 2001. [Google Scholar]
  29. Sheather, S.J.; Jones, M.C. A reliable data-based bandwidth selection method for kernel density estimation. J. R. Stat. Soc. Ser. B 1991, 53, 683–690. [Google Scholar] [CrossRef]
  30. Cheverud, J.M. A simple correction for multiple comparisons in interval mapping genome scans. Heredity 2001, 87, 52–58. [Google Scholar] [CrossRef] [PubMed]
  31. Li, J.; Ji, L. Adjusting multiple testing in multilocus analyses using the eigenvalues of a correlation matrix. Heredity 2005, 95, 221–227. [Google Scholar] [CrossRef] [PubMed]
  32. Goldbeter, A.; Koshland, D.E. An amplified sensitivity arising from covalent modification in biological systems. Proc. Natl. Acad. Sci. USA 1981, 78, 6840–6844. [Google Scholar] [CrossRef] [PubMed]
  33. Das, J. Activation or tolerance of natural killer cells is modulated by ligand quality in a nonmonotonic manner. Biophys. J. 2010, 99, 2028–2037. [Google Scholar] [CrossRef] [PubMed]
  34. Silverman, B.W. Density Estimation for Statistics and Data Analysis; Chapman and Hall: London, UK, 1986. [Google Scholar]
Figure 1. Minimal NK Cell Signaling Model shows the biochemical reactions for the six protein species interaction model: Syk, Vav1, Syk-Vav1, SHP1, and SHP1-pVav1. The corresponding set of coupled ODEs uses the reactions shown (above) to explain the deterministic mass action kinetics of this signaling model. To ensure that θ = ( θ 1 , θ 2 , θ 3 ) is identifiable, the other three reaction rates were fixed at 0.12 s 1 , 0.14 s 1 , and 0.05 s 1 , respectively.
Figure 1. Minimal NK Cell Signaling Model shows the biochemical reactions for the six protein species interaction model: Syk, Vav1, Syk-Vav1, SHP1, and SHP1-pVav1. The corresponding set of coupled ODEs uses the reactions shown (above) to explain the deterministic mass action kinetics of this signaling model. To ensure that θ = ( θ 1 , θ 2 , θ 3 ) is identifiable, the other three reaction rates were fixed at 0.12 s 1 , 0.14 s 1 , and 0.05 s 1 , respectively.
Entropy 27 00274 g001
Figure 2. A detailed examination of the “ground truth” MEDIUM scenario. Based on 1000 bootstrap resamples, Panel (A) shows the overall accuracy (black dots) and inaccuracy (grey dots) of ACE for the “ground truth” MEDIUM scenario. Panel (B) shows the distribution of absolute differences between the smallest and second smallest ACE scores when ACE selects the correct model (black) or an incorrect model (grey). The asterisk is the absolute difference between the ACE of h 2 and h 3 for data simulated with the “ground truth” MEDIUM scenario. Panel (C) shows how concordance with AICc provides additional evidence that the model selected by ACE is correct.
Figure 2. A detailed examination of the “ground truth” MEDIUM scenario. Based on 1000 bootstrap resamples, Panel (A) shows the overall accuracy (black dots) and inaccuracy (grey dots) of ACE for the “ground truth” MEDIUM scenario. Panel (B) shows the distribution of absolute differences between the smallest and second smallest ACE scores when ACE selects the correct model (black) or an incorrect model (grey). The asterisk is the absolute difference between the ACE of h 2 and h 3 for data simulated with the “ground truth” MEDIUM scenario. Panel (C) shows how concordance with AICc provides additional evidence that the model selected by ACE is correct.
Entropy 27 00274 g002
Table 2. Model selection based on a single TSS Dataset. For each “Ground Truth” scenario (LARGE, MEDIUM, and SMALL), we generated synthetic time-stamped snapshot (TSS) data for 6 proteins across 8000 cells at time t = 1.5 s. Columns 2 and 3 show the minimum ACE and the second smallest ACE, respectively. Columns 4 and 5 show the minimum AICc and the second smallest AICc, respectively. The minimization is taken over all three candidate models. The selected candidate model, h 1 , h 2 , or h 3 , is shown in brackets. For both ACE and AICc, the correct candidate model (bold) was selected.
Table 2. Model selection based on a single TSS Dataset. For each “Ground Truth” scenario (LARGE, MEDIUM, and SMALL), we generated synthetic time-stamped snapshot (TSS) data for 6 proteins across 8000 cells at time t = 1.5 s. Columns 2 and 3 show the minimum ACE and the second smallest ACE, respectively. Columns 4 and 5 show the minimum AICc and the second smallest AICc, respectively. The minimization is taken over all three candidate models. The selected candidate model, h 1 , h 2 , or h 3 , is shown in brackets. For both ACE and AICc, the correct candidate model (bold) was selected.
Ground TruthMinimum ACE2nd Smallest ACEMinimum AICc2nd Smallest AICc
LARGE23.8 [ h 3 ]24.2 [ h 2 ]33 [ h 3 ]54 [ h 2 ]
MEDIUM22.5 [ h 2 ]23.6 [ h 3 ]20 [ h 2 ]32 [ h 3 ]
SMALL23.5 [ h 1 ]23.6 [ h 2 ]10 [ h 1 ]21 [ h 2 ]
Table 3. Accuracy of ACE model selection. For each “Ground Truth” scenario, 1000 bootstrap resamples were generated. Here, we show the probability of selecting candidate models h 3 , h 2 , and h 1 using ACE when the “Ground Truth” scenario is LARGE, MEDIUM, and SMALL, respectively. Corresponding probabilities for AICc are given in parentheses.
Table 3. Accuracy of ACE model selection. For each “Ground Truth” scenario, 1000 bootstrap resamples were generated. Here, we show the probability of selecting candidate models h 3 , h 2 , and h 1 using ACE when the “Ground Truth” scenario is LARGE, MEDIUM, and SMALL, respectively. Corresponding probabilities for AICc are given in parentheses.
Ground Truth P select ( h 3 ) P select ( h 2 ) P select ( h 1 )
LARGE1.0 (0.68)0 (0.32)0 (0)
MEDIUM0.05 (0.11)0.95 (0.89)0 (0)
SMALL0 (0.01)0.24 (0.13)0.76 (0.86)
Table 4. Accuracy of ACE model selection (t = 4.5 s). For each “ground truth” scenario with protein abundances observed across 8000 cells, 100 bootstrap resamples were generated, and the probability of selecting candidate models h 3 , h 2 , and h 1 is shown. Model selection probabilities for ACE are bold, and AICc is shown in parentheses.
Table 4. Accuracy of ACE model selection (t = 4.5 s). For each “ground truth” scenario with protein abundances observed across 8000 cells, 100 bootstrap resamples were generated, and the probability of selecting candidate models h 3 , h 2 , and h 1 is shown. Model selection probabilities for ACE are bold, and AICc is shown in parentheses.
Ground Truth P select ( h 3 ) P select ( h 2 ) P select ( h 1 )
LARGE1.0 (1.0)0 (0)0 (0)
MEDIUM0.05 (0.02)0.95 (0.98)0 (0)
SMALL0 (0)0 (0)1.0 (1.86)
Table 5. Accuracy of ACE model selection with 4000 cells. For each “ground truth” scenario observed at t = 1.5, 1000 bootstrap resamples were generated, and the probability of selecting candidate models h 3 , h 2 , and h 1 is shown. Model selection probabilities for ACE are bold, and AICc is shown (without a penalty) in parentheses.
Table 5. Accuracy of ACE model selection with 4000 cells. For each “ground truth” scenario observed at t = 1.5, 1000 bootstrap resamples were generated, and the probability of selecting candidate models h 3 , h 2 , and h 1 is shown. Model selection probabilities for ACE are bold, and AICc is shown (without a penalty) in parentheses.
Ground Truth P select ( h 3 ) P select ( h 2 ) P select ( h 1 )
LARGE1.0 (0.65)0 (0.35)0 (0)
MEDIUM0.25 (0.35)0.75 (0.65)0 (0)
SMALL0 (0)0.42 (0.33)0.58 (0.67)
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Stewart, W.C.L.; Jayaprakash, C.; Das, J. An Entropy-Based Approach to Model Selection with Application to Single-Cell Time-Stamped Snapshot Data. Entropy 2025, 27, 274. https://doi.org/10.3390/e27030274

AMA Style

Stewart WCL, Jayaprakash C, Das J. An Entropy-Based Approach to Model Selection with Application to Single-Cell Time-Stamped Snapshot Data. Entropy. 2025; 27(3):274. https://doi.org/10.3390/e27030274

Chicago/Turabian Style

Stewart, William C. L., Ciriyam Jayaprakash, and Jayajit Das. 2025. "An Entropy-Based Approach to Model Selection with Application to Single-Cell Time-Stamped Snapshot Data" Entropy 27, no. 3: 274. https://doi.org/10.3390/e27030274

APA Style

Stewart, W. C. L., Jayaprakash, C., & Das, J. (2025). An Entropy-Based Approach to Model Selection with Application to Single-Cell Time-Stamped Snapshot Data. Entropy, 27(3), 274. https://doi.org/10.3390/e27030274

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop