# Improvement of the k-nn Entropy Estimator with Applications in Systems Biology

## Abstract

## 1. Introduction

## 2. k-nn Estimators of Differential Entropy

**Figure 1.**Performance of the k-nn estimator Equation (1) for increasing sample size: box-plots of estimated entropy values for samples from a uniform distribution: on interval [0,1] (

**top**); on hypercube ${[0,1]}^{5}$ (

**center**); on hypercube ${[0,1]}^{15}$ (

**bottom**); with real entropy value denoted by the red line.

## 3. Bias Correction of the k-nn Entropy Estimator

**Theorem 1**(Lebesgue)

**.**Let $p\left(x\right)\in {L}_{1}\left({\mathbb{R}}^{d}\right)$; then, for almost all $x\in {\mathbb{R}}^{d}$ and open balls with radius ${r}_{n}\to 0$:

**Figure 2.**Comparison of the k-nn entropy bias estimate by Sricharan et al. [8] (blue line) with the real bias based on the k-nn entropy estimation from sampled points (green line) and the k-nn entropy bias estimate obtained by our method (red line).

## 4. k-nn Estimator Performance for Different Distributions

#### 4.1. Independent Marginals Case

**Figure 3.**Bias of the entropy estimator for growing dimensions for original k-nn entropy estimator in Equation (1) (red) and corrected entropy estimator in Equation (6) (blue) for multivariate random variables with independent marginals sampled from a uniform distribution on interval [0,1] (

**left**); and from the Beta(3,1) distribution (

**right**); for two different sample sizes (

**top**and

**bottom**).

**Figure 4.**Box plots for the growing sample size of estimated entropy values with k-nn entropy estimator in Equation (1) (

**top**); and corrected entropy estimator in Equation (6) (

**center**); for four-dimensional random variables with independent marginals sampled from a uniform distribution on interval [0,1] (

**left**); and the Beta(3,1) distribution (

**right**); w.r.t. the real entropy value denoted by the red line. The bottom panels demonstrate the histograms of marginal distributions.

#### 4.2. Dependent Marginals Case

**Figure 5.**Bivariate distributions with a Gaussian copula dependence structure with correlation coefficient $\rho =0.5$ and marginals sampled from a uniform distribution on interval [0,1] (

**left**); vs. marginals sampled from the Beta(3,1) distribution (

**right**).

**Figure 6.**Bias of the entropy estimator for growing dimensions for original k-nn entropy estimator in Equation (1) (red) and corrected entropy estimator in Equation (6) (blue) for multivariate random variables with dependent marginals sampled from a uniform distribution on interval [0,1] (

**left**); and the Beta(3,1) distribution (

**right**); for two different sample sizes (top-bottom). The dependence structure is given by the Gaussian copula with correlation coefficients among marginals $\rho =0.5$.

## 5. Sensitivity Indices Based on the k-nn Entropy Estimator

**Definition 2.**The mutual information between continuous random variables $X\sim p\left(x\right)$ and $Y\sim p\left(y\right)$ is defined by:

**Definition 3.**The conditional entropy of a random variable $X\sim p\left(x\right)$ given random variable $Y\sim p\left(y\right)$ is defined as:

information theory | set theory |

$H(X,Y)$ | $X\cup Y$ |

$I(X;Y)$ | $X\cap Y$ |

$H\left(X\right|Y)$ | $X\setminus Y$ |

**Definition 4.**white Assume that ${X}_{i}$ are the parameters of the model and Y is the model output, then single sensitivity indices are defined as:

**Definition 5.**If ${X}_{i}$ are parameters of the model and Y is model output, then interactions indices within a pair of parameters are defined by:

**Corollary 6.**Interaction index within the pair of parameters ${X}_{i}$ and ${X}_{j}$ can be expressed as the sum of single sensitivity indices of parameter ${X}_{i}$ and parameter ${X}_{j}$ to the output variable Y minus the group sensitivity index for a pair of parameters:

#### 5.1. Case Study: Model of the p53-Mdm2 Feedback Loop

${Y}_{1}\left({t}_{0}\right)=0$ | p53 protein; |

${Y}_{2}\left({t}_{0}\right)=0.8$ | Mdm2 ligase; |

${Y}_{3}\left({t}_{0}\right)=0.1$ | mRna Mdm2 precursor of Mdm2 ligase. |

${p}_{1}=0.9$ | p53 production; | ${p}_{4}=0.8$ | Mdm2 transcription; |

${p}_{2}=1.7$ | Mdm2-dependent p53 degradation; | ${p}_{5}=0.8$ | Mdm2 degradation; |

${p}_{3}=1.1$ | p53-dependent Mdm2 production; | ${p}_{6}=0$ | independent p53 degradation; |

$k=0.0001$ | p53 threshold for degradation by Mdm2. |

**Figure 7.**Graphical scheme of the modelled system (

**left**); and the oscillatory behavior for the considered parameters values (

**right**).

**Figure 8.**Sensitivity indices based on mutual information (MI) (

**left**); local sensitivity analysis based on derivatives of variables w.r.t. parameters averaged in time (

**right**).

**Figure 9.**Sensitivity indices for pairs of parameters (

**left**); interactions within pairs of parameters, respectively, for the model output (

**right**).

## 6. Conclusions

## Appendix

## A. Performance of the Corrected k-nn Estimator

**Figure A1.**Box plots for the growing sample size of estimated entropy values with k-nn entropy estimator Equation (1) (

**top**); and corrected entropy estimator Equation (6) (

**middle**); for four-dimensional random variables with independent marginals sampled from a standard normal distribution (

**left**); and an Exponential(1) distribution (

**right**); w.r.t. the real entropy value denoted by the red line. The bottom panels demonstrate the histograms of the marginal distributions.

**Figure A2.**The bias of the entropy estimator for growing dimensions for original k-nn entropy estimator in Equation (1) (red), and corrected entropy estimator in Equation (6) (blue) for multivariate random variables with independent marginals sampled from a standard normal distribution (

**left**); and an Exponential(1) distribution (

**right**); for two different sample sizes (

**top-bottom**).

**Figure A3.**Gaussian copula probability density function for two marginal variables with correlation coefficient $\rho =0.5$ (

**left**); and a two-dimensional copula function with correlation coefficient $\rho =0.5$ (

**right**).

## B. Sketch of the Proof of k-nn Estimator Convergence [7]

**Figure B1.**Venn diagrams for information theoretic measures, including interactions between pairs of parameters.

**Figure B2.**Possible trajectories of the p53-Mdm2 negative feedback loop with perturbed parameters. The vertical axis denotes time; the horizontal axis denotes species concentration. The blue line corresponds to the p53 protein concentration; the green line corresponds to the mRna Mdm2 concentration; and the red line corresponds to the Mdm2 ligase concentration.

