# Information-Corrected Estimation: A Generalization Error Reducing Parameter Estimation Method

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Preliminaries

#### Generalization Error in KL Divergence Based Loss Functions

**Remark**

**1.**

## 3. Information Corrected Estimation (ICE)

**Definition**

**1**

**Remark**

**2.**

**Remark**

**3.**

**Remark**

**4.**

**Remark**

**5.**

#### Local Behavior of the ICE Objective

- ${\mathit{\theta}}_{0}$ is a global minimum of $-\mathcal{L}(\mathit{\theta})$ in the compact space $\mathrm{\Theta}$ defined in A2.
- There exists a $\epsilon >0$ such that $-\mathcal{L}({\mathit{\theta}}_{0})<-\mathcal{L}({\mathit{\theta}}_{1})-\epsilon $ for all other local minima ${\mathit{\theta}}_{1}$.
- For $k=0,1,2,3,4,5$ the derivative ${\partial}_{\mathit{\theta}}^{k}\mathcal{L}(\mathit{\theta})$ exists, is continuous, and bounded on an open set around ${\mathit{\theta}}_{0}$.
- For $k=0,1,2,3,4,5$, the variance $\mathbb{V}[{\partial}_{\mathit{\theta}}^{k}\ell (\mathit{\theta},{X}_{n})]\to 0$ as $n\to \infty $ on an open set around ${\mathit{\theta}}_{0}$.

- For $k=0,1,2,3$ the derivative ${\partial}_{\mathit{\theta}}^{k}{\ell}^{\ast}(\mathit{\theta},{\mathit{x}}_{n})$ exists, is continuous, and bounded on U, almost surely.
- For $k=0,1,2,3$, $\mathbb{V}[{\partial}_{\mathit{\theta}}^{k}{\ell}^{\ast}(\mathit{\theta},{X}_{n})]\to 0$ as $n\to \infty $ on U, almost surely.
- ${\mathit{\theta}}^{\ast}\in U$ as $n\to \infty $ almost surely.
- $\sqrt{n}({\mathit{\theta}}^{\ast}-{\mathit{\theta}}_{0}^{\ast})\to N(0,{({\widehat{J}}_{{\mathit{\theta}}_{0}^{\ast}}^{\ast})}^{-1}{\widehat{I}}_{{\mathit{\theta}}_{0}^{\ast}}^{\ast}{\left({\widehat{J}}_{{\mathit{\theta}}_{0}^{\ast}}^{\ast}\right)}^{-1})$ almost surely.
- $-\mathcal{L}\left(\widehat{{\mathit{\theta}}^{\ast}}\right)=-{\ell}^{\ast}({\mathit{\theta}}^{\ast}({X}_{n}),{X}_{n})+{O}_{p}({n}^{-3/2})$ almost surely.

**Remark**

**6.**

**Remark**

**7.**

## 4. Direct Computation Results

#### 4.1. Gaussian Error Model

#### Generalization Error Analysis

**Remark**

**8.**

#### 4.2. Friedman’s Test Case

#### 4.2.1. Impact of $\widehat{J}$ Approximation

#### 4.3. Multivariate Logistic Regression

#### 4.3.1. Data Generation Process

- $c<{\mathbb{E}}_{Z}[p(Y=1|\mathit{x},{\mathit{\theta}}_{0})]<d$
- $-\mathcal{L}({\mathit{\theta}}_{0})>\u03f5$

#### 4.3.2. Model Performance Comparison

#### 4.3.3. Convergence Analysis for Large n

## 5. Optimized Computation Results

- ${\mathit{\theta}}^{\ast}$: J is computed directly, ICE is implemented as written.
- ${\mathit{\theta}}_{2}^{\ast}$: J is taken to be constant w.r.t. $\mathit{\theta}$: (${J}_{\mathit{\theta}}={J}_{\widehat{\mathit{\theta}}}$).
- ${\mathit{\theta}}_{3}^{\ast}$: J is taken to be diagonal: ($J=D$).
- ${\mathit{\theta}}_{4}^{\ast}$: J is taken to be the identity: ($J=\mathcal{I}$).

**Remark**

**9.**

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A. Proofs

#### Appendix A.1. White’s Regularity Conditions

**Definition**

**A1**

- A1:
- The independent random vectors, ${X}_{i},i=1,\cdots ,n$, have common joint distribution function F on Ω, a measurable Euclidean space, with measurable Radon–Nikodym density $f=dF/dx$.
- A2:
- The family of distribution functions $G(x|\mathit{\theta})$ has Radon–Nikodym densities $g(x|\mathit{\theta})=dG(x|\mathit{\theta})/dx$ which are measurable in x for every $\mathit{\theta}\in \mathrm{\Theta}$, a compact subset of p-dimensional Euclidean space, and continuous in θ for every $x\in \mathrm{\Omega}$.
- A3:
- (a) $\mathbb{E}[log\phantom{\rule{3.33333pt}{0ex}}f(X)]$ exists and $|log\phantom{\rule{3.33333pt}{0ex}}g(x|\mathit{\theta})|\le m(x),\phantom{\rule{3.33333pt}{0ex}}\forall \mathit{\theta}\in \mathrm{\Theta}$, where m is integrable with respect to F; (b) ${\rho}_{KL}(f,{g}_{\mathit{\theta}})$ has a unique minimum at ${\mathit{\theta}}_{0}\in \mathrm{\Theta}$.
- A4:
- ${\partial}_{\mathit{\theta}}(log\phantom{\rule{3.33333pt}{0ex}}g(x|\mathit{\theta}))$ are measurable functions of x for each $\mathit{\theta}\in \mathrm{\Theta}$ and continuously differentiable functions of θ for each $x\in \mathrm{\Omega}$.
- A5:
- $|{\partial}_{\mathit{\theta}}^{2}(log\phantom{\rule{3.33333pt}{0ex}}g(x|\mathit{\theta}))|$ and $|{\partial}_{\mathit{\theta}}(log\phantom{\rule{3.33333pt}{0ex}}g(x|\mathit{\theta}))\xb7{\partial}_{\mathit{\theta}}(log\phantom{\rule{3.33333pt}{0ex}}g(x|\mathit{\theta})|)$ are dominated by functions integrable in x with respect to F for all $x\in \mathrm{\Omega}$ and $\mathit{\theta}\in \mathrm{\Theta}$.
- A6:
- (a) ${\mathit{\theta}}_{0}$ is an interior point of the parameter space; (b) $\mathbb{E}[{\partial}_{\mathit{\theta}}(log\phantom{\rule{3.33333pt}{0ex}}g(x|\mathit{\theta}))\xb7{\partial}_{\mathit{\theta}}(log\phantom{\rule{3.33333pt}{0ex}}g(x|\mathit{\theta}))]$ is non-singular; (c) ${\mathit{\theta}}_{0}$ is a regular point of $\mathbb{E}[{\partial}_{\mathit{\theta}}^{2}(log\phantom{\rule{3.33333pt}{0ex}}g(x|\mathit{\theta}))]$.

#### Appendix A.2. Proof of Finite Variance

**Lemma**

**A1**

- 1.
- $\mathcal{M}$ satisfies White’s regularity conditions A1–A6 (see Appendix A.1 or [22]).
- 2.
- ${\mathit{\theta}}_{0}$ is a global minimum of $-\mathcal{L}(\mathit{\theta})$ in the compact space Θ defined in A2.
- 3.
- There exists a $\epsilon >0$ such that $-\mathcal{L}({\mathit{\theta}}_{0})<-\mathcal{L}({\mathit{\theta}}_{1})-\epsilon $ for all other local minima ${\mathit{\theta}}_{1}$.
- 4.
- For $k=0,1,2,3,4,5$ the derivative ${\partial}_{\mathit{\theta}}^{k}\mathcal{L}(\mathit{\theta})$ exists, is continuous, and bounded on an open set around ${\mathit{\theta}}_{0}$.
- 5.
- For $k=0,1,2,3,4,5$, the variance $\mathbb{V}[{\partial}_{\mathit{\theta}}^{k}\ell (\mathit{\theta},{X}_{n})]\to 0$ as $n\to \infty $ on an open set around ${\mathit{\theta}}_{0}$.

- 1.
- For $k=0,1,2,3$ the derivative ${\partial}_{\mathit{\theta}}^{k}{\ell}^{\ast}(\mathit{\theta},{\mathit{x}}_{n})$ exists, is continuous, and bounded on U, almost surely.
- 2.
- For $k=0,1,2,3$, $\mathbb{V}[{\partial}_{\mathit{\theta}}^{k}{\ell}^{\ast}(\mathit{\theta},{X}_{n})]\to 0$ as $n\to \infty $ on U, almost surely.
- 3.
- ${\mathit{\theta}}^{\ast}\in U$ as $n\to \infty $ almost surely.

**Proof.**

#### Appendix A.3. Proof of Asymptotic Normality

**Theorem**

**A1**

- 1.
- For $k=0,1,2,3$ the derivative ${\partial}_{\mathit{\theta}}^{k}{\ell}^{\ast}(\mathit{\theta},{\mathit{x}}_{n})$ exists, is continuous, and bounded on U.
- 2.
- For $k=0,1,2,3$, $\mathbb{V}[{\partial}_{\mathit{\theta}}^{k}{\ell}^{\ast}(\mathit{\theta},{\mathit{x}}_{n})]\to 0$ as $n\to \infty $ on U.

**Proof.**

#### Appendix A.4. Proof of Prediction Bias Order under ICE

**Theorem**

**A2**

**Proof.**

## References

- Kullback, S.; Leibler, R.A. On Information and Sufficiency. Ann. Math. Stat.
**1951**, 22, 79–86. [Google Scholar] [CrossRef] - Esponda, I.; Pouzo, D. Berk–Nash Equilibrium: A Framework for Modeling Agents With Misspecified Models. Econometrica
**2016**, 84, 1093–1130. [Google Scholar] [CrossRef][Green Version] - Jiang, B. Approximate Bayesian Computation with Kullback-Leibler Divergence as Data Discrepancy. In Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics, Playa Blanca, Spain, 9–11 April 2018; Proceedings of Machine Learning Research; Storkey, A., Perez-Cruz, F., Eds.; PMLR: Playa Blanca, Lanzarote, 2018; Volume 84, pp. 1711–1721. [Google Scholar]
- Nguyen, X.; Wainwright, M.J.; Jordan, M.I. Estimating Divergence Functionals and the Likelihood Ratio by Convex Risk Minimization. IEEE Trans. Inf. Theory
**2010**, 56, 5847–5861. [Google Scholar] [CrossRef][Green Version] - Pérez-Cruz, F. Kullback-Leibler divergence estimation of continuous distributions. In Proceedings of the 2008 IEEE International Symposium on Information Theory, Toronto, ON, Canada, 6–11 July 2008; pp. 1666–1670. [Google Scholar]
- Konishi, S.; Kitagawa, G. Generalised Information Criteria in Model Selection. Biometrika
**1996**, 83, 875–890. [Google Scholar] [CrossRef][Green Version] - Roelofs, R. Measuring Generalization and Overfitting in Machine Learning. Ph.D. Thesis, University of California, Berkeley, CA, USA, 2019. [Google Scholar]
- Miller, R.G. The Jackknife—A Review. Biometrika
**1974**, 61, 1–15. [Google Scholar] - Firth, D. Bias Reduction of Maximum Likelihood Estimates. Biometrika
**1993**, 80, 27–38. [Google Scholar] [CrossRef] - Kosmidis, I. Bias Reduction in Exponential Family Nonlinear Models. Ph.D. Thesis, University of Warwick, Coventry, UK, 2007. [Google Scholar]
- Kosmidis, I.; Firth, D. A generic algorithm for reducing bias in parametric estimation. Electron. J. Stat.
**2010**, 4, 1097–1112. [Google Scholar] [CrossRef] - Kenne Pagui, E.C.; Salvan, A.; Sartori, N. Median bias reduction of maximum likelihood estimates. Biometrika
**2017**, 104, 923–938. [Google Scholar] [CrossRef] - Kosmidis, I. Bias in parametric estimation: Reduction and useful side-effects. Wiley Interdiscip. Rev. Comput. Stat.
**2014**, 6, 185–196. [Google Scholar] [CrossRef][Green Version] - Heskes, T. Bias/Variance Decompositions for Likelihood-Based Estimators. Neural Comput.
**1998**, 10, 1425–1433. [Google Scholar] [CrossRef] [PubMed][Green Version] - Akaike, H. Information Theory and an Extension of the Maximum Likelihood Principle. In Proceedings of the 2nd International Symposium on Information Theory, Tsahkadsor, Armenia, 2–8 September 1971; Petrov, B.N., Csaki, F., Eds.; Akademiai Kiado: Budapest, Hungary, 1973; pp. 267–281. [Google Scholar]
- Takeuchi, K. Distribution of information statistics and validity criteria of models. Math. Sci.
**1976**, 153, 12–18. [Google Scholar] - Stone, M. An Asymptotic Equivalence of Choice of Model by Cross-Validation and Akaike’s Criterion. J. R. Stat. Soc. Ser. B Methodol.
**1977**, 39, 44–47. [Google Scholar] [CrossRef] - Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning; Springer: New York, NY, USA, 2009. [Google Scholar]
- Ross, A.; Doshi-Velez, F. Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing Their Input Gradients. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
- Golub, G.H.; Heath, M.; Wahba, G. Generalized Cross-Validation as a Method for Choosing a Good Ridge Parameter. Technometrics
**1979**, 21, 215–223. [Google Scholar] [CrossRef] - Yanagihara, H.; Tonda, T.; Matsumoto, C. Bias correction of cross-validation criterion based on Kullback–Leibler information under a general condition. J. Multivar. Anal.
**2006**, 97, 1965–1975. [Google Scholar] [CrossRef][Green Version] - White, H. Maximum Likelihood Estimation of Misspecified Models. Econometrica
**1982**, 50, 1–25. [Google Scholar] [CrossRef] - Ward, T. Information Corrected Estimation: A Bias Correcting Parameter Estimation Method, Supplementary Materials. 2021. Available online: https://figshare.com/articles/software/Information_Corrected_Estimation_A_Bias_Correcting_Parameter_Estimation_Method_Supplementary_Materials/14312852/1 (accessed on 2 October 2021).
- Friedman, J.H. Multivariate adaptive regression splines. Ann. Stat.
**1991**, 19, 1–67. [Google Scholar] [CrossRef] - Broyden, C.G. A Class of Methods for Solving Nonlinear Simultaneous Equations. Math. Comput.
**1965**, 19, 577–593. [Google Scholar] [CrossRef]

**Figure 1.**A comparison of the KL-divergence (y-axis) of various estimation methods against the number of training samples n. Each KL divergence value was divided by the average KL divergence of the MLE estimate for that value of n. The ICE and ${L}_{2}$ series are shown with 2 standard deviation error bars.

**Figure 2.**The error in the estimated ${\widehat{\sigma}}_{ICE}$ and the Z-score of the estimate against the number of training samples n.

**Figure 3.**Comparison of the KL-divergence, averaged across 500 replications, of estimation methods against the number of training samples n. Each KL divergence value was divided by the average KL divergence of the MLE estimate for that value of n. The ICE and ${L}_{2}$ series are shown with 2 standard deviation error bars.

**Table 1.**Comparison of the average KL divergence across 500 replications for several model estimators given a fitting set size of n. For estimators other than $\widehat{\mathit{\theta}}$, the values in parentheses denotes the t-statistic of the difference between this estimator and $\widehat{\mathit{\theta}}$, with negative values indicating that the listed estimator has a lower KL divergence.

n | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{\widehat{\mathit{\theta}}})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{{\mathit{\theta}}_{\mathit{L}2}})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{{\mathit{\theta}}^{\ast}})$ |
---|---|---|---|

16 | $6.19\times {10}^{-1}$ | $8.442\times {10}^{-1}$ (8.40) | $3.02\times {10}^{-1}$ (−13.54) |

32 | $1.74\times {10}^{-1}$ | $1.74\times {10}^{-1}$ (0.16) | $1.37\times {10}^{-1}$ (−11.11) |

64 | $6.85\times {10}^{-2}$ | $6.85\times {10}^{-2}$ (0.0) | $5.81\times {10}^{-2}$ (−12.65) |

128 | $3.82\times {10}^{-2}$ | $3.82\times {10}^{-2}$ (0.0) | $3.53\times {10}^{-2}$ (−11.13) |

256 | $2.19\times {10}^{-2}$ | $2.19\times {10}^{-2}$ (0.0) | $2.12\times {10}^{-2}$ (−7.84) |

512 | $1.53\times {10}^{-2}$ | $1.53\times {10}^{-2}$ (0.0) | $1.51\times {10}^{-2}$ (−5.66) |

1024 | $1.25\times {10}^{-2}$ | $1.25\times {10}^{-2}$ (0.0) | $1.24\times {10}^{-2}$ (−5.03) |

**Table 2.**Comparison of the average KL divergence across 500 replications for ICE estimators with and without approximation of J given a fitting set size of n. For estimators other than $\widehat{\mathit{\theta}}$, the values in parentheses denotes the t-statistic of the difference between this estimator and $\widehat{\mathit{\theta}}$, with negative values indicating that the listed estimator has a lower KL divergence.

n | MLE | ICE ($\widehat{\mathit{J}}$ ) | ICE (J ) |
---|---|---|---|

16 | $6.19\times {10}^{-1}$ | $3.02\times {10}^{-1}$ (−13.54) | $6.21\times {10}^{-1}$ (0.05) |

32 | $1.74\times {10}^{-1}$ | $1.37\times {10}^{-1}$ (−11.11) | $1.51\times {10}^{-1}$ (−3.74) |

64 | $6.85\times {10}^{-2}$ | $5.81\times {10}^{-2}$ (−12.65) | $4.21\times {10}^{-2}$ (−17.29) |

128 | $3.82\times {10}^{-2}$ | $3.53\times {10}^{-2}$ (−11.13) | $2.99\times {10}^{-2}$ (−11.82) |

256 | $2.19\times {10}^{-2}$ | $2.12\times {10}^{-2}$ (−7.84) | $1.99\times {10}^{-2}$ (−4.19) |

512 | $1.53\times {10}^{-2}$ | $1.51\times {10}^{-2}$ (−5.66) | $1.48\times {10}^{-2}$ (−1.62) |

1024 | $1.25\times {10}^{-2}$ | $1.24\times {10}^{-2}$ (−5.03) | $1.24\times {10}^{-2}$ (−0.72) |

**Table 3.**Mean matrix norms of J, its approximations, and differences from these approximations across 500 replications.

n | $\parallel \mathit{J}\parallel $ | $\parallel \widehat{\mathit{J}}\parallel $ | $\parallel \mathit{D}\parallel $ | $\parallel \widehat{\mathit{J}}-\mathit{J}\parallel $ | $\parallel \mathit{D}-\mathit{J}\parallel $ | $\frac{1}{\mathit{n}}\mathit{tr}({\mathit{IJ}}^{-1})$ | $\frac{1}{\mathit{n}}\mathit{tr}(\widehat{\mathit{I}}{\widehat{\mathit{J}}}^{-1})$ | $\frac{1}{\mathit{n}}\mathit{tr}(\widehat{\mathit{I}}{\mathit{D}}^{-1})$ |
---|---|---|---|---|---|---|---|---|

16 | 1423.29 | 925,489.26 | 1738.30 | 926,165.93 | 2905.64 | 0.1452 | 0.0027 | 6.9484 |

32 | 875.03 | 89,414.11 | 96.49 | 89,786.19 | 793.43 | 0.0521 | 0.0256 | 0.0291 |

64 | 214.05 | 4820.55 | 89.55 | 4646.51 | 125.02 | 0.0202 | 0.0160 | 0.0162 |

128 | 200.86 | 200.68 | 85.00 | 27.65 | 115.86 | 0.0097 | 0.0086 | 0.0086 |

256 | 194.36 | 191.81 | 82.70 | 19.12 | 111.67 | 0.0048 | 0.0045 | 0.0045 |

512 | 191.20 | 190.10 | 82.76 | 14.18 | 108.44 | 0.0023 | 0.0023 | 0.0023 |

1024 | 188.91 | 187.39 | 81.89 | 11.58 | 107.02 | 0.0012 | 0.0012 | 0.0012 |

**Table 4.**Comparison of the KL divergence for the different estimation approaches applied to mis-specified data. The values in parentheses denote the t-statistic relative to MLE. For $p=\{5,10,20\}$ there are $m=\{2,4,8\}$ non-explanatory variables added.

p | n | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{\widehat{\mathit{\theta}}})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{{\mathit{\theta}}_{{\mathit{L}}_{2}}})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{{\mathit{\theta}}^{\ast}})$ |
---|---|---|---|---|

5 | 500 | $4.79\times {10}^{-3}$ | $3.01\times {10}^{-3}$ (−13.43) | $4.56\times {10}^{-3}$ (−13.53) |

5 | 1000 | $2.64\times {10}^{-3}$ | $1.76\times {10}^{-3}$ (−10.94) | $2.57\times {10}^{-3}$ (−12.80) |

5 | 2000 | $1.29\times {10}^{-3}$ | $1.09\times {10}^{-3}$ (−6.15) | $1.27\times {10}^{-3}$ (−7.69) |

5 | 5000 | $5.09\times {10}^{-4}$ | $4.60\times {10}^{-4}$ (−4.69) | $5.07\times {10}^{-4}$ (−6.19) |

10 | 500 | $9.79\times {10}^{-3}$ | $9.85\times {10}^{-3}$ (0.16) | $9.18\times {10}^{-3}$ (−6.27) |

10 | 1000 | $5.05\times {10}^{-3}$ | $5.13\times {10}^{-3}$ (0.51) | $4.83\times {10}^{-3}$ (−4.90) |

10 | 2000 | $2.50\times {10}^{-3}$ | $3.05\times {10}^{-3}$ (5.99) | $2.56\times {10}^{-3}$ (1.70) |

10 | 5000 | $1.06\times {10}^{-3}$ | $1.49\times {10}^{-3}$ (7.72) | $1.04\times {10}^{-3}$ (−0.86) |

20 | 500 | $2.18\times {10}^{-2}$ | $2.16\times {10}^{-2}$(−0.29) | $1.95\times {10}^{-2}$ (−8.71) |

20 | 1000 | $1.13\times {10}^{-2}$ | $1.24\times {10}^{-2}$ (3.79) | $1.10\times {10}^{-2}$ (−1.95) |

20 | 2000 | $6.86\times {10}^{-3}$ | $7.52\times {10}^{-3}$ (4.47) | $6.72\times {10}^{-3}$ (−1.67) |

20 | 5000 | $3.57\times {10}^{-3}$ | $4.24\times {10}^{-3}$ (6.56) | $3.59\times {10}^{-3}$ (0.45) |

**Table 5.**Comparison of the KL divergence under the MLE $\widehat{\mathit{\theta}}$, ${L}_{2}$ regularization and ICE regularization ${\mathit{\theta}}_{ICE}^{\ast}$ against a large sample size for the case when $p=10$ and $m=4$.

n | $\mathcal{L}({\mathit{\theta}}_{0})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{\widehat{\mathit{\theta}}})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{{\mathit{\theta}}_{{\mathit{L}}_{2}}})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{{\mathit{\theta}}^{\ast}})$ |
---|---|---|---|---|

500 | $0.5439$ | $9.28\times {10}^{-3}$ | $7.92\times {10}^{-3}$ | $7.74\times {10}^{-3}$ |

1000 | $0.5439$ | $5.50\times {10}^{-3}$ | $5.86\times {10}^{-3}$ | $4.81\times {10}^{-3}$ |

2000 | $0.5439$ | $2.65\times {10}^{-3}$ | $3.67\times {10}^{-3}$ | $2.65\times {10}^{-3}$ |

5000 | $0.5439$ | $1.85\times {10}^{-3}$ | $2.72\times {10}^{-3}$ | $1.35\times {10}^{-3}$ |

10,000 | $0.5439$ | $5.75\times {10}^{-4}$ | $1.57\times {10}^{-3}$ | $9.32\times {10}^{-4}$ |

20,000 | $0.5439$ | $5.84\times {10}^{-4}$ | $8.10\times {10}^{-4}$ | $6.11\times {10}^{-4}$ |

50,000 | $0.5439$ | $3.83\times {10}^{-4}$ | $4.09\times {10}^{-4}$ | $3.64\times {10}^{-4}$ |

100,000 | $0.5439$ | $1.67\times {10}^{-4}$ | $1.15\times {10}^{-3}$ | $1.86\times {10}^{-4}$ |

**Table 6.**The asymptotic computational cost (per iteration) of various proposed approximations as a function of parameter count p. Cost is amortized when ${J}_{\mathit{\theta}}={J}_{\widehat{\mathit{\theta}}}$ assuming that $n\approx p$. Note that a typical model will cost $O(p)$ in time and space for both the objective function and its gradients.

Approximation | Objective Cost (Space) | Objective Cost (Time) | Gradient Cost (Space) | Gradient Cost (Time) |
---|---|---|---|---|

Direct Computation | $O({p}^{2})$ | $O({p}^{3})$ | $O({p}^{2})$ | $O({p}^{4})$ |

${J}_{\mathit{\theta}}={J}_{\widehat{\mathit{\theta}}}$ | $O({p}^{2})$ | $O({p}^{2})$ | $O({p}^{2})$ | $O({p}^{3})$ |

$J=D$ | $O(p)$ | $O(p)$ | $O(p)$ | $O({p}^{2})$ |

$J=\mathcal{I}$ | $O(p)$ | $O(p)$ | $O(p)$ | $O({p}^{2})$ |

**Table 7.**Comparison of the average KL divergence across 200 replications for MLE and several variants of ICE given a fitting set size of n. For estimators other than $\widehat{\mathit{\theta}}$, the values in parentheses denotes the t-statistic of the difference between this estimator and $\widehat{\mathit{\theta}}$, with negative values indicating that the listed estimator has a lower KL divergence.

n | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{\widehat{\mathit{\theta}}})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{{\mathit{\theta}}^{\ast}})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{{\mathit{\theta}}_{2}^{\ast}})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{{\mathit{\theta}}_{3}^{\ast}})$ | ${\mathit{\rho}}_{\mathit{KL}}(\mathit{f},{\mathit{g}}_{{\mathit{\theta}}_{4}^{\ast}})$ |
---|---|---|---|---|---|

8 | $1.22\times {10}^{+1}$ | $4.55\times {10}^{+0}($−$4.89)$ | $5.70\times {10}^{+0}(-5.26)$ | $3.83\times {10}^{+0}(-5.22)$ | $1.28\times {10}^{+0}(-4.67)$ |

16 | $6.68\times {10}^{-1}$ | $2.99\times {10}^{-1}(-8.13)$ | $3.53\times {10}^{-1}(-10.56)$ | $3.36\times {10}^{-1}(-8.30)$ | $5.47\times {10}^{-1}(-2.11)$ |

32 | $1.45\times {10}^{-1}$ | $1.14\times {10}^{-1}(-6.90)$ | $1.04\times {10}^{-1}(-8.18)$ | $1.08\times {10}^{-1}(-10.16)$ | $3.63\times {10}^{-1}(19.60)$ |

64 | $5.93\times {10}^{-2}$ | $4.80\times {10}^{-2}(-10.42)$ | $4.70\times {10}^{-2}(-6.95)$ | $4.81\times {10}^{-2}(-9.81)$ | $2.38\times {10}^{-1}(39.08)$ |

128 | $2.48\times {10}^{-2}$ | $2.24\times {10}^{-2}(-6.38)$ | $2.33\times {10}^{-2}(-2.26)$ | $2.26\times {10}^{-2}(-6.00)$ | $1.62\times {10}^{-1}(61.85)$ |

256 | $1.21\times {10}^{-2}$ | $1.16\times {10}^{-2}(-4.28)$ | $1.20\times {10}^{-2}(-0.68)$ | $1.16\times {10}^{-2}(-4.11)$ | $1.01\times {10}^{-1}(68.82)$ |

512 | $6.26\times {10}^{-3}$ | $6.10\times {10}^{-3}(-2.41)$ | $6.16\times {10}^{-3}(-0.84)$ | $6.10\times {10}^{-3}(-2.39)$ | $5.42\times {10}^{-2}(63.84)$ |

1024 | $3.05\times {10}^{-3}$ | $3.00\times {10}^{-3}(-2.66)$ | $3.04\times {10}^{-3}(-0.37)$ | $2.99\times {10}^{-3}(-2.73)$ | $2.61\times {10}^{-2}(59.78)$ |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dixon, M.; Ward, T. Information-Corrected Estimation: A Generalization Error Reducing Parameter Estimation Method. *Entropy* **2021**, *23*, 1419.
https://doi.org/10.3390/e23111419

**AMA Style**

Dixon M, Ward T. Information-Corrected Estimation: A Generalization Error Reducing Parameter Estimation Method. *Entropy*. 2021; 23(11):1419.
https://doi.org/10.3390/e23111419

**Chicago/Turabian Style**

Dixon, Matthew, and Tyler Ward. 2021. "Information-Corrected Estimation: A Generalization Error Reducing Parameter Estimation Method" *Entropy* 23, no. 11: 1419.
https://doi.org/10.3390/e23111419