Information-Corrected Estimation: A Generalization Error Reducing Parameter Estimation Method

Modern computational models in supervised machine learning are often highly parameterized universal approximators. As such, the value of the parameters is unimportant, and only the out of sample performance is considered. On the other hand much of the literature on model estimation assumes that the parameters themselves have intrinsic value, and thus is concerned with bias and variance of parameter estimates, which may not have any simple relationship to out of sample model performance. Therefore, within supervised machine learning, heavy use is made of ridge regression (i.e., L2 regularization), which requires the the estimation of hyperparameters and can be rendered ineffective by certain model parameterizations. We introduce an objective function which we refer to as Information-Corrected Estimation (ICE) that reduces KL divergence based generalization error for supervised machine learning. ICE attempts to directly maximize a corrected likelihood function as an estimator of the KL divergence. Such an approach is proven, theoretically, to be effective for a wide class of models, with only mild regularity restrictions. Under finite sample sizes, this corrected estimation procedure is shown experimentally to lead to significant reduction in generalization error compared to maximum likelihood estimation and L2 regularization.


Introduction
Kullback and Leibler [1] showed that minimizing a divergence ρ KL (f, g θ ) between the truth, f , and a parametric model density, g θ , is necessary and sufficient for making accurate predictions about data using the model defined by θ.Recent work [2] on Berk-Nash equilibria has shown the central role that KL divergence plays in game theoretic choice models such as multi-armed bandits and stochastic multi-party games.KL divergence thus plays a leading role in machine learning and neuroscience, with several inferential approaches developed in the information theory literature.Such approaches for minimizing KL divergence employ a range of methods, including data partitioning, Bayesian indirect inference and Mestimation [3,4,5].These approaches are quite distinct from the standard penalized loss minimization framework and, as such, are non-trivial to combine with supervised learning methods such as neural networks.
It is well known that maximum likelihood estimation (MLE) introduces an asymptotic bias in the KL divergence minimizer which is problematic for both model estimation and model selection.For many models, where the parameters θ are themselves important, this may be investigated as parameter bias and parameter variance.However, for models common in modern machine learning, the parameters themselves do not have any easily interpreted meaning.For these models, the parameters themselves are irrelevant and only the accuracy (in terms of KL divergence) of the model predictions matter.Within the information theory literature, this has often been referred to simply as bias (e.g., b(G) from [6]).To distinguish it from parameter bias, one might refer to it as "prediction bias" or "generalization error".Generalization error is the more common terminology (see, for example, Equation 1.1.6[7]) and will be used here.
Before the widespread use of machine learning, most models had interpretable parameters, and thus there is a large literature focused on reducing parameter bias.For instance, the jackknife [8] (leave-one-out cross-validation) estimator is an early example.More relevant to this paper is the approach of Firth [9] and later Kosmidis [10,11].More recently, Pagui, Salvan, and Sartori [12] proposed a parameter bias reducing estimation methodology.An extensive review of the literature around this point can be found in [13].Unfortunately, these approaches do not consider the impact on KL divergence-based generalization error and thus are not applicable to the field of machine learning where the parameters themselves are devoid of meaning.Heskes [14] shows that classifiers do have a notion of bias-variance decomposition for generalization error, but it is not computable from parameter bias and parameter variance.Therefore, parameter bias reducing formulations are not useful within machine learning unless it can be shown that they also reduce generalization error.
In fact, to seat the approach taken in this paper to generalization error, we recall much earlier and seminal work at the intersection of statistics and information theory.Akaike [15], and later Takeuchi [16], proposed information criteria (AIC and TIC, respectively) for model selection designed explicitly to reduce generalization error.Konishi and Kitagawa [6] extended the approach of Takeuchi to cases where MLE was not used to fit the underlying model, but still restricted themselves to the question of model selection.Stone [17] proved that Akaike's Information Criterion (AIC) is asymptotically equivalent to jackknifing when the estimator is finite.Takeuchi himself showed that TIC is an extension of AIC with fewer restrictions, and thus it too is equivalent to jackknifing whenever AIC would be valid.
For highly parameterized models, as are common in machine learning, model selection such as this is of limited utility.The parameter count may necessarily be very large, and thus none of the models fit using MLE may be acceptable.Then, merely choosing among them is unlikely to produce acceptable results.Within this field, typically L 2 or similar regularization is used to reduce generalization error.See Section 11.5.2 [18], for a typical example.For a more recent innovation, refer to [19].Note that regularization schemes such as this often increase parameter bias while decreasing generalization error.Golub, Heath, and Wahba [20] showed that L 2 regularization is asymptotically equivalent to cross-validation for linear models, subject to certain assumptions.For nonlinear models, it has long been known that L 2 regularization is not always valid, and it is trivial to construct example models1 where this approach is always harmful in expectation.
Therefore, it is important to develop a method to reduce generalization error in model estimation analogous to the way that L 2 regularization would commonly be used for a highly parameterized model, but having applicability for a wider family of models, especially those for which L 2 regularization is not applicable.It is not the goal of this paper to perform a wide survey of generalization error reducing approaches, but we will rather propose an additional approach, investigate its properties, and show that it has superior performance when compared against L 2 regularization, which is currently the dominant generalization error reducing estimation procedure within the field of machine learning.
To this end, this paper introduces a generalization error reducing estimation approach referred to as Information Corrected Estimation (ICE).This estimator is proven to have a generalization error of only O(n − 3 2 ) instead of O(n −1 ) as is the case for MLE, and is shown to be valid within a neighborhood around the MLE parameter estimate.Optimizing over this ICE objective function instead of the negative log likelihood thus produces parameters with superior out of sample performance.
Takeuchi's TIC and Firth's approach have never seen widespread use due to the computational and numerical issues that arise from the computation of this adjustment [21], and the ICE estimator in its raw form would have similar problems.Therefore, this paper also proposes an efficient approximation of this correction term, and shows through numerical experiments that the approximation is effective at improving model performance across a range of models.

Preliminaries
Let us assume that we have data x n := {x 1 , . . ., x n } generated from an unknown joint density function f (x) of X n := {X 1 , . . ., X n }.Where necessary, we define X n to denote a second sample drawn from f (x), independent of X n , and x n is the observed realization of X n .We consider a model M p given by a parametric family of densities M p := {g(•|θ) | θ ∈ Θ ⊆ R p }, for some compact Euclidean parameter space Θ, which is misspecified and hence excludes the truth f .Henceforth, the distribution over x identified by θ may be referred to as g θ (x) := g(x|θ) where it is notationally convenient to do so.
Suppose that θ 0 is the quasi-true parameter of model M, and θ(X n ) is the random variable representing the MLE of θ 0 fit on a dataset, x n .The negative log-likelihood of X n under the distribution g θ is where − (θ, X n ) is written including a 1 n to make the expectation of this quantity O(1) and asymptotically independent of n.Similarly, the minus sign is incorporated because − (θ, X n ) is a strictly non-negative quantity if g θ (x i ) is a probability.The MLE, θ(X n ), minimizes the negative log likelihood of the data set with respect to the model: The expectation of − (θ, X n ) is the cross entropy between f and g θ : Here, the expectation is a function only of θ and of the distribution f that generated the data X n .As a function of the distribution f , this value is O(1), but could be large for poorly conditioned f .The quasi-true parameter θ 0 is

Generalization Error in KL Divergence Based Loss Functions
Kullback and Leibler [1] viewed "information" as discriminating the sample data drawn from one distribution against another, and defined the KL-divergence ρ KL between distributions in terms of the ability to make predictions about one by knowing the other.Here, This value is in general unknowable, but given a sample X n from f , − (θ, X n ) will converge asymptotically to ρ KL (f, g θ ) plus an additive constant that depends only on f .The convergence relies on White's regularity conditions [22].
A well known result by Stone [17] shows that the MLE is a biased estimator of the minimum KL-divergence: because it is evaluated on the data X n which was used to fit θ.Cross-validation was developed as a model selection technique to select a model from a group that actually minimizes E Xn [ρ KL (g θ0 , g θ(Xn) )] and not merely in the limit of large n.Takeuchi [16] and Akaike [15] explicitly modeled this bias (generalization error) of an estimation procedure θ(X n ) as Our goal is to obtain an estimate, b * , of the generalization error b without using the MLE.We will then add this term to the objective function to develop the estimator θ * (X n ) so as to cancel the lower order terms of the generalization error.This estimator will then minimize E Xn [ρ KL (g θ0 , g θ * (Xn) )] more effectively than MLE, and potentially would in turn produce improved predictions from the model fitted over finite training sets.
Remark We note that under MLE, b = O( 1 n ) [16].Equivalently, one could say that a particular realization of the generalization error ) is used to indicate that the quantity is a random variable with finite variance, whose mean is O( 1 n ).

Information Corrected Estimation (ICE)
We propose the following penalized likelihood function: where J θ is the negative expected Hessian and I θ is the Fisher Information matrix with Îθ , Ĵθ being their estimates over the data.Let θ * denote the minimizer of (8).
The trace term in Equation ( 8) will be familiar from Takeuchi [16].However, Takeuchi showed only that this was the leading order of the bias for the MLE estimate θ, and therefore the proof found there is not sufficient to justify a new estimator that will itself be the target of optimization, and is required to be valid away from θ.As in Takeuchi, because I and J are unknowable, we will substitute their approximations computed from the training data, Îθ and Ĵθ during the actual computation of this objective.The numerical impact of this approximation will be examined in Section 4.2.1.
Remark Though AIC was developed before TIC, it is easily reproduced as a special case of TIC.Subject to certain conditions (guaranteed by the requirements of [15]), at least in expectation, I θ = J θ .Thus, the quantity within the TIC trace term, I θ J −1 θ , is the identity matrix.Therefore, its trace is equal to p, the parameter count of the model, recovering AIC.TIC itself can be derived using a proof that is similar to, though somewhat simpler than, the one we include in (A.4), of which Takeuchi's proof is a special case that is valid only at the MLE estimate θ.
We refer to the estimation of θ * , by minimization of this corrected likelihood function as Information-Corrected Estimation (ICE).As the terminology suggests, we depart from the corrective approach used in Information Criterion, by directly minimizing the bias corrected likelihood function.Note that unlike L 2 regularization, the correction term is parameter-free and thus would not require cross validation to estimate a hyperparameter such as the λ used by L 2 .
General properties of this estimator are proved, and a set of regularity conditions are provided such that the estimator is asymptotically normal, and produces a bias that is O p (n −3/2 ) instead of the usual O p (n −1 ).Though this adds only a half-order to the bias correction, for most problems with reasonably large n, any increase in order is likely to greatly reduce bias.Experimental results demonstrate superior properties of ICE for linear models compared to MLE with and without L 2 regularization.
Remark For models satisfying White's regularity conditions (See [22]), it is known that J θ0 is positive definite (thus non-singular) and continuous, and also that I θ0 is continuous with respect to θ.Therefore, 1  n tr(I θ0 J −1 θ0 ) would always be well defined in an open region around θ 0 .Similarly, the solution θ * would be expected to have the same properties, and hence (for large enough n) the estimate 1  n tr( Îθ * Ĵ−1 θ * ) would be well defined when computed using the estimates Îθ * and Ĵθ * .
2 ), that does not mean that L(θ * ) is reduced by any particular amount relative to L( θ).We expect that using this corrected objective will always (if it can be calculated accurately) generate some improvement by virtue of more accurately representing the true performance of the model out of sample, but there is no proof that this level of improvement has any particular form or asymptotic behavior.
Our approach preserves the linear complexity of training with respect to n.However, the computation of Ĵ−1 θ * at each iteration of the numerical solver requires the inversion of a symmetric positive definite matrix with a complexity of O(p 3 ).Hence the approach is not suitable for high dimensional datasets without adjustment.See Section 5 for optimized approximations that are viable for larger parameter counts.Further exploration of large models based on this approach are beyond the scope of the present work.
Remark It is clear from inspection that if − (θ) is strictly convex, then so too is − * (θ) for large enough n.
We first provide a proof of asymptotic convergence of θ * under certain regularity conditions.With this convergence result in place, we then show that minimizing (8) leads to an O(n −3/2 ) bias term, an improvement over the O( 1n ) term produced by MLE.

Local Behavior of the ICE Objective
Suppose the following conditions hold: 1. M satisfies White's regularity conditions A1-A6 (see Section A.1 or [22]).
2. θ 0 is a global minimum of −L(θ) in the compact space Θ defined in A2.

For
Then for sufficiently large n there exists a compact subset U ⊂ Θ containing θ 0 , θ, such that: is continuous, and bounded on U , almost surely.

For
Items (1-3) follow from Lemma A.2 (see Section A.2).These are additional regularity conditions that are prerequisites for later theorems.
Item (4) follows from Theorem A.3 in Section A.3.This states that the estimate θ * is asymtotically normal in a way that is analogous to classical asymptotic normality results for MLE.It is only true almost surely because results (1-3) upon which it relies are only true almost surely.
Item (5) follows from Theorem A.4 in Section A.4.This item establishes the superior accuracy of the ICE objective compared to the MLE objective function in predicting out of sample errors.Like item (4) this is only true almost surely because intermediate results on which it relies are only true almost surely.
The reduction in generalization error seen arises from the optimization over the superior ICE objective function, analogous to the way that L 2 regularization is used for this purpose.
Remark The regularity conditions described here are only slightly more strict than the conditions described by White [22].In particular, models having three continuous derivatives as required by White, but not 5 as needed here are thought to be very rare.Requirement ( 2) is just the definition of θ 0 , which White labels differently, and requirement (3) excludes a pathological corner case, the further study of which is beyond the scope of this paper.
Remark Note that as − (θ, x n ) is convex in the neighborhood of θ 0 , so too is − * (θ) for large enough n because − * (θ) → − (θ).Thus it can be concluded that the local behavior of − * in the neighborhood of θ 0 is not appreciably worse than the behavior of − if the problem is not too ill conditioned.

Direct Computation Results
The following experiments have been designed to compare MLE, MLE with L 2 regularization, and ICE for regression.Each experiment involves simulation of training and test sets and is implemented in R. See the attached code to run each experiment.
Each of these experiments has been performed using the raw formula for ICE provided in Equation ( 8) with minimal adjustments.All gradients are computed using R's default finite difference approach.This means that for a model with p parameters, the objective function is dominated by the inversion of J, which costs O(p 3 ) time and O(p 2 ) space.The use of finite difference gradients further increases the time complexity to O(p 4 ), compounding the problem.This approach is therefore viable for small models with few parameters, but not realistic for larger models.Optimizations to overcome this limitation will be considered in upcoming Section 5.The use of finite difference derivatives was not found to produce appreciable numerical differences in the final output, so analytic derivatives were not used for this analysis.
The code and results for this section is provided in [23].Throughout this section, the following estimators will be compared.

Gaussian Error Model
We begin by considering the simplest case of univariate linear regression with Gaussian residuals.The advantage of this simple model is that the exact form of the correction term can be derived analytically and aids therefore in building intuition on its behavior.For such a toy model, y ∼ N (µ, σ 2 ) and, for simplicity, the following example will consider µ to be a constant, but it is equally applicable if µ = µ(x).Consider the parameters of the model to therefore be θ := (µ, σ) with their optimal values being θ 0 := (µ 0 , σ 0 ).The the probability density function is It is known a priori that L 2 regularization cannot improve this model, as if µ 0 = 0, any decrease in the magnitude of µ is likely to be systematically harmful.Similarly, a decrease in σ below σ results in a decrease in model distribution entropy, and hence would be generally making overfitting worse, and would generate a correspondingly higher KL-divergence than the MLE estimate.Consequently, we would expect any λ computed through cross-validation to be statistically indistinguishable from zero, and L 2 regularization to be generally harmful whenever λ = 0.

Generalization Error Analysis
The Gaussian model described was generated with µ 0 = 0.2, σ 0 = 0.2, and dy = 0.001.For each of n ∈ {16, 32, 64, 128, 256, 512, 1024}, 500 independent simulations of the data y 1 , . . ., y n were performed, and then the parameters were fit from that data.In each simulation, θ was computed using MLE, MLE with L 2 regularization, and ICE.The λ parameter for L 2 regularization was computed using 2-way cross-validation on the available data, and as expected, none of the computed values of λ were statistically different from zero.
For each estimate of θ, the KL-divergence ρ KL (f, g θ ) was computed (using the known value of θ 0 ), and the results were compared.The ICE parameter estimation method showed statistically significant improvement over MLE at the 5-sigma level out to n = 64, and was improved by just under 1-sigma at n = 1024.
The KL-divergence results graphed against n on a log-log scale are shown in Figure 1.Every value of n is normalized by the average KL divergence of the MLE methodology to improve legibility.The L 2 series is statistically indifferent from the MLE series at 2 standard deviations beyond n = 32, and the two are not materially different for any n.The ICE series is at least 4.5 standard deviations below the MLE series until n = 1024.Remark In addition to the series shown in Figure 1, a series was computed using the true value of J, estimated from a much larger sample n = 1024 from the underlying distribution, and this series was indistinguishable from the series computed using Ĵ for every n, thus it was not graphed.This validates Takeuchi's approach of approximating J with Ĵ in this instance.
As expected, the difference in µ between ICE and MLE is not statistically significant (at three standard deviations) for any n, but the ICE computed value of σ (shown in Figure 2) is considerably larger than the MLE estimate, especially for small values of n.This explains the greatly reduced KL-divergence noted in Figure 1.Note that the difference in estimated σ is always statistically significant when compared to the MLE value.This is because both MLE and ICE are fit on the same data, so ICE would always have a larger σ than MLE regardless of the actual data chosen from the distribution f .This is the cause of the large z-scores shown, always exceeding 200.We know from elementary statistics that correlation between the mean and std.deviation causes the MLE estimate of σ to be systematically low by a factor of n−1 n .Indeed, the ICE estimate of σ * is closely tracking σ 0 whereas σ is closely tracking σ0(n−1) n as expected.This is one example where reducing generalization error also reduces parameter bias as a side effect.

Friedman's Test Case
We now extend the example from Section 4.1 to the case where µ is no longer constants.For this example, we chose a standard regression test set, which is nonlinear in the features, based on Section 4.3 of [24]: where the Friedman model is The random features, X j , are i.i.d.uniform random and the parameter values are fixed.The true parameter set, θ 0 = (10.0,20.0, 0.5, 10.0, 5.0, 1.0), reserves the last parameter (1.0) for the value of σ.
Note that here σ must be treated as an unknown parameter.To do otherwise implies that the modeler knows the amount of noise expected in the data.In the case of a known noise term, overfitting is impossible since overfitting arises when a model reduces the projected noise below its actual value, which can never arise when the noise level is known.
The model probability density g(x, y|θ) of y is given by Recall that in Section 4.1, the value of µ was considered to be a constant.This example is a natural extension of Section 4.1, and was chosen due to the well-explored difficulty of Friedman's problem.
We simulate 500 batches of equally sized training sets of length n ∈ {16, 32, 64, 128, 256, 512, 1024}.The test set is always of length 1024 to ensure accuracy for the smaller values of n.The starting point of the optimization is generated by adding a random perturbation, δθ ∼ N (0, 0.1), to each parameter.As before, the KL-divergence is computed between the distribution represented by the parameters and the true distribution, and these values are compared between estimation methods.
For each test sample, the KL divergence is computed using numerical integration with a dy increment of 0.01 over the interval containing µ ± 10σ for both the true and model distributions.The computed probabilities are verified to numerically sum to unity within an error of ±10 −3 .
In each simulation, θ is computed using MLE, MLE with L 2 regularization, and ICE.The λ parameter for L 2 regularization is computed using 4-way cross validation on each batch of the training data.
As shown in Figure 3 and Table 1, L 2 is not effective for any value of n, and is is completely inactivated for n > 32.Where regularization is used (i.e., λ = 0), it generally underperforms MLE.ICE is effective across the entire data range, outperforming MLE for every n, and always by a statistically significant margin of at least 5 sigma.

Impact of Ĵ Approximation
It was noted previously that Takeuchi used Ĵ (and likewise, Î) in place of the true value of J, and we do so here as well.Though there is no realistic way to avoid this approximation in the real world, and the optimized approach discussed in Section 5 has an entirely different set of approximations, the impact of this approximation will be briefly characterized here.
In Table 2 As can be seen from Table 2, using the true value of J is at most marginally helpful.In fact, for most values of n it displays slightly better average results, but slightly higher std.deviation of those results, and thus reduced T-statistics.Thus, we conclude that the Takeuchi's approximation, replacing J with Ĵ is reasonable.The same conclusion was reached in Section 4.1, see the remark there.We note also that the ICE estimator using Ĵ exhibits substantially better performance for very low sample sizes, but further investigation of this phenomenon is beyond the scope of the current paper.
In Table 3, we show the average matrix norms of J, Ĵ, and also of the diagonal of Ĵ, referred to as the matrix D. The matrix D will be examined further in Section 5, and is included here for completeness.We also show the norms of several matrix differences.
We note that the ICE objective values themselves exhibit much lower variation than the matrix norms show in Table 3.In particular though the matrix D is not actually converging to J as n increases, we see from the correction term it generates that this difference does not appear to have a material impact for larger n.We thus conclude that the major eigenvectors of ( Ĵ − J) and (D − J) are very nearly orthogonal to the gradient vectors used to construct Î for large n.Table 3: Mean matrix norms of J, its approximations, and differences from these approximations across 500 replications.
It is not clear from examining the trace terms in Table 3 that D is a worse approximation of J than Ĵ is, even for small n where the impact of the ICE approach is most significant.A more complete investigation of the spectrum of these matrices is beyond the scope of the present work.

Multivariate Logistic Regression
The previous experiment is based on a well-known test case.In this second experiment, we assess the general performance of ICE under (i) varying dimensionality of the true data distribution, (ii) increasing misspecification, and (iii) increasing training set sizes.To achieve this goal, we generate a more exhaustive set of data from a more complex data generation process.

Data Generation Process
The synthetic data are designed to exhibit a number of characteristics needed to broadly evaluate the efficacy of ICE.First, the regressors should be sufficiently correlated so as to ensure that model selection is representative of typical datasets.However, we avoid multi-collinearity by ensuring the smallest eigenvalue is above a certain threshold.We additionally control the condition number of the covariance matrix Σ by randomly generating a symmetric positive definite covariance matrix Σ ∈ R p using the eigen-decomposition where U is an orthogonal random matrix with elements U ij ∼ N (0, 1) and D is diagonal matrix of positive eigenvalues.The eigenvalues are uniformly distributed over the interval [a, b] so that the condition number of Σ is b/a and the eigenvalues are kept distinct.Here, a is chosen to be 1 × 10 −4 and b is chosen to be 0.1.Using a Cholesky decomposition Σ = ΓΓ T and the random mean vector µ ∼ N (0, 1), we generate correlated gaussian vectors of dimension p with the properties The data (x n , y n ) are generated under a logistic regression A key challenge in assessing the efficacy of bias reduction is to avoid generating excessively low entropy distributions.In such cases, bias reduction will have marginal effect as the parameters are all nearly zero.To avoid such scenarios, the intercept parameter of the true model is adjusted a-posterior until the following conditions are met: where c = 0.35, d = 0.65, and = 0.2.If these conditions can not be met, then the replication is discarded.

Model Performance Comparison
As in prior sections, KL divergence is computed between the estimated model and the true model for each of the estimation methods.The T-statistics of the difference with the corresponding MLE KL divergence are computed, with negative T-statistics showing that an approach is performing better than the MLE approach.For L 2 regularization in this section, the value of λ is computed via cross-validation, using two folds, on the provided fitting set.Table 4 compares the KL divergences ρ KL from the true distribution to the model distributions produced using various estimation approaches applied to misspecified data.Here m denotes the number of regressors that are not predictive, i.e., θ 0 contains m zeros.The experiment is replicated 300 times using the data generation process described above and the test set is fixed at 100,000 observations.We observe that the t-statistic for θ * is most significant for relatively small sample sizes, particularly n = 500.For these small sizes, the improvement over MLE is greater, though noisier.There is uniform decay in improvement over θ as n grows, until for p = 10 and p = 20 the largest sizes are no longer statistically significant.This is expected, as both the MLE and ICE estimates are converging towards the true value of θ 0 , and for large enough sample sizes the ICE correction would be dominated by numerical error, particularly the ill conditioning of J.
The L 2 estimate improves for small values of p, but then becomes progressively worse for large values of p.We observe that for dimensionality above p = 5, the L 2 regularization described here is no longer effective in reducing the KL-divergence.For low values of p the value of θx has comparatively low variance, and thus the logistic function is reasonably locally approximated as linear.For higher p this approximation is less realistic and the performance of L 2 regularization degrades.
For the ICE estimates, larger values of p show fluctuations that are often not statistically significant.It is apparent that larger p is increasing the variance of the ICE divergences, probably due to numerical errors and ill conditioning.Larger values of n reduce the absolute size of the divergence improvement whereas larger values of p seem to increase it.
Note that though the t-statistics are degrading for large n, the absolute magnitude of the differences is asymptotically small.For these sizes, the results are insignificant, but more importantly, immaterial.

Convergence Analysis for Large n
For 10 randomly chosen example problems, under which the model coefficients are now fixed, the convergence behavior for large n, the training set size, is explored.Note that the test set remains fixed at 100,000 observations for each problem.Table 5 compares the KL divergence (averaged over all 10 problems) under MLE ( θ), L 2 regularization, and ICE for progressively larger sample sizes.The divergences ρ KL (f, g θ ) and ρ KL (f, g θ * ICE ) converge to zero as n → ∞, as does ρ KL (f,

Generally the θ *
ICE estimates are seen to converge slightly faster than the θ estimates.The regularization in θ * L2 is observed to be beneficial for very small sample sizes, but then becomes marginally detrimental for large n.

Optimized Computation Results
For any model satisfying White's Regularity Criteria, it is known that the matrix J is positive definite near the MLE optimum θ.This implies that J is diagonally dominated, and indeed considering just its diagonal elements D, it is known that tr(ID −1 ) > 0. Indeed tr(ID −1 ) differs strongly from tr(IJ −1 ) most strongly for models with strong regressor interactions.Therefore, using finite difference gradients, consider the following approximations for the ICE objective function: 1. θ * : J is computed directly, ICE is implemented as written.

θ *
2 : J is taken to be constant w.r.t.θ: (J θ = J θ ). 3. θ * 3 : J is taken to be diagonal: (J = D).4. θ * 4 : J is taken to be the identity: (J = I).Clearly, we expect that θ * 4 above is the least accurate approximation, and items θ * 2 and θ * 3 have varying levels of accuracy depending on the problem at hand.The cost comparison of these approaches is shown in Table 6.

Approximation
Objective Cost (Space) Objective Cost (Time) Gradient Cost (Space) Gradient Cost (Time) Table 6: The asymptotic computational cost (per iteration) of various proposed approximations as a function of parameter count p. Cost is amortized when J θ = J θ assuming that n ≈ p.Note that a typical model will cost O(p) in time and space for both the objective function and its gradients.
Remark When computing gradients for use in a solver, often approximation error will have only a marginal impact on the final result, though it may increase the number of iterations needed for convergence.Broyden's method [25] is a typical example of this approach in action.Efficient approximations of [∂ θ Ĵ] might similarly have only a minor effect on accuracy and iteration count.The construction of approximate analytical derivatives is beyond the scope of the present work.
These approximations were computed and compared for the Friedman (see Section 4.2) problem, and the results are shown in Table 7 below.From Table 7, it is apparent that approach θ * 4 , taking J = I is not effective.This is not surprising as the actual J matrix has dramatic differences in scale between regressors.Approximation θ * 3 , taking J = D is accurate enough that it cannot be statistically distinguished from the direct computation of ICE by the test above.Approximation θ * 2 tends to underperform approximation (3).Therefore, we propose taking J = D as a more numerically stable approximation of the ICE objective.

Conclusions
Takeuchi [16] is believed to be the first to have proposed using an objective function similar to ICE in order to reduce generalization error, though it was applied via model selection.Firth [9] introduced a similar term to reduce parameter bias in model fitting, as opposed to model selection, though he derived it only for exponential model families and did not consider its effect on generalization error.It is not known why this approach did not find widespread use, but one may infer that the O(p 4 ) computational cost and instability was enough to keep it from wider adoption.
In this paper, we reintroduce the objective function of [16] and provide a more general proof of its widespread applicability.We then show that efficient implementations costing only O(p) are possible.Under finite sample sizes, this bias correction term is shown experimentally in several models to lead to significant reduction in bias compared to maximum likelihood estimation with and without L 2 regularization.ICE offers many advantages over L 2 penalized maximum likelihood estimation: (i) it's suitable for most nonlinear models, (ii) it's provably asymptotically convergent; and (iii) does not rely on any parameters which would need to be provided by the operator or deduced through crossvalidation.

For
Proof As the first derivatives of are continuous, the mean value theorem may be applied: θ is between θ * and θ * 0 .Under the assumptions of Lemma A.2, and given its finite variance, Ĵθ is almost surely (in the large n limit) positive definite, and thus invertible as both θ * and θ * 0 are in U , and θ is between them.Therefore, Applying the mean value theorem a second time gives with θ 1 between θ and θ * 0 .If the order of (θ * − θ * 0 ) = O p (δ), where δ := n −1/2 , then As all of the Ĵ * are bounded away from zero in probability, we have with the equality holding in probability.In the large n limit, δ → 0, and thus As ∂ θ * (θ * 0 ) is the sum of n independent vectors, it is asymptotically normally distributed by the central limit theorem, and its mean is 0 by the definition of establishing the result.

A.4 Proof of Prediction Bias Order under ICE
Theorem A.4 (Prediction Bias Estimation under ICE) By minimizing − * instead of − , the first order terms of the prediction bias are cancelled leaving a O p (n −3/2 ) residual term Proof Note that in Takeuchi's proof [16], the use of θ was prescribed.Therefore, this approach could be used only for model selection, and not for model fitting.Now consider the bias under the ICE estimator θ As the second term is zero, this can be simplified to b √ n , and then recall from White [22] that as it is an expectation), and terms of order O p (δ) and higher will be dropped.Therefore, as in Takeuchi's derivation [16], the Taylor expansions below will be truncated at second order in δ, dropping terms of order O p (δ 3 ) or higher.Additionally, we will occasionally drop indications of X n where the meaning is clear and it greatly simplifies the notation.
With that truncation, recalling that θ 0 is a minimum of L(θ) and thus has zero gradient: Recall Theorem A.3, and recall the that for the quadratic form: Therefore, Now note that the second term on the right is a constant, and therefore would take no part in any optimization.Therefore, it can be safely ignored and Addressing the first term, again taking a Taylor expansion, we find that where the constant C is composed of the neglected constant terms from earlier stages However, the last term is O(δ 3 ), and the first is O(δ 2 ), so this may be approximated as Recalling again that (θ * 0 − θ 0 ) = O(δ 2 ), it is clear that C = O(δ 4 ), and may thus be absorbed into the O(δ 3 ) residual term.Therefore, −L( θ) = − (θ * (X n ), X n ) + 1 n tr(I θ J −1 θ ) + O(δ 3 ). (36) Comparing this to the form of * (θ): Thus, by minimizing − * instead of − , the first order terms of the prediction bias are canceled and, in expectation, a more accurate model is produced.

Figure 1 :
Figure 1: A comparison of the KL-divergence (y-axis) of various estimation methods against the number of training samples n.Each KL divergence value was divided by the average KL divergence of the MLE estimate for that value of n.The ICE and L 2 series are shown with 2 standard deviation error bars.

Figure 2 :
Figure 2: The error in the estimated σICE and the Z-score of the estimate against the number of training samples n.

Table 1 :
Comparison of the average KL divergence across 500 replications for several model estimators given a fitting set size of n.For estimators other than θ, the values in parentheses denotes the t-statistic of the difference between this estimator and θ, with negative values indicating that the listed estimator has a lower KL divergence.
Figure 3: Comparison of the KL-divergence, averaged across 500 replications, of estimation methods against the number of training samples n.Each KL divergence value was divided by the average KL divergence of the MLE estimate for that value of n.The ICE and L 2 series are shown with 2 standard deviation error bars.n ρ KL (f, g θ ) ρ KL (f, g θL2 ) ρ KL (f, g θ * ) 16 6.19 × 10 −1 8.442 × 10 −1 (8.40) 3.02 × 10 −1 (−13.

Table 5 :
Comparison of the KL divergence under the MLE θ, L 2 regularization and ICE regularization θ * ICE against a large sample size for the case when p = 10 and m = 4.

Table 7 :
Comparison of the average KL divergence across 200 replications for MLE and several variants of ICE given a fitting set size of n.For estimators other than θ, the values in parentheses denotes the t-statistic of the difference between this estimator and θ, with negative values indicating that the listed estimator has a lower KL divergence.