Improving the Performance and Stability of TIC and ICE

Takeuchi’s Information Criterion (TIC) was introduced as a generalization of Akaike’s Information Criterion (AIC) in 1976. Though TIC avoids many of AIC’s strict requirements and assumptions, it is only rarely used. One of the reasons for this is that the trace term introduced in TIC is numerically unstable and computationally expensive to compute. An extension of TIC called ICE was published in 2021, which allows this trace term to be used for model fitting (where it was primarily compared to L2 regularization) instead of just model selection. That paper also examined numerically stable and computationally efficient approximations that could be applied to TIC or ICE, but these approximations were only examined on small synthetic models. This paper applies and extends these approximations to larger models on real datasets for both TIC and ICE. This work shows the practical models may use TIC and ICE in a numerically stable way to achieve superior results at a reasonable computational cost.


Preliminaries
The ICE methodology that is analyzed by this paper is described in detail in [1]. That paper also contains a great deal of introductory information regarding information criteria and generalization errors. This section contains a brief review to introduce some common notation needed for the topics of this paper.
Consider the model g(x i |θ) that assigns a probability to the regressors x i , and is parameterized by the parameters θ. Suppose the actual probability of x i is f (x i ). Suppose X n is a random variable corresponding to a sample of size n drawn from f , composed of x i for 0 < i ≤ n. If the sample size is not specified (or understood to be 1), then we may write X instead. The usual definitions for this sort of analysis are listed below.
Definition 1 (Log Likelihood of θ over X).
where the expectation is taken over the distribution f (x).
For both I(θ) and J(θ), we defineĴ(θ, X n ) andÎ(θ, X n ) to be their estimators computed from the dataset X n . We may writeĴ andÎ for simplicity when the meaning is clear.

Information Criteria and Generalization Error
A well known result by Stone [2] shows that the MLE is a biased estimator of the minimum KL-divergence: because it is evaluated on the data X n , which was used to fitθ. Cross-validation was developed as a model selection technique to select a model from a group that actually minimizes E X n [ρ KL (g θ 0 , gθ (X n ) )] and not merely E X n [− (θ(X n ), X n )] in the limit of large n. Takeuchi [3] and Akaike [4] explicitly modeled this bias (generalization error) of an estimation procedure θ(X n ).
Akaike's Information Criterion (AIC) [4] was one of the earliest attempts to correct for this bias. AIC is able to correct for generalization errors when comparing MLE estimates for a restricted class of models. This work was extended by Takeuchi's TIC [3] to expand the class of models, while still requiring that the MLE estimates be used for comparison.
In particular, note that it has long been known (e.g., in [4]) that for the MLE estimatê θ of a model with m parameters, the bias b is asymptotically O( m n ) almost surely. So, for instance, − (θ(X n ), X n ) = −L(θ(X n )) + O( m n ) (8) almost surely. Hence, for the MLE estimateθ of a model with m parameters, we have that b(θ(X n )) = O( m n ) (9) almost surely. Proofs of this fact are found in both [4], for a restricted subset of models, and [3] for a broader class of models. The goal of AIC, TIC, and ICE is to reduce the generalization error by reducing the order of the O( m n ) term, by incorporating a more negative power of n. This does not guarantee superior performance. In the case of TIC particularly, numerical instability can cause this term to have an unexpectedly large constant factor. However, if numerical instability is effectively controlled, it is expected that many problems could benefit from these techniques for moderate sample sizes, as will be shown in later sections.
The second term here may be periodically referred to as the "trace term," as it appears in ICE as well in later sections. This was an extension of AIC, which had previously been developed by Akaike in [4]: Here, for convenience, we use the convention that TIC (and AIC) is O(1). In other work, it is often multiplied by n to produce a result that is O(n).
Takeuchi then showed that AIC is a limiting case of TIC. It was shown by Stone in [2] that AIC model selection and model selection via cross validation are equivalent whenever AIC is valid. By extension, TIC is also equivalent to cross validation under these circumstances.
If two models are to be compared using TIC or AIC, then the model with the lower value of TIC is on average the better model. Given two models, g 1 with TIC 1 and g 2 with TIC 2 , the model g 1 is actually the better model with probability p(ρ KL ( f , g 1 ) < ρ KL ( f , g 2 )) = e −n * TIC 1 e −n * TIC 1 + e −n * TIC 2 . (12) where ρ KL ( f , g 1 ) is the KL divergence between the true distribution f and a model generated distribution g. This follows directly from the fact that the exponential of a TIC value is a likelihood ratio, and the logic then proceeds in the usual way for likelihood ratio statistics [5].
In this way, TIC (as with any information criterion) can be used to select the better model from a family of fit models. However, it requires that all models be fit using maximum likelihood estimation (MLE).

Additional Information Criteria
Modern machine learning models often have a very large number of parameters, in some cases having more parameters than observations in the fitting set. Recalling Equation (9), it can be seen that using the MLE estimateθ is likely to produce models that generalize poorly. For these models, information criteria have therefore fallen out of favor. Using an information criterion to choose the best model from a small set of models fit using MLE is unlikely to find an accurate model. If the number of models is very large, then Equation (12) dictates that the information criterion differences must be very large in order to reliably find the best model, and again the result is unlikely to perform well. Additionally, each fit of a model such as this may carry considerable expense, so producing a large number of fits to filter with information criteria may be cost prohibitive as well.
Konishi and Kitagawa [6] developed GIC, which extended TIC to no longer require MLE estimation, allowing regularization and similar generalization error reducing approaches. See [7] for an overview of typical regularization techniques that might be paired with GIC in this way. Unfortunately, GIC is not viable as written for modern machine learning models as it still has a form similar to Equation (10), and as discussed in [1, 8,9], these equations are numerically unstable for large m (roughly m > 20), regardless of n.
Ichiguro et al. developed an alternate approach named Extended Information Criterion (EIC) in [10]. The main idea is that TIC and AIC use a Taylor expansion of the generalization error, and then their correction terms are simply the leading order terms of that expansion. However, the generalization error itself (Equation (7)) actually takes the form of an expectation over the true distribution. This expectation may be computed directly over the empirical distribution, avoiding the need for an expansion.
Additional analysis of EIC was performed by Kitagawa and Konishi in [11]. Their analysis indicates that EIC (in its most basic implementation) adds noise to the objective function proportional to the sample size. This means it may not be appropriate for large datasets without adjustment, and adjustments to reduce this issue are then proposed and analyzed.

ICE
In the discussions of information criteria in the previous sections the models would be fit using MLE, or some other procedure, and then model selection would be performed afterwards using an information criterion. The exact fitting procedure is not specified. These approaches assume that some procedure can be found which will produce models with reasonable levels of accuracy, but that is hardly a given if the model has a very high parameter count m relative to the observation count n.
Though GIC allows the use of regularization (and various other techniques), L 2 regularization itself is not always effective. For instance, see Figures 1 and 3 from Section 4 of [1] for examples where L 2 regularization is not helpful. Models as simple as estimating mean and variance of a Gaussian through MLE are always harmed by L 2 . This gives good cause to believe that cases where L 2 is not helpful, or not efficient, are fairly common. Approaches beyond regularization, such as early stopping or drop-out, tend to have hyperparameters which can be difficult to estimate, just as regularization does. Additionally, there is little theoretical reason to believe that these approaches are reducing a generalization error efficiently.
An example of a highly parameterized model is a modern MNIST challenge leader [12] that has 1,514,187 parameters, but was fit on a dataset with only 30,000 observations. A discussion of why this often occurs within the field of machine learning is beyond the scope of this paper, but it is enough to know that this is an important use case for model fitting that is not well served by existing information criteria.
In [1], the ICE objective function is defined.

Proof. See Appendix A of [1].
Theorem 1 would guarantee that b(θ * ) = O( m n 3/2 ) if I and J could be known exactly.
Taking v (θ,x) = −∂ θ (θ, x), and using the estimatesÎ andĴ for their true values, this can be rewritten (approximately) as This substitution was also used by Takeuchi in [3]. Equation (14) (rather than Definition 7) will be the starting point for the analysis in the remainder of the paper. It is expected that sinceÎ → I andĴ → J this equation would converge to Definition 7 and b(θ * ) → O( m n 3/2 ); however, that will not be proven here since it is orthogonal to the analysis performed.
This paper is primarily concerned with the empirical consequences of approximating Equation (14) through various means. The consequence of using Equation (14) instead of Definition 7 is not directly observable or relevant to that analysis. As the analysis below makes clear, numerical instability would make any approach using the Hessian directly unviable, regardless of whether or not one could gain access to the actual true Hessian.
For background on the consequences of using an empirical approximation to the Hessian or Fisher Information in lieu of the actual unobservable value, see [14].
Experiments in [1] showed that the neighborhood of validity for this approach is typically large enough to contain θ * . Thus, if some care is taken with the optimization itself (techniques for this are also described in [1]), then this approach is quite widely applicable.
For realistic data set size n, reducing the generalization error b from O( m n ) to O( m n 3/2 ) might effectively eliminate overfitting, without requiring any hyper parameters or cross validation. For an analysis of the scale of leading order bias terms, see [15], where it is seen in numerical simulations that first order corrections such as this can drastically reduce generalization errors.
Notice that for any models fit using ICE, it is sufficient to compare values of − * (θ) for model selection, as these are also valid approximations of TIC values. Both TIC and ICE approximate the log likelihood that would be computed using cross validation (if computed atθ, the MLE parameter estimate) and it can be seen that Equations (14) and (10) are identical.
As ICE is a superset of TIC, most of this paper will focus on ICE with the exception of sections comparing TIC to AIC.
The ICE approach, as with TIC, has a few main drawbacks.
Since J must be positive definite, this is only valid in a neighborhood around the MLE.

4.
Computing derivatives of * is expensive and potentially unstable.
Several proposals are made in [1] to overcome some of these issues. The purpose of this paper is to further analyze some of those proposals in the context of a large and more realistic problem, while also contributing additional improvements.

The Trace Term
The trace terms from Equations (10) and (14) are identical, and reproduced below. This equation is therefore well defined, but totally unsuitable for numerical computation, having two main problems:

1.
For even moderate parameter counts (e.g., 20+), the inverse condition number ofĴ is typically less than machine precision. Therefore, the direct numerical computation of this quantity will be quickly dominated by numerical errors, even when a stabilized SVD-based pseudo-inversion is used.

2.
The computational cost of the inversion ofĴ is O(m 3 ), and this must be performed on every iteration within the optimizer during model fitting. This quickly becomes intractable for parameter counts beyond a few hundred. This is less severe than the first issue, but still a major impediment to wide scale adoption in the highly parameterized models where bias is most an issue.

Efficient Approximations
In [1], several efficient approximations of Equation (14) have been proposed, and here we consider the following: where hereD is the diagonal ofĴ. In the analysis presented in [1], it was shown that using this approximation did not meaningfully impact accuracy, and even at times improved accuracy due to the numerical instability in the direct computation ofĴ −1 . Similar approximations are also used in other well-known numerical algorithms. For instance, the widely used ADAM [16] optimizer uses an even looser approximation by replacing D with the identity matrix.

Gradient Computation
To utilize TIC, the approximation in Equation (16) is sufficient; however, efficient computation of ICE also requires an approximate derivative. Since this derivative will be used only in optimizers, it need not be exact, but it is helpful if it generally has a high cosine similarity with the true derivative of Equation (16).
The gradient of the ICE objective function may be derived from Equation (14), and written as which simplifies to Notice that J (θ,x i )Ĵ −1 does not reduce to the identity, since it is a multiplication of the J matrix computed for a single observation against the inverse computed over all observations. Direct computation of this quantity costs O(m 4 ) time, so an approximation is needed. Begin by applying the approximation from Equation (16) to obtain This is much improved, but still requires the computation of [∂ θD ], and J (θ,x i ) , both of which cost O(m 2 ) in time and space. Further approximations are available to us here due to the fact that the optimizers that will use this gradient do not need its exact value. It is enough if the gradient is generally pointing "downhill" with respect to the objective function, and it is not necessary for it to be very accurate other than that. This translates to a requirement that the gradient approximation typically has a positive cosine-similarity with respect to the actual value.
If n is not too small, then the trace term is small compared to (θ), and v (θ,x) is small at θ * , but D is not. Similarly, [∂ θD ] need not be especially small or large in the neighborhood of θ * . Therefore, near θ * , the first correction term should be larger than the second, having only one factor of v (θ,x i ) instead of two.
Behavior far from θ * would generally be less important than behavior near this optimum, since it is expected that (θ) would dominate these gradient terms in that case. Additionally, the numerical stabilization discussed in the next section (Equation (23)) will also tend to reduce the importance of any term containing D outside of the region where D is positive definite by forcing (in the limit) that J = D = I. In this limit, both of the correction terms would become a simple scalar multiple of v (θ,x) , which would make them irrelevant to the optimization.
This reduces the equation to The computation here is still O(m 2 ), but one more approximation can be applied. Nearθ, asymptotically, J = I under certain conditions. Therefore, making this substitution produces where now the inner quantity is the original ICE correction for this specific observation, and that is used as a weight for the unadjusted gradient. This quantity can then be stabilized using the techniques in Equation (23). Computationally, this is extremely efficient, requiring only O(m) time and space. This is one of the approaches that will be considered for gradient calculation. The only remaining difficulty here is that this computation requires either two passes, or O(nm) space, because the matrixD must be computed first, and then applied element by element using Equation (21). Therefore, the gradients must either be computed twice (since they are used in the computation ofD), or their values stored. Alternatively, at a minor cost in accuracy, theD from the previous iteration could be used. This approach was used in the mortgage model examined here.
An alternative to the approximation in Equation (21) is to assume instead that This approximation may also be computed in O(m) time and space. However, this computation appears to be less stable than Equation (21), due to the likelihood of the non-positive definiteness of D (θ,x i ) for some observations x i , and it will be seen in later sections that this is indeed the case.
The results section will analyze both Equations (21) and (22) numerically to determine which approach is more effective in this problem space.

Numerical Stabilization
Equations (16), (21) and (22) can all suffer from the potential singularity or ill conditioning ofD. This is a more severe problem for ICE than for TIC, since ICE must necessarily operate far from the MLE optimumθ where D may be actually singular or not positive definite.
The analysis performed in [1] shows only that the trace term is the leading order generalization error term in a neighborhood U aroundθ, and need not be even positive outside of that neighborhood. Additional theories around the relationships between θ * andθ are not developed here, but should θ * be close enough toθ that it falls within U, and hence J is positive definite, then it is sufficient to ensure that the optimization over − * (θ) is able to reach U, and then within that neighborhood it can converge to θ * .
First of all, to improve numerical stability, we truncate to zero any gradient element where here ε is double precision machine epsilon, approximately 10 −16 . These terms are too small to change the outcome of a dot product within a machine error. A vector so truncated is indistinguishable via dot products from one that has not been; however, it is possible for such terms to add numerical instability due to rounding errors in the computation ofD −1 . Similarly, for each element where the weight w is computed as This weight is a continuous function of [D] k , and goes to zero as [D] k becomes small enough that it is dominated by rounding errors. In addition, for negative values of [D] k , when multiplied by the square of the gradient in Equation (16), the term becomes 1.0, thus preventing instability from forming when the optimizer is not near the MLE solution and D is not positive definite.
Geometrically, this means that far outside of U the trace term is approximately the constant m n for sample size n, thus the optimizer will move towards U if the MLE optimization would have done so. As the optimizer draws closer to U, individual elements of the trace term start to take on values other than 1.0. Deep within U, the objective − * (θ) is essentially unchanged from the value that it would have in the absence of this correction, and thus the optimizer can freely converge to θ * . Proving that the adjustment from Equation (23) will always allow convergence to θ * if θ * ∈ U is beyond the scope of this paper. Qualitatively, this would be expected to usually work, and this behavior is analyzed numerically in later sections.

The Mortgage Modeling Problem and Dataset
The goal of this paper is to expand upon the techniques from [17], and apply them to larger datasets with more complicated models than the simplified ones analyzed there.
The real world data chosen for this analysis are a selection of Freddie Mac single family loan level mortgage data. These mortgages represent loans used to purchase or refinance single family homes within the United States. The data itself is available in [18], and can be found under the data_fit.dat and data_test.dat files. These loan records are a random subset originally pulled from the pre-2008 originations available from [19], and enriched to include Home Price Index (HPI) and Interest Rate (IR) data downloaded from FRED [20]. Pre-crisis (pre-2008) data were chosen since those loans have more data available for them (up to 12 years at the time of the data download) than post-crisis mortgages. The data themselves are described in Appendix A.1. These data were chosen because it is an open data set, contains a large amount of data, and is a problem of economic significance.
Each loan in the dataset is divided into a number of observations, one per month. In a given month, a loan may be in one of several states (i.e., status), and in the following month it may transition to a different status. Loan level mortgage models are classifiers used by major banks and other financial firms in order to evaluate the present value of individual mortgage loans. The behavior of the individual loans is aggregated to produce behavior projections for pools of loans, and the bonds collateralized by these pools. These bonds, combined with the pools themselves, are the primary instruments traded in the mortgage market, though individual underlying loans (known as whole loans) are also traded to a lesser degree. For a broad overview of mortgage finance, see [21].
The mortgage market as a whole encompasses approximately $3 trillion in valuation within the United States alone, making it one of the largest securities markets by value. This market is composed of approximately 60 million mortgages, and the market as a whole produces one new observation per month for each of these loans, so the dataset itself grows by roughly 60 million observations per month.
A complete mortgage modeling system would include one classifier for each loan status. Typically, this would be at least six classifiers, for statuses traditionally called C, 3, 6, 9, F, R, and there are two absorbing states, P and D, that do not need models. The goal of this system is to consider a loan in a given status (e.g., C) and predict its probability of transitioning to each of the other statuses. These transitions represent payments made or missed by the loan, so knowing the transitions produces a valuation for the loan. Typically, the next month's status would be simulated using the calculated probabilities, and then the model would be run again using the simulated starting point for the next month. In this way, it would produce (one path of) the entire future trajectory of the loan, which implies all its cashflows.
For the calculations below, we consider only the most significant classifier, the classifier for loans that are current (status C) in the projected month. This is the set of loans that have made all required payments up until that point. A loan in status C can either make the required monthly payment this month (going to status C for next month), miss a payment (going to status 3) or refinance and liquidate (going to status P). These loans cannot miss multiple payments in a single month (because only one is due), so they cannot go to status 6 or 9, and they cannot be placed in foreclosure, so they cannot go to F, R, or D in the next month.
Having a highly accurate model for these transitions would allow a financial firm to accurately price mortgages, pools of mortgages, and bonds based on these pools, providing a competitive advantage within the mortgage market.
Two classifiers are analyzed in the results below. The first classifier is an open sourced industrial model called ITEM [17]. This model was chosen because it is highly automated, greatly reducing questions of hyper parameters that might influence the results. It also makes heavy use of AIC, and so is a natural fit for discussions of information criteria. Additionally, the model is very parsimonious with parameters, making accurate models with only very few parameters. This production of low parameter count models makes the analysis of the direct computation of the trace term in Equation (14) viable. This direct computation will be used as a basis for comparison in the ITEM model section that follow, but will be omitted from the following neural network section due to unwieldy parameter counts.
The second classifier is a low depth Multilayer-Perceptron classifier [22]. This model is selected because it is very widely used. The parameter counts are too high to expect numerically stable computation of Equation (14), and therefore an approach due to LeCun [23] is used to approximate the diagonal entries of J in order to apply Equation (16).

Results Overview
The numerical results in the following sections are computed based on the mortgage dataset described in Section 2. For these sections, many different approximations and optimizations are used and tested, and it is necessary to show that each previous approximation is valid before moving on to later computations that will then rely on it. Therefore, the calculations here are divided into two sections.
First, the ITEM model is examined in Section 4. This model can accurately represent the problem with parameter counts small enough to enable direct computation of the trace term without approximations, and it is therefore used to compare those approximations to the direct computation. The ITEM section performs the following tests.
Establish that the ICE approximations improve TIC even when used as an information criterion: Section 4.2; 4.
Establish which gradient computation is most accurate and measure its accuracy: Section 4.3; 5.
Establish that ICE is not unreasonably computationally expensive: Section 4.5.
With these analyses finished, it has been shown that the proposed ICE approach is numerically stable, efficient, and effective. Then, the approach is expanded to larger parameter counts (Section 4.6), to ensure that it does not fail as the parameter counts grow to several hundred parameters. This wraps up the section that uses the ITEM model.
Next, a Multilayer-Perceptron is examined in Section 5. This is a far more common model than ITEM, but since it is prone to very large parameter counts, it is necessary to use ITEM to establish the viability of the approach at smaller parameter counts first. Additionally, since this model uses an additional approximation attributed to LeCun [23], it is helpful to examine the accuracy of ICE without that approximation, in the ITEM context first. With that done, the Multilayer-Perceptron with LeCun's approximation is shown to largely track what was found in ITEM in terms of the general utility of the ICE approach. The results section concludes with Section 5.3 establishing the reduction in generalization error produced by ICE for the Multilayer-Perceptron model.

Item Model
To show the viability of the ICE approach for real-world computations, a mortgage model was constructed from Freddie Mac single family loan level data. The model chosen here is ITEM, as described in [17]. This particular model was the basis for the US mortgage models at the Royal Bank of Canada starting from 2014 (Tyler Ward was the head of US mortgage modeling at RBC from 2014 to 2016). ITEM is an automated system for producing decompositions of datasets into analytic curves. It is chosen here due to its use in the industry for this problem, and because it uses AIC to automatically produce parsimonious models, and therefore direct computation using Equation (14) is viable as a basis for comparison. The raw output and code used for this section can be found in [18].
The ITEM procedure produces a sequence of models from a random seed, just as a Multilayer-Perceptron would produce different models for different random seeds due to the pseudo-random nature of the weight initialization. In this way, a large number of plausible models of varying parameter counts can be generated. Statistics can be collected describing what a typical operator would be likely to obtain from each of the estimation methods considered.
To implement this in a real-world situation, two main difficulties must be overcome.

1.
The ICE objective must be computed efficiently.

2.
Approximate gradients must be computed efficiently.

Model Construction
In the following sections, the ITEM model generation procedure was allowed to run from various starting seeds. For the parameter estimation approaches used, see Table 1.  (23) Equation (22) For each of these models, 20 series of ITEM models (100 series total) were generated. Each series produced several models as larger models were iteratively constructed from smaller ones, see [17] for a description of how this is performed. The analysis then considered all intermediate and final models from each series, for a total of 2123 models with parameter counts ranging from 2 to 83.

L2 Regularization
L 2 regularization (i.e., ridge-regression) was attempted with various values of λ, but all values of λ = 0 were found to be harmful or statistically indistinguishable from zero. Consequently, these models were not qualitatively different from the MLE series described above. This was an expected outcome, and these approaches were not considered further in this paper.
For an example of why λ = 0 was not helpful, see the results in Section 4.2 of [1], and note that the ITEM models generated here also use standard deviation-like parameters, and so suffer from the same problems with L 2 regularization. Additionally, the scale of the regressors to this model covers more than 9 orders of magnitude (e.g., unpaid balance is of order 10 7 and incentive is of order 10 −2 ), greatly hampering L 2 and LASSO-like approaches unless data normalization is applied as a preprocessing step.

Numerical Stability of TIC and ICE
To evaluate the numerical stability of ICE and TIC, we found the MLE parameter estimate (i.e.,θ) for each of the 2123 models, and then computed the inverse condition number ofĴ andÎ at each of these optimum points. For this analysis, the matrixĴ is numerically singular whenever its inverse condition number is less than √ ε, with ε being machine precision, approximately 10 −16 for double precision floating point.
Within the literature, proper analysis of numerical stability of this term is severely lacking. Kitagawa and Konishi performed some simulations in [15], but only for m = 2. Therefore, they did not encounter difficulty inverting J, which (in the simulations below) begins to rapidly escalate for m > 10. Section 2.3 of Burnham and Anderson [9] gives the numerical instability of TIC when m ≥ 20 as a primary reason for avoiding it, and for it not seeing widespread use.
Additional analysis of the numerical stability of TIC and related methods can be found in [8], where Section 3 provides a good overview of recent research in the area, and some numerical results for approximations of J. The computation of an accurate inverse Hessian (J −1 ) is a persistent source of difficulty in machine learning.
In this circumstance, these approaches are being used as information criteria since they are evaluated atθ. For clarity, the approaches are listed here as information criteria in Table 2.
WhenĴ is numerically singular, we expect thatĴ −1 has unstable behavior and TIC (the only approach directly usingĴ −1 ) becomes unstable. The other approaches based on Equation (16) are not expected to be affected by this, since a diagonal matrix may be safely pseudo-inverted regardless of its condition number.
The numerical instability displayed in Table 3 indicates that TIC when directly computed will not be numerically stable, in addition to its O(m 3 ) computational cost.  (14) and (16), we compute the size of the adjustment term (tr(ÎĴ −1 )), and compare it to the parameter count (m), which is its asymptotic limit if the model is correctly specified, see [4]. The results are labeled according to the approach as described in Table 2.
In Table 4, it can be seen that the TIC3 approach is more clustered near 1.0 than either of the other approaches. The direct TIC approach shows signs of instability for higher parameter counts where the average adjustment turns negative at the MLE estimateθ whereĴ should be positive definite. This is due to the instability described in Table 3. The approach TIC2 has large swings in value through the parameter space, and between models within each of these groups, indicating that it too is suffering from instability that the TIC approach is able to correct using the methodology described in Equation (23). In Table 5, it is shown that TIC and TIC2 have a significant proportion of models with a negative objective function adjustment, even thoughĴ should be positive definite atθ. This indicates substantial numerical instability with these approaches. The approach TIC3 does not have any negative adjustments, in fact the lowest scaled adjustment (in the sense of Table 4) among any of the 2123 models considered here is 0.53. The conclusion that can be drawn here is that TIC is inappropriate for direct use for many models with even moderate parameter counts. However, applying the approximations and numerical stabilization described in [1] corrects many of these issues and might allow for a wider application of this adjusted approach to TIC. The additional numerical stabilization from Equation (23) is necessary in order to achieve reasonable results.
In Table 6, the sum of absolute errors is shown for each group of models between the test set value of (θ) and the information criterion computed for AIC, TIC, and TIC3. Curiously, TIC does worst for relatively lower parameter counts. Its results have higher variability for higher parameter counts, but the mean is not appreciably different from AIC on average. These results are, however, highly unstable for TIC and thus not appropriate for model selection. Though the approach ICE_A was used to produce approximately 1/5 of the 2123 models considered, it is clear from the analysis above that it is substantially worse than the ICE approach (TIC3 in the information criterion context), and thus it need not be considered further.

Gradient Accuracy
The finite-difference gradients of Equation (16) with stabilization via Equation (23) for the 2123 target models were computed. These were then compared to the approximate gradients computed using Equations (21) and (22). The average cosine similarity was computed, and also the fraction of gradients with positive cosine similarity, summarizing the results in Table 7. This analysis finds good agreement between the finite difference derivatives and the approximation through Equation (21), and somewhat lesser agreement using Equation (22). This indicates that approach ICE from Table 1 may outperform approach ICE_B.

Prediction Accuracy
For the 2123 target models, we refit the parameters of each using each of the target methodologies, and then compute the out of sample cross entropy. The results are summarized in Table 8. Here, the approach ICE_A has been eliminated, but ICE_RAW has been retained to represent the numerically unstable approaches and their impact on accuracy. As can be seen in Table 8, the ICE and ICE_B approaches perform similarly. This is expected, since they share the same objective function. Both of these approaches are a small improvement on MLE overall for every parameter group listed here. The approach ICE_RAW performs very badly due to its numerical instability for parameter counts larger than 20, and the performance of ICE_A (not shown) is similar. The individual entropy differences are small, but the outperformance of ICE is highly significant.
As can be seen in Table 9, ICE outperforms MLE at the 4 sigma level throughout most of the parameter range. The ICE_RAW approach has a comparatively large standard deviation due to numerical instability, but it underperforms at the 2 sigma level for most parameter counts. Table 9. T-statistics of differences between given methodology and MLE out of sample cross entropy (negative indicates the approach is performing better).

Computational Performance
For each of the target models, we recorded how long (in ms) a computer required to perform a parameter fit. The results are summarized in Table 10. The MLE estimates are omitted, because the optimization starts atθ so this optimization is vacuous. As can be seen in Table 10, the cost of the ICE_RAW methodology appears to be super linear in the parameter count. The other ICE methodologies are more linear, though with considerable noise. Much of this variation is due to the optimizers requiring more or fewer objective function evaluations to find an optimum. Therefore, some approaches could be more expensive due to poor derivatives causing more evaluations to be needed in order to find the optimum. To correct for this, we measure the time required for a computer to evaluate the objective function for each of these approaches. The time required is divided by the parameter count for each model, and then the results are grouped by parameter count and averaged. The results are summarized in Table 11. As can be seen from Table 11, since ICE and ICE_B share an objective function, they have equivalent costs. In both cases, the cost is highly linear with the parameter count, as expected. For these parameter sizes, the MLE objective cost is dominated by exponentials and logarithms in the multinomial logistic and entropy calculations, respectively, and the MLE objective does grow sub-linearly through this parameter range. The ICE_RAW objective function cost is dominated by the O(m 3 ) inversion ofĴ, and in this parameter range its cost grows super linearly. ICE_RAW and ICE begin at nearly the same cost, but for models with at least 30 parameters ICE_RAW is already more than twice as expensive as ICE. We expect that for larger parameter counts, the cost of ICE_RAW would be prohibitive.
The computational cost of each gradient computation was measured, and the results are summarized in Table 12. Again, each value was divided by the parameter count, as we expect all of these approaches to be O(m). In Table 12, the gradient computations for ICE, ICE_B, and ICE_RAW are almost exactly 4 times the cost of MLE throughout the entire parameter range. The approach ICE_RAW uses the same gradient computation as ICE, and ICE_B should have nearly identical computational complexity. We see here that this is indeed the case. For most model optimization approaches, gradient computation is the limiting factor. For such approaches, the ICE methodologies incur a small constant multiple in fitting costs.
Additional performance improvements are beyond the scope of the present work, but the current implementation is not efficiently sharing computations needed to produce v (θ,x i ) and those needed to computeD. In a more tuned implementation, the cost of ICE could be reduced by a small constant factor.

Large Model Accuracy
The most accurate (out of sample) model was selected from among the target models. This model had 32 parameters and an out of sample entropy of 0.19677. That model was then repeatedly expanded in order to generate a sequence of overfit models.
For each regressor, for each quantization, from 3 to 10 buckets was produced. Then, for each such bucket except the first and last from each quantization, a Gaussian and Logistic curve was added centered on that bucket. For each such curve, the width of the curve (std. dev. of the Gaussian, and inverse slope for the Logistic) was chosen to match the width of the target bucket. This procedure produced a sequence of 83 models having between 32 and 764 parameters. Each such model was fit using MLE, and then re-fit using each of MLE, ICE and ICE_B to examine the quality of bias reduction in the case of overfit models. Many of these models settled on local minima and thus exhibited considerably worse performance than the baseline model. The results were grouped by the parameter count and then summarized in Table 13. The T-statistics of the differences between the ICE and ICE_B and MLE models are summarized in Table 14. Table 14. T-statistics of differences between the given methodology and MLE out of sample cross entropy (negative indicates the approach is performing better) for large models. The ICE_B approach may have statistically significant degradation in performance for large models (more than 300 parameters), but the ICE methodology itself does not appear to suffer from this. Some of its most significant results are actually produced for comparatively large parameter counts. Similarly, in Table 13, it can be seen that many of the largest improvements in absolute magnitude for the ICE model are also produced at relatively high parameter counts. None of these values are significant at the 5 sigma level, and only three of them are significant at the 3 sigma level (all in favor of ICE and ICE_B).

Neural Network Implementation
For implementation within neural networks, it is necessary to be able to computeD using back-propagation. The techniques for performing this computation are described by LeCun in Section 3.2 of [23]. Another description of this approach is given in Section 4.1 of [24].
Consider the neural network with L layers, and cost function C. Assume also that no connections skip layers. Typically, for a classifier, C would be a cross entropy loss, with y being the known labels of the training data.
Here, W l is the matrix of edge weights for layer l, and it is assumed that the activation function at each layer is uniform, hence univariate function f l when applied to a vector argument produces a vector output. We may define the activation of layer l as a l = f l (W l f l−1 (W l−1 . . . f 1 (W 1 x)))) = f l (W l a l−1 ).
Then, we can rewrite the neural network, ignoring the parameter y, as C(a L ).
Note that the weights contain an implicit bias term, so more explicitly, the activation of the i'th node in layer l would be Then, the second derivative of the objective function C of the network may be constructed by inverting this sum (so it runs over i that is connected to by k).
where here the derivatives f l and f l are taken with respect to the function's sole argument. Note that this equation is only accurate for the case where the off diagonal elements of Equation (30) may be rewritten as The derivatives with respect to the weights are then These formulas are then suitable for a back-propagation implementation.

Back-Propagation Implementation
For modern neural nets, derivatives must be computed using back-propagation for efficiency reasons. This section describes the back-propagation techniques used to compute first and second derivatives.

Back-Propagation Gradient Implementation
Considering the network as previously defined, we may define the auxiliary value and then recursively define it for all other layers.
We then may compute the gradient of C using these values.
For future reference, note that and that ∂C ∂(a L ) is directly computable from the definition of C.

Back-Propagation Hessian Diagonal Implementation
Analogously to the definition of δ, we define an auxiliary value γ using Equation (30). Because this will be computing only the diagonal of the Hessian, it is necessary to write it in summation form.
and that (γ L ) = ∂ 2 C ∂(a L ) 2 is computable directly from the definition of C. Additionally, ∂C ∂(a l ) i may be computed using Equation (38). Then, the diagonal of the Hessian itself is computed using Equation (34).
The combination of Equations (41), (42), and (37) are sufficient to compute the first and (non-mixed) second derivatives of the neural network in a single back-propagation pass.
Recall that Equation (30) is only an approximation. If more accuracy is needed (at the expense of more computation), then the matrix (Γ l ) (i,k) (instead of the vector (γ l ) i ) may be computed and back-propagated using a similar formula. In which case Equation (42) relies instead on (Γ l ) (i,i) , but is otherwise unchanged. That analysis is beyond the scope of the current work.

Derivatives of Cross-Entropy Multinomial Logistic Loss
Suppose the loss function C is cross-entropy loss using a multinomial logistic (i.e., softmax) classifier. Defining the vector valued multinomial logistic function as Then, the cross entropy loss of a single observation is where y is a one-hot encoding of the classes for the given observation. The derivatives of this loss function with respect to a L are and Note that traditionally, a Multilayer-Perceptron will use the identity activation function for the last layer, in which case f L (x) = x.

Prediction Accuracy
The ICE estimator was implemented in the Apache Spark MultilayerPerceptronClassifier, and compared against a stock MultilayerPerceptronClassifier using Spark version 2.4.5. This implementation was chosen due to the dominant marketshare of Spark and the ease of implementation and testing within that codebase. Because the Spark MLP model does not provide for regularization or drop-out, this approach could not be compared against those approaches within this codebase.
The computation was performed on the same data set used for the ITEM calculations above, described in Appendix A.1.
For this section, accuracy was tested on four layer configurations. Each model has 11 input regressors and three classification states. The models tested are described in Table 15. The data were standardized before applying the neural network, as is common practice. The objective function used Equation (16) with Equation (23), and used LeCun's approximation (i.e., Equation (42)) to compute D, the diagonal of J. The gradients were computed using Equation (21). With the exception of the addition of LeCun's approximation, this is the same setup as was used for the models labeled ICE above.
Each configuration was fit 10 times on randomly drawn fitting sets of various sizes using both MLE and ICE. The cross-entropy on the testing set was averaged for each series of tests. All optimizations are performed using l-bfgs, which generally produced better fits in all the tests. The results are presented in Figures 1-4.
The models given here are relatively small compared to the huge models often found in machine learning (e.g., see [8]), but they are still much larger than the very small models with a handful of parameters often analyzed in statistical research. Whether or not these approaches generalize to much larger models with thousands or millions of parameters is beyond the scope of this present work, but there is no impediment found here that would prevent wider application to larger models.
In all four configurations, ICE effectively eliminates overfitting for all but the smallest sample sizes, whereas MLE suffers severe overfitting for smaller sample sizes. In all four tests, MLE performs slightly better with very large sample sizes, but the difference is not large. For the [11,3] configuration shown in Figure 1, ICE shows some overfitting, but much less than MLE. For the other configurations, no material amount of overfitting is present. This is likely due to the specifics of the l-bfgs fitting algorithm, which can generally search the parameter space much more efficiently for a more nearly linear model such as [11,3] than it can for more complicated configurations. The bias reduction from ICE is asymptotic, so it is not surprising that the approach is weaker with very small sample sizes. For larger models with correspondingly larger sample sizes, ICE is more consistently helpful. Regardless, ICE still greatly outperforms MLE for small sample sizes in even this smaller model.   For all four configurations, ICE fitting time (not shown) was not materially different from the time required to fit with MLE. The computation of the ICE loss and gradient as described here theoretically requires a small constant factor more computation than MLE loss and gradients. For these tests, both costs are swamped by other factors and overheads.
For the neural network model described here, a more full analysis of the TIC term numerical instability cannot be performed due to the reliance on LeCun's approximation, which provides only the diagonal matrix D, not the full Hessian J. Additionally, the inversion of J would be overly costly at these parameter counts, therefore those analyses are available only in Section 4.

Conclusions
It was shown in this paper that for a real world mortgage model in use within the industry, the incorporation of ICE can substantially improve the prediction accuracy at the cost of a small constant multiple increase in fitting time. Additionally, it was shown that the approach described by [1] can be successfully implemented in a Multilayer-Perceptron, and should be applicable to any back-propagating neural network using the techniques described here.
The numerical stability of the TIC term from [3] was explored on a real world problem using industrial models. Numerical approximations and stabilization techniques were demonstrated that greatly reduce the effect of numerical instability for this term, which might otherwise prevent the application of TIC for problems with a significant number of parameters.
The diagonal approximation of LeCun was shown to produce good results in networks of reasonable depth, and may serve as the basis for the application of ICE and stabilized TIC techniques to a broad class of neural networks of moderate depth.
as Mark-to-Market and combined Loan-to-Value ratios). All regressors are demanded to be non-negative, except for loan-to-value and unpaid balance regressors, which are required to be strictly positive, and incentive, which is required to be between −1.0 and 1.0 to remove a handful of loans with data entry errors. This filtering removes less than 0.5% of the data. The data are randomly split between fitting and testing datasets using probabilities (0.25, 0.75). The fitting set is reduced to 100,000 observations after the split, and the testing set is 1,471,313 observations from this dataset.