Conﬁdence Distributions for FIC Scores

: When using the Focused Information Criterion (FIC) for assessing and ranking candidate models with respect to how well they do for a given estimation task, it is customary to produce a so-called FIC plot. This plot has the different point estimates along the y-axis and the root-FIC scores on the x-axis, these being the estimated root-mean-square scores. In this paper we address the estimation uncertainty involved in each of the points of such a FIC plot. This needs careful assessment of each of the estimators from the candidate models, taking also modelling bias into account, along with the relative precision of the associated estimated mean squared error quantities. We use conﬁdence distributions for these tasks. This leads to fruitful CD–FIC plots, helping the statistician to judge to what extent the seemingly best models really are better than other models, etc. These efforts also lead to two further developments. The ﬁrst is a new tool for model selection, which we call the quantile-FIC, which helps overcome certain difﬁculties associated with the usual FIC procedures, related to somewhat arbitrary schemes for handling estimated squared biases. A particular case is the median-FIC. The second development is to form model averaged estimators with weights determined by the relative sizes of the median- and quantile-FIC scores.


Introduction and Summary
Mrs. Jones is pregnant. She is white, 25 years old, a smoker, and of weight 60 kg before pregnancy. What is the chance that her baby-to-come will have birthweight less than 2.50 kg (which would mean a case of neonatal medical worry)? Figure 1 gives a Focused Information Criterion (FIC) plot, using the Focused Information Criterion to display and rank in this case 2 3 = 8 estimates of this probability, computed via eight logistic regression models, inside the class p = P{y = 1 | x 1 , x 2 , z 1 , z 2 , z 3 } = exp(β 0 + β 1 x 1 + β 2 x 2 + γ 1 z 1 + γ 2 z 2 + γ 3 z 3 ) 1 + exp(β 0 + β 1 x 1 + β 2 x 2 + γ 1 z 1 + γ 2 z 2 + γ 3 z 3 ) , with z 2 on board but without z 1 and z 3 , etc.), associated with point estimates 0.282 and 0.259, whereas submodels 100 and 011 appear to be the worst, with rather less precise point estimates 0.368 and 0.226. Again, 'best' and 'worst' means as gauged by precision of these 8 estimates of the same quantity. Importantly, the FIC machinery, as briefly explained here, with more details in later sections, can be used for each new woman, with different 'best models' for different strata of women, and it may be used for handling different and even complicated focus parameters. In particular, if Mrs. Jones had not been a smoker, so that her z 1 = 1 would rather have been a z 1 = 0, we may re-run our programmes to produce a FIC table and a FIC plot for her, and learn that the submodel ranking is very different. Then 111 and 101 are the best and 001 and 000 the worst; also, the p estimates of her having a baby with low birthweight are significantly smaller.  Figure 1. Focused Information Criterion (FIC) plot for the 2 3 = 8 models for estimating the probability of having a child with birthweight below 2.5 kg, for Mrs. Jones (white, age 25, 60 kg, smoker). Here, '101' is the model where z 1 , z 3 in in and z 2 is out, etc. Table 1. FIC table for Mrs. Jones: there are 2 3 = 8 submodels, with absence-presence of z 1 , z 2 , z 3 indicated with 0 and 1 in column 2, followed by estimates p, estimated standard deviation, estimated absolute bias, the root-FIC score, which is also the Pythagorean combination of the stdev and the bias, and the model rank. The numbers are computed with formulae of Section 2. The FIC apparatus, initiated and developed in Claeskens and Hjort (2003), Hjort and Claeskens (2003a), and Claeskens and Hjort (2008), has led to quite a rich literature; see comments at the end of this section. FIC analyses have different forms of output, qua FIC tables (listing the best candidate models, along with estimates and root-FIC scores, perhaps supplemented with more information) and FIC plots. The general setup involves a selected quantity of particular interest, say µ, called the focus parameter, and various candidate models, say S, leading to a collection of estimators µ S . These carry root-mean-squared-errors rmse S , and the root-FIC scores are estimates of these root-risks. The FIC plot displays (FIC 1/2 S , µ S ) = ( mse 1/2 S , µ S ) for all candidate models S, as with Figure 1. The present paper concerns going beyond such FIC plots, investigating the precision of each displayed point. The point estimates µ S carry uncertainty, as do the FIC scores. A more elaborate version of the FIC plot can therefore display the uncertainty involved, in both the vertical and horizontal directions. This aids the statistician in seeing whether good models are 'clear winners' or not, and whether the ostensibly best estimates are genuinely more accurate than others. In various concrete examples one also observes that a few candidate models appear to be better than the rest. The methodology of our paper makes it possible to assess to which extent the implied differences in FIC scores are significant. Such insights lead also to model averaging strategies with weights given precisely to the best models for the given estimation purpose.
Our paper proceeds as follows. In Section 2 we give the required mathematical background, involving both the basic notation necessary and the key theorems about joint convergence of classes of candidate model based estimators. These results also drive the development of confidence distributions for FIC scores, in Section 3. These in turn also inspire a new variant for the FIC, which we call the quantile-FIC, where each root mean squared error quantity (rmse) is naturally estimated using an appropriate quantile in the associated confidence distribution. A special case is the median-FIC; details are given in Section 4. Having established such results, Section 5 then involves constructions of median-FIC driven weights for model averaging operations, where we also give a precise large-sample description of the implied model averaging estimators. In Section 6 we address performance and comparison issues, studying relevant aspects of how well different strategies behave, from post-FIC to model averaging estimators. It is in particular seen that the post-median-FIC estimators have certain advantages over post-AIC schemes. More information concerning performance is brought forward in Section 7, via simulation experiments, in four different setups. To display how our new CD-FIC based methods work in a setup with considerably more candidate models at play than with the 2 3 = 8 models used for Mrs. Jones above, a multi-regression Poisson setup is worked through in Section 8, involving abundance of bird species for 73 British and Irish islands. Then we sum up various salient points in our discussion Section 9, and offer a list of concluding remarks, some pointing to further research, in Section 10. In a separate Appendix, Section 11, we give technical details and formulae for required quantities and ingredients for candidate models inside a general regression framework.
We end our introduction section by commenting briefly on other relevant work, first on the FIC front and then on model averaging. Setting up FIC schemes involves finding good approximations to mse quantities, and then constructing estimators for these. This pans out differently in different classes of models, and sometimes requires lengthy separate efforts, depending also on the type of focus parameter. Claeskens and Hjort (2008) cover a broad range of general i.i.d. and regression models, using local neighbourhoods methodology. Later extensions include Claeskens et al. (2007) for time series models, Gueuning and Claeskens (2018) for high-dimensional setups, Hjort and Claeskens (2006) and  for semiparametric and nonparametric survival regression models, Zhang and Liang (2011) for generalised additive models, Zhang et al. (2012) for tobit models, Ko et al. (2019) for copulae with two-stage estimation methods. Recent methodological extensions and advances also include setups centred on a fixed wide model, with large-sample approximations not depending on the local asymptotics methods; see Claeskens et al. (2019); Hjort (2017, 2019), along with Cunen et al. (2020) for linear mixed models. There is a growing list of application domains where FIC is finding practical and context-relevant use, such as finance and economics (Behl et al. 2012;Brownlees and Gallo 2008), peace research and political science (Cunen et al. 2020), sociology (Zhang et al. 2012), marine science (Hermansen et al. 2016), etc. There is similarly a rapidly expanding literature on frequentist model averaging procedures, as partly contrasted with Bayesian versions; perspectives for the latter are summarised in Hoeting et al. (1999). A broad framework for frequentist averaging methods is developed in Claeskens and Hjort (2008); Hjort and Claeskens (2003a), including precise large-sample descriptions for how such schemes actually perform. Wang et al. (2009) give a broad review. In econometrics, Hansen (2007) studies model averaging for least squares procedures, and Magnus et al. (2009) compare frequentist and Bayesian averaging methods. Optimal weights are studied in Liang et al. (2011). The book chapter Chan et al. (2020) discusses optimal averaging schemes for forecasting, touching also the phenomenon that simpler weighting methods sometimes perform better than those involving extra layers of estimation to get closer to envisaged optimal weights.

Basic Setup and the FIC
In this section we give the basic theoretical background and main results behind the FIC plot (1). It is convenient to describe the i.i.d. setup first, and to describe a canonical limit experiment with the required basic quantities. In Section 2.2 we then briefly explain how the apparatus can be extended also to general regression models, where it also turns out that the limit experiment is of exactly the same type, only with somewhat more complex mechanisms lying behind the key ingredients. Technical details and explicit formulae for such general regression models, valid also beyond the realm of say generalised linear models, are provided in Section 11. The key results described in this section are behind the FIC plots and the FIC tables, such as Figure 1 and Table 1, and will also be used in later sections to derive confidence distributions for risks.

The I.I.D. Setup
Suppose we have independent and identically distributed observations, say y 1 , . . . , y n . A collection of candidate models is examined, ranging from a well-defined narrow model, parametrised as f narr (y, θ) with θ = (θ 1 , . . . , θ p ) of dimension p, to a wide model, parametrised as f (y, θ, γ), with certain extra parameters γ = (γ 1 , . . . , γ q ), signifying model extensions in different directions. The narrow model is assumed to be an inner point in the wider model, in the sense of f narr (y, θ) being equal to f (y, θ, γ 0 ) for an inner parameter point γ 0 . There is consequently a total of 2 q candidate models, corresponding to setting γ j parameters equal to or not equal to their null values γ 0,j , for j = 1, . . . , q. In the regression framework studied below this would typically correspond to taking covariates in and out of the wide model.
Other terms could be considered here, like 'full model' for the wide model and 'null model' for the narrow model, but we choose to stick to the 'wide' and 'narrow' labels as these have been used rather consistently in the FIC literature, from Claeskens and Hjort (2003) onwards. Furthermore, the alternative 'null model' term would risk being associated with a suggestion that the point of the setup is to test it, against various alternatives, but this is typically not the aim of the model selection and model averaging framework.
Assume now that a parameter µ is to be estimated, with a clear statistical interpretation across candidate models. It may in particular be expressed as µ = µ(θ, γ) in the wide model. We may then consider 2 q different candidate estimators, say µ S based on the submodel S, with S a subset of {1, . . . , q}, corresponding to the model having γ j as a parameter in the model when j ∈ S but with γ j set to their null values γ 0,j for j / ∈ S. Carrying out maximum likelihood (ML) estimation in model S means maximising the log-likelihood function n,S (θ, γ S ) = ∑ n i=1 log f (y i , θ, γ S , γ 0,S c ), with γ S notation for the collection of γ j with j ∈ S, and similarly for γ 0,S c with the complement set. With ( θ S , γ S ) the ML estimators for submodel S, this leads to a collection of candidate estimators In particular we have µ narr = µ( θ narr , γ 0 ) and µ wide = µ( θ wide , γ wide ), with ML estimation carried out in respectively the narrow p-dimensional and the wide (p + q)-dimensional models.
To understand the behaviour of all these candidate estimators, and to develop theory and methods for comparing them, we now present a 'master theorem', from Hjort and Claeskens (2003a), Claeskens and Hjort (2008, chp. 5, 6). We work inside a system of local neighbourhoods, where the real data-generating mechanism underlying our observations is with some unknown δ = √ n(γ − γ 0 ), seen as a local model extension parameter; in particular, the true focus parameter becomes µ true = µ(θ 0 , γ 0 + δ/ with partial derivatives evaluated at the narrow model, and with these quantities varying from focus parameter to focus parameter. Finally we need to introduce the q × q matrices Here, π S is the |S| × q projection matrix of zeroes and ones, such that π S u = u S , taking u = (u 1 , . . . , u q ) t to its subset of those u j with j ∈ S. We have G narr = 0 and G wide = I, the q × q identity matrix, and note that Tr(G S ) = |S|, the number of elements in S. The master theorems driving much of the FIC and related theory are now as follows. First, and, secondly, Here, Λ 0 ∼ N(0, τ 2 0 ), for the τ 0 given above, and Λ 0 and D are independent. This implies that the limit in (6) is normal, and we can read off its bias ω t (I − G S )δ and variance τ 2 0 + ω t G S QG t S ω. The risk or mean squared error for this limit distribution is hence say, in the usual fashion a sum of a variance part var S and a squared bias part bsq S . With a sparse S, there are many zeros in G S , leading to small variance but potentially a larger bias; with a bigger subset S, G S becomes closer to the identity matrix I, yielding bigger variance but a smaller bias. The essence of the Focused Information Criterion (FIC), developed in Hjort (2003, 2008) and later extended in various directions and to more general contexts and model classes, is to estimate each mse S from the data. This leads to a full ranking of all candidate models, from the best (smallest estimate of risk) to the worst (largest estimates of risk). Briefly, we start by putting up FIC formulae for the limit experiment, where all quantities τ 0 , Q, G S , ω are known (thanks to consistent estimators for these, see below), but where δ is not, as we can only rely on the information D ∼ N q (δ, Q) from (5). Noting that E DD t = δδ t + Q, which also means that using (c t D) 2 to estimate a squared linear combination parameter (c t δ) 2 means overshooting with expected amount c t Qc, there are actually two natural versions here, namely These correspond to the natural unbiased estimator and its truncated-to-zero version for the squared bias. That the first estimator for squared bias is negative means that the event is taking place, which happens quite frequently if δ is close to zero, in fact with probability up to P{χ 2 1 ≤ 1} = 0.683, if δ = 0, but is growing less likely when δ is moving away from zero. For actual data one plugs in consistent estimators τ 0 , Q, G S , ω for the relevant quantities, to be given below, and D n of (5) for δ. This leads to FIC scores Note from (6) that these are estimators of the limiting risk, where µ S − µ true has been multiplied with √ n. Most often it is therefore better, regarding reading of tables and interpretation of FIC plots, to transform the above scores to say We consider the truncated version a good default choice, since it avoids having negative estimates of squared biases, and this choice has indeed been used for Mrs. Jones and her FIC plot in Figure 1 and FIC table in Table 1. The consistent estimators in question are computed as follows. From ML analysis in the wide model, maximising n,wide (θ, γ), we compute the normalised Hessian matrix at this ML position, say α wide = ( θ wide , γ wide ), of size (p + q) × (p + q). This is a consistent estimator for J of (3) under the assumed sequence of data-generating mechanisms (2), under mild conditions; see Claeskens and Hjort (2008, chp. 6). Inverting this matrix and reading off its lower right block leads to Q = J 11 , consistent for Q. Finally ω and τ 0 are defined by plugging in relevant blocks of J wide in (4), along with partial derivatives of µ(θ, γ), computed at the ML position ( θ wide , γ wide ). There are in fact a few alternatives here, regarding estimation of J and ω, but these do not affect the basic asymptotics; see Claeskens and Hjort (2008, chp. 6, 7) for further discussion. For simplicity we have chosen not to overburden the notation here, with one name for FIC in the limit experiment, as in (8), and a different one for FIC with real data, as in (9); it is, in each case, clear from the context what is what.

Extension to Regression Models
As demonstrated in Hjort 2003, 2008) the theory briefly reviewed above for the i.i.d. setup can with the required extra effort be lifted to the framework of regression models. Data are then of the form (x i , y i ), with x i a covariate vector and y i the response. The natural setup becomes that of a wide regression model with densities f (y i | x i , θ, γ), featuring a narrow model parameter θ of size p and an extra γ parameter of size q, and where a null value γ = γ 0 yields the narrow model. Again using γ = γ 0 + δ/ √ n as the natural framework of local asymptotics, there are under mild Lindeberg conditions clear limiting normality results for all submodel based estimators, etc., parallelling those of (5)-(7), though involving somewhat more complex notation than for the i.i.d. case when it comes to key quantities Q, ω, G S . Technical details and formulae are provided in Section 11.
It is however simplest to develop our extended CD-FIC theory for the i.i.d. case, which we make our task below. For each method and result reached below there is a natural extension to the case of regression models. This is illustrated in Section 8 for a class of Poisson regression models applied to a study of bird species abundance. Furthermore, our introductory illustration, involving low birthweights, is an application of the general methodology to logistic regression models.

Confidence Distributions for FIC Scores
The FIC scores of (8) are estimators of the mse S quantities (7), defined in the limit experiment where D ∼ N q (δ, Q) and the other key quantities are known. Similarly, the rootFIC scores of (10) are estimating the genuine rmse S , the root-mse for the estimators µ S . However, the FIC scores carry their own uncertainty, which we address in this section through constructing confidence distributions for the estimated quantities.
As in Section 2 we start working out matters in the clear limit experiment, and then insert consistent estimators when engaged with real data. A brief prelude to explain what will take place is as follows: Suppose a single X is observed from a N(η, 1), and that inference is needed for the parameter φ = η 2 . Since X 2 is a noncentral chi-squared, with 1 degree of freedom and noncentrality parameter η 2 , which we write as X 2 ∼ χ 2 1 (η 2 ), we can build the function with Γ 1 (·, φ) the cumulative distribution function for the χ 2 1 (φ). Here, x obs is the observed value of the random X. The C(φ, x obs ) is a cumulative distribution function in φ, for the observed x obs , with the property that for each η, when X comes from the data model N(η, 1), then C(φ, X) has the uniform distribution: In other words, C(φ, x) defines an exact confidence distribution (CD), see Hjort and Schweder (2018); Schweder and Hjort (2016), and confidence intervals can be read off from {φ : C(φ, x obs ) ≤ α}. Note that this CD has a pointmass at zero, C(0, x obs ) = 1 − Γ 1 (x 2 obs ), involving the standard chi-squared cumulative Γ 1 (·) = Γ 1 (·, 0). Thus confidence intervals for φ = η 2 could very well start at zero. This CD is the optimal one, in this situation, cf. Schweder and Hjort (2016, chp. 6).
Going back to the mse S of (7), write Here, τ 2 S is the limiting variance of √ n µ S . It is smaller with fewer elements in S, and becomes larger with more elements. Furthermore, σ 2 S is the variance of ω t (I − G S )D, i.e., of the estimate of the bias ω t (I − G S )δ. Write for clarity X S = ω t (I − G S )D/σ S , which has a N(η S , 1) distribution, with η S = ω t (I − G S )δ/σ S . Since quantities τ 0 , ω, Q, G S are known, in the limit experiment, the arguments above lead to the CD It starts at position τ 2 S , the minimal possible value for mse S , with pointmass there of size C S (τ 2 The narrow model, with S = ∅ and G narr = 0, has the smallest τ S , namely τ 0 , but also the largest σ S , with On the other side of the spectrum of candidate models, the widest model has G wide = I, the mse wide is the constant τ 2 0 + ω t Qω with no additional uncertainty, in this framework of the limit experiment, and the C wide (mse wide ) is simply a full pointmass 1 at that position.
For a real dataset, we estimate the required quantities consistently, as per Section 2, and with D n = √ n( γ wide − γ 0 ) of (5) for D. Translating and transforming also to the real root-mse scale of Here, τ 2 is large-sample correct, in the sense that for any given position in the parameter space, its distribution converges to that of the uniform as sample size increases. Thus {ρ S : C * S (ρ S ) ≤ α} defines a confidence interval for ρ S , with coverage converging to α.
In Figure 2 confidence distributions are displayed for the eight true root-mse values pertaining to the eight submodels in the Mrs. Jones example of our introduction. Clearly, several of the CDs have pointmasses well above zero. Furthermore, displayed in the figure are three root-FIC scores of different type: the already mentioned FIC u and FIC t , along with the median-FIC which we come to in the next section. The unbiased estimator FIC u can for some models be considerably smaller than FIC t ; indeed it has the value zero for the narrow model 000. The models with smaller FIC u than FIC t have negative squared bias estimates, i.e., inside Γ 1 (·) will be smaller than 1, which leads to the corresponding CDs starting with a pointmass higher than 0.3173 = 1 − Γ 1 (1). In our first exposition of the case of Mrs. Jones, Figure 1 gave eight point estimates for the probability of her child-to-come having low birthweight, along with root-FIC scores. From the CDs in Figure 2 we can construct an updated and statistically more informative FIC plot, namely Figure 3, which provides accurate supplementary information regarding how precise these root-FIC scores are. The figure provides confidence intervals for both the root-FIC scores and the focus estimates. In particular, we see that the FIC score for the winning model 000 appears to be very precise, and we may then select this model without many misgivings. The scores of the next best models 010 and 001 appear to be more uncertain, and their intervals indicate that their underlying true rmse values are potentially much larger than what their root-FIC scores indicate. . FIC plot with associated uncertainty for the 2 3 = 8 models for estimating the probability of having a child with birthweight below 2.5 kg, for Mrs. Jones (white, age 25, 60 kg, smoker). The uncertainty is represented by 80% confidence intervals. The intervals for the root-FIC score are read off from the confidence distributions in Figure 2. The intervals for the focus parameter are based on the ordinary normal approximation with estimated variances taken from the variance part of the FIC calculations (see, e.g., Table 1). Note that the points here are the truncated FIC scores, i.e., with FIC t rather than FIC u of Formula (9).

Median-FIC and Quantile-FIC
As briefly pointed to in Section 2, there are often two valid variations on the basic FIC, when it comes to estimating the precise rmse S quantities, as in (8) and (9). The first uses the unbiased risk estimator, involving the possibility of having negative estimates for squared biases, whereas these are truncated up to zero for the second version.
Since the most natural way of assessing uncertainty of these risk estimators is via CDs, as in Section 3, with confidence pointmasses at the smallest values, etc., a third version suggests itself, namely the median confidence estimators. Generally, these have unbiasedness properties on the median scale, as opposed to on the expectation scale, and are discussed in Schweder and Hjort (2016, chp. 3, 4). Thus consider the median-FIC, defined for the limit experiment, via (11), to be viewed as an alternative to FIC u S and FIC t S of (8). For actual data, having estimated the required background quantities and also transformed to the scale of ρ S = rmse S / √ n, we use the CD C * S (ρ S ) of (12), and infer the median-FIC score FIC 0.50 See Figure 2 where we display the 0.50 confidence line and read off the corresponding medians.
Considering the limit experiment case (13) first, we know that the CD C S (mse S ) starts out at the minimal point τ 2 S with the pointmass . If this is already at least 1 2 , which inspection shows is equivalent to r S = |ω t (I − G S )D/σ S | ≤ 0.6745, then the median-FIC is equal to τ 2 S . If that ratio is above 0.6745, however, then the median-FIC is the numerical solution to for a given dataset, then the median-FIC for C * S (ρ S ) is equal to the minimum value τ S / √ n, and otherwise one solves C * S (ρ S ) = 1 2 numerically with a solution to the right of τ S / √ n. Going back to the limit experiment framework again, with r S = |ω t (I − G S )D/σ S | the relative size of the estimated bias versus its uncertainty, we have the following relations between the three different FIC scores. (i) If r S ≤ 0.675, then FIC 0.50 Since the three types of FIC scores are identical for the wide model, the three strategies can be understood as having increasing preference for selecting the wide model. The unbiased-FIC generally gives smaller FIC scores to all models except the wide model, so it will therefore have a smaller probability of selecting the wide. The median-FIC, on the other hand, typically gives larger FIC scores to the competing models, and is then more likely to select the wide model. The truncated-FIC lies somewhere between these two approaches. We will compare the three strategies in more detail in Section 6, where each strategy is studied also in terms of the risk of the estimator which the FIC score selects.
In addition to the median confidence estimator associated with the CDs it is also valuable to consider the more general quantile-FIC, which is for any given q ∈ (0, 1). We learn in Section 6 that quantile values smaller than 0.50 may be beneficial for estimating the squared bias parts when these are small to moderate. Similarly to our brief comments about the median-FIC score above, we may work out some of the relations between the previously existing FIC scores and the quantile-FIC score. We may for example study the specific choice of q = 0.25. This score, denoted by FIC 0.25 , will be equal to τ 2 S when r S ≤ 1.1503. For larger r S values one needs to find the numerical solution of C S (mse S ) = 0.25. Naturally, FIC 0.50 The lower-quartile-FIC will thus often be smaller than the previously existing FIC scores, as opposed to the median-FIC which will always be larger or equal, as we saw above. Since all the FIC scores are identical for the wide model, this entails that FIC 0.25 S will exhibit a preference for selecting smaller models. We will come back to these insights in the discussion section.

Model Averaging
Our FIC investigations above also invite new and focused model averaging schemes, where the weights attached to the different candidate models are allowed to depend on the specific focus parameter under consideration. Consider model averaging estimators of the general form with weights depending on D n = √ n( γ wide − γ 0 ) of (5), assumed to sum to 1, and with limits of (6), and utilising the joint limit distribution for the 2 q + 1 variables involved, a master theorem is reached in Claeskens and Hjort (2008, chp This result is generalised to yet larger classes of model averaging strategies, including bagging procedures, in Hjort (2014). In the present context, a natural averaging estimator is as above, with weights of the form The master theorem applies, which means we can read off the accurate limit distribution for the median-FIC based model averaging scheme in question. We may also use different tuning parameters for different models, i.e., with weights proportional to exp(−λ S FIC 0.50 S ), with appropriately selected λ S . A general venue is to use the CDs for each model in order to set such model-specific λ S values. One possibility is to evaluate all the CDs at the estimated rmse value of the widest model and then let see (12). For the wide model the C * wide (·) is a unit point mass at the position rmse wide , and we take λ wide = 1, but for the other models the λ S will have values above 1; see Figure 2. The intuition is that dividing the FIC score with the confidence, evaluated at this specific point, will give higher weights to models where the FIC scores are more certain. This is the method we have employed for Figure 4, for the model averaging scheme there denoted 'CD-FIC weights'.
There are clearly several other model averaging schemes that may be considered based on the CDs for the FIC scores. For example, one may wish to use only models which have a high probability of having a rmse lower than a certain threshold, and then use a similar weighting scheme as above among the models with scores falling below this threshold. Again our master theorem (17)  In Figure 4 we present a brief illustration of different model selection and averaging schemes. The figure displays the limiting distribution densities of √ n( µ * − µ true ), from (17), for five different strategies. The densities are produced not by simulating from some given model with a high sample size, but from the exact limit distributions, by drawing from Λ 0 and D. A sharper density around zero indicates that the strategy produces a more precise estimator than the others. The sharpness around zero may be assessed by computing the limiting mse of each √ n( µ * − µ true ), by simply summing the squared draws from the limiting distributions. For this illustration we have used q = 3, with 2 q = 8 submodels, Q equal to the identity matrix, τ 0 equal to 0.1357, and the δ set to (0.3, −0.1, 1.5) t . The red line represents the scheme where one always chooses the widest model. In that case the focus estimator is unbiased and its distribution is a perfect normal (as we see). The two blue lines are model selection strategies, where a single model is chosen, either using the classic AIC (light blue), or using our new median-FIC score (dark blue). We see that both strategies induce some bias in the final estimator, and that the distribution of µ * is a complicated nonlinear mixture of normals. The two green lines are model averaging strategies. The light green one is the scheme with weights as in (18), with λ = 1. The dark one is a strategy making use of the confidence distributions for the FIC score, with λ S as in (19).
For this particular position in the δ parameter space the two model averaging strategies produce the most precise estimators, obtaining limiting root-mse values of about 1.26 and 1.58 for the average of median-FIC and average with CD-FIC weights. The limiting root-mse values for the method selecting the best estimator according to the best median-FIC score or best AIC scores are respectively 1.60 and 1.67. The strategy of always selecting the wide model has a limiting rmse of 1.74, and is thus the least precise strategy among the five for this position in the parameter space.

Performance Aspects for the Different Versions of FIC
Our FIC procedures use estimates of root mean squared errors to compare and rank candidate models, and as we have demonstrated also lead to informative FIC plots and CD-FIC plots. There are several issues and aspects regarding performance, including these: (a) How good is the root-FIC score, as an estimator of the rmse? (b) How well-working is the implied FIC scheme for finding the underlying best model, e.g., as a function of increasing sample size? (c) How precise is the final estimator, which would be the after-selection estimator µ final of (16) or more generally the model average estimator µ * of (17)? (d) How well-working are the (approximate) CDs regarding coverage properties; do confidence intervals of the type {rmse S : C * S (rmse S ) ≤ 0.80} contain the true rmse S 80% of the time?
We note that themes (b) and (c) are quite related, even though different specialised questions might be posed and worked with to address particularities. Furthermore, in various contexts, theme (c) is what matters.
Methods to be compared are the unbiased FIC u , the truncated FIC t , the median-FIC FIC 0.50 , and also its more general variant the FIC q for other useful quantiles q. Themes (a), (b), (c), (d) can of course be studied for finite sample sizes, in different setups and with many variations, and indeed these questions addressed in Section 7 below. It is again illuminating to work inside the limit experiment setup of Sections 2 and 3, however, where complexities are stripped down to the basics. This involves certain known basis parameters and the crucial relative distance parameter δ = √ n(γ − γ 0 ) estimated via a single D ∼ N q (δ, Q). Below we report on relatively brief investigations into themes (a), (c), and return also to (b), (d) in the next section.

FIC for Estimating MSE
The limiting mse expressions are of the form τ 2 S + (a S δ) 2 , say, as per (7), with τ S and a S known quantities. The different FIC schemes differ with respect to how the squared bias term is estimated. In the reduced prototype form worked with at the start of Section 3, the comparison boils down to investigating four methods for estimating φ = η 2 in the setup with a single X ∼ N(η, 1). The unbiased and truncated FIC are associated with the estimation schemes φ u = X 2 − 1 and φ t = max(X 2 − 1, 0), and both of these uniformly beat the simpler X 2 maximum likelihood estimator (which is hence inadmissible, in the decision theoretic sense). The median-FIC corresponds to setting φ m equal to the median of the confidence distribution C(φ, x) = 1 − Γ 1 (x 2 , φ). Risk functions risk(φ) = E φ ( φ − φ) 2 can now be numerically computed and compared, for the different estimators, yielding say risk u (φ), risk t (φ), risk m (φ), risk q (φ); the first is incidentally equal to 2 + 4φ. Figure 5 displays four root-risk functions, i.e., risk(φ) 1/2 . We learn that the two 'usual' FIC based methods, the unbiased and truncated, are rather similar, though the truncated version is uniformly better for this particular task. The quartile-FIC is significantly better for a relatively large window of squared bias values, whereas the median-FIC is better when such values are large.

Narrow vs. Wide
We now consider a relatively simple setup, where we only wish to choose between two models, the narrow (with p parameters) and the wide (with p + q parameters). The limiting mean squared errors are mse narr = τ 2 0 + (ω t δ) 2 and mse wide = τ 2 0 + ω t Qω, from which it also follows that the narrow model is better than the wide in the infinite band |ω t δ| ≤ (ω t Qω) 1/2 . The FIC in effect attempts to use data to see whether δ is inside this band or not. We have Thus the unbiased FIC u says that the narrow is best if and only if |ω t D| ≤ √ 2(ω t Qω) 1/2 , and a bit of analysis reveals that the truncated FIC t in this case is in full agreement. In the limit experiment of this two-models setup, ψ = ω t δ has the estimators ψ narr = 0 and ψ wide = ω t D, and the final estimator used is Here, This FIC strategy is then to be contrasted with that of the median-FIC. The question is when This means finding when the function 1 − Γ 1 (t(D) 2 , 1) crosses 0.50, and a simple investigation shows that FIC 0.50 prefers the narrow to the wide model if and only if |t(D)| ≤ 1.0505. The limiting risk functions for the three FIC methods for reaching a final estimator µ final are therefore of the form using (20), with cut-off value t 0 = √ 2 for FIC u and FIC t , and with t 0 = 1.0505 for FIC 0.50 . More generally, the quantile-FIC method of (15) can be seen to have such a cut-off value t 0 = Γ −1 1 (1 − q, 1) 1/2 , which is, e.g., t 0 = 1.6859 for q = 0.25. The conservative strategy, choosing the wide model regardless of the observed D, corresponds to cut-off value t 0 = 0.
Let us also briefly point to the classic AIC method, in this setup. As shown in Claeskens and Hjort (2008, chp. 5, 6), in the limit AIC prefers the narrow over the wide model if and only if D t Q −1 D ≤ 2q. With notation as in Sections 2 and 3, the limit distribution of the AIC selected estimator becomes When there is only q = 1 extra parameter in the wide model, this is the very same as for the two first FIC methods, with cut-off value t 0 = √ 2. Figure 6 displays root-risk functions R(η) 1/2 for the usual FIC (with t 0 = √ 2, full curve), for the median-FIC (with t 0 = 1.0505, dotted curve, low max value), and the quantile-FIC with q = 0.25 (with t 0 = 1.6959, dotdashed curve, high max value). We see that the median-FIC often wins over the standard FIC, and its maximum risk is considerably lower. More precisely, median-FIC has the lowest risk in the parts of the parameter space where the wide model is truly the best model, but where η only has moderately large values, i.e., the parts of the parameter space where the true model is at some moderate distance from the narrow model. This fits well with some of our insights from Section 4, where we saw that median-FIC will select the wide model with a higher probability than ordinary FIC. For moderate η values, the median-FIC turns out to balance its submodel selection probabilities well, in the sense of securing relatively small risk for the final estimator. For η values farther away from zero all strategies always select the wide model and they therefore have identical risk. . For the one-dimensional case q = 1, root-risk function curves for estimators coming from three different FIC selection schemes, as functions of η = ω t δ/(ω t Qω) 1/2 : the usual FIC (black full), here also equivalent to the AIC; the median-FIC (dotted, blue, and with lowest maximum); and the quantile-FIC with q = 0.25 (dotdashed, blue, and with highest maximum). Furthermore, shown is the benchmark wide procedure (grey, constant). Inside the two vertical grey lines the narrow model is truly better than the wide.
For η values closer to zero, in the part of the parameter space where the narrow model is truly more precise than the wide, we see that median-FIC has a higher risk than the other strategies and that quantile-FIC with q = 0.25 is the best strategy. Again this is related to our comments in Section 4, with q = 0.25 quantile-FIC tending to give lower FIC scores to the non-wide models, compared to the other strategies. In this scenario, this gives FIC 0.25 a propensity to select the narrow model. This property is advantageous for η values around zero, but gives FIC 0.25 a higher risk for moderately large η values.

Three FIC Schemes with Q = 2
We continue with the somewhat more complex case where we have q = 2 extra parameters in the wide model, and four submodels under consideration, here denoted by 0, 1, 2, 12. We let Q = diag(κ 2 1 , κ 2 2 ) be diagonal, in order to have simpler expressions than otherwise. This in particular means that γ 1 and γ 2 become independent in the limit. The mse expressions for the four different candidate models are then where τ 0 , ω 1 , ω 2 , κ 1 , κ 2 are considered known parameters, whereas what one can know about δ = (δ 1 , δ 2 ) is limited to the independent observations D 1 ∼ N(δ 1 , κ 2 1 ) and D 2 ∼ N(δ 2 , κ 2 2 ). The FIC scores FIC u , FIC t , FIC 0.50 will depend on these known parameters and on D = (D 1 , D 2 ) t , and the associated limiting risks will be functions of δ = (δ 1 , δ 2 ), for the three different versions of with v 0 (D), v 1 (D), v 2 (D), v 12 (D) the associated indicator functions for where submodels 0, 1, 2, 12 are selected. We can now compute and compare these risk functions in the two-dimensional δ space, for each choice of τ 0 , ω 1 , ω 2 , κ 1 , κ 2 . Since the mse expressions, as well as the risk functions, all have the same τ 2 0 term, we disregard that contribution, and in effect set τ 0 = 0. In Figure 7 we show the results of such an exercise, with ω = (1, 1) t and κ = (1, 1) t . On the left hand side, we see that for this setup median-FIC gives lower risk than the two other strategies for a relatively large part of the parameter space. The right side shows the ratio between the risk of median-FIC and the best competing strategy.
The panels indicate that median-FIC beats the two other strategies for moderate values of both δ 1 and δ 2 , but loses when one or both of these quantities are close to zero, and also when both are large in absolute size. This is consistent with our observations in Section 4; the median-FIC has good performance in the parts of the parameter space where the wide model is truly the best. If quantile-FIC with q = 0.25 had been included in this comparison, we would have discovered that FIC 0.25 beats the other strategies in the areas were the wide model is not the best, particularly in the narrow diagonal band from (−6, 6) to (6, −6).

Finite-Sample Performance Evaluations
Complementing the performance analyses of Section 6, in the framework of the limit experiment, we have conducted various investigations of the performance of the FIC scores and of the CDs in finite-sample settings, via simulations. In these experiments we sample data from a known wide model and with a particular choice of focus parameter, for which we then know the true value. We generate a high number of datasets from this model, and compute FIC scores and CDs for each of these. From this we can investigate aspects (a), (b), (c), (d) mentioned in the beginning of the previous section. Do the root-FIC scores succeed in estimating the true rmse? And do the FIC scores provide a correct ranking of the models? Further, we can investigate the coverage properties of our CDs: do the confidence intervals we obtain from the CDs, say the {rmse S : C * S (rmse S ) ≤ 0.80}, cover the true rmse values for approximately 80% of the rounds? We will present the results from four different simulation setups: (1) a linear regression model with relatively few candidate models, (2) a linear regression model with many candidate models, (3) a Poisson regression, and (4) a logistic regression.
Our first two illustrations are for datasets of n = 100 observations from a linear normal model with the structure y i = x t i β + z t i γ + ε i , with errors being independent from N(0, σ 2 ), and with focus parameter of the type µ 0 = x t 0 β + z t 0 γ. In the first setup we have an intercept parameter β protected (so p = 1) and three extra parameters γ 1 , γ 2 , γ 3 associated with three covariates considered for ex-or inclusion (q = 3). The narrow model M 1 has only the intercept parameter, and the wide model M 8 has the intercept term and all three covariates. The other candidate models correspond to including or excluding the three covariates. We have used β = 0, γ = (0.5, −0.5, 0.1) t , and residual standard deviation σ = 1. The covariates are drawn from a multivariate normal distribution with zero means, unit variances, and intercorrelations chosen to be corr(X 1 , X 2 ) = −0.3, corr(X 1 , X 3 ) = −0.2, corr(X 2 , X 3 ) = 0.6. The focus parameter is µ 0 = x t 0 β + z t 0 γ, with x 0 = 1 and z 0 = (1, −1, 3) t . The red line in the left panel of Figure 8 indicates the true rmse values for the eight models. The grey crosses are the root-median-FIC scores evaluated in 10 3 datasets. The black dashed line gives the average scores from these 10 3 datasets. In the right panel, we see the realised coverage of the computed 80% confidence intervals. Note that the realised coverage for the wide model (here M 8 ) will always be zero as our framework does not yield confidence intervals for the widest model, but only a point estimate, by construction. Table 2 reports the percentage of rounds where each model has the lowest FIC score (i.e., the winning model). In this setup, model 5 had the lowest true rmse (as we see in the figure), and the wide model M 8 had the second lowest rmse.  In our second setup, we investigated a linear normal model with a higher number of covariates, and a much higher number of candidate models. Again we have n = 100 and an intercept parameter β protected (so p = 1), but this time we have ten extra parameters γ 1 , . . . , γ 10 considered for ex-or inclusion (q = 10). There are then 1024 candidate models. We have used β = 0, γ = (0.5, −0.5, 0.1, 0.4, −0.1, 0.05, −0.05, −0.5, 0.2, −0.4) t , and residual standard deviation σ = 1. The covariates are drawn from a multivariate normal distribution with zero means, variances between 0.9 and 2.2, and correlations ranging from −0.85 to 0.85. Again the focus parameter is of the form µ 0 = x t 0 β + z t 0 γ, with x 0 = 1 and z 0 = (1, −1, 3, 2, −1, 0, 0, 0, 0, 0) t . For the sake of presentation, we have chosen to present the results for 100 among the 1024 candidate models, see Figure 9. The first model is the narrow model, the last is the wide, and the remaining are a random selection among the candidate models. Figure 9 presents the same type of results as Figure 8, but because of the high number of candidate models we have not include this setup in Table 2.
Naturally, the size of the residual standard deviation σ is a crucial importance here, as seen also via the exact mse S Formula (21). For small σ, the bias part dominates, and the mse S is smallest for wider and more elaborate models; for larger σ, the variance part dominates, with mse S being smallest for simpler models with fewer regression terms. These aspects are also picked up by the FIC. It also follows from our analyses of the CD approximations that the rmse S confidence coverage property is more precise for smaller σ than for bigger σ.
Our third setup is a Poisson regression model where we simulate datasets of the same size and with the same covariates as in our application in Section 8. Here, n = 73, p = 3, q = 6, giving 48 submodels; see further details in the section mentioned. We also use the same focus parameter as we describe there, and simulate data from the fitted wide model, corresponding to parameter values β = (1.630, −0.004, 0.250) t , γ = (0.105, 0.094, −0.011, 0.032, 0.001, −0.006) t . In Figure 10, we present the same type of results as for the previous setups, but here we have used FIC 0.25 instead of median-FIC. With this criterion the truly best model was correctly identified in 63.5% of the rounds. q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q In our fourth setup we simulated n = 300 binary observations from a logistic regression model of the same form as in Section 1, but with p = 1 and q = 3. We let β = 0, γ = (0.5, −0.5, 0.1) t . Our focus parameter is the probability of an event for a certain vector of covariates, x 0 = 1 and z 0 = (1.0, 0.2, −0.5) t . The results are presented in Figure 11 and Table 2. In this setup, the truly best model was M 5 , closely followed by M 8 . From the left panels in each of the four figures in this section, we see that average root-FIC scores are generally close to the true rmse values. The FIC score selects the truly best model, in terms of mse, for most of the rounds, but not always. The ability to select the best model depends on the estimation quality of the FIC scores, but also on how close, or different, the rmse values of the models really are. In the first setup, M 5 is the truly best model, but the wide model M 8 is often preferred by the FIC score. This might appear disappointing, but in fact these two models have almost identical performance (the red line in Figure 8). Similarly, in the logistic regression setup the two best models had very similar performance and were often selected by the FIC machinery. In the Poisson regression setup the correct model was identified surprisingly often given the relatively high number of candidate models.
The right panels in the four figures are possibly even more interesting, since the main contribution in this paper are the CDs for the rmse. The realised coverage of the 80% confidence intervals is generally close to 80%, but for some candidate models it is considerably lower than the nominal level. First, as mentioned already, we do not get confidence intervals for the wide model in this framework, only an unbiased point estimate, so its realised coverage will always be zero. For other candidate models than the wide, the under-coverage phenomenon happens for candidate models which consistently produce very steep CDs. These are candidate models which have a small estimated bias, and also a small variance-term related to the estimation of the bias (σ S in our terminology). Reassuringly, for all the cases we have investigated, the candidate models with under-coverage consistently have very small spread in their FIC scores (note for instance M 5 in Figure 11). These candidate models thus should really get narrow confidence intervals, but these happen to become too narrow. Ultimately, the observed under-coverage is a consequence of our CDs being constructed based on the approximation to the limit experiment, and there is therefore a layer of uncertainty not accounted for in our construction; see the discussion in Section 9. Reed (1981) analysed the abundance of landbirds on 73 British and Irish islands. In the dataset, characteristics of each island were recorded: the distance from mainland (x 1 ), the log area (x 2 ), the number of different habitats (z 1 ), an indicator of whether the island is Irish or British (z 2 ), latitude (z 3 ), and longitude (z 4 ). As the notation indicates, we take x 1 , x 2 as protected covariates, to be included in all candidate models, whereas z 1 , z 2 , z 3 , z 4 are open. Based on general ecological theory and study of similar questions we also include two potential interaction terms, viz. z 5 = x 2 z 1 and z 6 = x 1 x 2 . Of the 2 6 = 64 candidate models, corresponding to inclusion and exclusion of z 1 , . . . , z 6 , we only allow the interaction term z 5 = x 2 z 1 in a model if z 1 is also inside; this leaves us with 64 − 18 = 48 candidate models below.

Illustration: Birds on 73 British and Irish Islands
Suppose we take an interest in predicting the number of species y i on the Irish island of Cape Clear. In Reed's dataset we have the following information about this island: it is located at 6.44 km from the mainland, at 51.26 degrees north and −9.37 degrees east, with an area of 639.11 hectares. At the time of study it had 20 different habitats (z 1 ), and 40 different bird species (y i ) were observed. Assume that we know that the number of habitats has decreased to 15 -which model gives the most precise estimate of the current number of species?
As the required wide model we choose the Poisson regression model, with y i ∼ Pois(λ i ), where The wide model thus has nine parameters to estimate, while the smallest, narrow one only has three. We conduct our FIC analysis, and using our confidence distribution apparatus we obtain our extended FIC plot with uncertainty bands in Figure 12. Some models indicate a clear improvement compared to the wide model, with very low uncertainty around their FIC scores. The winning model is similar to the narrow model, but includes the habitat covariate. Most of the models with low FIC scores contain this covariate, and one or both interaction terms or the longitude covariate (Cape Clear lies quite far west compared to most of the islands in the dataset). The predicted number of species on Cape Clear among the favoured models is around 29, a decrease from the 40 species in the dataset. The point to convey with this application is also that any other focused statistical question of interest can be worked with in the same fashion. Natural focus parameters could be the probability that y falls below a threshold y 0 , given a set of present or envisaged island characteristics, or the mean function E (y | x 1 , x 2 , z 1 , z 2 , z 3 , z 4 ) itself, for a given set of covariate combinations. For each such focused question, a FIC analysis can be run, leading to FIC plots and finessed CD-FIC plots as in Figure 12, perhaps each time with a new model ranking and a new model winner.

Discussion
Our paper has extended and finessed the theory of FIC, through the construction of confidence distributions associated with each point (FIC 1/2 S , µ S ) in the traditional FIC plots and FIC tables. The resulting CD-FIC plots enable the statistician to delve deeper into how well some candidate models compare to others; not only do some parameter estimates have less variance than others, but some estimates of the underlying root-mse quantities, i.e., the root-FIC scores, are more precise than others. The extra programming and computational cost is moderate, if one already has computed the usual FIC scores. Check in this regard the R package fic, which covers classes of traditional regression models; see Jackson and Claeskens (2019).
Differences in AIC scores have well-known limiting distributions, under certain conditions, which helps users to judge whether the AIC scores of two models are sufficiently different as to prefer one over the other. Aided by results of our paper one may similarly address differences in FIC scores, test whether two such scores are significantly different, etc.; see Remark C in the following section.
We trust we have demonstrated the usefulness of our methodology in our paper, but now point to a few issues and mild caveats. Some of these might be addressed in future work; see also Section 10. One concern is that our CDs for root-mse are constructed using a local neighbourhood framework for candidate models, leading to certain mse approximations where the squared bias terms are put on the same general O(1/n) footing as variances. First, this is not always a good operating assumption, since it rests on candidate models not being too far from each other. This points to the necessity of setting up such FIC schemes with care, when it comes to deciding on the narrow and the wide model, e.g., which covariates should be protected and which open in the model selection setup. Second, the mse approximations, of type (7), have led to clear CDs, but where these in essence stem from accurate analysis of estimated squared biases, not taking into account the extra variability associated with variance estimators. There is in other words a certain extra layer of second order variability not directly taken into account in the general CDs constructed in this paper.
For any finite dataset, therefore, our CDs will to some small extent underestimate the true variability present in the root-mse estimation. Still, we have seen in simulation studies that the coverage can be quite accurate with moderate sample sizes, i.e., that intervals of the type {rmse S : C S (rmse S ) ≤ 0.80} have real coverage close to 0.80, etc. Furthermore, it is possible to work out better finite-sample fine-tuned CDs for the important case of linear regression models, starting with the exactly valid mse S Formula (21). This is beyond the scope of the present article, however.
These considerations also imply that the estimated bias associated with submodel S will have a strong influence on the appearance of the CD for submodel S. The CD for rmse S will start at a position corresponding to the estimated variance of that model's focus parameter estimator µ S , but the height of the CD at this point will be determined by the relative size of the bias, viz. the bias estimate squared divided by the variance of the bias. Further, the steepness of the CD will mostly be determined by the variance of the estimated bias, with a steeper CD when the variance of the bias estimate is small. Thus a particular submodel S will obtain a narrow confidence interval around its root-FIC score if it leads to a focus estimator with small relative bias, or small variance in its bias estimate, or both.
This paper also introduces a new version of the FIC score, the quantile-FIC, and its natural special case, the median-FIC. One of the benefits of this latter FIC score is that it falls directly out of the CD, and avoids the need to explicitly decide whether one wants to truncate the squared bias or not. We have also indicated that the quantile-FIC scores can have good performance in large parts of the parameter space. More careful examination reveals that the advantageous performance of median-FIC is primarily found in the parts of the parameter space where the wide model really is the most precise. These are not the most interesting parameter regions when it comes to model selection with FIC, however, because model selection is typically conducted in situations where one hopes to find simpler effective models than the wide one. Our performance investigations reveal that other quantile-FIC versions, e.g., the lower-quartile-FIC with q = 0.25, appears to be a favourable strategy in the more crucial parts of the parameter space where the wide model is outperformed by smaller models.

Concluding Remarks
We conclude our paper by offering a list of remarks, some pointing to further research.
A. The relative sizes of minimum uncertainty and the model averaging potential. The master theorems underlying the essential descriptions of what can go on, with submodel estimators as well as model averaging estimators, are those of (6) and (17). Thus two key parameters are τ 0 and (ω t Qω) 1/2 , the standard deviations of Λ 0 and ω t (δ − D). In a suitable sense τ 0 measures the unavoidable minimum uncertainty, whereas (ω t Qω) 1/2 represents the total variability level with the extra terms involved, for both model selection and model averaging. With a given dataset, and a set of candidate models, one may estimate these quantities separately, and hence the relative components of variability, say ρ 0 = τ 2 0 /(τ 2 0 + ω t Qω) and ρ 1 = ω t Qω/(τ 2 0 + ω t Qω), before turning to model selection and model averaging. If ρ 0 is big and ρ 1 hence small, there is little scope for carrying out sophisticated additional analyses, as most estimates will be close. Indeed, for two candidate model estimators we have If on the other hand ρ 0 is small and ρ 1 big, there is room for genuine risk improvement with model selection and averaging.
B. More accurate finite-sample FIC scores. We have extended the FIC apparatus to include confidence distributions for the underlying root-mse quantities. Our formulae have been developed via the limit experiment, where there are clear and concise expressions both for the mse parameters and the precision of relevant estimators. For real data there remain of course differences between the actual finite-sample FIC scores, as with (9), and the large-sample approximations, as with (8). As discussed in Section 9 the CDs we construct, based on accurate analysis of limit distributions, miss part of the real-data variability for finite samples. It would hence be useful to develop relevant finite-sample corrections to our CDs. See in this connection also the second-order asymptotics section of Hjort and Claeskens (2003b).
C. Differences and ratios of FIC scores. For two candidate models, say S and T subsets of {1, . . . , q}, our CDs give accurate assessment of their associated rmse S and rmse T . It would be practical to have tools for also assessing the degree to which these quantities are different. It is not easy to construct a simple test for the hypothesis that rmse S = rmse T , but a conservative confidence approach for addressing the mse difference for any fixed pair of candidate models, is as follows. For each confidence level α of interest, consider the natural confidence ellipsoid E α = {δ : (δ − D) t Q −1 (δ − D) ≤ Γ −1 q (α)}, with Γ −1 q the quantile function for the χ 2 q . Then sample a high number of δ ∈ E q , to read off the range [l α , u α ] or values attained by d(δ). Then the confidence of the interval is at least α. This may in particular be used to construct a conservative test for d(δ) = 0. Similar reasoning applies to other relevant quantities, like using ratios of FIC scores to build tests and confidence schemes for the underlying mse T /mse S ratios. In Hjort (2020) CDs are constructed for all rmse S,n /rmse wide,n ratios, and these are exact for each n, for the case of variable selection in linear regression models, leading to new selection criteria.
D. The fixed wide model framework for FIC. The setup of our paper has been that of local neighbourhood models, with these being inside a common O(1/ √ n) distance of each other. This framework, having started with Hjort and Claeskens (2003a) and Claeskens and Hjort (2003), has been demonstrated to be very useful, leading to various FIC procedures in the literature, and now also to the extended and finessed FIC procedures of the present paper. A different and in some situations more satisfactory framework involves starting with a fixed wide model, and with no 'local asymptotics' involved; see the review paper Claeskens et al. (2019) for general regression models and Cunen et al. (2020) for classes of linear mixed models. The key results involve different approximations to mse quantities, along the lines of for each candidate model M. Here, µ true is defined through the real data generating mechanism of the wide model, whereas θ 0,M is the least false parameter in candidate model M, and with µ M (θ M ) the focus parameter expressed in terms of that models's parameter vector. It would be very useful to lift the present paper's methodology to such setups. This would entail setting up approximate CDs, say C M (rmse M ), for each candidate model. This involves different approximation methods and indeed different CD formulae than those worked out in the present paper.
E. From FIC to AFIC. The FIC machinery is geared towards optimal estimation and performance for each given focus parameter. Sometimes there are several parameters of primary interest, however, as with all high quantiles, or the regression function for a stratum of covariates. The FIC apparatus can with certain efforts be lifted to such cases, where there is a string of focus parameters, along with measures of relative importance; see Claeskens and Hjort (2008, chp. 6) for such average-FIC, or AFIC. The present point is that all methods of this paper can be lifted to the setting of such AFIC scores as well. In Hjort (2020) a connection is built from such AFIC scores to the Mallows C p criterion for linear regression models.
F. Post-selection and post-averaging issues. The distribution of post-selection and post-averaging estimators are complicated, as seen in Section 5, with limits being nonlinear mixtures of normals. Supplementing such estimators with accurate confidence analysis is a challenging affair, see, e.g., Efron (2014); Hjort (2014); Kabaila et al. (2019). Partial solutions are considered in Claeskens and Hjort (2008, chp. 7), Fletcher et al. (2019).

FIC and CD-FIC Formulae for General Regression Models
In Section 2 we gave the basic formulae for the key quantities involved in building the various FIC, FIC 0.50 , FIC q scores, the confidence distribution C S (rmse S ), etc., inside the i.i.d. setup. Here, we give the necessary technical details and formulae for similar quantities, for a general regression framework.
For regression applications more care might be needed when setting up both the wide model, under which biases, variances, mean squared errors are to be defined and then approximated and estimated, and the narrow model, in a natural sense the smallest of the candidate models. As with our introductory illustration, it often makes sense to designate some of the covariates as protected and others as open; see Claeskens and Hjort (2008, chp. 5-7) for a wider discussion. Consider therefore a regression setup with (x i , z i , y i ), for one-dimensional response variables y i , where x i a vector of length say p denoting such protected covariates, to be included in each candidate model, and z i = (z i,1 , . . . , z i,1 ) t of length q, with components which might be included or excluded, in the various candidate models. There is a wide model of the form f (y i | x i , z i , θ, β, γ), where θ of length say r is a set of core parameters, relating to perhaps scale and shape, and then with β and γ of dimensions p and q having regression coefficients related to x i and z i . The framework encompasses the traditional generalised linear models (linear, logistic, Poisson, gamma type regressions) but also wider models, like those called doubly-linear or generalised linear-linear regression models in Schweder and Hjort (2016, chp. 8). Examples of the latter are normal distributions (ξ i , σ 2 i ) with linear regression structure on both ξ i and log σ i , gamma distributions (a i , b i ) with log-linear structure on both parameters, etc.
The model selection and model averaging setup now takes as the data-generating mechanism, with δ/ √ n the relative modelling distance from the narrow model f (y i | x i , z i , θ 0 , β 0 , γ 0 ); in most applications, the γ 0 is simply the zero point, reflecting no influence of the z i on the response y i . The log-likelihood function for the wide model is leading to ML estimators α wide = ( θ wide , β wide , γ wide ) for the full r + p + q-dimensional parameter. For submodel S, corresponding to a subset S of {1, . . . , q}, the log-likelihood is with r + p + |S| unknown parameters, and ensuing ML estimator α S = ( θ S , β S , γ S ). For a general focus parameter µ = µ(θ, β, γ), a smooth function of the parameters of the wide model, and hence with a clear statistical interpretation across candidate models, the question is how well the different submodel generated estimators µ S = µ( θ S , β S , γ 0,S c , γ S ) succeed in coming close to µ true = µ(θ 0 , β 0 , γ 0 + δ/ √ n). The point is now that essentially all of the theory for the simpler i.i.d. case, covered in Section 2.1, goes through, mutatis mutandis, with the required attention to details, under broadly valid Lindeberg conditions for limiting normality etc. This needs of course properly modified definitions of the key quantities J, Q, ω, τ 0 , D n → d D, G S used in Sections 2 and 3, along with estimators for these. We now give such formulae, pointing also to Claeskens and Hjort (2008, chp. 5-7) for further details and illustrations of related points. We start with writing α 0 for the full parameter vector (θ 0 , β 0 , γ 0 ). This information matrix is of size (r + p + q) × (r + p + q). There is convergence to a well-defined limit matrix J, and the natural consistent estimator is J n = −n −1 ∂ 2 wide ( α wide )/∂α ∂α t , minus the Hessian from the numerical optimisation involved in finding the ML estimators in the wide model. The lower right q × q submatrix of J n , say Q n , is consistent for Q, the lower right submatrix of J −1 . Similarly, there is a crucial ω = J n,10 J −1 n,00 ∂µ( α)/∂(θ, β) − ∂µ( α)/∂γ, with J n,00 of size (r + p) × (r + p) corresponding to the protected (θ, β) part of the parameter vector, and with partial derivatives of µ (θ, β, γ) computed at the wide model's ML position. Other quantities from the i.i.d. setup are similarly modified, and with the key results being parallelling those given attention in Section 2. Two illustrations of the FIC apparatus and central formulae above are as follows. We first consider Poisson regression, as used in Section 8. Suppose y i is Poisson with mean parameter λ i = exp(x t i β + z t i γ), with the x i protected and z i open, of dimensions say p and q. In this situation there are no extra parameters, i.e., no θ, in the notation above, and one finds along with J n obtained by plugging in wide model ML estimators ( β wide , γ wide ). This leads to the relevant Q n and Q n , etc. If the focus parameter is as relative simple as µ = x t 0 β + z t 0 γ, i.e., a linear combination of the log-means parameters, one has ω n = J n,10 J −1 n,00 x 0 − z 0 , with corresponding estimator ω, a vector of length q. These formulae then lead to all FIC scores, the CDs C * S (mse S ), etc. The setup is fully capable of handling also more complicated focus parameters. Formulae for the case of logistic regression models are similar to those given here for the Poisson case, but involve a differently defined J n matrix.
Our second illustration of the general setup is the important class of linear regressions, with wide model y i = x t i β + z t i γ + σε i in terms of parameters (σ, β, γ), of combined length 1 + p + q. This is in some ways a simpler regression model than for the Poisson, but there is the extra scale parameter σ to include in the calculations. One finds in terms of the four blocks of the (p + q) × (p + q) covariate variance matrix Σ n for the (x i , z i ), and its inverse. In particular, Q n = σ 2 Σ 11 n . There are also q × q matrices G n,S = π t S Q n,S π S Q −1 n parallelling those of Section 2, and these are fully observed, since the σ 2 factor cancels out. Now consider a focus parameter of the mean type µ = E (y | x 0 , z 0 ) = x t 0 β + z t 0 γ, for which we find ω n = Σ n,10 Σ −1 n,00 x 0 − z 0 . For candidate model S, a subset of the z i,1 , . . . , z i,q covariates, the estimator of µ is µ S = x t 0 β S + z t 0,S γ S , where ( β S , γ S ) are the least squares estimators for the submodel with means x t i β + z t i,S γ S . The parallel to the i.i.d. result (7) for the limiting mse for candidate model S now yields an expression for mse n,S = E wide { √ n( µ S − µ true )} 2 = n E wide ( µ S − µ true ) 2 , namely mse n,S = σ 2 {x t 0 Σ −1 n,00 x 0 + ω t n G n,S Σ 11 n G t n,S ω n } + n{ω t n (I − G n,S )γ} 2 .
The crucial point is that this expression, derived here from a local asymptotics perspective with γ = δ/ √ n, is found to be exactly valid for these linear models. Funding: This research received no external funding.