1. Introduction
Model assessment, that is assessing the adequacy of a model and/or ability to perform model selection, is one of the fundamental components of statistical analyses. For example, in the model adequacy problem one usually begins with a fixed model and interest centers on measuring the model misspecification cost. A natural way to create a framework within which we can assess model misspecification is by using statistical distances as loss functions. These constructs measure the distance between the unknown distribution that generated the data and an estimate from the data model. By identifying statistical distances as loss functions, we can begin to understand the role distances play in model fitting and selection, as they become measures of the overall cost of model misspecification. This strategy will allow us to investigate the construction of a loss function as the maximum error in a list of model fit questions. Therefore, our fundamental question is the following. How can one design a loss function  that is scientifically and statistically meaningful? We would like to be able to attach a specific scientific meaning to the numerical values of the loss, so that a value of the distance equal to 4, for example, has an explicit interpretation in terms of our statistical goals. When we select between models, we would like to measure the quality of the approximation via the model’s ability to provide answers to important scientific questions. This presupposes that the meaning of “best fitting model” should depend on the “statistical questions” being asked of the model.
Lindsay [
1] discusses a distance-based framework for assessing model adequacy. A fundamental tenet of the framework for model adequacy put forward by Lindsay [
1] is that it is possible and reasonable to carry out a model-based scientific inquiry without believing that the model is true, and without assuming that the truth is included in the model. All this of course, assuming that we have a way to measure the quality of the approximation to the “truth”, is offered by the model. This point of view never assumes the correctness of the model. Of course, it is rather presumptuous to label any distribution as the truth as any basic modeling assumption generated by the sampling scheme that provided the data is never exactly true. An example of a basic modeling assumption might be “
 are independent, identically distributed from an unknown distribution 
”. This, as any other statistical assumption, is subject to question even in the most idealized of data collection frameworks. However, we believe that well designed experiments can generate data that is similar to data from idealized models, therefore we operate as if the basic assumption is true. This means that we assume that there is a true distribution 
 that generates the data, which is “knowable” if we can collect an infinite amount of data. Furthermore, we note that the basic modeling assumption will be the global framework for assessment of all more restrictive assumptions about the data generation mechanism. In a sense, it is the “nonparametric” extension of the more restrictive models that might be considered.
We let  be the class of all distributions consistent with the basic assumptions. Hence , and sets  are called models. We assume that ; hence, there is a permanent model misspecification error. Statistical distances will then provide a measure for the model misspecification error.
One natural way to measure model adequacy is to define a loss function  that describes the loss incurred when the model element M is used instead of the true distribution . Such a loss function should, in principle, indicate, in an inferential sense, how far apart the two distributions , M are. In the next section, we offer a formal definition of the concept of a statistical distance.
If the statistical questions of interest can be expressed as a list of functionals 
 of the model 
M that we wish to be uniformly close to the same functionals 
 of the true distribution, then we can turn the set of model fit questions into a distance via
      
      where the supremum is taken over the class of functionals of interest. Using the supremum of the individual errors is one way of assessing overall error, but using this measure has the nice feature that its value gives a bound on all individual errors. The statistical questions of interest may be global, such as: is the normal model correct in every aspect? Or we may be interested to have answers on a few key characteristics, such as the mean.
Lindsay et al. [
2] introduced a class of statistical distances, called quadratic distances, and studied their use in the context of goodness-of-fit testing. Furthermore, Markatou et al. [
3] discuss extensively the chi-squared distance, a special case of quadratic distance, and its role in robustness. In this paper, we study non-quadratic distances and their role in model assessment. The paper is organized as follows. 
Section 2 presents the definition of a statistical distance and its associated properties. 
Section 3, 
Section 4 and 
Section 5 discuss in detail three popular distances, total variation, the mixture index of fit and the Kullback-Leibler distance, with the aim of understanding their role in model assessment problems. The likelihood distance is also briefly discussed in 
Section 5. 
Section 6 illustrates computation and applications of total variation, mixture index of fit and Kullback-Leibler distances. Finally, 
Section 7 presents discussion and conclusions pertaining to the use of total variation and mixture index of fit distances.
  2. Statistical Distances and Their Properties
If we adopt the usual convention that loss functions are nonnegative in their arguments, and zero if the correct model is used, and have larger value if the two distributions are not very similar, then the loss 
 can also be viewed as a distance between 
, 
M. In fact, we will always assume that for any two distributions 
F, 
GIf this holds, we will say that  is a statistical distance. Unlike the requirements for a metric, we do not require symmetry. In fact, there is no reason that the loss should be symmetric, as the roles of , M are different. We also do not require  to be nonzero when the arguments differ. This zero property will allow us to specify that two distributions are equivalent as far as our statistical purposes are concerned by giving them zero distance.
Furthermore, it is important to note that if 
 is in 
 and 
, and 
, say 
, then the distance 
 induces a loss function on the parameter space via
      
Therefore, if  is in the model, the losses defined by  are parametric losses.
We begin within the discrete distribution framework. Let , where T is possibly infinite, be a discrete sample space. On this sample space we define a true probability density , as well as a family of densities , where  is the parameter space. Assume we have independent and identically distributed random variables  producing the realizations  from . We record the data as , where  is the number of observations in the sample with value equal to t. We note here that we use the word “density” in a generic fashion that incorporates both, probability mass functions as well as probability density functions. A rather formal definition of the concept of statistical distance is as follows.
Definition 1. (Markatou et al. [3]) Let τ, m be two probability density functions. Then  is a statistical distance between the corresponding probability distributions if , with equality if and only if τ and m are the same for all statistical purposes.  We would require 
 to indicate the worst mistake that we can make if we use 
m instead of 
. The precise meaning of this statement is obvious in the case of total variation that we discuss in detail in 
Section 3 of the paper.
We would also like our statistical distances to be convex in their arguments.
Definition 2. Let τ, m be a pair of probability density functions, with m being represented as , . We say that the statistical distance  is convex in the right argument ifwhere ,  are two probability density functions.  Definition 3. Let τ, m be a pair of probability density functions, and assume , . Then, we say that  is convex in the left argument ifwhere ,  are two densities.  Lindsay et al. [
2] define and study quadratic distances as measures of goodness of fit, a form of model assessment. In the next sections, we study non-quadratic distances and their role in the problem of model assessment. We begin with the total variation distance.
  3. Total Variation
In this section, we study the properties of the total variation distance. We offer a loss function interpretation of this distance and discuss sensitivity issues associated with its use. We will begin with the case of discrete probability measures and then move to the case of continuous probability measures. The results presented here are novel and are useful in selecting the distances to be used in any given problem.
The total variation distance is defined as follows.
Definition 4. Let τ, m be two probability distributions. We define the total variation distance between the probability mass functions τ, m to be  This measure is also known as the -distance (without the factor ) or index of dissimilarity.
Corollary 1. The total variation distance takes values in the interval .
 Proof.  By definition 
 with equality if and only if 
, 
. Moreover, 
. But 
, 
m are probability mass functions (or densities), therefore
        
        and hence
        
       or, equivalently
        
Therefore . ☐
 Proposition 1. The total variation distance is a metric.
 Proof.  By definition, the total variation distance is non-negative. Moreover, it is symmetric because 
 and it satisfies the triangle inequality since
        
Thus, it is a metric. ☐
 The following proposition states that the total variation distance is convex in both, left and right arguments.
Proposition 2. Let τ, m be a pair of densities with τ represented as , . Then Moreover, if m is represented as , , then  Proof.  It is a straightforward application of the definition of the total variation distance. ☐
 The total variation measure has major implications for prediction probabilities. A statistically useful interpretation of the total variation distance is that it can be thought of as the worst error we can commit in probability when we use the model m instead of the truth . The maximum value of this error equals 1 and it occurs when , m are mutually singular.
Denote by  the probability of a set under the measure  and by  the probability of a set under the measure m.
Proposition 3. Let τ, m be two probability mass functions. Thenwhere A is a subset of the Borel set .  Proof.  Define the sets 
, 
, 
. Notice that
        
Because on the set 
 the two probability mass functions are equal 
, and hence
        
Note that, because of the nature of the sets 
 and 
, both terms in the last expression are positive. Therefore
        
 Remark 1. The model misspecification measure  has a “minimax ”expression This indicates the sense in which the measure assesses the overall risk of using m instead of τ, then chooses m that minimizes the aforementioned risk.
 We now offer a testing interpretation of the total variation distance. We establish that the total variation distance can be obtained as a solution to a suitably defined optimization problem. It is obtained as that test function which maximizes the difference between the power and level of a suitably defined test problem.
Definition 5. A randomized test function for testing a statistical hypothesis  versus the alternative  is a (measurable) function ϕ defined on  and taking values in the interval  with the following interpretation. If x is the observed value of X and , then a coin whose probability of falling heads is y is tossed and  is rejected when head appears. In the case where y is either 0 or 1, , the test is called non-randomized.
 Proposition 4. Let  versus  and  is a test function, f, g are probability mass functions. Then  An advantage of the total variation distance is that it is not sensitive to small changes in the density. That is, if 
 is replaced by 
 where 
 and 
 is small then
      
Therefore, when the changes in the density are small . When describing a population, it is natural to describe it via the proportion of individuals in various subgroups. Having  small would ensure uniform accuracy for all such descriptions. On the other hand, populations are also described in terms of a variety of other variables, such as means. Having the total variation measure small does not imply that means are close on the scale of standard deviation.
Remark 2. The total variation distance is not differentiable in the arguments. Using  as an inference function, where d denotes the data estimate of τ (i.e., ), yields estimators of θ that have the feature of not generating smooth, asymptotically normal estimators when the model is true [4]. This feature is related to the pathologies of the variation distance described by Donoho and Liu [5]. However, if parameter estimation is of interest, one can use alternative divergences that are free of these pathologies.  We now study the total variation distance in continuous probability models.
Definition 6. The total variation distance between two probability density functions τ, m is defined as  The total variation distance has the same interpretation as in the discrete probability model case. That is
      
One of the important issues in the construction of distances in continuous spaces is the issue of invariance, because the behavior of distance measures under transformations of the data is of interest. Suppose we take a monotone transformation of the observed variable X and use the corresponding model distribution; how does this transformation affect the distance between X and the model?
Invariance seems to be desirable from an inferential point of view, but difficult to achieve without forcing one of the distributions to be continuous and appealing to the probability integral transform for a common scale. In multivariate continuous spaces, the problem of transformation invariance is even more difficult, as there is no longer a natural probability integral transformation to bring data and model on a common scale.
Proposition 5. Let  be the total variation distance between the densities ,  for a random variable X. If  is a one-to-one transformation of the random variable X, then  Proof.  Write
        
        where 
 is the inverse transformation. Next, we do a change of variable in the integral. Set 
 from where we obtain 
 and 
; the prime denotes derivative with respect to the corresponding argument. Then
        
Now since 
 is a one-to-one transformation, 
 is either increasing or decreasing on different segments of 
. Thus
        
       where 
. ☐
 A fundamental problem with the total variation distance is that it cannot be used to compute the distance between a discrete distribution and a continuous distribution because the total variation distance between a continuous measure and a discrete measure is always the maximum possible, that is 1. This inability of the total variation distance to discriminate between discrete and continuous measures can be interpreted as asking “too many questions”at once, without any prioritization. This limits its use despite its invariant characteristics.
We now discuss the relationship between the total variation distance and Fisher information. Denote by  the joint density of n independent and identically distributed random variables. Then we have the following proposition.
Proposition 6. The total variation distance is locally equivalent to the Fisher information number, that iswhere ,  are two discrete probability models.  Proof.  Now, expand 
 using Taylor series in the neighborhood of 
 to obtain
        
        where the prime denotes derivative with respect to the parameter 
. Further, write
        
        to obtain
        
        where
        
Therefore, assuming that 
 converges to a normal random variable in absolute mean, then
        
        because 
, 
 and 
 when 
. ☐
 The total variation is a non-quadratic distance. It is however related to a quadratic distance, the Hellinger distance, defined as  by the following inequality.
Proposition 7. Let τ, m be two probability mass functions. Then  Proof.  Straightforward using the definitions of the distances involved and Cauchy-Swartz inequality. Holder’s inequality provides . ☐
 Note that 
; the square root of this quantity, that is 
, is known as Matusita’s distance [
6,
7]. Further, define the affinity between two probability densities by
      
Then, it is easy to prove that
      
The above inequality indicates the relationship between total variation and Matusita’s distance.
  4. Mixture Index of Fit
Rudas, Clogg, and Lindsay [
8] proposed a new index of fit approach to evaluate the goodness of fit analysis of contingency tables based on the mixture model framework. The approach focuses attention on the discrepancy between the model and the data, and allows comparisons across studies. Suppose 
 is the baseline model. The family of models which are proposed for evaluating goodness of fit is a two-point mixture model given by
      
Here  denotes the mixing proportion, which is interpreted as the proportion of the population outside the model . In the robustness literature the mixing proportion corresponds to the contamination proportion, as explained below. In the contingency table framework ,  describe the tables of probabilities for each latent class. The family of models  defines a class of nested models as  varies from zero to one. Thus, if the model  does not fit well the data, then by increasing , the model  will be an adequate fit for  sufficiently large.
We can motivate the index of fit by thinking of the population as being composed of two classes with proportions  and  respectively. The first class is perfectly described by , whereas the second class contains the “outliers”. The index of fit can then be interpreted as the fraction of the population intrinsically outside , that is, the proportion of outliers in the sample.
We note here that these ideas can be extended beyond the contingency table framework. In our setting, the probability distribution describing the true data generating mechanism may be written as , where  and  is arbitrary. This representation of  is arbitrary such that we can construct another representation . However, there always exists the smallest unique  such that there exists a representation of  that puts the maximum proportion in one of the population classes. Next, we define formally the mixture index of fit.
Definition 7. (Rudas, Clogg, and Lindsay [8]) The mixture index of fit π* is defined by  Notice that  is a distance. This is because if we set  for a fixed , we have   and  if .
Definition 8. Define the statistical distance  as follows:  Remark 3. Note that, to be able to present Proposition 8 below, we have turned arbitrary discrete distributions into vectors. As an example, if the sample space  and , we write this discrete distribution as the vector . If, furthermore, we consider the vectors , , and  as degenerate distributions assigning mass 1 at positions 0, 1, 2 then . This representation of distributions is used in the proof of Proposition 8.
 Proposition 8. The set of vectors  satisfying the relationship  is a simplex with extremal points , where  is the vector with 1 at the th position and 0 everywhere else.
 Proof.  Given 
 with 
, there exists a representation of
        
Write any arbitrary discrete distribution 
 as follows:
        
        where 
 and 
 takes the value 1 at the 
th position and the value 0 everywhere else. Then
        
      which belongs to a simplex. ☐
 Proof.  Then
        
		with equality at some 
t. Let now the error term be
        
Then  and  cannot be made smaller without making  negative at a point . This concludes the proof. ☐
 Corollary 2. We haveif there exists  such that  and .  Proof.  By Proposition 9 , but it equals 1 at . ☐
 One of the advantages of the mixture index of fit is that it has an intuitive interpretation that does not depend upon the specific nature of the model being assessed. Liu and Lindsay [
9] extended the results of Rudas et al. [
8] to the Kullback-Leibler distance. Computational aspects of the mixture index of fit are discussed in Xi and Lindsay [
4] as well as in Dayton [
10] and Ispány and Verdes [
11].
Finally, a new interpretation to the mixture index of fit was presented by Ispány and Verdes [
11]. Let 
 be the set of probability measures and 
. If 
d is a distance measure on 
 and 
, then 
 is the least non-negative solution of the equation 
 in 
.
Next, we offer some interpretations associated with the mixture index of fit. The statistical interpretations made with this measure are attractive, as any statement based on the model applies to at least  of the population involved. However, while the “outlier”model seems interpretable and attractive, the distance itself is not very robust.
In other words, small changes in the probability mass function do not necessarily mean small changes in distance. This is because if 
, then a change of 
 in 
 from 
 to 0 causes 
 to go to 1. Moreover, assume that our framework is that of continuous probability measures, and that our model is a normal density. If 
 is a lighter tailed distribution than our normal model 
, then
      
     and therefore
      
That is, light tailed densities are interpreted as  outliers. Therefore, the mixture index of fit measures error from the model in a “one-sided” way. This is in contrast to total variation, which measures the size of “holes” as well as the “outliers” by allowing the distributional errors to be neutral.
In what follows, we show that if we can find a mixture representation for the true distribution then this implies a small total variation distance between the true probability mass function and the assumed model m. Specifically, we have the following.
Proposition 10. Let π* be the mixture index of fit. If , then  Proof.  Write
        
		with 
. This is because there always exists the smallest unique 
 such that 
 can be represented as a mixture model.
Thus, the above relationship can be written as
        
☐
 There is a mixture representation that connects total variation with the mixture index of fit. This is presented below.
Proof.  Fix 
; for any given 
m let 
 be a solution to the equation
        
Let 
 and 
 and note that since
        
        then
        
Rewrite now Equation (
1) as follows:
        
        where 
 and 
. Thus, ignoring the constraints, every pair 
 satisfying the equation above also satisfies
        
       for some number 
. Moreover, such pair must have 
 in order the constraints 
, 
 to be satisfied. Hence, varying 
 over 
 gives a class of solutions. To determine 
,
        
        and adding these we obtain
        
        and the maximum value is obtained when 
. Therefore
        
        and so
        
☐
 Therefore, for small  the mixture index of fit and the total variation distance are nearly equal.
  5. Kullback-Leibler Distance
The Kullback-Leibler distance [
12] is extensively used in statistics and in particular in model selection. The celebrated AIC model selection criterion [
13] is based on this distance. In this section, we present the Kullback-Leibler distance and some of its properties with particular emphasis on interpretations.
Definition 9. The Kullback-Leibler distance between two densities τ, m is defined asor  Proposition 12. The Kullback-Leibler distance is nonnegative, that iswith equality if and only if .  Proof.  Set , then  is a convex, non-negative function that equals 0 at . Therefore . ☐
 Definition 10. We define the likelihood distance between two densities τ, m as  The intuition behind the above expression of the likelihood distance comes from the fact that the log-likelihood in the case of discrete random variables taking  discrete values, , m is the number of groups, can be written, after appropriate algebraic manipulations, in the above form.
Alternatively, we can write the likelihood distance as
      
      and use this relationship to obtain insight into connections of the likelihood distance with the chi-squared measures studied by Markatou et al. [
3].
Specifically, if we write the Pearson’s chi-squared statistic as
      
      then from the functional relationship 
 we obtain that 
. However, it is also clear from the right tails of the functions that there is no way to bound 
 below by a multiple of 
. Hence, these measures are not equivalent in the same way that Hellinger distance and symmetric chi-squared are (see Lemma 4, Markatou et al. [
3]). In particular, knowing that 
 is small is no guarantee that all Pearson 
z-statistics are uniformly small.
On the other hand, one can show by the same mechanism that 
, where 
 and 
 is the symmetric chi-squared distance given as
      
It is therefore true that small likelihood distance  implies small z-statistics with blended variance estimators. However, the reverse is not true because the right tail in r for  is of magnitude r, as opposed to  for the likelihood distance.
These comparisons provide some feeling for the statistical interpretation of the likelihood distance. Its meaning as a measure of model misspecification is unclear. Furthermore, our impression is that likelihood, like Pearson’s chi-squared is too sensitive to outliers and gross errors in the data. Despite Kullback-Leibler’s theoretical and computational advantages, a point of inconvenience in the context of model selection is the lack of symmetry. One can show that reversing the roles of the arguments in the Kullback-Leibler divergence can yield substantially different results. The sum of the Kullback-Leibler distance and the likelihood distance produces the symmetric Kullback-Leibler distance or J divergence. This measure is symmetric in the arguments, and when used as a model selection measure it is expected to be more sensitive than each of the individual components.
  6. Computation and Applications of Total Variation, Mixture Index of Fit and Kullback- Leibler Distances
The distances discussed in this paper are used in a number of important applications. Euán et al. [
14] use the total variation to detect changes in wave spectra, while Alvarez- Esteban et al. [
15] cluster time series data on the basis of the total variation distance. The mixture index of fit has found a number of applications in the area of social sciences. Rudas et al. [
8] provided examples of the application of 
* to two-way contingency tables. Applications involving differential item functioning and latent class analysis were presented in Rudas and Zwick [
16] and Dayton [
17] respectively. Formann [
18] applied it in regression models involving continuous variables. Finally, Revuelta [
19] applied the 
* goodness-of-fit statistic to finite mixture item response models that were developed mainly in connection with Rasch models [
20,
21].
The Kullback-Leibler (KL) distance [
12] is fundamental in information theory and its applications. In statistics, the celebrated Akaike information Criterion (AIC) [
13,
22], widely used in model selection, is based on the Kullback-Leibler distance. There are numerous additional applications of the KL distance in fields such as fluid mechanics, neuroscience, machine learning. In economics, Smith, Naik, and Tsai [
23] use KL distance to simultaneously select the number of states and variables associated with Markov-switching regression models that are used in marketing and other business applications. KL distance is also used in diagnostic testing for ruling in or ruling out disease [
24,
25], as well as in a variety of other fields [
26].
Table 1 presents the software, written in R, that can be used to compute the aforementioned distances. Additionally, Zhang and Dayton [
27] present a SAS program to compute the two-point mixture index of fit for the two-class latent class analysis models with dichotomous variables. There are a number of different algorithms that can be used to compute the mixture index of fit for contingency tables. Rudas et al. [
8] propose to use a standard EM algorithm, Xi and Lindsay [
4] use sequential quadratic programming and discuss technical details and numerical issues related to applying nonlinear programming techniques to estimate 
*. Dayton [
10] discusses explicitly the practical advantages associated with the use of nonlinear programming as well as the limitations, while Pan and Dayton [
28] study a variety of additional issues associated with computing 
*. Additional algorithms associated with the computation of 
* can be found in Verdes [
29] and Ispány and Verdes [
11].
 We now describe a simulation study that aims to illustrate the performance of the total variation, Kullback-Leibler, and mixture index of fit as model selection measures. Data are generated from either an asymmetric 
 contamination model, or from a symmetric 
 contamination model, where 
 is the percentage of contamination. Specifically, we generate 500 Monte Carlo samples of sample sizes 200, 1000, and 5000 as follows. If the sample has size 
n and the percentage of contamination is 
, then 
 of the sample size is generated from model 
 or 
 and the remaining 
 from a 
 model. We use 
 and 
 in the 
 model and 
 in the 
 model. The total variation distance was computed between the simulated data and the 
 model. The Kullback-Leibler distance was calculated between the data generated from the aforementioned contamination models and a random sample of the same size 
n from 
. When computing the mixture index of fit, we specified the component distribution as a normal distribution with initial mean 0 and variance 1. All simulations were carried out on a laptop computer with an Intel Core i7 processor and 64 bit Windows 7 operation system. The R packages used are presented in 
Table 1.
Table 2 and 
Table 3 present means and standard deviations of the total variation and Kullback-Leibler distances as a function of the contamination model and the sample size. To compute the total variation distance we use the R function “TotalVarDist” of the R package “distrEx”. It smooths the empirical distribution of the provided data using a normal kernel and computes the distance between the smoothed empirical distribution and the provided continuous distribution (in our case this distribution is 
). We note here that the package “distrEx” provides an alternative option to compute the total variation which relies on discretizing the continuous distribution and then computes the distance between the discretized continuous distribution and the data. We think that smoothing the data to obtain an empirical estimator of the density and then calculating its distance from the continuous density is a more natural way to handle the difference in scale between the discrete data and the continuous model. Lindsay [
1] and Markatou et al. [
3] discuss this phenomenon and call it discretization robustness. The Kullback-Leibler distance was computed using the function “KLD.matrix” of the R package “bioDist”.
 We observe from the results of 
Table 2 and 
Table 3 that the total variation distance for small percentages of contamination is small and generally smaller than the Kullback-Leibler distance for both asymmetric and symmetric contamination models with a considerably smaller standard deviation. The above behavior of the total variation distance in comparison to the Kullback-Leibler manifests itself across all sample sizes used.
Table 4 presents the mixture index of fit computed using the R function “pistar.uv” from the R package “pistar” (
https://rdrr.io/github/jmedzihorsky/pistar/man/; accessed on 5 June 2018). Since the fundamental assumption in the definition of the mixture index of fit is that the population on which the index is applied is heterogeneous and expressed via the two-point model, we only used the asymmetric contamination model for various values of the contamination distribution.
 We observe that the mixture index of fit generally estimates well the mixing proportion 
. We observe (see 
Table 4) that when the second population is 
 the bias associated with estimating the mixing (or contamination) population can be as high as 
. This is expected because the population 
 is very close to 
 creating essentially a unimodal sample. As the means of the two normal components get more separated, the mixture index of fit provides better estimates of the mixing quantity and the percentage of observations that need to be removed so that 
 provides a good fit to the remaining data points.
  7. Discussion and Conclusions
Divergence measures are widely used in scientific work, and popular examples of these measures include the Kullback-Leibler divergence, Bregman Divergence [
30], the power divergence family of Cressie and Read [
31], the density power divergence family [
32] and many others. Two relatively recent books that discuss various families of divergences are Pardo [
33] and Basu et al. [
34].
In this paper we discuss specific divergences that do not belong to the family of quadratic divergences, and examine their role in assessing model adequacy. The total variation distance might be preferable as it seems closest to a robust measure, in that if the two probability measures differ only on a set of small probability, such as a few outliers, then the distance must be small. This was clearly exemplified in 
Table 2 and 
Table 3 of 
Section 6. Outliers influence chi-squared measures more. For example, the Pearson’s chi-squared distance can be made dramatically larger by increasing the amount of data in a cell with small model probability 
. In fact, if there is data in a cell with model probability zero, the distance is infinite. Note that if data occur in a cell with probability, under the model, equal to zero, then it is possible that the model is not true. Still, even in this case, we might wish to use it on the premise that 
 provides a good approximation.
There is a pressing need for the further development of well-tested software for computing the mixture index of fit. This measure is intuitive and has found many applications in the social sciences. Reiczigel et al. [
35] discuss bias-corrected point estimates of 
*, as well as a bootstrap test and new confidence limits, in the context of contingency tables. Well-developed and tested software will further popularize the dissemination and use of this method.
The mixture index of fit ideas were extended in the context of testing general model adequacy problems by Liu and Lindsay [
9]. Recent work by Ghosh and Basu [
36] presents a systematic procedure of generating new divergences. Ghosh and Basu [
36], building upon the work of Liu and Lindsay [
9], generate new divergences through suitable model adequacy tests using existing divergences. Additionally, Dimova et al. [
37] use the quadratic divergences introduced in Lindsay et al. [
2] and construct a model selection criterion from which we can obtain AIC and BIC as special cases.
In this paper, we discuss non-quadratic distances that are used in many scientific fields where the problem of assessing the fitted models is of importance. In particular, our interest centered around the properties and potential interpretations of these distances, as we think this offers insight into their performance as measures of model misspecification. One important aspect for the dissemination and use of these distances is the existence of well-tested software that facilitates computation. This is an area where further development is required.