Abstract
Pearson residuals aid the task of identifying model misspecification because they compare the estimated, using data, model with the model assumed under the null hypothesis. We present different formulations of the Pearson residual system that account for the measurement scale of the data and study their properties. We further concentrate on the case of mixed-scale data, that is, data measured in both categorical and interval scale. We study the asymptotic properties and the robustness of minimum disparity estimators obtained in the case of mixed-scale data and exemplify the performance of the methods via simulation.
1. Introduction
Minimum disparity estimation has been studied extensively in models where the scale of the data is either interval or ratio (Beran [1], Basu and Lindsay [2]). It has also been studied in the discrete outcomes case. Specifically, when the response variable is discrete and the explanatory variables are continuous, Pardo et al. [3] introduced a general class of distance estimators based on -divergence measures, the minimum -divergence estimators, and they studied their asymptotic properties. The estimators can be viewed as an extension/generalization of the Maximum Likelihood Estimator (MLE). Pardo et al. [4] used the minimum -divergence estimator in a -divergence statistic to perform goodness-of-fit tests in logistic regression models, while Pardo and Pardo [5] extended the previous works to address solving problems for testing in generalized linear models with binary scale data.
The case where data are measured on discrete scale (either on ordinal or generally categorical scale) has also attracted the interest of other researchers. For instance, Simpson [6] demonstrated that minimum Hellinger distance estimators fulfill desirable robustness properties and for this reason can be effective in the analysis of count data prone to outliers. Simpson [7] also suggested tests based on the minimum Hellinger distance for parametric inference which are robust as the density of the (parametric) model can be nonparametrically estimated. In contrast, Markatou et al. [8] used weighted likelihood equations to obtain efficient and robust estimators in discrete probability models and applied their methods to logistic regression, whereas Basu and Basu [9] considered robust penalized minimum disparity estimators for multinomial models with good small sample efficiency.
Moreover, Gupta et al. [10], Martín and Pardo [11] and Castilla et al. [12] used the minimum -divergence estimator to provide solution to testing problems in polytomous regression models. Working in a similar fashion, Martín and Pardo [13] studied the properties of the family of -divergence estimators for log-linear models with linear constraints under multinomial sampling in order to identify potential associations between various variables in multi-way contingency tables. Pardo and Martín [14] presented an overview of works associated with contigency tables of symmetric structure on the basis of minimum -divergence estimators and minimum -divergence test statistics. Additional works include Pardo and Pardo [15] and Pardo et al. [16]. Alternative power divergence measures have been introduced by Basu et al. [17].
The class of f or divergences was originally introduced by Csiszár [18]. The structural characteristics of this class and their relationship to the concepts of efficiency and robustness were studied, for the case of discrete probability models, by Lindsay [19]. Basu and Lindsay [2] studied the properties of estimators derived by minimizing divergences between continuous models and presented examples showing the robustness results of these estimates. We also note that Tamura and Boos [20] studied the minimum Hellinger distance estimation for multivariate location and covariance. Additionally, formal robustness results were presented in Markatou et al. [8,21] in connection with the introduction of weighted likelihood estimation.
If G is a real valued, convex function, defined on and such that converges to 0 as , , , , the class of divergences is defined as
where , are two probability models. Notice that we define on discrete probability models first, where is a discrete sample space, T possibly infinite, and , is the parameter space . Furthermore, different forms of the function provide different statistical distances or divergences.
We can change the argument of the function G from to . Then, G is a function of the Pearson residual which is defined as , and takes values in . If the measurement scale is interval/ratio, then the Pearson residuals are modified to reflect and adjust for the discrepancy of scale between data, that are always discrete, and the assumed continuous probability model (see Basu and Lindsay [2]).
The Pearson residual is used by Lindsay [19], Basu and Lindsay [2] and Markatou et al. [8,21] in investigating the robustness of the minimum disparity and weighted likelihood estimators, respectively. This residual system allows one to identify distributional errors. If, in the equation of Pearson residual, we replace with its best nonparametric representative , the proportion of observations in a sample with value t, then . We note that the Pearson residuals are called so because is Pearson’s chi-squared distance. Furthermore, these residuals are not symmetric since they take values in and are not standardized to have identical variances.
How does robustness fit into this picture? In the robustness literature, there is a denial of the model’s truth. Following this logic, the framework based on disparities starts with goodness-of-fit by identifying a measure that assesses whether the model fits the data adequately. Then, we examine whether this measure of adequacy is robust and in what sense. A fundamental tool that assists in measuring the degree of robustness is the Pearson residual, because it measures model misspecification. That is, Pearson residuals provide information about the degree to which the specified model fits the data. In this context, outliers are defined as those data points that have a low probability of occurrence under the hypothesized model. Such probabilistic outliers are called surprising observations (Lindsay [19]). Furthermore, the robustness of estimators obtained via minimization of the divergence measures we discuss here is indicated by the shape of the associated Residual Adjustment Function (RAF), a concept that is reviewed in Section 2. Of note is that in contingency table analysis, the generalized residual system is used for examination of sources of error in models for contingency tables, see, for example, Haberman [22], Haberman and Sinharay [23]. The concept of generalized residuals in the case of generalized linear models is discussed, for example, in Pierce and Schafer [24].
Data sets are comprised of data measured on both categorical (ordinal or nominal) scale and interval/ratio scale. We can think of these data as realizations of discrete and continuous random variables respectively. Examples of data sets that include mixed-scale data are electronic health records containing diagnostic codes (discrete) and laboratory measurements (e.g., blood pressure, alanine amino transferase (ALT) measurements on interval/ratio scale) and marketing data (customer records include income and gender information). Additional examples include data from developmental toxicology (Aerts et al. [25]), where fetal data from laboratory animals include binary, categorical and continuous outcomes. In this context, the joint density of the discrete and continuous random variables is given as , where are parameter vectors indexing the joint, conditional on x and probability density function of x.
Work on the analysis of mixed-scale data is complicated by the fact that is difficult to identify suitable joint probability distributions to describe both measurement scales of the data, although a number of ad hoc methods to the analysis of mixed-scale data have been used in applications. Olkin and Tate [26] proposed multivariate correlation models for mixed-scale data. Copulas also provide an attractive approach to modeling the joint distribution of mixed-scale data, though copulas are less straightforward to implement, and there are subtle identifiability issues that complicate the specification of a model (Genest and Nešlehová [27]).
To formulate the joint distribution in the mixed-scale variables case one can either specify the marginal distribution of the discrete variables and the conditional distribution of the continuous variables. Alternatively, one can specify the marginal distribution of the continuous variables and the conditional distribution of the discrete variables given the continuous variables. Of note here is that the direction of factorization generally yields distinct model interpretations and results. The first approach has received much attention in the literature, in the context of the analysis of data with mixtures of categorical and continuous variables. Here, the continuous variables follow different multivariate normal distributions for each possible setting of the categorical variable values; the categorical variables then follow an arbitrary marginal multinomial distribution. This model is known in the literature as the conditional Gaussian distribution model and is central in the discussion of graphical association models with mixed-scale variables (Lauritzen and Wermuth [28]). A very special case of this model is used in our simulations.
In this paper, we develop robust methods for mixed-scale data. Specifically, Section 2 reviews basic concepts in minimum disparity estimation, Section 3 defines Pearson residuals for data measured in discrete, interval/ratio and mixed-scale, and studies their properties. Section 4 establishes the optimization problem for obtaining estimators of the model parameters, while Section 5 and Section 6 establish the robustness and asymptotic properties of these estimators. Finally, Section 7 presents simulations showing the performance of these methods and Section 8 offers discussions. The Appendix A includes proofs of the theoretical results.
2. Concepts in Minimum Disparity Estimation
Beran [1] introduced a robust method to estimate the parameters of a statistical model, called minimum Hellinger distance estimation. The parameter estimator is obtained by minimizing the Hellinger distance between a parametric model density and a nonparametric density estimator. Lindsay [19] extended the aforementioned method to incorporate many other distances, and introduced the concept of the residual adjustment function in the context of minimum disparity estimation. The Minimum Distance Estimators (MDE) of a parameter vector are obtained by minimizing over , the distance (or disparity)
where the assumed model is a probability mass function. When the model is continuous, the MDE of the parameter vector is obtained by minimizing over the quantity
where , , is the empirical distribution function obtained from the data and k is a smooth family of kernel functions. One example is the normal density with mean t and standard deviation h. Furthermore, is the Pearson residual defined as . Lindsay [19] and Basu and Lindsay [2] discuss the efficiency and robustness properties of these estimators.
If we obtain the class of power divergence measures. Notice that we have . Different values of offer different measures; for example, when we obtain Neyman’s chi-squared divided by 2 measure, while return the Kullback-Leibler and Hellinger distances, respectively.
Under appropriate conditions, (1) and (2) can be written as
or
where and the prime denotes differentiation with respect to .
Lindsay [19] has shown that the structural characteristics of the function play an important role in the robustness and efficiency properties of these methods. Furthermore, without loss of generality, we can center and rescale , and define the RAF as follows.
Definition 1
(Lindsay [19]). Let be an increasing and twice differentiable function on defined as
where G is strictly convex and twice differentiable with respect to δ on with . Then, is called residual adjustment function.
Remark 1.
Since, the second order differentiability of G, in addition to its strict convexity, implies thatis strictly increasing function of δ on. Thus, we can defineas above without changing the solutions of the aforementioned estimating equations in the discrete case (see Lindsay [19], p. 1089). In the continuous case, such standardization does not change the estimating properties of the associated disparities (see Basu and Lindsay [2], p. 687).
Two fundamental and at the same time conflicting goals in robust statistics are the goals of robustness and efficiency. In the traditional literature on robustness, first order efficiency is sacrificed and, instead, safety of the estimation or testing method against outliers is guaranteed. Here, one adheres to the notion that information about robustness of a method is carried by the influence function. In our setting, using the influence function to characterize the robustness properties of the associated estimation procedures is misleading. Instead, the shape of the RAF, , provides information to the extent of which our procedures can be characterized as robust. The interested reader is directed to Lindsay [19] for further discussion on this topic.
3. Pearson Residual Systems
In this section, we define various Pearson residuals, appropriate for the measurement scale of the data. We introduce our notation first.
Let be realizations from n independent and identically distributed random variables that follow a distribution with density . Recall that we use the word density to denote a general probability function, independently of whether the random variables are discrete, continuous or mixed. In what follows, we define different Pearson residual systems that account for the measurement scale of the data and study their properties.
Case 1:Both X and Y are discrete.
In this case, the pairs follow a discrete probability mass function . Define the Pearson residual as
where , and is the number of observations in the cell with and .
Note that this definition of the Pearson residual is nonparametric on the discrete support of X. In the case of regression, one can carry out a semiparametric argument to obtain the estimators of the vector and .
We now establish that, under correct model specification, the residual converges, almost surely, to zero.
Proposition 1.
When the model is correctly specified and as,
Proof.
Write
Then
where is the indicator function. Furthermore,
and by the strong law of large numbers
Similarly,
therefore
under correct model specification. □
Case 2:Y is continuous and X is discrete.
This is the case in some ANOVA models. We can still define the Pearson residual in this setting as
where
and
Then,
Proposition 2.
Assume the model is correctly specified andis a continuous function. Then,
Proof.
Under the strong law of large numbers
Under the correct model specification, continuity of the kernel function and the fact that converges completely to F (implication of Glivenko-Cantelli theorem),
(extension of Helly-Bray lemma). Therefore,
and hence
□
Case 3:Y is continuous and X is continuous.
In this case, the pairs follow a continuous probability distribution. The Pearson residual is then defined as
where
As an example, we take the linear regression model with random carriers X, and . Furthermore, assume that the random carriers follow a normal distribution with mean vector and covariance matrix . In this case, and the quantities are independent, identically distributed random variables when represents the vector of true parameters. Hence, the ’s represent realizations of a random variable Z that has a completely known density . Thus,
and hence
The kernel is selected so that it facilitates easy computation. Kernels that do not entail loss of information when they are used to smooth the assumed parametric model are called transparent kernels (Basu and Lindsay [2]). Basu and Lindsay [2] provide a formal definition of transparent kernels and an insightful discussion on the point of why transparent kernels do not exhibit information loss when convoluted with the hypothesized model (see Section 3.1 of Basu and Lindsay [2]).
4. Estimating Equations
In this section, we concentrate on cases 1, 2 presented in the previous section. We carefully outline the optimization problems and discuss the associated estimating equations for these two cases. The case where both X and Y are continuous has been discussed in the literature, see, for example, Markatou et al. [21].
Case 1:Both X and Y are discrete.
In this case, the minimum distance estimators of the parameter vector and are obtained by solving the following optimization problem
subject to
Optimization problem (3) is equivalent to the problem
subject to
The class of G functions that we use creates distances that belong in the family of -divergences.
Proposition 3.
The estimating equations for β andare given as:
The functionis a weight function, such that, and it is defined as
withindicating the positive part of the function.
Proof.
The main steps of the proof are provided in the Appendix A.1. □
Remark 2.
- 1.
- The above two estimating equations can be solved with respect to β and. In an iterative algorithm, we can solve the second equation (4) explicitly forto obtainThis means that if the model does not fit any of the y, observed at a particular x well, the weight for this x will drop as well.
- 2.
- Whenthe corresponding estimating equation for β becomesand the MLE is obtained. This is because the corresponding weight function. In this case, the estimating equations for thes become, the estimating equations for the MLEs of.
- 3.
- The Fisher consistency property of the function that introduces the estimates guarantees that the expectation of the corresponding estimating function is 0, under the correct model specification.
Case 2:Y is continuous and X is discrete.
In this case, the estimates of the parameters and are obtained by solving the following optimization problem
subject to
In general ; in the case where are independent , and the optimization problem stated above is equivalent to
subject to
Proposition 4.
The estimating equations for β andin the case of independence ofare given as follows:
whereis the residual adjustment function (RAF) that corresponds to the function G, and is the derivative of G with respect to δ.
Proof.
Straightforward, after differentiating the Lagrangian with respect to and . □
Case 3:Y is continuous and X is continuous.
In this case, we refer the reader to Basu and Lindsay [2].
5. Robustness Properties
Hampel et al. [29] and Hampel [30,31] define robust statistics as the “statistics of approximate parametric models”, and introduce one of the fundamental tools of robust statistics, the concept of the influence function, in order to investigate the behavior of a statistic expressed as a functional . The influence function is a heuristic tool with the intuitive interpretation of measuring the bias caused by an infinitesimal contamination at a point x on the estimate standardized by the mass of contamination. Its formal definition is as follows:
Definition 2.
The influence function of a functional T at the distribution F is given as
in thosewhere the limit exists,andis the Dirac measure defined as
If an estimator has a bounded influence function, the estimator is considered to be robust to outliers, that is data which is away from the pattern set by the majority of the data. The effect of bounding the influence function is the sacrifice of efficiency; estimators with bounded influence function, while are not affected by outlying points, are not fully efficient under the correct model specification.
Our goal in calculating the influence function is to show the full efficiency of the proposed estimators. That is, the influence function of the proposed estimators, under correct model specification, equals the influence function of the corresponding maximum likelihood estimators. In our context, robustness of the estimators is quantified by the associated RAFs (see Lindsay [19] and Basu and Lindsay [2]).
In what follows, we will derive the influence function of the estimators for the parameter vector in the case where both are discrete. Similar calculations provide the influence functions of estimators obtained under the remaining scenarios. To do so, we need to resort to the estimators’ functional form, denoted by , with corresponding estimating equations
where The influence function is then obtained by differentiating the aforementioned estimating equations with respect to and then evaluating the derivative at .
Proposition 5.
The influence function of the β estimator is given by
where
with, and the subscript 0 indicates evaluation at a parametric model.
Proof.
The proof is obtained via straightforward differentiation and its main steps are provided in the Appendix A.2. □
Proposition 6.
Under the assumption that the model is correct, the influence function derived, reduces to the influence function of the MLE of β.
Proof.
Under the assumption that the adopted model is the correct model, the density is , so that . Now recall that and , so the expression reduces to
Furthermore, the expression reduces to , where we assume exchangeability of differentiation and integration and use the fact that . Hence, the influence function is given as
which is exactly the influence function of the MLE. Therefore, full efficiency is preserved under the model. □
6. Asymptotic Properties
In what follows, we establish asymptotic normality of the estimators in the case of discrete variables. The techniques for obtaining asymptotic normality in the mixed-scale case are similar and not presented here.
Case 1:Both X and Y are discrete.
Recall that the th estimating equation is given as , which can be expanded in Taylor series in the neighborhood of the true parameter to obtain:
where
is a Hessian matrix whose th element is given as
Under assumptions 1–8, listed in the Appendix A.3, we have the following theorem.
Theorem 1.
The minimum disparity estimators of the parameter vector β are asymptotically normal with asymptotic variance, whereindicates the Fisher information matrix.
7. Simulations
The simulation study presented below has two aims. The first one, is to indicate the versatility of the disparity methods for different data measurement scales. The second aim is to exemplify and study the robustness of these methods under different contamination scenarios.
Case 1:Both X and Y are discrete.
The Cressie-Read family of power divergence is given by
where is the proportion of observations with value and is the density function of the model of interest.
To evaluate the performance of our algorithmic procedure, we use the following disparity measures, that is,
The data are generated in four different ways using three different sample sizes N, say and 10,000. The data format used can be represented in a contingency table, with , ; denoting the counts in the -th cell, and representing the row and column totals, respectively. Furthermore, the variable x indicates columns, while y indicates the rows. In each of the aforementioned cases/scenarios, 10,000 tables were generated and that corresponds to the number of Monte Carlo (MC) replications. Our purpose is to get the mean values of the estimates of the parameters ’s and ’s along with their corresponding standard deviations (SDs). Notice that, in this setting, the estimation of and is completely nonparametric, that is, no model is assumed for estimating the marginal probabilities of X and Y.
The table was generated by using either a fixed total sample size N or fixed marginal probabilities. These two data generating schemes imply two different sampling schemes that could have generated the data with consequences for the probability model one would use. For example, with fixed total sample size the distribution of the counts is multinomial, or if the row margin is fixed in advance the distribution of the counts is a product binomial distribution. In the former case of fixed N, we explored two different scenarios: a balanced and an imbalanced one. The imbalanced scenario allows for the presence of one zero cell in the contingency table, whereas the balanced scenario does not. In the latter case of fixed marginal probabilities, the row marginal probabilities (’s) were fixed, while the column marginals (’s) were randomly chosen and these values were used to obtain the contingency table. In this case, we also explored a balanced and an imbalanced scenario based on whether the row marginal probabilities were chosen so that to be equal to each other or not, respectively.
Specifically, under Scenario Ia, where the total sample size N was fixed and the balanced design was exploited, none of the ’s () was set equal to zero, with equal row and column marginal probabilities. Table 1 presents the mean of 10,000 estimates and the corresponding SDs for all four distances () when N is fixed under the balanced scenario. Table 1 clearly shows that all distances provide estimates approximately equal to 0.200 regardless of the sample size used. Furthermore, as the sample size increases, the SDs decrease noticeably.
Table 1.
Scenario Ia: Means and standard deviations (SDs) of 4 distances (). A contingency table was generated having fixed the total sample size N under a balanced design with . The number of Monte Carlo (MC) replications used is 10,000.
In Scenario IIa, where the total sample size N was fixed and the contingency table was structured using the imbalanced design, the presence of a zero cell () was allowed. The results of this scenario are presented in Table 2, where the estimates were calculated exploiting all disparity measures. For the , was set equal to . The presence of zero cells in contingency tables has a large history in the relevant literature on contingency tables analysis, where several options are provided for the analysis of these tables (Fienberg [32], Agresti [33], Johnson and May [34], Poon et al. [35]). From Table 2, one could infer that the different distances handle differently the zero cell. This difference is reflected in the estimate of , because it is affected by the zero value of . The strongest control is provided by the Hellinger and symmetric chi-squared distances. All distances estimate the parameters similarly, with the bias in their estimation been between and . The SDs are almost the same for all distances per estimate and their values are ameliorated for 10,000.
Table 2.
Scenario IIa Means and SDs of 4 distances (). A contingency table was generated having fixed the total sample size N under an imbalanced design with . The number of MC replications used is 10,000.
A referee suggested that in certain cases interest may be centered on smaller samples. We generated tables with fixed total sample size of 50 and 70 observations. Table 3 and Table 4 describe the results when the contingency tables were generated under a balanced and an imbalanced design with associated respective Scenarios Ib and IIb. More precisely, Table 3 presents the estimators of the marginal row and column probabilities obtained when , , and distances are used. We notice that the increase in the sample size provides for a decrease in the overall absolute bias in estimation, defined as , where is the estimate of the ℓ-th component of an vector and is the corresponding true value. In our case, . This observation applies to all distances used in our calculations. Table 4 presents results associated with the imbalanced case. The generated tables contain two empty cells (). Once again, for calculating the , cells . We notice that the bias associated with the estimates is rather large for all the distances, and an increased sample size does not alleviate the observed bias. Basu and Basu [9] have proposed an empty cell penalty for the minimum power-divergence estimators. This penalty leads to estimators with improved small sample properties. See also Alin and Kurt [36] for a discussion of the need of penalization in small samples.
Table 3.
Scenario Ib: Means and Biases of 4 distances (). A contingency table was generated having fixed the total sample size N under a balanced design with . The number of MC replications used is 10,000.
Table 4.
Scenario IIb: Means and Biases of 4 distances (). A contingency table was generated having fixed the total sample size N under an imbalanced design with . The number of MC replications used is 10,000.
Table 5 provides the results obtained under Scenario III. In this case, the parameter estimates were calculated using the , , and distances when the contingency table was constructed by fixing the row marginal probabilities so that they were all set at 0.20, that is, . The column marginals were randomly chosen in the interval and summed to 1. In this case, the produced column marginal probabilities were . The simulation study reveals that the estimates of the parameters ’s and ’s do not differ substantially from the respective row and column marginal probabilities for any of the four distances utilized. The SDs are approximately the same and they get lower values for larger N.
Table 5.
Scenario III: Means and SDs of 4 distances (). A contingency table was generated having fixed the row marginal probabilities at (0.20, 0.20, 0.20, 0.20, 0.20). The number of MC replications used is 10,000.
Finally, in Table 6 the data generation was done by exploiting Scenario IV, that is, by having fixed the row marginal probabilities, which were not equal to each other; while, the column marginals were randomly chosen in the interval so that they sum to 1. In particular, the row marginal probabilities were fixed at values , while the column marginals used were . When , the value of is not approximately 0.07 and not equal to 0.04 for all distances. However, when or 10,000, we get better estimates irrespectively of the disparity measure choice. The SDs are approximately the same and they become smaller as the sample size increases.
Table 6.
Scenario IV: Means and SDs of 4 distances (). A contingency table was generated having fixed the row marginal probabilities at (0.04, 0.20, 0.20, 0.20, 0.36). The number of MC replications used is 10,000.
We also notice from Table 1, Table 5 and Table 6 that in all cases the standard deviation associated with the estimates obtained when we use other than likelihood distances, is approximately the same with the standard deviation that corresponds to the likelihood estimates, thereby showing the asymptotic efficiency of the disparity estimators.
All calculations were performed using the R language. Given that the problem described in this section can be viewed as a general non-linear optimization problem, the solnp function of the Rsolnp package (Ye [37]) was used to obtain the aforementioned estimates. For our calculations, we tried using a variety of different initial values (’s and ’s); we notice that no matter how the initial values were chosen, the estimates were always pretty similar and very close to the observed values ( and for ). Only the number of iterations needed for convergence is slightly affected. Consequently, random numbers from a Uniform distribution in the interval were set as initial values (which were not necessarily summing to 1). The solnp function has a built-in stopping rule and there was no need to set our own stopping rule. We only set the boundary constraints to be in the interval for all estimates which were also subject to .
Other functions may also be used to obtain the estimates. For example, we used the auglag function of the nloptr package with local solvers “lbfgs” or “SLSQP” (Conn et al. [38], Birgin and Martínez [39]) which emulates Augmented Lagrangian multipliers. However, the convergence using the solnp function (the number of iterations was on average 2) was extremely faster than using the auglag function (the average number of iterations was approximately 100). For this reason, the results presented in Table 1, Table 2, Table 3, Table 4, Table 5 and Table 6 were based only on the function solnp.
Case 2:X is discrete and Y is continuous
In this section, we are interested in solving the optimization problem (5) when X is discrete, Y is continuous and are independent of each other. To evaluate the performance of our procedure, we used Hellinger’s distance, which in this case takes on the following form:
The aim of this simulation is to obtain the minimum Hellinger distance estimators of and assuming (without loss of generality) that is known to be equal to 1. All calculations were performed in R language.
For this purpose, we generated mixed-type data of size N using the package OrdNor (Amatya and Demirtas [40]). More precisely, the data are comprised of one categorical variable X with three levels and probability vector , while the continuous part is coming from a trivariate normal distribution; symbolic , where . We used two different mean vectors: and . The set of ordinal and normal variables were generated concurrently using an overall correlation matrix , which consists of three components/sub-matrices: , and , with O and N corresponding to “Ordinal” and “Normal” variables, respectively. More precisely, the overall correlation matrix used is the following
where , , and represents the polyserial correlations for the combinations (for more information on polyserial correlations refer to Olsson et al. [41]). Since were assumed to be independent, we set . However, we also used weak correlations, say and , to investigate whether the estimates we receive in these cases remain reasonable.
The kernel function was the multivariate normal density with being estimated by the data using the kde function of the ks package (Duong [42]), represented the multivariate normal density and was the multinomial mass function. This choice of smoothing parameter, stemmed from the fact that we were interested in evaluating the performance, in terms of robustness, of standard bandwidth selection.
To solve the optimization problem, the solnp function of the Rsolnp package (Ye [37]) was used. Specifically, the initial values set for the probabilities associated with the X variable were random uniform numbers in the interval , while the initial values for the means were random numbers in the interval for , where and stand for the respective 25th and the 75th quantile per component of the continuous part. Following the same procedure with the one of Basu and Lindsay [2] in the univariate continuous case, here (in the mixed-case) the numerical evaluation of the integrals was also done on the basis of the Simpson’s 1/3rd rule using the sintegral function of the Bolstad2 package (Bolstad [43]). Moreover, we calculated the mean values, the SDs, as well as the percentages of bias of the mean and the probability vectors for three different sample sizes: ; and over 1000 MC replications. The bias is defined as the difference of the estimates from their “true” values, that is, and for . The results are shown in Table 7 and Table 8.
Table 7.
Means, Absolute Biases and Overall Absolute Bias of the Hellinger’s distance (). The data were concurrently generated with a given correlation structure (an overall correlation matrix ) and consist of a discrete variable X with marginal probability vector and a continuous vector , where and is a identity matrix. The number of MC replications used is 1000.
Table 8.
Means, Absolute Biases and Overall Absolute Bias of the Hellinger’s distance (). The data were concurrently generated with a given correlation structure (an overall correlation matrix ) and consist of a discrete variable X with marginal probability vector and a continuous vector , where and is a identity matrix. The number of MC replications used is 1000.
In particular, Table 7 illustrates the mean values, the SDs and the bias percentages of the corresponding minimum Hellinger distance estimators, over 1000 MC replications, for the three different sample sizes and polyserial correlations, when . The estimates for the are approximately equal to , while the estimates are almost zero, even in the cases of weak correlations. When , the sample size choice does not seem to affect the values of the estimates either overall or per component of variables. Specifically, we observe that the total absolute bias, computed as the sum of the individual component-wise absolute biases of the vectors and are approximately the same, with larger samples providing slightly less biases at the expense of a higher computational cost.
In Table 8, analogous results are presented with the difference that the mean vector used was . The estimates are very close to for all X components, no matter which sample size or correlation is used. On the contrary, the interpretation of the estimates slightly differs in this case. We also calculated the overall absolute bias as well as the individual, per parameter, absolute biases. In this case, larger samples clearly provide estimates with smaller bias for both parameter vectors , and for both cases, the case of independence as well as the case of weak correlations. However, the computational time increases.
In what follows, we also present -for illustration purposes- a small simulation example using a mixed-type, contaminated data set of size , which was generated using OrdNor package setting . Once again, the data were comprised of one categorical variable X with three levels and probability vector , and a trivariate continuous vector . The contamination is happening only in the continuous part on the basis of , as follows: , where . This means that, data were generated with Y coming from multivaraiate standard normal and the remaining subset of the data followed a multivaraiate normal distribution with mean vector . It goes without saying that when , there is no contamination. Here, we are still considering the same optimization problem with the one described above and, consequently, we are interested in evaluating the minimum Hellinger distance estimators over 1000 MC replications by examining/studying to what extend the contamination level affects these estimates.
As indicated from Table 9, when there is no contamination in the data , the estimates for the s are almost equal to , while the ’s estimates are almost equal to zero. As the data become more contaminated (i.e., the value of decreases), the minimum disparity estimators corresponding to X variable remain pretty consistent with their true values. However, this is not the case with the estimates for the s, which deteriorate as the value of the contamination level shifts from the target/null value, that is .
Table 9.
Means and SDs of the Hellinger’s distance (). The data were concurrently generated with a given correlation structure (an overall correlation matrix ) and consist of a discrete variable X with marginal probability vector and a continuous trivariate vector , where , is a identity matrix and indicates the contamination level. The number of MC replications used is 1000.
The mean parameters are estimated with reasonable bias (maximum bias is for the second component of the mean) when , that is the contamination is . When the contamination is , the bias of the mean components is relatively high but still below . With higher contamination, the percentage of bias in the mean components is in the interval . This is the result of using standard density estimation to obtain the smoothing parameters for the different mean components. Smaller values of these component smoothing parameters result in substantial bias reduction.
We also looked at the case where the continuous model was contaminated by a trivariate normal with mean and covariance matrix . In this case (results not shown), when the contamination is the maximum bias of the mean components is , while when the contamination is the maximum bias of the mean components is . Again, in this case the bandwidth parameters were obtained by fitting a unimodal density to the data.
The above results are not surprising. A judicious selection of the smoothing parameter decreases the bias of the component estimates of the mean. Agostinelli and Markatou [44] provide suggestions of how to select the smoothing parameter that can be extended and applied in this context.
8. Discussion and Conclusions
In this paper, we discuss Pearson residual systems that conform to the measurement scale of the data. We place emphasis on the mixed-scale measurements scenario, which is equivalent to having both discrete (categorical or nominal) and continuous type random variables, and obtain robust estimators of the parameters of the joint probability distribution that describes those variables. We show that, disparity methods can be used to actually control against model misspecification and the presence of outliers, and these methods provide reasonable results.
The scale and nature of measurement of the data imposes additional challenges, both computationally and statistically. Detecting outliers in this multidimensional space is an open research question (Eiras-Franco et al. [45]). The concept of outliers has a long history in the field of statistics and outlier detection methods have broad applications in many scientific fields such as security (Diehl and Hampshire [46], Portnoy et al. [47]), health care (Tran et al. [48]) and insurance (Konijn and Kowalczyk [49]) to mention just a few.
Classical outlier detection methods are largely designed for single measurement scale data. Handling mixed measurement scale is a challenge with few works coming from both, the field of statistics (Fraley and Wilkinson [50], Wilkinson [51]) and the fields of engineering and computer science (Do et al. [52], Koufakou et al. [53]). All these works use some version of a probabilistic outlier, either looking for regions in the space of data that have low density (Do et al. [52], Koufakou et al. [53]) or by attaching a probability, under a model, to the suspicious data point (Fraley and Wilkinson [50], Wilkinson [51]).
Our concept of a probabilistic outlier discussed here and expressed via the construction of appropriate Pearson residuals can unify the different measurement scales, and the class of disparity functions discussed above can provide estimators for the model parameters that are not influenced unduly by potential outliers.
One of the important parameters that controls the robustness of these methods is the smoothing parameter(s) used to compute the density estimator of the continuous part of the model. In our computations, we use standard smoothing parameters obtained from utilizing appropriate R functions for density estimation. The results show that, depending on the level of contamination and the type of contaminating probability model, the performance of the methods is satisfactory. Specifically, a small simulation study using the model reported in the caption of Table 9 shows that the overall bias associated with the mean components of the standard multivariate normal model is low when contamination with a multivariate normal model with mean components equal to 3 is less than or equal to . But even in this case, when the percentage of contamination is greater than , the bias increases when the smoothing parameter used is the one obtained from the R density function. Here, smaller values of the smoothing parameter guarantee reduction of the bias.
Devising rules for selecting the smoothing parameter(s) in the context of mixed-scale measurements that can guarantee robustness for larger than levels of contamination may be possible. However, it is the opinion of the authors that greater levels of data inhomogeneity may indicate model failure, a case where assessing model goodness of fit is of importance.
Author Contributions
The authors of this paper have contributed as follows. Conceptualization: M.M.; Methodology: M.M., E.M.S., R.L.; Software: E.M.S., H.W.; Writing-original draft presentation: M.M., E.M.S., R.L., H.W.; Supervision, funding acquisition and project administration: M.M. All authors have read and agreed to the published version of the manuscript.
Funding
This research was funded by the Troup Fund, KALEIDA Health Foundation, under award number 82114, to Markatou who supported the work of the first and the third author of the paper.
Conflicts of Interest
The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
| ALT | Alanine Aminotransferase |
| HD | Twice-Squared Hellinger’s Disparity |
| LD | Likelihood Disparity |
| MC | Monte Carlo Replications |
| MDE | Minimum Distance Estimators |
| MLE | Maximum Likelihood Estimator |
| PCS | Pearson’s Chi-Squared Disparity Divided by 2 |
| PWD | Power Divergence Disparity |
| RAF | Residual Adjustment Function |
| SCS | Symmetric Chi-Squared Disparity |
| SD | Standard Deviation |
Appendix A
Appendix A.1. Proof of Proposition 3
Proof.
The equations (4) are obtained from solving optimization problem (3). To solve this problem we need to form the corresponding Langrangian, which is
(i) Let denote gradient with respect to . The estimators of are obtained as solutions of the set of equations:
which can be equivalently expressed as follows,
Notice that the of is given by
where the superscript "’" denote derivative with respect to , is the Pearson residual and
is the score for in the conditional distribution of y given x. Therefore,
where
By making use of the fact that , the resulting equations can represented as
or equivalently,
Without loss of generality, we can take,
(ii) We now need to obtain , which can be obtained by setting the gradient of formula with respect to equal to zero, that is, by the following equations:
Recording and , the above equations are reduced to,
and we readily conclude that,
Furthermore, to satisfy the constraint , we obtain
Therefore, we get
and by making use of the fact that , the above equation can be represented as
for any x where is the indicator function of the event □
Appendix A.2. Proof of Proposition 5
Recall that is a solution of the set of estimating equation
where and is a p-dimensional vector.
The influence function of is calculated by differentiating, with respect to , the quantity (A1), and evaluating the derivative at . Thus, we need
Taking into account that , the aforementioned evaluation implies
which implies that
Appendix A.3. Assumptions of Theorem 1
The following assumptions are needed to be able to establish asymptotic normality of the estimators.
- 1.
- The weight functions are nonnegative, bounded and differentiable with respect to .
- 2.
- The weight function is regular, that is, is bounded, where is the derivative of w with respect to .
- 3.
- 4.
- The elements of the Fisher information matrix are finite and the Fisher information matrix is nonsingular.
- 5.
- 6.
- If denotes the true value of , there exist functions such that , with , and
- 7.
- If denotes the true value of , there is a neighborhood such that for the quantity are bounded by and respectively, such that their corresponding expectations are finite.
- 8.
- is bounded, where denotes the second derivative of A with respect to .
References
- Beran, R. Minimum Hellinger Distance Estimates for Parametric Models. Ann. Stat. 1977, 5, 445–463. [Google Scholar] [CrossRef]
- Basu, A.; Lindsay, B.G. Minimum Disparity Estimation for Continuous Models: Efficiency, Distributions and Robustness. Ann. Inst. Stat. Math. 1994, 46, 683–705. [Google Scholar] [CrossRef]
- Pardo, J.A.; Pardo, L.; Pardo, M.C. Minimum ϕ-Divergence Estimator in Logistic Regression Models. Stat. Pap. 2005, 47, 91–108. [Google Scholar] [CrossRef]
- Pardo, J.A.; Pardo, L.; Pardo, M.C. Testing In Logistic Regression Models on ϕ-Divergences Measures. J. Stat. Plan. Inference 2006, 136, 982–1006. [Google Scholar] [CrossRef]
- Pardo, J.A.; Pardo, M.C. Minimum ϕ-Divergence Estimator and ϕ-Divergence Statistics in Generalized Linear Models with Binary Data. Methodol. Comput. Appl. Probab. 2008, 10, 357–379. [Google Scholar] [CrossRef]
- Simpson, D.G. Minimum Hellinger Distance Estimation for the Analysis of Count Data. J. Am. Stat. Assoc. 1987, 82, 802–807. [Google Scholar] [CrossRef]
- Simpson, D.G. Hellinger Deviance Tests: Efficiency, Breakdown Points, and Examples. J. Am. Stat. Assoc. 1989, 84, 104–113. [Google Scholar] [CrossRef]
- Markatou, M.; Basu, A.; Lindsay, B.G. Weighted Likelihood Estimating Equations: The Discrete Case with Applications to Logistic Regression. J. Stat. Plan. Inference 1997, 57, 215–232. [Google Scholar] [CrossRef]
- Basu, A.; Basu, S. Penalized Minimum Disparity Methods for Multinomial Models. Stat. Sin. 1998, 8, 841–860. [Google Scholar]
- Gupta, A.K.; Nguyen, T.; Pardo, L. Inference Procedures for Polytomous Logistic Regression Models Based on ϕ-Divergence Measures. Math. Methods Stat. 2006, 15, 269–288. [Google Scholar]
- Martín, N.; Pardo, L. New Influence Measures in Polytomous Logistic Regression Models Based on Phi-Divergence Measures. Commun. Stat. Theory Methods 2014, 43, 2311–2321. [Google Scholar] [CrossRef]
- Castilla, E.; Ghosh, A.; Martín, N.; Pardo, L. New Robust Statistical Procedures for Polytomous Logistic Regression Models. Biometrics 2018, 74, 1282–1291. [Google Scholar] [CrossRef] [PubMed]
- Martín, N.; Pardo, L. Minimum Phi-Divergence Estimators for Loglinear Models with Linear Constraints and Multinomial Sampling. Stat. Pap. 2008, 49, 2311–2321. [Google Scholar] [CrossRef]
- Pardo, L.; Martín, N. Minimum Phi-Divergence Estimators and Phi-Divergence Test for Statistics in Contingency Tables with Symmetric Structure: An Overview. Symmetry 2010, 2, 1108–1120. [Google Scholar] [CrossRef]
- Pardo, L.; Pardo, M.C. Minimum Power-Divergence Estimator in Three-Way Contingency Tables. J. Stat. Comput. Simul. 2003, 73, 819–831. [Google Scholar] [CrossRef]
- Pardo, L.; Pardo, M.C.; Zografos, K. Minimum ϕ-Divergence Estimator for Homogeneity in Multinomial Populations. Sankhyā Indian J. Stat. Ser. A (1961–2002) 2001, 63, 72–92. [Google Scholar]
- Basu, A.; Harris, I.A.; Hjort, N.L.; Jones, M.C. Robust and Efficient Estimation by Minimising a Density Power Divergence. Biometrika 1998, 85, 549–559. [Google Scholar] [CrossRef]
- Csiszár, I. Information-Type Measures of Difference of Probability Distributions and Indirect Observations. Stud. Sci. Math. Hung. 1967, 25, 299–318. [Google Scholar]
- Lindsay, B.G. Efficiency Versus Robustness: The Case for Minimum Hellinger Distance and Related Methods. Ann. Stat. 1994, 22, 1081–1114. [Google Scholar] [CrossRef]
- Tamura, R.N.; Boos, D.D. Minimum Hellinger Distance Estimation for Multivariate Location and Covariance. J. Am. Stat. Assoc. 1986, 81, 223–229. [Google Scholar] [CrossRef]
- Markatou, M.; Basu, A.; Lindsay, B.G. Weighted Likelihood Equations with Bootstrap Root Search. J. Am. Stat. Assoc. 1998, 93, 740–750. [Google Scholar] [CrossRef]
- Haberman, S.J. Generalized Residuals for Log-Linear Models. In Proceedings of the 9th International Biometrics Conference, Boston, MA, USA, 22–27 August 1976; pp. 104–122. [Google Scholar]
- Haberman, S.J.; Sinharay, S. Generalized Residuals for General Models for Contingency Tables with Application to Item Response Theory. J. Am. Stat. Assoc. 2013, 108, 1435–1444. [Google Scholar] [CrossRef]
- Pierce, D.A.; Schafer, D.W. Residuals in Generalized Linear Models. J. Am. Stat. Assoc. 1986, 81, 977–986. [Google Scholar] [CrossRef]
- Aerts, M.; Molenberghs, G.; Geys, H.; Ryan, L. Topics in Modelling of Clustered Data; Monographs on Statistics and Applied Probability; Chapman & Hall/CRC Press: New York, NY, USA, 1986; Volume 96. [Google Scholar]
- Olkin, I.; Tate, R.F. Multivariate Correlation Models with Mixed Discrete and Continuous Variables. Ann. Math. Stat. 1961, 32, 448–465, With correction in 1961, 36, 343–344. [Google Scholar] [CrossRef]
- Genest, C.; Nešlehová, J. A Primer on Copulas for Count Data. ASTIN Bull. 2007, 37, 475–515. [Google Scholar] [CrossRef]
- Lauritzen, S.; Wermuth, N. Graphical Models for Associations between Variables, some of which are Qualitative and some Quantitative. Ann. Stat. 1989, 17, 31–57. [Google Scholar] [CrossRef]
- Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.J.; Stahel, W.A. Robust Statistics: The Approach Based on Influence Functions; Wiley Series in Probability and Mathematical Statistics. Probability and Mathematical Statistics; Wiley: New York, NY, USA, 1986. [Google Scholar]
- Hampel, F.R. Contributions to the Theory of Robust Estimation. Ph.D. Thesis, Department of Statistics, University of California, Berkeley, Berkeley, CA, USA, 1968. Unpublished. [Google Scholar]
- Hampel, F.R. The Influence Curve and its Role in Robust Estimation. J. Am. Stat. Assoc. 1974, 69, 383–393. [Google Scholar] [CrossRef]
- Fienberg, S.E. The Analysis of Incomplete Multi-Way Contingency Tables. Biometrics 1972, 28, 177–202. [Google Scholar] [CrossRef]
- Agresti, A. Categorical Data Analysis, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
- Johnson, W.D.; May, W.L. Combining 2 × 2 Tables That Contain Structural Zeros. Biometrics 1972, 14, 1901–1911. [Google Scholar] [CrossRef]
- Poon, W.Y.; Tang, M.L.; Wang, S.J. Influence Measures in Contingency Tables with Application in Sampling Zeros. Sociol. Methods Res. 2003, 31, 439–452. [Google Scholar] [CrossRef]
- Alin, A.; Kurt, S. Ordinary and Penalized Minimum Power-Divergence Estimators in Two-Way Contingency Tables. Comput. Stat. 2008, 23, 455–468. [Google Scholar] [CrossRef]
- Ye, Y. Interior Algorithms for Linear, Quadratic, and Linearly Constrained Convex Programming. Ph.D. Thesis, Department of Engineering-Economic Systems, Stanford University, Stanford, CA, USA, 1987. Unpublished. [Google Scholar]
- Conn, A.R.; Gould, N.I.M.; Toint, P. A Globally Convergent Augmented Lagrangian Algorithm for Optimization with General Constraints and Simple Bounds. SIAM J. Numer. Anal. 1991, 28, 545–572. [Google Scholar] [CrossRef]
- Birgin, E.G.; Martínez, J.M. Improving Ultimate Convergence of an Augmented Lagrangian Method. Optim. Methods Softw. 2008, 23, 177–195. [Google Scholar] [CrossRef]
- Amatya, A.; Demirtas, H. OrdNor: An R Package for Concurrent Generation of Correlated Ordinal and Normal Data. J. Stat. Softw. 2015, 68, 1–14. [Google Scholar] [CrossRef]
- Olsson, U.; Drasgow, F.; Dorans, N.J. The Polyserial Correlation Coefficient. Psychmetrika 1982, 47, 337–347. [Google Scholar] [CrossRef]
- Duong, T. ks: Kernel Density Estimation and Kernel Discriminant Analysis for Multivariate Data in R. J. Stat. Softw. 2007, 21, 1–16. [Google Scholar] [CrossRef]
- Bolstad, W.M. Understanding Computational Bayesian Statistics; John Wiley & Sons: Hoboken, NJ, USA, 2010. [Google Scholar]
- Agostinelli, C.; Markatou, M. Test of Hypotheses Based on the Weighted Likelihood Methodology. Stat. Sin. 2001, 11, 499–514. [Google Scholar]
- Eiras-Franco, C.; Martínez-Rego, D.; Guijarro-Berdiñas, B.; Alonso-Betanzos, A.; Bahamonde, A. Large Scale Anomaly Detection in Mixed Numerical and Categorical Input Spaces. Inf. Sci. 2019, 487, 115–127. [Google Scholar] [CrossRef]
- Diehl, C.; Hampshire, J. Real-Time Object Classification and Novelty Detection for Collaborative Video Surveillance. In Proceedings of the 2002 International Joint Conference on Neural Networks. IJCNN’02 (Cat. No.02CH37290), Honolulu, HI, USA, 12–17 May 2002; Volume 3, pp. 2620–2625. [Google Scholar]
- Portnoy, L.; Eskin, E.; Stolfo, S. Intrusion Detection with Unlabeled Data Using Clustering. In Proceedings of the ACM CSS Workshop on Data Mining Applied to Security (DMSA-2001), Philadelphia, PA, USA, 5–8 November 2001; pp. 5–8. [Google Scholar]
- Tran, T.; Phung, D.; Luo, W.; Harvey, R.; Berk, M.; Venkatesh, S. An Integrated Framework for Suicide Risk Prediction. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Chicago, IL, USA, 11–14 August 2013; ACM: New York, NY, USA, 2013; pp. 1410–1418. [Google Scholar]
- Konijn, R.M.; Kowalczyk, W. Finding Fraud in Health Insurance Data with Two-Layer Outlier Detection Approach. In Data Warehousing and Knowledge Discovery, DaWak 2011; Cuzzocrea, A., Dayal, U., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 394–405. [Google Scholar]
- Fraley, C.; Wilkinson, L. Package ‘HDoutliers’. R Package. 2020. Available online: https://cran.r-project.org/web/packages/HDoutliers/index.html (accessed on 31 December 2020).
- Wilkinson, L. Visualizing Outliers. 2016. Available online: https://www.cs.uic.edu/~wilkinson/Publications/outliers.pdf (accessed on 31 December 2020).
- Do, K.; Tran, T.; Phung, D.; Venkatesh, S. Outlier Detection on Mixed-Type Data: An Energy-Based Approach. In Advanced Data Mining and Applications; Li, J., Li, X., Wang, S., Li, J., Sheng, Q.Z., Eds.; Springer: Cham, Switzerland, 2016; pp. 111–125. [Google Scholar]
- Koufakou, A.; Georgiopoulos, M.; Anagnostopoulos, G.C. Detecting Outliers in High-Dimensional Datasets with Mixed Attributes. In Proceedings of the 2008 International Conference on Data Mining, DMIN, Las Vegas, NV, USA, 14–17 July 2008; pp. 427–433. [Google Scholar]
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).