Distance-Based Estimation Methods for Models for Discrete and Mixed-Scale Data

Pearson residuals aid the task of identifying model misspecification because they compare the estimated, using data, model with the model assumed under the null hypothesis. We present different formulations of the Pearson residual system that account for the measurement scale of the data and study their properties. We further concentrate on the case of mixed-scale data, that is, data measured in both categorical and interval scale. We study the asymptotic properties and the robustness of minimum disparity estimators obtained in the case of mixed-scale data and exemplify the performance of the methods via simulation.


Introduction
Minimum disparity estimation has been studied extensively in models where the scale of the data is either interval or ratio (Beran [1], Basu and Lindsay [2]). It has also been studied in the discrete outcomes case. Specifically, when the response variable is discrete and the explanatory variables are continuous, Pardo et al. [3] introduced a general class of distance estimators based on φ-divergence measures, the minimum φ-divergence estimators, and they studied their asymptotic properties. The estimators can be viewed as an extension/generalization of the Maximum Likelihood Estimator (MLE). Pardo et al. [4] used the minimum φ-divergence estimator in a φ-divergence statistic to perform goodnessof-fit tests in logistic regression models, while Pardo and Pardo [5] extended the previous works to address solving problems for testing in generalized linear models with binary scale data.
The case where data are measured on discrete scale (either on ordinal or generally categorical scale) has also attracted the interest of other researchers. For instance, Simpson [6] demonstrated that minimum Hellinger distance estimators fulfill desirable robustness properties and for this reason can be effective in the analysis of count data prone to outliers. Simpson [7] also suggested tests based on the minimum Hellinger distance for parametric inference which are robust as the density of the (parametric) model can be nonparametrically estimated. In contrast, Markatou et al. [8] used weighted likelihood equations to obtain efficient and robust estimators in discrete probability models and applied their methods to logistic regression, whereas Basu and Basu [9] considered robust penalized minimum disparity estimators for multinomial models with good small sample efficiency.
Moreover, Gupta et al. [10], Martín and Pardo [11] and Castilla et al. [12] used the minimum φ-divergence estimator to provide solution to testing problems in polytomous regression models. Working in a similar fashion, Martín and Pardo [13] studied the properties of the family of φ-divergence estimators for log-linear models with linear constraints under multinomial sampling in order to identify potential associations between various variables in multi-way contingency tables. Pardo and Martín [14] presented an overview of works associated with contigency tables of symmetric structure on the basis of minimum φ-divergence estimators and minimum φ-divergence test statistics. Additional works include Pardo and Pardo [15] and Pardo et al. [16]. Alternative power divergence measures have been introduced by Basu et al. [17].
The class of f or φ−divergences was originally introduced by Csiszár [18]. The structural characteristics of this class and their relationship to the concepts of efficiency and robustness were studied, for the case of discrete probability models, by Lindsay [19]. Basu and Lindsay [2] studied the properties of estimators derived by minimizing f −divergences between continuous models and presented examples showing the robustness results of these estimates. We also note that Tamura and Boos [20] studied the minimum Hellinger distance estimation for multivariate location and covariance. Additionally, formal robustness results were presented in Markatou et al. [8,21] in connection with the introduction of weighted likelihood estimation.
If G is a real valued, convex function, defined on [0, ∞) and such that G(u) converges to 0 as u → ∞, 0G(0/0) = 0, 0G(u/0) = uG ∞ , G ∞ = lim u→∞ (G(u)/u), the class of φ−divergences is defined as where τ(·), m β 0 (·) are two probability models. Notice that we define ρ(τ, m β 0 ) on discrete probability models first, where T = {0, 1, 2, . . . , T} is a discrete sample space, T possibly infinite, and m β 0 (t) ∈ M = m β (t) : β ∈ B , B is the parameter space B ⊆ R d . Furthermore, different forms of the function G(u) provide different statistical distances or divergences. We can change the argument of the function G from m β 0 (t) − 1. Then, G is a function of the Pearson residual which is defined as δ(t) = τ(t) m β 0 (t) − 1, and takes values in [−1, ∞). If the measurement scale is interval/ratio, then the Pearson residuals are modified to reflect and adjust for the discrepancy of scale between data, that are always discrete, and the assumed continuous probability model (see Basu and Lindsay [2]).
The Pearson residual is used by Lindsay [19], Basu and Lindsay [2] and Markatou et al. [8,21] in investigating the robustness of the minimum disparity and weighted likelihood estimators, respectively. This residual system allows one to identify distributional errors. If, in the equation of Pearson residual, we replace τ(t) with its best nonparametric representative d(t), the proportion of observations in a sample with value t, then δ(t) = d(t) m β 0 (t) − 1. We note that the Pearson residuals are called so because n ∑ δ 2 (t)m(t) is Pearson's chi-squared distance. Furthermore, these residuals are not symmetric since they take values in [−1, ∞] and are not standardized to have identical variances.
How does robustness fit into this picture? In the robustness literature, there is a denial of the model's truth. Following this logic, the framework based on disparities starts with goodness-of-fit by identifying a measure that assesses whether the model fits the data adequately. Then, we examine whether this measure of adequacy is robust and in what sense. A fundamental tool that assists in measuring the degree of robustness is the Pearson residual, because it measures model misspecification. That is, Pearson residuals provide information about the degree to which the specified model m β fits the data. In this context, outliers are defined as those data points that have a low probability of occurrence under the hypothesized model. Such probabilistic outliers are called surprising observations (Lindsay [19]). Furthermore, the robustness of estimators obtained via minimization of the divergence measures we discuss here is indicated by the shape of the associated Residual Adjustment Function (RAF), a concept that is reviewed in Section 2. Of note is that in contingency table analysis, the generalized residual system is used for examination of sources of error in models for contingency tables, see, for example, Haberman [22], Haberman and Sinharay [23]. The concept of generalized residuals in the case of generalized linear models is discussed, for example, in Pierce and Schafer [24].
Data sets are comprised of data measured on both categorical (ordinal or nominal) scale and interval/ratio scale. We can think of these data as realizations of discrete and continuous random variables respectively. Examples of data sets that include mixed-scale data are electronic health records containing diagnostic codes (discrete) and laboratory measurements (e.g., blood pressure, alanine amino transferase (ALT) measurements on interval/ratio scale) and marketing data (customer records include income and gender information). Additional examples include data from developmental toxicology (Aerts et al. [25]), where fetal data from laboratory animals include binary, categorical and continuous outcomes. In this context, the joint density of the discrete and continuous random variables is given as are parameter vectors indexing the joint, conditional on x and probability density function of x.
Work on the analysis of mixed-scale data is complicated by the fact that is difficult to identify suitable joint probability distributions to describe both measurement scales of the data, although a number of ad hoc methods to the analysis of mixed-scale data have been used in applications. Olkin and Tate [26] proposed multivariate correlation models for mixed-scale data. Copulas also provide an attractive approach to modeling the joint distribution of mixed-scale data, though copulas are less straightforward to implement, and there are subtle identifiability issues that complicate the specification of a model (Genest and Neslehová [27]).
To formulate the joint distribution in the mixed-scale variables case one can either specify the marginal distribution of the discrete variables and the conditional distribution of the continuous variables. Alternatively, one can specify the marginal distribution of the continuous variables and the conditional distribution of the discrete variables given the continuous variables. Of note here is that the direction of factorization generally yields distinct model interpretations and results. The first approach has received much attention in the literature, in the context of the analysis of data with mixtures of categorical and continuous variables. Here, the continuous variables follow different multivariate normal distributions for each possible setting of the categorical variable values; the categorical variables then follow an arbitrary marginal multinomial distribution. This model is known in the literature as the conditional Gaussian distribution model and is central in the discussion of graphical association models with mixed-scale variables (Lauritzen and Wermuth [28]). A very special case of this model is used in our simulations.
In this paper, we develop robust methods for mixed-scale data. Specifically, Section 2 reviews basic concepts in minimum disparity estimation, Section 3 defines Pearson residuals for data measured in discrete, interval/ratio and mixed-scale, and studies their properties. Section 4 establishes the optimization problem for obtaining estimators of the model parameters, while Sections 5 and 6 establish the robustness and asymptotic properties of these estimators. Finally, Section 7 presents simulations showing the performance of these methods and Section 8 offers discussions. The Appendix A includes proofs of the theoretical results.

Concepts in Minimum Disparity Estimation
Beran [1] introduced a robust method to estimate the parameters of a statistical model, called minimum Hellinger distance estimation. The parameter estimator is obtained by minimizing the Hellinger distance between a parametric model density and a nonparametric density estimator. Lindsay [19] extended the aforementioned method to incorporate many other distances, and introduced the concept of the residual adjustment function in the context of minimum disparity estimation. The Minimum Distance Estimators (MDE) of a parameter vector β are obtained by minimizing over β, the distance (or disparity) where the assumed model m β is a probability mass function. When the model m β is continuous, the MDE of the parameter vector β is obtained by minimizing over β the quantity where f * (x) = k(x; t, h)dF(t), m * β (x) = k(x; t, h)m β (t) dt,F is the empirical distribution function obtained from the data and k is a smooth family of kernel functions. One example is the normal density with mean t and standard deviation h. Furthermore, δ(x) is the Pearson residual defined as δ(x) = f * (x)/m * (x) − 1. Lindsay [19] and Basu and Lindsay [2] discuss the efficiency and robustness properties of these estimators. Under appropriate conditions, (1) and (2) can be written as where A(δ) = (δ + 1)G (δ) − G(δ) and the prime denotes differentiation with respect to δ. Lindsay [19] has shown that the structural characteristics of the function A(δ) play an important role in the robustness and efficiency properties of these methods. Furthermore, without loss of generality, we can center and rescale A(δ), and define the RAF as follows.
Definition 1 (Lindsay [19]). Let A(δ) be an increasing and twice differentiable function on where G is strictly convex and twice differentiable with respect to δ on [−1, ∞) with G(0) = 0. Then, A(δ) is called residual adjustment function.

Remark 1.
Since A (δ) = (1 + δ)G (δ), the second order differentiability of G, in addition to its strict convexity, implies that A(δ) is strictly increasing function of δ on [−1, ∞). Thus, we can define A(δ) as above without changing the solutions of the aforementioned estimating equations in the discrete case (see Lindsay [19], p. 1089). In the continuous case, such standardization does not change the estimating properties of the associated disparities (see Basu and Lindsay [2], p. 687).
Two fundamental and at the same time conflicting goals in robust statistics are the goals of robustness and efficiency. In the traditional literature on robustness, first order efficiency is sacrificed and, instead, safety of the estimation or testing method against outliers is guaranteed. Here, one adheres to the notion that information about robustness of a method is carried by the influence function. In our setting, using the influence function to characterize the robustness properties of the associated estimation procedures is misleading. Instead, the shape of the RAF, A(·), provides information to the extent of which our procedures can be characterized as robust. The interested reader is directed to Lindsay [19] for further discussion on this topic.

Pearson Residual Systems
In this section, we define various Pearson residuals, appropriate for the measurement scale of the data. We introduce our notation first. Let (y i , x i ), i = 1, 2, . . . , n be realizations from n independent and identically distributed random variables that follow a distribution with density m β (x, y). Recall that we use the word density to denote a general probability function, independently of whether the random variables X, Y are discrete, continuous or mixed. In what follows, we define different Pearson residual systems that account for the measurement scale of the data and study their properties.
Case 1: Both X and Y are discrete. In this case, the pairs (y i , x i ) follow a discrete probability mass function m β (x i , y i ). Define the Pearson residual as where π x = P(X = x) = g(x), and n x,y is the number of observations in the cell with Y = y and X = x.
Note that this definition of the Pearson residual is nonparametric on the discrete support of X. In the case of regression, one can carry out a semiparametric argument to obtain the estimators of the vector β and π x .
We now establish that, under correct model specification, the residual δ(x, y) converges, almost surely, to zero.

Proposition 1.
When the model is correctly specified and as n → ∞, Then n x n = (# of observations in the sample equal to x) n where I(·) is the indicator function. Furthermore, Case 2: Y is continuous and X is discrete. This is the case in some ANOVA models. We can still define the Pearson residual in this setting as Then, Under the correct model specification, continuity of the kernel function and the fact thatF n converges completely to F (implication of Glivenko-Cantelli theorem), (extension of Helly-Bray lemma). Therefore,

Case 3: Y is continuous and X is continuous.
In this case, the pairs (y i , x i ) follow a continuous probability distribution. The Pearson residual is then defined as As an example, we take the linear regression model with random carriers X, and i ∼ N(0, 1). Furthermore, assume that the random carriers follow a normal distribution with mean vector µ and covariance matrix Σ. In this case, y i = x T i β + i and the quantities z i = (y i − x T i β)/σ are independent, identically distributed random variables when β represents the vector of true parameters. Hence, the z i 's represent realizations of a random variable Z that has a completely known density f (z). Thus, The kernel k(z, t, h) is selected so that it facilitates easy computation. Kernels that do not entail loss of information when they are used to smooth the assumed parametric model are called transparent kernels (Basu and Lindsay [2]). Basu and Lindsay [2] provide a formal definition of transparent kernels and an insightful discussion on the point of why transparent kernels do not exhibit information loss when convoluted with the hypothesized model (see Section 3.1 of Basu and Lindsay [2]).

Estimating Equations
In this section, we concentrate on cases 1, 2 presented in the previous section. We carefully outline the optimization problems and discuss the associated estimating equations for these two cases. The case where both X and Y are continuous has been discussed in the literature, see, for example, Markatou et al. [21].
Case 1: Both X and Y are discrete. In this case, the minimum distance estimators of the parameter vector β and π x are obtained by solving the following optimization problem The class of G functions that we use creates distances that belong in the family of φ-divergences. Proposition 3. The estimating equations for β and π x are given as: The function w(δ(x, y)) is a weight function, such that 0 ≤ w(δ(x, y)) ≤ 1, and it is defined as with [·] + indicating the positive part of the function A(δ(x, y)) + 1.
Proof. The main steps of the proof are provided in the Appendix A.1.

1.
The above two estimating equations can be solved with respect to β and π x . In an iterative algorithm, we can solve the second equation (4) explicitly for π x to obtain This means that if the model does not fit any of the y, observed at a particular x well, the weight for this x will drop as well.

2.
When A(δ(x, y)) = δ(x, y) the corresponding estimating equation for β becomes ∑ x,y n x,y u(y|x; β) = 0 and the MLE is obtained. This is because the corresponding weight function w(δ(x, y)) = 1. In this case, the estimating equations for the π x s become ∑ n x,y I(X=x) π x − 1 = 0, the estimating equations for the MLEs of π x .

3.
The Fisher consistency property of the function that introduces the estimates guarantees that the expectation of the corresponding estimating function is 0, under the correct model specification.
Case 2: Y is continuous and X is discrete.
In this case, the estimates of the parameters β and π x are obtained by solving the following optimization problem and the optimization problem stated above is equivalent to

Proposition 4.
The estimating equations for β and π x in the case of independence of y, x are given as follows: where A(δ) is the residual adjustment function (RAF) that corresponds to the function G, and G (δ) is the derivative of G with respect to δ.
Proof. Straightforward, after differentiating the Lagrangian with respect to β and π x .
Case 3: Y is continuous and X is continuous.
In this case, we refer the reader to Basu and Lindsay [2].

Robustness Properties
Hampel et al. [29] and Hampel [30,31] define robust statistics as the "statistics of approximate parametric models", and introduce one of the fundamental tools of robust statistics, the concept of the influence function, in order to investigate the behavior of a statistic T n expressed as a functional T(G). The influence function is a heuristic tool with the intuitive interpretation of measuring the bias caused by an infinitesimal contamination at a point x on the estimate standardized by the mass of contamination. Its formal definition is as follows: The influence function of a functional T at the distribution F is given as in those x ∈ X where the limit exists, 0 ≤ t ≤ 1 and ∆ x is the Dirac measure defined as If an estimator has a bounded influence function, the estimator is considered to be robust to outliers, that is data which is away from the pattern set by the majority of the data. The effect of bounding the influence function is the sacrifice of efficiency; estimators with bounded influence function, while are not affected by outlying points, are not fully efficient under the correct model specification.
Our goal in calculating the influence function is to show the full efficiency of the proposed estimators. That is, the influence function of the proposed estimators, under correct model specification, equals the influence function of the corresponding maximum likelihood estimators. In our context, robustness of the estimators is quantified by the associated RAFs (see Lindsay [19] and Basu and Lindsay [2]).
In what follows, we will derive the influence function of the estimators for the parameter vector β in the case where both y, x are discrete. Similar calculations provide the influence functions of estimators obtained under the remaining scenarios. To do so, we need to resort to the estimators' functional form, denoted by β , with corresponding estimating equations The influence function is then obtained by differentiating the aforementioned estimating equations with respect to and then evaluating the derivative at = 0.

Proposition 5. The influence function of the β estimator is given by
with u(t|s; β) = ∇ ln m β (t|s), and the subscript 0 indicates evaluation at a parametric model.

Proof.
The proof is obtained via straightforward differentiation and its main steps are provided in the Appendix A.2.

Proposition 6.
Under the assumption that the model is correct, the influence function derived, reduces to the influence function of the MLE of β.
Proof. Under the assumption that the adopted model is the correct model, the density Furthermore, the expression B(x, y; d) reduces to u(y|x; β 0 ), where we assume exchangeability of differentiation and integration and use the fact that u(t|s; β 0 ) = u(s, t; β 0 ). Hence, the influence function is given as which is exactly the influence function of the MLE. Therefore, full efficiency is preserved under the model.

Asymptotic Properties
In what follows, we establish asymptotic normality of the estimators in the case of discrete variables. The techniques for obtaining asymptotic normality in the mixed-scale case are similar and not presented here.
Recall that the k−th estimating equation is given as ∑ x,y w(δ β (x, y))n x,y u k (y|x; β) = 0, which can be expanded in Taylor series in the neighborhood of the true parameter β 0 to obtain: where C n is a p × p Hessian matrix whose (t, e)−th element is given as Under assumptions 1-8, listed in the Appendix A.3, we have the following theorem. Theorem 1. The minimum disparity estimators of the parameter vector β are asymptotically normal with asymptotic variance I −1 (β 0 ), where I(·) indicates the Fisher information matrix.

Simulations
The simulation study presented below has two aims. The first one, is to indicate the versatility of the disparity methods for different data measurement scales. The second aim is to exemplify and study the robustness of these methods under different contamination scenarios.
Case 1: Both X and Y are discrete. The Cressie-Read family of power divergence is given by where d(x, y) = n x,y /n is the proportion of observations with value x, y and m β (x, y) = m β (y|x)π x is the density function of the model of interest.
To evaluate the performance of our algorithmic procedure, we use the following disparity measures, that is, Twice-squared Hellinger's (λ = −1/2) : Pearson's chi-squared divided by 2 (λ = 1) : , δ(x, y) + 2 : The data are generated in four different ways using three different sample sizes N, say N = 100; N = 1000 and N = 10,000. The data format used can be represented in a 5 × 5 contingency table, with n i,j , i = 1, 2, . . . , 5; j = 1, 2, . . . , 5 denoting the counts in the ij-th cell, n i• and n •j representing the row and column totals, respectively. Furthermore, the variable x indicates columns, while y indicates the rows. In each of the aforementioned cases/scenarios, 10,000 tables were generated and that corresponds to the number of Monte Carlo (MC) replications. Our purpose is to get the mean values of the estimates of the parameters m β (y|x)'s and π x 's along with their corresponding standard deviations (SDs). Notice that, in this setting, the estimation of π x and m β (y|x) is completely nonparametric, that is, no model is assumed for estimating the marginal probabilities of X and Y.
The table was generated by using either a fixed total sample size N or fixed marginal probabilities. These two data generating schemes imply two different sampling schemes that could have generated the data with consequences for the probability model one would use. For example, with fixed total sample size the distribution of the counts is multinomial, or if the row margin is fixed in advance the distribution of the counts is a product binomial distribution. In the former case of fixed N, we explored two different scenarios: a balanced and an imbalanced one. The imbalanced scenario allows for the presence of one zero cell in the contingency table, whereas the balanced scenario does not. In the latter case of fixed marginal probabilities, the row marginal probabilities (m β (y|x)'s) were fixed, while the column marginals (π x 's) were randomly chosen and these values were used to obtain the contingency table. In this case, we also explored a balanced and an imbalanced scenario based on whether the row marginal probabilities were chosen so that to be equal to each other or not, respectively.
Specifically, under Scenario Ia, where the total sample size N was fixed and the balanced design was exploited, none of the n ij 's (n ij = 0, ∀ i, j = 1, 2, 3, 4, 5) was set equal to zero, with equal row and column marginal probabilities. Table 1 presents the mean of 10,000 estimates and the corresponding SDs for all four distances (PCS, HD, SCS, LD) when N is fixed under the balanced scenario. Table 1 clearly shows that all distances provide estimates approximately equal to 0.200 regardless of the sample size used. Furthermore, as the sample size increases, the SDs decrease noticeably.
In Scenario IIa, where the total sample size N was fixed and the contingency table was structured using the imbalanced design, the presence of a zero cell (n 11 = 0) was allowed. The results of this scenario are presented in Table 2, where the estimates were calculated exploiting all disparity measures. For the LD, n 11 was set equal to 10 −8 . The presence of zero cells in contingency tables has a large history in the relevant literature on contingency tables analysis, where several options are provided for the analysis of these tables (Fienberg [32], Agresti [33], Johnson and May [34], Poon et al. [35]). From Table 2, one could infer that the different distances handle differently the zero cell. This difference is reflected in the estimate ofm β(y 1 |x) =m β 1 , because it is affected by the zero value of n 11 . The strongest control is provided by the Hellinger and symmetric chi-squared distances. All distances estimate the parameters π x i similarly, with the bias in their estimation been between 2.7% and 5.2%. The SDs are almost the same for all distances per estimate and their values are ameliorated for N = 10,000.
A referee suggested that in certain cases interest may be centered on smaller samples. We generated 2 × 3 tables with fixed total sample size of 50 and 70 observations. Tables 3 and 4 describe the results when the contingency tables were generated under a balanced and an imbalanced design with associated respective Scenarios Ib and IIb. More precisely, Table 3 presents the estimators of the marginal row and column probabilities obtained when PC, HD, SCS and LD distances are used. We notice that the increase in the sample size provides for a decrease in the overall absolute bias in estimation, defined as ∑ L =1 |θ − θ 0, |, whereθ is the estimate of the -th component of an L × 1 vector θ and θ 0, is the corresponding true value. In our case, θ T = (m β 1 , m β 2 , π x 1 , π x 2 , π x 3 ). This observation applies to all distances used in our calculations. Table 4 presents results associated with the imbalanced case. The generated 2 × 3 tables contain two empty cells (n 12 = n 21 = 0). Once again, for calculating the LD, cells n 12 = n 21 = 10 −8 . We notice that the bias associated with the estimates is rather large for all the distances, and an increased sample size does not alleviate the observed bias. Basu and Basu [9] have proposed an empty cell penalty for the minimum power-divergence estimators. This penalty leads to estimators with improved small sample properties. See also Alin and Kurt [36] for a discussion of the need of penalization in small samples. Table 5 provides the results obtained under Scenario III. In this case, the parameter estimates were calculated using the PCS, HD, SCS and LD distances when the 5 × 5 contingency table was constructed by fixing the row marginal probabilities so that they were all set at 0.20, that is, (0.20, 0.20, 0.20, 0.20, 0.20). The column marginals were randomly chosen in the interval [0, 1] and summed to 1. In this case, the produced column marginal probabilities were (0.1472, 0.2365, 0.3196, 0.2370, 0.0597). The simulation study reveals that the estimates of the parameters m β (y|x)'s and π x 's do not differ substantially from the respective row and column marginal probabilities for any of the four distances utilized. The SDs are approximately the same and they get lower values for larger N.
Finally, in Table 6 the data generation was done by exploiting Scenario IV, that is, by having fixed the row marginal probabilities, which were not equal to each other; while, the column marginals were randomly chosen in the interval [0, 1] so that they sum to 1. In particular, the row marginal probabilities were fixed at values (0.04, 0.20, 0.20, 0.20, 0.36), while the column marginals used were (0.2171, 0.1676, 0.2347, 0.1178, 0.2628). When N = 100, the value ofm β (y 1 |x) =m β 1 is not approximately 0.07 and not equal to 0.04 for all distances. However, when N = 1000 or N = 10,000, we get better estimates irrespectively of the disparity measure choice. The SDs are approximately the same and they become smaller as the sample size increases.
We also notice from Tables 1, 5 and 6 that in all cases the standard deviation associated with the estimates obtained when we use other than likelihood distances, is approximately the same with the standard deviation that corresponds to the likelihood estimates, thereby showing the asymptotic efficiency of the disparity estimators.
All calculations were performed using the R language. Given that the problem described in this section can be viewed as a general non-linear optimization problem, the solnp function of the Rsolnp package (Ye [37]) was used to obtain the aforementioned estimates. For our calculations, we tried using a variety of different initial values (π (0) x 's andm (0) β (y|x)'s); we notice that no matter how the initial values were chosen, the estimates were always pretty similar and very close to the observed values (n i• /N and n •j /N for i, j = 1, 2, 3, 4, 5). Only the number of iterations needed for convergence is slightly affected. Consequently, random numbers from a Uniform distribution in the interval [0, 1] were set as initial values (which were not necessarily summing to 1). The solnp function has a built-in stopping rule and there was no need to set our own stopping rule. We only set the boundary constraints to be in the interval [0, 1] for all estimates which were also subject to ∑ π x = ∑ m β (y|x) = 1.
Other functions may also be used to obtain the estimates. For example, we used the auglag function of the nloptr package with local solvers "lbfgs" or "SLSQP" (Conn et al. [38], Birgin and Martínez [39]) which emulates Augmented Lagrangian multipliers. However, the convergence using the solnp function (the number of iterations was on average 2) was extremely faster than using the auglag function (the average number of iterations was approximately 100). For this reason, the results presented in Tables 1-6 were based only on the function solnp.    m β 1m β 2m β 3m β 4m β 5π x 1π x 2π x 3π x 4π

Case 2: X is discrete and Y is continuous
In this section, we are interested in solving the optimization problem (5) when X is discrete, Y is continuous and X, Y are independent of each other. To evaluate the performance of our procedure, we used Hellinger's distance, which in this case takes on the following form: The aim of this simulation is to obtain the minimum Hellinger distance estimators of π x and µ assuming (without loss of generality) that σ 2 is known to be equal to 1. All calculations were performed in R language.
For this purpose, we generated mixed-type data of size N using the package OrdNor (Amatya and Demirtas [40]). More precisely, the data are comprised of one categorical variable X with three levels and probability vector (1/3, 1/3, 1/3), while the continuous part is coming from a trivariate normal distribution; symbolic µ 2 , µ 3 ). We used two different mean vectors: µ T = (0, 0, 0) and µ T = (0, 3,6). The set of ordinal and normal variables were generated concurrently using an overall correlation matrix Σ, which consists of three components/sub-matrices: Σ OO , Σ ON and Σ NN , with O and N corresponding to "Ordinal" and "Normal" variables, respectively. More precisely, the overall correlation matrix Σ used is the following ON and ρ ON represents the polyserial correlations for the ON combinations (for more information on polyserial correlations refer to Olsson et al. [41]). Since X, Y were assumed to be independent, we set ρ ON = 0.0. However, we also used weak correlations, say ρ ON = 0.1 and 0.2, to investigate whether the estimates we receive in these cases remain reasonable. The kernel function was the multivariate normal density MV N 3 (0, H) with H being estimated by the data using the kde function of the ks package (Duong [42]), m * Y (y) represented the multivariate normal density MV N 3 (µ, Σ + H) and m X (x) was the multinomial mass function. This choice of smoothing parameter, stemmed from the fact that we were interested in evaluating the performance, in terms of robustness, of standard bandwidth selection.
To solve the optimization problem, the solnp function of the Rsolnp package (Ye [37]) was used. Specifically, the initial values set for the probabilities π x 1 , π x 2 , π x 3 associated with the X variable were random uniform numbers in the interval [0, 1], while the initial values for the means µ y 1 , µ y 2 , µ y 3 were random numbers in the interval [Q1(Y i ), Q3(Y i )] for i = 1, 2, 3, where Q1 and Q3 stand for the respective 25th and the 75th quantile per component of the continuous part. Following the same procedure with the one of Basu and Lindsay [2] in the univariate continuous case, here (in the mixed-case) the numerical evaluation of the integrals was also done on the basis of the Simpson's 1/3rd rule using the sintegral function of the Bolstad2 package (Bolstad [43]). Moreover, we calculated the mean values, the SDs, as well as the percentages of bias of the mean and the probability vectors for three different sample sizes: N = 100; N = 1000 and N = 1500 over 1000 MC replications. The bias is defined as the difference of the estimates from their "true" values, that is, bias(µ y i ) =μ y i − µ i and bias(π x i ) =π x i − 1/3 for i = 1, 2, 3. The results are shown in Tables 7 and 8. In particular, Table 7 illustrates the mean values, the SDs and the bias percentages of the corresponding minimum Hellinger distance estimators, over 1000 MC replications, for the three different sample sizes and polyserial correlations, when µ = (0, 0, 0) T . The estimates for the π x i are approximately equal to 1/3 = 0.333, while the µ y i estimates are almost zero, even in the cases of weak correlations. When ρ ON = 0.0, the sample size choice does not seem to affect the values of the estimates either overall or per component of X, Y variables. Specifically, we observe that the total absolute bias, computed as the sum of the individual component-wise absolute biases of the vectors π T = (π 1 , π 2 , π 3 ) and µ T = (µ 1 , µ 2 , µ 3 ) are approximately the same, with larger samples providing slightly less biases at the expense of a higher computational cost. Table 7. Means, Absolute Biases and Overall Absolute Bias of the Hellinger's distance (HD). The data were concurrently generated with a given correlation structure (an overall correlation matrix Σ) and consist of a discrete variable X with marginal probability vector (1/3, 1/3, 1/3) and a continuous vector In Table 8, analogous results are presented with the difference that the mean vector used was µ = (0, 3, 6) T . The π x i estimates are very close to 1/3 (= 0.333) for all X components, no matter which sample size or correlation is used. On the contrary, the interpretation of the µ i estimates slightly differs in this case. We also calculated the overall absolute bias as well as the individual, per parameter, absolute biases. In this case, larger samples clearly provide estimates with smaller bias for both parameter vectors π, µ and for both cases, the case of independence as well as the case of weak correlations. However, the computational time increases.
In what follows, we also present -for illustration purposes-a small simulation example using a mixed-type, contaminated data set of size N = 1000, which was generated using OrdNor package setting ρ ON = 0.0 . Once again, the data were comprised of one categorical variable X with three levels and probability vector (1/3, 1/3, 1/3), and a trivariate continuous vector Y = (Y 1 , Y 2 , Y 3 ). The contamination is happening only in the continuous part on the basis of α ∈ {1.00, 0.95, 0.90, 0.85, 0.80}, as follows: 3,3). This means that, N 1 = α × N data were generated with Y coming from multivaraiate standard normal and the remaining N 2 = N − N 1 subset of the data followed a multivaraiate normal distribution with mean vector µ T = (3,3,3). It goes without saying that when α = 1.00, there is no contamination. Here, we are still considering the same optimization problem with the one described above and, consequently, we are interested in evaluating the minimum Hellinger distance estimators over 1000 MC replications by examining/studying to what extend the contamination level affects these estimates. Table 8. Means, Absolute Biases and Overall Absolute Bias of the Hellinger's distance (HD). The data were concurrently generated with a given correlation structure (an overall correlation matrix Σ) and consist of a discrete variable X with marginal probability vector (1/3, 1/3, 1/3) and a continuous vector 3,6) and I 3  As indicated from Table 9, when there is no contamination in the data (α = 1.00), the estimates for the π x i s are almost equal to 1/3, while the µ y 's estimates are almost equal to zero. As the data become more contaminated (i.e., the value of α decreases), the minimum disparity estimators corresponding to X variable remain pretty consistent with their true values. However, this is not the case with the estimates for the µ y i s, which deteriorate as the value of the contamination level α shifts from the target/null value, that is 1.00.
The mean parameters are estimated with reasonable bias (maximum bias is 9% for the second component of the mean) when α = 0.95, that is the contamination is 5%. When the contamination is 10%, the bias of the mean components is relatively high but still below 19%. With higher contamination, the percentage of bias in the mean components is in the interval [28.3%, 47%]. This is the result of using standard density estimation to obtain the smoothing parameters for the different mean components. Smaller values of these component smoothing parameters result in substantial bias reduction. Table 9. Means and SDs of the Hellinger's distance (HD). The data were concurrently generated with a given correlation structure (an overall correlation matrix Σ) and consist of a discrete variable X with marginal probability vector (1/3, 1/3, 1/3) and a continuous trivariate vector We also looked at the case where the continuous model was contaminated by a trivariate normal with mean µ T = (1.5, 1.5, 1.5) and covariance matrix I. In this case (results not shown), when the contamination is 5% the maximum bias of the mean components is 6.6%, while when the contamination is 10% the maximum bias of the mean components is 13.5%. Again, in this case the bandwidth parameters were obtained by fitting a unimodal density to the data.
The above results are not surprising. A judicious selection of the smoothing parameter decreases the bias of the component estimates of the mean. Agostinelli and Markatou [44] provide suggestions of how to select the smoothing parameter that can be extended and applied in this context.

Discussion and Conclusions
In this paper, we discuss Pearson residual systems that conform to the measurement scale of the data. We place emphasis on the mixed-scale measurements scenario, which is equivalent to having both discrete (categorical or nominal) and continuous type random variables, and obtain robust estimators of the parameters of the joint probability distribution that describes those variables. We show that, disparity methods can be used to actually control against model misspecification and the presence of outliers, and these methods provide reasonable results.
The scale and nature of measurement of the data imposes additional challenges, both computationally and statistically. Detecting outliers in this multidimensional space is an open research question (Eiras-Franco et al. [45]). The concept of outliers has a long history in the field of statistics and outlier detection methods have broad applications in many scientific fields such as security (Diehl and Hampshire [46], Portnoy et al. [47]), health care (Tran et al. [48]) and insurance (Konijn and Kowalczyk [49]) to mention just a few.
Classical outlier detection methods are largely designed for single measurement scale data. Handling mixed measurement scale is a challenge with few works coming from both, the field of statistics (Fraley and Wilkinson [50], Wilkinson [51]) and the fields of engineering and computer science (Do et al. [52], Koufakou et al. [53]). All these works use some version of a probabilistic outlier, either looking for regions in the space of data that have low density (Do et al. [52], Koufakou et al. [53]) or by attaching a probability, under a model, to the suspicious data point (Fraley and Wilkinson [50], Wilkinson [51]).
Our concept of a probabilistic outlier discussed here and expressed via the construction of appropriate Pearson residuals can unify the different measurement scales, and the class of disparity functions discussed above can provide estimators for the model parameters that are not influenced unduly by potential outliers.
One of the important parameters that controls the robustness of these methods is the smoothing parameter(s) used to compute the density estimator of the continuous part of the model. In our computations, we use standard smoothing parameters obtained from utilizing appropriate R functions for density estimation. The results show that, depending on the level of contamination and the type of contaminating probability model, the performance of the methods is satisfactory. Specifically, a small simulation study using the model reported in the caption of Table 9 shows that the overall bias associated with the mean components of the standard multivariate normal model is low when contamination with a multivariate normal model with mean components equal to 3 is less than or equal to 10%. But even in this case, when the percentage of contamination is greater than 10%, the bias increases when the smoothing parameter used is the one obtained from the R density function. Here, smaller values of the smoothing parameter guarantee reduction of the bias.
Devising rules for selecting the smoothing parameter(s) in the context of mixed-scale measurements that can guarantee robustness for larger than 5% levels of contamination may be possible. However, it is the opinion of the authors that greater levels of data inhomogeneity may indicate model failure, a case where assessing model goodness of fit is of importance.   ∇ β ∑ x,y G(δ(x, y))m β (y|x)π x − λ( ∑ π x − 1) = 0, which can be equivalently expressed as follows, ∑ x,y π x [∇ β G(δ(x, y))]m β (y|x) + ∑ x,y π x G(δ(x, y))∇ β (y|x) = 0.
Therefore, we get ∑ x,y A(δ(x, y))m β (y, x) I(X = z) π x − 1 = 0 and by making use of the fact that ∑ x,y m β (x, y) I(X=z) π x − 1 = 0, the above equation can be represented as ∑ x,y w(δ(x, y))n x,y I(X = x) for any x where I(X = x) is the indicator function of the event {X = x}.