Bivariate Distributions Underlying Responses to Ordinal Variables

: The association between two ordinal variables can be expressed with a polychoric correlation coefﬁcient. This coefﬁcient is conventionally based on the assumption that responses to ordinal variables are generated by two underlying continuous latent variables with a bivariate normal distribution. When the underlying bivariate normality assumption is violated, the estimated polychoric correlation coefﬁcient may be biased. In such a case, we may consider other distributions. In this paper, we aimed to provide an illustration of ﬁtting various bivariate distributions to empirical ordinal data and examining how estimates of the polychoric correlation may vary under different distributional assumptions. Results suggested that the bivariate normal and skew-normal distributions rarely hold in the empirical datasets. In contrast, mixtures of bivariate normal distributions were often not rejected.


Introduction
Data in the social and behavioral sciences commonly include observations from variables with ordinal response scales, such as Likert items. If ordinal variables do not possess metric properties, alternative techniques than those employed with continuous variables are required. The polychoric correlation coefficient proposed by Pearson [1] is a recommended measure of association between two ordinal variables. The polychoric correlation coefficient is based on the assumption that responses to ordinal variables are generated by two latent underlying continuous variables. The underlying variables are conventionally assumed to follow a bivariate normal distribution, also referred to as underlying bivariate normality. In the present study, this assumption will be addressed.
In simulation studies, the polychoric correlation coefficient has been compared to other measures of association, including the product-moment correlation, Spearman's rank correlation, and Kendall's tau coefficient [2,3]. These studies showed that if the underlying bivariate normality held, the polychoric correlation coefficient was generally closer to the correlation between the two underlying continuous variables than other measures of association. However, bivariate normality of the underlying continuous variables has been considered unrealistic [4,5]. Indeed, experience with empirical data suggests that the underlying bivariate normality assumption may be questionable [6][7][8][9][10][11]. These findings give rise to the question whether alternative distributions may represent features of the underlying continuous latent variables more accurately.
Some studies suggested that the polychoric correlation coefficient is fairly robust against small to moderate departures from underlying bivariate normality [12][13][14]. However, Grønneberg and Foldnes [15] explained that these simulation studies used a generation method of the ordinal data that is equivalent to discretizing normal data. In recent evaluations using simulated data that are not compatible with underlying normality, it was found that the polychoric correlation coefficient is highly sensitive to underlying Psych 2021, 3 non-normality [16][17][18]. Similar results have been found by Jin and Yang-Wallentin [19], who examined the robustness of the polychoric correlation against non-normality with data generated from skew-normal [20], skew-t (υ) [21], and Pareto [22] distributions.
In accordance with Muthén and Hofacker [8], Jin and Yang-Wallentin [19] suggested that the estimate of the polychoric correlation should only be used if the underlying bivariate normality assumption holds. When the assumption of underlying bivariate normality does not hold, the polychoric correlation coefficient can be based on other distributional assumptions that represent features of the data more accurately than the bivariate normal distribution [6]. Several alternative underlying bivariate distributions have been proposed already, among which are Azzalini and Dalla Valle's skew-normal distribution [19,20,23] and the mixture of normal distributions [24]. Previous research indeed shows that the polychoric correlation coefficient provides an accurate estimate of the correlation between the two underlying continuous variables when the correct underlying distribution is assumed [10,19,23]. In a study by Roscino and Pollice [23], a polychoric correlation coefficient was introduced based on the hypothesis that the underlying latent variables follow Azzalini and Dalla Valle's bivariate skew-normal distribution. The polychoric correlation based on this distribution yielded better estimates of the correlation between the underlying continuous latent variables than the original polychoric correlation when the sample size was large, or the number of categories of the ordinal variables was small, or the skewness parameters were discordant [23]. More recently, Jin and Yang-Wallentin [19] examined the performance of several generalizations in an extensive simulation study, including Azzalini and Dalla Valle's skew-normal distribution. In line with Roscino and Pollice [23], their results suggest that assuming an underlying skew-normal distribution generally produces lower bias in the polychoric correlation estimate than assuming an underlying bivariate normal distribution.
In addition to the skew-normal distribution, the mixture of normal distributions has been considered for the estimation of the polychoric correlation. Uebersax and Grove [25] proposed a latent mixture model in which the distribution of a latent trait measured by an ordinal variable is defined as a combination of two subgroups' probability density functions. Although their proposed model was presented as an approach to analyze rating agreement, it could also be applied for modeling mixtures with observed ordinal variables [24]. That is, a polychoric correlation coefficient can be based on the assumption that the underlying continuous latent variables follow a mixture of two or more normal distributions contingent on the idea that data have been gathered from two or more subpopulations. Because the means and variances of each subgroup's normal distribution are allowed to differ, the mixture distribution can be a flexible tool to account for heterogeneous and asymmetric data. This distribution may therefore be suited to accurately reflect the features of the underlying continuous latent variables. The polychoric correlation based on an underlying mixture of normal distributions has not been studied yet with empirical or simulated data.
Although it is clear that the polychoric correlation coefficient can be accurately estimated as long as the underlying distribution giving rise to the observed ordinal responses is known, it is impossible to identify the correct underlying distribution for empirical data. The Ref. [26] showed that with two binary variables and underlying non-normality, there can exist a very wide range of tetrachoric correlations that are consistent with the observed data, depending on what distribution is assumed for the underlying latent variables. With more than two response options per variable, and by using substantive knowledge to add restrictions to the underlying distributions, the range of possible correlations will get smaller and eventually converge to an identified case. In practice, it can still be informative to test which distributions are consistent with the observed data and which are not, so that some distributions may be ruled out.

The Present Study
The aim of the present study was to examine the fit of the bivariate normal distribution, bivariate skew-normal distribution, and mixture of bivariate normal distributions to a large number of pairs of ordinal variables from empirical datasets. Knowledge of the fit of the proposed underlying distributions to empirical data may contribute in several ways. First, the results of this study may help to determine the degree of applicability of the fitted distributions. Second, the results could explicate how to generate more realistic data in future simulation studies, including ordinal variables, to increase generalizability. Third, we examined how estimates of the polychoric correlation vary under different distributional assumptions in empirical data. Fourth, we provide R-syntax that other researchers can adapt to fit the distributions to ordinal data.
The remainder of the present paper is organized as follows. The polychoric correlation assuming underlying bivariate normality is described first. The bivariate skew-normal distribution [20] and mixture of bivariate normal distributions distribution [27] are then introduced as alternatives to the bivariate normal distribution for the latent continuous variables underlying the observed ordered variables. Subsequently, we illustrate estimations of the polychoric correlation on the basis of the bivariate normal distribution, the bivariate skew-normal distribution, and the mixture of bivariate normal distributions with an empirical example. This illustration may serve as a guideline for researchers on how to test the underlying distributions when estimating the polychoric correlation between two observed ordinal variables. We then present the results of a more extensive study on the fit of these distributions to a large number of contingency tables, showing which distributions are rejected most often in real data. We explicitly focus on the bivariate case in this article, and we discuss issues around the multivariate case in the discussion.

The Bivariate Normal Distribution
Consider two observed ordinal variables X 1 and X 2 with response categories i = 1, 2, · · · , I and j = 1, 2, · · · , J. The polychoric correlation coefficient assumes that the responses to X 1 and X 2 are generated by latent underlying continuous variables ξ 1 and ξ 2 . The relationship between observed ordinal X 1 and X 2 and underlying continuous ξ 1 and ξ 2 may be written as where τ (1) and τ (2) are the thresholds parameters for X 1 and X 2 , respectively. The thresholds represent the bounds of the response categories such that −∞ = τ 0 < τ 1 < · · · < τ I−1 < τ I = ∞. An item with I categories has I − 1 threshold parameters.
The maximum likelihood estimate of the polychoric correlation between X 1 and X 2 is the value of ρ that minimizes where p ij is the observed proportion and π ij (γ) is the expected proportion for X 1 = i and X 2 = j. The expected proportion may be written as where f (·) is a bivariate normal distribution with means µ and covariance matrix Σ.
Because the underlying continuous latent variables are not directly observed, their means and variances are usually fixed at zero and one, respectively. Alternatively, the location and scale of the underlying variable could be identified by fixing two thresholds. There exist two approaches to estimating the polychoric correlation coefficient [28]. In the two-step approach, the threshold parameters are estimated from univariate information first, and the other parameters are estimated in a second step while fixing the thresholds to the values obtained in the first step. Another approach is to estimate all parameters simultaneously. Throughout this manuscript, we use the latter approach.

The Skew-Normal Distribution
The univariate skew-normal distribution was proposed by Azzalini [29] and extended to the multivariate skew-normal distribution by Azzalini and Dalla Valle [20]. The skewnormal distribution is a natural generalization of the normal distribution with extra shape parameters α that regulate skewness and kurtosis. The bivariate skew-normal distribution involves two shape parameters α 1 and α 2 , and simplifies to the bivariate normal distribution when α 1 = α 2 = 0. With larger absolute values of α, the skewness and kurtosis of the distribution increase. The distribution is right-skewed and leptokurtic if α > 0, and left-skewed and leptokurtic if α < 0. According to Azzalini and Dalla Valle [20], the skew-normal distribution is reasonably flexible with regard to empirical data fitting, and maintains some convenient formal properties of the normal distribution. An additional advantage of the bivariate skew-normal distribution is that its marginal distributions are skew-normal.
The density function of the bivariate skew-normal distribution for the underlying continuous latent variables ξ 1 and ξ 2 is given by where φ(·, ·; ω) is the density function of the standard normal distribution with correlation ω, and Φ(·) is the standard normal distribution function. Under the underlying bivariate skew-normal assumption, the expected probability π ij (γ) = P(X = i, Y = j) may be written as where g(·) denotes the bivariate skew-normal distribution. The polychoric correlation coefficient under the bivariate skew-normal distribution can be obtained by where π is the actual number π. The parameters δ 1 and δ 2 can be computed from α 1 α 2 and ω by and vary in (−1, 1).

Mixture of Normal Distributions
The mixture of normal distributions was first proposed by Pearson [27]. The mixture of normal distributions decomposes the population into a set of subpopulations, often called components. The mixture distribution is composed of a weighted sum of each subpopulation's normal distribution with means µ and covariance matrix Σ. An advantage of the mixture of normal distributions is that great flexibility can be achieved with only a few subpopulations [25]. In the present study, we only considered mixtures of two subpopulations.
The density function of the mixture of bivariate normal distributions for the underlying continuous latent variables ξ 1 and ξ 2 with two subpopulations can be written as where 0 < λ < 1 is the component probability, that is, the prevalence of the subpopulation in which the underlying continuous latent variables ξ 1 and ξ 2 follow a bivariate normal distribution with means µ and covariance matrix Σ. In the subpopulation with a prevalence of (1 − λ), the underlying continuous latent variables follow a bivariate normal distribution with means µ * and covariance matrix Σ * . The expected proportion π ij under the mixture of bivariate normal distributions in Equation (8) is where h(·) denotes the mixture distribution. For identification of the underlying continuous latent variables, the means and variances in the first subpopulation can be fixed at zero and unity, respectively, while constraining the thresholds to be equal across subpopulations. All other parameters can be freely estimated. The polychoric correlation in the first subpopulation is equal to the covariance σ 12 . The polychoric correlation in the second subpopulation ρ * is equal to the covariance σ * 12 divided by the product of the standard deviations of ξ 1 and ξ 2 , diag(Σ * ).
The number of parameters to be estimated for the mixture of bivariate normal distributions can be reduced by imposing restrictions on the polychoric correlations, means, or variances. One possibility is to restrict ρ and ρ * to be equal. With this restriction, the association between ξ 1 and ξ 2 is the same in each subpopulation. Another approach to reduce the number of parameters is to fix ρ * at zero, resulting in a mixture of bivariate normal distributions in which ξ 1 and ξ 2 are not associated in the second subpopulation. Additionally, one can restrict the means to be equal across the two subpopulations, µ = µ * = 0 0 . This restriction results in a mixture of bivariate normal distributions with a single mode. Another option is to restrict the variances of the underlying variables to be equal across the subpopulations, σ = σ * = 1 1 .

Testing Underlying Distributions
The fit of underlying distributions on empirical data can be tested for a given pair of ordinal variables with the likelihood ratio test (LRT; [7]). The LRT statistic is given by whereπ ij is the estimated expected proportion obtained from the tested distribution with estimated parameter vectorγ. When the model holds, the LRT statistic is asymptotically chi-squared distributed with degrees of freedom equal to where I × J is the number of response patterns and n is the number of estimated parameters [30].

Illustrative Example
Below, we briefly demonstrate estimating the polychoric correlation assuming an underlying bivariate normal distribution, underlying bivariate skew-normal distribution, and underlying mixture of bivariate normal distributions with an empirical example. The purpose of this illustration is to show how one can test whether an underlying distribution may be suitable or not when estimating the polychoric correlation. The R scripts of the example are available in Appendixes A-C.

Example: Type D Personality Data
The Ref. [31] gathered observations from 541 respondents on the Type D personality, which can be described as a tendency to experience negative affectivity and social inhibition. The DS14 was used as an instrument to measure these two tendencies, and observations on Item 2 from the negative affectivity subscale and Item 6 from the social inhibition subscale are presented in a contingency table given in Table 1. Assuming underlying bivariate normality, the estimated polychoric correlation between negative affectivity and social inhibition wasρ = 0.29. The bivariate normal distribution, however, did not fit the data at a 5% level of significance, χ 2 (15) = 52.69, p < 0.001. The bivariate skew-normal distribution with estimated correlationρ = 0.29 and shape parameters α 1 = 0.07 and α 2 = 1.67 did not fit the data either, χ 2 (13) = 52.23, p < 0.001. In addition, the increase in fit compared to the bivariate normal distribution was not significant, ∆χ 2 (2) = 0.50, p = 0.780. Figure 1 shows the estimated underlying bivariate skew-normal distribution and its marginal distributions. The marginal distribution underlying the observed social inhibition was skewed to the right (see Figure 1b) as indicated by the positive shape parameter.  The mixture of bivariate normal distributions with freely estimated means, variances, and correlation in the second subsample was not rejected at a 5% level of significance, χ 2 (9) = 9.36, p = 0.405. The estimated component probability wasλ = 0.50, reflecting equally large subsamples. In the first subsample, the underlying continuous variables followed a bivariate standard normal distribution with means of 0, variances of 1, and a polychoric correlation ofρ = 0.85. In the other subsample, the bivariate normal distribution of the underlying variables had means ofμ * = 0.34 −0.10 , a covariance matrix of Σ * = 1.14 −0.30 −0.30 0.72 , and a polychoric correlation ofρ * = −0.33. This mixture of bivariate normal distributions fitted significantly better than a mixture distribution with the correlation in the second subsample fixed at zero, ∆χ 2 (1) = 4.88, p = 0.027, indicating that the correlation in the second subsample significantly differs from zero.
The estimated mixture of bivariate normal distributions is illustrated in Figure 2.

Empirical Study on the Fit of Different Distributions
We now turn to our empirical study in which the fit of the bivariate normal distribution, bivariate skew-normal distribution, and the mixture of bivariate normal distributions was tested. These distributions were fitted to 700 contingency tables stemming from two datasets described below.

Type D Personality Data
The first empirical dataset that was used in the present study was gathered by [31] to examine whether negative affectivity and social inhibition predict cardiovascular events in 541 patients with coronary artery disease. The sample consisted of 541 patients, of which 473 were male. Negative affectivity and social inhibition were measured using the DS14 [32]. The DS14 is a widely used instrument for the assessment of the Type D personality, which is described as a tendency towards negative affectivity (i.e., experiencing negative emotions) and social inhibition (i.e., experiencing social discomfort, reticence, and lack of social poise). The DS14 contains 14 items with five ordered response categories each (1 = false, 2 = rather false, 3 = neutral, 4 = rather true, 5 = true). Negative affectivity and social inhibition were both assessed with seven items. Two social inhibition items, items 1 and 3, were negatively worded and were therefore recoded in this study such that a higher score indicates a higher level of social inhibition. With 14 items, there are 91 pairs of items to be analyzed.

Health Status Data
The second dataset concerned a nationwide health status survey conducted by the Netherlands Organisation of Applied Scientific Research TNO [33]. The Dutch version [33] of the SF-36 health survey [34] was used to assess the health status of 1742 adults. The SF-36 health survey consists of 36 items with ordered response categories, organized into eight different aspects of health: physical functioning (PF; ten items with three response categories each), role limitations due to physical health (RP; four items with two response categories each), bodily pain (BP; two items with five and six response categories each, respectively), general health perceptions (GH; five items with five response categories each), vitality (VT; four items with six response categories each), social functioning (SF; two items with five response categories each), role limitations due to emotional health (RE; three items with two response categories each), and general mental health (MH; five items with six response categories each). In addition, there is one item with five response categories to assess Health Comparison (HC). The observed response categories were coded such that higher scores indicate higher levels of functioning or well-being. The dataset consisted of 630 contingency tables in total. When both observed ordinal variables were dichotomous (I = J = 2), the null hypothesis of underlying bivariate normality could not be tested; therefore, the 21 contingency tables with two dichotomous variables were excluded from the analysis.

Fitted Distributions
The possible bivariate distributions that could be tested varied as a function of the number of response categories of each ordinal variable in the contingency table (see Table 2). This is because a distribution can only be tested for a given pair of ordinal variables if the degrees of freedom that partly depend on the number of response categories are positive. Consider, for example, a pair of ordinal variables with I × J = 2 × 3 = 6 possible response patterns. The total number of thresholds to be estimated for this pair of variables is I + J − 2 = 2 + 3 − 2 = 3. If the bivariate normal distribution is fitted, ρ is additionally free to be estimated. Hence, the degrees of freedom when fitting the bivariate normal distribution to a 2 × 3 contingency table are I × J − n − 1 = 6 − 4 − 1 = 1. However, we would not be able to fit the bivariate skew-normal distribution to a 2 × 3 contingency table, because with the two additional parameters α 1 and α 2 the degrees of freedom become negative, I × J − n − 1 = 6 − 6 − 1 = −1.
The bivariate normal distribution was fitted to each pair of ordinal variables, where the means µ were fixed at zero and the variances σ were fixed at unity. The polychoric correlation ρ and all thresholds τ (1) and τ (2) were free to be estimated. When the bivariate skew-normal distribution was fitted to the data, the location and scale parameters were fixed at zero and unity, respectively. The polychoric correlation ρ and all thresholds τ (1) and τ (2) were free to be estimated. Moreover, the shape parameters α 1 and α 2 were both estimated. When the mixture of bivariate normal distributions was fitted to the data, the means and variances in the first subpopulation were fixed at zero and unity, respectively, and the polychoric correlation ρ, the thresholds τ (1) and τ (2) , and the component probability λ were free parameters. Thresholds were constrained to be equal across subpopulations. The correlation in the second subpopulation was either fixed at zero (ρ * = 0), constrained to be equal to the correlation in the first subpopulation (ρ * = ρ), or set free to be estimated. Moreover, the second subpopulation's mean and variance of both underlying continuous variables were freely estimated. Note. C reflects the total number of contingency tables that could be analyzed in this study. The degrees of freedom are equal to I × J − n − 1, where I and J are numbers of response categories and n is the number of estimated parameters.

Analysis
The underlying bivariate distributions were fitted to the pairs of ordinal variables by minimizing Equation (10). The minimization was solved using a one-step procedure [28]. That is, all parameters in γ were simultaneously estimated. With k items, k(k − 1)/2 tests are conducted in each dataset. Therefore, the goodness of fit was not only tested at 0.05, but also at a Bonferroni-adjusted level of significance 2 × 0.05 k(k−1) to avoid inflated family-wise Type I error rates (note that this procedure is different from Raykov and Marcoulides [35] who applied a Benjamini-Hochberg procedure). The null hypothesis was rejected when the LRT statistic was significant. This indicated that the underlying continuous latent variables did not follow the bivariate distribution being tested.
In order to minimize Equation (10), the R function nlminb() from the PORT library [36] was used. This function uses a quasi-Newton algorithm that can be subjected to box constraints. Substantial non-convergence rates were observed when fitting the bivariate skew-normal and mixture of bivariate normal distributions with the default nlminb() method (see Appendix D). Inspection of the convergence messages suggested that the stopping tolerances were too tight. We therefore used default stopping tolerances, but adjusted the relative tolerance when we encountered non-convergence. As we also observed convergence to local minima, we used multiple starting values for the bivariate skewnormal distribution. For the mixture of bivariate normal distributions, we imposed lower and/or upper constraints on the polychoric correlations, the component probability, and the variances of the underlying continuous latent variables in order to avoid inadmissible values of these parameters.
We calculated the percentage of contingency tables for which each tested underlying distribution was rejected in each of the datasets. In addition, we evaluated the absolute difference in polychoric correlation estimates averaged across all contingency tables in both datasets as outcome variables. The absolute difference was defined as |ρ A −ρ N |, wherê ρ N is the polychoric correlation estimate assuming underlying bivariate normality, andρ A is the estimate of the polychoric correlation assuming an alternative underlying bivariate distribution (i.e., skew-normal distribution or mixture of bivariate normal distributions). For the two mixtures of bivariate normal distributions in which ρ * is allowed to differ from ρ, the correlation of the largest subsample was used for the calculation of the absolute difference. Note that the absolute differences do not reflect differences with regard to population values and therefore cannot be interpreted as reflecting estimation bias. Instead, the absolute differences in polychoric correlation estimates provide information about the range of estimates obtained using different distributions with empirical data. Results were analyzed with R version 3.4.3 [37]. Table 3 shows the rejection percentages of the bivariate distributions among all pairs of ordinal variables. In the Type D personality dataset, the bivariate normal distribution was rejected for 83.52% of the variable pairs when the level of significance was 0.05. With a Bonferroni-adjusted significance level, the bivariate normal distribution was rejected for 42.86% of the contingency tables. For most pairs of ordinal variables, the assumption of underlying bivariate normality is thus violated. The bivariate skew-normal distribution obtained comparably high percentages of rejection. With a significance level of 0.05, the bivariate skew-normal distribution was rejected for 79.55% of the variable pairs. The distributions that obtained the lowest rejection percentages were the mixtures of bivariate normal distributions. With a Bonferroni-adjusted significance level, the mixture of bivariate normal distributions with ρ * fixed at zero was not rejected for any pair of variables. In the health status dataset, the bivariate normal distribution was rejected for 71.43% of the pairs of variables using an unadjusted significance level. The null hypothesis of underlying bivariate normality was rejected for 35.06% pairs of variables when a Bonferroni-adjusted level of significance was used. The percentages of rejection of the bivariate skew-normal distribution were slightly lower. Again, the mixtures of bivariate normal distributions showed substantially lower rejection percentages. For instance, with a Bonferroni-adjusted significance level, the mixture of bivariate normal distributions in which ρ * is fixed at zero and the means and variances of both underlying variables are freely estimated was rejected for only 4.99% of contingency tables. When ρ * was additionally estimated, the rejection percentage was 3.60%.

Absolute Difference in ρ
The average absolute differences for the bivariate skew-normal distribution and mixtures of bivariate normal distributions are presented in Table 4. Compared to the other distributions, the skew-normal distribution and mixture of bivariate normal distributions with ρ * = ρ generally produced polychoric correlation estimates that were close to the polychoric correlation estimate assuming underlying bivariate normalityρ N . These distributions obtained comparable average absolute differences (i.e., 0.03 and 0.04). The largest average absolute differences were found for the mixture of bivariate normal distributions with ρ * as a free parameter and the mixture of bivariate normal distributions with ρ * fixed at zero. The average absolute differences of these distributions ranged from 0.26 to 0.31, and were substantially larger than those obtained by the bivariate skew-normal distribution and the mixture of bivariate normal distributions with ρ * = ρ.

Distributions Type D Personality Health Status
Skew-normal 0.03 0.04 Note. The absolute differences were averaged across the contingency tables in the dataset. In the Type D personality dataset, there was a total of 91 contingency tables. In the health status dataset, the total number of contingency tables was 609 for the bivariate normal distribution, 539 for the bivariate skew-normal distribution, and 361 for the mixtures of normal distributions.

Discussion
In line with existing literature [6][7][8][9][10][11], this study showed that the underlying bivariate normal distribution seldom holds in empirical data. This may indicate that in this study, the polychoric correlation based on underlying bivariate normality is an under-or overestimation of the correlation between the underlying continuous variables for most pairs of ordinal variables [19]. The bivariate skew-normal distribution was also frequently rejected in empirical data.
A possible explanation for the high rejection percentages of the bivariate skew-normal distribution may involve the dependency between skewness and kurtosis. In Azzalini and Dalla Valle's skew-normal distribution [20], skewness and kurtosis were regulated with the same parameter. Hence, a bivariate skew-normal distribution with high skewness cannot have low kurtosis, or the other way around. This might not be realistic, as in practice, data can be highly skewed with low kurtosis, or the other way around. Moreover, similar to Jin and Yang-Wallentin [19], we encountered non-convergence and local optima when estimating the parameters of the bivariate skew-normal distribution. Multiple starting values and adjusted stopping tolerances for minimizing the LRT statistic were used in this study to overcome these problems.
The mixture of bivariate normal distributions was not often rejected in the empirical datasets. Specifically, the mixture of bivariate normal distributions with a free or fixed-atzero correlation in the second subpopulation was often found to be consistent with the data. Although the mixture with a freely estimated correlation in the second subpopulation is less restrictive, the mixture with a correlation in the second subpopulation fixed at zero was rejected less often in one of the datasets. This was against our expectations, because in theory, a less restrictive distribution (with more parameters to be estimated) will always fit the data better than a nested, more restrictive distribution (with fewer parameters to be estimated). The higher rejection percentages of the less restrictive mixture distribution with a free correlation in the second subpopulation may be caused by an increase in fit of the distribution that is too small relative to the decrease in degrees of freedom, or by local optima. Overall, our results support Uebersax's [24] suggestion that the mixture distribution is a flexible tool to model non-normal distributions, but there is a need for further studies in which its performance is investigated. One of the issues of the mixture distribution is the risk of overfitting because of the large number of parameters. Mixture distributions may therefore not generalize easily to other samples.

Future Research and Limitations
The current paper only considers testing bivariate distributions. A logical next step would be to extend the procedures to a multivariate approach, and estimate a matrix of polychoric correlation coefficients based on all variable pairs, with various underlying distributions. The extension to the multivariate case brings about multiple issues to be considered. For example, it would make sense to add constraints to the thresholds, since it would be undesirable if a different set of thresholds would be estimated for each pair of ordinal variables [28]. Additionally, with the mixture distributions, it may be needed to constrain the component probabilities for the subpopulations to be equal across variable pairs, and to avoid estimating a free correlation coefficient in each subpopulation (but instead constrain them to be equal across subpopulations, or to fix one of the correlations to zero). In order to fit structural equation models to polychoric correlation matrices using weighted least squares estimators, the asymptotic covariance matrix of the estimated parameters will be needed. Currently, there are no straightforward methods to obtain the asymptotic covariance matrix based on other underlying distributions than the bivariate normal, although as a reviewer suggested, the approach by Monroe [38] might be extended. Correspondingly, the estimated polychoric correlation matrix should be positive semidefinite in order to fit structural equation models, which may not be the case without the appropriate constraints, or with a misspecified underlying distribution. Moreover, although the LRT has been evaluated to test for underlying multivariate normality [35,39], as well as a parametric bootstrap procedure [17], to our knowledge, there exist no simulation studies investigating the statistical performance of these tests when the tested distribution is non-normal. Such a simulation study would be useful to get information about the Type 1 and Type 2 errors, estimation bias in correlation coefficients, and the needed sample sizes for adequate performance of the tests.
A limitation of the present study is that although a large number of contingency tables were analyzed, they only stemmed from two datasets. This study provides a first insight into the distributions underlying the responses to ordinal variables, but more data must be investigated in order to verify the results. Moreover, the present study examined only a few non-normal bivariate distributions. It would be interesting to evaluate the fit of other non-normal bivariate distributions to empirical data as well. For example, Timofeeva and Khailenko [10] proposed a polychoric correlation assuming a generalized lambda distribution for the underlying continuous variables. The generalized lambda distribution is a non-symmetric extension of Tukey's lambda distribution and is known for its high flexibility in physical and social science settings, among others [40]. Research shows that the bivariate generalized lambda distribution is rejected less often for empirical data than the bivariate normal distribution [10]. It would be interesting to examine the rejection percentages of the generalized lambda distributions in other, possibly larger, empirical datasets that consist of ordinal responses.

Appendix D
In the nlminb() function, the relative tolerance defaults to 1e-10. Relative tolerance works as follows. The minimization stops if the algorithm is unable to reduce the value of the objective to be minimized by a factor of the sum of the absolute value of the objective and the relative tolerance. We prevented non-convergence by using nlminb() with a relative tolerance of 1e-5 or 1e-3. We found similar rejection percentages but less convergence problems with the adjusted relative tolerances. In the table below, the non-convergences rates under the different relative tolerances are presented.