This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).
The ChiSquare test (χ^{2} test) is a family of tests based on a series of assumptions and is frequently used in the statistical analysis of experimental data. The aim of our paper was to present solutions to common problems when applying the Chisquare tests for testing goodnessoffit, homogeneity and independence. The main characteristics of these three tests are presented along with various problems related to their application. The main problems identified in the application of the goodnessoffit test were as follows: defining the frequency classes, calculating the X^{2} statistic, and applying the χ^{2} test. Several solutions were identified, presented and analyzed. Three different equations were identified as being able to determine the contribution of each factor on three hypothesizes (minimization of variance, minimization of square coefficient of variation and minimization of X^{2} statistic) in the application of the Chisquare test of homogeneity. The best solution was directly related to the distribution of the experimental error. The Fisher exact test proved to be the “golden test” in analyzing the independence while the Yates and MantelHaenszel corrections could be applied as alternative tests.
Statistical instruments are used to extract knowledge from the observation of real world phenomena as Fisher suggested “… no progress is to be expected without constant experience in analyzing and interpreting observational data of the most diverse types” (where observational data are seen as information) [
The Chisquare test introduced by K. Pearson was subject of debate for much research. A series of papers analyzed Pearson's test [
It is well known that Pearson's Chisquare (χ^{2}) is a family of tests with the following assumptions [
Yates' correction is applied when the third assumption is not met [
Koehler and Larntz suggested the use of at least three categories if the number of observations is at least 10. Moreover, they suggested that the square of the number of observations be at least 10 times higher than the number of categories [
The Chisquare test has been applied in all research areas. Its main uses are: goodnessoffit [
The aim of our paper was to present solutions to common problems when applying the Chisquare for testing goodnessoffit, homogeneity and independence.
The most frequently used Chisquare tests were presented (
The main characteristics of these tests are as follows:
Goodnessoffit (Pearson's ChiSquare Test [
Is used to study similarities between groups of categorical data.
Tests if a sample of data came from a population with a specific distribution (compares the distribution of a variable with another distribution when the expected frequencies are known) [
Can be applied to any univariate distribution by calculating its cumulative distribution function (
Has as alternatives the AndersonDarling [
The agreement between observation and hypothesis is analyzed by dividing the observations in a defined number of intervals (
where X^{2} = value of Chisquare statistics; χ^{2} = value of the Chisquare parameter from Chisquare distribution;
The probability to reject the null hypothesis is calculated based on the theoretical distribution (χ^{2}). The null hypothesis is accepted if the probability to be rejected (χ^{2}_{CDF}(X^{2}, ft1)) is lower than 5%.
The Chisquare test is the most well known statistics used to test the agreement between observed and theoretical distributions, independency and homogeneity. Defining the applicability domain of the Chisquare test is a complex problem [
At least three problems occur when the Chisquare test is applied in order to compare observed and theoretical distributions:
Defining the frequency classes.
Calculating the X^{2} statistic.
Applying the
The test of homogeneity:
Is used to analyze if different populations are similar (homogenous or equal) in terms of some characteristics.
Is applied to verify the homogeneity of: data, proportions, variance (more than two variances are tested; for two variances the F test is applied), error variance, sampling variances.
The Chisquare test of homogeneity is used to determine whether frequency counts are identically distributed across different populations or across different subgroups of the same population. An important assumption is made for the test of homogeneity in populations coming from a contingency of two or more categories (this is the link between the test of homogeneity and the test of independence): the observable under assumption of homogeneity should be observed in a quantity proportional with the product of the probabilities given by the categories (assumption of independence between categories). When the number of categories is two, the expectations are calculated using the E_{i,j} formula (mean of the expected value (for Chisquare of homogeneity) or frequency counts (Chisquare test of independence) for (i,j) pair of factors) [
The observed contingency table is constructed; the values for the first factor/population/subgroup are in the rows and the values for the second variable/factor/population/subgroup are in the columns. The observed frequencies are counted at the intersection of rows with columns and the hypothesis of homogeneity is tested.
The value of X^{2} statistic is computed using the formula presented in
The test of independence (also known as Chisquare test of association):
Is used to determine whether two characteristics are dependent or not.
Compares the frequencies of one nominal variable for different values of a second nominal variable.
Is an alternative to the Gtest of independence (also known as the Likelihood Ratio Chisquare test) [
Fisher's exact test of independence [
The chisquare test of independence is applied in order to compare frequencies of nominal or ordinal data for a single population/sample (two variables at the same time).
The Chisquare for independence also faced some difficulties when applied on experimental data [
Glass and Hopkins [
The first problem of the Chisquare test of goodnessoffit is how to establish the number of frequency classes. At least two approaches could be applied:
The number of frequency classes (discreet number) is computed from Hartley's entropy [
The number of frequency classes is obtained based on the histogram of observed values as estimator of density [
One ruleofthumb suggests dividing the sample in a number of frequency classes equal to
The second problem refers to the width of the frequency classes. Two approaches could be applied here:
Data could be grouped in probability frequency classes (theoretical or observed) with equal width. This approach is frequently used when the observed data are grouped.
Data could be grouped in intervals with equal width.
The third problem is the number of observations in each frequency class. Every class must contain at least five observations; otherwise the frequencies from two proximity classes are cumulated.
The investigation of homogeneity of the values associated to a class (row or column in the contingency table) could be carried out by decomposing the X^{2} expression (see
One assumption is that the O_{i,j} observations are the result of multiplying two factors; repeated observations approximate better the effect of multiplication. Thus, the formula of expected frequencies (E_{i,j} [
Three mathematical assumptions could be formulated in terms of square error ((O_{i,j} − E_{i,j})^{2}) of observation:
The measurement is affected by chance errors, absolute values (S^{2},
The measurement is affected by chance errors, relative values (CV^{2},
The measurement is affected by chance errors on a scale with values (X^{2},
The first hypothesis (chance errors, absolute values) leads mathematically to the minimization of the variance (S^{2}) obtained between model and observation.
where a_{i}, 1 ≤ i ≤ r = contribution of first factor to the expected value E_{i,j}; b_{i}, 1 ≤ j ≤ c = contribution of second factor to the expected value E_{i,j}; E_{i,j}=a_{i}·b_{j}.
The second hypothesis (chance errors, relative values) leads to the minimization of the squared coefficient of variation (CV^{2}) (see
One possible solution for the third hypothesis is the minimization of the X^{2} statistic (see
The contribution of each factor (A = (a_{i})_{1 ≤i ≤ r}, and B = (b_{j})_{1 ≤ j ≤ c}) could be determined through the minimization of values given by
The calculations revealed the followings:
The relation in
The relation in
The relation in
The relations presented in
Dealing directly with
This is solvable in (a_{2}/a_{1}). Thus, there are an infinity of solutions (for any nonnull value of a_{1} there is a value
The application of successive approximations using the solution offered by
The method of successive approximations rapidly converged towards the optimal solution. Thus, three iterations are necessary in order to obtain a residual value of 282.11735 for the relation presented in
The experimental data reported by Fisher [
The values resulted when the iterative approach was applied to obtain the solution for
The analysis of the results presented in
The values for all three types of experimental errors (square absolute S^{2}, square relative CV^{2} and Pearson's X^{2}) and for all four analyzed cases are presented in
The experimental errors estimated by
The graphical representation in
The intersection between the contingency area and error areas is done through the absolute square error. Therefore, the contingency defined by
The triangle of the X^{2} statistics variation intersects only with the X^{2} statistics triangle. This fact recommends the use of optimization defined in
The analysis of errors distribution obtained from the above association analysis is presented in Supplementary Material.
The relative position of the solution proposed in
The results of the representations showed in
A single degree of freedom is known to exist for a 2 × 2 contingency table.
The probability to observe the situation presented in
The range in which
In order to exemplify this problem, the experimental data reported by Fisher in 1935 [
The space of possible observations regarding the
Two possible approaches could be applied in relation to the objective of the comparison in a contingency table:
If higher distances from homogeneity than the observed gives the statistic, then the probability associated to observation is obtained by cumulating the probabilities for x = 0, x = 10, x = 11 and x = 12 (red and blue dots on
If higher distances from homogeneity than the observed strictly in the sense of the observed gives the statistic, then the probability associated to observation is obtained by cumulating the probabilities for x = 10, x = 11 and x = 12 (red dots on
Frank Yates proposed in 1934 [
The application of the Chisquare test is directly related with some assumptions and with the design of the experiment. Three problems were identified in the application of Chisquare goodnessoffit and solutions were identified, presented and analyzed.
Three different equations were identified as able to determine the contribution of each factor on three hypothesizes (minimization of variance, minimization of square coefficient of variation and minimization of X^{2} statistic) in the application of the Chisquare test of homogeneity. The best solution proved to be directly related to the distribution of the experimental error.
The Fisher exact test proved to be the “golden test” in analyzing the independence while the Yates and MantelHaenszel corrections could be applied as alternative tests.
Euclidian distances among estimations of experimental errors.
Position of empirical estimation (
Value of X^{2} statistic as function of independent observation
Value of the statistical probability of the observed according to the observable.
Summary of Chisquare tests.
Goodnessoffit 
One sample. Compares the expected and observed values to determine how well the experimenter's predictions fit the data. 
H_{0}: The observed values are equal to theoretical values (expected). (The data followed the assumed distribution). 

Homogeneity 
Two different populations (or subgroups). Applied to one categorical variable. 
H_{0}: Investigated populations are homogenous. 

Independence 
One population. Type of variables: nominal, dichotomical, ordinal or grouped interval Each population is at least 10 times as large as its respective sample [ 
Research hypothesis: The two variables are dependent (or related). 

Experimental values: response to fertilization with manure on different potato varieties.
DS  25.3  28.0  23.3  20.0  22.9  20.8  22.3  21.9  18.3  14.7  13.8  10.0  
DC  26.0  27.0  24.4  19.0  20.6  24.4  16.8  20.9  20.3  15.6  11.0  11.8  
DB  26.5  23.8  14.2  20.0  20.1  21.8  21.7  20.6  16.0  14.3  11.1  13.3  
US  23.0  20.4  18.2  20.2  15.8  15.8  12.7  12.8  11.8  12.5  12.5  8.2  
UC  18.5  17.0  20.8  18.1  17.5  14.4  19.6  13.7  13.0  12.0  12.7  8.3  
UB  9.5  6.5  4.9  7.7  4.4  2.3  4.2  6.6  1.6  2.2  2.2  1.6  
Σ 
TV: Treatment
Values of (a_{i}b_{j})_{1 ≤ i ≤ 6; 1 ≤ j ≤ 12} calculated with
DS  27.61  26.30  22.68  22.51  21.71  21.33  20.86  20.69  17.36  15.28  13.57  11.40 
DC  27.21  25.92  22.35  22.18  21.40  21.02  20.55  20.39  17.11  15.06  13.37  11.24 
DB  25.56  24.35  21.00  20.84  20.10  19.75  19.31  19.15  16.07  14.15  12.56  10.56 
US  21.04  20.04  17.28  17.15  16.55  16.25  15.90  15.76  13.23  11.65  10.34  8.69 
UC  21.24  20.23  17.44  17.31  16.70  16.41  16.04  15.91  13.35  11.76  10.44  8.77 
UB  6.14  5.85  5.05  5.01  4.83  4.75  4.64  4.60  3.86  3.40  3.02  2.54 
TV: Treatment
Optimized value of the (a_{i}b_{j})_{1≤i≤6;1≤j≤12} calculated with
DS  27.07  26.42  22.64  21.85  21.85  21.94  20.94  20.63  17.93  15.48  13.54  11.61 
DC  26.66  26.02  22.29  21.52  21.52  21.60  20.62  20.32  17.66  15.24  13.33  11.43 
DB  24.91  24.32  20.83  20.11  20.11  20.19  19.27  18.99  16.50  14.25  12.46  10.69 
US  20.64  20.15  17.26  16.66  16.66  16.73  15.96  15.73  13.67  11.80  10.32  8.85 
UC  20.58  20.09  17.21  16.61  16.61  16.68  15.92  15.69  13.63  11.77  10.29  8.83 
UB  6.29  6.14  5.26  5.08  5.08  5.10  4.86  4.79  4.17  3.60  3.14  2.70 
TV: Treatment
Optimized value of the (a_{i}b_{j})_{1 ≤ i ≤ 6; 1 ≤ j ≤ 12} calculated with
DS  27.57  26.08  23.04  22.61  21.48  21.61  21.13  20.69  17.66  15.23  13.79  11.56 
DC  27.38  25.90  22.88  22.45  21.34  21.46  20.99  20.55  17.54  15.13  13.69  11.48 
DB  25.84  24.44  21.59  21.19  20.14  20.26  19.80  19.40  16.56  14.28  12.92  10.83 
US  21.23  20.08  17.74  17.40  16.54  16.64  16.27  15.93  13.60  11.73  10.62  8.90 
UC  21.47  20.31  17.94  17.61  16.73  16.83  16.46  16.12  13.76  11.86  10.74  9.00 
UB  7.02  6.64  5.87  5.76  5.47  5.51  5.38  5.27  4.5  3.88  3.51  2.94 
TV: Treatment
Optimized value of (a_{i}b_{j})_{1 ≤ i ≤ 6; 1 ≤ j ≤ 12} calculated with
DS  27.64  26.19  22.85  22.60  21.59  21.44  20.98  20.71  17.49  15.24  13.67  11.47 
DC  27.35  25.91  22.61  22.36  21.36  21.22  20.76  20.50  17.30  15.08  13.52  11.35 
DB  25.74  24.40  21.28  21.05  20.11  19.97  19.55  19.29  16.29  14.20  12.73  10.68 
US  21.17  20.06  17.50  17.31  16.53  16.42  16.07  15.87  13.39  11.68  10.47  8.78 
UC  21.40  20.28  17.69  17.50  16.71  16.60  16.25  16.04  13.54  11.80  10.58  8.88 
UB  6.57  6.23  5.43  5.37  5.13  5.10  4.99  4.93  4.16  3.63  3.25  2.73 
TV: Treatment
Comparative value for chance experimental errors.
 

DS  23.4  18.76  24.12  57.97  1.10  0.937  1.127  2.308  0.056  0.0515  0.0573  0.0971 
DC  59.7  48.48  59.86  104.95  3.08  2.497  3.052  4.847  0.164  0.133  0.1611  0.2365 
DB  69.8  66.77  71.47  95.21  3.78  3.596  3.796  4.803  0.221  0.2078  0.2167  0.2633 
US  41.6  49.03  41.66  35.34  2.72  3.190  2.709  2.358  0.186  0.2158  0.183  0.1635 
UC  57.6  59.01  56.53  82.16  3.46  3.660  3.339  4.367  0.218  0.2375  0.2065  0.2444 
UB  37.5  40.1  37.13  28.26  7.89  8.295  7.659  5.956  1.751  1.8018  1.6696  1.3512 
UD  30.3  26.3  28.20  78.9  2.66  2.35  2.15  3.58  0.335  0.293  0.235  0.232 
KK  15.3  13.5  15.80  18.7  0.76  0.64  0.73  0.88  0.045  0.033  0.035  0.044 
KP  63.0  62.7  64.00  67.5  3.11  3.15  3.13  3.19  0.155  0.162  0.159  0.155 
TP  34.3  31.4  33.30  76.5  2.79  2.69  2.37  3.67  0.357  0.340  0.256  0.242 
ID  3.4  3.9  4.00  4.5  0.21  0.27  0.28  0.26  0.017  0.028  0.029  0.021 
GS  26.2  25.6  26.90  28.6  2.29  2.45  2.52  2.42  0.319  0.349  0.352  0.327 
AJ  45.0  47.0  45.30  43.4  2.56  2.71  2.60  2.44  0.152  0.168  0.164  0.148 
BQ  21.5  20.4  21.00  31.8  1.93  1.71  1.67  2.19  0.253  0.205  0.182  0.193 
ND  18.3  17.9  19.10  20.5  2.13  2.29  2.35  2.27  0.393  0.424  0.427  0.403 
EP  2.9  3.2  3.30  3.8  0.53  0.64  0.66  0.62  0.133  0.158  0.163  0.142 
AC  18.2  18.8  18.70  19.3  1.76  1.87  1.84  1.83  0.209  0.232  0.233  0.221 
DY  11.1  11.5  11.20  10.6  1.31  1.40  1.39  1.27  0.228  0.255  0.258  0.227 
Σ  289.5  290.8  404.1  22.04  22.17  24.62  2.596  2.647  2.493 
Tt = type of treatment; S^{2} =
Transformation of the residuals presented in
E  289.5  22.04  2.596 
S^{2} = min.  22.17  2.647  
X^{2} = min.  290.8  2.493  
CV^{2} = min.  404.1  24.62  
 
E  1.026  1.016  1.102 
S^{2} = min.  1.022  1.124  
X^{2} = min.  1.030  1.059  
CV^{2} = min.  1.432  1.135 
E = use of
2 × 2 contingency table with one degree of freedom.
Class B  x  n_{1} − x  n_{1} 
Class Ω_{2}\B  n_{2}−x  n_{3} − n_{1} + x  n_{2} + n_{3} − n_{1} 
Total Ω_{2}  n_{2}  n_{3}  n_{2}+n_{3} 
X^{2} = ChiSquare. Class A = first value of first category. Ω_{1} = whole first category. Class B = first value of second category. Ω_{2} = whole second category.
Probability of observation.
p_{X2}  χ^{2}_{CDF}(X^{2} = 13.03,df = 1)  3.063 ×·10^{−4} 
p_{O2} (x^{2} ≥ X^{2})  p_{MN}(10,13,12,18) + p_{MN}(11,13,12,18) + p_{MN}(12,13,12,18)  4.625·× 10^{−4} 
p_{O2} (x^{2} > X^{2})  p_{MN}(11,13,12,18) + p_{MN}(12,13,12,18)  1.548·× 10^{−5} 
p_{D2} (x^{2} ≥ X^{2})  p_{O2}(x^{2} ≥ X^{2}) + P_{MN}(0,13,12,18)  5.367·× 10^{−4} 
p_{D2} (x^{2} > X^{2})  p_{O2}(x^{2} > X^{2}) + P_{MN}(0,13,12,18)  8.702·× 10^{−5} 
p_{X2} = probability of χ^{2} distribution; p_{O2} = probability of observing a higher distance from homogeneity in the direction of the observed value; p_{D2} = probability of observing a higher distance from homogeneity in any direction; χ ^{2}CDF = probability of cumulative distribution function; p_{MN} = probability from multinomial distribution.
The study was supported by UEFISCSU/ID1105/2008 for R. Sestraş and by POSDRU/89/1.5/S/62371 through a fellowship for L. Jäntschi.