Over the last two decades, there have been growing concerns about the so-called replication crisis in psychology and other fields [1
]. As a result, the scientific community has paid increasing attention to the issue of replicability in science as well as to good research and statistical practices.
In this context, many have highlighted the limitations of null-hypothesis significance testing and called for more modern approaches to statistics. One such recommendation, for example, from the “New Statistics” initiative is to report effect sizes and their corresponding confidence intervals and to increasingly rely on meta-analyses to increase confidence in those estimations [3
]. These recommendations are meant to complement (or even replace, according to some) null-hypothesis significance testing and would help transition toward a “cumulative quantitative discipline”.
These so-called “New Statistics” are synergistic because effect sizes are not only useful for interpreting study results in themselves but also because they are necessary for meta-analyses, which aggregate effect sizes and their confidence intervals to create a summary effect size of their own [4
]. (The title of this paper is an allusion to the rhyme spoken by the giant in the English fairy tale Jack and the Beanstalk (“Fee-fi-fo-fum”).)
Unfortunately, popular software applications do not always offer the necessary implementations of the specialized effect sizes necessary for a given research design and their confidence intervals. In this paper, we want to focus on effect sizes for categorical data that are probably less well known than popular effect sizes like Cohen’s d
or Pearson’s r
]. For categorical data, d
are inappropriate measures of an effect size. Cohen’s d
refers to the standardized difference between the means of two populations, while Pearson’s correlation coefficient r
measures linear correlations. Hence, both measures refer to continuous, not categorical, data.
To compare categorical data, for instance, where associations can be presented as contingency tables, several effect size metrics are available. Common effect sizes for 2-by-2 tables are odds ratios (OR), risk ratios (RR), or the phi
) coefficient. While phi
can be interpreted similarly to a correlation coefficient, OR and RR are harder to interpret as they are not bounded between zero and one. Furthermore, RR is not symmetrical [8
]. The size of the effect can change when columns and rows are exchanged. For tables with larger dimensions than 2-by-2, other effect sizes (like Cramér’s V
) are available that share the property of phi
of being able to be interpreted like a correlation coefficient and which are discussed later.
The observed distribution of categorical data—usually measured as multinomial variables—can also be compared to an expected distribution. Again, effect sizes to measure the strength of such associations show some limitations regarding ease of interpretation. What is missing here is an effect size whose metric is comparable to those for contingency tables.
This paper aims to review the most commonly used effect sizes for analyses of categorical variables that use the (chi-square) test statistic and introduce a new effect size, פ (Fei, pronounced “fay”), which closes the gap of a missing effect size measure in a correlation-like metric that is appropriate for categorical data.
Importantly, we offer researchers an applied walkthrough on how to use these effect sizes in practice thanks to the effectsize
package in the R programming language, which implements these measures and their confidence intervals [9
]. The presented effectsize
package closes another gap related to the aforementioned effect sizes because the uncertainty of such measures—expressed by their confidence intervals—is often not included in the output of statistical software. We cover, in turn, tests of independence (φ/phi, Cramér’s V
) and tests of goodness-of-fit (Cohen’s w
, Tschuprow’s T
, and a new proposed effect size, פ/Fei
2. Effect Sizes for Tests of Independence
test of independence between two categorical variables examines if the frequency distribution of one of the variables is dependent on the other. That is, are the two variables correlated such that, for example, members of group 1 on variable X are more likely to be members of group A on variable Y rather than evenly spread across Y variable groups A and B. Formally, the test examines how likely the observed conditional frequencies (cell frequencies) are under the null hypotheses of independence. This is done by examining the degree to which the observed cell frequencies deviate from the frequencies that would be expected if the variables were indeed independent. The test statistic for these tests is the
, which is computed as:
are the observed frequencies and
are the frequencies expected under independence, and
are the number of rows and columns, respectively, in the contingency table.
Instead of the deviations between the observed and expected frequencies, we can write
in terms of observed and expected cell probabilities and the total sample size
are the observed cell probabilities and
the probabilities expected under independence.
gives a short example in R to demonstrate whether the probability of survival is dependent on the sex of the passenger aboard the Titanic. The null hypothesis tested here is that the probability of survival is independent of the passenger’s sex.
The performed -test is statistically significant. Thus, we can reject the hypothesis of independence. However, the output includes no effect size, and we cannot conclude the strength of the association between sex and survival.
For a 2-by-2 contingency table analysis, as the one used above, the
) coefficient is a correlation-like measure of effect size indicating the strength of association between the two binary variables. One possibility to compute this effect size is to recode the binary variables as dummy (“0” and “1”) variables and compute the Pearson correlation between them [11
Another way to compute
is by using the
This value ranges between 0 (no association) and 1 (complete dependence), and its values can be interpreted the same as Pearson’s correlation coefficient. Table 2
shows the correlation coefficient and the effect size
for the data shown in Table 1
Note that cannot be negative, so we will take the absolute value of Pearson’s correlation coefficient. Also note that the effectsize package gives a one-sided confidence interval by default, to match the positive direction of the associated test at (that the association is larger than zero at a 95% confidence level).
2.2. Cramér’s V (and Tschuprow’s T)
When the contingency table is larger than 2-by-2, using
can produce values larger than 1, which loses its interpretability as a correlation-like effect size. Cramér showed that while for 2-by-2 the maximal possible value of
, for larger tables the maximal possible value for
]. Therefore, he suggested the
effect size (also sometimes known as Cramér’s phi and denoted as
is 1 when the columns are completely dependent on the rows or the rows are completely dependent on the columns (and 0 when rows and columns are completely independent).
gives a short example in R to demonstrate whether the probability of survival is dependent on the person’s travel class or position aboard the Titanic. The null hypothesis tested here is that the probability of survival is independent of the travel class or position.
Tschuprow devised an alternative value, at
which is 1 only when the columns are completely dependent on the rows and
the rows are completely dependent on the columns, which is only possible when the contingency table is a square [13
For example, in Table 4
, each row is dependent on the column value; that is, if we know if the food is a soy, milk, or meat product, we also know whether the food is vegan or not. However, the columns are not
fully dependent on the rows: knowing the food is vegan tells us the food is soy-based; however, knowing it is not vegan does not allow us to classify the food—it can be either a milk product or a meat product.
Accordingly, as can be seen in Table 4
, Cramer’s V
will be 1, but Tschuprow’s T
will not be:
We can generalize , , and to: . That is, they express the square root of a proportion of the sample- to the maximum possible given the study design.
These coefficients can also be used for confusion matrices, which are 2-by-2 contingency tables used to assess machine learning algorithms’ classification abilities by comparing true outcome classes with the model-predicted outcome class. A popular metric is the Matthews correlation coefficient (MCC) for binary classifiers, which is often presented in terms of true and false positives and negatives but is nothing more than
4. Simulation Study of the Distributional Form of the Fei Effect Size
In the previous section, we showed some results for the effect size פ (Fei) and its confidence intervals for different distributions of a multinomial variable. Like all effect sizes discussed in this paper, פ follows a scaled non-central
statistic follows a non-central
distribution, and its square root follows a non-central
distribution; this random variable is then scaled by a constant that is a function of the sample size and the study design. The noncentrality parameter of the non-central
distribution can be found by applying the inverse of the scale to the population effect size. Therefore,
is the inverse of the
to פ conversion:
פ is the population effect size, k
is the number of classes, and
is the random variable of possible observed effect sizes in a random sample. This can also be formulated in terms of a non-central
is the inverse of the
to פ conversion:
To validate our assumptions, we conducted a simulation study, where we simulated data of multinomial distributions for known true effect sizes of 0.1, 0.3, and 0.5, respectively. The datasets contained 500 simulations per effect size, for three different expected probabilities (same as in Table 6
), and 3 different sample sizes of 50, 100, and 350, resulting in 13,500 simulated data points (500 simulations × 3 effect sizes × 3 expected probabilities × 3 different sample sizes). Figure 1
shows the results from the simulation study.
The smallest sample size is more affected by noise, and results show more variation (and less continuity) of simulation-based פ (Fei) values around the true effect sizes. For sample sizes N = 100 and N = 350, פ values closely replicate the true effect sizes and clearly follow a non-central distribution, indicating that Fei, like , V, T, and w, is a scaled value.
פ (Fei) following a non-central
distribution also allows for power calculation. For example, if the null probabilities are [0.35, 0.65] and the alternative probabilities are [0.545, 0.455], the scaling constant is:
and the population effect size is
Therefore, the sample פ will follow the following distribution:
One must then find the N that produces the desired power for the significance level that will be used to reject the null. For example, for a significant level of 0.01 and a power of at least 0.85, an N of at least 78 is required.
The pwr package in R provides a function (pwr.chisq.test()) that can be used to calculate the power for goodness-of-fit tests. Although the function uses Cohen’s w as input/output, פ can easily be converted to Cohen’s w (e.g., by using the fei_to_w() function from effectsize), allowing for the pwr function to be used with פ. An example can be found in the accompanying R code.
Effect sizes are essential to interpreting the magnitude of observed effects; they are frequently required in scientific journals; and they are necessary for a cumulative quantitative science relying on meta-analyses. In this paper, we have covered the mathematics and implementation in R of four different effect sizes for analyses of categorical variables that specifically use the
(chi-square) statistic. Furthermore, with our proposal of the effect size פ (Fei), we fill in the missing effect size for all cases of a
test, as can be seen in Table 7
Thus, we now have effect sizes to accompany any sized 1-dimensional or 2-dimensional contingency tables that represent the sample’s relative to the maximally possible , ranging from 0 to 1, that can be easily interpreted on the scale of a correlation coefficient.