3. Examples to Illustrate Good Statistical Practices in Categorical Data
Outline of analysis of our two alfalfa experiments, the nodules per root (N/R) and phosphorus-winter survival (P_WS). The structured approach to the analysis of data from these experiments includes determining the objective, detailing the experiment design, carefully determining quantities measured, specifying analyses and potential analysis issues, and finally making conclusions and communicating results. We give this outline for each of the experiments, first with the alfalfa nodules/root (N/R) experiment and next with the phosphorous winter survival (P_WS) experiment.
Example 1. Nodules/root (N/R) experiment to compare alfalfa cultivars and nodulation strains
Ex1.1. Objective. Objectives of the alfalfa nodules study were: (1) to test whether there is a difference in mean number of nodules on roots between two alfalfa cultivars, one with branching or fibrous root and the other with taproot architecture; (2) to test whether there are differences for different nodulation strains.
Ex1.2. Experiment Design. The design of this experiment is a randomized complete block for six treatments in a (2 × 3) factorial, two cultivars each having three inoculation treatments (two nodulation strains plus a control). The three blocks are repeats of greenhouse runs undertaken at different dates (called Times), with eight plants per treatment, each grown in separate pots, all completely randomized within the greenhouse. This results in 3 × 2 × 3 × 8 = 144 total plants. However, six plants are missing due to plant death, so the total number of plants on which measurements are made is 138.
Ex1.3. Data. Nodules are counted on the roots of each plant after plants have been established and grown for 14 days (original data in supplemental file “nodules.csv” for detail). Nodule numbers range from 0 to 78, with 4 of the plants having 0 nodules. We note that the counts are likely to have a higher variance for higher mean count numbers, a potential issue for the data analysis if we use the NID assumptions. This type of count data are often considered to be Poisson-distributed [
3,
15].
Ex1.4. Statistical Analysis. The first analysis for this experiment uses the nodule count data (Y) in a linear model assuming normal independent distribution (NID), an analysis familiar to many scientists. Models in these statistical analysis subsections are written in the same way for both JMP and R, except for interactions (in JMP the A-B interaction is depicted as A*B and in R as A:B. In R, A*B means A + B + A:B). The model, written in R, is:
where Time (=greenhouse run considered as a Rep), its interactions are assumed random effects, and Cultivar and Inoculum are fixed.
The second analysis uses the square-root data transformation and averages over pots to obtain an anova to test the null hypothesis of equal cultivar means and to estimate cultivar and inoculum treatment means, as well as to test the cultivar-inoculum interaction. Its model, written in JMP, is:
where all factors, including the nuisance factor, Time, are considered fixed.
The third analysis has a generalized linear mixed model for Poisson distributed Y, with a model equation similar to that in Equation (2), except using glmer (see R Markdown supplement), and for this glmm, there is a normal random effect for pot to account for this recognized source of variation. This model also has omitted terms not statistically significant.
where Time and Pot are random and other factors are fixed. This equation refers to modeling the standard Poisson log link function on these explanatory variables.
There are several issues related to these statistical analyses. We examine proper error terms for testing treatments in this design for different scopes of inference. Count data do not fit the usual anova normal independently distributed (NID) assumptions, and we have categorical data analysis software in R and JMP available for glm analyses for the Poisson distributed count data. We used the glmer function to fit the glmm (4) with log link and independent normal pot error term. The R codes can be found in a R markdown file “nodules.rmd”, and the line-by-line results can be found in the supplemental file “nodules.docx” for details.
Ex1.5. Interpretation. A difference in numbers of nodules per plant may be associated with higher nitrogen (N) fixation, and it is important to find if a difference in nodules exists. A 95% confidence interval (CI) for the cultivar mean nodule difference would be valuable scientific information. These results might influence the direction of an alfalfa breeding program.
Example 1 Results of Nodules/Root (N/R) Experiment
Over-simplified analyses can give misleading first impressions. A researcher had requested a simple independent-sample t-test for cultivar difference. An issue with this test is that the
p-value for cultivar difference from this did not agree with probabilities from a linear model analysis-of-variance. The anova table of the nodule counts, using Equation (2), is shown in
Table 1. This table has a cultivar null hypothesis probability of 0.026, which differed from the independent sample t-test with probability of 0.19. The independent-sample t-test for cultivar differences was shown to be equivalent to a one-factor (cultivar) anova (see
Figure 1), as discussed in the literature review. This example illustrates that the independent sample t-test comparing cultivars, because it does not take account of important sources of variation, inflates the error term. These sources of variation include blocks (greenhouse runs) and inoculum treatments, which, when part of the residual, makes the error term too high for the independent sample t-test compared with the multi-factor anova model.
Figure 1 below contains two R programs, one being the anova with a single two-level factor, cultivar, and the other a simple independent-sample t-test for differences of the two cultivars. Both programs output the same results. However, as we mentioned earlier, this analysis is faulty because it does not take into account important contributions to the variance from blocks (Times of greenhouse runs) and inoculum treatment differences. We present these incorrect analyses to illustrate a pitfall and to show equality of the results of the two methods.
The desired inference requires a proper error term. The analysis of variance with fixed effects in
Table 1 also has the problem that the test of cultivars uses a residual mean square based on 90 degrees of freedom (df), which is not the proper error term for inferring results to new greenhouse environments. Crawley ([
11], pp. 173–182) discusses considerations in random vs. fixed effects, and Stroup et al. ([
2], p. 10) discuss the scope of inference when making this choice in models. Plants are nested within each replication or greenhouse run (Time), and to infer results to future greenhouse conditions, we need to assume the greenhouse runs are a random sample from a population of such runs. The error needed for this broader inference should include the Time interaction with treatment combinations. This can be seen in the expected values of mean squares in the analysis of variance table, assuming a random Time-cultivar interaction. (Our example has the anomaly that the cultivar-block mean square is smaller than that of residual even though it is expected to be larger). The main issue is that the residual error term in
Table 1 measures plant-to-plant variation, which has inflated degrees of freedom (df). Many researchers mistakenly use such error terms with degrees of freedom (df) higher than warranted by the experiment design, called pseudoreplication by Crawley ([
11], pp. 176–182). With mixed model analysis designating the block-treatment interactions as random, some software packages such as JMP automatically perform correct tests of treatment differences for broader inference.
Another learning point about the error term in the analysis of this experiment is the difference in testing with type I and type III sums of squares for cultivars. Type I sum of squares refers to sequential testing of each successive factor in the model and is the default method for R anova. On the other hand, JMP and SAS use Type III sums of squares, which account for all other factors in the model, including interactions. Consequently, it is necessary to consider the default F-tests for cultivar differences when using different software for analyses.
The data do not fit the usual anova assumptions. A third issue is that data are counts of nodules on the alfalfa roots, and count data do not usually follow a normal distribution. This leads first to the question of whether to transform the data to solve this issue. Later we also explore using generalized linear models (glm) for analysis. The square root (sqrt) transformation is recommended by Snedecor and Cochran to better approximate the NID anova assumptions ([
3], pp. 287–289). Data distributions for original data y and recommended transformation sqrt(y) are given in
Figure 2 below. JMP graphically depicts the data distributions, including the original nodule count data and the square root of nodule count. Tests of goodness of fit for normal distributions are below the histograms and normal quantile plots, and the square-root transformation distribution does not significantly differ from the normal distribution.
We know that taking averages of the eight plants per treatment combination results in means that are more normally distributed than the original count data. Analysis using averages of the original data is performed with Time (=block factor), cultivar, inoculum, and cultivar-inoculum interaction, given in
Figure 3. The same analysis with the averages of square root-transformed nodule counts is also shown. These type analyses are called two-stage analyses [
10]. From the output, we see two different estimates of probability of the null hypothesis that cultivars are equal,
p = 0.029 for the averages of sqrt transformed and
p = 0.134 for the averages of original data.
Using Equation (3) with the sqrt transformation and averages of the eight plants, the cultivars are statistically significantly different (p-value = 0.029). More important to report is the 95% confidence interval for the difference between cultivars, for which cultivar 3233 has an average of 3.2 more nodules per root than cultivar 3234, and 95% CI (3.31 to 4.13). This difference is computed by squaring the square-root-transformed cultivar means and subtracting. The standard error of the difference (SED) of means is found by squaring and adding the standard errors of means to obtain the variance of a difference, and taking the square root. However, this is the SED of transformed values, and so must be squared to obtain the SED in the original scale.
The best methods for analyzing non-normal data continue to evolve. The generalized linear model in JMP for nodule counts was modeled on fixed Time (greenhouse runs), Cultivar, Rhizobium inoculants, and Cultivar-Rhizobium interaction. This generalized linear model option is found in the upper right corner of the Fit Model window labeled “personality”. We use the Poisson distribution with log link and the output is presented in
Figure 4. The same analysis in R using the function glm is shown in
Figure 5, with the same probability for a test of cultivar differences, 0.000004. These figures indicate very high statistical significance, i.e., a low probability of cultivars having the same average counts of nodules, but this uses an error term with inflated degrees of freedom with the narrow inference restricted to the three fixed greenhouse runs (Times) of this experiment instead of the broader inference across future greenhouse environments. A correct error term for this broader inference uses the random time-cultivar interaction term as specified in expected values of mean squares in
Table 1. With this term in the linear R model, we obtain a probability of 0.0259 for the test of cultivar differences, and in the glmm model in Equation (4) this probability is 0.0234 (see the R markdown file).
Summary of the results from different N/R example analyses. Results of the analyses performed for our example are consolidated in
Table 2, which shows the model type, number of data points (either 138 individual or 18 averages for the nodulation data, advantages and disadvantages of each model method, and cultivar
p-value to illustrate similarities or differences of the software and methods.
Discussion for Example 1, the Nodules per Root (N/R) Experiment
One point of emphasis statisticians convey to researchers is to not strictly adhere to a red line p = 0.05 for making decisions for testing hypotheses. A more meaningful method is to use confidence intervals for the difference in means of counts for the two cultivars. This difference is an average of 3.7 nodules per root, with 95% CI (3.3, 4.1). We simply use p-values here as statistics for comparisons of models.
Several lessons were learned in testing the null hypothesis of equal true average nodule counts for the two cultivars. One is that SAS/JMP generally gives the same results for the same model and assumptions, as does R. This is illustrated in
Table 2 with generalized linear model fits of the two software programs. Of course, there may be differences due to different floating-point accuracies of programs, even if the algorithms are the same. Another lesson learned is that the correct error term for testing treatment differences is dictated by randomization in the experiment design, the scope of inference desired (e.g., wishing to infer cultivar differences for a population of potential greenhouse runs), and inherent testing protocols for the particular software used. A third lesson is that assumptions for the analysis can affect
p-values. For example, averaging nodule counts results in distributions closer to normal compared with original data. The square root transformation recommended by Snedecor and Cochran, in our example, has more nearly normal data than the original observations. Stroup has recommended to use generalized linear models rather than transformations such as the square root [
6]. Finally, we use the generalized linear model in JMP and glm in R for nodule counts for the 138 plants. For this paper, these models have fixed effects for Cultivar, Inoculum, and their interaction, but additional analyses consider Time (of greenhouse run) and its interactions as random, leading to mixed models (using GLIMMIX in SAS or glmer in R). Generalized linear mixed models (GLIMMIX in SAS or glmer in R) are helpful for modeling to account for random factors, and the recommended glmm in R in Equation (4) is in the R markdown supplemental file. Using other modeling distributions, such as negative binomial, can handle over-dispersion relative to the Poisson, whose mean is the same as variance [
4,
6].
Example 2: Phosphorous fertilizer application and winter-survival (P_WS).
Ex2.1. Objective. The objective of this experiment is to test whether different phosphorous treatments will increase winter survival of alfalfa plants. The response variable is stand count or survival rate. Response of survival to quantitative phosphorous applied (P), i.e., P-application, for each of two cultivars is to be estimated. Tests of cultivar differences are not a major objective because we already know they differ.
Ex2.2. Experiment design and conduct. This is a field plot experiment with three reps whose design may be found in Wang, et al. (2022) [
16]. Each plot was thinned to 1000 plants/plot in the fall. Stand count was determined following the winter after phosphorous treatments had been applied. Phosphorous treatments applied the previous year were whole-plot treatments in a split-plot in randomized complete block design with subplots two alfalfa cultivars, one dormant and one semi-dormant. In September, 0, 50, 100, and 150 kg ha
−1P
2O
5 (P0, P1, P2, and P3) as calcium phosphate was applied and incorporated into the soil at 5 to 8 cm depth, with irrigation after soil mulching. No harvest was undertaken in the establishment year. Plants were kept well-watered and weed and insect control was conducted as necessary.
Ex2.3. Data. These data are an example of a binomially distributed response variable (survival count of n = 1000 plants, see supplemental data “winterSurvival.csv” for detail). The binomial data are a categorical response data example. The survival count for 1000 plants ranges from low of 682 up to 973. Survival rates are correspondingly 0.682 to 0.973. Categorical explanatory factors are “Reps” (blocks of the Randomized Block Design), “Treatments” (the four phosphorous rates randomized on the four whole-plots), and “Cultivars” (two subplot cultivars). The four nominal phosphorous treatments have quantitative phosphorous application rates (P = 0, 50, 100, 150 kg/ha) for estimating survival response to applied phosphorous. Alfalfa yield from the year before the winter survival and its three component cuttings are included in the data set.
The binomial distribution is appropriate for describing the number of successes of a fixed number n of Bernoulli independent trials, such as coin flips with heads or tails. Here, the binary data are success (plant survives) or failure (it dies) of the
n = 1000 total plants. The mean, or expected value, of number of plants surviving for the binomial distribution is np and the variance is npq, where p is probability of success and q = (1 − p). For example, if the average proportion surviving is 0.852, the mean number of plants surviving per plot is 852 and the variance is 0.852 × 0.148 × 1000 = 126.096. As you can see, the variance is not independent of the mean, violating the NID assumption. However, for large n values, we can often use the normal approximation to the binomial ([
3], p. 130). We have a large enough n (=1000) to meet Snedecor and Cochran’s “rule” of np and nq larger than 15 (discussed on page 130 of their book). For the smallest q, we have nq = 1000 × [1 − (973/1000)] = 27, which is larger than 15.
The binomial and Poisson distributions are related. S. D. Poisson derived this distribution ([
3], p. 130) for rare events of the binomial distribution by letting n approach infinity as p approaches zero for np, the mean, being constant [
3]. For the Poisson, the mean equals the variance, both approaching np compared with the binomial distribution where mean = np and variance = npq. (As n-> infinity and p -> 0, q -> 1, and Poisson variance -> np = mean). Both distributions approach the normal distribution for n being sufficiently large.
A more practical matter for breeders and scientists is to see what aspects of the distributions apply to modelling experiment data. Both distributions, binomial and Poisson, have a relationship of mean to their variance. In the binomial, we often model proportion of successes, p. Binomial has least variance when p is close to 0 or 1 and highest variance when p = 0.5. Poisson is useful for modeling counts and has larger variance for larger counts [
4]. A potential problem if we try to model survival count using Poisson and we actually have a fixed number of plants per plot (here, 1000) is that the Poisson distribution has variance continually increasing as survival increases, but binomial survival count variances decrease as we move closer to 1000. Both distributions have variance determined when we estimate the mean.
In contrast, the normal distribution with variance independent of mean allows estimation of mean and variance separately. This is an advantage because we expect soil and weather error variation in addition to the variance from the relationship with mean of binomial or Poisson random variables. Therefore, it seems reasonable to use a linear model with a separate variance parameter when the normal approximation is justified. Agresti ([
4], pp. 75–76) gives examples of the Poisson distribution with overdispersion when any variables affecting the data are missed, and he proposes the negative binomial distribution (with mean µ and variance µ + Dµ
2 for dispersion parameter D) to model the extra variance (p. 220). He also illustrates generalized linear mixed models with random error structures to better capture this variance in Chapter 10 of the book [
4].
Other variables measured in this experiment are total alfalfa yield the season prior to P-treatment and overwintering. Total yield Y = Y1 + Y2 + Y3, the sum of alfalfa yield of the three cuttings.
Ex2.4. Statistical Analysis
These data analyses are also conducted using R and JMP for comparison, the R codes can be found in the R markdown file “winterSurvival.rmd”, and the line by line results from the R codes can be found in the supplemental file “winterSurvival.docx” for details.
We potentially could use total yield or yield for third cutting as a covariate in the modeling of overwinter survival. This variable might be positively associated with survival, indicating healthier plants with higher yield which store more photosynthate in the roots for the winter, or might be negatively related if removal of nutrients from high yield the previous season weakens plants for winter survival. These models are run, but the yield covariate is not statistically significant.
Because of the large sample size (1000), we can analyze assuming the normal distribution (see the criteria quoted above). First, we fit the linear mixed model with JMP as:
using Treatment (Trt) and Cultivar as fixed nominal factors, and both Rep and Rep-Treatment interaction as random. This way of writing a linear model corresponds to the sequential fitting of factors in a split-plot analysis with whole-plot factor phosphorous treatment (Trt) having whole-plot error estimated as the Rep-Trt interaction.
Regression of survival on quantitative phosphorous applied (P) provides a response curve which can be used to estimate ideal amount of applied P for best winter survival. The 95% CI for optimum rate of applied P is computed from this equation.
The generalized linear mixed model for the binomial variable p = survivalcount/1000 can be fit in R with glmer and model:
where Rep, Rep:Trt, and Plot are considered random and the other factors are fixed. The equation refers to modeling the proportion on explanatory factors using the binomial canonical logit link function ([
4], p. 67).
Ex2.5. Interpretation. The response of winter survival to applied P gives farmers a tool for achieving better alfalfa stands and, potentially, a longer-lasting alfalfa stand. With two cultivars that differ in their inherent survivability, we can see if there is cultivar-P rate interaction, and if not, may give a more general recommendation. If the optimum P-rate for survival differs for the cultivars, different optimum rates could be recommended.
Example 2 Results of P Fertilizer Application and Winter-Survival (P_WS) Experiment
By using the model in Equation (5) above, the random-effect variance component estimate (from JMP) for Treatment*Rep is estimated as 0, as shown (
Figure 6) below.
The treatment-rep interaction component is negligible and the residual error term for testing Cultivar and the Treatment-Cultivar interaction, Error (b) based on eight error degrees of freedom (df), is 2974.9. In the split-plot experiment, the Treatment effects are tested with the error (a) term based on six df, the random Rep-Treatment interaction mean square. However, with this component negative, we pool the error (a) and Error (b) terms to obtain a single residual error based on 14 df, which includes all the Rep interactions with other factors. Its estimate is 2476.4. The analysis-of-variance linear model in JMP is Survival_count = Rep + Treatment + Cultivar*Treatment, with results in
Figure 7.
The anova has statistically significant rep and cultivar differences, and the test of phosphorous treatment differences has a probability of 0.09. Because our objective is to find the survival relationship with applied P, it is better to model the response to the quantitative Phosphorous applied than to test statistical significance. The treatment-cultivar interaction is practically nonexistent, and that term is omitted from the model.
The three degrees of freedom for treatments can be examined using three orthogonal contrasts whose sums of squares (SS) will add to the treatment SS in
Figure 7 above. We fit the linear, quadratic, and cubic orthogonal contrasts using JMP, and results are shown in
Table 3. These use the same error term as in the anova
Figure 7, and their SS add to 19,262, the same as the three-df treatment SS in
Figure 7. The tests of significance for linear, quadratic, and cubic are given in
Table 3 and use our best estimate of error. As stated previously, do not put too much emphasis on significance tests and a “red line” at a probability of 0.05. Here, our principal objective is to fit survival response to P-application. The cubic term is not necessary, so we fit a quadratic response curve.
Without treatment-cultivar interaction, we can model survival on quantitative P for both cultivars. As seen in
Figure 8 below, the quadratic fit to the data for cultivars combined is:
Y (Survival) = 806.4 + 1.592P − 0.0084 P2, where P is kg/ha applied P2O5. Standard errors for the coefficients of P and P2 are 0.569 and 0.0036, respectively.
To find the optimal rate of P to apply, we set the first derivative of Y with respect to P to zero, 1.592 − 0.0168 P = 0, and solve for the optimum, which is P = 94.8, or 95 kg/ha applied P as P2O5. The estimated standard error of the optimum P rate is 1.325, and a 95% confidence interval for the optimum is (92.0, 97.6).
The figure from JMP shown below has a correct equation in the key in the upper left, but does not have a proper root mean squared error (RMSE) or significance test. The model fits the equation correctly but does not remove recognized variance sources reps and cultivars, inflating the error and rendering
p-values too high. The correct anova error variance is 2476.4, with RMSE = 49.8. The linear and quadratic contrast probabilities given in
Table 3 above are correctly computed, with 0.09 and 0.06 as
p-values for linear and quadratic, respectively. The cubic term is lack of fit for the quadratic equation and is not significant, with P = 0.69.
The model to fit response to applied P uses the quantitative variable P and quadratic term P
2. This is undertaken for cultivars together (the cultivars had similar response curves). The result is graphed in
Figure 8 below.
Summary of the results from different P_WS analyses. Results of the analyses performed for our example are consolidated in
Table 4, which shows the model type, number of data points (24 for the winter survival data), advantages and disadvantages of each model method, and cultivar P values to illustrate similarities or differences of the methods.
Discussion for Example 2, the P Fertilizer Application and Winter Survival (P_WS) Experiment
The R code for analysis and updates to this analysis is contained in the R markdown file “winterSurvival.rmd”. We compare models with R analyses and results for the Gaussian linear model, for a generalized linear model (glm) using a binomial distribution with logit link, and finally for the generalized linear mixed model glmm with random field variance components for Reps and the Rep-Treatment interaction using binomial distribution modeling. There are differences in
p-values for the test of Treatment mean differences for these methods. The standard tests for different scopes of inference are dictated by the model fitting of, for example, the random Rep-Treatment component which has correct error df for the broader inference (over future reps for which this set is a random sample). Fang and Loughin used a simulation for a split-plot design and found methods based on linear mixed models and generalized linear mixed models held Type I error rates better than generalized linear models [
17].
We see many of the same principles illustrated in this binomial data example as were shown in the Poisson nodule count example, such as the need to pay attention to the scope of inference and to the error terms for correctly testing hypotheses in deciding fixed versus random effects ([
2], p. 12). Such principles are a main point of emphasis for these examples.
We have seen that there are several ways that statistical models can be specified, some more true to the underlying distributions, and others which are approximations but may have more familiarity to the data analyst. For this winter survival example, the normal approximation turns out to be very good, and is straightforward to use for those trained in the use of linear model methods, such as regression and analysis of variance. The survival count data may at first seem to be an example in which the Poisson modeling for count data seems appropriate, but the experiment is to count surviving plants from an initial stand of 1000. This makes the underlying distribution binomial, which differs from the unbounded count data which have a mean equal to the variance. In fact, as we discussed earlier, the closer the counts are to 1000 the less variance they have. The binomial distribution seems a much better option to match the actual conditions of this experiment.
A second consideration in modeling these data is the actual experiment design and spatial variation generating what we assume are random error terms. The (P_WS) experiment was arranged as a split plot with phosphorous treatments as whole plots and cultivars as subplots. This subsumes an error structure with one error term for whole plots and another for the subplots. This leads to random error designations as discussed by Crawley ([
11], pp. 173–184). The generalized linear models, even with random terms (glmms), would need the random error terms corresponding to those of the linear models which assume normal distributions. The normal distribution model has a variance parameter as well as the parameter for mean, and modeling using glms and glmms is often performed with some provision for this over-dispersion that is inherent when we assume variance is only dependent on the mean. Although this extra variance can be modeled with a distribution such as negative binomial having an extra dispersion parameter, our approach is to include a random plot variance to account for the residual field variance and random phosphorous Trt-Rep interaction for the whole-plot error variance. As stated already, the methods for these CDA analyses are continually evolving.