# Good Statistical Practices in Agronomy Using Categorical Data Analysis, with Alfalfa Examples Having Poisson and Binomial Underlying Distributions

^{1}

^{2}

^{3}

^{4}

^{5}

^{*}

## Abstract

**:**

## 1. Introduction

- To raise awareness of categorical data analysis methods.
- To demonstrate proper CDA analyses.
- To highlight potential pitfalls and corrections in statistical analyses for the example data sets.

## 2. Literature Review

**Pitfall 1. Using too simple an analysis, e.g., a simple t-test when we need to include other recognizable sources of variation in the model.**There is always a delicate balance between over-simplifying and including too many independent variables in a statistical model. James et al. discuss the bias-variance trade-off in statistical modeling, stating that neither over-simplified nor too-complicated models (which can even over-fit to the point of zero error degrees of freedom) are good for prediction ([9], pp. 33–36). For example, suppose we have an experiment with two cultivars and three other treatments in a 2 × 3 factorial. An over-simplified initial t-test can give a misleading first impression. All sources of variation other than that due to cultivar would be grouped into the error variance estimate, but important factors such as the other treatment and perhaps a blocking factor should have been recognized and not included in residual error. Without these sources of variation accounted for in the model, the error term would be greatly inflated, a major pitfall. The relation F = t

^{2}should hold for analysis of variance with a single two-level factor ([3], p. 224), and in this instance, such a single-factor anova model would be faulty because important other factors would be left out. Failure to include all important terms makes an overly-simple model less flexible, potentially resulting in large prediction bias [9].

**Pitfall 2. Using the wrong error term for desired inference.**A common mistake in analyses with “cookbook” methods is to use an error term having inflated degrees of freedom to test for statistical significance. For example, in a cultivar yield trial experiment with eight locations, each having a randomized complete block experiment, we generally wish to infer results to a new environment or set of environments. Running a linear model with location, cultivar, and location-cultivar interaction is reasonable, but using the residual error term from this model to test for cultivar differences can be misleading because we want to infer results not just to the exact environments tested, but to new ones, potentially in a future year or on a different field of which the environments are a sample ([2], p. 12). One approach which avoids the problem of having the wrong error term is to first obtain cultivar averages or least squares means for each location and then use these means in an across-locations analysis with cultivar and location as terms in the model [10]. Mohring and Piepho call this a two-stage analysis. These two-stage analyses typically require using standard errors from the first stage for weighting the averages in the second stage, but equal variances are often assumed when all first-stage means have the same or nearly the same numbers of observations. Split-plot experiment designs require the use of models with complicated error terms ([11], pp. 174–175). Stroup discusses the error structure in split-plot experiments, including an error component for whole-plots and another for subplots ([6], p. 821). Crawley even discusses a three-level split-split plot example ([11], p. 174). Expected values of mean squares guide researchers on error terms for hypothesis tests in these split-plot designs ([3], pp. 322–323). Automatic tests of treatment effects in nested designs may differ in R functions from those in SAS and JMP, and it is important to realize if Type I, Type II, or Type III sums of squares are being used for these tests. The Type I, or sequential sums of squares in R functions, are used in aov and anova, for example in the command anova(lm(y~A*B, data = D), and differ from the Type III sums of squares and mean squares in the function Anova, for distinction written with capital A, in the R car package [12,13]. Type III sums of squares are the default in JMP [1]. The Type I sums of squares are helpful to analyze split-plot experiments, as Crawley illustrates in the R function aov ([11], p. 174). For a split-plot design with whole-plot factor A and subplot factor B, both nested in rep R, the split-plot analysis command is:

**Pitfall 3. Placing too much emphasis on achieving probabilities less than 0.05 in hypothesis tests.**Scientists and other practitioners seem obsessed with obtaining statistically significant test results, generally meaning finding a probability below 0.05 for their particular hypothesis test. Not only is this a dangerous practice in which results which are important but show no statistically significant difference in, for example, treatment versus control are not even published, but also many times the real point of the experiment is missed. For example, we may want to know the optimum phosphorous fertilizer rate for maximizing alfalfa plant over-winter survival, but simple hypothesis tests of each treatment versus control might not even be statistically significant when there is possibly a true quadratic relationship.

**Pitfall 4. Confusion in choosing the correct software function to analyze data.**There are so many available functions in a programming language such as R that it is hard for scientists to decide which to use. The number of available R packages to download is over 10,000 [14]. This makes it difficult for practitioners to decide which programs or subroutines to use when writing a program to analyze their data. Examples of well-vetted R functions are available in good textbooks which describe statistical procedures and give examples which illustrate how to use commands; for example, Agresti [4] and James at all [9]. Another guiding principle is to always check with your statistician, who is an expert in the area of data analyses, in the same way you would check with a plant pathologist if you found a new unfamiliar disease on your crop of interest.

## 3. Examples to Illustrate Good Statistical Practices in Categorical Data

**Outline of analysis of our two alfalfa experiments, the nodules per root (N/R) and phosphorus-winter survival (P_WS).**The structured approach to the analysis of data from these experiments includes determining the objective, detailing the experiment design, carefully determining quantities measured, specifying analyses and potential analysis issues, and finally making conclusions and communicating results. We give this outline for each of the experiments, first with the alfalfa nodules/root (N/R) experiment and next with the phosphorous winter survival (P_WS) experiment.

**Example 1. Nodules/root (N/R) experiment to compare alfalfa cultivars and nodulation strains**

**Ex1.1. Objective.**Objectives of the alfalfa nodules study were: (1) to test whether there is a difference in mean number of nodules on roots between two alfalfa cultivars, one with branching or fibrous root and the other with taproot architecture; (2) to test whether there are differences for different nodulation strains.

**Ex1.2. Experiment Design.**The design of this experiment is a randomized complete block for six treatments in a (2 × 3) factorial, two cultivars each having three inoculation treatments (two nodulation strains plus a control). The three blocks are repeats of greenhouse runs undertaken at different dates (called Times), with eight plants per treatment, each grown in separate pots, all completely randomized within the greenhouse. This results in 3 × 2 × 3 × 8 = 144 total plants. However, six plants are missing due to plant death, so the total number of plants on which measurements are made is 138.

**Ex1.3. Data.**Nodules are counted on the roots of each plant after plants have been established and grown for 14 days (original data in supplemental file “nodules.csv” for detail). Nodule numbers range from 0 to 78, with 4 of the plants having 0 nodules. We note that the counts are likely to have a higher variance for higher mean count numbers, a potential issue for the data analysis if we use the NID assumptions. This type of count data are often considered to be Poisson-distributed [3,15].

**Ex1.4. Statistical Analysis.**The first analysis for this experiment uses the nodule count data (Y) in a linear model assuming normal independent distribution (NID), an analysis familiar to many scientists. Models in these statistical analysis subsections are written in the same way for both JMP and R, except for interactions (in JMP the A-B interaction is depicted as A*B and in R as A:B. In R, A*B means A + B + A:B). The model, written in R, is:

**Ex1.5. Interpretation.**A difference in numbers of nodules per plant may be associated with higher nitrogen (N) fixation, and it is important to find if a difference in nodules exists. A 95% confidence interval (CI) for the cultivar mean nodule difference would be valuable scientific information. These results might influence the direction of an alfalfa breeding program.

**Example 1 Results of Nodules/Root (N/R) Experiment**

**Over-simplified analyses can give misleading first impressions.**A researcher had requested a simple independent-sample t-test for cultivar difference. An issue with this test is that the p-value for cultivar difference from this did not agree with probabilities from a linear model analysis-of-variance. The anova table of the nodule counts, using Equation (2), is shown in Table 1. This table has a cultivar null hypothesis probability of 0.026, which differed from the independent sample t-test with probability of 0.19. The independent-sample t-test for cultivar differences was shown to be equivalent to a one-factor (cultivar) anova (see Figure 1), as discussed in the literature review. This example illustrates that the independent sample t-test comparing cultivars, because it does not take account of important sources of variation, inflates the error term. These sources of variation include blocks (greenhouse runs) and inoculum treatments, which, when part of the residual, makes the error term too high for the independent sample t-test compared with the multi-factor anova model.

**The desired inference requires a proper error term**. The analysis of variance with fixed effects in Table 1 also has the problem that the test of cultivars uses a residual mean square based on 90 degrees of freedom (df), which is not the proper error term for inferring results to new greenhouse environments. Crawley ([11], pp. 173–182) discusses considerations in random vs. fixed effects, and Stroup et al. ([2], p. 10) discuss the scope of inference when making this choice in models. Plants are nested within each replication or greenhouse run (Time), and to infer results to future greenhouse conditions, we need to assume the greenhouse runs are a random sample from a population of such runs. The error needed for this broader inference should include the Time interaction with treatment combinations. This can be seen in the expected values of mean squares in the analysis of variance table, assuming a random Time-cultivar interaction. (Our example has the anomaly that the cultivar-block mean square is smaller than that of residual even though it is expected to be larger). The main issue is that the residual error term in Table 1 measures plant-to-plant variation, which has inflated degrees of freedom (df). Many researchers mistakenly use such error terms with degrees of freedom (df) higher than warranted by the experiment design, called pseudoreplication by Crawley ([11], pp. 176–182). With mixed model analysis designating the block-treatment interactions as random, some software packages such as JMP automatically perform correct tests of treatment differences for broader inference.

**The data do not fit the usual anova assumptions.**A third issue is that data are counts of nodules on the alfalfa roots, and count data do not usually follow a normal distribution. This leads first to the question of whether to transform the data to solve this issue. Later we also explore using generalized linear models (glm) for analysis. The square root (sqrt) transformation is recommended by Snedecor and Cochran to better approximate the NID anova assumptions ([3], pp. 287–289). Data distributions for original data y and recommended transformation sqrt(y) are given in Figure 2 below. JMP graphically depicts the data distributions, including the original nodule count data and the square root of nodule count. Tests of goodness of fit for normal distributions are below the histograms and normal quantile plots, and the square-root transformation distribution does not significantly differ from the normal distribution.

**The best methods for analyzing non-normal data continue to evolve.**The generalized linear model in JMP for nodule counts was modeled on fixed Time (greenhouse runs), Cultivar, Rhizobium inoculants, and Cultivar-Rhizobium interaction. This generalized linear model option is found in the upper right corner of the Fit Model window labeled “personality”. We use the Poisson distribution with log link and the output is presented in Figure 4. The same analysis in R using the function glm is shown in Figure 5, with the same probability for a test of cultivar differences, 0.000004. These figures indicate very high statistical significance, i.e., a low probability of cultivars having the same average counts of nodules, but this uses an error term with inflated degrees of freedom with the narrow inference restricted to the three fixed greenhouse runs (Times) of this experiment instead of the broader inference across future greenhouse environments. A correct error term for this broader inference uses the random time-cultivar interaction term as specified in expected values of mean squares in Table 1. With this term in the linear R model, we obtain a probability of 0.0259 for the test of cultivar differences, and in the glmm model in Equation (4) this probability is 0.0234 (see the R markdown file).

**Summary of the results from different N/R example analyses.**Results of the analyses performed for our example are consolidated in Table 2, which shows the model type, number of data points (either 138 individual or 18 averages for the nodulation data, advantages and disadvantages of each model method, and cultivar p-value to illustrate similarities or differences of the software and methods.

**Discussion for Example 1, the Nodules per Root (N/R) Experiment**

**Example 2: Phosphorous fertilizer application and winter-survival (P_WS).**

**Ex2.1. Objective**. The objective of this experiment is to test whether different phosphorous treatments will increase winter survival of alfalfa plants. The response variable is stand count or survival rate. Response of survival to quantitative phosphorous applied (P), i.e., P-application, for each of two cultivars is to be estimated. Tests of cultivar differences are not a major objective because we already know they differ.

**Ex2.2. Experiment design and conduct.**This is a field plot experiment with three reps whose design may be found in Wang, et al. (2022) [16]. Each plot was thinned to 1000 plants/plot in the fall. Stand count was determined following the winter after phosphorous treatments had been applied. Phosphorous treatments applied the previous year were whole-plot treatments in a split-plot in randomized complete block design with subplots two alfalfa cultivars, one dormant and one semi-dormant. In September, 0, 50, 100, and 150 kg ha

^{−1}P

_{2}O

_{5}(P0, P1, P2, and P3) as calcium phosphate was applied and incorporated into the soil at 5 to 8 cm depth, with irrigation after soil mulching. No harvest was undertaken in the establishment year. Plants were kept well-watered and weed and insect control was conducted as necessary.

**Ex2.3. Data.**These data are an example of a binomially distributed response variable (survival count of n = 1000 plants, see supplemental data “winterSurvival.csv” for detail). The binomial data are a categorical response data example. The survival count for 1000 plants ranges from low of 682 up to 973. Survival rates are correspondingly 0.682 to 0.973. Categorical explanatory factors are “Reps” (blocks of the Randomized Block Design), “Treatments” (the four phosphorous rates randomized on the four whole-plots), and “Cultivars” (two subplot cultivars). The four nominal phosphorous treatments have quantitative phosphorous application rates (P = 0, 50, 100, 150 kg/ha) for estimating survival response to applied phosphorous. Alfalfa yield from the year before the winter survival and its three component cuttings are included in the data set.

^{2}for dispersion parameter D) to model the extra variance (p. 220). He also illustrates generalized linear mixed models with random error structures to better capture this variance in Chapter 10 of the book [4].

**Ex2.4. Statistical Analysis**

**Ex2.5. Interpretation.**The response of winter survival to applied P gives farmers a tool for achieving better alfalfa stands and, potentially, a longer-lasting alfalfa stand. With two cultivars that differ in their inherent survivability, we can see if there is cultivar-P rate interaction, and if not, may give a more general recommendation. If the optimum P-rate for survival differs for the cultivars, different optimum rates could be recommended.

**Example 2 Results of P Fertilizer Application and Winter-Survival (P_WS) Experiment**

^{2}, where P is kg/ha applied P

_{2}O

_{5}. Standard errors for the coefficients of P and P

^{2}are 0.569 and 0.0036, respectively.

_{2}O

_{5}. The estimated standard error of the optimum P rate is 1.325, and a 95% confidence interval for the optimum is (92.0, 97.6).

^{2}. This is undertaken for cultivars together (the cultivars had similar response curves). The result is graphed in Figure 8 below.

**Summary of the results from different P_WS analyses.**Results of the analyses performed for our example are consolidated in Table 4, which shows the model type, number of data points (24 for the winter survival data), advantages and disadvantages of each model method, and cultivar P values to illustrate similarities or differences of the methods.

**Discussion for Example 2, the P Fertilizer Application and Winter Survival (P_WS) Experiment**

## 4. Summary of Analysis Approaches and Models for Both Examples

## 5. Conclusions

## Supplementary Materials

## Author Contributions

## Funding

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Sall, J.; Stephens, M.L.; Lehman, A.; Loring, S. JMP Start Statistics: A Guide to Statistics and Data Analysis Using JMP; SAS Institute: Cary, NC, USA, 2017. [Google Scholar]
- Stroup, W.W.; Milliken, G.A.; Claassen, E.A.; Wolfinger, R.D. SAS for Mixed Models: Introduction and Basic Applications; SAS Institute: Cary, NC, USA, 2018. [Google Scholar]
- Snedecor, G.W.; Cochran, W.G. Statistical Methods, 8th ed.; I.S.U. Press: Ames, IA, USA, 1989. [Google Scholar]
- Agresti, A. An Introduction to Categorical Data Analysis, 3rd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2019. [Google Scholar]
- Nelder, J.A.; Wedderburn, R.W. Generalized linear models. J. R. Stat. Soc. Ser. A (Gen.)
**1972**, 135, 370–384. [Google Scholar] [CrossRef] - Stroup, W.W. Rethinking the analysis of non-normal data in plant and soil science. Agron. J.
**2015**, 107, 811–827. [Google Scholar] [CrossRef] - Brooks, M.E.; Kristensen, K.; Van Benthem, K.J.; Magnusson, A.; Berg, C.W.; Nielsen, A.; Skaug, H.J.; Machler, M.; Bolker, B.M. glmmTMB balances speed and flexibility among packages for zero-inflated generalized linear mixed modeling. R J.
**2017**, 9, 378–400. [Google Scholar] [CrossRef] [Green Version] - Bolker, B.; Skaug, H.; Magnusson, A.; Nielsen, A. Getting Started with the glmmTMB Package; R Foundation for Statistical Computing: Vienna, Austria, 2012. [Google Scholar]
- Gareth, J.; Daniela, W.; Trevor, H.; Robert, T. An Introduction to Statistical Learning with Applications in R, 1st ed.; Springer: New York, NY, USA, 2013. [Google Scholar]
- Möhring, J.; Piepho, H.P. Comparison of weighting in two-stage analysis of plant breeding trials. Crop Sci.
**2009**, 49, 1977–1988. [Google Scholar] [CrossRef] - Crawley, M.J. Statistics: An Introduction Using R, 2nd ed.; John Wiley & Sons, Inc.: Chichester, UK, 2014. [Google Scholar]
- Mangiafico, S.; Mangiafico, M.S. Package ‘rcompanion’. CRAN Repos
**2017**, 20, 1–71. [Google Scholar] - Langsrud, Ø. ANOVA for unbalanced data: Use Type II instead of Type III sums of squares. Stat. Comput.
**2003**, 13, 163–167. [Google Scholar] [CrossRef] - Smith, D. CRAN Now Has 10,000 R Packages. Here’s How to Find the Ones You Need. Revolutions. Daily News about Using Open Source R for Big Data Analysis, Predictive Modeling, Data Science, and Visualization Since 2008. 2017. Available online: https://blog.revolutionanalytics.com/2017/01/cran-10000.html (accessed on 12 June 2021).
- Steel, R.G.D.; Torrie, J.H. Principles and Procedures of Statistics: A Biometrical Approach, 3rd ed.; McGraw-Hill: New York, NY, USA, 1997. [Google Scholar]
- Wang, Y.; Zhang, J.; Yu, L.; Xu, Z.; Samac, D.A. Overwintering and Yield Responses of Two Late-Summer Seeded Alfalfa Cultivars to Phosphate Supply. Agronomy
**2022**, 12, 327. [Google Scholar] [CrossRef] - Fang, L.; Loughin, T.M. Analyzing binomial data in a split-plot design: Classical approach or modern techniques? Commun. Stat.-Simul. Comput.
**2013**, 42, 727–740. [Google Scholar] [CrossRef]

**Figure 1.**Under-fitted model of the data. One-way ANOVA with cultivar (two levels) gives same probability of test of equal means as independent-sample t-test. Cultivar_3233 and Cultivar_3234 stand for the nodule data from cultivar # 3233 and 3234; noduleSD is for the data from both cultivars.

**Figure 3.**JMP output for anova using averages of original nodule counts y (

**top**) and sqrt(y) (

**bottom**).

**Figure 8.**Quadratic regression line between the fertilizer levels and stand counts. The blue dots are from cultivar number 1, and the red ones are for cultivar number 2. The letter P in the equation is for the phosphorous rate in kg/ha.

**Table 1.**ANOVA table for nodule numbers, assumed with a normal distribution, from R code executed with multi-way ANOVA analysis.

Terms * | Df | Sum Square | Mean Square | F Value | p Value | Exp (MS) |
---|---|---|---|---|---|---|

Time (T) | 2 | 102.07 | 51.03 | 45.51 | 2.20 × 10^{−14} | |

Cultivar (C) | 1 | 5.75 | 5.75 | 5.13 | 0.0259 | (σ^{2} + 8σ^{2}_{TC}) + κ^{2}_{C} |

Inoculum (I) | 2 | 45.17 | 22.59 | 20.141 | 5.90 × 10^{−8} | |

T:C | 2 | 0.76 | 0.38 | 0.341 | 0.7121 | σ^{2} + 8σ^{2}TC |

T: | 4 | 3.95 | 0.99 | 0.881 | 0.4785 | |

C:I | 2 | 0.48 | 0.24 | 0.212 | 0.8092 | |

T:I:C | 4 | 1.36 | 0.34 | 0.303 | 0.8751 | |

Residuals | 90 | 100.92 | 1.12 | σ^{2} |

**Table 2.**Consolidation of results for models from the nodulation data, including cultivar p-values, to show difference in models.

Model Name | No. DP | Pro | Cons | Cultivar p-Value | R/JMP |
---|---|---|---|---|---|

Fixed linear model | 138 | Easy; all means and CI available | Not correct error if infer to pop. of runs | 0.0237 | JMP |

Linear Mixed model | 138 | Random effects | Not best specific distribution (Poisson) | 0.0356 | JMP |

Avg(Y) linear model | 18 | Easy; all means and CI available | Not best approx. to normal | 0.1338 | JMP |

Avg(Sqrt(Y)) linear model | 18 | Best normal approximation | Need to back transform to obtain means | 0.0290 | JMP |

GLM with Poisson | 138 | Best Specific distribution | No random term for broad inference | 0.000004 | JMP |

Group T test | 138 | For 1 factor | Not for >1 factor | 0.1447 | R |

One-factor ANOVA | 138 | 1 factor | Normal distribution | 0.1438 | R |

3-factor ANOVA | 138 | Can be >1 factors | Normal distribution | 0.0465 | R |

Linear Mixed Model (lme4) | 138 | Random effects | Not best specific distribution (Poisson) | 0.0287 | R |

GLM w/Poisson | 138 | Specific distribution | No random term for broad inference | 0.000004 | R |

GLMM w/Poisson | 138 | Specific distribution | With random | 0.0234 | R |

Contrast Type | Sum Square (SS) | Num DF | F Ratio | Prob > F |
---|---|---|---|---|

Linear Contrast | 8267 | 1 | 3.3382 | 0.0891 |

Quadratic Contrast | 10,584 | 1 | 4.2739 | 0.0577 |

Cubic Contrast | 411 | 1 | 0.1658 | 0.6900 |

Total Treatment | 19,262 | 3 |

**Table 4.**Consolidation of results for models from the winter survival data, including cultivar p-values, to show the difference in models.

Model Name | No. DP | Pros | Cons | Cultivar p-Value |
---|---|---|---|---|

LM w/split plot | 24 | Familiar LM, good approximation, accounts for error structure | Normal distribution assumption needed | 0.0875 |

GLM w/binomial | 24 | Specific family distribution | No random error; Using inflated df | 5.90 × 10^{−15} |

GLMM w/binomial | 24 | Specific family distribution accounts for error structure | Hard to explain | 0.0311 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Mowers, R.P.; Bucciarelli, B.; Cao, Y.; Samac, D.A.; Xu, Z.
Good Statistical Practices in Agronomy Using Categorical Data Analysis, with Alfalfa Examples Having Poisson and Binomial Underlying Distributions. *Crops* **2022**, *2*, 154-171.
https://doi.org/10.3390/crops2020012

**AMA Style**

Mowers RP, Bucciarelli B, Cao Y, Samac DA, Xu Z.
Good Statistical Practices in Agronomy Using Categorical Data Analysis, with Alfalfa Examples Having Poisson and Binomial Underlying Distributions. *Crops*. 2022; 2(2):154-171.
https://doi.org/10.3390/crops2020012

**Chicago/Turabian Style**

Mowers, Ronald P., Bruna Bucciarelli, Yuanyuan Cao, Deborah A. Samac, and Zhanyou Xu.
2022. "Good Statistical Practices in Agronomy Using Categorical Data Analysis, with Alfalfa Examples Having Poisson and Binomial Underlying Distributions" *Crops* 2, no. 2: 154-171.
https://doi.org/10.3390/crops2020012