# Guidelines for Assessing Enological and Statistical Significance of Wine Tasters’ Binary Judgments

## Abstract

**:**

## 1. Introduction

## 2. The Role of Chance in Scientific Research

## 3. Criteria for Assessing Levels of Practical Significance of the Reliability of Wine Judgments

## 4. The Landis and Koch (1977), Fleiss (1981) and Cicchetti (1994) Enological Criteria

_{w}and the intra-class correlation coefficient (ICC), the criteria apply regardless of the type of variable under investigation. First, Fleiss [15] demonstrated the mathematical equivalence between Cohen’s kappa statistic (k) for nominal binary variables and the intra-class correlation coefficient for variables deriving from interval scales; and, secondly, Fleiss and Cohen [16] demonstrated the mathematical equivalence between Cohen’s weighted kappa coefficient [3] and the ICC [6]. This prompted Fleiss and colleagues to correctly describe these three statistics as belonging to a family of mathematically inter-related coefficients. An analogy in the broader bio-statistical world is the often cited mathematical equivalence between the standard correlation coefficient (r) for interval variables and the Phi coefficient for Nominal-dichotomous variables [17].

## 5. The Agreement or (A) Index and Its Mathematical Relationship to the ICC

_{w}or ICC value into its Agreement (A) equivalent. The relevance this type of thinking has for oenological research is explained in the next section of this report.

## 6. Revising the Criteria for the Enological Significance of Research Findings

k, k_{w}, ICC | Agreement (A) | Strength of Agreement |

<0.00 | <50% | Poor |

0.00–0.20 | 50–60% | Slight |

0.21–0.40 | 60.5–70% | Fair |

0.41–0.60 | 70.5–80% | Moderate |

0.61–0.80 | 80.5–90% | Substantial |

0.81–1.00 | 90.5–100% | Almost Perfect |

_{w}or ICC as 0.40 to 0.74 was changed to 0.40 to 0.79; and the last category was revised to define Excellent at ≥0.80 instead of at ≥0.75.

## 7. Specific Category Agreement Levels

Taster B | |||

Taster A | Oaked (+) | Unoaked (−) | Totals |

Oaked (+) | 60 | 20 | 80 |

Unoaked (−) | 0 | 20 | 20 |

Total | 60 | 40 | 100 |

## 8. The Sensitivity-Specificity Model in an Enological Context

## 9. Criteria for Assessing Levels of Statistical Significance

**For OA = 70%:**

**For OA = 80%:**

**For OA = 90%:**

## 10. Overall Results: Correlations between the Reliability and Accuracy of Wine Tasters’ Hypothetical Binary Judgments

## 11. Hypothetical Results on a Case by Case Basis

- Overall Agreement = 23/27 = 85.2%. Chance Agreement = 13.2/27 = 48.9%.
- Kappa = (85.2 ‒ 48.9)/51.1 = 0.71; p = 0.001.
- These results, from an oenological viewpoint, are: Substantial by the Landis & Koch criteria;
- acceptable by the Fleiss, et al. criteria; and Good by the Cicchetti criteria.
- Se = 12/12 = 100% (Perfect)
- Sp = 11/13 = 85% (Good)
- PPA = 12/16 = 80% (Good) and
- PNA = 11/11 = 100% (Perfect)

## 12. Summary and Conclusions

## Conflicts of Interest

## References

- Cicchetti, D.V. Opinions versus facts: A bio-statistical paradigm shift in oenological research. Proc. J. Wine Res.
**2017**, 1, 1–8. [Google Scholar] - Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas.
**1960**, 23, 37–40. [Google Scholar] [CrossRef] - Cohen, J. Weighted kappa: Nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull.
**1968**, 70, 195–201. [Google Scholar] [CrossRef] - Bartko, J.J. The intraclass correlation coefficient as a measure of reliability. Psychol. Rep.
**1966**, 19, 3–11. [Google Scholar] [CrossRef] [PubMed] - Bartko, J.J. Corrective note to “The intraclass correlation coefficient as a measure of reliability”. Psychol. Rep.
**1974**, 34, 1–11. [Google Scholar] [CrossRef] - Shrout, P.E.; Fleiss, J. Intraclass correlations: Uses in assessing rater reliability. Psychol. Bull.
**1979**, 86, 420–428. [Google Scholar] [CrossRef] [PubMed] - Cicchetti, D.V.; Klin, A.; Volkmar, F.R. Assessing binary diagnoses of bio-behavioral disorders: The clinical relevance of Cohen’s Kappa. J. Nerv. Ment. Dis.
**2017**, 205, 58–65. [Google Scholar] [CrossRef] [PubMed] - Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics
**1977**, 3, 159–174. [Google Scholar] [CrossRef] - Fleiss, J. Statistical Methods for Rates and Proportions, 2nd ed.; Wiley: New York, NY, USA, 1981. [Google Scholar]
- Fleiss, J.; Levin, B.; Paik, M.C. Statistical Methods for Rates and Proportions, 3rd ed.; Wiley: New York, NY, USA, 2003. [Google Scholar]
- Cicchetti, D.V.; Sparrow, S.S. Developing criteria for establishing interrater reliability of specific items: Applications to assessment of adaptive behavior. Am. J. Ment. Defic.
**1981**, 86, 127–137. [Google Scholar] [PubMed] - Cicchetti, D.V. Guidelines, criteria, and rules of thumb for evaluating normed and standardized assessment instruments in psychology. Psychol. Assess.
**1994**, 6, 284–290. [Google Scholar] [CrossRef] - Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; Erlbaum: Hillsdale, NJ, USA, 1988. [Google Scholar]
- Cicchetti, D.V.; Volkmar, F.R.; Klin, A.; Showalter, D. Diagnosing autism using ICD-10 criteria: A comparison of neural networks and standard multivariate procedures. Child Neuropsychol.
**1995**, 1, 26–37. [Google Scholar] [CrossRef] - Fleiss, J. Measuring agreement between two judges on the resence or absence of a trait. Biometrics
**1975**, 31, 651–659. [Google Scholar] [CrossRef] [PubMed] - Fleiss, J.L.; Cohen, J. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educ. Psychol. Meas.
**1973**, 33, 613–619. [Google Scholar] [CrossRef] - Kaltenhauser, J.; Lee, Y. Correlation coefficients for binary data. Geogr. Anal.
**2010**, 8, 305–313. [Google Scholar] [CrossRef] - Robinson, W. The statistical measurement of agreement. Am. Sociol. Rev.
**1957**, 22, 17–25. [Google Scholar] [CrossRef] - Borenstein, M. The shift from significance testing to effect size estimation. Res. Methods Compr. Clin. Psychol.
**1998**, 3, 319–349. [Google Scholar] - Fleiss, J.L.; Cohen, J.; Everitt, B.S. Large sample standard errors of kappa and weighted kappa. Psychol. Bull.
**1969**, 72, 323–327. [Google Scholar] [CrossRef] - Cohen, J.; Cohen, P.; West, S.G.; Aiken, I.S. Appliede Multiple Regression/Correlation for the Behavioral Sciences, 2nd ed.; Lawrence Erlbaum: Mahwah, NJ, USA, 2003. [Google Scholar]
- Cicchetti, D.V.; Cicchetti, A.F. As wine experts disagree, consumers’ taste buds flourish: How two experts rate the 2004 Bordeaux vintage. J. Wine Res.
**2013**, 24, 311–317. [Google Scholar] [CrossRef] - Cicchetti, D.V.; Cicchetti, A.F. Two enological titans rate the 2009 Bordeaux wines. Wine Econ. Policy
**2014**, 3, 28–36. [Google Scholar] [CrossRef]

**Table 1.**(

**A**) The Landis and Koch (1977) Criteria for Assessing Enological Significance; (

**B**) The Fleiss (1981) Criteria for Assessing Enological Significance; (

**C**) The Cicchetti and Sparrow (1981) Criteria for Assessing Enological Significance.

(A) | |

k, k_{w} or ICC | Strength of Agreement |

<0.00 | Poor |

0.00–0.20 | Slight |

0.21–0.40 | Fair |

0.41–0.60 | Moderate |

0.61–0.80 | Substantial |

>0.80 | Almost Perfect |

(B) | |

k, k_{w} or ICC | Clinical Significance |

<0.40 | Poor |

0.40–0.74 | Fair to Good |

≥0.75 | Excellent |

(C) | |

k, k_{w} or ICC | Clinical Significance |

<0.40 | Poor |

0.40–0.59 | Fair |

0.60–0.74 | Good |

≥0.75 | Excellent |

ICC Value | Percent Agreement (A) |
---|---|

0.00 (P) | 50 (P) |

0.05 (P) | 52.5 (P) |

0.10 (P) | 55(P) |

0.15 (P) | 57.5 (P) |

0.20 (P) | 60 (P) |

0.25 (P) | 62.5 (P) |

0.30 (P) | 65 (P) |

0.35 (P) | 67.5 (P) |

0.40 (F) | 70 (F) |

0.45 (F) | 72.5 (F) |

0.50 (F) | 75 (F) |

0.55 (F) | 77.5 (F) |

0.60 (G) | 80 (G) |

0.65 (G) | 82.5 (G) |

0.70 (G) | 85 (G) |

0.75 (G) | 87.5 (G) |

0.80 (E) | 90 (E) |

0.85 (E) | 92.5 (E) |

0.90 (E) | 95 (E) |

0.95 (E) | 97.5 (E) |

1.00 (E) | 100 (E) |

^{1}Because of the mathematical equivalencies between ICC, Kappa and Weighted Kappa, this relationship holds for each of these three statistics for assessing levels of wine tasters’ binary judgments, as well as inter-rater agreement levels more generally. See text for more details. The letters P, F, G and E refer, in this context, to Poor, Fair, Good and Excellent wine quality, respectively, as defined by the Robert Parker and similar wine rating scales.

**Table 3.**(

**A**) Revised Landis and Koch Criteria [8] for Assessing Enological Significance; (

**B**) Revised Fleiss, Levin and Cho Paik (2003) Criteria for Assessing Enological Significance; (

**C**) Revised Cicchetti (1994) Criteria for Assessing Enological Significance.

(A) | ||

k, k_{w} or ICC Value | Percent Agreement (A) | Strength of Agreement |

<0.00 | <50 | Poor |

0.00–0.19 | 50–59.5 | Slight |

0.20–0.39 | 60–69.5 | Fair |

0.40–0.59 | 70–79.5 | Moderate |

0.60–0.79 | 80–89.5 | Substantial |

>0.80 | >90 | Almost Perfect |

(B) | ||

k, k_{w} or ICC Value | Percent Agreement (A) | Clinical Significance |

<0.40 | <70 | Poor |

0.40–0.79 | 70–89.5 | Fair to Good |

≥0.80 | ≥90 | Excellent |

(C) | ||

k, k_{w} or ICC Value | Percent Agreement (A) | Clinical Significance |

<0.40 | <70 | Poor |

0.40–0.59 | 70–79.5 | Fair |

0.60–0.79 | 80–89.5 | Good |

≥0.80 | ≥90 | Excellent |

**Table 4.**Relationship between the Reliability and Accuracy of Pairs of Hypothetical Tasters Judging Whether a Wine is Oaked (+) or Unoaked (−) When the Tasters Are in 70% Agreement.

Case: | (++) | (− −) | (+ −) | (− +) | PC | Kappa ^{1} | PO^{+} | PO^{−} | PO^{+}/PO^{−} Agreement | p Value ^{3} |
---|---|---|---|---|---|---|---|---|---|---|

1 | 35 | 35 | 15 | 15 | 50 | 0.40 (F) | 70 (F) | 70 (F) | 100 | <0.0005 |

2 | 40 | 30 | 15 | 15 | 50.5 | 0.39 (P) | 67 (P) | 73 (F) | 94 | 0.002 |

3 | 45 | 25 | 15 | 15 | 52 | 0.375 (P) | 62.5 (P) | 75 (F) | 87.5 | 0.004 |

4 | 50 | 20 | 15 | 15 | 54.5 | 0.34 (P) | 57 (P) | 77 (F) | 80 | 0.01 |

5 | 55 | 15 | 15 | 15 | 58 | 0.29 (P) | 50 (P) | 79 (F) | 71 | NS ^{2} |

6 | 60 | 10 | 15 | 15 | 62.5 | 0.20 (P) | 40 (P) | 80 (G) | 60 | NS |

7 | 65 | 5 | 15 | 15 | 68 | 0.06 (P) | 25 (P) | 81 (G) | 44 | NS |

8 | 70 | 0 | 15 | 15 | 74.5 | −0.18 (P) | 0 (P) | 82 (G) | 18 | NS |

^{1}Kappa values are classified as Poor (P), Fair (F), Good (G) or Excellent (E) by the revised Cicchetti criteria in Table 3C;

^{2}NS = not statistically significant at p ≤ 0.05;

^{3}Statistical significance is found by dividing kappa by its standard error as derived by Fleiss, Cohen and Everitt, [20]. Values of Z are interpreted in the standard manner whereby <±1.96 = p at the 0.05 level; ±2.58 is at t 0.01; ±3 at 0.003; ±4 at 0.0005; and ±5 at 0.0001 [20,21].

**Table 5.**Relationship between the Reliability and Accuracy of Pairs of Hypothetical Tasters Judging Whether a Wine is Oaked (+) or Unoaked (−) When the Tasters Are in 80% Agreement.

Case: | (++) | (− −) | (+ −) | (− +) | PC | Kappa ^{1} | PO^{+} | PO^{−} | Agreement | PO^{+}/PO^{−} p Value ^{3} |
---|---|---|---|---|---|---|---|---|---|---|

1 | 40 | 40 | 10 | 10 | 50 | 0.60 (G) | 80 (G) | 80 (G) | 100 | <0.0005 |

2 | 45 | 35 | 10 | 10 | 50.5 | 0.60 (G) | 82 (G) | 78 (F) | 96 | <0.0005 |

3 | 50 | 30 | 10 | 10 | 52 | 0.58 (F) | 83 (G) | 75 (F) | 92 | 0.001 |

4 | 55 | 25 | 10 | 10 | 55 | 0.56 (F) | 85 (G) | 71 (F) | 86 | <0.005 |

5 | 60 | 20 | 10 | 10 | 58 | 0.52 (F) | 86 (G) | 67 (P) | 81 | <0.005 |

6 | 65 | 15 | 10 | 10 | 63 | 0.47 (F) | 87 (G) | 60 (P) | 73 | <0.005 |

7 | 70 | 10 | 10 | 10 | 68 | 0.38 (P) | 88 (G) | 50 (P) | 62 | 0.01 |

8 | 75 | 5 | 10 | 10 | 74.5 | 0.22 (P) | 88 (G) | 33 (P) | 45 | NS ^{2} |

9 | 80 | 0 | 10 | 10 | 82 | −0.11 (P) | 89 (G) | 0 (P) | 11 | NS |

^{1}Kappa values are classified as Poor (P), Fair (F), Good (G) or Excellent (E) by the revised Cicchetti criteria in Table 3C;

^{2}NS = not statistically significant at p ≤ 0.05;

^{3}Statistical significance is found by dividing kappa by its standard error as derived by Fleiss, Cohen and Everitt, (1969). Values of Z are interpreted in the standard manner whereby <±1.96 = p at the 0.05 level; ±2.58 is at t 0.01; ±3 at 0.003; ±4 at 0.0005; and ±5 at 0.0001 [20,21].

**Table 6.**Relationship between the Reliability and Accuracy of Pairs of Hypothetical Tasters Judging Whether a Wine is Filtered (+) or Not Filtered (−) When the Tasters Are in 90% Agreement.

Case: | (++) | (− −) | (+ −) | (− +) | PC | Kappa ^{1} | PO^{+} | PO^{−} | (PO^{+}/PO^{−}) | p Value ^{3} |
---|---|---|---|---|---|---|---|---|---|---|

1 | 45 | 45 | 5 | 5 | 50 | 0.80 (E) | 90 | 90 | 100 | <0.0005 |

2 | 50 | 40 | 5 | 5 | 51 | 0.80 (E) | 89 | 91 | 98 | <0.0005 |

3 | 55 | 35 | 5 | 5 | 52 | 0.79 (G) | 88 | 92 | 96 | <0.0005 |

4 | 60 | 30 | 5 | 5 | 55 | 0.78 (G) | 86 | 92 | 94 | <0.0005 |

5 | 65 | 25 | 5 | 5 | 58 | 0.76 (G) | 83 | 93 | 90 | <0.0005 |

6 | 70 | 20 | 5 | 5 | 63 | 0.73 (G) | 80 | 93 | 87 | <0.0005 |

7 | 75 | 15 | 5 | 5 | 68 | 0.69 (G) | 75 | 94 | 81 | <0.0005 |

8 | 80 | 10 | 5 | 5 | 75 | 0.61 (G) | 67 | 94 | 73 | <0.0005 |

9 | 85 | 5 | 5 | 5 | 82 | 0.44 (F) | 50 | 94 | 56 | 0.001 |

10 | 90 | 0 | 5 | 5 | 90.5 | −0.05 (P) | 0 | 95 | 5 | NS ^{2} |

^{1}Kappa values are classified as Poor (P), Fair (F), Good (G) or Excellent (E) by the revised Cicchetti criteria in Table 3C;

^{2}NS = not statistically significant at p ≤ 0.05;

^{3}Statistical significance is found by dividing kappa by its standard error as derived by Fleiss, Cohen and Everitt [20]. Values of Z are interpreted in the standard manner whereby <±1.96 = p at the 0.05 level; ±2.58 is at t 0.01; ±3 at 0.003; ±4 at 0.0005; and ±5 at 0.0001 [20,21].

**Table 7.**Illustrating the Relationship between the Reliability and Accuracy of Wine Tasters’ Hypothetical Binary Judgments of Whether a Wine is Oaked (+) or Not Oaked (−), Expressed in Percentages.

Taster 2 | |||
---|---|---|---|

Taster 1 | (+) | (−) | Totals |

(1) | 12 | 4 | 16 |

(−) | 0 | 11 | 11 |

Totals | 12 | 15 | 27 |

© 2017 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Cicchetti, D. Guidelines for Assessing Enological and Statistical Significance of Wine Tasters’ Binary Judgments. *Beverages* **2017**, *3*, 53.
https://doi.org/10.3390/beverages3040053

**AMA Style**

Cicchetti D. Guidelines for Assessing Enological and Statistical Significance of Wine Tasters’ Binary Judgments. *Beverages*. 2017; 3(4):53.
https://doi.org/10.3390/beverages3040053

**Chicago/Turabian Style**

Cicchetti, Dom. 2017. "Guidelines for Assessing Enological and Statistical Significance of Wine Tasters’ Binary Judgments" *Beverages* 3, no. 4: 53.
https://doi.org/10.3390/beverages3040053