# Logistic Regression Model for a Bivariate Binomial Distribution with Applications in Baseball Data Analysis

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Models and Notations

**Remark**

**1.**

- Model 0—Model with no random effectWe consider the following two logit link functions to model the two inter-related success probabilities ${p}_{\ell t}$ and ${q}_{\ell t}$ with the covariates:$$\begin{array}{ccc}\hfill \mathrm{logit}\left({p}_{\ell t}\right)& =& {\beta}_{p0}^{\left(0\right)}+{\beta}_{p1}^{\left(0\right)}{z}_{\ell t1}+\dots +{\beta}_{pk}^{\left(0\right)}{z}_{\ell tK}={\mathbf{z}}_{\ell t}{\mathit{\beta}}_{p}^{{}^{\prime}\left(0\right)},\hfill \\ \hfill \mathrm{logit}\left({q}_{\ell t}\right)& =& {\beta}_{q0}^{\left(0\right)}+{\beta}_{q1}^{\left(0\right)}{z}_{\ell t1}+\dots +{\beta}_{qk}^{\left(0\right)}{z}_{\ell tK}={\mathbf{z}}_{\ell t}{\mathit{\beta}}_{q}^{{}^{\prime}\left(0\right)},\hfill \end{array}$$
- Model 1—Model with joint random effectWe consider the following two logit link functions to model two inter-related success probabilities ${p}_{\ell t}$ and ${q}_{\ell t}$ with the covariates:$$\begin{array}{ccc}\hfill \mathrm{logit}\left({p}_{\ell t}\right)& =& {\beta}_{p0}^{\left(1\right)}+{\beta}_{p1}^{\left(1\right)}{z}_{\ell t1}+\dots +{\beta}_{pk}^{\left(1\right)}{z}_{\ell tK}+{a}_{\ell}^{\left(1\right)}={\mathbf{z}}_{\ell t}{\mathit{\beta}}_{p}^{{}^{\prime}\left(1\right)}+{a}_{\ell}^{\left(1\right)},\hfill \\ \hfill \mathrm{logit}\left({q}_{\ell t}\right)& =& {\beta}_{q0}^{\left(1\right)}+{\beta}_{q1}^{\left(1\right)}{z}_{\ell t1}+\dots +{\beta}_{qk}^{\left(1\right)}{z}_{\ell tK}+{\beta}^{*\left(1\right)}{a}_{\ell}^{\left(1\right)}\hfill \\ & =& {\mathbf{z}}_{\ell t}{\mathit{\beta}}_{q}^{{}^{\prime}\left(1\right)}+{\beta}^{*\left(1\right)}{a}_{\ell}^{\left(1\right)},\hfill \end{array}$$
- Model 2—Model with joint random effect and unobserved heterogeneityWe now extend Model 1 by incorporating an additional random effect term in $\mathrm{logit}\left({q}_{lt}\right)$. The extended model denoted by Model 2 is given by$$\begin{array}{ccc}\hfill \mathrm{logit}\left({p}_{\ell t}\right)& =& {\beta}_{p0}^{\left(2\right)}+{\beta}_{p1}^{\left(2\right)}{z}_{\ell t1}+\dots +{\beta}_{pk}^{\left(2\right)}{z}_{\ell tK}+{a}_{\ell}^{\left(2\right)}={\mathbf{z}}_{\ell t}{\mathit{\beta}}_{p}^{{}^{\prime}\left(2\right)}+{a}_{\ell}^{\left(2\right)},\hfill \\ \hfill \mathrm{logit}\left({q}_{\ell t}\right)& =& {\beta}_{q0}^{\left(2\right)}+{\beta}_{q1}^{\left(2\right)}{z}_{\ell t1}+\dots +{\beta}_{qk}^{\left(2\right)}{z}_{\ell tK}+{\beta}^{*\left(2\right)}{a}_{\ell}^{\left(2\right)}+{\kappa}_{\ell}\hfill \\ & =& {\mathbf{z}}_{\ell t}{\mathit{\beta}}_{q}^{{}^{\prime}\left(2\right)}+{\beta}^{*\left(2\right)}{a}_{\ell}^{\left(2\right)}+{\kappa}_{\ell},\hfill \end{array}$$

## 3. Bayesian Inference

#### 3.1. Prior and Posterior Distributions

#### 3.2. Markov Chain Monte Carlo (MCMC) Procedures

**Step****1.**- Given the current estimate ${\theta}^{\left(h\right)}$, in the h-th iteration, generate ${\theta}^{*}$ from a standard normal distribution$$\begin{array}{ccc}\hfill {\theta}^{*}& \sim & N(0,1)=\pi \left({\theta}^{*}\right).\hfill \end{array}$$
**Step****2.**- Compute the ratio $\alpha $, composed of the full conditional distribution p and the prior densities $\pi $.$$\begin{array}{ccc}\hfill \alpha ({\theta}^{\left(h\right)},{\theta}^{*})& =& \frac{p\left({\theta}^{*}\right)\pi \left({\theta}^{\left(h\right)}\right)}{p\left({\theta}^{\left(h\right)}\right)\pi \left({\theta}^{*}\right)}.\hfill \end{array}$$
**Step****3.**- Draw $u\sim Uniform(0,1)$. If $u\le \alpha ({\theta}^{\left(h\right)},{\theta}^{*})$, then ${\theta}^{(h+1)}={\theta}^{*}$. Otherwise ${\theta}^{(h+1)}={\theta}^{\left(h\right)}$.

## 4. Practical Data Analysis

- Win Probability Added (WPA): The percent change in a team’s chances of winning from one game to the next;
- Center percentage (Cent%): The percentage of balls in play that were hit to center fields by batters;
- Pull percentage (Pull%): The percentage of balls in play that were pulled by hitters;
- Opposite percentage (Oppo%): The percentage of balls in play that were hit to opposite fields by batters;
- BABIP (Batting Average on Balls in Play): A statistic indicating how often a ball in play goes for a hit;
- Walk to strikeout ratio (BB/K): A batting ratio that shows the ratio of walks for each strikeout. The higher the ratio, the better the performance;
- Home run to fly ball ratio (HR/FB): The ratio of how many home runs are hit against a pitcher for every fly ball he/she allows;
- Line drive percentage (LD%): The percentage of balls hit into the field of play that are characterized as line drives;
- Ground ball percentage (GB%): The percentage of batted balls hit as ground balls against a pitcher;
- Fly ball percentage (FB%): The percentage of fly balls hit into the field of play.

- Variables influencing both p and q: WPA, Cent%;
- Variables influencing p only: BABIP, BB/K, LD%, GB%, Oppo%;
- Variables influencing q only: FB%, HR/FB, Pull%.

## 5. Monte Carlo Simulation Studies

## 6. Concluding Remarks

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## References

- Jensen, S.T.; McShane, B.B.; Wyner, A.J. Hierarchical Bayesian modeling of hitting performance in baseball. Bayesian Anal.
**2009**, 4, 631–652. [Google Scholar] [CrossRef] - Leonard, T. Bayesian methods for binomial data. Biometrika
**1972**, 59, 581–589. [Google Scholar] [CrossRef] - Bedrick, E.J.; Christensen, R.; Johnson, W. Bayesian binomial regression: Predicting survival at a Trauma Center. Am. Stat.
**1997**, 51, 211–218. [Google Scholar] - Chen, M.-H.; Ibrahim, J.G.; Kim, S. Properties and implementation of Jeffreys’s prior in binomial regression models. J. Am. Stat. Assoc.
**2008**, 103, 1659–1664. [Google Scholar] [CrossRef] [PubMed] - Pires, R.P.; Diniz, C.A.R. Correlated binomial regression models. Comput. Stat. Data Anal.
**2012**, 56, 2513–2525. [Google Scholar] [CrossRef] - Prasetyo, R.B.; Kuswanto, H.; Iriawan, N.; Ulama, B.S.S. Binomial regression models with a flexible generalized logit link function. Symmetry
**2020**, 12, 221. [Google Scholar] [CrossRef] - Mains, R. When Slugging Percentage Beats on-Base Percentage; FanGraphs: Arlington, VA, USA, 22 January 2016. [Google Scholar]
- Crowder, M.; Sweeting, T. Bayesian inference for a bivariate binomial distribution. Biometrika
**1989**, 76, 599–603. [Google Scholar] [CrossRef] - Polson, N.; Wasserman, L. Prior distributions for the bivariate binomial. Biometrika
**1990**, 77, 901–904. [Google Scholar] [CrossRef] - Scotto, M.G.; Wei, C.H.; Silva, M.E.; Pereira, I. NINE: A Journal of Baseball History and Culture; Johns Hopkins University Press: Baltimore, MD, USA, 2014; Volume 125, pp. 233–251. [Google Scholar]
- Kim, S.W.; Shahin, S.; Ng, H.K.T.; Kim, J. Binary segmentation procedures using the bivariate binomial distribution for detecting streakiness in sports data. Comput. Stat.
**2021**, 36, 1821–1843. [Google Scholar] [CrossRef] - Wallis, K.F. Time series analysis of bounded economic variables. J. Time Ser. Anal.
**1987**, 8, 115–123. [Google Scholar] [CrossRef] - Gelman, A.; Carlin, J.B.; Stern, H.S.; Dunson, D.B.; Vehtari, A.; Rubin, D.B. Bayesian Data Analysis; CRC Press: New York, NY, USA, 2014. [Google Scholar]
- Chib, S.; Greenberg, E. Understanding the Metropolis-Hastings Algorithm. Am. Stat.
**1995**, 49, 327–335. [Google Scholar] - Chen, M.-H.; Shao, Q.-M. Monte Carlo estimation of Bayesian credible and HPD Intervals. J. Comput. Graph. Stat.
**1999**, 8, 69–92. [Google Scholar] - Gelfand, A.E.; Smith, A.F.M. Sampling based approaches to calculating marginal densities. J. Am. Stat. Assoc.
**1990**, 85, 398–409. [Google Scholar] [CrossRef] - Baumer, B. Why on-base percentage is a better indicator of future performance than batting average: An algebraic proof. J. Quant. Anal. Sport.
**2008**, 4. [Google Scholar] [CrossRef] - Null, B. Modeling baseball player ability with a nested Dirichlet distribution. J. Quant. Anal. Sport.
**2009**, 5, 1–38. [Google Scholar] [CrossRef] - Puerzer, R.J. Engineering baseball: Branch Rickey’s innovative approach to baseball management. Nine J. Baseb. Hist. Cult.
**2003**, 12, 72–87. [Google Scholar] [CrossRef] - Silver, N. Lies, Damned Lies, Randomness: Catch the Fever! Baseball Prospectus: Columbus, OH, USA, 14 May 2003. [Google Scholar]
- Studeman, D. Should Jose Reyes hit more ground balls? The Hardball Times, 13 December 2007. [Google Scholar]
- Ley, C.; Dominicy, Y. Science Meets Sports: When Statistics Are More than Numbers; Cambridge Scholars Publishing: Cambridge, UK, 2020. [Google Scholar]
- Akman, V.E.; Raftery, A.E. Bayes factors for non-homogeneous Poisson processes with vague prior information. J. R. Stat. Soc. Ser.
**1986**, 48, 322–329. [Google Scholar] [CrossRef] - Spiegelhalter, D.J.; Smith, A.F.M. Bayes factors for linear and log-linear models with vague prior information. J. R. Stat. Soc. Ser.
**1982**, 44, 377–387. [Google Scholar] [CrossRef] - Roebber, P.J. Does the principle of investment diversification apply to the starting pitching staffs of major league baseball teams? PLoS ONE
**2021**, 16, e0244941. [Google Scholar] [CrossRef] - Taylor, N.C. Forecasting Batter Performance Using Statcast Data in Major League Baseball; ProQuest LLC: Ann Arbor, MI, USA, April 2017. [Google Scholar]
- Gelman, A.; Rubin, D.B. Inference from iterative simulation using multiple sequences. Stat. Sci.
**1992**, 7, 457–511. [Google Scholar] [CrossRef] - Hahn, G. Fitting regression models with no intercept term. J. Qual. Technol.
**1997**, 9, 56–61. [Google Scholar] [CrossRef] - Othman, S.A. Comparison between models with and without intercept. Gen. Math. Notes
**2014**, 21, 118–127. [Google Scholar]

**Figure 2.**The posterior means for ${a}_{l}^{\left(2\right)}$ under Model 2 in conjunction with success probability p (

**left panel**), and the posterior means for two different random effects based on the player labels according to ${a}_{\ell}^{\left(2\right)}$ (

**right panel**).

Parameter | MLE | Posterior Mean | 95% HPD Interval | p-Value |
---|---|---|---|---|

${\beta}_{p0}^{\left(0\right)}$ | $-1.0362$ | $-1.0376$ | $(-1.0618,-1.0135)$ | <0.0001 |

${\beta}_{p1}^{\left(0\right)}$ | $0.1334$ | $0.1330$ | $(0.0847,0.1814)$ | <0.0001 |

${\beta}_{p2}^{\left(0\right)}$ | $0.0609$ | $0.0663$ | $(-0.4381,0.5708)$ | $0.4032$ |

${\beta}_{p3}^{\left(0\right)}$ | $1.6197$ | $1.6186$ | $(1.1803,2.0569)$ | <0.0001 |

${\beta}_{p4}^{\left(0\right)}$ | $0.1327$ | $0.1339$ | $(0.0436,0.2242)$ | $0.0024$ |

${\beta}_{p5}^{\left(0\right)}$ | $0.2536$ | $0.2796$ | $(-0.3921,0.9514)$ | $0.2247$ |

${\beta}_{p6}^{\left(0\right)}$ | $-0.0994$ | $-0.0826$ | $(-0.5843,0.4190)$ | $0.3465$ |

${\beta}_{p7}^{\left(0\right)}$ | $-0.2794$ | $0.2993$ | $(-0.2889,0.8878)$ | $0.1684$ |

${\beta}_{q0}^{\left(0\right)}$ | $-0.4136$ | $-0.4145$ | $(-0.4591,-0.3701)$ | <0.0001 |

${\beta}_{q1}^{\left(0\right)}$ | $-0.0729$ | $-0.0713$ | $(-0.1575,0.0147)$ | $0.0609$ |

${\beta}_{q2}^{\left(0\right)}$ | $-0.5332$ | $-0.5551$ | $(-1.5886,0.4782)$ | $0.1564$ |

${\beta}_{q8}^{\left(0\right)}$ | $3.1443$ | $3.1411$ | $(2.3109,3.9712)$ | <0.0001 |

${\beta}_{q9}^{\left(0\right)}$ | $2.7858$ | $2.7795$ | $(2.1279,3.4311)$ | <0.0001 |

${\beta}_{q10}^{\left(0\right)}$ | $0.2409$ | $0.1981$ | $(-0.8541,1.2504)$ | $0.3229$ |

Parameter | Posterior Mean | SD | 95% HPD Interval |
---|---|---|---|

${\beta}_{p1}^{\left(1\right)}$ | 0.1331 | 0.0247 | (0.0847, 0.1815) |

${\beta}_{p2}^{\left(1\right)}$ | 0.0594 | 0.2452 | (−0.4212, 0.54) |

${\beta}_{p3}^{\left(1\right)}$ | 1.5778 | 0.2335 | (1.1202, 2.0355) |

${\beta}_{p4}^{\left(1\right)}$ | 0.1320 | 0.0467 | (0.0404, 0.2236) |

${\beta}_{p5}^{\left(1\right)}$ | 0.3470 | 0.3343 | (−0.3082, 1.0022) |

${\beta}_{p6}^{\left(1\right)}$ | −0.0776 | 0.2540 | (−0.5754, 0.4202) |

${\beta}_{p7}^{\left(1\right)}$ | 0.2890 | 0.2830 | (−0.2657, 0.8436) |

${\beta}_{q1}^{\left(1\right)}$ | −0.0727 | 0.0442 | (−0.1593, 0.0139) |

${\beta}_{q2}^{\left(1\right)}$ | −0.5374 | 0.5697 | (−1.6539, 0.5792) |

${\beta}_{q8}^{\left(1\right)}$ | 3.1403 | 0.4117 | (2.3334, 3.9471) |

${\beta}_{q9}^{\left(1\right)}$ | 2.7825 | 0.3460 | (2.1044, 3.4606) |

${\beta}_{q10}^{\left(1\right)}$ | 0.2389 | 0.5597 | (−0.8581, 1.3359) |

${\beta}^{*\left(1\right)}$ | 0.4022 | 0.0235 | (0.3562, 0.4482) |

Parameter | Posterior Mean | SD | 95% HPD Interval |
---|---|---|---|

${\beta}_{p1}^{\left(2\right)}$ | 0.1347 | 0.0253 | (0.085, 0.1843) |

${\beta}_{p2}^{\left(2\right)}$ | 0.0657 | 0.2553 | (−0.4347, 0.5661) |

${\beta}_{p3}^{\left(2\right)}$ | 1.6181 | 0.2349 | (1.1576, 2.0785) |

${\beta}_{p4}^{\left(2\right)}$ | 0.1332 | 0.0454 | (0.0441, 0.2223) |

${\beta}_{p5}^{\left(2\right)}$ | 0.3275 | 0.3317 | (−0.3227, 0.9777) |

${\beta}_{p6}^{\left(2\right)}$ | −0.0688 | 0.2603 | (−0.5789, 0.4413) |

${\beta}_{p7}^{\left(2\right)}$ | 0.3164 | 0.2902 | (−0.2524, 0.8852) |

${\beta}_{q1}^{\left(2\right)}$ | −0.0599 | 0.0464 | (−0.1509, 0.0312) |

${\beta}_{q2}^{\left(2\right)}$ | −0.5264 | 0.5452 | (−1.595, 0.5421) |

${\beta}_{q8}^{\left(2\right)}$ | 3.2221 | 0.3872 | (2.4631, 3.9811) |

${\beta}_{q9}^{\left(2\right)}$ | 2.7660 | 0.3507 | (2.0787, 3.4533) |

${\beta}_{q10}^{\left(2\right)}$ | 0.3569 | 0.5355 | (−0.6927, 1.4064) |

${\beta}^{*\left(2\right)}$ | 0.3344 | 0.1131 | (0.1127, 0.5561) |

**Table 4.**Simulated biases, MSEs for point estimation, coverage probabilities (CP) and average widths (AW) of 95% credible intervals of all parameters for sample sizes of $m=100$ and $L=30$ for 6 time points with 200 replications in Model 0.

Parameter | True Value | Posterior Mean | Bias | MSE | CP | AW |
---|---|---|---|---|---|---|

${\beta}_{p0}^{\left(0\right)}$ | 1 | 1.0007 | 0.0007 | 0.0006 | 0.9750 | 0.0700 |

${\beta}_{p1}^{\left(0\right)}$ | $-1$ | $-1.0031$ | $-0.0031$ | 0.0064 | 0.9700 | 0.2304 |

${\beta}_{p2}^{\left(0\right)}$ | 2 | 2.0030 | 0.0030 | 0.0085 | 0.9400 | 0.2427 |

${\beta}_{q0}^{\left(0\right)}$ | $-1$ | $-1.0002$ | $-0.0002$ | 0.0009 | 0.9500 | 0.0829 |

${\beta}_{q1}^{\left(0\right)}$ | 1 | 0.9984 | $-0.0016$ | 0.0096 | 0.9550 | 0.2755 |

${\beta}_{q2}^{\left(0\right)}$ | $-2$ | $-2.0074$ | $-0.0074$ | 0.0110 | 0.9350 | 0.2880 |

**Table 5.**Simulated biases, MSEs for point estimation, coverage probabilities (CP) and average widths (AW) of 95% credible intervals of all parameters for sample sizes $m=100$ and $L=30$ for 6 time points with 200 replications for Model 1.

Parameter | True Value | Posterior Mean | Bias | MSE | CP | AW |
---|---|---|---|---|---|---|

${\beta}_{p1}^{\left(1\right)}$ | $-1$ | $-1.0223$ | $-0.0223$ | 0.0161 | 0.9375 | 0.3376 |

${\beta}_{p2}^{\left(1\right)}$ | 2 | 1.9982 | $-0.0018$ | 0.0155 | 0.9625 | 0.3577 |

${\beta}_{q1}^{\left(1\right)}$ | 1 | 0.9900 | −0.0100 | 0.0275 | 0.9500 | 0.4707 |

${\beta}_{q2}^{\left(1\right)}$ | $-2$ | $-1.9895$ | 0.0105 | 0.0335 | 0.9625 | 0.4960 |

${\beta}^{*\left(1\right)}$ | 1 | 1.0017 | 0.0017 | 0.0055 | 0.9500 | 0.2050 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Han, Y.; Kim, J.; Ng, H.K.T.; Kim, S.W.
Logistic Regression Model for a Bivariate Binomial Distribution with Applications in Baseball Data Analysis. *Entropy* **2022**, *24*, 1138.
https://doi.org/10.3390/e24081138

**AMA Style**

Han Y, Kim J, Ng HKT, Kim SW.
Logistic Regression Model for a Bivariate Binomial Distribution with Applications in Baseball Data Analysis. *Entropy*. 2022; 24(8):1138.
https://doi.org/10.3390/e24081138

**Chicago/Turabian Style**

Han, Yewon, Jaeho Kim, Hon Keung Tony Ng, and Seong W. Kim.
2022. "Logistic Regression Model for a Bivariate Binomial Distribution with Applications in Baseball Data Analysis" *Entropy* 24, no. 8: 1138.
https://doi.org/10.3390/e24081138