2.1. Gradient Boosted Machine Learning
The concept to be learned in this investigation is this: At any point within a football game, what is the likelihood that a turnover will be observed on the next play from scrimmage?
The specific objective is to learn an unknown function
F that maps explanatory variables
$\mathbf{x}=\{{x}_{1},\dots ,{x}_{d}\}$ to the response
y, or
$F:\mathbf{x}\to y$, where
x represents the game situation, and
$y\in \{0,1\}$ is the binary decision (no turnover, turnover). A collection of training examples
$T=\{({\mathbf{x}}_{i},{y}_{i}),\text{}i=1,\dots N\}$ is used to estimate an approximation to
F, or
$\widehat{F}\left(\mathbf{x}\right)=y$, by an adaptive learning algorithm known as the “gradient boosting machine” [
7,
8].
Gradient boosting machines (GBMs) are learning algorithms that reconstruct a decision function
$\widehat{F}$ based on the consensus of an ensemble of classification or regression trees. New decision tree models are sequentially added to the ensemble, in order to increase the estimation accuracy of the response variable. The numerical optimization minimizes an expected loss
$\widehat{F}\left(\mathbf{x}\right)={\text{argmin}}_{F}{E}_{y,\mathbf{x}}L(y,F\left(\mathbf{x}\right))$ of a group of trees, conditioned over the entire training data set [
7]. The loss function can be selected according to a given learning concept and joint probability distribution
$f(\mathbf{x},y)$ under study. Here, we use a Bernoulli distribution loss function, convert the classification to a continuous value via logistic regression and estimate the turnover probability
$\widehat{p}\left({\mathbf{x}}_{i}\right)=p({y}_{i}=1|{\mathbf{x}}_{i})$,
$\widehat{p}\in [0,1]$ .
A useful property of GBMs in applications is interpretability through calculation of the relative influence of explanatory variables in constructing a consensus decision. The influence of each input variable
${x}_{j}$ in a given tree is based on the frequency of its selection for splitting in non-terminal nodes, and its contribution to successful model classification over the data sample. This influence is averaged over the ensemble of trees to estimate the variable’s overall importance to the decision function
$\widehat{F}$ [
7]. In the current investigation, this interpretation may provide insight into the game conditions under which turnovers might be expected to occur.
Gradient boosting machine models were developed and evaluated in
R, using the ”gbm” package [
9,
10].
2.2. Sample, Segmentation and Features
The population under study consists of NFL season, game, player and play-level data for complete seasons 2009 through 2015, covering all 32 teams. Game data were downloaded from the site
www.nfl.com using utilities provided by ”nflscrapR” [
11]. These data were preprocessed by (1) sampling by season and team; (2) filtering by play type, to include only plays from scrimmage (run, pass or sack); (3) annotating by status of turnovers (true, false) observed on each play; (4) constructing feature vectors using attributes of the play-by-play and game contextual data.
This sample comprised 300,450 plays. Running plays represented $31.7\%$ of all plays, passes $42.1\%$, and sacks only $2.9\%$. Although sack-fumbles lost are significant events ($5.1\%$ of sacks produce turnovers), we decided to exclude sacks from further consideration due to their negligible numbers relative to run and pass plays. To make this predictive analysis useful in practice, it is prudent to categorize turnover events in association with scrimmage plays that could reasonably be anticipated by a defensive team, based on offensive formation.
After excluding sack plays, the sample contained 291,675 plays, with an overall turnover prevalence of $1.633\%$ for pass and run plays, combined. Pass plays made up $43.4\%$ and runs $32.6\%$ of the resultant dataset.
Two partitioning schemes were applied to the sample. First, an aggregate sample of all 32 NFL teams was created to assess whether invariant patterns of turnover predictability could be determined. Second, individual team samples were assembled to develop team-specific models of turnovers. Seven full season-long records were used for all sample datasets.
Predictive models were trained and evaluated for each sample. These samples were segmented by distinct event types—(1) Run plays; (2) Pass plays; and (3) Run or Pass plays combined.
Feature vectors for learning were constructed from available fields in the play-by-play data. Numeric data were normalized by characteristic length and time scales. Categorical and ordinal variables were represented as binary valued quantities using one-hot encoding. The features and their corresponding nominal dimensions upon encoding are summarized in
Table 1. Not all dimensions listed in the table were in model development, due to their low variation across certain limited subsamples.
2.3. Modeling and Analysis
The incidence of turnovers as a percentage of all plays from scrimmage is very low, around
$1.6\%$. For this reason, the distribution of class labels
${y}_{i}$ in a training set
$T=\{({\mathbf{x}}_{i},{y}_{i}),\text{}i=1,\dots N\}$ randomly sampled from the true population is highly skewed. Learning the parameters of a useful statistical estimator of turnover probability
$\widehat{p}\left({\mathbf{x}}_{i}\right)=p({y}_{i}=1|{\mathbf{x}}_{i})$ suggests the use of specific learning techniques to avoid trivially predicting “no turnover” on every decision [
12].
To address this, the approach taken in this study was to re-balance the distribution of classes in the training set, over-representing the distribution of the minority class in order to present sufficient examples to the learning algorithm. During validation of the models, examples closely representative of the true distribution within the population were used to assess model predictive power when applied out-of-sample.
The modeling strategy included bootstrap resampling [
13], cross validation analysis, and receiver operating characteristic curve (ROC) analysis [
14]. The latter technique enabled error estimation, model comparison and selection from the large number of hypotheses generated by the gradient boosting machines during training. ROC curves are often used to tradeoff false positive rate (
$FPR$) and true positive rate (
$TPR$) for evaluation of classifiers. In this study, the false discovery rate (
$FDR$) was substituted for
$FPR$ for analysis.
$FDR$ is the fraction of all positive decisions (i.e., turnover predicted) made by a model that are incorrect [
15].
$FDR$ is a more informative metric than
$FPR$ in diagnostic or predictive applications where confidence in a positive prediction is preferred, especially when the class distribution is skewed [
16].
$FDR$ is related to the positive predictive value statistic by
$PPV=1-FDR$. High
$PPV$ (low
$FDR$) values are desirable.
$TPR$ denotes the sensitivity of the model, or the likelihood that actual turnovers events are detected within a testing distribution.
In ROC space (
$FDR$,
$TPR$), an optimal decision threshold
$D{T}_{opt}$ is determined experimentally for a given distribution. Our objective is to minimize
$FDR$ for tactical reasons. A second pass through the training data with this fixed threshold is used to train and evaluate model performance. The gradient boosted model outputs a probability
$\widehat{p}$; the turnover prediction algorithm is then [
17]
where
$\widehat{y}\left(\mathbf{x}\right)=1$ means a turnover will be observed given input
$\mathbf{x}$.
Model learning for the aggregate sample used the “bootstrap” [
13] to repeatedly draw samples from the entire training set. Data were partitioned according to the play type segment under consideration, and a stratified sample was constructed for training. The validation data were sampled at random from the entire sample, according to the natural distribution of turnovers. A two step procedure was followed, for each of
$B=100$ bootstrap replicates. The first step estimated the detection threshold (DT) for optimal
$FDR$ and
$TPR$ via ROC analysis, training GBMs comprising 1500 trees (nominally). Secondly, the threshold was held constant such that
$DT=D{T}_{opt}$ and the entire sample was modeled again.
The learning procedure used for the team-wise samples was notionally similar, differing slightly in the numerical mechanics. Stratified sampling (with respect to the class labels
y) of individual teams produced untenably small sample counts. This required an alternate sampling strategy. The decision was made to use 10-fold cross validation, nested within a 10-trial bagging procedure. Prediction rules were developed by finally averaging the performance results. Modeling therefore included all of the available instances, and benefited from the variance-reduction properties of bagging for model performance estimation [
18].
Performance statistics $FDR\left(D{T}_{opt}\right),TPR\left(D{T}_{opt}\right)$ were accumulated and finally averaged over replicates B (or trials/folds k) to estimate the generalization performance of the ensemble of trees. Sampling distributions of the sample mean and standard error values for $FDR$ and $TPR$ observed in out-of-sample test were recorded for each sample and segment under investigation.
In this investigation, we define a “good” false discovery rate to be $FDR<0.15$. In other words, a positive prediction made by the model ($\widehat{y}\left(\mathbf{x}\right)=1$) is correct at least $85\%$ of the time to meet this criterion of model utility. This means that when a turnover is predicted on the impending play from scrimmage, a high degree of confidence can be associated with that prediction.
A pseudo-code outline of the model training and evaluation procedure appears in the
Appendix, as Algorithm A1.