Predicting the Outcome of NBA Playoffs Based on the Maximum Entropy Principle

: Predicting the outcome of National Basketball Association (NBA) matches poses a challenging problem of interest to the research community as well as the general public. In this article, we formalize the problem of predicting NBA game results as a classiﬁcation problem and apply the principle of Maximum Entropy to construct an NBA Maximum Entropy (NBAME) model that ﬁts to discrete statistics for NBA games, and then predict the outcomes of NBA playoffs using the model. Our results reveal that the model is able to predict the winning team with 74.4% accuracy, outperforming other classical machine learning algorithms that could only afford a maximum prediction accuracy of 70.6% in the experiments that we performed


Introduction
The National Basketball Association (NBA), the highest level basketball league in the world, was founded in 1946, and has had a 70 year history.NBA games are now among the most professional, marketed, attended, in addition to being one of the most popular leagues in the world.The NBA enjoys a big following around the world, with many participants anticipating results, in addition to a multitude of betting companies offering vast amounts of money to gamblers on odds of one team winning against another [1,2].Most participants often place their odds subjectively based on their personal preference of teams without any scientific basis, thus accuracy of the prediction is often very poor.With the rapid advance in science and technology, specifically using sophisticated data mining and machine learning algorithms, forecasting the outcome of a game with high precision is highly feasible and of great economic significance to various players in the betting industry.
By 1950, the popularity of the NBA had increased globally, necessitating the need to forecast results of NBA games; thus, experts began to focus on the historical records of game statistics in a bid to turn the data into useful information.In the early days, most researchers just applied simple principles of statistics that simply combined technical features of past games to create a ranked list of teams used to forecast likelihood of a home team winning an upcoming game [3,4].However, their accuracy is low compared to probabilistic based machine learning methods.As data for past games became more ubiquitous, researchers began to look for more methods to apply to the large amounts of data; thus, a vast amount of articles related to the analysis and forecasting of results of sports encounters were published.With advances in statistics and processing power of personal computers, researchers leveraged this power to improve accuracy in prediction.Bhandari et al. [5] developed the Advanced Scout based on a Windows personal computer machine in 1996, which pushed NBA games' data into data mining and the knowledge discovery technology field, and enabled coaches to find some interesting patterns of the competition of basketball games based on data.
By the end of the 20th century, scientists started using a variety of machine learning algorithms to forecast NBA games.Existing research that has used neural nets and decision trees has a major limitation of limited datasets, which lead to overfitting of both models.Consequently, the models will perform very well based on the training data but very low based on the test dataset [6][7][8].The Maximum Entropy model overcomes this limitation by making use of little known facts and making no assumptions about the unknown.Similarly, the support vector machine is limited by its failure to output a probability value, but only a win or loss, which makes the results difficult to explain [9].Lack of independence between some features used in sports forecasting is a major limitation to research, such as [10], that uses the Naive Bayes method.
Recently, many scholars have used a variety of probability graph models to simulate games [11][12][13], and their results are promising.However, their major focus is the difference between the simulation and the real game, but not to predict the final outcome of the game.They also do not compute their prediction accuracy.Stekler et al. [14] examined some different evaluation procedures and compared prediction accuracy of some forecasting methods.Haghighat et al. [15] reviewed the use of data mining technologies (neural nets, support vector machines, Bayesian method, decision trees and fuzzy system) to forecast the results of sports events and evaluated the advantages and disadvantages of each method.However, they did not evaluate the Maximum Entropy method, and, to the best of our knowledge, this is the first piece of research to apply the Maximum Entropy model to sports forecasting.
The Maximum Entropy model is more concerned about the construction of feature functions and the preprocessing of feature values of the data.In this paper, using the Maximum Entropy principle, we attempt to overcome the feature independence assumption that limits the Naive Bayesian model.We apply the Maximum Entropy principle to a set of features and establish the NBA Maximum Entropy (NBAME) model.Then, we use the model to calculate the probability of the home team's win of an upcoming game and make predictions based on this probability.Our results show that the prediction accuracy is pretty high when compared with other machine learning algorithms.
The rest of this paper is arranged as follows: in the following sections, we describe the Maximum Entropy model and k-means clustering.Section 3 gives an overview of the NBAME model.Section 4 presents the experiment results and compares them with results from other algorithms.Finally, concluding remarks and suggestions for future work are given in Section 5.

Background
Before exploring the use of the entropy-based scheme in NBA predication, we discuss the Maximum Entropy model, and the k-means clustering algorithm, which we used to discretize continuous valued attributes.

Maximum Entropy Model
The concept of "information entropy" dates way to 1948 when Shannon [16] first put forward the concept of information entropy.Information entropy is the expected value of information contained in a message.As a measure of random events' uncertainty, information entropy can explicitly be written as where H(p) is the information entropy, and p i is the probability of the ith random event.
Jayne [17] proposed a criterion that was subject to precisely stated prior data, and the probability distribution which best represents the current state of knowledge is the one with the largest entropy.This criterion is known as the "Maximum Entropy principle".The Maximum Entropy principle points out the best approximation to unknown probability distribution, which satisfies any constraints on the unknown distribution that we are aware of and makes no subjective assumptions about unknown conditions.In this case, the probability distribution is most uniform, and the risk of making a wrong prediction is at the lowest level.
The Maximum Entropy model, also known as a log-linear model, is based on the Principle of Maximum Entropy.Unlike the Naive Bayes classifier, the Maximum Entropy model does not assume that the features are conditionally independent of each other.The Maximum Entropy approach is superior to similar approaches in many circumstances [18,19], especially when the number of samples is small [20]; this is partly because it is not only a regression approach but also its optimization routine is guaranteed to converge on the Maximum Entropy solution.
In recent years, Maximum Entropy based models have been widely used for Natural Language Processing (NLP) tasks, especially for tagging sequential data [21][22][23].These models have a great advantage over traditional Hidden Markov Models (HMMs) and Naive Bayes models.For example, the Maximum Entropy models can incorporate richer features in a well-founded fashion that HMMs do not.Maximum Entropy based models have also been widely applied to many areas lately: (1) Tseng and Tuszynski [24] gave several examples of applications of Maximum Entropy in different stages of drug discovery; (2) Xu et al. [25] proposed a continuous Maximum Entropy method to investigate the robust optimal portfolio selection problem for the market with transaction costs and dividends; and (3) Phillips et al. [26] studied the problem of modeling the geographic distribution of a given animal or plant species by maximum-entropy techniques.Since the Maximum Entropy model is designed to solve the problems for cases that have insufficient information, we argue that it may provide a very appropriate approach to NBA playoffs prediction.

K-Means Clustering
Like many supervised machine learning algorithms, the Maximum Entropy model requires a discrete feature space.In order to train the Maximum Entropy model with a very limited training dataset, we need to convert attributes that have continuous numeric values into discrete ones.There has been a lot of research done on continuous feature discretization field [27][28][29][30][31][32].Methods for discretization are broadly classified into Supervised vs. Unsupervised, Global vs. Local, and Static vs. Dynamic.Recursive minimal entropy partitioning, the error based discretization and Self Organized Map (SOM) based discretization are several supervised discretization processes [33].However, unsupervised methods do not make use of class labels for discretization.Equal width binning is one of the simplest approaches to the unsupervised discretization process, together with equal frequency binning [34].Other methods based on the clustering principles include k-means clustering discretization [35].
Jain [36] provided an overview of clustering algorithm development and application.k-means clustering is a method of vector quantization and is originally from signal processing.The standard algorithm was first proposed by Lloyd in 1982 [37], and its main concept is to partition n observations {x 1 , x 2 , • • •, x n } into k clusters, in which each observation belongs to the cluster with the nearest mean.
Algorithmic steps for k-means clustering: Randomly select "c" cluster centers and calculate the distance between each data point and cluster centers; 3. Assign the data point to the cluster center whose distance from the cluster center is the minimum of all the cluster centers; 4. Recalculate the new cluster center using: , where c i represents the number of data points in ith cluster; 5. Recalculate the distance between each data point and new obtained cluster centers; 6.If no data point was reassigned, then stop; otherwise, repeat from step 3.
Nowadays, k-means clustering is very popular, and one of the most effective unsupervised discretization algorithms [38] in the data mining field [39][40][41], and this motivated our decision to use it to discretize our feature values.Kanungo et al. [42] presented a simple and efficient implementation of the k-means clustering algorithm.

Materials and Methods
In this section, we describe basic technical features of each game and apply the Maximum Entropy principle to build the NBAME model.

Basic Technical Features
We formalized the "outcome predicting" problem as a two class classification problem.Each game is described by a vector consisting of 29 features of participating teams and the outcome of the game (the label).Table 1 shows the complete features set with corresponding abbreviations used in this article.The statistics shown in Table 1 were used since they are common to basketball and any typical fan should be able to understand what each statistic represents.

NBAME Model Overview
Before building the NBAME model, we construct a feature function.Choice of the feature function is vital for performance of the Maximum Entropy model, which affects the structure of the optimal probability model directly, and it also makes the Maximum Entropy model superior to other models.There is flexibility in choosing the feature function, which enables the designer to make full use of the known facts from data to improve the performance of the model.In general, a feature function is a binary function of the form f (x, y) ∈ (0, 1), where x is the set of features and y is the label.
After constructing the feature functions, we build the NBAME model using the Maximum Entropy principle.We count the games with the same features x i and the same outcome y i in the training dataset, and then divide them by the training dataset size N.We get the empirical distribution of joint probability distribution p(x, y) : for each feature function f k , and the expectation with the empirical probability distribution of joint probability distribution p(x, y) is: We calculate the number of games with similar feature vector x and then divide this number by the training dataset size N to get the empirical distribution of marginal probability distribution p(x): × number of times that (x) occurs in the training dataset, (5) and the expectations of feature function f k relative to the model p(y|x) and empirical distribution of marginal probability distribution p(x) is: By constraining the expected value to be equal to the empirical value and from Equations ( 4) and ( 6), we have that: Equation ( 7) is called the constraint, and we have as many constraints as the number of feature functions.
The above constraints can be satisfied by an infinite number of models.Thus, in order to build our model, we need to select the best candidate based on a specific criterion.According to the principle of Maximum Entropy, we should select the model that is as close as possible to uniform.That is, we should select the model p * with Maximum Entropy: given that: 1. p(y|x) ≥ 0 for all x, y; 2. ∑ y p(y|x) = 1 for all x; 3. ∑ (x,y) p(x, y) f k (x, y) = ∑ (x,y) p(x)p(y|x) f k (x, y) for k ∈ {1, 2, . . ., K}.
To solve the above optimization problem, we introduce the Lagrangian multipliers, focus on the unconstrained dual problem, and estimate free variables {λ 1 , λ 2 , . . ., λ K } with the Maximum Likelihood Estimation method.It can be proved that if we find the {λ 1 , λ 2 , . . ., λ K } parameters that maximize the dual problem, the probability given a game statistics x to be classified as y is equal to: where the π(x) is a normalization factor: Parameter λ k can be perceived as the weight of feature function f k (x, y) and the Maximum Entropy algorithm learns by adjusting λ k .When solving for parameter λ k , we cannot obtain it analytically but numerically, the most popular method being the Generalized Iterative Scaling (GIS) [43].In this paper, we use the GIS method to calculate parameter λ k .Thus, given that we have found the λ k parameters of our model, all we need to do in order to classify the outcome of a new game as a win or a loss for the home team is to use the "maximum a posteriori" decision rule and select the category with the highest probability.

Results
In order to test the performance of the NBAME model, after collecting and preprocessing the games' statistics, we turn to the problem of predicting the outcomes of NBA playoff games for each season individually from the 2007-08 season to the 2014-15 season.We made experiments with the dataset using the NBAME model and some other machine learning algorithms.

Data Collection and Preprocessing
We created a crawler program to extract the 14 basic technical features of both teams and the home team's win or loss from http://www.stat-nba.com/,collected a total of 10,271 records for all games for seasons ranging from the 2007-08 season to the 2014-15 season, and stored them into a MySQL database.
After the original data set was obtained, we cleaned it using Java 1.7.First, we combined the two teams' 14 basic technical features of the same game into a single record for the game.The features of a game therefore contained 28 basic technical features and a label indicating a win or loss for the home team.Secondly, we calculated the mean of each basic technical feature from the most recent six games prior to the candidate game being predicted.If teams didn't have at least six games before the game started, we took the mean of the basic technical feature for any games prior to the candidate game.We cannot predict the outcome of the first game of each season because of the absence of prior data.Table 2 shows the home team's most recent six games' basic technical features obtained from the website and their mean values that we used for predicting the upcoming game.   3 indicate the home team and away team respectively, for example FGM h means Field Goal Made by the home team; the abbreviations are derived from Table 1.As shown in Table 3, each training example is of the form (x i , y i ), which corresponds to the statistics and outcome of a game.x i is a 28-dimensional vector that contains the input variables, and y i indicates whether the home team won (y i = 1) or lost (y i = 0) in that game.The first 28 columns indicate the basic technical features for each team as obtained by computing an average of the previous six games played by the corresponding team.The 29-th column is the actual outcome of the game, corresponding to the predicted game labeled as "Home team win", takes on only two values: 1 or 0; Here, the number 1 indicates that the home team won and 0 indicates otherwise.We used this basic technical features dataset to train the NBAME model by the principle of Maximum Entropy and predict the result of the coming game during the NBA playoffs for each season.According to the Maximum Entropy principle, the NBAME model needs to be trained on a sufficient amount of training data.However, training data in each season is limited, and thus there is a possible threat of over-fitting; if there are too many feature functions such that the number of training samples is lower than the number of feature functions, the probability distribution model will over-fit, resulting in high variance.Consequently, we get a better performance with the training data but low accuracy with testing data.
We used k-means clustering for data discretization with the R version 3.2.2.We applied the clustering software package [44] using the Partitioning Around Medoids (PAM) function to cluster the data of each feature.The number of clusters are the input parameters, and their values often involve clustering effects.A crucial choice to make was the number of clusters to be used; the Silhouette Coefficient (SC) [45] can be used to solve this problem, which combines condensation degree and degree of separation.It indicated the effectiveness of clustering with an SC value between −1 and +1-the greater the value, the better result of clustering.According to this principle, we could try to use some parameters of numbers of clustering, calculating the SC repeatedly under the condition of different cluster numbers, and then we can choose the one with the highest SC, which corresponds to the number of best clusters.
We calculate the SC of the away teams' score when k ranges from 3 to 10 (two clusters are not enough to obviously distinguish a lot of data).Figure 1 shows the relationship between the k value and SC by k-means clustering to discretize the away teams' score, where there is haphazard change in the SC value of the away teams' score as the number of clusters increases from 3 to 10 in the 2014-15 season.We note that when k is 3, SC is at a maximum with a value of 0.545.Thus, the cluster number of the away teams' score is assumed to be 3.
Figure 2 shows discrete values of the away teams' score after k-means clustering when the SC is 0.545 and the distribution in each cluster is also indicated by different colors.The top blue cluster contains games whose away team scores range between 104 and 125.Ranges for the green (middle) and red (bottom) clusters are 97 to 103 and 80 to 96 respectively.We use k-means clustering to discretize home teams' score values and other basic technical features for each game in the same way.Some samples of the experimental data set can be seen in the Table 4.  Subscripts h and a in Table 4 indicate the home team and away team respectively, for example FGM h means Field Goal Made by the home team; the abbreviations are derived from Table 1.In Table 4, the first 14 columns represent the home teams' basic technical feature values after k-means clustering discretization.The last column is the home teams' actual wins or losses of the game.Others represent the away home teams' basic technical feature values after k-means clustering discretization.It is also the final dataset that is applied to train the NBAME model and make predictions for the NBA playoffs.We sort them by the date, separate them by season, save the data for each season to a file, and then use data in each file to train and test the NBAME model repeatedly.We used the feature vectors to construct the NBAME model with the Maximum Entropy principle and trained the parameter λ k with the GIS algorithm.Then, we applied 28 basic technical features of the coming game to the NBAME model and calculated the probability of the home team's victory in the game, p(y|x).Since p(y|x) is a continuous value, the model makes a prediction based on a defined threshold: with a threshold of 0.5, it makes a prediction based on the conditions set in Equation ( 11) (meaning that if our model outputs a probability greater than or equal to 0.5, we decide that the home team wins, else we decide that the home team loses) 1(win), p(y|x) ≥ 0.5, 0(lose), p(y|x) < 0.5.
Finally, we compared the decision of our model to the true outcome of the game.If it was the same, then we said the prediction of the NBAME model was right, and we added 1 to the count of the correct prediction.Eventually, we would get the total number of predictions correctly, and we divided it by the number instances from the data set that we used to test it, which is our model's forecast accuracy.Accuracy was used as performance measure, and it was calculated by the following formula: Accuracy = number of correct predictions number of predictions .
The NBAME model outputs the probability of the home team's win in the upcoming game given the coming game's features.The home team would be more likely to win if the model output a probability greater than the threshold value.At this point, it is important to note that setting a high confidence improves the accuracy of our model predictions with a drawback of predicting fewer games.For example, if we set a threshold of 0.6, it makes predictions based on conditions defined in Equation (13), implying that the model will not take a prediction decision for all games with output probabilities between 0.4 and 0.6: Tables 5 and 6 show the prediction results and the number of predicted games for each season using the defined thresholds of 0.5, 0.6, and 0.7.From Table 5, the first row shows the prediction results for eight seasons of NBA playoff games by the NBAME model using a threshold of 0.5 (with a 0.5 threshold, the model makes predictions for all the playoffs).We notice that at 0.5 threshold, prediction accuracy of the model reaches as high as 74.4% in the 2007-08 season.If we increase the threshold, the number of games for which we could make a decision for all of the seasons reduces.For example, the number of predicted games decreased from 86 to 48 when we increased the threshold from 0.5 to 0.6 in the 2007-08 season; however, prediction accuracy improved from 74.4% to 77.1%.Similarly, when we increased the threshold from 0.6 to 0.7 in the 2007-08 season, the number of predicted games reduced from 48 to six with a 22.9% increase in prediction accuracy.This shows that we can trade the number of games for which we can make a prediction for an improved prediction accuracy, which can be of great commercial value.The results show that the proposed model is suitable to forecast the outcome of NBA playoffs while achieving high prediction accuracy.
Figure 3 shows the effect of varying thresholds on the number of predicted games and prediction accuracy for playoffs during the 2007-08 season and the 2014-15 season.
We also used Receiver Operating Characteristics (ROCs) [46,47] and the Area Under Curve (AUC) [48,49] to evaluate the quality of our NBAME model.We imported the probability of the home team's winning and the true outcome of the game into R, and used prediction and performance function within the RROC package 1.0-7 [50] to plot the ROC curve and calculated AUC values for the eight seasons, and the results are presented in Figure 4.

Comparison of NBAME Model with Some Selected Existing Machine Learning Algorithms
To evaluate the NBAME model, we compared its performance with selected other machine learning algorithms (Naive Bayes, Logistic Regression, Back Propagation (BP) Neural Networks, Random Forest) in the Waikato Environment for Knowledge Analysis (WEKA 3.6) [51].Table 7 shows the results obtained when the features in Table 1 were used together with these algorithms to predict the outcome of NBA playoffs between 2007 and 2015 in Table 7, and Figure 5 presents a graphical representation of the results.From Table 7 and Figure 5, we realize that our model outperformed all of the other classifiers for all seasons under consideration except for the 2010-11 season and the 2012-13 season, where our model was outperformed by Neural Networks and Random Forest, respectively.The Random Forest algorithm follows closely in the second position.The Naive Bayes had the lowest prediction accuracy with an average of about 60%, and this may have been caused by its assumption that all the features were independent, which was not the case.Accuracy results from the Neural Networks suffer adverse variations between seasons.For example, in the 2010-11 season, the Neural Networks registered impressive prediction accuracy at 67.9% but drastically reduced to 52.4% in the 2009-10 season.These variations could be explained by insufficiently small size of the training dataset that may have caused the model to overfit the data.Standard Logistic Regression, also a log-linear algorithm, had a relatively stable prediction accuracy for all seasons, similar to the NBAME.The NBAME outperformed the standard logistic regression because the former avoids overfitting by using regularisation techniques.
We give the AUC values in Table 8, which make us view our NBAME model performance from another perspective, and Figure 5 shows a graphical representation of the same values.Figure 6 shows that each algorithm's AUC value is not very high due to a high number of features, yet working with only a small size of the training dataset [52].The NBAME model is almost the top performing model in all seasons except 2012-13 and 2013-14.All algorithms show similar trends for all seasons.For example, they all performed very well in the 2012-13 season while experiencing the worst performance in the 2011-12 season.This indicates that some seasons are more difficult to predict than others.The difficulty in accurately forecasting results of a particular season is certainly triggered by unanticipated natural factors in the season; for example, the low performance in the 2011-12 season can be explained by the lockout that reduced the number of games from 82 to 66, thus reducing the training dataset size; in the same season, Derrick Rose, Joakim Noah, and David West were injured, leading to their failure to participate in the playoffs.Similarly, the controversy regarding Clippers' owner Donald Sterling's racist comments that arose in the 2013-14 season playoffs, and attracted protests from the Clippers and all NBA teams' players, could have reduced the players' morale, resulting in a very unpredictable season.

Conclusions
We applied the Maximum Entropy principle to construct the NBAME model and used the model to predict the outcome of the NBA playoffs from the 2007-08 season to the 2014-15 season.As seen in Section 4, the NBAME model is a good probability model for the prediction of NBA games.The prediction of NBA playoffs outcomes is a very difficult problem because there are many un-foreseeable factors such as the relative strengths of either team, the presence of injured players, players' attitudes, and team managers' operations that determine the winner or loser.Overall, the NBAME model is able to match or perform better than other machine learning algorithms.
The predictive model in this research was able to use the mean of each basic technical feature, respectively, from the most recent six games for both sides before the game started to accurately predict the outcome of the upcoming game.Possible extensions to this research would include exploring better methods to calculate the value of the features for the coming game, such as using more effective algorithms to preprocess the features of NBA dataset or looking for some comprehensive strengths as features.

Figure 1 .
Figure 1.Silhouette Coefficient (SC) with the change of clusters.

Figure 3 .
Figure 3.The number and accuracy of predictions with different confidence by the NBAME model from the 2007-08 season to the 2014-15 season playoffs.

Figure 4 .
Figure 4. ROC curves and AUC values of prediction using the NBAME model from the 2007-08 season to the 2014-15 season playoffs.

Figure 5 .
Figure 5.Comparison of the accuracy of the NBAME model against some machine learning algorithms.

Figure 6 .
Figure 6.Comparison of AUC of the NBAME model against some machine learning algorithms.

Table 1 .
Basic technical features used by the model.

Table 3
shows sample records of the mean values of features computed as demonstrated in Table2for games on 31 December 2014.Subscripts h and a in Table

Table 3 .
Sample records of the experimental dataset obtained by getting averages of the previous six games.

Table 4 .
Discretized sample records of the experimental dataset.

Table 5 .
Prediction accuracy (in percentages) of the NBAME model with different thresholds.

Table 6 .
The number of prediction games of the NBAME model with different thresholds.

Table 7 .
Prediction accuracy (in percentages) of selected algorithms for NBA playoffs for seasons between 2007 and 2015.

Table 8 .
AUC (in percentages) values of selected algorithms for NBA playoffs for seasons between 2007 and 2015.