Regression Tree Model for Predicting Game Scores for the Golden State Warriors in the National Basketball Association

: Data mining is becoming increasingly used in sports. Sport data analyses help fans to understand games and teams’ results. Information provided by such analyses is useful for game lovers. Speciﬁcally, the information can help fans to predict which team will win a game. Many scholars have devoted attention to predicting the results of various sporting events. In addition to predicting wins and losses, scholars have explored team scores. Most studies on score prediction have used linear regression models to predict the scores of ball games; nevertheless, studies have yet to use regression tree models to predict basketball scores. Therefore, the present study analyzed game data of the Golden State Warriors and their opponents in the 2017–2018 season of the National Basketball Association (NBA). Strong and weak symmetry requirements were identiﬁed for each team. We developed a regression tree model for score prediction. After predicting the scores of each player on two teams, we summed and compared the predicted total scores to obtain the predicted results (lose or win) of the team of interest. The results of this study revealed that the regression tree model can e ﬀ ectively predict the score of each player and the total score of the team. The model achieved a predictive accuracy of 87.5%.


Introduction
Advanced statistical methods were commonly used in various studies. A soft computing model used a learning approach for addressing data management over social networks [1]. Dulebenets et al. [2] applied regression models to estimate the effects of various factors on the driving ability of individuals. Andrée et al. [3] estimated a penalized non-parametric model of environmental output across economic development. A multivariate random parameter Tobit model was utilized to determine the factors that drive both the crash occurrence probability and the crash rate of 65+ roadway users [4]. Narasingam et al. [5] applied sparse regression to determine the structure of reduced-order model on a hydraulic fracturing process.
Since the 20th century, sport has grown globally [6,7]. Professional sporting games, such as basketball, baseball, tennis, and golf, and events such as the World Football Championship and the Olympic Games not only attract the attention of many fans but also continue to create extremely high output value for the sport industry. The Internet has enabled sport betting to develop rapidly. In most countries, sport betting is inextricably linked to professional sport. In sport betting, increasing the betting win rate requires-in addition to subjective judgements-predictions from historical data on the game, such as predictions of total results, total over or under, handicaps, and point spreads.
Scholars in most studies have applied linear regression models to predict the scores of various ball games. However, studies have yet to use the regression tree method to predict basketball scores.

Related Studies
Several scholars have devoted attention to predicting the results of various sporting events and have conducted research on basketball, baseball, football, cricket, and other ball games [13][14][15][16]. Thabtah et al. [13] applied naive Bayes, neural network-like, and decision tree machine learning methods to various feature sets, in order to construct prediction models. By comparing the respective prediction accuracy rates, they could select the model with superior performance and determine the key factors affecting the results of the game. Valero [14] analyzed 10 years of Major League Baseball (MLB) regular season game data, using four data mining methods, namely lazy learners, artificial neural networks, support vector machines, and decision trees; the goal was to evaluate the abilities of the aforementioned classification-and regression-based methods in predicting game outcomes (home team win or lose) in MLB games. Razali et al. [15] used Bayesian networks to predict home victories, away victories, and draws in the English Premier League. Pathak et al. [16] applied modern classification techniques, namely naive Bayes, support vector machine, and Random Forest, to predict the outcome of the One Day International (ODI) cricket match.
Loeffelholz et al. [17] collected 620 NBA games and used neural networks to predict the success of basketball teams. The selection of features input to the neural networks as the most salient features for prediction from signal-to-noise ratios and expert opinions was also discussed in this study. Cao [18] collected the data of five regular NBA seasons and applied machine learning algorithms to build models for predicting the NBA game outcomes.
In addition to the prediction of victory or defeat, many scholars have studied team scores. For example, Harville [19] proposed a linear model for predicting differences in scores for college basketball or football games, using the difference in team effects plus or minus the home-field or field advantage. To fit a relevant linear model, Harville proposed an improved method of least-squares estimation and applied team estimates, in order to rank teams in the league. The study findings revealed that the results of the playoffs could be effectively predicted. Karlis and Ntzoufras [20] proposed bivariate Poisson models for analyzing goals scored by two teams and adjusted the models to increase the probability of a draw. Adam [21] attempted to extend the bivariate Poisson method through a generalized linear model. The score was modeled as the joint probability of a Poisson distribution representing the total number of goals and a binomial distribution representing a team goal.
Wheeler [22] first used the chi-square test to screen input variables; the threshold was set to 0.05, excluding features not exceeding the threshold, and 16 variables were finally obtained. Linear regression was then used to calculate the average score of each player according to the characteristic variables and to obtain the sum of the predicted scores for the two players on the field. Finally, the results of the two teams' matches (win/lose) were compared. In addition, to compare the performance of the linear regression model with other benchmark models, feature variables were input into naive Bayes and support vector machine (SVM) classifiers, to obtain classification results. The linear regression output value was converted into two classification values for comparison. The linear regression model predicted a player's average score, and the converted win-loss classification result had an error rate of 53%. The naive Bayes and SVM classifier error rates were as low as 31%.
Singh et al. [23] proposed two separate models for predicting the match results for the ODI. They used data for the 2013 and 2014 ODI competitions as training and testing sets and conducted 10 cross-validations. Linear regression was used to predict the final score of the first round of ODI, and a naive Bayes classifier was used to estimate the probability of a team winning the second round. The results showed that, for prediction of the final score, the errors in the linear regression classifier were less than those for the current Run Rate method. As the game progressed, the accuracy of the naive Bayes prediction of the game results increased from 70% to 91%.
Wiseman [24] predicted the winning score of events on the PGA Tour, using first-round data. The author used linear regression, neural network regression, Bayesian linear regression, decision forest regression, and boosted decision tree regression models and compared the performance of the methods. Models were constructed by using data from 2004 to 2015 and validated by using the 2016 tournament. Correlation matrix analysis was conducted for various features. The first-round lead score, first-round average score, event, course yardage, and total prize money were selected as forecast indicators, and the R-squared and mean square error (MSE) values were used as evaluation indicators. The results revealed that the linear regression and Bayesian linear regression models were superior to the other models.
Lu et al. [25] analyzed games from 2012 to 2016 and established the least square fir model, using previous game results, team ability, and home advantages, based on data from five seasons, to predict the point difference for each team. Linear regression models depending on total over/under were fit to data before the all-star break and checked for adequacy, to predict final score difference between home and away NBA teams, for a regular season during 2011-2012 [26].

Data
In this study, we considered the Golden State Warriors (GSW), one of the 30 NBA teams, for analysis. All teams competing with the GSW were regarded as opponents. The reason for choosing the GSW as the research object is that, in the traditional basketball concept, the closer to the basket score, the more solid the game will win, and the three-pointer is just a way to assist in scoring. However, the presence and rise of Stephen Curry subverts this traditional concept and opens up the modern basketball of "three-pointers". In addition, the Golden State Warriors are the champions of the NBA's first season, and the team won a total of six league championships. Therefore, this study used the Golden State Warriors as the object of analysis. Other teams can also model and predict according to the proposed method of this article.
We executed the data collection step by capturing GSW player match data for the 2017-2018 season from the Basketball Reference website [27]. Because the final purpose was to obtain prediction results through prediction scores, GSW opponents' data must also be collected. After the collection of the data, the missing values were deleted. The missing data in this dataset comprised only two major items: Inactive and Player Suspended. We manually removed data fields indicating "Inactive" and "Player Suspended". The number of records is not fixed for each player on the field. Some teams change players frequently. Each team has a total of 30 database fields, as shown in Table 1. The first eight items are related to the event and were not considered in the prediction model in this study. For the ninth item (i.e., Games Started), players in the starting lineup are usually the best players on the team (see [28]). If the best player acts as a starter but does not contribute to the team, the player is considered to have hindered the team. The corresponding variable was therefore selected to establish whether it is relevant to the score. Regarding the 10th item (i.e., Minutes Played), Martínez and Martínez [29] indicated that no linear correlation exists between score and playing time. However, we used the M5Prime model tree algorithm (M5P) [30], which predicts nonlinear continuous data; therefore, this variable was selected and used in the prediction model. The 11th to 28th items pertain to personal data contributed by players to the team in the game. Among them, Field Goals, 3-Point Field Goals, and Free Throws are absolutely linearly related to the score (2 × FG + 3P + FT = PTS). Therefore, the above three items were excluded in the prediction model. In addition, the 28th item represents the player's score, which we used as an output item. Game Score (29th item) and Plus/Minus (30th item) are the player efficiency level and personal goal difference. Both items are calculated based on the player's personal data; therefore, they were not used in the prediction model. Finally, items considered to be removed from the dataset were player uniform number, rank, season game, date, age, team, home/away, opponent, field foals, 3-point field goals, free throws, game score, and plus/minus.  29 Game Score GmSc X 30 Plus/Minus +/− X Table 1 presents a summary of the variables considered in the regression tree model for this study, including the variable fields and variable abbreviations. The last column in the table indicates whether the variables were included in the model in this study. Sixteen input variables were used, and the output variables were the actual points.
The dataset was divided into a training set, validation set, and test set, and the ratio of the three sets was 6:2:2. In 82 games, the total number of players playing in each game is different. To avoid a scenario in which data of the 3rd player in the 50th game are assigned to the training set and data of the 4th player in the same game are assigned to the verification set, we divided the dataset according to Season Game, with the first 50 fields representing the training set, the 51st-66th fields representing the validation set, and the 67th-82nd fields representing the test set.

Methods
The flowchart of the study procedure is illustrated in Figure 1. We considered three regression methods, namely regression tree, linear regression, and support vector regression models, for modeling, prediction, and comparison. After training three regression models through the training set, we used the three constructed models to predict the validation dataset and used the root mean square error (RMSE) as an error index (loss function) for the models. We determined the superior of the three aforementioned models and used it to predict player scores. This step was executed by using the test set. After predicting and summing up the scores of players in each field, we obtained the team's predicted total score. By comparing the two teams' predicted total scores, we could obtain the predicted match result; finally, we could compare the result with the actual result and calculate the accuracy rate of the predicted match result.

Regression Tree
The overall process of the regression tree method is similar to that of the classification tree method, and a prediction value is obtained at each node. Classification trees are used to process discrete data, whereas regression trees are used to process continuous data.
We constructed the model employed in this study by using the M5P tree regression algorithm in Weka software. M5P is a machine learning algorithm published by Wang and Witten in 1996 [31]. Its predecessor was M5, which was developed by Quinlan in 1992. Compared with traditional linear regression algorithms, M5P can accurately predict nonlinear data, and the rules and regression models are easy to interpret.
M5P is a binary regression tree model. The last node in the regression tree is a linear regression function that produces continuous numerical attributes. The M5P algorithm includes four main steps: The first step entails dividing the input space into several subspaces, to create a tree. The variability in the subspace from root to node can be minimized by using segmentation criteria. The standard deviation of the value reaching this node is used to measure variability. The construction of the tree is completed by using a reduced standard deviation range (SDR) factor, which maximizes the expected reduction in errors on the nodes, as expressed in the following equation: where S is the set of data records arriving at the node, S i is the set obtained by dividing the node according to a given attribute, and sd is the standard deviation. The second step entails developing a linear regression model in each subspace, using the data associated with that subspace. The third step involves applying pruning techniques to overcome the problem of overtraining. However, the pruning process may cause a sharp interruption between adjacent linear models. The final step entails performing a smoothing process to compensate for the sharp interruption. The smoothing process combines all models from leaf to root to create the final model of the leaf. In the process, the predicted values of the leaves are filtered when they return to the root. The filtered values are combined with the predicted values through a linear regression of the node, as follows: where E is the estimated value passed to the next highest node, e is the estimated value passed from below to the current node, a is the predicted value of the model at this node, n is the number of training examples that have reached the node, and k is a constant (see [31,32]).

Linear Regression
Linear regression is the simplest and most commonly used prediction model. A linear regression model predicts the linear relationship between continuous target variables and predicted variables, and many data items fulfil the basic assumptions of normal distribution and linear relationship.
Linear regression models can be divided into simple linear regression and multiple linear regression models. Simple linear regression models entail the use of a single independent variable (X) to predict a dependent variable (Y). The regression equation can be expressed as follows: where Y i is the actual observation value (variable) for the ith observation value of the dependent variable, Y; X i is the ith observation (variable) of the independent variable, X; β 0 is the parameter of the regression mode (termed the intercept or constant term); β 1 is the parameter of the regression mode (termed the regression coefficient or slope); n is the number of observations; and ε i is a random variable of the ith observation and belongs to a random error. Multiple linear regression models entail the use of two or more independent variables to predict a dependent variable (Y). The regression equation can be expressed as follows: where Y i is the actual observation value of the ith observation value for the dependent variable, Y; X ki is the ith observation for the kth independent variable, X; β 0 is a parameter of the regression mode (termed the intercept); β 1 , . . . , β k is a parameter of the multiple regression mode (termed the regression coefficient); ε i is a random variable of the ith observation value; n is the number of observations; k is the number of independent variables; and k > 0 is a positive integer.

Support Vector Regression
Support vector regression (SVR) solves binary classification problems and has been proven to be an effective tool in real-value function estimation [33]. Like a regression method, the output of SVR is a real number. SVR finds an optimal hyperplane that balancing the model complexity and prediction error. The main advantages of SVR include that its computational complexity does not depend on the dimensionality of the input space, and it has excellent generalization capability, with high prediction accuracy. The prediction function of SVR is defined as follows [34]: where X denotes the space of the input patterns; and (w,x) denotes the dot product in X. If we minimize w and b, the optimization problem is defined as follows: where C 1 Lε(d i , y i ) is empirical error risk, which could be obtained from an ε-insensitive loss function in Equation (7); 1 2 w 2 is a regularization term; and C is a regularization. By introducing positive slack variables ξ i and ξ i * into Equation (6), we get the following: Lagrange multiplier is used to solve the optimization problem, and Equation (5) becomes the following: where a i and a i * are Lagrange multipliers, satisfying a i *a i * = 0, a i ≥ 0 and a i * ≥ 0 for i = 1, . . . , n, and the dual optimization problem is as follows:

Performance Evaluation
Several loss functions are used for regression, with commonly used functions being the MSE and mean absolute error (MAE). The MAE is an absolute value of the deviation between target and output value; therefore, positive and negative phases cannot cancel out. The MAE can thus effectively reflect the reality of the prediction error. Nevertheless, the MAE value is not differentiable at 0, and no method exists for determining the correction direction of a model through differentiation. The MSE overcomes this disadvantage but cannot easily be used to interpret data to obtain interpretable units. The solution is to use the RMSE to obtain an interpretable unit.
Accordingly, we used the RMSE to evaluate predictive performance. The RMSE is the square root of the ratio of the sum of all squared deviations of the predicted value from the actual values to the number of observations, n. To explain the degree of dispersion of a sample, the RMSE can be minimized for nonlinear fitting. The RMSE formula is as follows: whereŶ i is the predicted value, and Y i is the actual value.

Results
In this study, we used the training set to construct three regression models-regression tree, linear regression, and support vector regression models-through Weka software. We subsequently used the three regression models to predict the validation data and then calculated the RMSE values of the models, in order to determine the optimal model. Finally, the optimal model obtained from the training and validation sets was employed for prediction, using the test set to measure model performance. Results obtained by using data for the GSW as an example are described in the following sections. Table 2 lists the opponent data corresponding to Season Game for the test set. The total scores of the last 16 GSW games were predicted; subsequently, the total scores of the 11 teams in Table 2 were also predicted. By comparing the predicted total scores of the two teams, we obtained the predicted outcomes. Finally, we compared the actual results with the predicted results and calculated the accuracy of win or loss predictions.

Model Validation Results
We applied the training set to construct the model and used the validation set to establish the best model; we then obtained the prediction results for the two models, as well as the equations for the regression tree, linear regression, and support vector regression models. The modeling results are presented in Table 3. Figure 2 illustrates the regression tree, and Table 4 presents the regression equation. The linear regression equation is represented by Equation (12).    Table 3, the regression tree model with the lowest RMSE was the optimal model. Although constructing the model would require a longer time than that required for the linear regression model, the RMSE value of the regression tree model was the lowest among the three models. Therefore, the regression tree model was used to predict player scores in the subsequent step.

Model Test Results
Consider, for example, the test set data of GSW players for March 11, 2018; each row in Figure 3 contains information on players who played for the GSW on that day. According to the rules derived from the training set in Figure 2, we used the program to determine the rules (LM1-LM6) for each row of test set data. After judgement, the input variables were entered into the corresponding equations, to obtain the predicted PTS in the last column of Figure 4. The actual total score of the GSW team on March 11, 2018, was 103 points, and the team's total score based on predictions was 108 points (Figure 4).  We used the data in the remaining 15 test sets for the GSW and the data in the test sets for the opponent teams, to obtain the total score from each prediction. Each team input into the M5P training model was subject to several rules, with each rule having a corresponding regression equation. The relevant form of the opponent is presented in the Appendix A to Appendix C. After obtaining the predicted scores of both parties in all test sets, we could predict the GSW match results by comparing the predicted GSW scores with the opponent scores. Finally, we compared the predicted scores with the actual win or loss results and then calculated the accuracy of the predicted match results. Table 5 lists the predicted results, on the basis of the test set, of the last 16 games of the GSW, and their opponents in the 2017-2018 season. Figure 5 presents a line chart comparing the actual and predicted scores of the GSW and their opponents. Except for the prediction errors for the 67th and 77th fields, all predictions were accurate, and the prediction accuracy was 87.5% (Table 5). In addition, the actual scores for some games were not significantly different from the corresponding predicted scores. For some games, the actual scores were accurately predicted; for example, the predictions for the GSW in game 74, for both teams in game 81, and for the opponents in game 82 were accurate. Therefore, the regression tree model can effectively predict team scores.  Table 6 provides results from relevant studies. Several scholars have devoted attention to predicting game wins or losses. The use of machine learning models in earlier studies, to predict competition results, and the development of models based on other principles in recent years have engendered an increase in the accuracy of predictions. Miljkovic et al. [35] used four machine learning algorithms to predict the competition results and reported that the naive Bayes classifier had the best prediction accuracy rate (67%). Moreover, Cao [18] used four machine learning methods for prediction, including a naive Bayes classifier, and revealed that the logistic regression model achieved higher prediction accuracy than did the naive Bayes classifier. Cheng et al. [36] developed an NBAME model based on the principle of maximum entropy, to predict the outcome of games. They compared the performance of the NBAME model with traditional machine learning classifiers and reported that the NBAME model achieved a higher prediction accuracy rate (74.4%). To solve the shortcomings of SVMs that lack rule generation, Pai et al. [37] and Kaur et al. [38] used SVMs to combine decision rules and fuzzy rules, respectively, to develop new predictive match outcome models. The results demonstrated that the models achieved higher accuracy than did the conventional SVMs. Linear regression was used in another study [22], and the accuracy achieved was 47%. Although it is much lower than our result from our linear regression model, it is important to know that the proposed methodology is not comparable with this study, due to the use of a different database.

Discussion
Due to the uncertainty of game results, linear regression cannot be used to generate a regression equation that illustrates the linear relationship between variables and scores. The M5P regression tree algorithm used in this study could establish multiple regression models based on the distribution of data, and the prediction accuracy was determined to be higher than those of the linear regression model and support vector regression model. The present study differs from other studies in that it applied a regression tree model to predict the scores of players in two opposing teams for each match and then summed and compared the scores, to obtain the game results of the team of interest.

Conclusions
We conducted this study to develop regression tree and linear regression models by using data from two competing teams in a single NBA season. We predicted the scores of each player on the two teams and summed and compared the predicted scores of the two competing teams; thus, the win or loss results of the team of interest could be obtained. The results reveal that the regression tree model could predict player scores more accurately when compared with the linear regression model. Any game is a complex system. The model proposed in this study yields favorable results for the prediction of the outcome of NBA games. This model can thus provide valuable prediction information for NBA team leaders and players. The limitation and future study of our method includes the following: (1) The procedures to determine the data rules and to obtain the corresponding equations for obtaining the prediction scores for each team were manual and must be debugged to avoid errors, which was time-consuming. (2) Other factors may not have been considered in this study. For example, if a team's key player does not play due to injury, the team may score lower in the relevant match. Factors that have been overlooked in the present study can be looked at by future studies, to determine whether the inclusion of such factors can further improve predictive accuracy. Funding: This research received no external funding.

Conflicts of Interest:
The authors declare no conflict of interest. Figure A1 presents the comparison of the prediction performance of two regression models of 11 opponents of GSW.

Appendix C
Tables A1-A11 present the regression equations of several regression trees of 11 GSW opponents.

Rule
Regression Equation