1. Introduction
Basketball is one of the most famous sports in the world, especially in the United States of America and Canada. Leagues like the National Basketball Association (NBA) act almost like the Champions League model in soccer terms; they are extremely competitive and bring together the best teams in the world and the best players.
Its popularity in America and Europe makes it one of the main sources of sports data. The information on this sport is very varied, with a wealth of statistics to focus on to identify patterns. Compared to sports such as soccer, basketball is a much more mathematical sport when it comes to numbers.
Basketball is well suited for statistical analysis due to its emphasis on individual and collective data, making it an attractive option for developing machine learning (ML) algorithms to identify patterns.
Basketball is a highly popular sport all over the world that draws a significant amount of attention from various audiences. Basketball was the second most watched sport worldwide in 2023, just after soccer [
1]. There are several fiercely competitive basketball leagues, but the NBA stands out from the rest due to its immense popularity and the consistent entertainment it provides year after year [
2].
The NBA is the world’s foremost men’s professional basketball league, established in 1946 with 11 teams. Over the years, the league has grown to include 30 teams through expansion. Since the 2004–2005 season, the teams have been split into two conferences—the East and the West—with three sub-conferences of five teams each. A team can have a maximum of 15 players, with only 5 of them allowed to be on the court at the same time.
The professional basketball season consists of two parts: the regular season and the playoffs, each lasting an average of 8 months. During the regular season, each team plays 82 games, half of them played at home and the other half on the road. All teams play against each other at least twice during this phase of the season. The main goal of this part of the season is to determine the teams that will qualify for the playoffs. The top eight teams in each conference move on to the playoffs, where their ranking in the regular season plays a crucial role. For instance, in the playoffs, the first-placed team from the Eastern Conference will play against the eighth-placed team from the same conference in a seven-game series, with four games played at the higher-ranked team’s home and three at the lower-ranked team’s home. The first team to win four games advances to the next round of the playoffs.
Sports such as basketball offer detailed statistical information on individual and collective performance, as can be seen in
Figure 1 [
3]. In the NBA, numerous individual awards are given throughout the season, which are based on athletic prowess and also translated into statistical data.
These awards are given to outstanding basketball players who have performed exceptionally well in their respective positions throughout the regular season. The Most Valuable Player (MVP) award is given to the player who leads in several statistical categories while keeping their team high in standings. Other awards, such as the Defender of the Year award, are given to players who lead in various defensive categories. The All-NBA team recognizes the top five players in each position based on their statistical performance. All of these awards are based on sports performance and are translated into statistical data.
In terms of team stats, the NBA maintains several rankings during the season as the best offensive team (they typically lead many offensive stat metrics such as points, assists, or effectiveness). It also has a ranking for teams in terms of defense in which the ranking is more focused on defensive stats (such as steals, blocks or, for example, defensive rebounds).
All of the statistical metrics in the NBA make it an area of interest for ML and DL models. This is due to the ease of finding correlations with these algorithms.
This article aims to provide an insight into the latest state of the NBA. The works analyzed date back to 2020 at the latest, which is a long way from where we are today. The aim was to obtain a more up-to-date dataset with games from the 2023/2024 season and to test various ML algorithms. The main contributions of this work are as follows:
From 2020 to 2024 there were several events that may have changed the dynamics of the most competitive league in the world. Between each season, the NBA presents several changes, from players to rules, which can cause entire teams to change their structure. However, COVID-19 has altered the league for a significant period, in which we intend to study and observe the disruptions caused by the pandemic.
Deep learning (DL) algorithms will also be studied to see if they can compete in predictive capacity with ML algorithms.
A prediction model for the Women’s National Basketball Association (WNBA) will also be carried out in order to observe the difference between the two leagues and create a model that makes predictions for this up-and-coming league.
The article is structured as follows.
Section 1 introduces the NBA basketball league.
Section 2 presents a review of the literature on the application of ML in sports.
Section 3 outlines the process of developing the prediction algorithm for basketball games, including the creation of features through Feature Engineering. This section also provides an in-depth analysis of web scraping methods and data organization.
Section 4 explains the feature selection process and describes how the data were partitioned to preserve temporal integrity.
Section 5 discusses the results and examines the impact of COVID-19 on the NBA and WNBA.
Section 6 offers a brief discussion of the results and compares them with current trends in the scientific community. Finally,
Section 7 presents the main conclusions.
2. Related Work
ML is used in several forecasting studies [
4,
5,
6,
7,
8,
9,
10,
11,
12,
13,
14,
15,
16,
17,
18,
19,
20]. Cao [
4] built a model to predict the results of NBA games based on the 2005/2006 to 2010/2011 seasons, using various methods such as LR, SVM, NB, and Artifical Neural Networks (ANNs). The best result obtained in terms of accuracy was 67.82% with LR. Lidder [
5] explores the ability of the LR algorithm to predict the results of NBA games between the 2014 and 2018 seasons. In this study, a prediction capacity of around 68% is obtained. Houde [
6] compares various ML methods to predict NBA game outcomes using collective team statistics from the 2018 to 2021 seasons. An exploratory analysis identified significant features that influence game outcomes, and the author suggests adding variables such as recent team performance and the Elo rating to improve accuracy. LR, RF Classifier, KNN Classifier, Support Vector Classifier (SVC), Gaussian Naïve Bayes (GNB), and XGB Classifier. Evaluated with five classification options, the best performing model was GNB with 65% accuracy. The author notes that parameter tuning could enhance performance.
Ma [
7] studies the predictive capacity of various algorithms to predict NBA playoff games. The playoffs have an interesting feature in that each team plays its opponent in a best-of-seven series, which may make it possible to better capture the relationships due to the successive confrontations between two teams in a short space of time. This study used a dataset from the 1996 season to the 2020 season, in which various ML algorithms were tested, such as the following: LR, KNN, Linear Regression (LinR), GNB and Gaussian Process Classifier (GPC). The algorithm that obtained the best results was LR, with a percentage of 92.2% to predict playoff games. Josh [
8] compared predictions based on team and player statistics by scraping data from the Synergy Sports website, which offers detailed NBA statistics. They scraped twelve seasons and faced the challenge of cleaning up this data to remove irrelevant information. Features such as team form for the last ten games, Elo rating, and individual player stats were added to enhance prediction accuracy. LR and RF models were used. The RF model achieved a higher accuracy of 67. 15% compared to LR. Furthermore, player statistics were analyzed separately, with LinR achieving a lower accuracy of 58.66%, highlighting the greater predictive power of team statistics. Lunelli [
9] tests various ML algorithms to predict the outcomes of NBA games and develop a betting system to exploit market inefficiencies for profit. Using a 15-year dataset covering around 20,000 matches from the 2003-04 to 2017-18 seasons, the author used feature engineering to include game averages, Elo Ratings, squad fitness, and top player indicators. The Decision Tree algorithm achieved the highest accuracy at 68.6%. Despite this, the betting model only achieved a 4% return on investment, highlighting the challenges in creating a consistently profitable AI-based betting system.
Bunker and Thabtah [
10] proposed a data mining model that predicts the outcome of a game utilizing team features. They argued that models based on ANNs are more accurate in predicting game outcomes. The study also mentioned the challenges faced when processing data in a live setting. Malamatinos et al. [
11] use ML techniques to predict the results of soccer matches in the Greek Super League is investigated. Using historical match data and collective statistics, the authors evaluate algorithms such as ANN, SVM and RF. ANN showed the highest accuracy of 55%, with the factor of ’playing at home’ being one of the main factors.
Nassis et al. [
12] offer a comprehensive review of ML applications in soccer, with a focus on injury prediction and risk. Various ML techniques are discussed to analyze performance metrics such as player workload and biometric data. Key models include decision trees, SVMs, and DL networks, each offering unique advantages in processing complex datasets. The review highlights challenges such as data quality, privacy concerns, and the dynamic nature of soccer, suggesting the integration of ML algorithms into injury prediction to improve player safety and team performance.
Wu [
13] examines the causes of injuries in the NBA by analyzing game statistics from the 2011 to 2019 seasons, using data from the Pro Sports Transactions and Basketball Reference websites. Focusing only on in-game injuries, the authors used a RF algorithm to determine which variables most significantly predict injuries. They found that game overload is a critical factor, with high-risk variables including minutes played, the frequency of three-point shots, and top defensive performances. The study suggests that reducing the number of games per season could help mitigate injuries while maintaining the NBA’s financial model.
Ke et al. [
14] developed an ML framework that optimizes the construction of basketball teams in the NBA and WNBA. This prediction involves individual player performance, team chemistry, and collective success. Player statistics, individual attributes, and team data are collected and then processed. Feature engineering contains an important approach that is used to obtain performance indicators, quantify player chemistry, and predict injuries. Various algorithms are used, ranging from supervised to clustering and finally to reinforcement learning algorithms. This study shows the capabilities of artificial intelligence in the various parameters in which it can be useful in the world of sport.
Papageorgiou et al. [
15] aim to evaluate and prove the effectiveness of different ML models in the performance of basketball players and teams. After obtaining the data and preparing them, different ML models were applied, such as supervised models, neural networks, and ensemble models. The results of the study reveal the advantages and limitations of each model in the specific context of the analysis. Simpler models such as LinR may not capture all the complexity of the data, while neural networks can offer more accurate predictions, although they are more difficult to interpret and require more computing power.
Du et al. [
16] propose an analysis of foul actions of basketball players in real time using ML and image processing. To do this, the video was captured in real time and processed by computer vision algorithms to extract the most relevant characteristics. The extracted features are then analyzed by ML models in order to recognize patterns that correspond to different types of foul, such as pushing, illegal blocking, or excessive physical contact. Once a foul has been identified, the system alerts the referees in real time.
Kai [
17] deals with the development of a predictive model of injuries in youth soccer players, using ML-based text classification technology. Research focuses on analyzing large volumes of textual data, such as medical reports, training records, and injury histories, to identify patterns that may indicate the likelihood of future injuries. The model uses advanced natural language processing techniques to transform these unstructured data into useful information, with the main objective of preventing injuries and creating more effective strategies to protect athletes.
Thabath et al. [
18] propose an intelligent ML framework to predict NBA game outcomes by identifying influential features that affect results. To do this, they used a dataset that included all NBA finals from 1980 to 2017. Then, different feature selectors were used to choose the best variables for the ML algorithm. Selector features such as Multiple Regression, Correlation Feature Set (CFS) and Ripper were used. Evaluates several ML methods, including NB, ANNs, and Decision Trees. The study compares performance using different feature sets and finds defensive rebounds (DRBs) to be the most significant factor. Other key features such as three-point percentage (TPP), free throws made (FT), and total rebounds (TRB) also improve the prediction accuracy by 2–4%. Lundberg and Lee [
19] introduce SHAP (SHapley Additive exPlanations), a unified framework for interpreting the predictions of the ML model. This framework consolidates and extends six prior explanation methods, addressing limitations in computational performance and interpretability found in earlier approaches. The novel methodology of SHAP improves the consistency with human intuition, while enabling efficient computation of the contributions of features. Li et al. [
20] present the Cell Transformer (CeT) which integrates the Cell Transmission Model (CTM) with a Transformer-based encoder–decoder to predict traffic states at signalized intersections. By modeling discretized lane cells as graph nodes and incorporating vehicle-type attributes, dynamic signal phases, and temporal embeddings, CeT captures complex spatial–temporal dependencies. Using the pNEUMA dataset, it significantly outperforms baseline models, reducing Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) while improving accuracy, demonstrating strong potential for autonomous vehicle trajectory planning.
3. Materials and Methods
Initially, it is necessary to define the seasons from which the data will be collected.
Figure 2 demonstrates the impact that the metrics related to the 3 points have gained too much importance to be a key factor when it comes to obtaining data [
21]. As can be observed, from the 2009–2010 season to the 2017–2018 season, the number of triples per season has almost doubled. As an example, in the 2024 season, the team in first place in the Western Conference is the one with the highest shooting efficiency from the three-point line in the entire league. It might not be worth considering earlier seasons because the statistics related to 3-pointers would not have the desired impact when training the model. Data were then taken from the 2015–2016 season to the recent 2023–2024 season; in this way, it will be possible to ensure when training the model that the statistics related to the 3 points will have the proper impact, and the model will have eight seasons to train which conveys robustness concerning the dataset.
The Basketball Reference website is a good alternative to train ML models [
8].
The work will focus on the block diagram represented in
Figure 3. Although all processes are essential, feature engineering will be an essential step in obtaining the best results. Below is a brief description of the processes depicted:
Web Scraping: The data will be scraped through an NBA statistics website, series of features will be extracted using the Python programming language (version 3.12).
Data Processing: After obtaining the data, it is necessary to process it, remove unnecessary information, and filter the necessary information for a dataset.
Feature Engineering: New features will be obtained from other features, it is important to simplify the information as much as possible so that ML has access to the relevant features it needs.
ML: Different ML techniques and two different datasets will be tested, one from the NBA and the other from the WNBA.
Results: Analysis of results and subsequent conclusions of the study.
Figure 3.
Diagram of the process.
Figure 3.
Diagram of the process.
Basketball statistics are quite different from other sports, as it is considered a fast-paced sport in the sense that the score fluctuates a lot during the game, and it turns out to be crucial to analyze all the metrics available during the game.
Figure 4 shows a standard box score of an NBA game, where 33 metrics will be considered from this website.
Table 1 shows a detailed description of the box score statistics. All of these metrics are important, but some end up being more prominent among popular analysts of the game of basketball. The field goal (FG), which indicates the team’s shooting efficiency, is an important statistic, especially when evaluating individual players. Due to the popularity of the 3-point shot, statistics related to this particular feature of the game have gained importance in recent years. All the parameters related to rebounds mainly indicate the aggressiveness that the team has in winning second balls; this metric turns out to be very relevant because it allows teams to obtain points through second chances or to win possession from the opposing team.
The advanced box stats are more detailed information used by sports experts and analysts. They provide hidden information about the game. They need to be explained in more detail because they are metrics related to mathematical formulas and their impact is highly undervalued.
Table 2 indicates the meaning of the advanced stats that will be used to perform the NBA and WNBA match prediction algorithm.
3.1. Feature Engineering
Feature engineering is the process of transforming raw data into features that fit the machine learning model. It is a method of selecting, extracting, or transforming the most relevant features from the available data to make the ML model as accurate as possible [
22].
The success of ML models is highly dependent on the quality of the features used to train them.
Figure 5 shows where the feature engineer fits into the ML process. This technique helps to highlight the most important patterns and the relationship between the data, which will make the model learn about the data more effectively.
It is important to explain the features that will be created based on feature engineering so that it is clear what they represent.
Table 3 indicates the features that will be introduced with a brief description. The primary goal of these features is to help the ML model in visualizing statistics that may not be fully comprehensible using statistics obtained solely from the current dataset.
3.1.1. Last Games
Basketball is a high-volume scoring sport with very short breaks between games compared to other sports [
23]. The feature “Last Games” will explore how teams arrive at the game that is to be predicted. Since basketball is a sport of momentum, there will be times during the season when it seems that nothing goes right for the team or other times when everything goes perfectly. Wins are a very important moral aspect, but player injuries can sometimes have a big impact on form and give the model accurate information about the team’s state.
In several of the works discussed, it is common to use only on average the last 10 games as the basis for this feature. However, the number of games will be varied to see what impact they will have on the algorithm.
3.1.2. Rest Days Between Matches
The physical condition of teams is very important for sporting results. The NBA is widely criticized for the number of games each team plays per season, which is at least 82. Playing 82 games from the end of October to the beginning of April is quite a task. To give an idea of a comparison with soccer, a team plays at least 30 games a season, starting in mid-August and ending in May, giving an idea of how brutal the NBA schedule can sometimes be. There are also back-to-backs, pairs of games that occur on sequential days, which are said to be grueling ordeals that put teams to the test [
24]. Much more tired than usual, players often compete in different cities with no rest in between. It is logical for teams to perform worse in the second game of back-to-backs compared to other games.
The ML algorithm will be provided with information about the interval of days between matches to assess the physical condition of the players before they step on the field.
3.1.3. Information About Next Game
Information about the game that is wanted to predict will be transmitted to the algorithm. It would be impossible to understand this information from the statistics of the last match alone. There is a lot of relevant information that can be sent, such as home-court advantage. This small detail has long been a topic of discussion for the NBA. Past statistics have shown that the home team will win around 60% of the time [
25]. This statistic makes it relevant for the algorithm to know which team is playing at home and which team is playing away.
In American sports, teams playing away games do not have the same allocation of tickets for their fans as in European sports. Consequently, the atmosphere in the stadium tends to be more favorable for home fans. In addition, teams playing away often engage in a series of away games, leading to an extended period of time away from home and family.
Statistics on past matches will also be broadcast, especially on the next opponents. We need to be careful when transmitting this information to the ML model because we do not want to transmit information that only happens during the game, as that would be feeding the model with information that has not happened yet. Statistics such as who the next opponent is and their recent performance can be known before the game starts.
3.1.4. Elo Rating
Elo Rating is the most effective method for relativizing a team’s current form and performance in the NBA. The Elo Ratings are calculated as follows: all teams start with a score of 1500 and over the course of the games, points are added or subtracted based on the final result of each game. The exact formula for calculating the Elo Rating is defined in Equation (
1).
where
k: constant representing the maximum adjustment possible in a single game;
S_team: the actual score of the team participant (usually 1 for a win, 0 for a loss);
E_team: represents the expected win probability of the team;
Ri: former Elo Rating.
The Elo Rating is maintained from season to season. Teams tend to remain constant over time, and if a team is a title contender, it tends to stay so for a while before starting to fall.
As NBA teams have a large number of games and a single game does not have a big impact during the season,
k tends to have a lower value compared to sports that have fewer games during the season. The value
k for this type of work usually ranges from 16 to 32 and is chosen based on the desired sensitivity to calculate the Elo Rating [
26,
27].
3.2. WebScraping and Data Organization
The first step in retrieving the data is to create a function that accesses the page from which the data are to be retrieved. Playwright will be used for this function. This library will emulate a browser, so it is possible to access the page and extract the code through an HTML selector. This function will work asynchronously, allowing the program to perform multiple tasks without having to wait for one task to finish before starting the next [
28].
It is necessary to set several retries and a time interval between the scraping of the website. This chosen sleep interval is very important because sometimes scraping websites too quickly can result in a server ban, which causes the scraping to stop working because the website cannot be accessed. Equation (
2) shows how to deal with this if it happens. Suppose that the program has to try to retrieve the data from the page. In that case, it means that there has been an error such as a temporary ban or a connection error, so it is important to have a waiting period between each attempt so that if this temporary ban occurs, it is possible to wait and re-enter the page when the suspension has been lifted.
where
WT: Waiting time;
S: Sleeping Time;
R: Retries.
BeautifulSoup will be the library used to parse the site, converting the HTML into a navigable tree of Python objects. Looking at the site individually by season, it is clear that the matches are organized by month and that each month has a table with links to the statistics for each match. It will therefore be necessary to navigate through the months to obtain the tables of where the matches are. For example, when entering the website in the 2022–2023 season, there are boxes with the months of the season; clicking on these objects enters a link where the box score table contains the statistics of all the matches that took place in that month.
After obtaining the individual statistics pages for each game, they need to be organized and stored in a dataset to be used later by an ML model.
The main objective will be to extract the totals from the table containing the basic statistics of the teams and the table containing the more advanced statistics; however, the page provides other relevant useful information such as which team played at home, the season in which it took place, and the date. This last piece of information will be essential to be able to organize the dataset chronologically later on. If the aim is to make predictions, it is important to maintain chronological order and use past matches to predict future matches.
The statistics provided indicate many parameters, but it is only possible to see who won based on the points scored and conceded. It is also necessary to provide this information to the dataset so that the model can identify winning and losing patterns when it is trained.
Basketball is a team sport in which only five players from each team can be on the court at the same time. Due to this limited number of players, the impact that one player can have on the game ends up being greater than in a sport like soccer, for example, where you have eleven players from each team simultaneously making stops. Therefore, it becomes necessary to obtain the numbers of the player who stands out from each team in each statistic. These data will have a similar name to the other statistics, but with the addition of “_max” in the name. For example, the point parameter has the name “Points”, which indicates the points that the team has scored, but “Points_max” indicates the number of points that the player with the most points from that team has scored. In addition to being based on team statistics, an extra factor is added in case a team has a key player who makes a difference in the game but translates mathematically into the statistics (which is quite common), which the model will be able to identify. Also, sometimes there are injuries, and this parameter that goes down or increases can indicate the absence or return from injury of a key player.
It is also important to include all the opponent’s parameters in the match statistics. When trying to predict whether team x will win or not, the model also needs to know everything about your opponent. In this way, the model can have information not only about the team you want to predict but also about the other team, thus greatly reducing the probability of predicting that both teams will win or both teams will lose.
3.3. Data Analysis
The analysis of the data obtained is important. Understanding how many games there are and how they are divided up in each dataset can be very important later when analyzing the data.
Figure 6 shows the distribution of games in the NBA and WNBA dataset by season. There are more than 2500 games per season except in three of the seasons. The 2019/2020 and 2020/2021 seasons were years affected by COVID-19. The NBA has reduced the number of games in these two years to protect players from contracting this terrible virus that has affected everyone this season. For the 2023/2024 season, it will be used exclusively for testing.
For the WNBA, we can see that there are more than 400 games per season, but in one of the seasons, 2020, there are substantially fewer games. Similarly in the NBA, COVID-19 has also affected this league and preventive measures were taken to reduce the number of games. We draw attention to the 2022 and 2023 seasons, when there was a significant increase in games due to the Commissionary Cup, which introduced a tournament within the league that resulted in a greater number of games in those seasons.
There are significant differences in the number of matches between the two datasets. WNBA has only 12 teams, while NBA has 32 and in the regular season there are only 40 games, while NBA there are 82.
The playoff model is also different between the two leagues. The NBA has implemented the best-of-7 in all its rounds, while the WNBA until the 2021 season used the best-of-3, and only in the 2022 and 2023 seasons is a hybrid model implemented, in which the first round of the playoffs is the best-of-3 and the rest of the best-of-5, which means that there are more games to be analyzed.
Both competitions are highly competitive, but the structure of the NBA provides more data than that of the WNBA.
4. Data Preprocessing and Model Evaluation
This section provides insight into the development and operation of the ML model. In this case, it is essential to adhere to the chronological order of the data to ensure that the algorithm does not receive information about the future it is intended to predict.
For the development of the system, we used Scikit-learn, a widely used open-source library for ML in Python.
Figure 7 illustrates the sequence in which this library is used in the prediction process of basketball matches. The concepts are explained as follows:
Normalization: Used for all values in the dataset to have the same scale to facilitate interpretation by the ML model.
Feature Selection: Used to select the most relevant features to pervert the target to prevent overfitting.
Data Splitting: Splitting data in chronological order for the training and testing of various ML algorithms.
Model Evaluation: Evaluating the accuracy of each ML model.
Figure 7.
Machine learning architecture.
Figure 7.
Machine learning architecture.
4.1. Normalization
Normalization is crucial for this work because it avoids feature dominance. In datasets where features have different scales, those with numerically larger values can dominate over the others during model training, which would result in a model biased towards larger scales and ignoring features with smaller scales. It makes interpretation easier because by assigning the same scale to all the features, they will be equally comparable in relation to the variable we want to predict.
Min-max normalization, one of the most popular normalization techniques, aims to restrict the range of variables between 0 and 1 (or −1 to 1, if there are negative values), adjusting the data accordingly. This approach, described by Equation (
3), is particularly useful when the data do not follow a normal distribution or when the standard deviation is low. However, it is important to note that min-max normalization can be sensitive to extreme values, undermining its effectiveness in datasets with pronounced outliers.
The data will be normalized using this method with the Python Scikit-learn package. Features such as Team Averages were first normalized and only then was the feature created. This makes it possible to reduce the influence of outliers. Without normalization, outliers can distort the average or other features. Normalization before creating certain statistics can result in more balanced features and more robust models.
4.2. Feature Selection
In datasets with a high number of features, it is necessary to use a Feature Selector before using the machine model. The NBA dataset has 993 features and the WNBA has 951. Passing this number of features to a ML model risked several problems that could jeopardize its accuracy in predicting NBA games. Dimensionality reduction is very important in datasets with many features, some of which may be irrelevant or even detrimental to the model’s performance. By removing these irrelevant or harmful features, the feature selector can improve the model’s performance by reducing the noise in the data and focusing on the relevant features.
Complex models can fit the training data quite well, which is called overfitting, capturing the noise in the data. A feature selector helps mitigate overfitting by removing features that do not contribute significantly to the model’s ability to generalize. Fewer features also mean less training time, which is very beneficial in terms of computational performance.
ElasticNetCV is an algorithm that is utilized to determine the most critical features in ML. This method combines regression techniques such as Lasso and Ridge, which are beneficial for feature selection. ElasticNetCV uses cross-validation to determine the optimal parameters of the model. A set of 50 logarithmically spaced values of the regularization parameter ranging from to was evaluated to capture a broad spectrum of model complexities. Furthermore, we tested five to balance between Lasso () and Ridge () regularization effects. The maximum number of iterations was set to 10,000 to ensure algorithmic convergence, and a fixed random_state of 100 was used to guarantee reproducibility.
Cross-validation is a widely used method in ML to assess the model’s ability to generalize. This method divides the dataset into subsets, which are known as training and test subsets. However, when working with time series, it is important to respect the temporal order of the data. Using “TimeSeriesSplit” instead of cross-validation is essential due to the sequential nature of time series data, as shown in
Figure 8. In time series, the order of the data is crucial because each data point is related to the previous point and the next point. In this case, cross-validation can scramble the data by breaking the temporal order, which would be cheating because we would be going into the future to find the relevant features. “TimeSeriesSplit”, by maintaining temporal order, allows us to split the training and test data to find the most relevant features. The training data are guaranteed to precede the test data, thus maintaining temporal order.
Pearson’s correlation was also used to understand the correlation between the variables. Understanding correlation is essential, as highly correlated variables, despite their significance, may provide redundant information to the algorithm. Consequently, in instances where two features are highly similar, they may not provide as much relevant information to the ML algorithm due to their overlapping significance. It is important to remove one of the highly correlated features to avoid redundancy between the chosen variables.
Figure 9 shows a schematic of how the Feature Selector was developed. Initially, ElasticNetCV is used to remove all features that have an importance value of 0 for the “Target” variable, thus eliminating unnecessary variables that would not be useful.
After obtaining the features that have some importance, Pearson’s correlation is used to eliminate the highly correlated variables. The pairwise correlation of all variables is then calculated; those with more than 90% correlation are compared to the “Target” variable and the one with the least correlation to this variable is eliminated, thus keeping the feature that can most help predict the final result of the game.
After applying ElasticNetCV, we compile a comprehensive list of the most significant features to implement in our ML models. We expect to have 100 variables available for the NBA and 17 for the WNBA. It appears that establishing relationships between the “Target” variable and the features in the WNBA dataset may present a higher level of complexity.
This approach is of great importance as it helps simplify the model, enhance interpretability, and potentially improve predictive performance by focusing on the most significant features in the dataset.
4.3. Data Splitting
It is essential to ensure that historical data is used to predict future data. This approach is quite common in predictive modeling, where the model is trained on historical data up to a certain point and tested on future data to evaluate its performance. The training data is used to train the model, allowing it to learn patterns about the historical data. After training the model, it is tested on future data to make predictions. The predictions will be compared with the actual data values to calculate metrics such as accuracy, providing information about the model’s ability to make accurate predictions about the data.
The data were then divided by season.
Figure 10 illustrates how the division of data for training and testing works. Initially, the first two seasons contained in the 2015/2016 and 2016/2017 datasets will only be used for training and to make forecasts for the 2017/2018 season. After that, the 2015/2016, 2016/2017, and 2017/2018 seasons will be used to make forecasts for the 2018/2019 season and so on until the 2023/2024 season. This ensures that the training data contain only past events concerning the test data.
4.4. Model Evaluation
Several parameters could be evaluated, but the most predominant is the accuracy of the model. When trying to predict something like the outcome of a sports game such as basketball, in which there are two outcomes, winning or losing, accuracy ends up being the most important. The accuracy is given by the following:
where
TP = True Positives: Number of cases correctly classified as positives.
TN = True Negatives: Number of cases correctly classified as negatives.
FP = False Positives: Number of cases incorrectly classified as positives.
FN = False Negatives: Number of cases incorrectly classified as negatives.
Although accuracy is an important measure, other metrics can be evaluated, such as precision, which is giving in Equation (
5). Precision in this context is the proportion of correct predictions of wins (true positives) concerning the total number of predictions of wins made by the model (true positives plus false positives). This metric allows us to determine all the times the model predicted a win and how many of those predictions were correct. This parameter makes it possible to assess the accuracy of the model’s predictions of wins.
Recall is another metric that can be used in the basketball classification model. This metric, defined by Equation (
6), measures the model’s ability to detect all real wins (that is, those that actually occurred) in the official dataset. A high recall means that the model is good at identifying the majority of real wins.
The F1-Score defined in Equation (
7) is a performance metric that considers both precision and recall. It is useful to find a balance between these two aspects, especially when there is an imbalance between the classes.
We will conduct rigorous tests with varying numbers of features, ranging from 15 to 100, to comprehensively analyze their impact on the accuracy of the ML model for the NBA. Because the WNBA has fewer relevant features, a study with the 7 and the 17 best features will be carried out to understand the impact of feature variation. Our objective is to meticulously evaluate different algorithms and their combinations to achieve optimal results.
The behavior of the algorithm with the best metrics will also be studied throughout the test seasons to understand the levels of accuracy that would be obtained for each season.
An exclusive forecast for the 2024 season will also be tested, in which the previous seasons will be the scope of the training set and the 2024 season the test set. This will make it possible to understand which algorithm is best for predicting the season in which we are currently in.
5. Results
In this section, we will present the results of the various algorithms and features tested for the NBA and WNBA models.
5.1. NBA
First, the NBA results will be presented. Subsequently, various algorithms will be tested using the top 15, 30, 50, 75, and 100 features selected by the Feature Selector.
The algorithms covered are LR, RR, RF, NB, KNN, SVM, STACK, BAG, MLP, AB, and XGB.
Table 4 indicates the meaning of the letters in the statistics that will be selected by the Feature Selector.
5.1.1. Top NBA Features
Figure 11 represents the features selected by the ElasticNETCV algorithm. All of the features shown are derived from feature engineering, which proves the importance of these features in order to obtain the best possible results when it comes to prediction.
The best results were obtained using the top 75 features.
Table 5 shows that LR and STACK have slightly higher ratings than MLP with the Top 50 Features. Although their accuracy is similar, they exhibit varying values in other metrics. STACK has a fractionally higher recall (66.03% versus 65.87%), indicating that it is slightly more effective at identifying real wins. However, LR has a relatively higher precision (65.38% versus 65.32%), suggesting that it makes fewer incorrect predictions of wins. The tie-breaker is in the F1-Score metric; the STACK algorithm’s metric for this is marginally better than LR (65.68% vs. 65.62%), reflecting a better balance between precision and recall.
RR and BAG are also excellent candidates in terms of predicting NBA results, with percentages of 65.41% and 65.40%, respectively. The fact that BAG has a higher recall close to 1% indicates that it is more effective at capturing real wins, which is reflected in the F1-Score parameter.
MLP shows a decrease of more than 0.5% in its accuracy compared to the previous measurement. More complex models such as neural networks may have greater capacity for nonlinear and complex data relationships. However, this also means that they are more susceptible to overfitting when there are too many irrelevant or noisy features for the algorithm. The MLP stands out for its high recall of 69.61%, even beating algorithms with higher accuracy. This is because there is a kind of trade-off with precision; this parameter, which stands at 63.13%, suggests that MLP makes more predictions of wins, which is why it has such a high recall.
SVM and XGB also have percentages greater than 65.00%, but these are two very different cases. SVM has slightly increased its accuracy concerning the top 50 Features, which indicates the algorithm’s ability to deal with high dimensionality. Although XGB is already declining, this is because the features introduced are no longer very relevant to the algorithm, leading to a drop in accuracy.
Figure 12 represents the performance of the top algorithms in the variation of features. It can be seen that accuracy increases to a certain number of features (in most algorithms, 75 features, and in MLP, 50 features). This shows that adding new features, even if they are important to the feature selection algorithm, will only be relevant up to a certain point. After that, these features may not contribute relevant information, or the algorithm may be overfitted and cause a decrease in accuracy.
5.1.2. Analysis by Season
To understand the accuracy of this algorithm in each season contained in the dataset, from 2016 to 2024, a study was carried out to try to understand its variation during each season. The algorithm chosen was the Stacking Classifier with 75 features because it was the one that obtained the best performance. This algorithm stands out compared to Logic Regression, also with 75 features, because it had better recall and F1-Score.
Figure 13 shows the accuracy obtained over the seasons. It can be seen that the seasons with the best levels of accuracy are 2017/2018 and 2018/2019. A slightly sharp decline is observed in the following seasons and may be justified by COVID-19. The 2019/2020 and 2021/2022 seasons were severely affected by the pandemic that affected the entire world. The 2019/2020 season was stopped halfway through and resumed months later in a kind of resort where the teams that had a chance of making the playoffs gathered to finish the season. The result was that there were no spectators in the stands and all the games were played in the same place, which meant that the home factor was eliminated.
Statistically, NBA teams win 58% of the games they play at home, which poses a significant limitation for the ML algorithm. The pandemic also severely affected the 2020/2021 and 2021/2022 seasons. In the 2020/2021 season, regular-season games were reduced from 82 to 72. Despite playing in their arenas, the teams did not have fans, which diminished the home-court advantage. Unvaccinated players faced various restrictions on their ability to perform, as some states did not allow these athletes to participate in certain events. When a player contracted COVID-19, several other players on the same team had to quarantine, forcing teams to use players who were not on the regular rotation.
In the 2021/2022 season, the NBA returned to 82 games in the regular season, but COVID-19-related protocols continued to have an impact. Fans began to return to the stadium, but full capacity was not allowed. In the 2022/2023 season, everything returned to normal, but the ability to predict this season is severely affected by previous seasons in which there was a pattern of abnormality in the world and the league.
In the 2023/2024 season, the league’s predictive capacity increases, and statistics such as the home factor begin to gain the right weight for ML algorithms.
5.1.3. Season 2023/2024
Upon careful examination, it was observed that although the 2023/2024 season contains the most data for training, it does not necessarily yield the highest accuracy. Subsequently, a comprehensive study was initiated to evaluate the top five ML and two DL algorithms (LSTM and CNN) to understand if these algorithms could be effective in better predicting the current NBA season.
The parameters used for the DL models are presented in
Table 6. In the DL algorithms, due to their randomness in certain parameters and as the objective was to obtain a precise study, ten attempts were made for LSTM and CNN. In these ten attempts, the different feature variations were evaluated and the variation that showed the best results was chosen. Subsequently, the averages of these ten attempts were calculated and compared with those of the other ML algorithms.
Table 7 shows that the STACK and MLP algorithms, with 75 and 50 features, respectively, have the best accuracy values. However, STACK has an advantage because it has a higher recall and F1-Score, which suggests a better balance between identifying true positives and the accuracy of positive predictions. In a context applied to the NBA, it is better at capturing the majority of wins, even if it accepts a small number of wrong predictions. The other algorithms have slightly lower prediction capabilities than STACK.
LSTM achieved highly positive metrics, competing closely with the two ML algorithms that had the highest prediction accuracies (STACK and MLP). By obtaining a prediction percentage of 65.26% and in one attempt obtaining better accuracy values than STACK, it managed to reach 65.62%, as can be seen in
Table 8. This is due to the weights of the neural networks, which are initialized randomly, and also the way in which the learning rate is adjusted over time can vary between runs.
In light of the challenges posed by irregular patterns that emerged during the pandemic, it was decided to exclude the specific time periods that represent the peak of COVID-19 activity (2019/2020, 2020/2021, and 2021/2022 seasons) from the training set. This was performed to ensure the effectiveness of the ML algorithms.
Table 9 shows the results, where we can see both the increase and the decrease in accuracy by the algorithms. MLP achieves the highest accuracy of 65.74%, and even the highest for the 2023/2024 season. MLP appears to be more robust to the noise present in the COVID-19 data. Removing these data allows the model to focus on more consistent and predictable patterns. Removing atypical data makes the dataset more homogeneous and consistent, which leads to greater accuracy. The BAG algorithm also showed improvements in removing data affected by the pandemic.
The other ML and DL algorithms showed slight drops in accuracy. Although the seasons affected by the pandemic are different in many ways from the others, they have relevant information about the game’s evolution. These three years of the eight used for training are already significantly reduced, as certain algorithms rely on a large volume of data to capture complex variations. Also, because prediction methods are based on time series, erasing a period of history may not be appropriate, leading to the removal of accuracy from these algorithms.
5.1.4. Model Interpretability Analysis
To perform an interpretability analysis that quantifies the impact of different features on prediction results and uses metrics to assess probabilistic prediction performance, we use SHAP values and Brier score. These methods are especially important for applications such as betting or strategy optimization. The models analyzed are as follows: MLP, LR, RR, and CNN.
SHAP is a method based on cooperative game theory and is used to increase the transparency and interpretability of ML models [
19,
29].
Figure 14 shows the SHAP values for the analyzed models. It is observed that “home_next”, “team_elo_5_y”, and “team_elo” consistently emerge as the three most influential features across all models. This highlights the importance of incorporating information about the next game and the current form and performance of the team into the prediction results. The consistency of these results across all models emphasizes the importance of feature engineering, demonstrating that specifically created features play a crucial role in improving prediction performance.
The Brier score is a scoring rule that is used to measure the accuracy of probabilistic predictions [
30]. It calculates the mean squared difference between predicted probabilities and the actual outcomes. It ranges from 0 to 1, with a lower score indicating better calibration and accuracy, making it especially useful in sports betting.
Figure 15 presents the calibration curves (reliability diagrams) for the models evaluated. These diagrams illustrate model calibration by binning predicted probabilities and comparing them to actual outcomes, providing a clear visualization of prediction reliability. To quantitatively assess how well the predicted probabilities reflect the actual outcomes, Brier scores are reported alongside the curves. As can be seen, the model’s calibration curves remain close to perfect calibration, demonstrating good performance. Amongst the models, the CNN exhibits a slightly more irregular calibration curve, while achieves the lowest Brier score of 0.221.
The results demonstrate consistent performance across all models, indicating well-calibrated models that support more informed decision-making based on predicted probabilities. SHAP and Brier score analyses serve as essential tools for evaluating model confidence and reliability, providing insights that extend beyond conventional accuracy metrics.
5.2. WNBA
A comparative study mirroring that of the NBA has been undertaken for the WNBA. However, it is important to note that the WNBA dataset is considerably smaller and lacks the 2024 season compared to the NBA.
Although the WNBA dataset is smaller than the NBA dataset, it was clear that it was necessary to increase the number of iterations for ElasticNetCV not to reach the limit on the number of iterations. Iterations in the context of ElasticNetCV refer to the number of steps the optimization algorithm takes to find the model that minimizes the cost function. The reasons why ElasticNetCV needing more iterations and being more computationally exhaustive for the WNBA dataset are as follows:
Less Information: In smaller datasets, there are less data available to inform parameter estimation. This can make it difficult to identify clear patterns and, consequently, require more iterations for the algorithm to find a stable minimum of the cost function.
Greater Variability: With less data, the variability in the estimators can be greater. This means that coefficient updates can be less accurate, resulting in an optimization process that needs more steps to stabilize and converge.
Less Effective Regularization: In a smaller dataset, the effects of regularization can be less pronounced. Regularization helps to avoid overfitting and simplifies the model; but, with less data, the balance between adjustment and regularization can be more difficult to achieve, requiring more iterations to find the optimal coefficients.
The features chosen by ElasticNetCV and Pearson’s Correlation comprised 17 elements. Among these, the 7 and 17 features demonstrated varying levels of importance for the “Target” variable.
5.2.1. Top WNBA Features
For the Top 17 Features chosen by ElasticNetCV, no additional features of the Feature Engineer have been added, but more statistics from the previous game seem to have a greater impact, as can be seen in
Figure 16. A metric related to the three-point shot is highlighted, which shows that in the WNBA this game strategy is not given as much importance as it is in the NBA.
Table 10 shows the results for the Top 17 Features. STACK ended up showing an increase in accuracy to 67.32% (from Top 7 features) while also maintaining the highest precision, which means that the model is reliable in correctly identifying positive examples, such as wins while minimizing defeats.
The MLP shows a sharp drop from 67.48% (Top 7 features) to 62.61%, indicating what can be considered a “curse of dimensionality”, a phenomenon in which an increase in the number of features can lead to an exponential increase in the volume of the feature space, making it more difficult for the ML model to generalize well.
The SVM algorithm has shown a commendable accuracy of 66.47 %. Furthermore, it achieved the highest recall among all algorithms, at 66.56 %. Recall measures the algorithm’s ability to correctly identify all actual positive cases.
Figure 17 shows the variation in accuracy as the number of features changes. MLP suffers the most as the number of features increases, almost decreasing by 5%. The other algorithms maintain fairly close percentages, varying between 66.5% and 67.3%. The fact that there are not as many features as in the NBA does not allow for such a detailed study of the impact of the number of features.
5.2.2. Analysis by Season
A study was conducted using the algorithm with the highest predictive accuracy to evaluate its performance in various seasons. In the context of the NBA, the pandemic has been noted to have an impact on the league. Here, the objective is to analyze and understand the extent of this impact on the WNBA.
Figure 18 shows the results of the MLP algorithm with the Top 7 Features in each WNBA season studied. It is possible to see a sharp decrease in the 2020 season at the same time that COVID-19 decelerated. As in the NBA, there have been several changes at all levels in the WNBA that are responsible for this percentage decrease. In the 2020 season, the number of games in the regular season was reduced from 36 games to 22, which means that there is less information. That season was also played in a “bubble” at the IMG Academy in Bradenton, Florida to minimize the risk of contagion, very similar to what happened in the NBA. The “home_next” feature is one of the most important features of the feature selector algorithm, and the fact that there is no crowd and all games are played on a neutral field means that one of the most important features no longer has an influence. Some players also did not participate due to health concerns, which means that the information provided may be less concise than this season.
The 2021 season sees a return to more normal percentages. The season returned to 32 games, which is still low compared to the traditional 36. The matches were also held in the usual stadiums of the teams, which means that the “home_next” feature is once again becoming more relevant, although the arenas are still limited to a certain number of spectators.
The pandemic had a huge impact on this league, just as it had on the NBA. The advantage of the WNBA is that it takes place exclusively in the summer, avoiding the more controversial periods experienced by other leagues due to COVID-19.
5.2.3. Season 2023
A recent study was conducted on the WNBA for the data available from the 2023 season. All seasons before this are used for training purposes. The top 5 ML algorithms with the best previous performance and two DL techniques, LSTM and CNN, are used to test the capacity of these algorithms for this type of prediction. The parameters used for the WNBA DL models are the same as those used for the NBA. The set of features that obtained the best results was also used and ten attempts were made to obtain the average metrics of the DL algorithms.
LR was the method that obtained the best classification for this season, as shown in
Table 11. Despite having the same accuracy as BAG, LR stands out for having a better F1-score, indicating a better harmonic mean relationship between precision and recall.
The LSTM algorithm obtained excellent percentages of predictions, and although the result was 69.05%, in the ten attempts made there were values that equaled the 69.33% of the top ML algorithms, as shown in
Table 12. Despite the fact that CNN had the worst average accuracy in its ten attempts, it ended up with a value of 69.59% in its best attempt, which beats all the ML algorithms. This is because the weights of neural networks are initialized randomly. Each run starts with a different set of weights, which can lead to different training paths and, consequently, different performance levels.
Table 13 shows the results with the 2020 season removed from the training set to understand the impact that this season, which was severely affected by the pandemic, had on the algorithm predictions. There were some increases, especially in the RR algorithm, which reached 69.59% and was previously at 68.81%.
6. Discussion
This section provides some remarks on the study conducted and its connection to previous research. The quality of the dataset is very important to allow ML algorithms to capture relationships that may be important for making predictions; different datasets with different statistical parameters will lead to different results. The features created by Feature Engineering are extremely important because they facilitate the analysis of information for ML models. The feature selector used in this study primarily selects these features. Also, and most importantly, it is related to the periods studied. The NBA is a league that is always evolving and there may be more competitive or less competitive seasons, which will certainly influence the ability to predict. COVID-19 also had a significant impact on the seasons during which it was present, and its effects may still be felt at various levels within the league today.
The way in which the training and test data are divided is also very important. In this study, a different approach was tried: dividing the dataset by seasons and using the seasons before the one being tested for training. The approach taken by previous work involves dividing the dataset into a training set and a testing set. This allows a portion of a season to be utilized for testing within the training set, potentially leading to improved results as training incorporates patterns specific to that season.
An additional consideration worth noting is the use of cross-validation for feature selection. However, it may not be suitable for this type of study due to the random division of the dataset, which could result in future data being used in the training set and past data in the test set. As a result, the selected features may not align with realistic criteria. To address these issues, this study employs time-split cross-validation to preserve temporal order.
For the WNBA, the highest prediction result obtained in the dataset was 67.48% with MLP, which beats the results obtained in the NBA dataset in this study. Although we have fewer games and slightly fewer statistics for the WNBA, it is notable that the algorithm finds it easier to capture relationships between past data and predict future data in this context. This is perhaps due to the heightened level of competition in the NBA compared to its women’s league.
7. Conclusions
This article has effectively showcased the capability of ML and DL algorithms to forecast the outcomes of basketball games within the NBA and WNBA leagues.
This study contributes to the growing role of artificial intelligence in sports analytics by providing interpretable predictive models for NBA and WNBA outcomes, and is particularly valuable for coaches, team managers, analysts and stakeholders aiming to improve performance and strategic planning.
Despite the existence of datasets for the NBA, WebScraping was used to obtain a more up-to-date dataset so that it could be compared with the results of other works on predicting NBA games using ML. The league has been significantly affected by the pandemic, resulting in complex patterns that are challenging to interpret.
For the WNBA, WebScraping is a necessary tool, as there are no datasets with relevant information for this study. The WNBA proves to be less unpredictable and less competitive than the NBA, which is to be expected as it is a league that is still several years behind in terms of both financial and human resources.
For future work, several parameters could be explored over time to make the study more up-to-date and possibly improve the results in terms of the accuracy of the two models. Hyperparameter optimization techniques such as GridSearch and RandomSearch could be used to find the ideal combination of hyperparameters and features to maximize the model’s performance. The addition of a database indicating the available players and their impact could be very useful, since this is a five-on-five sport and one really good player can have a huge impact on the game. Exploring new feature engineering techniques used in other sports and applying them in this context could potentially enhance the results.
In addition, it would be valuable to incorporate methods with stronger temporal and interaction modeling capabilities, such as team–player interaction modeling based on Graph Neural Networks (GNNs) or Transformer architectures. Exploring hybrid architectures, such as combining LSTM with Attention mechanisms or Transformer-based models, represents a promising direction, as these integrated approaches provide enhanced capacity to capture both temporal dynamics and complex interactions. Furthermore, future studies could extend the validation beyond the NBA and WNBA datasets to include cross-season transfer prediction and cross-league transfer performance analysis, allowing greater generalizability and robustness.