1. Introduction
It is widely known and accepted that traditional statistics cannot accurately describe some aspects of basketball, and for this reason, an advanced statistics revolution has taken place in research on basketball in order to produce statistics that are more meaningful and useful for the analysis of the game. The advanced statistics for basketball can be found in the works [
1,
2]. However, these analyses are valid only in a league format, where all teams play with all other teams, and seasons last for a long time. When the situation is a tournament which is a fast-track competition where teams do not face all other teams, these statistics could be misleading. Moreover, the view of this work is “macroscopic”, i.e., the aim is to specify factors that can lead to overall (performance-based) success in the tournament and not to winning in a single game. The aim of this paper is to offer a quantitative method of answering questions in a tournament situation, such as the FIBA World Cup, and to be a starting point for analyzing tournaments in other sports. The focus of most previous papers regarding research on basketball has mostly been on league situations and comparisons or factors of discrimination between winning and losing teams. The focus here is on overall tournament performance and not only on single-game winning factors.
Some previously published related works include the work [
3] that explored the factors that influenced the performance of the Chinese team in the 15th Men’s World Basketball Championship; they found that the team’s ability was imbalanced, that a flexible attack strategy was needed in order to increase attacking ability, and that players’ metal regulation needed improving greatly. Furthermore, in work [
4], the authors for the matches of the Chinese basketball team in the 14th Men’s World Basketball Championship analyzed all kinds of causes of the losses and gains in the match and indicated that speediness, agility, precision, and antagonism are the everlasting trends in the world basketball, while in work [
5], the authors used regression analysis to examine the influence of certain basketball elements (FIBA standard indicators of performance) on the final result of a basketball game (they considered games from the 13th, 14th, and 15th Men’s World Basketball Championships). Additionally, in the paper [
6], the authors compared the Chinese team with the six other top teams of the 2006 Men’s World Basketball Championship in terms of statistics. They analyzed the gaps and detected the weaknesses of the Chinese team. In work [
7], the author determined which basketball performance indicators can discriminate winners from losers using a dataset of 76 matches from the world championships in Spain in 2014, of which the official statistical parameters were downloaded from FIBA. Finally, in work [
8], they compared and analyzed differences between the technical styles of the Chinese and American men’s basketball teams in the 15th FIBA World Cup.
The explanation of the aims of this paper follows. A crucial term is team performance. The consideration of only the final ranking of the team in the tournament is obviously misleading. Maybe a team is playing very well in all games but has a blackout in a knock-out match, and then the rating is unfair for this team. On the other hand, if performance is the extent of victory, then a strong team might be lucky in the draw of the groups and easily win against their first opponents but, when facing another strong team, not be able to cope with the situation. The previous examples have led to the consideration of performance as a multivariate measure, with the target being to extract a single value for the performance of each team. To achieve this, we used principal component analysis (PCA). The next goal of this work was to determine which factors contributed to the performance of a team. The basis for this analysis is the concept of the four factors of Dean Oliver as a standard for determining the winner of a game. Another big debate is whether offense or defense is more important for success in such tournaments. Both questions are answered with the use of multivariate regression. Additionally, we studied the effects of other factors, such as (i) the height of the team, (ii) the age of the team, (iii) the coach’s experience with the team, (iv) the players’ percentage (pcg.) usage of the ball (or from the first five players), (v) the distance shooting in a team, (vi) the balance in team scoring, and (vii) the efficiency of small players. Multiple regression and the correlation of variables with team performance were the tools for measuring these effects. Another very popular debate is whether a team performed as well as expected in the tournament. In this manuscript, we make an attempt to determine whether team performance is compatible with pre-tournament expectations, which are specified with the help of hierarchical k-means clustering based on variables that were found to affect the performance of teams. Groups of teams were formed according to their pre-tournament characteristics, and post-tournament actual performance was compared with the expected performance of the teams. A final question that was studied is whether we can have better pre-tournament predictions than power rankings. We employed machine learning models (Random Forests and neural networks) for the prediction of the final positions (based on performance) of the teams in the tournament. Power rankings incorporate information and knowledge from experts that should not be wasted, and this is the reason why they were considered among the inputs in our models. Moreover, they were the benchmark for our models, i.e., we were interested in whether a model could enhance power rankings, and if so, then the model was considered useful. The models were compared in terms of correlation (through the pseudo-R-square measure) with performance-based final positions.
A brief overview of the problems that are studied in this work is the following: the measurement of the performance of a team in a tournament, the detection of which factors played an important role in the performance of a team, whether a team fulfilled expectations in the tournament, and, finally, the suggestion of an improvement in ‘Power Rankings’.
The rest of the paper is as follows:
Section 2 presents the definitions and meanings of the statistical measures and the statistical methods that were used to tackle the problems in this work.
Section 3 is an overview of the questions and problems we considered for the tournament and a detailed description of the procedures we used to deal with them.
Section 4 presents the data analysis, and
Section 5 contains the summary and the conclusions of the paper.
2. Statistical Definitions, Measures, and Tools
In this section are briefly presented the elements which are used throughout this work. The Principal Component Analysis (PCA) method is a statistical method that was introduced by Pearson and later independently developed and named by Hoteling, and the aim is to express multivariate data with fewer dimensions. A detailed analysis of this method can be found in the book [
9].
The correlation of 2 variables can be measured using a coefficient that quantifies this correlation (the value of the coefficient is between −1 and 1, the magnitude displays the strength of the correlation while the sign displays the direction of the correlation). In this work, we use two such coefficients for completeness: Pearson correlation (r) (details about this method can be found in [
10]) and the Spearman rank correlation coefficient (rho) (details about this method can be found in [
11]).
In statistics, linear regression is a linear approach of the form y = Xb + ε, which is used to model the relationship between a (dependent) variable and one or more explanatory (independent) variables. Details about linear regression can be found in many statistics books, such as [
12]. It is known that factors that affect the outcome of a game are the shooting factor, turnover factor, rebounding factor, and free throw factor, and they are introduced and described in the works [
1,
2]. Their formulas are mentioned briefly: The shooting factor (
Sh.
F.) formula for both offense and defense is (
FG +
0.5 ×
3P)/
FGA. The turnover factor (
To.
F.) formula for both offense and defense is
TOV/(
FGA +
0.44 ×
FTA +
TOV). The rebounding factor (
Reb.
F.) formula for
offense is
ORB/(
ORB +
Opp DRB), while the formula for
defense is
DRB/(
Opp ORB +
DRB). The free throw factor (
FT.
F.) formula for both offense and defense is
FT/FGA.
Possessions of a team are computed through the formula:
FGA +
0.475 ×
FTA −
ORB +
TO. The possessions are calculated for both the offensive and defensive teams, and the average is considered to decide a game’s overall possessions.
Random Forests (are described in [
13], were introduced in [
14], and each node is split using the best split among a subset of predictors randomly chosen at that node. The output is the mean of all trees for regression. This strategy performs very well against other classifiers and is robust against overfitting. Neural networks are computing systems that are inspired by biological neural networks that constitute animal brains. An overview of neural networks can be found in reference [
15]. K-means clustering ([
16]) is a popular method for cluster analysis in data mining. In this work, we use the method of Hierarchical k-means clustering ([
17]), and the method is implemented in the R package ‘factoextra’ ([
18]).
3. FIBA World Cup 2019: Problems and Procedures to Solve Them
To be decided the success of a team in tournament competition, they are used some metrics. Because the most important definition is the performance of a team, the Winning percentage is naturally the first used metric. However, in the case of a tournament is not a suitable metric because teams do not face all other teams (only a subset of them after a draw). Another measure of the performance of a team is the point difference (PD) between the team and its opponents (this metric displays the dominance of the team). Another metric of success of a team could be the final ranking of the team in the tournament. This metric is also inappropriate. In order to achieve a complete metric of the success of a team, we consider all the above metrics, and we derive an overall metric of success (team score) with the use of the concept of Principal Component Analysis (which explains a large portion of variance).
Furthermore, in this work, it is specified whether the four factors (Sh.F., TO.F., Reb.F., and FT.F) affect the overall performance of the team. To achieve this, we use a multiple regression model (Model 1) with these factors as independent variables and performance as dependent variable. The factors for each team are calculated based on team statistics per game (were extracted from the site of basketball reference—
https://www.basketball-reference.com/international/fiba-world-cup/2019.html accessed on 1 March 2023).
Additionally, this work replies to another very interesting question, which is whether offence or defense played the most important role in the performance of a team in the tournament. To answer this question, are formulated, applied, and compared two multiple regression models (Models 2 and 3).
Furthermore, many effects are tested for their effect on the performance of a team in the tournament. Firstly, are tested the effects which are related to player usage percentage (usg%). The formula of the concept of usage percentage (usg%) is the following: usg%= 100 × ((FGA + 0.44 × FTA + TOV) × (Tm MP/5))/(MP × (Tm FGA + 0.44 × Tm FTA + Tm TOV)). The usage percentage (usg%) is an estimate of the percentage (pcg.) of the team’s offensive attempts (plays), which are used by a player while he is on the floor.
Except for the usg%, we consider the position of the player with the greatest usg% in the team (or the avg. position of the five players with the greatest usg%), the played minutes of the player with the greatest usg% in the team (or the avg. played minutes of the five players with the greatest usg%) and the percentage of plays of the player with the greatest usg% in the team (or the avg. percentage of plays of the five players with the greatest usg%). The effect of the player with the greatest usg% (or of five players with the greatest usg%) is tested with multiple regression models (Models 4 and 5, respectively).
Next, there is tested if the players who are competing in a specific league (League Effect) can affect the performance of a team in the competition. The most important leagues (and their weights for building an overall League Effect score) is an ad-hoc decision. There are considered players who play in the NBA, the Euroleague, the Eurocup, the Basketball Champions League (BCL), and the NCAA. In this work, the scores for the leagues are respectively 1, 1, 0.5, 0.5 and 0.5. Other effects which are tested include whether they affect the performance of a team, the heights of players of the team (this is measured by the average height of the players of the team and by the number of players in a team with a height over 200 cm.), the ages of the players of the team (this effect is measured by the average age of the players of the team and by the number of players in the team with age over 30 years old), the coach experience to the bench of the team (in Years), and the importance of shooting (this is measured by the percentage of 3 point attempts over the overall attempts and by the points scored from players who plays in the positions 1, 2, and 3 (small players) versus the points scored from players who plays in the positions 4 and 5 (high players)). These effects are tested with regression models (Models 6–10).
Moreover, two formulas are defined:
- (i)
.
- (ii)
.
Additionally, the intention is to check whether the team pace (tempo) affects the performance, i.e., faster or slower teams found to perform well (this is measured by the number of possessions of a team per game of the competition). These additional effects are tested with regression models (Models 11–13) and with Spearman and Pearson correlations.