1. Introduction
In the realm of sports, a big change occurred due to sports analytics. The use of advanced tracking technologies not only offers organizations and coaches vital insights regarding athlete performance but also generates a wealth of data [
1]. The extensive data produced serve as a catalyst, empowering coaches to refine their decision-making and strategic approaches [
2]. This insight extends beyond shaping roster composition, cost reduction, and increasing team value [
3,
4]. Moreover, this wave of innovation not only amplifies team competitiveness but also injects a new level of excitement into sports for fans. Real-time access to detailed statistical information improves the fan experience by providing a stronger connection between spectators and the complicated nature of the game.
Football, as the most generally recognized and followed sport worldwide, provides an ideal arena for the application and research of sports analytics. The game’s complicated design, combined with its massive global fan base, provides a rich tapestry for the analysis of advanced analytical methods. In this context, this research focuses on forecasting a player’s performance by predicting the number of goals a player is likely to achieve in the upcoming season based on historical data.
Noteworthy studies in the field of sports analytics are mentioned below. The authors in [
5] conducted two experiments related to football, focusing on team and player performance prediction. In the first experiment, they employed two tactics. The primary objective of the first approach was to forecast whether a team would secure a better position in the table for the 2017–2018 season compared to the previous two seasons. Using the random forest algorithm, this method achieved an accuracy of 70%. The second strategy involved simulating football matches for the 2018–2019 season to categorize results as home victories, away wins, or draws. The English Premier League exhibited the highest match outcome accuracy at 57%, while the Spanish La Liga had the lowest root mean squared error (RMSE). In their second experiment, the researchers explored the characteristics and moves during a game that could impact a defender’s rating. The dataset included 59 central defenders from the English Premier League during the 2016–2017 season. They employed the multiple linear regression model with backward elimination, achieving an R-squared metric of 0.867. In our study, we focus on player performance by considering all the various positions that players occupy on the field, aiming to predict the total number of goals scored.
In [
6], the researchers utilized the Wyscout public dataset to forecast player positions using sports performance and psychological attributes. Six key indicators, encompassing accuracy of shot, accuracy of simple pass, accuracy of glb (ground loose ball), accuracy of defending duel, accuracy of air duel, and accuracy of attacking duel, were selected as input variables to train a BP neural network. The model’s hyperparameter combinations were evaluated using k-fold cross-validation. Ultimately, the model attained an accuracy rate of 77%. Compared with this study, our research advances by using player positions, along with other variables, to enhance the prediction of the total number of goals scored.
Furthermore, injuries in sports pose a threat for both individuals and teams, with possible long-term consequences for players’ careers and the overall effectiveness and achievements of sports clubs. These injuries frequently necessitate extensive times, affecting team performance and match outcomes. Thus, injuries are of great importance in the world of sports.
In 2020, a study [
7] aimed to investigate the effectiveness of machine learning (ML) in detecting injury risk factors among elite male youth footballers. The research involved analyzing 355 athletes who underwent a series of neuromuscular tests (anthropometric measurements, single leg countermovement jump, tuck jump assessments). The results highlighted various factors associated with injury risk. The most common were asymmetry in a single-leg countermovement jump (SLCMJ), 75% hop, Y-balance, tuck jump knee valgus, and anthropometrics measures.
Additionally, in 2022, researchers conducted a study focusing on predicting injury risk in professional football players using body composition parameters and physical fitness evaluations. Their research, which comprised 36 male players from the First Portuguese Soccer League during the 2020–2021 season, looked at 22 distinct characteristics. Sectorial postures, body height, sit-and-reach performance, one-minute push-up count, handgrip strength, and 35-min linear speed were all found to be the most important variables in predicting injury risk for elite football players, using net elastic analysis. Notably, ridge regression was the most accurate model, with an RMSE of 0.591 for predicting the frequency of potential injury occurrences [
8]. This study differs in focus from our research; however, both studies utilize regression models, among other techniques, to predict their target variables.
Football teams are also using wearable gadgets during training and matches to track players’ physical abilities. These devices help experts analyze data and provide useful insights to clubs for better player management and strategic planning. The rising use of wearable technology highlights its growing importance in influencing football-related decisions.
Specifically, in 2022, researchers in [
9] attempted to construct a model for predicting lower-body injuries in male footballers resulting from over- or undertraining leveraging wearable technology. It is widely recognized that predicting injuries remains challenging due to individual biological variations and players’ psychophysical conditions. The study utilized Catapult wearable global positioning trackers to gather data during both training sessions and matches. Among the algorithms, XGBoost produced the highest accuracy, reaching 90%. The utilization of wearable devices will improve player performance analysis by delivering real-time data on metrics like heart rate, movement patterns, etc. This information will be essential to having more accurate results.
1.1. Related Work
Understanding and predicting football players’ performance is an important aspect of sports analytics. Extensive research has been conducted in this area, with the goal of uncovering crucial findings that will benefit the broader field of football analytics. This subsection provides a brief overview of the related work that has influenced our understanding of predicting a player’s performance.
In [
10], the researchers undertook a study on predicting football player performance, specifically focusing on overall performance value. They developed separate models based on player position, leading to a linear regression algorithm with an accuracy of 84.34%. Additionally, when predicting a player’s future market value based on the performance values of the first model, the algorithm demonstrated 91% accuracy. With this approach, coaches should be able to identify football potential without bias stemming from factors such as team budget or league competitiveness.
In 2018, a study was developed with the goal of predicting English Premier League football outcomes [
11]. The dataset covered a period of 11 seasons, with the training phase comprising 9 seasons (from 2005 to 2014), followed by two seasons of testing (from 2014 to 2016). The home/away attribute emerged as one of the most important features. This attribute depicts whether a team plays at its home stadium or not. Predicting football outcomes posed challenges, notably due to the substantial occurrence of draws, which constitute 25% of the testing dataset. Various models, including Gaussian naïve Bayes, support vector machine, random forest, and gradient boosting, were evaluated during the experimentation. The best model was gradient boosting, which achieved a ranked probability score (RPS) of 0.2158 from weeks 6 to 38 in the English Premier League over the 2 seasons.
Different research examines how situational variables and performance indicators affect match outcomes in the English Premier League during the 2017–2018 season. Using decision trees, it was discovered that scoring first was the most important factor. Clearance, show, and possession percentage have varying importance depending on the opponent’s quality. The findings can assist coaches and managers in setting goals for players and teams during training and games [
12].
1.2. Research Overview
This dissertation delves into football, a globally known sport. Its aim is to predict a player’s performance in terms of goals using historical data from the preceding four seasons (2018–2019 to 2021–2022) and conduct the evaluation in the final season (2022–2023). Specifically, this study includes players from four leagues: Bundesliga, Premier League, La Liga, and Serie A. Additionally, a dataset comprising players for all leagues was implemented.
Data collection relied on a reliable source, Sports Reference. Data were collected from seasons 2017–2018 to 2022–2023 with more than 5000 players. Furthermore, preprocessing and feature engineering were necessary to format the dataset appropriately. As part of the process were the transformation of data to historical (season lag features) and the division of the dataset, focusing on players within the top 30% in terms of scoring performance. Subsequently, each version was subdivided into three cases based on the attributes utilized in the training phase, as detailed in the subsequent
Section 2.2.1. Data Collection.
Various ML algorithms were evaluated, including linear and ridge regression, random forest, gradient boosting, XGBoost, and multilayer perceptron. The effectiveness of the models was measured using metrics like mean absolute error (MAE), mean square error (MSE), root mean square error (RMSE), mean absolute percentage error (MAPE), and R-squared.
1.3. Contributions
This paper presents a comparative study of the four major European leagues: Bundesliga, Premier League, La Liga, and Serie A. This comparison underscores the strengths and weaknesses of various ML models, providing insights into their effectiveness. Our findings suggest that XGBoost should be considered a strong candidate for predicting the total number of goals for datasets structured similarly to those in this study.
Additionally, our research identifies the league-specific datasets that yield the most effective performance prediction outcomes. By analyzing attributes such as player positions, historical performance metrics, and other relevant variables, we pinpoint the key factors that contribute to accurate goal prediction across different leagues.
2. Materials and Methods
This section outlines the processes and complications involved in the methodology. It examines the entire data collection process, starting from scraping data to cleansing and feature engineering. The goal is to illustrate the modifications made to the dataset before implementing ML algorithms. Alongside, the research hypotheses that guide our investigation are presented.
Model comparisons are assessed using metrics such as MAE, MSE, RMSE, MAPE, and R-squared. These metrics provide a comprehensive evaluation of the performance of our predictive models.
2.1. Research Questions/Hypothesis
How do different ML models, including linear regression, ridge regression, random forest, gradient boosting, XGBoost, and multilayer perceptron (MLP), compare in predicting football player performance?
Which league-specific dataset demonstrates the most effective performance prediction outcomes, and based on what attributes?
2.2. Methodology
This section provides an in-depth exploration of the procedures that contribute to the effectiveness of the analytical process.
The initial step involves data collection through scraping from Sports Reference, a platform offering athlete statistics across various sports. The dataset includes football players from the 2017–2018 to 2022–2023 seasons, exceeding 5000 players with a total of 35 features. The dataset, comprising a diverse range of football players representing various nations, teams, and leagues, has been narrowed down to exclusively include players from four leagues: Bundesliga, Premier League, La Liga, and Serie A. The ML algorithms are trained using data covering the 2018–2019 to 2021–2022 seasons, with the subsequent 2022–2023 season employed as the test dataset.
During the initial phases of data preprocessing, the dataset was refined to include players participating in all seasons, resulting in a significant reduction in data. Moreover, two versions of the dataset were created, one that contained all the players and another with players who are in the top 30% quartile based on goal performance. Finally, three different cases were developed regarding the training features. Case 1 considered the features most strongly correlated with the target variable ‘Goals’; case 2 involved the removal of one attribute from highly correlated pairs; and case 3 retained all available columns.
To enhance realism in evaluating football player performance predictions, we converted the dataset to include past statistics. Season lag features from previous seasons were introduced, allowing models to forecast season goals using data from preceding seasons. Additionally, we introduced a ‘Previous_Gls’ column, indicating the player’s goal count in the prior season.
To fulfill the primary objective of this study, various ML models are used, including linear regression, ridge regression, random forest, gradient boosting, XGBoost, and multilinear regression.
For this study, several libraries were used, including Pandas for data manipulation and analysis and Matplotlib for data visualization. Furthermore, sklearn was utilized to implement and evaluate the ML models. Ultimately, the assessment of the outcomes was conducted using three metrics: MAE, MSE, RMSE, MAPE, and R-squared. These metrics provide valuable insights into the dependability and effectiveness of the models, and their analysis is presented in the following section.
Figure 1. presents a flowchart for the proposed methodology.
2.2.1. Data Collection
The central focus of this paper revolves around the process of data collection. In the realm of football statistics, a plethora of websites offer information on clubs and players. Consequently, ensuring the legitimacy of the acquired data becomes critical, as any inaccuracies could jeopardize the precision of the results.
Specifically, this study’s dataset was obtained from Sports Reference [
13], a renowned organization that provides significant data coverage across a wide range of sports. To execute the data collection procedure, the scraping tool that was used was Octoparse.
Table 1 presents information about the number of records and features in the initial scraped dataset per league, and
Table 2 provides feature descriptions.
The dataset incorporates a set of features detailed below. Nevertheless, it is important to note that the dataset employed for the ML algorithms underwent significant transformations, resulting in a format distinct from the one described above. A detailed analysis of these changes is presented in the
Section 2.2.2. Pre-processing and
Section 2.2.3. Feature Engineering subsections.
2.2.2. Pre-Processing
First, a series of modifications were applied to the dataset to improve its suitability for the prediction models. An initial adjustment involved converting object-type columns into strings. Furthermore, attributes ‘Rank’ and ‘90s-Minutes played divided by 90’ were eliminated due to their lack of meaningful information.
Another observation revealed examples of players who played for multiple football clubs during the same season. As a result, the decision was made to calculate the average value for players in such situations, specifically for arithmetic columns. A composite string name was generated in the ‘Squad’ column, concatenating team names for these players.
A key criterion in this phase was the inclusion of players who participated in all six seasons, leading to the removal of those who did not meet this criterion. Consequently, the dataset underwent a significant reduction. Bundesliga experienced a reduction from 1185 distinct players to 109 players, while the Premier League saw a decrease from 1298 unique players to 112 players. Likewise, La Liga witnessed a decline from 1431 individual players to 97 players, and Serie A had a decrease from 1441 unique players to 106. Additionally, a supplementary dataset was introduced, encompassing players from all leagues (424 players in total). To distinguish players and their respective leagues, a new ‘League’ column was introduced, featuring numerical codes (e.g., League = 1 for Bundesliga, League = 2 for Premier League, League = 3 for La Liga, League = 4 for Serie A).
A more advanced distinction was made, focusing on players’ goal performance during the most recent season (2022–2023). The dataset was divided into two distinct subsets: one containing all players and another containing only those ranked in the top 30% quartile based on their goal achievements in the last season.
Subsequently, the focus shifted towards determining the features to be included in the algorithms, a critical process known as dimensionality reduction. Dimensionality reduction decreases the total number of input variables in a dataset [
14].
Case 1 contained the 10 columns with the highest correlation to the target variable ‘Goals’. The selection criteria were based on the Pearson correlation coefficient. In Case 2, a distinctive approach involved calculating the percentage of correlation for each pair of attributes. As a result, one column from each highly associated pair was kept. Finally, case 3 included all available columns from the dataset. Notably, in the dataset containing the total number of players across all leagues, the column ‘League’ was introduced to distinguish the players.
Table 3 presents the features for Case 1, Case 2, and Case 3.
2.2.3. Feature Engineering
As outlined earlier, the initial objective was to train the ML algorithms using the dataset of the first four seasons (2018–2019 to 2021–2022) and subsequently evaluate their performance on the test set from the last season (2022–2023). Nonetheless, since key statistics such as predicted goals and assists are included, using this approach could produce results that are too optimistic and do not reflect realistic outcomes.
To address this concern, an alternative methodology was implemented. To avoid reliance on current-season statistics, the dataset underwent transformation to incorporate historical data. Each row displayed past statistics, enabling the algorithm to predict a player’s goal count for the 2018–2019 season using data from the previous season (2017–2018). Additionally, a new column, ‘Previous Goals,’ was introduced, denoting the player’s goal for the 2017–2018 season, while the ‘Goals’ column indicated the goals for the subsequent season (2018–2019). Therefore, in the final dataset, each row depicts the seasonal performance statistics of each player from the last season, aiming to forecast the upcoming season’s goals.
The primary goal was to anticipate how many goals a player would score in the 2022–2023 season using data from the previous season (2021–2022). This strategy, known as season lag features, uses historical data to identify patterns that contribute to accurate predictions.
2.2.4. Modeling
Six different ML algorithms were used to predict the number of goals the player will achieve in the 2022–2023 season. These were: linear regression, ridge regression, random forest, gradient boosting, XGBoost, and multilayer perceptron algorithm.
Linear regression is a statistical approach for modeling the relationship between a dependent variable and one or more independent variables by fitting a straight line through the data points. The goal is to select the best-fitting line that minimizes the discrepancy between observed and anticipated values [
15]. Linear regression was chosen for its simplicity, providing a strong baseline for comparison. On the other hand, ridge regression is a statistical technique that reduces the multicollinearity in linear regression, which arises when independent variables are strongly correlated [
16]. A ridge regression model estimates coefficients using a biased estimator instead of ordinary least squares (OLS), resulting in lower variance and reduced standard error, making it useful for addressing multicollinearity issues [
2].
Random forest is an ensemble method that can be used for both regression and classification problems. It constructs many decision trees during training and returns the average prediction (regression) or most frequent class (classification) of the individual trees. It is robust, scalable, and good at handling complicated datasets while minimizing overfitting [
17].
Another algorithm is gradient boosting. It is a model that combines an ensemble of weak learners, most commonly decision trees. It works by fitting each new tree to the residual errors of the preceding ones, progressively increasing the model’s prediction accuracy [
18]. The XGBoost algorithm is an optimized implementation of gradient boosting. It integrates advanced features such as regularization and tree pruning techniques [
19]. These models were chosen because of their ability to handle different data distributions through ensemble techniques.
Lastly, MLP is an artificial neural network with multiple layers of neurons. It includes an input layer, one or more hidden layers, and an output layer. Except for the input layer, each employs nonlinear activation functions to capture complex data relationships. MLP is good for capturing intricate patterns in data [
20].
Furthermore, grid search was employed for all algorithms to optimize hyperparameters, aiming to find the most effective combination of values for each model [
21]. The hyperparameters table with the values for the best models of each scenario is available in
Appendix A, in
Table A1,
Table A2,
Table A3,
Table A4 and
Table A5. Additionally, feature importance was performed to determine the impact of input variables on model prediction. All algorithms produced feature importance scores, except for MLP. Furthermore, predictions were rounded to integers to ensure compatibility with the discrete structure of goal counts. Finally, metrics were calculated for both training and testing datasets to enable thorough evaluation and comparison.
4. Discussion
This section comprehensively examines the outcomes of each league and its cases. The final section offers an in-depth comparison of these findings, providing valuable insights into the performance across the different datasets. Specifically, the XGBoost algorithm performed best with Serie A’s dataset and attributes from case 1, featuring the 10 most correlated features related to the target variable ‘Goals’, with a MAE of 1.29. Next, we address the feature importance conclusions and the threats to the validity of our research.
4.1. Implications
Random forest proves to be the most effective algorithm within the Bundesliga dataset when utilizing the elite players dataset in case 2. Contrary to expectations, an algorithm trained on a significantly reduced dataset demonstrates superior results. Attributes in case 2 were selected by evaluating the correlation for each pair of features, with one feature chosen from each highly correlated pair. Notably, the feature with the highest importance value is ‘Previous goals’, indicating its critical role in predicting future outcomes. Additionally, ‘Age’ emerges as another significant feature, suggesting potential variations in player performance based on age. Another observation arises from the standard deviation calculation. With a standard deviation of 3.79 goals for the reduced dataset and a MAE of 1.71, the prediction range for a player scoring 10 goals would be 10+/-3 with an error margin of 1.71, indicating a good outcome.
In the analysis of the Premier League dataset encompassing all players, superior performance was observed in case 1 with the XGBoost algorithm. Features such as expected goals, non-penalty expected goals, and previous goals emerged as the most influential factors, highlighting their significant impact on the target variable. Expected goals represent a statistical metric in football used to assess the likelihood of a goal being scored from a given shot. Furthermore, the disparity between the train and test metrics values suggests a minimal occurrence of overfitting.
Case 1, along with its associated attributes, was utilized to achieve the best results, employing the entire La Liga’s dataset with MLP.
Once again, in Serie A league, the XGBoost algorithm in case 1 demonstrated the most optimal performance. It is worth mentioning that in this league, the values between the training and testing sets are closely related to those in other leagues.
Lastly, XGBoost emerged as the top-performing algorithm in case 3, including data from all four leagues. Conversely, for the reduced dataset, gradient boosting yielded the best metric errors. Expected goals played a pivotal role in both scenarios. As anticipated, the dataset containing all players exhibited superior results.
To better understand their distinction, we will provide an illustration involving one player from each league and their performances. These players were T. Müller (Bundesliga), D. Welbeck (Premier League), K. Benzema (La Liga), and N. Barella (Serie A).
Typically, using center backs or defenders as examples yields more precise predictions compared to forwards. This happens because defenders usually do not score in a season.
Figure 2 summarizes the performance predictions for players from all leagues.
T. Müller, a forward, scored a total of 7 goals in the 2023–2023 season. The random forest algorithm predicted 8 goals when applied to the elite players dataset, resulting in a 1-goal discrepancy. Similarly, the XGBoost algorithm, using data from all four leagues, also predicted 8 goals for Müller, aligning closely with the random forest prediction. For D. Welbeck, who is a midfielder, XGBoost precisely predicted the number of goals he scored in the final season using the league-specific dataset. In contrast, XGBoost, employing data from all four leagues, predicted 5 goals. In the instance of Karim Benzema, a striker who scored 19 goals, the MLP model accurately predicted the actual goals in the La Liga’s dataset, while the XGBoost algorithm in the extended dataset predicted 14 goals, deviating from the actual count by 5 goals. Lastly, N. Barella, a skillful midfielder in Serie A, contributed 6 goals during the 2022–2023 season. XGBoost was the best algorithm in both datasets, predicting 5 goals.
Certainly, there are cases where the actual goals of players perfectly match the predictions made by all models. Conversely, there are also instances where the predicted goals for players show significant deviations from their actual achievements.
4.2. Comparative Insights across Leagues
As already mentioned, the comparison of algorithms across the different cases is a crucial aspect of this study. The results are summarized in
Table 24, unveiling interesting insights.
Serie’s A dataset depicted the lowest metric values compared to another league or the combined dataset. Among these metrics, MAE stands out as the most crucial metric, indicating the proximity of predictions to the actual number of goals. For instance, with XGBoost, MAE was recorded at 1.29, suggesting that if a player scored 10 goals, the prediction would fall within the range of 9 to 11 goals. While not all predictions achieve perfect accuracy, overall, they demonstrate high efficacy.
Moreover, it is noteworthy that only in the Bundesliga did an algorithm utilizing the reduced dataset yield superior results. Specifically, when the XGBoost algorithm was tested on the entire dataset across all leagues, MAE reached its second-lowest value. This underscores the notion that, despite cultural and gameplay differences among leagues, the comprehensive dataset generally produces more accurate predictions. Nevertheless, it is essential to recognize that the other error metrics do not exhibit similar trends.
Most significantly, the XGBoost algorithm emerges as the overall victor in three out of five scenarios, indicating its effectiveness across diverse datasets. Consequently, data scientists are advised to prioritize this algorithm when dealing with similar datasets. Furthermore, we validate key factors for each league regarding feature importance that contribute to achieving more accurate results.
Additionally, researchers must acknowledge the significance of their studies, as the results obtained surpass many previous endeavors. Although some studies may report marginally superior error metrics, it is crucial to acknowledge the challenge of comparing results across different datasets, given the substantial variations present in different sports and their dynamics.
4.3. Feature Importance
We discuss here the key features that consistently demonstrated high importance values and significantly contributed to the accuracy of our goal prediction models. Previous research [
22] identified expected goals and previous goals as the most influential features in goal prediction.
Our feature analysis concluded that a player’s goal-scoring performance is significantly related to previous goals, expected goals, and expected goals per 90 min. Previous goals represent a player’s historical scoring performance, which is a reliable predictor of future goal-scoring potential. Expected goals (xG) provide a player’s probability of scoring based on the opportunities presented to them. Finally, expected goals per 90 min (xG per 90′) is a normalized metric allowing for a fair comparison of players with variable minutes of presence on the pitch.
4.4. Threats to Validity
As previously mentioned, our experiments yielded high accuracy and good results. However, it is important to note several potential threats to the validity of this research.
One of the initial assumptions was to divide the dataset by picking the top 30% of athletes based on goal scoring. This judgment sought to assess the performance of algorithms on players whose actual goals were not zero. Positions like goalkeepers and defenders often have few scoring opportunities, making it easier for algorithms to anticipate their scored goals for the 2022-23 season. By focusing on players with non-zero goals, we hoped to generate a more challenging and informative evaluation of the prediction models.
Another potential threat was the selection of variables. We considered three different scenarios regarding feature selection. In the first scenario, we selected the 10 features with the highest Pearson correlation to the target variable ‘Goals’. In the second scenario, we calculated the correlation percentage for each pair of attributes and retained only one column from each highly correlated pair, resulting in a total of 13 features. The third scenario included all available features from the dataset.
5. Conclusions and Future Work
In this study, our primary objective was to predict the scoring performance of football players, meaning the total goals, using historical data. Data were scraped from 4 leagues: Bundesliga, Premier League, La Liga, and Serie A, reaching more than 5000 players originally for six seasons. Seasons 2018–2019 to 2021–2022 were used to train the models, while season 2022–2023 was used as the testing dataset.
We assessed the performance of six ML algorithms: linear regression, ridge regression, random forest, gradient boosting, XGBoost, and multilayer perceptron. We employed two versions of each algorithm, one using the entire dataset and another using the elite players (top 30% quartile). A further division was conducted based on the features utilized for training. The effectiveness of each model was evaluated through various metrics such as MAE, MSE, RMSE, MAPE, and R-squared.
The findings revealed that the XGBoost algorithms in 3 out of 5 categories outperformed other models and demonstrated higher accuracy. Specifically, the best results were found in Serie’s A dataset, where the MAE was 1.29. It is evident that sports analytics will play a crucial role in the future, driven by the large volumes of data. Sport clubs will progressively have more data scientists to optimize player performance across metrics like physical fitness, technical skills like striking accuracy, and other aspects essential for maximizing on-field contributions.
In summary, this study provides significant knowledge for football clubs, managers, and coaches. It allows them to make better decisions and predict player performance, resulting in overall team improvement. Our research findings indicate the feasibility of accurately predicting a player’s performance in the upcoming season based on historical data. However, further improvements should be made to obtain greater precision and efficacy.
Data scientists can explore various ways to improve their work, including leveraging more advanced and complex statistics or including statistics and using player statistics for every match of the season to enrich the dataset utilized for model training. Furthermore, insights from our study can be used to estimate a team’s total goals by aggregating individual player performance. This will potentially offer information on the team’s ranking prospects.
As discussed in another section, wearable devices or cameras can provide valuable insights into players’ physical movements and conditions. These gadgets track a range of statistics, including heart rate and breathing patterns, which can improve goal-scoring performance analysis [
23]. Additionally, analyzing Twitter data using sentiment analysis offers a novel approach to understanding the psychological factors influencing player performance [
24]. This assists coaches and teams in decision-making and morale management [
25]. Finally, injury analytics play a crucial role in optimizing player performance and reducing injury risks. Teams can leverage data on player fitness and movement patterns to enhance player well-being and maintain peak physical condition throughout the season [
26].