6.1. SHAP-Based Feature Importance Analysis
To improve interpretability and performance, (SHAP) analysis was employed for feature selection. SHAP assigns importance values to features based on their contribution to model predictions, ensuring consistent and locally accurate attributions [
45,
46]. We applied SHAP to four models: LR, DT, GB, and XGB, identifying the top 25 features influencing salary prediction. Moreover, for all SHAP analyses, the feature importance values are computed using the test dataset to ensure unbiased and reliable interpretation of the models’ predictive behavior on unseen data.
Figure 2a–d show these features ranked by SHAP values. LR serves as a baseline, highlighting only three meaningful features:
goals,
xG, and
xGdiff. This reflects its limitation to direct linear relationships, ignoring collinear or weaker predictors.
DT captures non-linear interactions, identifying xGChain, age, and league_weight as top features. It emphasizes players’ involvement in buildup play, career stage, and league strength, with xGBuildup and assistsRatio also being influential. Traditional metrics like goals, xG, and key_passes show little impact, indicating that DT values broader offensive and career factors over isolated scoring stats.
GB similarly ranks xGChain, age, and league_weight highest, reinforcing the importance of dynamic offensive roles and league quality. Features like assistsRatio, xGBuildup, and xGg support the significance of playmaking and scoring potential. Moderate influence appears for shooting and passing metrics (kppm, shpg, gpm), while defensive and discipline metrics have less effect. Lower-ranked features include key_passes and npg, highlighting a focus on overall attacking contribution.
XGB, an optimized GB variant, also prioritizes xGChain, age, and league_ratio, emphasizing passing sequences, career stage, and league competitiveness. Additional important features include assistsRatio, aGgRatio, xGg, and encoded categorical variables such as teamEncoded, and positionen, reflecting team strength and positional roles, respectively.
Overall, SHAP analysis consistently identifies league competitiveness (league_weight), goal-related contributions, and positional context as key drivers in predicting player salaries. The growing importance of league_weight across more complex models highlights their enhanced capacity to capture contextual performance factors. Notably, league_weight ranks among the top three contributors in all models analyzed, with SHAP values of approximately 0.20, 0.31, and 0.35 for DT, GB, and XGB, respectively. This demonstrates that our newly engineered league weighting feature significantly improves the models’ explanatory power regarding salary discrepancies, affirming its relevance in capturing league-specific salary effects.
Individual goal contributions, specifically goals and assists, consistently rank as key predictors across all models due to their fundamental role in evaluating player performance and value. Goals directly impact match results and are often the primary factor fans, clubs, and sponsors associate with a player’s effectiveness. Assists highlight a player’s ability to create scoring opportunities for teammates, reflecting creativity and vision—qualities highly prized in professional soccer. These two statistics are widely recognized as the most concrete and quantifiable measures of offensive contribution, which strongly influence contract negotiations and salary levels. Their consistent importance across different models underscores the centrality of direct involvement in goal-scoring actions as a primary driver of player salary.
Moreover, two advanced performance metrics, xGChain and xGBuildup, emerge as key contributors in the SHAP analysis. xGChain quantifies the total expected goals (xG) value of all attacking actions a player participates in during a possession sequence culminating in a shot. This metric effectively captures the player’s cumulative influence throughout buildup and passing sequences, highlighting their integral role in offensive plays beyond merely taking shots. Conversely, xGBuildup isolates the expected goals generated during the buildup phase by excluding key passes and shots. This emphasizes a player’s contribution in advancing the ball and facilitating scoring opportunities prior to decisive attacking actions. Collectively, these metrics enrich the evaluation of a player’s offensive impact by encompassing both direct and indirect contributions to goal-scoring chances, underscoring the multifaceted nature of attacking involvement. While we acknowledge the presence of highly correlated features such as xG and xGg or aG and aGg, these engineered per game metrics provide important standardization for fair comparisons across players with differing playtime. Given the robustness of tree-based models used in our study to multicollinearity, we retain these features to capture complementary information.
6.2. Salary Discrepancy Analysis Using GB
As shown in
Section 5, the GB model is the most effective for predicting player salaries. Using this model, we analyze salary discrepancies by comparing actual versus predicted earnings. Players are classified as underestimated (predicted salary higher than actual) or overestimated (actual salary higher than predicted). To improve prediction reliability, we use the top 25 influential features identified via SHAP analysis. This approach highlights potential market inefficiencies and league-specific salary patterns. We also examine factors such as age, league, and position to better understand the drivers behind these discrepancies.
Table 10 lists the top five underestimated and overestimated players. Each has an actual salary of GBP 2.1 M, yet their predicted salaries reveal large gaps. L. Messi shows the largest shortfall with a predicted GBP 26.3 M, a GBP 24.2 M difference. Neymar (GBP 21.4 M) and A. Griezmann (GBP 11.9 M) also have significant underestimations. These gaps suggest valuation misalignments influenced by contracts, wage policies, salary structures, and aging effects. Among overestimated players, actual salaries exceed predicted values, indicating potential overvaluation. Marcelo earns GBP 21.95 M compared to a predicted GBP 1.99 M, a GBP 19.96 M gap. E. Hazard (GBP 24.9 M), L. Suárez (GBP 24.7 M), David de Gea (GBP 16.82 M), and G. Bale (GBP 20.7 M) follow with similar discrepancies. Furthermore,
Table 11 shows that most players (3449) are underestimated, while 1789 are overestimated, indicating that the model generally predicts lower than actual salaries.
We analyzed how salary predictions vary across different age groups and player positions to gain deeper insights into the internal factors influencing salary estimations. Both features consistently emerge as significant predictors in our ML models, underscoring their strong influence on salary outcomes. Moreover, age and position are well-established determinants of player compensation in professional football, significantly affecting a player’s market value, performance expectations, and career trajectory. These factors are critical for understanding salary disparities.
Figure 3a presents a detailed comparison of predicted and actual salary differences across career stages. The results show that the model effectively captures salary trends by age, particularly for mid-career players (26–30 years), where predictions closely match actual salaries. This indicates that the model incorporates key performance factors influencing peak-year salaries. For younger players (15–25 years), some underestimations likely reflect real-world contracts, where emerging talents earn lower base wages before renegotiations. Thus, these predictions may realistically represent early-stage salary trajectories rather than systematic bias. For older players (31+ years), occasional overestimations may stem from legacy contracts or wage structures extending beyond peak performance, indicating that the model accounts for salary stability due to long-term agreements.
Figure 3b analyzes the impact of player position on salary predictions, revealing close alignment between predicted and actual salaries across positions. Goalkeepers and defensive midfielders show consistent salary differences, reflecting more predictable salary structures, likely driven by standardized contracts. In contrast, forwards and midfielders exhibit greater variability, with some earning significantly less than predicted, indicating more fluctuation in salary expectations for these roles.
In summary, the distribution of salary discrepancies reveals a predominance of underestimated players, suggesting the model tends to predict lower salaries overall. Predictions align more closely with actual salaries for mid-career players, while younger and older players experience some under and overestimations, respectively. Positional analysis further highlights that goalkeepers and defensive midfielders have more stable salary patterns, whereas forwards and midfielders show greater variability. Moreover the group-level patterns illustrated in
Figure 3a,b provide important context for the individual-level discrepancies reported in
Table 10. The consistent underestimation of prominent forwards and attacking midfielders in their mid-career stage (e.g., Messi, Neymar) appears to stem from the model’s reliance on measurable performance indicators, which may not fully capture broader determinants of salary such as commercial value or contractual nuances. Conversely, the overestimation of older defenders and goalkeepers (e.g., Marcelo, David de Gea) suggests that the model does not always reflect the influence of long-term or legacy contracts that may not align with current on-field contributions. This alignment between group-level trends and individual-level deviations enhances the interpretability of the results and offers insight into the structural factors influencing salary prediction accuracy.
6.3. Comparison with State-of-the-Art
Several prior studies investigated player salary prediction using traditional ML technologies, often relying on basic features like goals, assists, and appearances. As shown in
Table 12, these models generally lack advanced performance metrics or contextual feature engineering, which limits their predictive power.
For example, Smith et al. [
47] reported a moderate
of 0.71 using standard performance indicators, though their error metrics such as RMSE, were not provided. Huang and Zhang [
10] included an error metric, reporting an
of 0.90, of approximately
, but did not incorporate predicted performance metrics such as expected goals. In contrast, the proposed model integrates refined predicted metrics like
xGg and
aGg, along with league and position adjustments, which enhances its predictive capability. It achieves a substantially higher
of 0.91, representing a significant improvement in explained variance over previous works, and reduces prediction errors with an MAE of approximately
. Additionally, our dataset and models differ from others, outperforming all existing metrics. This comparison highlights the benefits of integrating domain-specific features with advanced ML techniques, leading to more precise and dependable salary predictions in soccer.