1. Introduction
Baseball analytics has evolved dramatically from traditional box-score statistics to sophisticated sabermetric analysis and machine learning applications, fundamentally transforming how teams evaluate player performance and construct competitive rosters. Sabermetrics, the systematic analysis of baseball through objective evidence and statistical methods, emerged from pioneering work seeking to quantify player value beyond conventional metrics [
1,
2,
3]. Machine learning has emerged as the next frontier in sports analytics, with studies demonstrating superior predictive accuracy compared to traditional statistical approaches [
4,
5,
6,
7,
8]. For example, Huang and Li [
9] achieved 94.18% accuracy in predicting Major League Baseball (MLB) game outcomes using artificial neural networks, establishing the viability of computational approaches for baseball prediction. Moreover, Li et al. [
10] developed machine learning models for predicting MLB game outcomes by systematically exploring and selecting relevant features from comprehensive performance metrics. The superiority of machine learning methodologies stems from their capacity to model non-linear relationships and complex interactions between variables without imposing restrictive distributional assumptions [
11,
12].
However, a critical gap persists in sports analytics research: the near-exclusive geographic concentration of baseball analytics research on MLB [
5,
7,
9]. When findings emerge exclusively from MLB, a league with specific characteristics including unrestricted international player markets, multi-divisional structures, and established salary cap frameworks, implicit assumptions arise that these findings generalize universally to other professional baseball contexts. This assumption remains largely untested empirically. The question of whether machine learning models trained on MLB data effectively generalize to structurally different professional baseball environments has received minimal investigation.
The Korean Baseball Organization (KBO), South Korea’s premier professional baseball league since 1982, represents an ideal context to examine this generalizability question [
13]. Several KBO players have achieved notable success in MLB, such as Shin-soo Choo and Dae-ho Lee. Despite its competitive quality and international significance, academic research on KBO performance analytics remains virtually nonexistent. The KBO exhibits several structural characteristics that fundamentally distinguish it from MLB: a 10-team single-division format with all teams competing directly against one another, distinctive foreign player limitations (maximum three foreign players per team, two pitchers maximum), a unique step-ladder tournament playoff system, and ownership by major Korean conglomerates rather than private entities [
14]. Furthermore, the KBO provides a unique opportunity to examine a fundamental unresolved question in sabermetrics: whether defense-dependent metrics (ERA, WHIP) or defense-independent metrics (FIP) better predict team playoff qualification, and whether this relationship differs from MLB contexts.
A particularly important unresolved question in sabermetric research concerns the relationship between defense-dependent and defense-independent pitching metrics for predicting team success. Within sabermetrics, Fielding Independent Pitching (FIP) is considered the “gold standard” for evaluating individual pitcher talent because it isolates pitcher skill from defensive factors [
15]. FIP was developed by Tom Tango based on foundational research by Voros McCracken, establishing that pitchers have minimal control over balls in play [
15,
16,
17]. Unlike ERA, which reflects both pitcher performance and team defense, FIP focuses exclusively on events entirely within pitcher control: strikeouts, walks, hit-by-pitches, and home runs [
18]. Empirical research demonstrates that FIP is more predictive of future pitcher performance than current-season ERA, making it superior for evaluating true pitcher talent independent of defensive context [
19,
20,
21].
While it may seem intuitively plausible that team-level outcomes depend on combined pitching-defense performance rather than isolated pitcher skill, this hypothesis has not been systematically tested empirically. Prior research on FIP focuses almost exclusively on individual pitcher evaluation and future performance projection within MLB contexts [
19,
20,
21]. Crucially, no prior study has quantified the relative predictive importance of defense-dependent (ERA, WHIP) versus defense-independent (FIP) metrics for team playoff qualification using machine learning feature importance analysis, nor has any research examined whether the hierarchical structure of metric importance differs across professional baseball leagues with distinct organizational characteristics. Furthermore, the KBO’s structural differences from MLB, including a single-division format (eliminating divisional imbalances), strict foreign player limits (three per team, with a maximum of two pitchers), and a step-ladder playoff system (the top five teams qualify), may fundamentally alter which performance metrics predict postseason success. Testing whether MLB-derived sabermetric principles generalize to structurally different leagues constitutes a critical test of the universality versus context-specificity of baseball analytics frameworks.
This study focuses exclusively on pitching metrics rather than comprehensive team performance indicators for several reasons. First, the research addresses a specific theoretical question in sabermetrics—whether defense-dependent (ERA, WHIP) or defense-independent (FIP) metrics better predict team playoff qualification—which requires isolating pitching performance from offensive and other factors. Second, extensive baseball analytics literature emphasizes pitching’s dominant role in team success, often summarized by the axiom “pitching wins championships,” making it a natural focal point for classification analysis [
3,
7]. Third, isolating pitching enables a clean interpretation of patterns in metric importance without confounding from hitting or managerial factors. While comprehensive models integrating offense, defense, and organizational characteristics would provide more complete prediction frameworks and constitute valuable future research directions, the pitching-only approach provides conceptual clarity for addressing the ERA-FIP question while establishing baseline discriminatory capability.
This study addresses these critical gaps by developing and evaluating machine learning models for classifying KBO playoff advancement using comprehensive pitching metrics from 2015 to 2024. While this study employs established machine learning algorithms rather than introducing novel methodological innovations, its contribution lies in three areas: (1) systematic application of these methods to the under-researched KBO context, enabling the first comprehensive playoff prediction framework for this league; (2) direct empirical comparison of defense-dependent (ERA, WHIP) versus defense-independent (FIP) metrics at the team level, addressing a fundamental theoretical question in sabermetrics; and (3) examination of whether metric importance patterns observed in MLB generalize to structurally different professional baseball leagues. This approach prioritizes domain-specific analytical insights and cross-league generalizability over algorithmic novelty. Unlike traditional forecasting studies that aim to predict future outcomes, this study adopts a retrospective classification design. By utilizing season-end pitching metrics, we aim to identify the underlying performance constructs that historically distinguished playoff qualifiers from non-qualifiers in the KBO.
2. Methodology
2.1. Data Collection and Dataset Construction
All data were compiled from two authoritative sources: (1) the official KBO league website (
www.koreabaseball.com), which maintains comprehensive official statistics verified by the league, and (2) KBReport (
http://www.kbreport.sbs), a publicly available baseball statistics website providing detailed team-level metrics consistent with official league records. These dual sources were utilized to ensure data accuracy and enable manual cross-referencing of reported statistics. Specifically, team-level metrics for all 100 team-seasons were compared across both platforms to verify data integrity. This verification process yielded a 100% match across all 1000 data points (100 seasons × 10 variables), with zero discrepancies observed. Consequently, no manual error correction or tolerance-based cleaning was necessary, ensuring the highest level of data provenance and reliability. The dataset covers ten consecutive KBO regular seasons spanning 2015–2024, providing comprehensive coverage of the league’s recent competitive history. This ten-year period was strategically selected for several methodological reasons. First, it ensures consistency, as it encompasses only seasons following the KBO’s stabilization in its current 10-team structure. Second, this timeframe provides sufficient temporal breadth (N = 100 team-seasons) to enable robust machine learning model training and validation, as machine learning algorithms typically require at least 50–100 observations for stable performance estimation [
22]. Third, the ten-year span captures varied competitive conditions, managerial approaches, and strategic philosophies, enhancing the generalizability of findings across diverse contexts within the KBO.
The fundamental unit of analysis is the team-season. With 10 teams competing across 10 consecutive seasons, the final dataset comprises 100 unique observations. All 100 team-seasons contain complete data with no missing values, as pitching statistics represent season-aggregated official league records compiled from comprehensive game-by-game data. The 2020 season, despite COVID-19 disruptions, is included as all teams played the same 144-game schedule, ensuring comparability. This team-level aggregation is appropriate for the research question, as the study examines how team-level pitching performance characteristics predict team-level playoff qualification outcomes.
The dependent variable is a binary categorical variable, Postseason_Qualification, operationalized as follows: 1 = Postseason Qualification (teams finishing in the top five of the regular season standings, qualifying for postseason competition) and 0 = No Postseason Qualification (teams finishing sixth through tenth, failing to qualify for postseason competition). The top-five cutoff reflects the current KBO playoff format, wherein the top five teams advance to the step-ladder tournament.
2.2. Independent Variables and Measurement
The model utilizes ten continuous, team-level pitching statistics as independent variables. Traditional metrics (defense-dependent) include ERA (Earned Run Average), the average number of earned runs allowed per nine innings pitched calculated as ERA = (Earned Runs × 9)/Innings Pitched, and WHIP (Walks and Hits per Inning Pitched), the average number of baserunners allowed per inning pitched calculated as WHIP = (Walks + Hits)/Innings Pitched [
23].
Advanced sabermetric metrics (defense-independent) include FIP (Fielding Independent Pitching), an estimate of pitcher ERA based exclusively on strikeouts, walks, and home runs; K/9 (Strikeouts per 9 Innings), the average number of strikeouts recorded per nine innings pitched; BB/9 (Walks per 9 Innings), the average number of walks issued per nine innings pitched; K/BB (Strikeout-to-Walk Ratio), calculated as Strikeouts/Walks; and HR/9 (Home Runs per 9 Innings), the average number of home runs allowed per nine innings [
17,
23].
Opponent performance metrics include Batting Average Against (BAA), calculated as Hits/At-Bats; Opponent On-Base Percentage (OBP), calculated as (Hits + Walks + HBP)/(At-Bats + Walks + HBP + Sacrifice Flies); and Opponent On-Base Plus Slugging (OPS), calculated as OBP + SLG, providing holistic assessment of offensive production against the pitching staff. These metrics were deliberately selected to provide a comprehensive assessment of pitching staff performance, incorporating both traditional baseball statistics and advanced sabermetric indicators. Complete computational formulas and operational definitions for all ten metrics are provided in
Supplementary Table S1.
2.3. Machine Learning Models and Algorithms
Five well-established machine learning algorithms were selected for evaluation. These algorithms were chosen not for methodological novelty, but rather to enable systematic comparison with prior sports analytics research and to ensure that findings regarding metric importance are robust across diverse algorithmic approaches. The selected algorithms represent different modeling paradigms, parametric linear models, tree-based methods, kernel-based approaches, and neural architectures, providing comprehensive coverage of standard machine learning techniques commonly applied in sports analytics [
24,
25,
26,
27,
28,
29,
30].
Logistic Regression is a foundational statistical model for binary classification that models the probability of a categorical outcome by fitting a linear equation to the log-odds of the event, then transforming using a sigmoid function [
24]:
A classification decision is made by comparing this probability to a predetermined threshold, typically 0.5 [
25]. It provides interpretability and computational efficiency but assumes linear separability in log-odds space, limiting performance on complex non-linear problems [
26].
Decision Trees recursively partition the dataset into increasingly homogeneous subsets based on feature values, selecting splits that maximize information gain and yielding an interpretable, rule-based model that can capture non-linear relationships but is prone to overfitting with high variance [
27]. Random Forest extends Decision Trees through an ensemble of trees trained on bootstrap samples with feature subsampling at each split, aggregating predictions via majority voting to reduce variance and improve generalization while maintaining the ability to model complex interactions [
28].
Support Vector Machines identify an optimal separating hyperplane that maximizes the margin between classes, and, through kernel functions such as polynomial or radial basis function kernels, can handle non-linearly separable data by implicitly mapping observations into higher-dimensional feature spaces [
29]. Artificial Neural Networks consist of interconnected layers of nodes (input, hidden, and output layers) with weighted connections updated via backpropagation, enabling the learning of highly non-linear patterns and interactions, but requiring careful specification of architecture and regularization to avoid overfitting, particularly in relatively small datasets [
30].
2.4. Model Evaluation Metrics
Model performance was evaluated using several standard metrics derived from the confusion matrix: Classification Accuracy, Precision, Recall, F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC) [
31].
where TP (True Positives) = correctly predicted postseason teams; FP (False Positives) = non-postseason teams incorrectly predicted as postseason; FN (False Negatives) = postseason teams incorrectly predicted as non-postseason; TN (True Negatives) = correctly predicted non-postseason teams.
Classification Accuracy, computed as (TP + TN)/(TP + TN + FP + FN), measures the overall proportion of correctly classified team-seasons. Precision, defined as TP/(TP + FP), quantifies the proportion of predicted postseason teams that actually qualified, while Recall, defined as TP/(TP + FN), quantifies the proportion of actual postseason teams correctly identified by the model. F1-Score, computed as 2 × (Precision × Recall)/(Precision + Recall), provides a harmonic mean that balances Precision and Recall, making it particularly informative under class imbalance. AUC summarizes the model’s ability to discriminate between postseason and non-postseason teams across all possible probability thresholds, with values closer to 1 indicating stronger discriminative performance.
2.5. Model Training and Validation
All models were trained and evaluated using Orange3’s Test and Score widget (version 3.36) with stratified 5-fold cross-validation. This validation strategy partitions the dataset into five equal folds, iteratively training on four folds (
n = 80 team-seasons) and testing on the held-out fold (
n = 20 team-seasons) until all observations have served as test cases exactly once. Stratification ensures that each fold maintains the dataset’s class distribution (50% playoff-qualifying, 50% non-qualifying), preventing biased performance estimates due to unbalanced partitions. This cross-validation approach provides a robust estimate of generalization performance by averaging results across five independent train-test splits, which is particularly important for modest-sized datasets where single train-test splits may yield unstable estimates [
32]. Orange3 reports mean cross-validated performance across folds; fold-level predictions were exported via the Predictions output port and analyzed externally to compute fold-level variability (
Table 1). The dataset exhibits perfect class balance with 50 playoff-qualifying and 50 non-qualifying team-seasons, reflecting KBO’s playoff structure, where exactly 5 of 10 teams qualify annually. This natural balance eliminates the need for resampling techniques or class-weighted loss functions. Orange3’s graphical interface does not expose random seed parameters; the software manages random states internally through its default algorithmic implementations. Fold-level predictions were exported from Orange3’s Test and Score widget and analyzed externally. For each model, we report mean performance with standard deviations across the five folds, and 95% confidence intervals were computed using the t-distribution with four degrees of freedom. This study employs a retrospective classification design rather than a prospective forecasting approach. Season-end aggregate statistics are used to identify pitching performance characteristics that historically distinguished playoff-qualifying from non-qualifying teams, rather than to forecast outcomes prior to or during the season. Cross-validation partitions may therefore include team-seasons from different years within the same fold; this is appropriate for retrospective classification but would not constitute valid temporal forecasting.
2.6. Hyperparameter Specifications
All models were trained using Orange3 (version 3.36) default hyperparameters without manual tuning. This approach was deliberately adopted given the modest sample size (N = 100), where hyperparameter optimization risks overfitting to specific data partitions, and because the primary research objective concerns relative metric importance rather than maximizing classification accuracy. The use of untuned default parameters also enhances reproducibility, as results are fully determined by the software defaults without researcher degrees of freedom in parameter selection. The default specifications are as follows:
Logistic Regression employed L2 (Ridge) regularization with regularization strength C = 1.0.
Decision Tree used the entropy criterion (information gain) with no maximum depth restriction and a minimum of 2 instances per leaf.
Random Forest comprised 10 trees with no maximum depth constraint.
Support Vector Machine used the Radial Basis Function (RBF) kernel with cost parameter C = 1.0 and automatic gamma scaling.
Neural Network employed a single hidden layer with 100 neurons, ReLU activation function, Adam optimizer, and a maximum of 200 training iterations.
All models used Orange3’s default data preprocessing, which includes feature standardization (z-score normalization) applied within each cross-validation fold to prevent data leakage.
2.7. Collinearity Considerations
Several pitching metrics share mathematical components and exhibit intercorrelations. Notably, ERA and WHIP both incorporate hits and walks allowed, resulting in conceptual and statistical overlap. While the Variance Inflation Factor (VIF) analysis is commonly used to detect multicollinearity in regression contexts, it is designed specifically for detecting coefficient inflation in ordinary least squares estimation and is less directly applicable to the classification algorithms employed in this study [
33]. Tree-based ensemble methods, such as Random Forest, handle collinearity through their inherent architecture: bootstrap sampling and random feature subsampling at each split cause correlated features to compete for inclusion rather than produce inflated importance estimates [
34]. Similarly, regularization mechanisms in Logistic Regression (L2 penalty) and Neural Network (weight decay) mitigate collinearity effects by constraining coefficient magnitudes [
11]. Given these algorithmic properties, we interpret the importance of correlated (ERA, WHIP) collectively as reflecting defense-dependent run prevention rather than independent contributions. Readers should note that the individual importance values for highly correlated predictors cannot be cleanly separated and should be interpreted with appropriate caution.
2.8. Feature Importance Assessment
Feature importance was assessed using Orange3’s Rank widget, which computes Information Gain scores for each predictor variable. Information Gain measures the reduction in entropy when the dataset is split based on each feature, with higher values indicating greater discriminatory power for distinguishing between playoff-qualifying and non-qualifying teams [
35]. The reported importance values are raw scores computed by the Rank widget and are not normalized to sum to 1.0, allowing direct comparison of the relative discriminatory contribution of each pitching metric.
3. Results
3.1. Comparative Model Performance
The five machine learning models were trained and tested on the KBO dataset from 2015 to 2024. Their classification performance was evaluated across the five key metrics described in the methodology, with results summarized in
Table 1.
Table 1 presents the classification performance of five machine learning models evaluated using stratified 5-fold cross-validation. Logistic Regression achieved the highest discriminatory performance with an AUC of 0.804, followed by Neural Network (AUC = 0.799 ± 0.072), SVM (AUC = 0.770 ± 0.067), Random Forest (AUC = 0.740 ± 0.090), and Decision Tree (AUC = 0.643 ± 0.056). In terms of classification accuracy, SVM performed best (CA = 74.0 ± 5.5%), followed by Logistic Regression (CA = 73.0 ± 9.1%), Neural Network (CA = 72.0 ± 6.7%), Random Forest (CA = 65.0 ± 11.2%), and Decision Tree (CA = 63.0 ± 5.7%). The F1-scores followed a similar pattern, with SVM and Logistic Regression achieving the highest values (0.740 and 0.730, respectively).
These results indicate that simpler models (Logistic Regression) and neural approaches achieved comparable or superior performance to ensemble methods (Random Forest) for this classification task. The moderate performance levels (AUC ranging from 0.643 to 0.804; fold-level standard deviations of 0.056–0.090) suggest that pitching metrics provide meaningful but not deterministic information for distinguishing playoff-qualifying from non-qualifying teams. Decision Tree showed the lowest performance across all metrics, indicating that single tree structures are insufficient to capture the complex relationships between pitching metrics and playoff outcomes.
Figure 1 presents the Receiver Operating Characteristic (ROC) curves for all five classification models. The ROC curve plots True Positive Rate (Sensitivity) against False Positive Rate (1—Specificity) across classification thresholds, with curves closer to the upper-left corner indicating superior discrimination.
As shown in
Figure 1, Logistic Regression (blue) demonstrates the highest discriminatory performance, with its curve positioned closest to the upper-left corner. Neural Network (green) and SVM (purple) show comparable intermediate performance, while Random Forest (orange) exhibits lower discrimination. Decision Tree (blue-green) shows the weakest performance, with its curve closest to the diagonal reference line representing random classification. This visual pattern is consistent with the AUC values reported in
Table 1.
3.2. Confusion Matrix Analysis
Given its highest AUC performance (0.804 ± 0.057), Logistic Regression was selected for detailed confusion matrix analysis.
Table 2 presents the classification results aggregated across 5-fold stratified cross-validation.
As shown in
Table 2, of the 50 team-seasons that achieved postseason qualification (Actual = 1), the model correctly predicted 36 (True Positives), misclassifying 14 as non-qualifying (False Negatives), yielding a sensitivity of 72.0%. Of the 50 team-seasons that did not achieve postseason qualification (Actual = 0), the model correctly predicted 37 (True Negatives), with 13 incorrectly predicted as playoff qualifiers (False Positives), yielding specificity of 74.0%.
The relatively balanced distribution between False Positives (13) and False Negatives (14) indicates that the model does not systematically favor either class, consistent with the balanced dataset structure. This moderate classification performance (overall accuracy = 73.0%) suggests that while pitching metrics provide meaningful discriminatory information, they alone cannot fully determine playoff qualification.
The error patterns reveal the inherent complexity of playoff prediction based solely on pitching performance. Misclassified teams likely reflect cases in which other performance factors—such as hitting, fielding, or situational performance—compensated for or undermined pitching contributions. Teams with strong pitching but weak hitting may fail to qualify despite favorable pitching metrics (False Positives), while teams with moderate pitching but exceptional offensive performance may qualify despite suboptimal pitching profiles (False Negatives).
3.3. Feature Importance Analysis
To identify which pitching metrics exercised the most significant influence on classification outcomes, feature importance analysis was conducted using three complementary importance measures: Information Gain, Information Gain Ratio, and Gini Decrease. Results are presented in
Table 3, ranked by importance magnitude. As reported in
Table 3, across all three importance measures, a consistent hierarchy emerges. ERA and WHIP are unequivocally the two most important predictors of postseason qualification, with ERA demonstrating slightly greater importance (Information Gain = 0.176) than WHIP (Information Gain = 0.161). These defense-dependent metrics collectively account for 33.7% of total information gain, approximately one-third of the model’s predictive power.
These findings are followed by OBP with Information Gain of 0.133, and K/BB with Information Gain of 0.106. Notably, FIP, the sabermetrically advanced metric specifically designed to isolate pitcher skill from defensive influence, ranks only fifth with the Information Gain of 0.098, significantly below the defense-dependent metrics. This positioning indicates that FIP’s defense-independent property, while theoretically compelling for individual pitcher evaluation, provides substantially less predictive power for team postseason qualification in the KBO context.
At the lower end of the importance hierarchy, K/9 demonstrates minimal predictive utility with the Information Gain of only 0.015. This finding aligns with prior research suggesting that raw strikeout rates, when disaggregated from supporting context (walk rates, home run rates), provide limited information about overall pitching effectiveness.
3.4. Descriptive Analysis of Pitching Metrics
To validate and contextualize the feature importance findings, descriptive statistical analysis compared pitching performance between postseason and non-postseason KBO teams using independent-samples
t-tests. As reported in
Table 4, postseason teams demonstrated significantly superior pitching performance across the most critical metrics. For ERA, postseason teams exhibited a substantially lower average (M = 4.389) than non-postseason teams (M = 4.944), with this difference statistically significant (
p < 0.001) and representing a meaningful difference of 0.555 runs per nine innings. Similarly, for WHIP, postseason teams allowed substantially fewer baserunners (M = 1.418) than non-postseason teams (M = 1.511), a highly significant difference (
p < 0.001).
As shown in
Table 4, this superior performance pattern held consistently across other high-importance metrics. OBP showed that postseason teams limited opposing batters to a mean OBP of 0.341, compared to 0.355 for non-postseason teams (
p < 0.001), representing approximately 4% fewer on-base opportunities allowed. K/BB showed postseason teams with higher ratios (M = 2.135) compared to non-postseason teams (M = 1.930), a statistically significant difference (
p = 0.001), indicating postseason pitching staffs struck out approximately 2.1 batters for every walk issued versus 1.9 for non-postseason teams. OPS revealed postseason teams limiting opponents to an aggregate OPS of 0.741 compared to 0.777 for non-postseason teams
p < 0.001). At the same time, BAA demonstrated that postseason teams held opponents to 0.269, compared with 0.279 for non-postseason teams (
p < 0.001).
Defense-independent metrics demonstrated differential patterns. FIP showed a mean of 4.488 for postseason teams, compared with 4.731 for non-postseason teams, with this difference statistically significant (p = 0.011). However, this FIP difference was less pronounced than ERA and WHIP, reflecting FIP’s lower rank in the feature importance analysis. BB/9 showed postseason teams issuing fewer walks (M = 3.418) than non-postseason teams (M = 3.774), a statistically significant difference (p = 0.001), while HR/9 demonstrated postseason teams allowing fewer home runs (M = 0.888) compared to non-postseason teams (M = 0.983), a significant difference (p = 0.027).
Finally, for the lowest-importance metric K/9, there was no statistically significant difference between postseason and non-postseason teams (p = 0.689), confirming its limited utility in predicting team success. Postseason teams averaged 7.080 strikeouts per nine innings, compared to 7.147 for non-postseason teams, a negligible difference, affirming that the strikeout rate alone, absent supporting context, provides minimal discriminatory information.
5. Conclusions
This study establishes a machine learning framework for classifying KBO playoff qualification based on pitching metrics and contributes several key insights to baseball analytics. First, we demonstrate that machine learning models achieve moderate classification accuracy (73.0%) for KBO playoff qualification using pitching metrics alone, with Logistic Regression (AUC = 0.804 ± 0.057) and Neural Network (AUC = 0.799 ± 0.072) showing the strongest discriminatory performance. This moderate performance level indicates that pitching metrics provide meaningful but not deterministic information for playoff classification, consistent with the multifactorial nature of team success in professional baseball [
3,
4]. Second, we provide empirical resolution of a fundamental theoretical question in sabermetrics: while FIP is superior for individual pitcher evaluation [
16,
17], ERA and WHIP, defense-dependent metrics capturing pitching-defense synergy, are substantially more predictive of team playoff success, collectively accounting for 33.7% of model information gain versus FIP’s 9.8%. This highlights a critical distinction: team success emerges from pitching-defense synergy rather than the aggregation of individual pitchers’ isolated skills [
18]. Third, we establish the cross-cultural applicability of machine learning to non-MLB contexts [
7]. However, the specific metric importance patterns differ from conventional sabermetric wisdom, indicating that comprehensive analytics frameworks must account for league-specific characteristics rather than assuming universal MLB applicability [
9,
10].
The findings suggest potential implications for KBO team management: prioritizing pitching-defense synergy in roster construction, emphasizing walk avoidance and contact management alongside strikeout ability, and using empirical metric importance patterns to guide analytical resource allocation [
10]. The competitive performance of Logistic Regression affirms the practical value of simpler parametric models for sports classification tasks with modest sample sizes [
37]. Future research should incorporate offensive, defensive, and organizational factors; conduct comparative analyses across multiple professional baseball leagues; extend temporal analysis beyond 2024; and explore alternative modeling approaches for outcome prediction [
11,
26]. Ultimately, this research advances evidence-based approaches to player valuation and roster construction in professional sports, demonstrating that while machine learning principles are applicable across leagues, context-specific characteristics influence which metrics discriminate successful teams.