Machine Learning-Based Classification of Team Playoff Advancement Using Pitching Performance Metrics in Korean Professional Baseball

Bae, Jung-Sup; Chiu, Bryan Weisheng

doi:10.3390/app16052215

Open AccessArticle

Machine Learning-Based Classification of Team Playoff Advancement Using Pitching Performance Metrics in Korean Professional Baseball

by

Jung-Sup Bae

¹

and

Bryan Weisheng Chiu

^2,*

¹

College of KU Glocal Innovation, Konkuk University Glocal Campus, Chungju-si 27478, Republic of Korea

²

Lee Shau Kee School of Business and Administration, Hong Kong Metropolitan University, Kowloon, Hong Kong, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(5), 2215; https://doi.org/10.3390/app16052215

Submission received: 16 January 2026 / Revised: 16 February 2026 / Accepted: 23 February 2026 / Published: 25 February 2026

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figure

Versions Notes

Abstract

This study develops and evaluates machine learning models for classifying Korean Baseball Organization (KBO) playoff advancement using pitching metrics from 2015 to 2024 (N = 100 team-seasons), focusing specifically on pitching’s contribution to playoff qualification to address the ERA-FIP paradox at the team level. Five algorithms were compared: Random Forest, Support Vector Machines, Logistic Regression, Neural Networks, and Decision Trees. Independent variables included ten pitching statistics: Earned Run Average (ERA), Walks and Hits per Inning Pitched (WHIP), Fielding Independent Pitching (FIP), Strikeouts per 9 Innings (K/9), Walks per 9 Innings (BB/9), Strikeout-to-Walk Ratio (K/BB), Home Runs per 9 Innings (HR/9), batting average against (BAA), and opponent On-Base Percentage (OBP) and On-Base Plus Slugging (OPS). Logistic Regression achieved the highest classification performance with an Area Under the Receiver Operating Characteristic Curve (AUC) of 0.804 and classification accuracy of 73.0%, followed by Neural Network (AUC = 0.799, CA = 72.0). Feature importance analysis showed ERA and WHIP, both defense-dependent metrics, as the dominant predictors of postseason qualification, collectively accounting for 33.7% of information gain, while FIP ranks fifth, indicating that defense-dependent metrics are more informative for team success than defense-independent measures. The findings highlight the strategic importance of pitching–defense synergy, demonstrate the applicability of machine learning-based playoff classification beyond Major League Baseball, and provide empirical evidence that defense-dependent metrics (ERA, WHIP) exhibit superior discriminatory power compared to defense-independent metrics (FIP) for team playoff qualification. Findings reflect pitching’s contribution to playoff success; comprehensive models integrating hitting, defense, and managerial factors would provide more complete classification frameworks.

Keywords:

machine learning; KBO league; postseason classification; pitching statistics; sabermetrics; retrospective analysis; sports analytics

1. Introduction

Baseball analytics has evolved dramatically from traditional box-score statistics to sophisticated sabermetric analysis and machine learning applications, fundamentally transforming how teams evaluate player performance and construct competitive rosters. Sabermetrics, the systematic analysis of baseball through objective evidence and statistical methods, emerged from pioneering work seeking to quantify player value beyond conventional metrics [1,2,3]. Machine learning has emerged as the next frontier in sports analytics, with studies demonstrating superior predictive accuracy compared to traditional statistical approaches [4,5,6,7,8]. For example, Huang and Li [9] achieved 94.18% accuracy in predicting Major League Baseball (MLB) game outcomes using artificial neural networks, establishing the viability of computational approaches for baseball prediction. Moreover, Li et al. [10] developed machine learning models for predicting MLB game outcomes by systematically exploring and selecting relevant features from comprehensive performance metrics. The superiority of machine learning methodologies stems from their capacity to model non-linear relationships and complex interactions between variables without imposing restrictive distributional assumptions [11,12].

However, a critical gap persists in sports analytics research: the near-exclusive geographic concentration of baseball analytics research on MLB [5,7,9]. When findings emerge exclusively from MLB, a league with specific characteristics including unrestricted international player markets, multi-divisional structures, and established salary cap frameworks, implicit assumptions arise that these findings generalize universally to other professional baseball contexts. This assumption remains largely untested empirically. The question of whether machine learning models trained on MLB data effectively generalize to structurally different professional baseball environments has received minimal investigation.

The Korean Baseball Organization (KBO), South Korea’s premier professional baseball league since 1982, represents an ideal context to examine this generalizability question [13]. Several KBO players have achieved notable success in MLB, such as Shin-soo Choo and Dae-ho Lee. Despite its competitive quality and international significance, academic research on KBO performance analytics remains virtually nonexistent. The KBO exhibits several structural characteristics that fundamentally distinguish it from MLB: a 10-team single-division format with all teams competing directly against one another, distinctive foreign player limitations (maximum three foreign players per team, two pitchers maximum), a unique step-ladder tournament playoff system, and ownership by major Korean conglomerates rather than private entities [14]. Furthermore, the KBO provides a unique opportunity to examine a fundamental unresolved question in sabermetrics: whether defense-dependent metrics (ERA, WHIP) or defense-independent metrics (FIP) better predict team playoff qualification, and whether this relationship differs from MLB contexts.

A particularly important unresolved question in sabermetric research concerns the relationship between defense-dependent and defense-independent pitching metrics for predicting team success. Within sabermetrics, Fielding Independent Pitching (FIP) is considered the “gold standard” for evaluating individual pitcher talent because it isolates pitcher skill from defensive factors [15]. FIP was developed by Tom Tango based on foundational research by Voros McCracken, establishing that pitchers have minimal control over balls in play [15,16,17]. Unlike ERA, which reflects both pitcher performance and team defense, FIP focuses exclusively on events entirely within pitcher control: strikeouts, walks, hit-by-pitches, and home runs [18]. Empirical research demonstrates that FIP is more predictive of future pitcher performance than current-season ERA, making it superior for evaluating true pitcher talent independent of defensive context [19,20,21].

While it may seem intuitively plausible that team-level outcomes depend on combined pitching-defense performance rather than isolated pitcher skill, this hypothesis has not been systematically tested empirically. Prior research on FIP focuses almost exclusively on individual pitcher evaluation and future performance projection within MLB contexts [19,20,21]. Crucially, no prior study has quantified the relative predictive importance of defense-dependent (ERA, WHIP) versus defense-independent (FIP) metrics for team playoff qualification using machine learning feature importance analysis, nor has any research examined whether the hierarchical structure of metric importance differs across professional baseball leagues with distinct organizational characteristics. Furthermore, the KBO’s structural differences from MLB, including a single-division format (eliminating divisional imbalances), strict foreign player limits (three per team, with a maximum of two pitchers), and a step-ladder playoff system (the top five teams qualify), may fundamentally alter which performance metrics predict postseason success. Testing whether MLB-derived sabermetric principles generalize to structurally different leagues constitutes a critical test of the universality versus context-specificity of baseball analytics frameworks.

This study focuses exclusively on pitching metrics rather than comprehensive team performance indicators for several reasons. First, the research addresses a specific theoretical question in sabermetrics—whether defense-dependent (ERA, WHIP) or defense-independent (FIP) metrics better predict team playoff qualification—which requires isolating pitching performance from offensive and other factors. Second, extensive baseball analytics literature emphasizes pitching’s dominant role in team success, often summarized by the axiom “pitching wins championships,” making it a natural focal point for classification analysis [3,7]. Third, isolating pitching enables a clean interpretation of patterns in metric importance without confounding from hitting or managerial factors. While comprehensive models integrating offense, defense, and organizational characteristics would provide more complete prediction frameworks and constitute valuable future research directions, the pitching-only approach provides conceptual clarity for addressing the ERA-FIP question while establishing baseline discriminatory capability.

This study addresses these critical gaps by developing and evaluating machine learning models for classifying KBO playoff advancement using comprehensive pitching metrics from 2015 to 2024. While this study employs established machine learning algorithms rather than introducing novel methodological innovations, its contribution lies in three areas: (1) systematic application of these methods to the under-researched KBO context, enabling the first comprehensive playoff prediction framework for this league; (2) direct empirical comparison of defense-dependent (ERA, WHIP) versus defense-independent (FIP) metrics at the team level, addressing a fundamental theoretical question in sabermetrics; and (3) examination of whether metric importance patterns observed in MLB generalize to structurally different professional baseball leagues. This approach prioritizes domain-specific analytical insights and cross-league generalizability over algorithmic novelty. Unlike traditional forecasting studies that aim to predict future outcomes, this study adopts a retrospective classification design. By utilizing season-end pitching metrics, we aim to identify the underlying performance constructs that historically distinguished playoff qualifiers from non-qualifiers in the KBO.

2. Methodology

2.1. Data Collection and Dataset Construction

All data were compiled from two authoritative sources: (1) the official KBO league website (www.koreabaseball.com), which maintains comprehensive official statistics verified by the league, and (2) KBReport (http://www.kbreport.sbs), a publicly available baseball statistics website providing detailed team-level metrics consistent with official league records. These dual sources were utilized to ensure data accuracy and enable manual cross-referencing of reported statistics. Specifically, team-level metrics for all 100 team-seasons were compared across both platforms to verify data integrity. This verification process yielded a 100% match across all 1000 data points (100 seasons × 10 variables), with zero discrepancies observed. Consequently, no manual error correction or tolerance-based cleaning was necessary, ensuring the highest level of data provenance and reliability. The dataset covers ten consecutive KBO regular seasons spanning 2015–2024, providing comprehensive coverage of the league’s recent competitive history. This ten-year period was strategically selected for several methodological reasons. First, it ensures consistency, as it encompasses only seasons following the KBO’s stabilization in its current 10-team structure. Second, this timeframe provides sufficient temporal breadth (N = 100 team-seasons) to enable robust machine learning model training and validation, as machine learning algorithms typically require at least 50–100 observations for stable performance estimation [22]. Third, the ten-year span captures varied competitive conditions, managerial approaches, and strategic philosophies, enhancing the generalizability of findings across diverse contexts within the KBO.

The fundamental unit of analysis is the team-season. With 10 teams competing across 10 consecutive seasons, the final dataset comprises 100 unique observations. All 100 team-seasons contain complete data with no missing values, as pitching statistics represent season-aggregated official league records compiled from comprehensive game-by-game data. The 2020 season, despite COVID-19 disruptions, is included as all teams played the same 144-game schedule, ensuring comparability. This team-level aggregation is appropriate for the research question, as the study examines how team-level pitching performance characteristics predict team-level playoff qualification outcomes.

The dependent variable is a binary categorical variable, Postseason_Qualification, operationalized as follows: 1 = Postseason Qualification (teams finishing in the top five of the regular season standings, qualifying for postseason competition) and 0 = No Postseason Qualification (teams finishing sixth through tenth, failing to qualify for postseason competition). The top-five cutoff reflects the current KBO playoff format, wherein the top five teams advance to the step-ladder tournament.

The final analytic dataset (CSV), Orange3 workflow file (.ows), and data dictionary are openly available on OSF (https://doi.org/10.17605/OSF.IO/TSYQX) (accessed on 25 December 2024). The original data are publicly accessible from the official KBO league website (www.koreabaseball.com) and KBReport (http://www.kbreport.sbs). Data collection was conducted in 25 December 2024.

2.2. Independent Variables and Measurement

The model utilizes ten continuous, team-level pitching statistics as independent variables. Traditional metrics (defense-dependent) include ERA (Earned Run Average), the average number of earned runs allowed per nine innings pitched calculated as ERA = (Earned Runs × 9)/Innings Pitched, and WHIP (Walks and Hits per Inning Pitched), the average number of baserunners allowed per inning pitched calculated as WHIP = (Walks + Hits)/Innings Pitched [23].

Advanced sabermetric metrics (defense-independent) include FIP (Fielding Independent Pitching), an estimate of pitcher ERA based exclusively on strikeouts, walks, and home runs; K/9 (Strikeouts per 9 Innings), the average number of strikeouts recorded per nine innings pitched; BB/9 (Walks per 9 Innings), the average number of walks issued per nine innings pitched; K/BB (Strikeout-to-Walk Ratio), calculated as Strikeouts/Walks; and HR/9 (Home Runs per 9 Innings), the average number of home runs allowed per nine innings [17,23].

Opponent performance metrics include Batting Average Against (BAA), calculated as Hits/At-Bats; Opponent On-Base Percentage (OBP), calculated as (Hits + Walks + HBP)/(At-Bats + Walks + HBP + Sacrifice Flies); and Opponent On-Base Plus Slugging (OPS), calculated as OBP + SLG, providing holistic assessment of offensive production against the pitching staff. These metrics were deliberately selected to provide a comprehensive assessment of pitching staff performance, incorporating both traditional baseball statistics and advanced sabermetric indicators. Complete computational formulas and operational definitions for all ten metrics are provided in Supplementary Table S1.

2.3. Machine Learning Models and Algorithms

Five well-established machine learning algorithms were selected for evaluation. These algorithms were chosen not for methodological novelty, but rather to enable systematic comparison with prior sports analytics research and to ensure that findings regarding metric importance are robust across diverse algorithmic approaches. The selected algorithms represent different modeling paradigms, parametric linear models, tree-based methods, kernel-based approaches, and neural architectures, providing comprehensive coverage of standard machine learning techniques commonly applied in sports analytics [24,25,26,27,28,29,30].

Logistic Regression is a foundational statistical model for binary classification that models the probability of a categorical outcome by fitting a linear equation to the log-odds of the event, then transforming using a sigmoid function [24]:

P (Y = 1 | X) = \frac{1}{1 + e^{- (β_{0} + β_{1} X_{1} + \dots + β_{p} X_{p})}}

A classification decision is made by comparing this probability to a predetermined threshold, typically 0.5 [25]. It provides interpretability and computational efficiency but assumes linear separability in log-odds space, limiting performance on complex non-linear problems [26].

Decision Trees recursively partition the dataset into increasingly homogeneous subsets based on feature values, selecting splits that maximize information gain and yielding an interpretable, rule-based model that can capture non-linear relationships but is prone to overfitting with high variance [27]. Random Forest extends Decision Trees through an ensemble of trees trained on bootstrap samples with feature subsampling at each split, aggregating predictions via majority voting to reduce variance and improve generalization while maintaining the ability to model complex interactions [28].

Support Vector Machines identify an optimal separating hyperplane that maximizes the margin between classes, and, through kernel functions such as polynomial or radial basis function kernels, can handle non-linearly separable data by implicitly mapping observations into higher-dimensional feature spaces [29]. Artificial Neural Networks consist of interconnected layers of nodes (input, hidden, and output layers) with weighted connections updated via backpropagation, enabling the learning of highly non-linear patterns and interactions, but requiring careful specification of architecture and regularization to avoid overfitting, particularly in relatively small datasets [30].

2.4. Model Evaluation Metrics

Model performance was evaluated using several standard metrics derived from the confusion matrix: Classification Accuracy, Precision, Recall, F1-Score, and Area Under the Receiver Operating Characteristic Curve (AUC) [31].

Confusion Matrix = [\begin{matrix} TP & FP \\ FN & TN \end{matrix}]

where TP (True Positives) = correctly predicted postseason teams; FP (False Positives) = non-postseason teams incorrectly predicted as postseason; FN (False Negatives) = postseason teams incorrectly predicted as non-postseason; TN (True Negatives) = correctly predicted non-postseason teams.

Classification Accuracy, computed as (TP + TN)/(TP + TN + FP + FN), measures the overall proportion of correctly classified team-seasons. Precision, defined as TP/(TP + FP), quantifies the proportion of predicted postseason teams that actually qualified, while Recall, defined as TP/(TP + FN), quantifies the proportion of actual postseason teams correctly identified by the model. F1-Score, computed as 2 × (Precision × Recall)/(Precision + Recall), provides a harmonic mean that balances Precision and Recall, making it particularly informative under class imbalance. AUC summarizes the model’s ability to discriminate between postseason and non-postseason teams across all possible probability thresholds, with values closer to 1 indicating stronger discriminative performance.

2.5. Model Training and Validation

All models were trained and evaluated using Orange3’s Test and Score widget (version 3.36) with stratified 5-fold cross-validation. This validation strategy partitions the dataset into five equal folds, iteratively training on four folds (n = 80 team-seasons) and testing on the held-out fold (n = 20 team-seasons) until all observations have served as test cases exactly once. Stratification ensures that each fold maintains the dataset’s class distribution (50% playoff-qualifying, 50% non-qualifying), preventing biased performance estimates due to unbalanced partitions. This cross-validation approach provides a robust estimate of generalization performance by averaging results across five independent train-test splits, which is particularly important for modest-sized datasets where single train-test splits may yield unstable estimates [32]. Orange3 reports mean cross-validated performance across folds; fold-level predictions were exported via the Predictions output port and analyzed externally to compute fold-level variability (Table 1). The dataset exhibits perfect class balance with 50 playoff-qualifying and 50 non-qualifying team-seasons, reflecting KBO’s playoff structure, where exactly 5 of 10 teams qualify annually. This natural balance eliminates the need for resampling techniques or class-weighted loss functions. Orange3’s graphical interface does not expose random seed parameters; the software manages random states internally through its default algorithmic implementations. Fold-level predictions were exported from Orange3’s Test and Score widget and analyzed externally. For each model, we report mean performance with standard deviations across the five folds, and 95% confidence intervals were computed using the t-distribution with four degrees of freedom. This study employs a retrospective classification design rather than a prospective forecasting approach. Season-end aggregate statistics are used to identify pitching performance characteristics that historically distinguished playoff-qualifying from non-qualifying teams, rather than to forecast outcomes prior to or during the season. Cross-validation partitions may therefore include team-seasons from different years within the same fold; this is appropriate for retrospective classification but would not constitute valid temporal forecasting.

2.6. Hyperparameter Specifications

All models were trained using Orange3 (version 3.36) default hyperparameters without manual tuning. This approach was deliberately adopted given the modest sample size (N = 100), where hyperparameter optimization risks overfitting to specific data partitions, and because the primary research objective concerns relative metric importance rather than maximizing classification accuracy. The use of untuned default parameters also enhances reproducibility, as results are fully determined by the software defaults without researcher degrees of freedom in parameter selection. The default specifications are as follows:

Logistic Regression employed L2 (Ridge) regularization with regularization strength C = 1.0.

Decision Tree used the entropy criterion (information gain) with no maximum depth restriction and a minimum of 2 instances per leaf.

Random Forest comprised 10 trees with no maximum depth constraint.

Support Vector Machine used the Radial Basis Function (RBF) kernel with cost parameter C = 1.0 and automatic gamma scaling.

Neural Network employed a single hidden layer with 100 neurons, ReLU activation function, Adam optimizer, and a maximum of 200 training iterations.

All models used Orange3’s default data preprocessing, which includes feature standardization (z-score normalization) applied within each cross-validation fold to prevent data leakage.

2.7. Collinearity Considerations

Several pitching metrics share mathematical components and exhibit intercorrelations. Notably, ERA and WHIP both incorporate hits and walks allowed, resulting in conceptual and statistical overlap. While the Variance Inflation Factor (VIF) analysis is commonly used to detect multicollinearity in regression contexts, it is designed specifically for detecting coefficient inflation in ordinary least squares estimation and is less directly applicable to the classification algorithms employed in this study [33]. Tree-based ensemble methods, such as Random Forest, handle collinearity through their inherent architecture: bootstrap sampling and random feature subsampling at each split cause correlated features to compete for inclusion rather than produce inflated importance estimates [34]. Similarly, regularization mechanisms in Logistic Regression (L2 penalty) and Neural Network (weight decay) mitigate collinearity effects by constraining coefficient magnitudes [11]. Given these algorithmic properties, we interpret the importance of correlated (ERA, WHIP) collectively as reflecting defense-dependent run prevention rather than independent contributions. Readers should note that the individual importance values for highly correlated predictors cannot be cleanly separated and should be interpreted with appropriate caution.

2.8. Feature Importance Assessment

Feature importance was assessed using Orange3’s Rank widget, which computes Information Gain scores for each predictor variable. Information Gain measures the reduction in entropy when the dataset is split based on each feature, with higher values indicating greater discriminatory power for distinguishing between playoff-qualifying and non-qualifying teams [35]. The reported importance values are raw scores computed by the Rank widget and are not normalized to sum to 1.0, allowing direct comparison of the relative discriminatory contribution of each pitching metric.

3. Results

3.1. Comparative Model Performance

The five machine learning models were trained and tested on the KBO dataset from 2015 to 2024. Their classification performance was evaluated across the five key metrics described in the methodology, with results summarized in Table 1.

Table 1 presents the classification performance of five machine learning models evaluated using stratified 5-fold cross-validation. Logistic Regression achieved the highest discriminatory performance with an AUC of 0.804, followed by Neural Network (AUC = 0.799 ± 0.072), SVM (AUC = 0.770 ± 0.067), Random Forest (AUC = 0.740 ± 0.090), and Decision Tree (AUC = 0.643 ± 0.056). In terms of classification accuracy, SVM performed best (CA = 74.0 ± 5.5%), followed by Logistic Regression (CA = 73.0 ± 9.1%), Neural Network (CA = 72.0 ± 6.7%), Random Forest (CA = 65.0 ± 11.2%), and Decision Tree (CA = 63.0 ± 5.7%). The F1-scores followed a similar pattern, with SVM and Logistic Regression achieving the highest values (0.740 and 0.730, respectively).

These results indicate that simpler models (Logistic Regression) and neural approaches achieved comparable or superior performance to ensemble methods (Random Forest) for this classification task. The moderate performance levels (AUC ranging from 0.643 to 0.804; fold-level standard deviations of 0.056–0.090) suggest that pitching metrics provide meaningful but not deterministic information for distinguishing playoff-qualifying from non-qualifying teams. Decision Tree showed the lowest performance across all metrics, indicating that single tree structures are insufficient to capture the complex relationships between pitching metrics and playoff outcomes.

Figure 1 presents the Receiver Operating Characteristic (ROC) curves for all five classification models. The ROC curve plots True Positive Rate (Sensitivity) against False Positive Rate (1—Specificity) across classification thresholds, with curves closer to the upper-left corner indicating superior discrimination.

As shown in Figure 1, Logistic Regression (blue) demonstrates the highest discriminatory performance, with its curve positioned closest to the upper-left corner. Neural Network (green) and SVM (purple) show comparable intermediate performance, while Random Forest (orange) exhibits lower discrimination. Decision Tree (blue-green) shows the weakest performance, with its curve closest to the diagonal reference line representing random classification. This visual pattern is consistent with the AUC values reported in Table 1.

3.2. Confusion Matrix Analysis

Given its highest AUC performance (0.804 ± 0.057), Logistic Regression was selected for detailed confusion matrix analysis. Table 2 presents the classification results aggregated across 5-fold stratified cross-validation.

As shown in Table 2, of the 50 team-seasons that achieved postseason qualification (Actual = 1), the model correctly predicted 36 (True Positives), misclassifying 14 as non-qualifying (False Negatives), yielding a sensitivity of 72.0%. Of the 50 team-seasons that did not achieve postseason qualification (Actual = 0), the model correctly predicted 37 (True Negatives), with 13 incorrectly predicted as playoff qualifiers (False Positives), yielding specificity of 74.0%.

The relatively balanced distribution between False Positives (13) and False Negatives (14) indicates that the model does not systematically favor either class, consistent with the balanced dataset structure. This moderate classification performance (overall accuracy = 73.0%) suggests that while pitching metrics provide meaningful discriminatory information, they alone cannot fully determine playoff qualification.

The error patterns reveal the inherent complexity of playoff prediction based solely on pitching performance. Misclassified teams likely reflect cases in which other performance factors—such as hitting, fielding, or situational performance—compensated for or undermined pitching contributions. Teams with strong pitching but weak hitting may fail to qualify despite favorable pitching metrics (False Positives), while teams with moderate pitching but exceptional offensive performance may qualify despite suboptimal pitching profiles (False Negatives).

3.3. Feature Importance Analysis

To identify which pitching metrics exercised the most significant influence on classification outcomes, feature importance analysis was conducted using three complementary importance measures: Information Gain, Information Gain Ratio, and Gini Decrease. Results are presented in Table 3, ranked by importance magnitude. As reported in Table 3, across all three importance measures, a consistent hierarchy emerges. ERA and WHIP are unequivocally the two most important predictors of postseason qualification, with ERA demonstrating slightly greater importance (Information Gain = 0.176) than WHIP (Information Gain = 0.161). These defense-dependent metrics collectively account for 33.7% of total information gain, approximately one-third of the model’s predictive power.

These findings are followed by OBP with Information Gain of 0.133, and K/BB with Information Gain of 0.106. Notably, FIP, the sabermetrically advanced metric specifically designed to isolate pitcher skill from defensive influence, ranks only fifth with the Information Gain of 0.098, significantly below the defense-dependent metrics. This positioning indicates that FIP’s defense-independent property, while theoretically compelling for individual pitcher evaluation, provides substantially less predictive power for team postseason qualification in the KBO context.

At the lower end of the importance hierarchy, K/9 demonstrates minimal predictive utility with the Information Gain of only 0.015. This finding aligns with prior research suggesting that raw strikeout rates, when disaggregated from supporting context (walk rates, home run rates), provide limited information about overall pitching effectiveness.

3.4. Descriptive Analysis of Pitching Metrics

To validate and contextualize the feature importance findings, descriptive statistical analysis compared pitching performance between postseason and non-postseason KBO teams using independent-samples t-tests. As reported in Table 4, postseason teams demonstrated significantly superior pitching performance across the most critical metrics. For ERA, postseason teams exhibited a substantially lower average (M = 4.389) than non-postseason teams (M = 4.944), with this difference statistically significant (p < 0.001) and representing a meaningful difference of 0.555 runs per nine innings. Similarly, for WHIP, postseason teams allowed substantially fewer baserunners (M = 1.418) than non-postseason teams (M = 1.511), a highly significant difference (p < 0.001).

As shown in Table 4, this superior performance pattern held consistently across other high-importance metrics. OBP showed that postseason teams limited opposing batters to a mean OBP of 0.341, compared to 0.355 for non-postseason teams (p < 0.001), representing approximately 4% fewer on-base opportunities allowed. K/BB showed postseason teams with higher ratios (M = 2.135) compared to non-postseason teams (M = 1.930), a statistically significant difference (p = 0.001), indicating postseason pitching staffs struck out approximately 2.1 batters for every walk issued versus 1.9 for non-postseason teams. OPS revealed postseason teams limiting opponents to an aggregate OPS of 0.741 compared to 0.777 for non-postseason teams p < 0.001). At the same time, BAA demonstrated that postseason teams held opponents to 0.269, compared with 0.279 for non-postseason teams (p < 0.001).

Defense-independent metrics demonstrated differential patterns. FIP showed a mean of 4.488 for postseason teams, compared with 4.731 for non-postseason teams, with this difference statistically significant (p = 0.011). However, this FIP difference was less pronounced than ERA and WHIP, reflecting FIP’s lower rank in the feature importance analysis. BB/9 showed postseason teams issuing fewer walks (M = 3.418) than non-postseason teams (M = 3.774), a statistically significant difference (p = 0.001), while HR/9 demonstrated postseason teams allowing fewer home runs (M = 0.888) compared to non-postseason teams (M = 0.983), a significant difference (p = 0.027).

Finally, for the lowest-importance metric K/9, there was no statistically significant difference between postseason and non-postseason teams (p = 0.689), confirming its limited utility in predicting team success. Postseason teams averaged 7.080 strikeouts per nine innings, compared to 7.147 for non-postseason teams, a negligible difference, affirming that the strikeout rate alone, absent supporting context, provides minimal discriminatory information.

4. Discussion

4.1. Primary Findings and Model Interpretation

The empirical results demonstrate that Logistic Regression achieved the highest discriminatory performance for KBO playoff classification based on pitching metrics, with an AUC of 0.804 (±0.057) and classification accuracy of 73.0% (±9.1%). Neural Network performed comparably (AUC = 0.799 ± 0.072, CA = 72.0 ± 6.7%), followed by SVM (AUC = 0.770 ± 0.067) and Random Forest (AUC = 0.740 ± 0.090). The decision tree showed the lowest performance (AUC = 0.643 ± 0.056), indicating that single tree structures are insufficient for this classification task.

The moderate classification performance across all models suggests that pitching metrics provide meaningful but not deterministic information for distinguishing playoff-qualifying from non-qualifying teams. This finding is consistent with the multifactorial nature of team success in professional baseball, where hitting, fielding, baserunning, and managerial decisions also contribute substantially to win-loss records and playoff qualification [5].

The relatively strong performance of Logistic Regression—a simpler parametric model—compared to more complex ensemble methods warrants discussion. This pattern may reflect the modest sample size (N = 100), which limits the ability of complex models such as SVM and Neural Networks to demonstrate their full potential. With limited training data, simpler models with fewer parameters are less prone to overfitting and may generalize more effectively [11]. Additionally, the relationships between pitching metrics and playoff outcomes may be approximately linear at the team-season level, favoring models that assume linear decision boundaries.

This finding partially contrasts with previous sports analytics studies. Bunker and Susnjak [3] found that tree-based ensemble methods generally outperformed other algorithms for match outcome prediction across multiple sports, whereas Lo et al. [5] achieved 89.2% accuracy in predicting baseball game outcomes using a Random Forest. However, these studies typically utilized larger sample sizes and game-level rather than season-level data. The season-level aggregation in the present study may smooth out the non-linear patterns that ensemble methods excel at capturing, thereby reducing the advantage of complex algorithms. The consistent pattern across sports analytics applications suggests ensemble methods are particularly well-suited for structured tabular data with moderate feature counts and modest sample sizes typical of sports performance datasets.

The most significant finding highlights the overwhelming discriminatory importance of ERA and WHIP, both metrics dependent on defensive performance, which substantially surpass Fielding Independent Pitching (FIP), a sabermetric metric that is specifically designed to isolate pitcher skill from defensive context. ERA and WHIP collectively account for 33.7% of the model’s total information gain, while FIP ranks only fifth. This finding resolves a critical tension in sabermetric research: while FIP is theoretically superior for isolating individual pitcher skill [16,17], ERA and WHIP are empirically superior for predicting team playoff success.

4.2. Theoretical Implications

This research yields three significant theoretical implications for sabermetric research and baseball analytics scholarship. The ERA-FIP paradox finding does not constitute an indictment of FIP as a tool for evaluating individual pitcher talent. FIP remains theoretically elegant and empirically validated as superior to ERA for isolating a pitcher’s defense-independent skill component [19]. In the context of individual player evaluation, FIP provides valuable information about pure pitching ability [16]. However, in predicting team playoff qualification in the KBO, the data provide compelling evidence that defense-independent skills are substantially less informative than final, combined run-prevention outcomes for team classification. The magnitude of this difference is substantial: ERA and WHIP collectively account for 33.7% of total information gain, whereas FIP accounts for only 9.8%, representing more than a threefold difference in discriminatory importance. This empirical quantification moves beyond conceptual arguments to provide data-driven evidence of the differential utility of these metrics across analytical contexts.

This finding highlights a critical distinction between individual-level player evaluation and team-level outcome classification. Team success emerges not from aggregating individual player skills but from the synergistic interaction of multiple player abilities and the strategic organization of those abilities within a coherent system. The preponderance of ERA and WHIP in the feature importance hierarchy points toward a crucial strategic insight: the critical importance of pitching-defense synergy for team success [18]. By definition, both ERA and WHIP are dependent on defense performance. A ground ball pitcher with league-average strikeouts but elite ground-ball tendencies can produce an exceptional ERA when paired with elite infield defense, while the same pitcher generates ordinary results in isolation. This synergy effect suggests that team-level outcomes are better understood through systems thinking rather than reductionist approaches that decompose team performance into independent player contributions. The KBO context, with its single-division format ensuring all teams face identical competition schedules, provides an ideal natural experiment to isolate these synergistic effects without confounding divisional strength imbalances.

This research validates the cross-cultural applicability of machine learning methodologies to non-North American professional baseball contexts. The fact that machine learning achieves moderate classification accuracy (62.0–74.0%) for KBO playoff classification demonstrates that the underlying principles of classification analytics are applicable across professional baseball leagues [3,7]. However, the specific finding regarding ERA/WHIP dominance over FIP suggests that league-specific characteristics may influence the relative importance of different metrics. This raises an important empirical question for future research: Does the ERA-FIP pattern replicate in other professional baseball leagues (Japanese Professional Baseball, Mexican League, Chinese Professional Baseball League), or is it KBO-specific? If metric importance hierarchies vary systematically across leagues, this would indicate that sabermetric frameworks require contextualized adaptation rather than wholesale transfer from MLB to other professional baseball environments. The KBO’s unique structural features, foreign player restrictions, step-ladder playoff format, and single-division competition may amplify the importance of pitching-defense coordination relative to isolated pitcher skill, explaining FIP’s lower discriminatory utility compared to defense-dependent metrics.

The comparative analysis of machine learning algorithms reveals meaningful insights into algorithm selection for sports classification tasks. The ordering of algorithms by AUC performance—Logistic Regression (0.804 ± 0.057) > Neural Network (0.799 ± 0.072) > SVM (0.770 ± 0.067) > Random Forest (0.740 ± 0.090) > Decision Tree (0.643 ± 0.056)—demonstrates that simpler parametric models achieved comparable or superior performance to complex ensemble methods on this task. This finding contrasts with accumulating evidence in applied machine learning, suggesting that ensemble methods such as Random Forest often outperform simpler approaches on tabular data [28,29]. However, the present results align with research indicating that model complexity must be calibrated to sample size: with limited training data (N = 100), simpler models with fewer parameters are less prone to overfitting and may generalize more effectively [30,31].

The relatively strong performance of Logistic Regression likely reflects approximately linear relationships between pitching metrics and playoff outcomes at the team-season level. When underlying relationships are linear or near-linear, the additional complexity of ensemble methods provides no benefit and may introduce unnecessary variance [32]. This interpretation is supported by the poor performance of the Decision Tree (AUC = 0.643 ± 0.056), suggesting that simple recursive partitioning without ensemble averaging is insufficient to capture stable classification boundaries in this dataset. Neural Network achieved the second-highest AUC (0.799 ± 0.072) and the highest classification accuracy (72.0%), demonstrating competitive performance despite the modest sample size. The regularization mechanisms employed (L2 penalty, early stopping) likely prevented overfitting, which often limits deep learning approaches on small datasets [36]. Nevertheless, Neural Network did not substantially outperform Logistic Regression, suggesting that the additional modeling capacity did not translate to improved discrimination for this task. Random Forest’s moderate performance (AUC = 0.740 ± 0.090) relative to simpler models warrants consideration. While Random Forests excel at capturing non-linear interactions through ensemble aggregation of decision trees [28], the season-level aggregation of pitching metrics may have smoothed the complex patterns that tree-based methods typically exploit. Additionally, with only 100 observations and 10 features, the bootstrap sampling and feature subsampling that underpin Random Forest’s performance may have introduced excessive variance rather than beneficial diversity [36].

4.3. Practical Implications

These findings have direct implications for team management strategies in the KBO and potentially other professional baseball contexts. First, the analysis provides empirical evidence consistent with the importance of team defensive capability in classifying playoff advancement. Front offices should not construct rosters based solely on pitching ERA or WHIP without considering defensive alignment. A marginal pitcher paired with elite defense may demonstrate superior classification outcomes than an individually talented pitcher in an indefensible defense. This suggests that value in roster construction should be assigned not solely based on individual player metrics but rather on player-position fit and defensive capability [18]. A pitcher with below-average FIP but exceptional ERA indicates an elite pairing with team defense and provides greater value to team success than an individually talented pitcher with superior FIP but mediocre ERA.

Second, the relative unimportance of strikeout rates (K/9 ranked tenth in importance with no statistically significant difference between postseason and non-postseason teams) challenges conventional wisdom in some analytical circles. While strikeout rates retain value in individual pitcher assessment, their weak discriminatory relationship with team playoff advancement suggests that walk avoidance (BB/9), contact management (batting average against), and overall run prevention (ERA/WHIP) are more strategically important for team success [23]. Team management might productively focus resources on developing pitchers who generate weak contact and limit walks rather than exclusively pursuing strikeout-dominant hurlers.

Third, the study establishes feature importance patterns for classifying KBO playoff qualification that teams can employ in strategic planning informed by metric importance. During the competitive season, teams can use the identified metric importance patterns to prioritize roster construction and player evaluation decisions, informing trade deadline decisions, rotation adjustment strategies, and bullpen utilization patterns. This analytical capability enables more data-informed decision-making than subjective assessment and provides quantitative guidance for time-sensitive roster management decisions [7].

4.4. Limitations

Several limitations warrant acknowledgment. First, the relatively small sample size (N = 100 team-seasons) limits model complexity and generalization to future seasons [32]. The moderate classification accuracy achieved across all models (63.0% to 74.0%) may partly reflect insufficient data to fully capture the complex relationships between pitching metrics and playoff outcomes. Notably, simpler models (Logistic Regression, AUC = 0.804 ± 0.057) outperformed more complex ensemble methods (Random Forest, AUC = 0.740 ± 0.090), suggesting that the sample size was insufficient to leverage the capacity of sophisticated algorithms. Larger datasets incorporating additional seasons or multiple professional leagues would enable more complex architectures to demonstrate their potential advantages and enhance generalization confidence.

The 2015–2024 period represents a specific KBO era with a stable league structure (10 teams), consistent foreign player regulations (maximum three per team, two pitchers), and the current playoff format. Model performance and metric importance patterns may not generalize to earlier periods with different rules (e.g., pre-2015 with 8–9 teams) or future structural changes (e.g., league expansion, modified foreign player limits). Future research should examine whether the ERA-WHIP dominance pattern persists across different KBO eras.

Second, the analysis focuses exclusively on pitching metrics, excluding offensive variables, defensive metrics, and organizational factors. This design enables clean evaluation of the ERA-FIP paradox but necessarily limits the model’s representation of overall team success. Comprehensive models incorporating hitting, defense, and management factors would likely achieve higher predictive accuracy and provide more actionable insights [31]. Team playoff qualification reflects multiple performance dimensions beyond pitching alone.

Third, the 2015–2024 period represents a specific competitive era in KBO history [14]. Results may not generalize to earlier periods with different league rules or competitive quality. Comparative analysis across multiple professional baseball leagues would illuminate whether the ERA-FIP pattern replicates universally or represents a league-specific phenomenon [13].

Fourth, the analysis treats pitching metrics as independent despite their mathematical interdependence. ERA and WHIP are related, and both depend on underlying pitcher performance [11]. While machine learning algorithms handle collinearity better than traditional regression [26], this dependency limits the interpretation of individual metric importance.

Fifth, while using season-end totals raises concerns regarding target leakage, our use of 5-fold cross-validation ensures that the models are evaluating the discriminative power of pitching metrics on unseen data points. The fact that the Logistic Regression model (AUC = 0.804 ± 0.057) outperformed complex ensemble methods suggests that the identified relationships reflect genuine success factors rather than a mere tautological mapping of season standings.

Finally, the binary outcome variable dichotomizes continuous underlying quality differences. A regression-based approach predicting probability or rankings of final season standings might capture additional nuance [27]. Future research incorporating these extensions would substantially advance understanding of how pitching performance translates to team success across diverse professional baseball contexts.

5. Conclusions

This study establishes a machine learning framework for classifying KBO playoff qualification based on pitching metrics and contributes several key insights to baseball analytics. First, we demonstrate that machine learning models achieve moderate classification accuracy (73.0%) for KBO playoff qualification using pitching metrics alone, with Logistic Regression (AUC = 0.804 ± 0.057) and Neural Network (AUC = 0.799 ± 0.072) showing the strongest discriminatory performance. This moderate performance level indicates that pitching metrics provide meaningful but not deterministic information for playoff classification, consistent with the multifactorial nature of team success in professional baseball [3,4]. Second, we provide empirical resolution of a fundamental theoretical question in sabermetrics: while FIP is superior for individual pitcher evaluation [16,17], ERA and WHIP, defense-dependent metrics capturing pitching-defense synergy, are substantially more predictive of team playoff success, collectively accounting for 33.7% of model information gain versus FIP’s 9.8%. This highlights a critical distinction: team success emerges from pitching-defense synergy rather than the aggregation of individual pitchers’ isolated skills [18]. Third, we establish the cross-cultural applicability of machine learning to non-MLB contexts [7]. However, the specific metric importance patterns differ from conventional sabermetric wisdom, indicating that comprehensive analytics frameworks must account for league-specific characteristics rather than assuming universal MLB applicability [9,10].

The findings suggest potential implications for KBO team management: prioritizing pitching-defense synergy in roster construction, emphasizing walk avoidance and contact management alongside strikeout ability, and using empirical metric importance patterns to guide analytical resource allocation [10]. The competitive performance of Logistic Regression affirms the practical value of simpler parametric models for sports classification tasks with modest sample sizes [37]. Future research should incorporate offensive, defensive, and organizational factors; conduct comparative analyses across multiple professional baseball leagues; extend temporal analysis beyond 2024; and explore alternative modeling approaches for outcome prediction [11,26]. Ultimately, this research advances evidence-based approaches to player valuation and roster construction in professional sports, demonstrating that while machine learning principles are applicable across leagues, context-specific characteristics influence which metrics discriminate successful teams.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/app16052215/s1, Table S1: Complete Metric Formulas.

Author Contributions

Conceptualization, J.-S.B. and B.W.C.; methodology, J.-S.B.; validation, J.-S.B.; formal analysis, J.-S.B.; investigation, J.-S.B.; data curation, J.-S.B.; writing—original draft preparation, J.-S.B. and B.W.C.; writing—review and editing, B.W.C.; visualization, J.-S.B. and B.W.C.; supervision, B.W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in the study are openly available on the official KBO league website (www.koreabaseball.com) and KBReport (www.kbreport.com).

Acknowledgments

During the preparation of this manuscript, the authors used Grammarly for the purposes of language editing. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Bill, J. Bill James’ Baseball Abstract; Ballantine Books: New York, NY, USA, 1985. [Google Scholar]
Zhao, W.; Akella, V.S.; Yang, S.; Luo, X. Machine learning in baseball analytics: Sabermetrics and beyond. Information 2025, 16, 361. [Google Scholar] [CrossRef]
Bunker, R.; Susnjak, T. The application of machine learning techniques for predicting match results in team sport: A review. J. Artif. Intell. Res. 2022, 73, 1285–1322. [Google Scholar] [CrossRef]
Bunker, R.P.; Thabtah, F. A machine learning framework for sport result prediction. Appl. Comput. Inform. 2019, 15, 27–33. [Google Scholar] [CrossRef]
Lo, T.-C.; Lee, C.-Y.; Chen, C.-L.; Hsieh, T.-Y.; Chen, C.-H.; Lin, Y.-K. Application of machine learning models for baseball outcome prediction. Appl. Sci. 2025, 15, 7081. [Google Scholar] [CrossRef]
Constantinou, A.C.; Fenton, N.E. Solving the problem of inadequate scoring rules for assessing probabilistic football forecast models. J. Quant. Anal. Sports 2012, 8, 1. [Google Scholar] [CrossRef]
Koseler, K.; Stephan, M. Machine learning applications in baseball: A systematic literature review. Appl. Artif. Intell. 2017, 31, 745–763. [Google Scholar] [CrossRef]
Brill, R.S.; Wyner, A.J. Introducing Grid WAR: Rethinking WAR for starting pitchers. J. Quant. Anal. Sports 2024, 20, 293–329. [Google Scholar] [CrossRef]
Huang, M.-L.; Li, Y.-Z. Use of machine learning and deep learning to predict the outcomes of Major League Baseball Matches. Appl. Sci. 2021, 11, 4499. [Google Scholar] [CrossRef]
Li, S.F.; Huang, M.L.; Li, Y.Z. Exploring and selecting features to predict the next outcomes of MLB games. Entropy 2022, 24, 288. [Google Scholar] [CrossRef]
Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: New York, NY, USA, 2013. [Google Scholar]
Yaseen, A.S.; Marhoon, A.F.; Saleem, S.A. Multimodal machine learning for major league baseball playoff prediction. Informatica 2022, 46, 1–7. [Google Scholar] [CrossRef]
Bae, J.-S.; Chiu, W.; Nam, S.-B. Sport fans’ price Sensitivity based on loyalty levels: A case of Korean Professional Baseball League. Sustainability 2021, 13, 3361. [Google Scholar] [CrossRef]
Korea Baseball Organization. KBO League Regulations; KBO Publishing: Seoul, Republic of Korea, 2025. [Google Scholar]
Kahan, D.M. Fielding before and after baseball’s great transformation. J. Sports Anal. 2025, 11, 1–9. [Google Scholar] [CrossRef]
McCracken, V. Pitchers and Defense: How Much Control Do Hurlers Have? Available online: https://www.baseballprospectus.com/news/article/878/pitching-and-defense-how-much-control-do-hurlers-have/ (accessed on 25 December 2024).
Tango, T.M.; Lichtman, M.G.; Dolphin, A.E. The Book: Playing the Percentages in Baseball; Potomac Books, Inc.: Sterling, VA, USA, 2007. [Google Scholar]
Davies, M.A.; Basco, D. The many flavors of DIPS: A history and an overview. Baseb. Res. J. 2010, 39, 41–50. [Google Scholar]
Sievert, C.; Mills, B.M. Using publicly available baseball data to measure and evaluate pitching performance. In Handbook of Statistical Methods and Analyses in Sports; Albert, J., Glickman, M.E., Swartz, T.B., Koning, R.H., Eds.; Chapman and Hall/CRC: Boca Raton, FL, USA, 2017; pp. 55–82. [Google Scholar]
Cui, A.Y. Forecasting Outcomes of Major League Baseball Games Using Machine Learning. Bachelor’s Thesis, University of Pennsylvania, Philadelphia, PA, USA, 2020. [Google Scholar]
Bock, J.R. Pitch sequence complexity and long-term pitcher performance. Sports 2015, 3, 40–55. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Thorn, J.; Palmer, P.; Reuther, D. The Hidden Game of Baseball: A Revolutionary Approach to Baseball and Its Statistics; University of Chicago Press: Chicago, IL, USA, 2015. [Google Scholar]
Cox, D.R. The regression analysis of binary sequences. J. R. Stat. Soc. Ser. B Stat. Methodol. 1958, 20, 215–232. [Google Scholar] [CrossRef]
Hosmer, D.W.; Lemeshow, S.; Sturdivant, R.X. Applied Logistic Regression; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
James, G.; Witten, D.; Hastie, T.; Tibshirani, R.; Taylor, J. Statistical Learning. In An Introduction to Statistical Learning: With Applications in Python; Springer International Publishing: Cham, Switzerland, 2023; pp. 15–67. [Google Scholar]
Wu, X.; Kumar, V.; Ross Quinlan, J.; Ghosh, J.; Yang, Q.; Motoda, H.; McLachlan, G.J.; Ng, A.; Liu, B.; Yu, P.S.; et al. Top 10 algorithms in data mining. Knowl. Inf. Syst. 2008, 14, 1–37. [Google Scholar] [CrossRef]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Wang, S.-C. Artificial Neural Network. In Interdisciplinary Computing in Java Programming; Springer: Boston, MA, USA, 2003; pp. 81–100. [Google Scholar]
Rainio, O.; Teuho, J.; Klén, R. Evaluation metrics and statistical tests for machine learning. Sci. Rep. 2024, 14, 6086. [Google Scholar] [CrossRef]
Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI’95: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, QC, Canada, 20–25 August 1995; Morgan Kaufmann Publishers Inc.: San Francisco, CA, USA; pp. 1137–1145.
O’brien, R.M. A Caution Regarding Rules of Thumb for Variance Inflation Factors. Qual. Quant. 2007, 41, 673–690. [Google Scholar] [CrossRef]
Strobl, C.; Boulesteix, A.-L.; Zeileis, A.; Hothorn, T. Bias in random forest variable importance measures: Illustrations, sources and a solution. BMC Bioinform. 2007, 8, 25. [Google Scholar] [CrossRef]
Quinlan, J.R. Induction of decision trees. Mach. Learn. 1986, 1, 81–106. [Google Scholar] [CrossRef]
Gopal, V.; Kondakindi, K.; Lohia, N.; Williams, M. Baseball Decision-Making: Optimizing At-bat Simulations. SMU Data Sci. Rev. 2024, 8, 9. [Google Scholar]
Shwartz-Ziv, R.; Armon, A. Tabular data: Deep learning is not all you need. Inf. Fusion 2022, 81, 84–90. [Google Scholar] [CrossRef]

Figure 1. ROC curves for five classification models. The diagonal dashed line represents random classification (AUC = 0.5). The solid black line represents the iso-performance line at the default threshold (0.5).

Table 1. Comparative Performance of Classification Models for KBO Playoff Prediction.

Model	AUC	Classification Accuracy	F1-Score	Precision	Recall
Logistic Regression	0.804 ± 0.057 [0.733, 0.875]	0.730 ± 0.091 [0.617, 0.843]	0.730 ± 0.092 [0.615, 0.843]	0.730 ± 0.091 [0.620, 0.845]	0.730 ± 0.091 [0.617, 0.843]
Decision Tree	0.643 ± 0.056 [0.573, 0.713]	0.630 ± 0.057 [0.559, 0.701]	0.630 ± 0.061 [0.552, 0.703]	0.630 ± 0.056 [0.563, 0.701]	0.630 ± 0.057 [0.559, 0.701]
Random Forest	0.740 ± 0.090 [0.629, 0.851]	0.650 ± 0.112 [0.511, 0.789]	0.650 ± 0.112 [0.510, 0.788]	0.650 ± 0.112 [0.513, 0.791]	0.650 ± 0.112 [0.511, 0.789]
Neural Network	0.799 ± 0.072 [0.710, 0.888]	0.720 ± 0.067 [0.637, 0.803]	0.720 ± 0.069 [0.634, 0.804]	0.721 ± 0.066 [0.640, 0.805]	0.720 ± 0.067 [0.637, 0.803]
Support Vector Machine	0.770 ± 0.067 [0.687, 0.888]	0.740 ± 0.055 [0.672, 0.808]	0.740 ± 0.055 [0.671, 0.808]	0.740 ± 0.055 [0.674, 0.809]	0.740 ± 0.055 [0.672, 0.808]

Note: TP = Values represent pooled cross-validation performance (±fold-level SD) [95% CI]. Standard deviations and 95% confidence intervals were computed from fold-level metrics across 5-fold stratified cross-validation. CIs were calculated as Mean ± t(0.025, 4) × SD/√5, where t(0.025, 4) = 2.776.

Table 2. Confusion Matrix for the Random Logistic Regression Model.

	Predicted: Postseason	Predicted: Non-Postseason	Total Actual
Actual: Postseason	36 (TP)	14 (FN)	50
Actual: Non-Postseason	13 (FP)	37 (TN)	50
Total Predicted	49	51	100

Note: TP = True Positives; FP = False Positives; FN = False Negatives; TN = True Negatives. Sensitivity = TP/(TP + FN) = 36/50 = 72.0%; Specificity = TN/(TN + FP) = 37/50 = 74.0%; Precision = TP/(TP + FP) = 36/49 = 73.5%; Overall Accuracy = (TP + TN)/N = 73/100 = 73.0%.

Table 3. Feature Importance Rankings by Three Metrics for Independent Variables.

Pitching Metric	Information Gain	Information Gain Ratio	Gini Decrease
ERA	0.176	0.088	0.111
WHIP	0.161	0.080	0.106
OBP (Opponent)	0.133	0.067	0.086
K/BB	0.106	0.053	0.071
FIP	0.098	0.046	0.066
BB/9	0.092	0.046	0.062
OPS (Opponent)	0.086	0.043	0.057
BAA	0.063	0.031	0.042
HR/9	0.053	0.027	0.036
K/9	0.015	0.008	0.011

Note: Values represent Information Gain scores computed using Orange3’s Rank widget. Information Gain measures the reduction in entropy when the dataset is split by each feature; higher values indicate greater discriminatory power. Values are raw scores and are not normalized to sum to 1.0.

Table 4. Descriptive Statistics and Statistical Comparisons of Pitching Metrics between Postseason and Non-Postseason Teams.

Metric	Postseason Teams M (SD)	Non-Postseason Teams M (SD)	t-Value	p-Value
ERA	4.389 (0.526)	4.944 (0.494)	5.435	<0.001 ***
WHIP	1.418 (0.084)	1.511 (0.075)	5.831	<0.001 ***
OBP (Opp.)	0.341 (0.014)	0.355 (0.013)	5.581	<0.001 ***
K/BB	2.135 (0.293)	1.930 (0.306)	3.417	0.001 **
FIP	4.488 (0.478)	4.731 (0.458)	2.595	0.011 *
BB/9	3.418 (0.397)	3.774 (0.504)	3.417	0.001 **
OPS (Opp.)	0.741 (0.042)	0.777 (0.040)	4.417	<0.001 ***
BAA	0.269 (0.014)	0.279 (0.014)	3.720	<0.001 ***
HR/9	0.888 (0.222)	0.983 (0.200)	2.248	0.027 *
K/9	7.080 (1.058)	7.147 (0.541)	0.401	0.689

Note: * p < 0.05, ** p < 0.01, *** p < 0.001.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Bae, J.-S.; Chiu, B.W. Machine Learning-Based Classification of Team Playoff Advancement Using Pitching Performance Metrics in Korean Professional Baseball. Appl. Sci. 2026, 16, 2215. https://doi.org/10.3390/app16052215

AMA Style

Bae J-S, Chiu BW. Machine Learning-Based Classification of Team Playoff Advancement Using Pitching Performance Metrics in Korean Professional Baseball. Applied Sciences. 2026; 16(5):2215. https://doi.org/10.3390/app16052215

Chicago/Turabian Style

Bae, Jung-Sup, and Bryan Weisheng Chiu. 2026. "Machine Learning-Based Classification of Team Playoff Advancement Using Pitching Performance Metrics in Korean Professional Baseball" Applied Sciences 16, no. 5: 2215. https://doi.org/10.3390/app16052215

APA Style

Bae, J.-S., & Chiu, B. W. (2026). Machine Learning-Based Classification of Team Playoff Advancement Using Pitching Performance Metrics in Korean Professional Baseball. Applied Sciences, 16(5), 2215. https://doi.org/10.3390/app16052215

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Machine Learning-Based Classification of Team Playoff Advancement Using Pitching Performance Metrics in Korean Professional Baseball

Abstract

1. Introduction

2. Methodology

2.1. Data Collection and Dataset Construction

2.2. Independent Variables and Measurement

2.3. Machine Learning Models and Algorithms

2.4. Model Evaluation Metrics

2.5. Model Training and Validation

2.6. Hyperparameter Specifications

2.7. Collinearity Considerations

2.8. Feature Importance Assessment

3. Results

3.1. Comparative Model Performance

3.2. Confusion Matrix Analysis

3.3. Feature Importance Analysis

3.4. Descriptive Analysis of Pitching Metrics

4. Discussion

4.1. Primary Findings and Model Interpretation

4.2. Theoretical Implications

4.3. Practical Implications

4.4. Limitations

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI