2.1. Data and Pre-Processing
The dataset used in this study consists of 1284 injection-molding production cycles collected from a plastic washing-machine control panel manufacturing line. Each observation includes the machine’s process settings—comprising 24 different parameters such as injection pressure and speed stages, holding (packing) pressure and time, melt and mold temperatures, and relevant cycle times—along with binary labels indicating the presence of various defect types in the produced part. A total of 22 distinct defects were originally recorded as binary (0/1) outcomes (e.g., gas mark, sink mark, flash, short shot, color defect, etc.).
Among these, four major defects—gas mark, sink mark, flash, and short shot—were selected as the focus of this work. Their positive sample counts in the dataset are 29, 24, 12, and 31, respectively (96 defective cycles in total), which represent the minimum but sufficient frequency required for training meaningful classification models. Extremely rare defects (e.g., color defect = 3 observations, scratch = 1 observation) were excluded from modeling since incorporating such sparse targets would introduce statistical noise and unnecessarily expand the Pareto search space.
During data pre-processing, the raw Excel file was imported, and all variable names containing Turkish characters were standardized. Measurement units and scales were harmonized across all parameters. Train–test partitioning followed the “split” indicator provided in the data source, resulting in approximately 70% training (899 samples) and 30% testing (385 samples). The split was performed in a setting-wise stratified manner, ensuring that no identical machine setting combination appears in both training and test sets. This prevents the model from memorizing specific setting profiles.
The dataset consisted of 1284 production cycles grouped into eight distinct machine setups. A value-based (setup-based) stratified split was applied, in which approximately 70% of samples from each setup were assigned to the training set and 30% to the test set, while preserving the defective–non-defective ratio across all critical quality characteristics (CTQs). Because production cycles within the same setup share identical machine settings, material conditions, and mold states, they are highly correlated. A random split would, therefore, introduce information leakage by placing nearly identical samples in both training and test sets, leading to overly optimistic performance estimates. The setup-based partitioning ensures that the test set represents unseen but statistically consistent operating conditions, providing a realistic assessment of model generalization for industrial deployment.
Missing or out-of-scope (NA) values were checked; no outliers requiring removal were detected among the continuous variables.
The distributional characteristics of key process parameters are summarized in
Table 1. For example, the injection pressure in the first stage exhibits a median of approximately 135 bar, with 50% of observations falling within the 125–150 bar range (IQR ≈ 25 bar), whereas the fifth-stage pressure has a lower median of around 90 bar. Injection speed similarly decreases across stages, starting from a median of ~44% in the first stage and dropping to ~9% in the final stage. The holding (packing) time shows a median of ~6 s, with most observations lying between 5 and 7 s.
Injection speed is expressed as a percentage of the machine’s maximum programmable screw speed; 100% corresponds to the maximum allowable screw velocity.
2.2. Initial Quality Level
The initial defect distribution in the dataset was examined using a Pareto chart based on the total occurrence count of each defect type (
Figure 1). Among the 1284 produced parts, 192 contained at least one defect, corresponding to a long-term defect rate of 14.94%. A total of 195 defects occurred across these 192 parts, with three parts exhibiting two distinct defect types.
In terms of frequency, the short shot (SS) defect ranked first with 31 occurrences, followed by gas trap burn (GTB) with 29, sink mark (SK) with 24, and flash (FL) with 12. These four major defects together accounted for approximately 49% of all 195 defect instances. Pareto analysis revealed that a small number of leading defect types contributed to the majority of total defects: for instance, the top seven defect types cumulatively represented nearly 80% of all defect cases. Accordingly, the selected focus defects—GTB, SS, SK, and FL—constitute the most critical components influencing total scrap, both in frequency and in process interactions.
In Six Sigma methodology, a defect is defined as any outcome that fails to meet a Critical-to-Quality (CTQ) requirement. A quality opportunity corresponds to one CTQ per produced part. Defects per Million Opportunities (DPMO) expresses how many CTQ defects are expected per one million such opportunities, providing a standardized measure of process quality. The sigma level represents the distance of the process from defect-free performance, with higher sigma values indicating lower defect risk. For reference, a Six-Sigma process corresponds to approximately 3.4 defects per million opportunities, whereas lower sigma levels imply substantially higher scrap and rework rates on the shop floor. In this study, SS, GTB, and SK are treated as CTQs because they directly generate scrap or customer-visible defects.
The initial process quality was also quantified using the Six Sigma methodology. The quality objective of this study is to treat short shot (SS), gas trap burn (GTB), and sink mark (SK) as critical-to-quality (CTQ) characteristics. Flash defects, which can be manually trimmed within 3–5 s on-site without generating scrap, were excluded from the initial sigma-level calculation but were included later as a constraint variable in the GTB optimization framework. This approach aligns with real-world manufacturing and customer-risk considerations by reflecting actual cost-of-quality implications.
Sigma Level Calculation:
N = 1284 parts, OPU = 3 (CTQs: SS, GTB, SK). Total CTQ defects D = 84 (SS = 31, GTB = 29, SK = 24).
This DPMO value indicates that the process quality at baseline is relatively low. Based on the long-term performance assumption, the calculated sigma level of approximately 2.02 σ reflects a process that is far from the ideal Six Sigma benchmark. In other words, given OPU = 3, the probability of a part containing at least one CTQ defect is approximately 6.4%.
Table 2 provides a detailed summary of the initial sigma-level computation.
2.3. Statistical Screening: Welch ANOVA
To identify process variables significantly influencing the occurrence of gas trap burn (GTB) defects, all continuous process parameters were statistically compared between the GTB-absent (0) and GTB-present (1) groups. Since the classical Student’s t-test can be misleading when the homogeneity of variances is violated, Levene’s test was first applied to assess variance equality. For variables with Levene’s p < 0.05, group means were compared using Welch’s ANOVA (assuming unequal variances), whereas for variables with Levene’s p ≥ 0.05, results from the classical t-test/ANOVA were considered valid.
To control the familywise error rate introduced by multiple hypothesis testing, Benjamini–Hochberg False Discovery Rate (FDR) correction was applied to all p-values, producing adjusted q-values. As an effect size measure, Hedges’ g statistic was calculated for each variable (|g| ≈ 0.2 = small, ≈0.5 = medium, ≈0.8 = large). A positive g value indicates that the mean of the variable is higher for non-defective parts (i.e., lower values increase GTB risk), whereas a negative g implies that higher variable values correspond to higher GTB risk.
Welch’s ANOVA was applied as a univariate screening tool to identify candidate variables prior to model-based learning; therefore, interaction terms were not included at this stage. Potential interaction effects among variables were subsequently captured by the machine-learning models and analyzed through SHAP. In addition, multicollinearity among the selected variables was assessed using correlation analysis and variance inflation factors (VIF), and no problematic collinearity was observed.
The analysis identified 11 process variables that showed statistically significant relationships with GTB formation (
p < 0.05, FDR-adjusted): Injection Pressure 1–5 (5 variables), Injection Speed 1–5 (5 variables), and Holding Time (1 variable).
Table 3 summarizes the statistical test results.
For instance, the mean difference for Injection Pressure 1 was highly significant (p ≈ 2.62 × 10−5, q ≈ 5.8 × 10−5) with a large effect size (g ≈ +0.98). The positive sign of g indicates that non-defective parts exhibited much higher Pressure 1 values, meaning that low injection pressure is statistically associated with a higher GTB risk. This trend was consistent across all five pressure stages (g = +0.64–0.98).
Similarly, all injection speed stages exhibited significant differences (p < 0.001) with g values ranging from +0.4 to +0.7, confirming that lower injection speeds correspond to higher GTB probability. Finally, holding (packing) time was also identified as a significant variable (p ≈ 0.022, q ≈ 0.022, g = –0.36). The negative g indicates that GTB-positive parts had longer average holding times, implying that excessively long holding can increase GTB risk—possibly due to material thermal degradation or localized burning from prolonged pressure and heat exposure.
These findings quantitatively demonstrate that maintaining sufficiently high injection pressure and speed while avoiding excessively long holding times is essential to reducing GTB risk. Additionally, the remaining 13 process variables (e.g., cooling time, mold temperature, melt temperature) showed no statistically significant differences (p > 0.05). Hence, process control for GTB prevention should primarily focus on the pressure–speed profile during the filling stage. In subsequent sections, the predictive models and optimization frameworks are developed using these 11 critical variables.
2.4. Modeling and Explainability
Using the statistically screened process variables, several machine learning (ML) classifiers were developed to predict the occurrence of gas trap burn (GTB) defects in injection molding. The main objective was to enable early warning systems for operators or machine control logic by forecasting potential GTB occurrences during production and to use these predictive models for process-parameter optimization.
The following algorithms were evaluated on the training dataset: Logistic Regression (L1 and L2 regularization), Support Vector Machine (SVM) with an RBF kernel (calibrated using Platt scaling), Random Forest (RF), Balanced Random Forest (BRF), and LightGBM gradient boosting. Because the dataset suffered from strong class imbalance (only ~2% GTB-positive samples), several resampling and weighting strategies were applied. For example, Logistic Regression employed class_weight = “balanced” and Random Oversampling (ROS) to increase the weight of the minority class. In addition, SMOTE and SMOTE-Tomek methods were integrated with the RF and LightGBM models to assess their effectiveness under resampled training sets.
For model evaluation, PR-AUC (Area Under the Precision–Recall Curve) was used as the primary performance metric, as it is more informative in highly imbalanced classification problems. As secondary indicators, ROC-AUC, Recall, Precision, F1-score, and Accuracy were also reported for the positive class.
Random Oversampling (ROS) and SMOTE were intentionally selected as imbalance-handling techniques due to their model-agnostic nature, compatibility with SHAP-based explainability, and suitability for practical deployment in real shop-floor environments. While more advanced approaches, such as focal loss or cost-sensitive ensemble models, can further enhance classification performance, they typically require algorithm-specific loss functions or customized training procedures, which may reduce model transparency and complicate industrial implementation. Since this study prioritizes not only predictive performance but also interpretability and operational feasibility within a Six Sigma-oriented decision-making framework, such approaches were deliberately not employed.
A comprehensive comparison revealed that the L2-penalized Logistic Regression model achieved the best overall performance. Specifically, the L2 + ROS (100% oversampling) configuration yielded PR-AUC ≈ 0.15 and ROC-AUC ≈ 0.89 on the test set, outperforming other algorithms. This model successfully detected ~78% of GTB cases (Recall = 0.78) but had relatively low Precision ≈ 0.09 (and F1 ≈ 0.16) due to the low prevalence of the positive class.
In contrast, the SVM-RBF model performed poorly under data imbalance, achieving only PR-AUC ≈ 0.107 even after calibration. At the default threshold, it failed to identify any positive cases (Recall = 0), a result of an overly conservative decision function and the default threshold (0.5) being unsuitable for rare events.
The Random Forest model showed moderate performance (PR-AUC ≈ 0.109, ROC-AUC ≈ 0.87) but tended to ignore the minority class unless resampling was applied. When SMOTE was introduced, RF’s precision improved, though PR-AUC remained unchanged (~0.109). Balanced Random Forest (BRF) produced nearly identical results, maintaining the same recall (78%) and precision (~8.7%). LightGBM achieved a similar PR-AUC (~0.11), indicating comparable discriminative capacity but limited improvement under imbalance.
Although SVM-RBF achieved a high overall accuracy (97.7%), it failed to detect any GTB cases (Recall = 0) at the default decision threshold. This behavior indicates that the model optimized majority-class correctness rather than rare-defect sensitivity, making it unsuitable for an early-warning system where missing a defect is far more costly than raising a false alarm. Similarly, tree-based models such as Random Forest and LightGBM produced reasonable ROC-AUC values but limited PR-AUC, showing that they ranked samples well but did not sufficiently separate the rare GTB class from the majority. In contrast, L2-regularized logistic regression achieved a better precision–recall balance by producing well-calibrated probabilities and smoother decision boundaries, which is critical for rare-event detection and shop-floor risk prioritization.
Under severe class imbalance, complex tree-based models such as LightGBM tend to overfit sparse minority-class patterns and optimize overall accuracy rather than precision–recall trade-offs. In contrast, L2-regularized logistic regression provides a smoother decision boundary, penalizes extreme coefficients, and directly optimizes probability estimates, which leads to more stable and better-calibrated predictions for rare-event classification. This explains why logistic regression achieved superior PR-based performance in the GTB prediction task.
Overall, Logistic Regression emerged as the top model for ranking rare GTB cases. Its high ROC-AUC (0.89) suggests strong ranking ability, yet the low Precision implies many false alarms in practice. This limitation arises because only 9 of 385 test samples (≈2.3%) were GTB-positive. In such rare-event problems, even high-AUC models tend to overpredict positives. The obtained PR-AUC ≈ 0.15 is roughly 6.4× better than random chance, yet still modest in absolute terms. A practical solution is to optimize the decision threshold (e.g., via Top-K or Precision@Recall strategies), rather than fixing it at 0.5. Given the strong ranking performance, one can capture most GTB cases by inspecting only a small subset of the highest-risk parts, which significantly reduces inspection effort and cost.
The comparative performance of the evaluated machine-learning classifiers under different resampling strategies is summarized in
Table 4.
To interpret model decisions, SHAP (Shapley Additive Explanations) analysis was performed. For the L2-regularized Logistic Regression model, a model-agnostic SHAP Explainer was used to compute global and local feature attributions. The global SHAP beeswarm plot (
Figure 2) reveals that injection pressures (especially stages 1–4) and injection speeds dominate the GTB predictions, while holding time plays a secondary but measurable role. The local waterfall plot (
Figure 3) for a representative defective sample shows how individual feature contributions lead to a final predicted GTB probability of f(x) = 0.52, as positive and negative SHAP values are aggregated from the baseline.
These explainability results confirm that while pressure, speed, and packing duration modestly influence GTB probability, the model retains good discriminative power (ROC-AUC ≈ 0.89 and PR-AUC ≈ 0.15) and captures the underlying relationships between key process variables and defect risk.
In addition, SHAP patterns indicate that the joint configuration of injection pressure and injection speed plays a decisive role in GTB formation, highlighting strong interaction effects between filling dynamics.
This figure shows the global contribution of each injection-molding process parameter to the prediction of gas trap burn (GTB) defects, based on SHAP (Shapley Additive Explanations) values.
Each point represents one production cycle. The horizontal axis denotes the SHAP value, which indicates the impact of a given feature on the model output (positive values increase GTB risk, negative values decrease it).
Color indicates the actual feature value, ranging from low (blue) to high (red).
The results reveal that injection pressures (especially stages 1–4) and injection speeds are the most influential variables governing GTB formation, whereas holding time plays a secondary but still relevant role.
This figure illustrates how individual process parameters contributed to the GTB prediction for a representative defective part from the test set (sample index 12).
Starting from the baseline model output , each feature shifts the prediction either upward (increasing GTB probability, red bars) or downward (reducing GTB probability, blue bars), leading to the final predicted probability .
In this defective part, injection pressures and injection speed stages are the dominant contributors to the predicted GTB probability, confirming that GTB formation is governed by the coupled pressure–speed profile rather than by a single parameter alone.