Toward Green Manufacturing: A Heuristic Hybrid Machine Learning Framework with PSO for Scrap Reduction

Nacar, Emine Nur; Erdebilli, Babek; Eraslan, Ergün

doi:10.3390/su17209106

Open AccessArticle

Toward Green Manufacturing: A Heuristic Hybrid Machine Learning Framework with PSO for Scrap Reduction

by

Emine Nur Nacar

^*

,

Babek Erdebilli

and

Ergün Eraslan

Department of Industrial Engineering, Ankara Yıldırım Beyazıt University, Keçiören, Ankara 06010, Türkiye

^*

Author to whom correspondence should be addressed.

Sustainability 2025, 17(20), 9106; https://doi.org/10.3390/su17209106

Submission received: 11 September 2025 / Revised: 2 October 2025 / Accepted: 9 October 2025 / Published: 14 October 2025

(This article belongs to the Section Waste and Recycling)

Download

Browse Figures

Versions Notes

Abstract

Accurate scrap forecasting is essential for advancing green manufacturing, as reducing defective output not only lowers production costs but also prevents unnecessary resource consumption and environmental impact. Effective scrap prediction enables manufacturers to take proactive measures to minimize waste generation, thereby supporting sustainability goals and improving production efficiency. This study proposes a hybrid ensemble framework that integrates CatBoost and XGBoost, combined with Particle Swarm Optimization (PSO), to enhance prediction accuracy in industrial applications. The model exploits the complementary strengths of both algorithms by applying weighted averaging and stacked generalization, allowing it to process heterogeneous datasets containing both categorical and numerical variables. A case study in the aerospace manufacturing sector demonstrates the effectiveness of the proposed approach. Compared to standalone models, the PSO-enhanced hybrid ensemble achieved more than a 30% reduction in Root Mean Squared Error (RMSE), confirming its ability to capture complex interactions among diverse process parameters. Feature importance analysis further showed that categorical attributes, such as machine type and operator, are as influential as numerical parameters, underscoring the need for hybrid modeling. Although the model requires higher computational effort, the integration of PSO significantly improves robustness and scalability. By reducing scrap and optimizing resource utilization, the proposed framework provides a data-driven pathway toward greener, more resource-efficient, and resilient manufacturing systems.

Keywords:

hybrid ensemble learning; scrap prediction; green manufacturing; heuristic; CatBoost; XGBoost; IIoT

1. Introduction

In recent years, rare but highly impactful events, such as pandemics and earthquakes, have highlighted the critical need for proactive solutions in various industries. As these events demonstrate, the unpredictability and scale of their consequences often challenge traditional methods. Simultaneously, the increasing reliance on artificial intelligence (AI) has created opportunities to address large volumes of data more efficiently, especially when conventional approaches fall short due to limitations in managing uncertainty and complexity [1,2]. The evolution of AI applications in industries such as manufacturing has shown that advanced algorithms can deliver more accurate and reliable predictions in areas like scrap management, where traditional methods often struggle to adapt.

However, traditional methods for optimizing production and reducing waste face significant limitations, particularly when dealing with large, complex datasets. Conventional statistical techniques often fail to capture the intricacies of high-dimensional data and require considerable manual intervention in data preprocessing and model adjustment. Furthermore, traditional approaches’ inefficiency not only hampers operational effectiveness but also contributes to unsustainable practices, as the inability to accurately predict and minimize scrap leads to excess material waste and increased environmental impact [3,4]. These shortcomings highlight the necessity for advanced, data-driven solutions capable of handling large-scale problems while promoting sustainable production practices.

The industrial sector has long faced the challenge of reducing waste and optimizing production to ensure both economic viability and environmental sustainability. Scrap refers to a material with no direct economic value, but its fundamental components can be recovered and recycled [5]. Research on scrap necessitates a strong focus on environmental responsibility. In recent years, recycling-related topics have garnered significant attention, becoming a prominent area of interest within academic literature, industry, and among authorities and governments. As technology evolves and societal priorities shift, waste management studies have continued to expand and diversify [6]. Scrap is already a vital raw material in the industry and is expected to grow in importance as a secondary raw material for companies in the years ahead [7]. Scrap generation, a direct consequence of inefficiencies within production processes, not only increases operational costs but also exacerbates environmental concerns. Because effectively predicting scrap promotes environmentally sustainable production by minimizing waste, optimizing resource utilization, and saving both time and costs [8], as global industries push for sustainable practices, minimizing scrap has become an essential objective. Machine learning (ML) offers the ability to predict and manage scrap generation in real time, thereby enhancing production processes and reducing waste [9]. Working with scrap requires end-to-end process management [10].

In the modern industrial landscape, the imperative to balance economic growth with environmental stewardship has gained unprecedented attention. Manufacturing processes, while critical for meeting the demands of global markets, often generate significant waste that contributes to environmental degradation and resource depletion. This waste not only impacts operational costs but also exacerbates broader challenges, such as climate change, loss of biodiversity, and pollution of ecosystems. Efficient waste reduction strategies, therefore, have become a cornerstone of sustainable manufacturing, aligning with global initiatives like the United Nations Sustainable Development Goals (SDGs), which emphasize responsible production and consumption.

Minimizing waste is pivotal not just for environmental sustainability but also for enhancing resource efficiency, reducing greenhouse gas emissions, and promoting a circular economy. For example, effective scrap reduction directly decreases the demand for raw materials, conserving finite natural resources such as minerals, water, and energy. Moreover, lower scrap levels reduce the carbon footprint of manufacturing processes by curbing energy-intensive recycling and disposal activities. These improvements not only enhance the ecological performance of production systems but also foster economic resilience by mitigating supply chain vulnerabilities linked to material scarcity and price volatility.

As industries transition toward greener practices, ML has emerged as a transformative tool for addressing these sustainability challenges. ML algorithms offer unparalleled capabilities in analyzing complex datasets, predicting inefficiencies, and optimizing production systems in real time. By leveraging advanced models like CatBoost and XGBoost, manufacturers can achieve a dual objective of reducing waste while maintaining or even enhancing productivity. This hybrid approach not only supports sustainable development goals but also empowers industries to proactively adapt to stringent environmental regulations and evolving market demands for eco-friendly products.

These considerations underscore the broader significance of scrap reduction beyond its immediate operational benefits, framing it as a critical enabler of sustainable manufacturing. By integrating innovative ML-based solutions, the proposed hybrid model seeks to address these multifaceted challenges, offering a pathway toward achieving sustainable, cost-effective, and environmentally responsible production practices.

Among the wide array of ML techniques, ensemble models like CatBoost and XGBoost are particularly noteworthy for their flexibility in handling diverse types of data, including both categorical and numerical variables. These algorithms excel in addressing typical data challenges such as overfitting, missing values, and multicollinearity, making them well-suited for complex predictive tasks like scrap estimation. Their ability to perform without extensive data preprocessing further underscores their efficiency in real-world manufacturing environments [11].

This study introduces a hybrid ML framework, incorporating the strengths of CatBoost and XGBoost, to predict scrap generation within industrial settings. By combining the predictive power of these algorithms with a robust exploratory data analysis (EDA) approach and comprehensive preprocessing techniques, this model aims to significantly improve accuracy and reliability in scrap prediction. Furthermore, the integration of EDA ensures that the model can effectively handle the complexities of large, high-dimensional datasets—overcoming what is often referred to as the “curse of dimensionality” [12].

The logic of the proposed algorithm lies in its hybrid structure that integrates both CatBoost and XGBoost, with a novel weighted averaging and stacked ensemble methodology. This approach combines the strengths of both algorithms, optimizing performance through dynamic boosting. By integrating both numeric and categorical variables efficiently, the model is tailored to the complexities of scrap prediction, accounting for intricate relationships in the dataset that traditional methods or standalone models might overlook. Furthermore, the blending of stacked ensemble models and the use of weighted averaging ensure that the prediction model is both flexible and robust, making it highly suitable for industrial scrap forecasting. This hybrid design not only enhances accuracy but also allows the model to adapt dynamically to changes in production processes, providing manufacturers with a predictive tool that reduces waste and improves operational efficiency.

The remainder of this paper is structured as follows. Section 2 reviews the relevant literature with a focus on scrap forecasting and hybrid machine learning approaches. Section 3 outlines the exploratory data analysis and data preprocessing techniques used in this study. Section 4 presents the proposed hybrid ensemble framework that integrates XGBoost and CatBoost with PSO-based hyperparameter tuning. Section 5 introduces a case study from the aerospace manufacturing sector to validate the model. Section 6 reports the results and discussion. Finally, Section 7 concludes with key findings and directions for future research.

2. Literature Review

Over the years, production processes have undergone remarkable transformations driven by increasing human needs. As industries expanded, companies sought to remain competitive by producing high-quality products at lower costs and within shorter delivery times [13,14]. This demand stimulated the adoption of innovative approaches to meet quality requirements and enhance efficiency [15]. However, as manufacturing complexity increased, the likelihood of generating scrap also rose, negatively influencing both production costs and product reliability [16]. Consequently, scrap detection and minimizing its related costs have emerged as critical challenges for manufacturers striving to maintain competitiveness [17,18].

The growing complexity of supply chains has resulted in an exponential increase in production-related data, leading to the rise of big data [19]. This massive data flow requires advanced technological tools offering features such as interconnectedness, transformability, and real-time sharing [20]. With advances in computer and network technologies, the efficient processing of such data has become feasible [21]. In modern industrial ecosystems, production-related issues can propagate through entire supply chains, triggering a cascading effect commonly referred to as the “bullwhip effect” in the literature [22].

The advent of Industry 4.0 has further transformed manufacturing practices, positioning machine learning (ML) as an essential component for accurate forecasting and production data management [23,24]. ML algorithms have demonstrated superior performance in managing large-scale and complex problems [25,26]. This paradigm shift towards data-driven decision-making has revolutionized industrial operations, particularly in production forecasting, quality assurance, and scrap management [27,28].

Recent progress in nature-inspired optimization algorithms has also strengthened the ability to address high-dimensional and nonlinear problems in engineering and ML contexts. For instance, the Greylag Goose Optimization (GGO) algorithm leverages the dynamic behavior of geese to balance exploration and exploitation, demonstrating remarkable success in feature selection and constrained optimization tasks [29]. Similarly, the iHow Optimization Algorithm (iHowOA) employs human-inspired processes such as learning and decision-making to enhance search efficiency, showing promise in feature reduction and continuous optimization problems [30]. The Football Optimization Algorithm (FbOA), inspired by the dynamic teamwork strategies in football, effectively balances exploration and exploitation by mimicking passing and positioning strategies, proving successful in benchmark optimization tasks [31]. These advancements highlight the potential of metaheuristic optimization approaches to improve hybrid ML frameworks, especially in feature selection and hyperparameter tuning stages.

Recent studies have also emphasized the role of machine learning and data-driven methods in improving predictive quality and reducing scrap in manufacturing. For example, Alexopoulos et al. [32] applied deep learning techniques to estimate the fill-level of industrial waste containers in a copper tube plant, demonstrating the potential of computer vision approaches for scrap management and sustainable production. Similarly, Knott et al. [33] investigated the integration of predictive models into manufacturing processes and inspection sequences, highlighting how such approaches can contribute to scrap reduction and process optimization. In addition, de Souza et al. [34] proposed a machine learning-based framework for surface roughness prediction in hard turning operations, showing how predictive modeling can effectively prevent rework and scrap generation. These contributions further reinforce the relevance of hybrid and ensemble-based learning approaches, such as the one proposed in this study, to address the challenges of defect prediction and waste minimization in industrial manufacturing contexts.

2.1. Scrap Forecasting

Scrap, as an inevitable byproduct of manufacturing, poses significant environmental and economic challenges. With increasing amounts of scrap generated, effective management has become a cornerstone of sustainable production [8]. Accurate scrap forecasting not only reduces waste but also optimizes resource consumption and improves production efficiency [35]. Recent studies have highlighted the capability of ML-based models to outperform traditional approaches in scrap prediction and management [17,36].

Techniques such as Random Forest, CatBoost, and XGBoost have been successfully applied to scrap prediction, particularly in complex datasets where conventional statistical methods often fail [16,37]. For example, Xu and Zhang (2023) developed neural networks for forecasting regional scrap prices in China, achieving high predictive accuracy [38]. Likewise, Daigo et al. (2023) proposed a deep learning-based PSPNet approach for the visual detection of unwanted materials in steel scrap, thus improving material classification [39].

In specialized sectors, ML techniques have been used to predict surface defects and classify scrap categories. Chen et al. (2022) applied CatBoost, XGBoost, and other algorithms to forecast casting defects in the steel industry, demonstrating considerable improvements in quality control [40]. Similarly, ANN-based approaches have been employed to forecast production and scrap quantities simultaneously, thereby supporting efficiency gains in manufacturing [41]. Table 1 illustrates that while numerous studies have explored hybrid or heuristic approaches for prediction tasks, relatively few have focused on scrap reduction in manufacturing, underscoring the research gap addressed in this study.

Beyond manufacturing, scrap prediction has also been linked to broader environmental sustainability studies, especially in material recycling. For instance, research on aluminum scrap showed that combining Random Forest with neural networks enhanced classification performance by improving recall, precision, and overall accuracy [42,43]. As illustrated in Figure 1, nearly half of the reviewed studies focus on scrap and quality improvement, while only a small fraction address areas such as healthcare, mining, or environmental applications.

2.2. Hybrid Algorithms

Hybrid ML algorithms combine the strengths of different models to enhance robustness and predictive power. These algorithms are particularly effective in real-world applications where datasets are complex, nonlinear, and heterogeneous. By integrating multiple learners, hybrid frameworks improve generalization, mitigate overfitting, and ensure stability in predictive outcomes. Such benefits make them well-suited for complex industrial processes such as scrap forecasting.

An early example of dynamic integration was presented by Tsymbal et al. (2000), who combined AdaBoost and bagging into a hybrid ensemble, demonstrating improved accuracy through equal and weighted voting schemes [44].

In short-term load forecasting, Zhang et al. (2024) tested hybrid combinations of CatBoost and XGBoost with optimization methods. Their study revealed that while the CatBoost–AOA hybrid excelled in training datasets, XGBoost-based hybrids outperformed during testing phases [45]. In healthcare, Nagassou et al. (2023) applied LightGBM–CatBoost hybrids to predict type-2 diabetes, achieving superior results compared to RF, GBM, and AdaBoost [46]. Hybrid models have also been applied in mining, where LightGBM and CatBoost were used to predict blast toe volume, contributing to safer and more sustainable mining operations [47,48].

The energy sector has benefited significantly from hybrid frameworks as well. Studies incorporating XGBoost, CatBoost, and Random Forest have shown improved energy demand forecasting by handling datasets with missing data more effectively [48]. Hybrid models such as WOA–XGBoost and GWO–XGBoost have been successfully utilized for predicting blast-induced ground vibrations, surpassing the accuracy of conventional learners like RF and CatBoost [49]. In construction, hybrid models integrating Bayesian Optimization with CatBoost and NSGA-III have proven effective for shield construction parameter optimization, especially with limited sample sizes [50].

The versatility of hybrid algorithms extends to environmental monitoring. Ahn et al. (2023) employed a hybrid ensemble combining gradient boosting with attention-based deep learning to forecast harmful algal blooms (HABs), showing strong predictive capabilities in ecological applications [51]. Table 2 summarizes the distribution of methodologies across different application domains, indicating that hybrid models and boosting algorithms (CatBoost and XGBoost) are most frequently employed in diverse contexts such as energy, environment, and mining.

3. Exploratory Data Analysis, Data Preprocessing, and Machine Learning Foundations

3.1. Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is a crucial initial step in understanding the structure of a dataset before implementing machine learning (ML) models. The primary objective of EDA is to summarize the main characteristics of the data through both statistical and visual methods, enabling the identification of patterns, anomalies, and relationships that are not immediately visible in raw data [52]. This process supports researchers in detecting potential data quality issues, such as missing values or outliers, while providing insights into variable distributions and dependencies [53].

Graphical EDA techniques, including histograms, scatter plots, and box plots, are particularly effective for visualizing variable distributions and detecting anomalies [54]. Non-graphical EDA complements this approach by summarizing the dataset with statistical measures, offering a concise numerical overview that supports assumption testing and anomaly detection [55].

Depending on the number of variables analyzed, EDA is categorized into univariate, bivariate, and multivariate analysis. Univariate analysis focuses on individual variables, and bivariate analysis examines the relationships between two variables, while multivariate approaches such as Principal Component Analysis (PCA) and regression models allow the exploration of high-dimensional interactions [56,57,58]. These analyses are particularly valuable in manufacturing-related datasets where multiple features simultaneously affect scrap generation.

3.2. Data Preprocessing

Data preprocessing is one of the most resource-intensive stages of ML projects, often consuming over half of the total project effort [59]. Despite its importance, it is frequently underemphasized in research discussions. Real-world data commonly contain missing values, outliers, noise, and inconsistencies, which, if left unaddressed, can significantly degrade model performance [60]. Empirical studies confirm that preprocessing can yield substantial performance improvements; for example, Chandrasekar and Qian [61] demonstrated a 25.39% increase in model accuracy when preprocessing was applied.

The key preprocessing tasks include data cleaning, which involves the identification and handling of missing values, smoothing noisy data, and detecting inconsistencies [62]. Moreover, data transformation refers to converting variables into formats suitable for machine learning, including normalization, scaling, feature engineering, and categorical encoding [63,64]. In addition, data reduction simplifies large datasets through dimensionality reduction techniques such as PCA, or through sample reduction to improve computational efficiency [65]. Finally, data integration merges multiple datasets from heterogeneous sources into a unified representation for analysis and modeling [66].

Preprocessing ensures data reliability, consistency, and interpretability, thereby enhancing the generalizability of predictive models. In the context of scrap prediction, preprocessing steps play a central role in converting raw manufacturing data into usable forms for ML-driven sustainability analytics.

3.3. Machine Learning (ML)

Machine learning (ML) is defined as the process of enabling computational systems to improve their performance by learning from data or past experiences [67]. By extracting patterns from historical data, ML provides predictive insights applicable across diverse domains such as quality control, production optimization, and scrap reduction [68,69].

As part of Industry 4.0, ML has gained increasing prominence due to its scalability and adaptability, surpassing traditional statistical techniques that are often constrained in handling high-dimensional, noisy, or unstructured data [70,71]. ML algorithms can uncover hidden patterns, automate decision-making, and improve the robustness of forecasting tasks, positioning them as indispensable tools for sustainable production [72,73].

ML approaches can be broadly categorized into three paradigms [74]. Supervised learning trains models on labeled datasets and is commonly applied to classification and regression tasks, while unsupervised learning identifies hidden patterns in unlabeled data, often used for clustering and dimensionality reduction. Finally, reinforcement learning optimizes sequential decision-making through feedback signals from the environment.

Recent surveys highlight the growing adoption of ML in industry, with reports indicating that over two-thirds of companies actively deploy ML, and nearly all plan to expand its usage [75]. This growing relevance emphasizes the role of ML not only in operational efficiency but also as a critical enabler of sustainable, intelligent manufacturing practices.

4. Proposed Hybrid Weighted–Stacked Ensemble (HWSE) Framework

4.1. CatBoost Algorithm

CatBoost is a gradient boosting algorithm developed by Yandex, specifically designed to efficiently handle categorical features [76]. Unlike traditional ML algorithms, which often rely on preprocessing techniques such as one-hot encoding, CatBoost incorporates categorical data directly into the training process. A core innovation is its ordered boosting scheme, which mitigates overfitting by preventing the target leakage that typically arises in standard boosting methods.

CatBoost also supports GPU acceleration, enabling scalable training on large industrial datasets. For regression tasks such as scrap prediction, it minimizes loss functions (e.g., RMSE and MAE) effectively while maintaining high generalization. Its robustness against hyperparameter sensitivity makes it especially suitable for manufacturing environments where rapid deployment is essential.

4.2. XGBoost Algorithm

XGBoost (Extreme Gradient Boosting) is a scalable and highly efficient variant of the gradient boosting framework, widely recognized for its predictive performance [77]. It leverages advanced regularization techniques (L1 and L2) to prevent overfitting and employs a sparsity-aware split finding method to optimize tree growth on large, sparse datasets.

A key advantage of XGBoost is its ability to model complex nonlinear relationships in high-dimensional data, making it suitable for regression problems such as scrap forecasting. Despite its relatively higher computational cost compared to simpler models, its superior predictive accuracy has established it as one of the most widely adopted ML methods in both academia and industry.

4.3. Proposed Method

While both XGBoost and CatBoost provide state-of-the-art performance, each has distinct advantages: XGBoost excels in modeling high-dimensional numeric variables, whereas CatBoost is optimized for categorical variables. In real-world manufacturing datasets, both variable types coexist and contribute to scrap generation. Thus, a hybrid ensemble framework was designed to leverage the strengths of both algorithms.

The proposed method integrates three ensemble paradigms: weighted averaging, stacked generalization, and blending. Weighted averaging balances the predictions of CatBoost and XGBoost using optimized weights, while stacking introduces a meta-model that learns second-level representations of the base models’ outputs. Finally, a blending step combines both results, ensuring robust performance and minimizing bias from a single ensemble strategy. The operation of the model is provided in pseudocode in Algorithm 1.

Algorithm 1: HWSE for scrap prediction.

Input: Dataset D with numeric features

X_{n u m}

, categorical features

X_{c a t}

, and target variable y

Output: Final predictions

{\hat{y}}_{F i n a l}

, model performance (RMSE)

Step 1: Input the Dataset

Load dataset D.

Step 2: Separate Features

X_{n u m} \leftarrow

Numeric features

X_{c a t} \leftarrow

Categorical features

Step 3: Train Base Models

Train XGBoost on

X_{n u m}

:

{\hat{y}}_{X G B} = f_{X G B} (X_{n u m})

Train CatBoost on

X_{c a t}

:

{\hat{y}}_{C a t} = f_{C a t} (X_{c a t})

Step 4: Optimize Weights with PSO

Run Particle Swarm Optimization to find

w_{X G B}, w_{C a t}

that minimize validation RMSE, subject to:

w_{X G B}, w_{C a t} \in [0, 1]

,

w_{X G B} + w_{C a t} = 1

Step 5: Weighted Averaging

{\hat{y}}_{W e i g h t e d} = w_{X G B} \cdot {\hat{y}}_{X G B} + w_{C a t} \cdot {\hat{y}}_{C a t}

Step 6: Create Stacking Dataset

Z = [{\hat{y}}_{X G B}, {\hat{y}}_{C a t}]

Step 7: Train Meta-Model

Train

f_{M e t a}

(Ridge, Lasso, or GBM) on Z:

{\hat{y}}_{M e t a} = f_{M e t a} (Z)

Step 8: Blend Predictions

{\hat{y}}_{F i n a l} = α \cdot {\hat{y}}_{W e i g h t e d} + β \cdot {\hat{y}}_{M e t a}

, where

α + β = 1

Step 9: Evaluate Model

Evaluate using Root Mean Squared Error (RMSE):

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{F i n a l, i})}^{2}}

Step 10: Benchmark with Baselines

Train and evaluate: XGBoost, CatBoost, AdaBoost, LightGBM, GBM, Weighted Averaging, Stacked Ridge, Stacked Lasso, PSO–XGBoost, PSO–CatBoost. Compare RMSE values with the proposed model.

In this study, we propose a Hybrid Weighted–Stacked Ensemble (HWSE) framework to improve scrap prediction performance. The framework is designed as a multi-layered ensemble in which base models specialize in different feature types: XGBoost is trained exclusively on numerical variables, while CatBoost is applied to categorical variables. Their predictions are first combined through a weighted averaging scheme, where the weights are optimized using Particle Swarm Optimization (PSO). In the second stage, the predictions of both base models are used to create a stacking dataset on which a meta-model (Ridge, Lasso, or GBM) is trained to capture residual patterns. Finally, the weighted averaging and stacking outputs are blended through an optimized linear combination to generate the final prediction. This layered design enables the HWSE to exploit the strengths of both averaging- and stacking-based ensembles, providing robustness against feature heterogeneity and improved accuracy compared to single or conventional ensemble models.

In the proposed HWSE framework, the weights

w_{XGB}

and

w_{Cat}

are tuned via Particle Swarm Optimization (PSO) under the constraints

w_{XGB}, w_{Cat} \in [0, 1]

and

w_{XGB} + w_{Cat} = 1

. The optimization process is performed by minimizing the validation RMSE with early stopping and multiple restarts to ensure stability. Base-model out-of-fold predictions are used to feed the stacking meta-model, thereby avoiding data leakage. The final predictions are obtained by a constrained blend of the PSO-optimized weighted average and the meta-model output, which provides both robustness and improved generalization. As illustrated in Figure 2, the framework efficiently combines numerical and categorical features to improve predictive accuracy.

4.4. Advantages of the Proposed Framework

The proposed hybrid design offers several benefits, including the efficient handling of both categorical and numeric features without excessive preprocessing, enhanced predictive accuracy through the integration of weighted averaging and stacking ensembles, improved robustness and generalization by preventing reliance on a single ensemble method, and scalability and adaptability to large industrial datasets with heterogeneous feature structures. This framework provides a novel solution for manufacturing scrap prediction, outperforming conventional single-model approaches by capturing complex interactions across diverse feature types.

5. Case Study: Scrap Prediction in the Aerospace Manufacturing Industry

The dataset used in this study was obtained from a defense and aviation company and covers a five-year period (January 2019–November 2023). It consists of 41,003 rows and 64 variables, compiled from monthly production records. The features include general production information such as the production date, machine identity (Machine, MachineCode), operator, product code (Item, Description), lot number, working hours, working speed, production amount, number of machine hits, and energy consumption. Table 3 shows a summary of the dataset features and their descriptions. The analyses were performed on a personal computer (Acer Nitro 5, Acer Inc., Taipei, Taiwan) using Python 3.12, PyCharm 2024.2, and JupyterLab 4.1.

Scrap-related variables capture different types of defects leading to scrap generation, including form errors, burrs, scratches, mold traces, missing holes, bending or slitting errors, and raw material issues, while aggregated indicators such as total scrap count, production scrap rate, and overall scrap in kilograms are also included. In addition, downtime and failure variables record process interruptions in minutes, caused by mold, mechanical, electrical, pneumatic, hydraulic, or transfer failures, as well as operational issues such as planned maintenance, cleaning, training, material delays, forklift waiting, repair waiting, and power outages. Higher-level performance metrics such as total downtime, total number of stops, and model change frequency are also part of the dataset. This comprehensive structure enables both a granular analysis of defect sources and a holistic evaluation of production efficiency.

The content of the variables in the dataset was examined individually. According to the dataset information, it has been observed that some variables were defined as the wrong data type in their initial definition. For this reason, it is necessary to convert the data types of the Amount, Hit, RollEndScrap, MoldChangeFailure, RollChangeError, AutocontrolFailure, CleaningError and MoldSettingError variables. Since the relevant variables gave errors when direct conversion was made, the content of the variables was examined. Problems were observed when all lines were examined, such as some values being entered as letters by mistake or too many punctuation marks being written. Therefore, some data were either corrected or cleared. Then, these variables were converted from string type to float type. Since the Consumption and LotNumber variables contain meaningless data and a lot of missing data, it was decided that they were not necessary for the model, so they were left without any action to be dropped from the dataset. In its final form, the dataset contains 41,001 rows and 64 columns. At this stage, the variables are left the same. The Date variable in the dataset is written as day.month.year. Since it is impossible to examine the relationship between the variables in the dataset on a daily, weekly, monthly, and annual basis, five variables with datetime data type were added to the dataset. Relevant variables were derived from Date and added to the dataset for each row. According to this structure, for example, the date 17 March 2022 will be displayed in the dataset as year 2022, month 3, week 12. In this way, periodic comparisons of the dataset can be made. Then, the dataset was analyzed statistically using numerical and categorical variables. In order to protect company confidentiality, only a specific part of the numerical analysis is included in Table 4, and a specific part where the operator name and product name are deleted.

Then, EDA was performed, and the data were examined as univariate, bivariate, and multivariate. The products manufactured by day and week are presented in Figure 3 and Figure 4. However, the highest production rates on an annual basis were in the years 2022, 2021, 2023, 2020, and 2019, respectively. It has been observed that the production rate has increased over the years, but this was different in 2023.

Wastes by year, month, and day are given in Figure 5, Figure 6 and Figure 7. As can be seen in Figure 5, although the values in 2019 are close to each other, it is observed that there is a broader distribution in 2020, 2021, 2022, and 2023. In monthly distributions, March, November, and December are widespread, while daily distributions can be interpreted similarly from Figure 7.

In the data preprocessing section, irrelevant feature removal was carried out following the insights from EDA. Accordingly, the variables Machine, MachineCode, Operator, Description, Hit, LotNumber, WorkingSpeed, Consumption, ProductionScrapRate, and SpEnergyConsumption were removed from the dataset. The Machine value, containing the names of the machines, was removed. MachineCode was dropped due to containing too many NA (not available) values, and the remaining values could not be filled with any significant data. The Operator name was removed because it contained incorrect business information, personal names, and surnames. The Description variable, which contained brand and model information of the product, was also removed. The Hit variable was dropped, as it contained similar data to the Amount variable. LotNumber was removed due to having too many missing and irrelevant values. The WorkingSpeed variable was dropped because it contained unreliable data. Consumption and SpEnergyConsumption variables were removed due to containing too many NA values and adding meaningless complexity to the dataset. Finally, the ProductionScrapRate value was dropped, as it could lead to confusion despite having retrievable values from the dataset.

The proposed methodology relies on a hybrid ensemble framework that integrates XGBoost and CatBoost. Specifically, XGBoost was dedicated to modeling numerical variables, while CatBoost handled categorical variables. Their predictions were first combined using a weighted averaging strategy, where the weights were optimized through Particle Swarm Optimization (PSO). Subsequently, a stacking dataset was constructed using the base model outputs, on which meta-learners such as Ridge, Lasso, or GBM were trained to capture residual patterns. Finally, a blending step was applied to merge the PSO-optimized weighted average with the stacking outputs, thereby producing the final predictions. This multi-layered design allowed the framework to exploit the complementary strengths of both averaging- and stacking-based ensembles, offering robustness and improved predictive accuracy.

The comparative performance of the proposed model against baseline algorithms is illustrated in Figure 8. As observed, XGBoost and CatBoost individually yielded Root Mean Squared Error (RMSE) values of approximately 1050, while AdaBoost performed the weakest with an RMSE close to 1500. In contrast, the proposed hybrid model achieved an RMSE of 721, corresponding to an error reduction exceeding 30%. This significant improvement highlights the effectiveness of combining PSO-optimized weighted averaging with stacked generalization in addressing the heterogeneity of industrial datasets. All experiments were conducted on a workstation equipped with an Intel i9 processor, 64 GB RAM, and an NVIDIA RTX GPU. Hyperparameter tuning was performed using grid search and Particle Swarm Optimization (PSO) for selected models. The dataset was divided into training (70%) and testing (30%) sets.

A more focused comparison is presented in Figure 9, where PSO-optimized XGBoost and CatBoost were evaluated against the proposed model. The RMSE of PSO–XGBoost was measured as 1068, while PSO–CatBoost achieved 1062. In contrast, the proposed hybrid framework substantially outperformed both single models with an RMSE of 721. This result confirms that although PSO provides slight performance gains for individual models, the major improvement is achieved through the hybrid integration of both algorithms.

The computational efficiency of the models was also evaluated as shown in Figure 10. Training times for PSO–XGBoost and PSO–CatBoost were 6 and 7 s, respectively, whereas the proposed hybrid model required 71 s. Although this represents nearly a tenfold increase in training cost, the significant reduction in forecasting error more than justifies the additional computational effort. Particularly in batch-level industrial applications, such training times remain within acceptable limits, ensuring practicality alongside accuracy.

In summary, the case study results clearly demonstrate that the proposed PSO-enhanced hybrid ensemble not only delivers superior predictive accuracy but also provides robustness against feature heterogeneity. While the training overhead is higher compared to single models, the reduction in forecasting error directly translates into cost savings and improved production quality. Thus, the framework offers a scalable and reliable pathway for integrating data-driven decision support into aerospace manufacturing and other industrial contexts.

6. Results and Discussion

From a computational perspective, the hybrid PSO–CatBoost–XGBoost ensemble requires approximately 71 s of training time, compared to 6 and 7 s for standalone PSO–XGBoost and PSO–CatBoost models, respectively. While this overhead may limit applications that demand real-time updates, it is acceptable for batch-level industrial scrap forecasting where accuracy is paramount. Importantly, the reduction in RMSE from 1068 and 1062 to 721 directly translates into significant cost savings and quality improvements, which outweigh the additional computational burden. For real-time Industrial IoT (IIoT) deployment, lightweight versions of the ensemble (e.g., omitting stacking or using compressed models) could be employed to balance speed and accuracy.

From a numerical perspective, the proposed hybrid model demonstrates a clear superiority over the benchmark algorithms. In terms of computational cost, the PSO-based XGBoost and CatBoost models required 6 and 7 s of training time, respectively, while the proposed ensemble required 71 s. Although this corresponds to a notable increase in training time, the performance gains achieved fully justify this overhead.

The identification of categorical factors, particularly operator and machine type, as highly influential variables provides actionable guidance for industrial practice. In real-world settings, managers can operationalize these findings by introducing targeted training programs for operators whose assignments correlate with higher scrap levels, or by standardizing machine calibration and maintenance schedules for machine types associated with higher defect rates. Furthermore, scrap monitoring dashboards integrated with IIoT platforms could flag operator–machine combinations that historically yield higher scrap, thereby enabling preventive interventions. In this way, the model’s feature importance analysis does not remain a purely statistical output but becomes a decision-support tool for reducing variability and ensuring consistent quality.

With respect to prediction accuracy, the Root Mean Squared Error (RMSE) values indicate substantial improvements. The PSO–XGBoost and PSO–CatBoost models yielded RMSE values of 1068 and 1062, respectively, whereas the proposed model achieved an RMSE of 721, corresponding to an error reduction exceeding 30 percent. A broader comparison further confirmed this trend: while AdaBoost exhibited the weakest performance with an RMSE of 1492, and conventional XGBoost and CatBoost models recorded 1071 and 1083, the proposed model consistently outperformed all competitors, attaining the lowest RMSE overall. To assess the robustness of the performance improvements, paired t-tests were conducted between the proposed hybrid ensemble and the baseline models. The results confirmed that the improvements were statistically significant (all p-values < 0.01). Notably, when compared to the strongest baseline, the paired t-test produced a p-value of 0.000121, confirming that the error reduction achieved by the hybrid model is highly significant.

These quantitative findings demonstrate that, despite its longer training time, the proposed approach provides a significant improvement in predictive accuracy. In practical manufacturing contexts, such as scrap forecasting, a 30–35 percent reduction in prediction error translates into considerable cost savings and enhanced production efficiency. Therefore, the computational cost of the proposed hybrid ensemble is outweighed by its superior generalization capability and its potential to support sustainable manufacturing decision-making.

7. Conclusions

The purpose of this research was to explore the potential of a hybrid approach combining CatBoost and XGBoost for scrap prediction in industrial settings. By integrating the strengths of both algorithms, the hybrid model demonstrated improved accuracy and performance, particularly when handling complex datasets containing both categorical and numerical variables. The methodology employed, which involved weighted averaging and stacked ensemble techniques, allowed for more robust predictions compared to traditional single-algorithm models. To further enhance performance, Particle Swarm Optimization (PSO) was employed as a metaheuristic optimization strategy for hyperparameter tuning, ensuring that both base models and ensemble weights were efficiently calibrated.

The proposed hybrid model demonstrates significant potential for scalability and adaptability, making it applicable across various manufacturing environments. Its design, which integrates advanced ensemble learning techniques with heuristic optimization, provides a flexible framework that can be tailored to specific operational contexts. For instance, the model’s feature selection and parameter tuning processes can be adjusted to align with the unique characteristics of different datasets, such as variations in product types, production volumes, and manufacturing technologies.

In terms of scalability, the modular structure of the model allows it to handle larger datasets by leveraging distributed computing frameworks or cloud-based architectures. This ensures that the model maintains its performance and efficiency even as the scale of the manufacturing environment expands. Furthermore, the model’s reliance on scalable algorithms like XGBoost and CatBoost ensures computational efficiency, which is critical for real-time applications.

For real-time monitoring, the model can be adapted to integrate with Industrial Internet of Things (IIoT) systems, enabling the continuous collection and analysis of production data. This adaptation would involve implementing real-time data pipelines and incremental learning mechanisms to update the model dynamically without requiring complete retraining. Such capabilities would allow manufacturers to detect and respond to inefficiencies or quality issues in real time, enhancing operational resilience and productivity.

Although validated in the aerospace manufacturing sector, the hybrid PSO–CatBoost–XGBoost framework is applicable to a wide range of industries such as automotive, energy, and process manufacturing. Since the ensemble structure integrates both numerical and categorical variables, it can adapt to diverse data structures and production dynamics without significant modifications. The modularity of the framework allows the reconfiguration of preprocessing pipelines and meta-models according to sector-specific requirements, thereby supporting generalizability across different manufacturing environments.

The benefits of this hybrid approach include its ability to efficiently process diverse data types, handle overfitting issues, and provide scalable solutions. However, one drawback is the longer computational time required due to the model’s complexity, which could hinder its applicability in scenarios demanding real-time predictions. Despite this limitation, the enhanced predictive accuracy and flexibility of the hybrid model make it a valuable tool for industries aiming to minimize scrap and optimize production processes.

While the hybrid model demonstrates significant improvements in predictive accuracy and robustness, implementing it in real-time production settings presents notable challenges. One of the primary concerns is the computational cost associated with the ensemble design and PSO optimization, as multiple stages of training and prediction are resource intensive. Another challenge lies in the scalability of the model across diverse manufacturing environments. Variations in data characteristics, production processes, and resource availability may require additional customization and fine-tuning, which can further increase the implementation effort. Integrating the model into the existing production infrastructure may also necessitate investments in hardware, software, and personnel training.

Despite the fact that PSO outperformed other metaheuristics tested in this study, future research may consider hybridizing multiple optimization algorithms to exploit their complementary strengths. Such hybrid strategies could enhance convergence speed and accuracy in optimizing ensemble models for industrial scrap prediction. Although the proposed framework has strong potential for large-scale deployment, several barriers may hinder its seamless adoption. From a technical perspective, the main challenges include the computational demands of ensemble models, real-time data integration with IIoT devices, and the need for robust data pipelines to ensure low-latency processing. Data quality issues such as missing values, sensor errors, and inconsistent formats may also compromise performance. From an organizational standpoint, industries may face resistance to adopting AI-driven decision-making due to a lack of trust, limited technical expertise, or concerns about interpretability. In addition, implementing such systems often requires workforce training and cultural adaptation to integrate data-driven insights into established workflows. Finally, infrastructural barriers such as inadequate IT infrastructure, cybersecurity risks, and costs associated with upgrading legacy systems may slow down adoption. Overcoming these challenges requires not only technical refinements, such as model compression and incremental learning, but also organizational strategies that emphasize capacity building, change management, and cross-functional collaboration.

Addressing these challenges will be critical for maximizing the practical utility of the proposed hybrid model. Future studies should investigate strategies to reduce computational time while maintaining accuracy, such as lightweight ensembles, model compression, or adaptive optimization. Additionally, exploring the application of this hybrid approach in other domains, such as energy forecasting or environmental monitoring, would demonstrate its versatility. Importantly, by reducing scrap generation, conserving raw materials, and lowering energy use, the proposed framework directly contributes to green manufacturing practices, the circular economy, and the United Nations Sustainable Development Goals (SDGs) on responsible production and consumption. Overall, the PSO-enhanced hybrid model represents a scalable, optimization-driven, and environmentally responsible solution for industrial scrap prediction.

Author Contributions

Conceptualization, E.N.N. and B.E.; Methodology, E.N.N.; Software, E.N.N.; Writing—review and editing, E.N.N. and E.E.; Supervision, B.E. and E.E. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data that support this study’s findings are available on request from the corresponding author.

Acknowledgments

This study is derived from the corresponding author’s doctoral dissertation.

Conflicts of Interest

The authors declare that they have no competing interests.

Abbreviations

The following abbreviations are used in this manuscript:

CatBoost	Categorical Boosting
EDA	Exploratory Data Analysis
ML	Machine Learning
PSO	Particle Swarm Optimization
RMSE	Root Mean Squared Error
XGBoost	Extreme Gradient Boosting

References

Heydarbakian, S.; Spehri, M. Interpretable Machine Learning to Improve Supply Chain Resilience, An Industry 4.0 Recipe. IFAC-PapersOnLine 2022, 55, 2834–2839. [Google Scholar] [CrossRef]
dos Santos, P.H.; de Carvalho Santos, V.; da Silva Luz, E.J. Towards robust ferrous scrap material classification with deep learning and conformal prediction. Eng. Appl. Artif. Intell. 2025, 140, 109724. [Google Scholar] [CrossRef]
Sahoo, K.; Samal, A.K.; Pramanik, J.; Pani, S.K. Exploratory Data Analysis Using Python. Int. J. Innov. Technol. Explor. Eng. 2019, 8, 4727–4735. [Google Scholar] [CrossRef]
Manyika, J.; Chui, M. Big Data: The Next Frontier for Innovation, Competition, and Productivity; McKinsey Global Institute: San Francisco, CA, USA, 2011; Available online: www.mckinsey.com/mgi (accessed on 17 November 2024).
Shahul Hammed, S.; Karthikeyan, M.P.; Raj, R.P.; Preethi, C.; Haripriya, K.; Kathiravan, A. Scrap Management using E-com Online Machine Learning Iscrap Algorithm. In Proceedings of the International Conference on Sustainable Computing and Smart Systems (ICSCSS), Coimbatore, India, 14–16 June 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 356–360. [Google Scholar] [CrossRef]
Shi-Ming Huang, S.-M.H.; Shi-Ming Huang, Y.-Y.L. Machine Learning Improves Environmental Sustainability: A Teaching Case of Prediction about Scrapping Inventory. Int. J. Comput. Audit. 2023, 5, 4–31. [Google Scholar] [CrossRef]
Schäfer, M.; Faltings, U.; Glaser, B. DOES—A Multimodal Dataset for Supervised and Unsupervised Analysis of Steel Scrap. Sci. Data 2023, 10, 26–62. [Google Scholar] [CrossRef]
Abdou, K.; Schaaf, N.; Struckmeier, F.; Braun, J.; Bhat Keelanje Srinivas, P.; Ottnad, J.; Huber, M.F. Nestability: A Deep Learning Oracle for Nesting Scrap Prediction in Manufacturing Industry. Resour. Conserv. Recycl. 2024, 205, 107540. [Google Scholar] [CrossRef]
Pentland, A.S. The Data-Driven Society. Sci. Am. 2013, 309, 78–83. [Google Scholar] [CrossRef]
Ohashi, T. Feature Extraction for Machine Learning to Detect Floating Scrap During Stamping Using Accelerometer. J. Adv. Mech. Des. Syst. Manuf. 2024, 18, JAMDSM0052. [Google Scholar] [CrossRef]
Verleysen, M.; François, D. The Curse of Dimensionality in Data Mining and Time Series Prediction. In Computational Intelligence and Bioinspired Systems; Cabestany, J., Prieto, A., Sandoval, F., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; Volume 3512. [Google Scholar] [CrossRef]
Gelman, A. Exploratory Data Analysis for Complex Models. J. Comput. Graph. Stat. 2004, 13, 755–779. [Google Scholar] [CrossRef]
Rey, J.; Apelt, S.; Trauth, D.; Mattfeld, P.; Bergs, T.; Klocke, F. Highly Iterative Technology Planning: Processing of Information Uncertainties in the Planning of Manufacturing Technologies. Prod. Eng. 2019, 13, 361–371. [Google Scholar] [CrossRef]
Schmitt, J.; Bönig, J.; Borggräfe, T.; Beitinger, G.; Deuse, J. Predictive Model-Based Quality Inspection Using Machine Learning and Edge Cloud Computing. Adv. Eng. Inform. 2020, 45, 101101. [Google Scholar] [CrossRef]
Wuest, T.; Irgens, C.; Thoben, K.D. An Approach to Monitoring Quality in Manufacturing Using Supervised Machine Learning on Product State Data. J. Intell. Manuf. 2014, 25, 1167–1180. [Google Scholar] [CrossRef]
Huynh, N.-T. Online Defect Prognostic Model for Textile Manufacturing. Resour. Conserv. Recycl. 2020, 161, 104910. [Google Scholar] [CrossRef]
Franceschini, F.; Galetto, M.; Genta, G.; Maisano, D.A. Selection of Quality-Inspection Procedures for Short-Run Productions. Int. J. Adv. Manuf. Technol. 2018, 99, 2537–2547. [Google Scholar] [CrossRef]
Galetto, M.; Verna, E.; Genta, G. Effect of Process Parameters on Parts Quality and Process Efficiency of Fused Deposition Modeling. Comput. Ind. Eng. 2021, 156, 107238. [Google Scholar] [CrossRef]
García, S.; Ramírez-Gallego, S.; Luengo, J.; Benítez, J.M.; Herrera, F. Big Data Preprocessing: Methods and Prospects. Big Data Anal. 2016, 1, 9. [Google Scholar] [CrossRef]
Li, D.; Zhi, B.; Schoenherr, T.; Wang, X. Developing Capabilities for Supply Chain Resilience in a Post-COVID World: A Machine Learning-Based Thematic Analysis. IISE Trans. 2023, 55, 1256–1276. [Google Scholar] [CrossRef]
Chen, Y.M.; Chen, T.Y.; Li, J.S. A Machine Learning-Based Anomaly Detection Method and Blockchain-Based Secure Protection Technology in Collaborative Food Supply Chain. Int. J. e-Collab. 2023, 19, 1–16. [Google Scholar] [CrossRef]
Schroeder, M.; Lodemann, S. A Systematic Investigation of the Integration of Machine Learning into Supply Chain Risk Management. Logistics 2021, 5, 62. [Google Scholar] [CrossRef]
Pennekamp, J.; Glebke, R.; Henze, M.; Meisen, T.; Quix, C.; Hai, R.; Gleim, L.; Niemietz, P.; Rudack, M.; Knape, S.; et al. Towards an Infrastructure Enabling the Internet of Production. In Proceedings of the 2019 IEEE International Conference on Industrial Cyber Physical Systems (ICPS), Taipei, Taiwan, 6–9 May 2019; IEEE: New York, NY, USA, 2019; pp. 31–37. [Google Scholar] [CrossRef]
Tao, F.; Qi, Q.; Wang, L.; Nee, A.Y.C. Digital Twins and Cyber–Physical Systems toward Smart Manufacturing and Industry 4.0: Correlation and Comparison. Engineering 2019, 5, 653–661. [Google Scholar] [CrossRef]
Chen, M.K.; Tai, T.W.; Hung, T.Y. Component Selection System for Green Supply Chain. Expert Syst. Appl. 2012, 39, 5687–5701. [Google Scholar] [CrossRef]
Sardar, S.K.; Sarkar, B.; Kim, B. Integrating Machine Learning, Radio Frequency Identification, and Consignment Policy for Reducing Unreliability in Smart Supply Chain Management. Processes 2021, 9, 247. [Google Scholar] [CrossRef]
Pavlyshenko, B.M. Machine-Learning Models for Sales Time Series Forecasting. Data 2019, 4, 15. [Google Scholar] [CrossRef]
Tirkolaee, E.B.; Sadeghi, S.; Mooseloo, F.M.; Vandchali, H.R.; Aeini, S. Application of Machine Learning in Supply Chain Management: A Comprehensive Overview of the Main Areas. Math. Probl. Eng. 2021, 1476043. [Google Scholar] [CrossRef]
El-Kenawy, E.S.M.; Khodadadi, N.; Mirjalili, S.; Abdelhamid, A.A.; Eid, M.M.; Ibrahim, A. Greylag Goose Optimization: Nature-Inspired Optimization Algorithm. Expert Syst. Appl. 2024, 238, 122147. [Google Scholar] [CrossRef]
El-Kenawy, E.S.M.; Rizk, F.H.; Zaki, A.M.; Elshabrawy, M.; Ibrahim, A.; Abdelhamid, A.A.; Khodadadi, N.; ALmetwally, E.M.; Eid, M.M. iHow Optimization Algorithm: A Human-Inspired Metaheuristic Approach for Complex Problem Solving and Feature Selection. J. Artif. Intell. Eng. Pract. 2024, 1, 36–53. [Google Scholar] [CrossRef]
El-Kenawy, E.S.M.; Ibrahim, A. Football Optimization Algorithm (FbOA): A Novel Metaheuristic Inspired by Team Strategy Dynamics. J. Artif. Intell. Metaheuristics 2024, 8, 21–38. [Google Scholar] [CrossRef]
Alexopoulos, K.; Catti, P.; Kanellopoulos, G.; Nikolakis, N.; Blatsiotis, A.; Christodoulopoulos, K.; Kaimenopoulos, A.; Ziata, E. Deep Learning for Estimating the Fill-Level of Industrial Waste Containers of Metal Scrap: A Case Study of a Copper Tube Plant. Appl. Sci. 2023, 13, 2575. [Google Scholar] [CrossRef]
Knott, A.L.; Stauder, L.; Ruan, X.; Schmitt, R.H.; Bergs, T. Potential of prediction in manufacturing process and inspection sequences for scrap reduction. CIRP J. Manuf. Sci. Technol. 2023, 44, 55–69. [Google Scholar] [CrossRef]
de Souza, L.G.P.; Vasconcelos, G.A.V.B.; Costa, L.A.R.; Francisco, M.B.; de Paiva, A.P.; Ferreira, J.R. Roughness prediction using machine learning models in hard turning: An approach to avoid rework and scrap. Int. J. Adv. Manuf. Technol. 2024, 133, 4205–4221. [Google Scholar] [CrossRef]
Schorr, S.; Möller, M.; Heib, J.; Fang, S.; Bähre, D. Quality Prediction of Reamed Bores Based on Process Data and Machine Learning Algorithm: A Contribution to a More Sustainable Manufacturing. Procedia Manuf. 2020, 43, 519–526. [Google Scholar] [CrossRef]
Galetto, M.; Genta, G.; Maculotti, G.; Verna, E. Defect Probability Estimation for Hardness-Optimised Parts by Selective Laser Melting. Int. J. Precis. Eng. Manuf. 2020, 21, 1739–1753. [Google Scholar] [CrossRef]
Schmidt, C.; Hocke, T.; Denkena, B. Deep Learning-Based Classification of Production Defects in Automated-Fiber-Placement Processes. Prod. Eng. 2019, 13, 501–509. [Google Scholar] [CrossRef]
Xu, X.; Zhang, Y. Scrap Steel Price Forecasting with Neural Networks for East, North, South, Central, Northeast, and Southwest China and at the National Level. Ironmak. Steelmak. 2023, 50, 1683–1697. [Google Scholar] [CrossRef]
Daigo, I.; Murakami, K.; Tajima, K.; Kawakami, R. Thickness Classifier on Steel in Heavy Melting Scrap by Deep-Learning-Based Image Analysis. ISIJ Int. 2023, 63, 197–203. [Google Scholar] [CrossRef]
Chen, S.; Kaufmann, T. Development of Data-Driven Machine Learning Models for the Prediction of Casting Surface Defects. Metals 2022, 12, 1. [Google Scholar] [CrossRef]
Polat, T.K. Forecasting of Production and Scrap Amount Using Artificial Neural Networks. Emerg. Mater. Res. 2022, 11, 345–355. [Google Scholar] [CrossRef]
Díaz-Romero, D.; Sterkens, W.; Van den Eynde, S.; Goedemé, T.; Dewulf, W.; Peeters, J. Deep Learning Computer Vision for the Separation of Cast- and Wrought-Aluminum Scrap. Resour. Conserv. Recycl. 2021, 172, 105685. [Google Scholar] [CrossRef]
Díaz-Romero, D.J.; Van den Eynde, S.; Sterkens, W.; Eckert, A.; Zaplana, I.; Goedemé, T.; Peeters, J. Real-Time Classification of Aluminum Metal Scrap with Laser-Induced Breakdown Spectroscopy Using Deep and Other Machine Learning Approaches. Spectrochim. Acta Part B At. Spectrosc. 2022, 196, 106519. [Google Scholar] [CrossRef]
Tsymbal, A.; Puuronen, S. Bagging and Boosting with Dynamic Integration of Classifiers. In Proceedings of the 4th European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Lyon, France, 13–16 September 2000; Springer: Berlin/Heidelberg, Germany, 2000; pp. 116–125. [Google Scholar]
Zhang, L.; Janošík, D. Enhanced Short-Term Load Forecasting with Hybrid Machine Learning Models: CatBoost and XGBoost Approaches. Expert Syst. Appl. 2024, 241, 122686. [Google Scholar] [CrossRef]
Nagassou, M.; Mwangi, R.W.; Nyarige, E. A Hybrid Ensemble Learning Approach Utilizing Light Gradient Boosting Machine and Category Boosting Model for Lifestyle-Based Prediction of Type-II Diabetes Mellitus. J. Data Anal. Inf. Process. 2023, 11, 480–511. [Google Scholar] [CrossRef]
Kahraman, E.; Hosseini, S.; Taiwo, B.O.; Fissha, Y.; Jebutu, V.A.; Akinlabi, A.A.; Adachi, T. Fostering Sustainable Mining Practices in Rock Blasting: Assessment of Blast Toe Volume Prediction Using Comparative Analysis of Hybrid Ensemble Machine Learning Techniques. J. Saf. Sustain. 2024, 1, 75–88. [Google Scholar] [CrossRef]
Khan, P.W.; Byun, Y.-C.; Lee, S.-J.; Park, N. Machine Learning-Based Hybrid System for Imputation and Efficient Energy Demand Forecasting. Energies 2020, 13, 2681. [Google Scholar] [CrossRef]
Qiu, Y.; Zhou, J.; Khandelwal, M.; Yang, H.; Yang, P.; Li, C. Performance Evaluation of Hybrid WOA-XGBoost, GWO-XGBoost and BO-XGBoost Models to Predict Blast-Induced Ground Vibration. Eng. Comput. 2022, 38, 4145–4162. [Google Scholar] [CrossRef]
Chen, H.; Shen, Q.G.; Skibniewski, M.J.; Cao, Y.; Liu, Y. Dynamic Prediction and Optimization of Tunneling Parameters with High Reliability Based on a Hybrid Intelligent Algorithm. Inf. Fusion 2025, 114, 102705. [Google Scholar] [CrossRef]
Ahn, J.M.; Kim, J.; Kim, K. Ensemble Machine Learning of Gradient Boosting (XGBoost, LightGBM, CatBoost) and Attention-Based CNN-LSTM for Harmful Algal Blooms Forecasting. Toxins 2023, 15, 608. [Google Scholar] [CrossRef] [PubMed]
Velleman, P.F. Applications, Basics, and Computing of Exploratory Data Analysis; The Internet-First University Press: Ithaca, NY, USA, 1981; Available online: http://dspace.library.cornell.edu/handle/1813/62 (accessed on 19 November 2024).
Komorowski, M.; Marshall, D.; Salciccioli, J.; Crutain, Y. Exploratory Data Analysis. In Secondary Analysis of Electronic Health Records; MIT Critical Data, Ed.; Springer: Cham, Switzerland, 2016; pp. 185–203. [Google Scholar] [CrossRef]
Morgenthaler, S. Exploratory Data Analysis. Wiley Interdiscip. Rev. Comput. Stat. 2009, 1, 33–44. [Google Scholar] [CrossRef]
Jebb, A.T.; Parrigon, S.; Woo, S.E. Exploratory Data Analysis as a Foundation of Inductive Research. Hum. Resour. Manag. Rev. 2017, 27, 265–276. [Google Scholar] [CrossRef]
Flury, B.K.; Riedwyl, H. Standard Distance in Univariate and Multivariate Analysis; Taylor & Francis, Ltd.: Abingdon, UK, 1986. [Google Scholar]
Agresti, A. Categorical Data Analysis, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2002. [Google Scholar]
Rencher, A.C. A Review of “Methods of Multivariate Analysis, Second Edition”. IIE Trans. 2005, 37, 1083–1085. [Google Scholar] [CrossRef]
García, S.; Luengo, J.; Herrera, F. Data Preprocessing in Data Mining; Springer: Cham, Switzerland, 2015; Available online: http://www.springer.com/series/8578 (accessed on 15 December 2024).
Ramírez-Gallego, S.; Krawczyk, B.; García, S.; Woźniak, M.; Herrera, F. A Survey on Data Preprocessing for Data Stream Mining: Current Status and Future Directions. Neurocomputing 2017, 239, 39–57. [Google Scholar] [CrossRef]
Chandrasekar, P.; Qian, K. The Impact of Data Preprocessing on the Performance of a Naïve Bayes Classifier. In Proceedings of the International Computer Software and Applications Conference (COMPSAC), Atlanta, GA, USA, 10–14 June 2016; IEEE Computer Society: Washington, DC, USA, 2016; pp. 618–619. [Google Scholar] [CrossRef]
Alasadi, S.A.; Bhaya, W.S. Review of Data Preprocessing Techniques in Data Mining. J. Eng. Appl. Sci. 2017, 16, 4102–4107. [Google Scholar]
Joshi, A.P.; Patel, B.V. Data Preprocessing: The Techniques for Preparing Clean and Quality Data for Data Analytics Process. Orient. J. Comput. Sci. Technol. 2021, 13, 78–81. [Google Scholar] [CrossRef]
Zheng, X.; Wang, M.; Ordieres-Meré, J. Comparison of Data Preprocessing Approaches for Applying Deep Learning to Human Activity Recognition in the Context of Industry 4.0. Sensors 2018, 18, 2146. [Google Scholar] [CrossRef]
Fan, C.; Chen, M.; Wang, X.; Wang, J.; Huang, B. A Review on Data Preprocessing Techniques Toward Efficient and Reliable Knowledge Discovery from Building Operational Data. Front. Energy Res. 2021, 9, 652801. [Google Scholar] [CrossRef]
Dong, X.L.; Rekatsinas, T. Data Integration and Machine Learning: A Natural Synergy. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Houston, TX, USA, 10–15 June 2018; Association for Computing Machinery: New York, NY, USA, 2018; pp. 1645–1650. [Google Scholar] [CrossRef]
Alpaydin, E. Introduction to Machine Learning, 3rd ed.; MIT Press: Cambridge, MA, USA, 2014. [Google Scholar]
Ni, D.; Xiao, Z.; Lim, M.K. A Systematic Review of the Research Trends of Machine Learning in Supply Chain Management. Int. J. Mach. Learn. Cybern. 2020, 11, 1463–1482. [Google Scholar] [CrossRef]
Ali, M.F.B.M.; Ariffin, M.K.A.B.M.; Supeni, E.E.B.; Mustapha, F.B. An Unsupervised Machine Learning-Based Framework for Transferring Local Factories into Supply Chain Networks. Mathematics 2021, 9, 3114. [Google Scholar] [CrossRef]
Ben Elmir, W.; Hemmak, A.; Senouci, B. Smart Platform for Data Blood Bank Management: Forecasting Demand in Blood Supply Chain Using Machine Learning. Information 2023, 14, 31. [Google Scholar] [CrossRef]
Hirata, E.; Lambrou, M.; Watanabe, D. Blockchain Technology in Supply Chain Management: Insights from Machine Learning Algorithms. Marit. Bus. Rev. 2020, 6, 114–128. [Google Scholar] [CrossRef]
Sharma, N.; Sharma, R.; Jindal, N. Machine Learning and Deep Learning Applications—A Vision. Glob. Trans. Proc. 2021, 2, 24–28. [Google Scholar] [CrossRef]
Lee, A.; Taylor, P.; Kalpathy-Cramer, J. Machine Learning Has Arrived! Ophthalmology 2017, 124, 1726–1728. [Google Scholar] [CrossRef] [PubMed]
Sokolova, M.; Lapalme, G. A Systematic Analysis of Performance Measures for Classification Tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
Brown, S. Machine Learning, Explained. MIT Sloan School of Management. 2021. Available online: https://mitsloan.mit.edu/ideas-made-to-matter/machine-learning-explained (accessed on 19 October 2024).
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS 2018), Montréal, QC, Canada, 3–8 December 2018; Curran Associates, Inc.: Red Hook, NY, USA, 2018; pp. 6639–6649. [Google Scholar]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; Association for Computing Machinery: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]

Figure 1. Distribution of studies in the literature according to fields.

Figure 2. Architecture of the proposed Hybrid Weighted–Stacked Ensemble (HWSE) framework.

Figure 3. Products produced by days.

Figure 4. Products produced by weeks.

Figure 5. Total scrap by year.

Figure 6. Total scrap by month.

Figure 7. Total scrap by day.

Figure 8. Comparison of model performances in terms of RMSE across baseline and hybrid algorithms. Lower values indicate better accuracy.

Figure 9. Performance comparison of PSO-optimized single models (XGBoost and CatBoost) versus the proposed hybrid model.

Figure 10. Training time comparison of PSO–XGBoost, PSO–CatBoost, and the proposed hybrid model.

Table 1. Summary of literature by application area, methodology, and contribution.

Reference	Application Area	Methodology	Key Findings/Contribution
[13]	Production engineering	Uncertainty management	Improved handling of uncertainties in manufacturing planning
[14]	Smart factories	ML + Edge Computing	Predictive quality inspection in real-time settings
[15]	Manufacturing quality	Supervised ML	Enhanced defect monitoring using product state data
[16]	Textile manufacturing	Online ML model	Prognostic defect detection reduced scrap generation
[17]	Manufacturing	Decision models	Optimized inspection for short-run productions
[36]	Additive manufacturing	ML-based estimation	Accurate prediction of hardness-related defects in SLM
[18]	3D printing (FDM)	Parameter analysis	Identified critical process parameters for efficiency
[35]	Machining sustainability	Process data + ML	Scrap reduction via predictive quality analytics
[37]	Composite manufacturing	Deep learning	Automated defect detection in fiber placement
[38]	Iron & steel industry	ANN	Regional scrap steel price forecasting with high accuracy
[39]	Steel recycling	PSPNet (DL)	Visual classification of unwanted materials in scrap
[40]	Foundry industry	CatBoost, XGBoost	Boosting algorithms outperform conventional methods in defect prediction
[41]	General manufacturing	ANN	Simultaneous prediction of production output and scrap levels
[42]	Recycling (Aluminum)	Deep learning CV	High-accuracy classification of aluminum scrap
[43]	Recycling (Aluminum)	LIBS + RF/NN	Real-time scrap classification enhances recycling efficiency
[29]	Optimization	GGO metaheuristic	Improved feature selection and constrained optimization
[30]	Optimization	iHowOA metaheuristic	Human-inspired search balances exploration and exploitation
[31]	Optimization	FbOA metaheuristic	Football-strategy inspired optimization for global search
[44]	General ML	Hybrid AdaBoost+Bagging	Early dynamic integration improved classifier accuracy
[45]	Energy (Load forecasting)	CatBoost + XGBoost	Hybrid boosting improved short-term load prediction
[46]	Healthcare (Diabetes)	LightGBM + CatBoost	Hybrid ensemble reduced overfitting and improved accuracy
[47]	Mining	LightGBM + CatBoost	Blast toe prediction supporting sustainable practices
[48]	Energy forecasting	Hybrid (CatBoost, XGBoost, RF)	Robust prediction under missing data conditions
[49]	Environmental engineering	WOA/GWO + XGBoost	Hybrid models outperform conventional learners in vibration prediction
[50]	Construction (Tunneling)	CatBoost + Bayesian Opt. + NSGA-III	High reliability in tunneling parameter optimization
[51]	Environmental monitoring	Hybrid boosting + CNN-LSTM	Forecasting harmful algal blooms with strong predictive accuracy

Table 2. Classification of studies by methodology and application domain.

Methodology	Scrap/Quality	Energy	Healthcare	Environment	Mining/Construction
Random Forest (RF)	[42,43]	–	–	–	–
Artificial Neural Networks (ANN)	[38,41]	–	–	–	–
CatBoost	[40]	[45]	–	–	[50]
XGBoost	[40]	[45,48]	–	[49,51]	[50]
Hybrid Models	–	[45,48]	[46]	[49,51]	[47,50]
Deep Learning (DL)	[37,39]	–	–	–	–

Table 3. Summary of dataset features and their descriptions.

Variable	Description
General Production Information
Date	The date the product is produced
Machine, MachineCode	Machine identity and corresponding code
Operator	Worker operating the machine
Item, Description	Product code and name
LotNumber	Code of the product family
WorkingHours	Machine’s total operating hours
WorkingSpeed	Operating speed of the machine
Amount	Number of produced parts
Hit	Number of machine hits
Consumption	Energy consumption in kWh
SpEnergyConsumption	Specific energy consumption in kWh
Scrap-related Variables
FormError, BurrHole, BurrSurface, GradualCut, …	Different types of scrap causes (e.g., form error, surface burr, scratches, mold traces, missing holes, bending/slitting errors, etc.)
RawMatScrap, RollEndScrap	Scrap caused by raw material or roll ends
TotalNumberScrap, TotalProductionScrap	Total number and amount of scrap parts
ProductionScrapRate	Scrap ratio in production
TotalEngScrap, TotalScrap	Engineering scrap and overall scrap (kg)
Downtime and Failure Variables
MoldFailure, MechanicalFault, ElectricalFault, …	Downtime (minutes) due to different machine or process failures (e.g., pneumatic, hydraulic, transfer, mold change, roll change, machine setting, etc.)
PlannedMaintenanceFailure, CleaningError, TrainingError	Downtime due to maintenance, cleaning or training
MaterialDelay, ForkliftDelay, RepairDelay, BoxDelay	Downtime due to logistic delays
PowerCut, ChiefDecision	Downtime due to external or managerial causes
TotalDowntime, TotalNumberStop	Overall downtime and number of stops
ModelChange, ModelChangeCount, ModelChangePeriod	Frequency and duration of model changes

Table 4. Descriptive statistics of selected numerical variables.

Variable	Count	Mean	Std	Min	25%	50%	75%	Max
WorkingHours	40,748	$1.12 \times 10^{10}$	$2.27 \times 10^{12}$	0.0	4.5	5.75	9.0	$4.58 \times 10^{14}$
Amount	40,956	6722.6	10,249.0	0.0	580.0	2450.0	9041.0	144,000.0
Hit	40,940	3821.6	5083.5	0.0	400.0	1325.0	5650.0	42,940.0
WorkingSpeed	57	3175.4	7280.8	20.0	28.0	33.0	40.0	46,000.0
FormError	854	28.4	122.2	2.0	2.0	5.0	16.0	3000.0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Nacar, E.N.; Erdebilli, B.; Eraslan, E. Toward Green Manufacturing: A Heuristic Hybrid Machine Learning Framework with PSO for Scrap Reduction. Sustainability 2025, 17, 9106. https://doi.org/10.3390/su17209106

AMA Style

Nacar EN, Erdebilli B, Eraslan E. Toward Green Manufacturing: A Heuristic Hybrid Machine Learning Framework with PSO for Scrap Reduction. Sustainability. 2025; 17(20):9106. https://doi.org/10.3390/su17209106

Chicago/Turabian Style

Nacar, Emine Nur, Babek Erdebilli, and Ergün Eraslan. 2025. "Toward Green Manufacturing: A Heuristic Hybrid Machine Learning Framework with PSO for Scrap Reduction" Sustainability 17, no. 20: 9106. https://doi.org/10.3390/su17209106

APA Style

Nacar, E. N., Erdebilli, B., & Eraslan, E. (2025). Toward Green Manufacturing: A Heuristic Hybrid Machine Learning Framework with PSO for Scrap Reduction. Sustainability, 17(20), 9106. https://doi.org/10.3390/su17209106

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Toward Green Manufacturing: A Heuristic Hybrid Machine Learning Framework with PSO for Scrap Reduction

Abstract

1. Introduction

2. Literature Review

2.1. Scrap Forecasting

2.2. Hybrid Algorithms

3. Exploratory Data Analysis, Data Preprocessing, and Machine Learning Foundations

3.1. Exploratory Data Analysis (EDA)

3.2. Data Preprocessing

3.3. Machine Learning (ML)

4. Proposed Hybrid Weighted–Stacked Ensemble (HWSE) Framework

4.1. CatBoost Algorithm

4.2. XGBoost Algorithm

4.3. Proposed Method

4.4. Advantages of the Proposed Framework

5. Case Study: Scrap Prediction in the Aerospace Manufacturing Industry

6. Results and Discussion

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI