Enhancing Supply Chain Management: A Comparative Study of Machine Learning Techniques with Cost–Accuracy and ESG-Based Evaluation for Forecasting and Risk Mitigation

Sattar, Mian Usman; Dattana, Vishal; Hasan, Raza; Mahmood, Salman; Khan, Hamza Wazir; Hussain, Saqib

doi:10.3390/su17135772

Open AccessArticle

Enhancing Supply Chain Management: A Comparative Study of Machine Learning Techniques with Cost–Accuracy and ESG-Based Evaluation for Forecasting and Risk Mitigation

by

Mian Usman Sattar

^1,*

,

Vishal Dattana

²

,

Raza Hasan

^3,*

,

Salman Mahmood

⁴

,

Hamza Wazir Khan

⁵

and

Saqib Hussain

⁶

¹

College of Science and Engineering, University of Derby, Kedleston Road, Derby DE22 1GB, UK

²

Department of Computer Science and Management Information System, Oman College of Management & Technology, P.O. Box 680, Barka 320, Oman

³

Department of Science and Engineering, Southampton Solent University, Southampton SO14 0YN, UK

⁴

Department of Computer Science, Nazeer Hussain University, ST-2, Near Karimabad, Karachi 75950, Pakistan

⁵

Department of Business Studies, Namal University, Mianwali 42250, Pakistan

⁶

Department of Computer and Information Sciences, Northumbria University, Newcastle upon Tyne NE1 8QH, UK

^*

Authors to whom correspondence should be addressed.

Sustainability 2025, 17(13), 5772; https://doi.org/10.3390/su17135772

Submission received: 30 May 2025 / Revised: 17 June 2025 / Accepted: 20 June 2025 / Published: 23 June 2025

(This article belongs to the Special Issue Managing Sustainable Development: Technology, Modelling & Applications)

Download

Browse Figures

Versions Notes

Abstract

In today’s volatile market environment, supply chain management (SCM) must address complex challenges such as fluctuating demand, fraud, and delivery delays. This study applies machine learning techniques—Extreme Gradient Boosting (XGBoost) and Recurrent Neural Networks (RNNs)—to optimize demand forecasting, inventory policies, and risk mitigation within a unified framework. XGBoost achieves high forecasting accuracy (MAE = 0.1571, MAPE = 0.48%), while RNNs excel at fraud detection and late delivery prediction (F1-score ≈ 98%). To evaluate models beyond accuracy, we introduce two novel metrics: Cost–Accuracy Efficiency (CAE) and CAE-ESG, which combine predictive performance with cost-efficiency and ESG alignment. These holistic measures support sustainable model selection aligned with the ISO 14001, GRI, and SASB benchmarks; they also demonstrate that, despite lower accuracy, Random Forest achieves the highest CAE-ESG score due to its low complexity and strong ESG profile. We also apply SHAP analysis to improve model interpretability and demonstrate business impact through enhanced Customer Lifetime Value (CLV) and reduced churn. This research offers a practical, interpretable, and sustainability-aware ML framework for supply chains, enabling more resilient, cost-effective, and responsible decision-making.

Keywords:

demand forecasting; inventory optimization; machine learning; XGBoost; RNNs; risk mitigation; supply chain management; model interpretability; Cost–Accuracy Efficiency (CAE); ESG metrics; sustainable supply chains

1. Introduction

In the contemporary landscape of global business operations, SCM has become increasingly complex, requiring advanced analytical methodologies to address its multifaceted challenges. Traditional SCM practices, which often rely on historical data and rule-based systems, are often unable to adapt to the dynamic nature of modern markets [1]. These markets are characterized by fluctuating demand patterns, heightened customer expectations, and growing operational risks, including fraud and delivery delays [2]. According to a previous study [3], companies that have successfully integrated advanced analytics into their supply chain operations have achieved cost reductions of 8–12% and service level improvements of 15–20% [4]. Machine learning (ML), with its ability to uncover patterns and make predictions from large and complex datasets, has emerged as a transformative force in SCM. Recent research has highlighted the potential of ML techniques to enhance demand forecasting accuracy, optimize inventory levels, and mitigate operational risks [5]. However, much of this research has focused on isolated applications, and there remains a significant gap in understanding how ML can be comprehensively integrated across multiple supply chain functions to deliver synergistic benefits.

Despite the advances in ML applications for SCM, several critical gaps and challenges remain. First, existing studies often focus on single-function applications, such as demand forecasting or fraud detection, without considering the interdependencies between different supply chain functions. This fragmented approach limits the ability to realize the full potential of ML in creating a cohesive and resilient supply chain ecosystem. Second, there is a lack of comparative analysis between traditional ML methods (e.g., XGBoost) and deep learning approaches (e.g., RNNs) within SCM contexts, making it difficult for practitioners to choose the most appropriate tools for their specific needs. Third, while model performance is frequently evaluated in academic settings, there is limited guidance on ensuring that these models deliver tangible business outcomes, such as improved CLV and reduced churn rates. This study addresses these gaps by presenting a comprehensive framework for applying ML across demand forecasting, inventory optimization, and risk mitigation, while also providing actionable insights for practitioners.

In addition to accuracy and business impact, there is growing recognition that SCM solutions must also align with environmental, social, and governance (ESG) goals. Frameworks such as ISO 14001 [6], the Global Reporting Initiative (GRI) [7], and the Sustainability Accounting Standards Board [8] emphasize the importance of integrating sustainability metrics into operational decision-making. However, existing ML evaluation methods rarely incorporate cost-efficiency or sustainability, which can lead to suboptimal model choices. To address this, we introduce two novel performance metrics: Cost–Accuracy Efficiency (CAE) and its extended variant, CAE-ESG. CAE evaluates models based on their accuracy-to-cost ratio, while CAE-ESG integrates ESG performance indicators—such as environmental efficiency, labor standards, and governance risk—into the model evaluation framework. These metrics offer a more comprehensive approach to assessing ML models in SCM, balancing accuracy, implementation cost, and sustainability impact [9,10].

To address these challenges, this study proposes a comprehensive ML-driven framework that spans demand forecasting, inventory optimization, risk mitigation, and ESG-aligned evaluation. Grounded in recent advancements and gaps identified in the literature, we formulate and test a set of hypotheses that guide our empirical investigations. These hypotheses are developed in detail in Section 2, following a thematic review of prior work in each SCM domain.

This study makes four primary contributions to the intersection of machine learning and supply chain management:

We develop and evaluate an XGBoost-based demand forecasting model on a high-variance retail dataset, achieving an 18% reduction in mean absolute error (MAE) and a 22% reduction in root-mean-square error (RMSE) compared to a benchmark ARIMA (1,1,1) model. This result highlights the advantage of tree-based ensembles in handling non-stationary and intermittent demand patterns.
We design a continuous-review inventory replenishment policy that dynamically adjusts reorder points based on forecast accuracy. When the MAE falls below 12% of average demand, this approach improves service levels by 7% and reduces the total inventory cost by 10% compared to a fixed-interval policy under identical conditions.
We introduce two composite evaluation metrics—CAE and CAE-ESG—that jointly assess model performance, implementation cost, and sustainability impact. Using these metrics, we show that although Random Forests and RNNs perform well, XGBoost achieves the best balance between cost-efficiency and ESG footprint, reducing greenhouse gas emissions by 15% compared to deep learning models.
We apply RFM-based customer segmentation to enhance the ML model input structure. By tailoring forecasts to customer segments (e.g., “Champions,” “Loyalists,” “At-Risk”), we observe up to an 11% improvement in forecast accuracy and a 9% uplift in performance for retention-critical cohorts, demonstrating the value of behavioral segmentation in demand modeling.

The significance of this research lies in its potential to advance both academic understanding and practical applications of ML in SCM. By providing a comprehensive framework that integrates multiple ML applications, this study offers a more realistic and actionable approach to leveraging data-driven strategies in complex supply chain environments. This work’s novelty is threefold: First, it presents an integrated application of ML across demand forecasting, inventory optimization, and risk mitigation, addressing the interdependencies between these functions in a way that previous research has not. Second, it provides a detailed comparison of traditional and deep learning approaches within SCM contexts, offering evidence to help practitioners select the most suitable ML techniques. Third, it emphasizes the importance of model interpretability and validation through SHAP analysis, ensuring that the insights are statistically robust and practically actionable for decision-makers. By bridging the gap between advanced analytics and real-world SCM operations, we aim to equip businesses with the tools and knowledge needed to thrive in an increasingly competitive and uncertain global market.

The remainder of this article is organized as follows: Section 2 presents a review of related work and highlights existing gaps in supply chain analytics. Section 3 outlines the materials and methods used in this study, including data preprocessing, segmentation, forecasting, classification, and model evaluation. Section 4 reports the results of the forecasting accuracy, inventory optimization, risk mitigation, and ESG-aligned model comparisons. Section 5 discusses the implications of the findings across business functions, sustainability goals, and deployment strategies. Finally, Section 6 concludes this study and proposes directions for future research.

2. Literature Review

This section critically reviews the evolution of analytical techniques in SCM, focusing on the integration of ML across four core functions: demand forecasting, inventory optimization, risk mitigation, and sustainability evaluation. Beginning with traditional statistical models such as ARIMA and exponential smoothing, we highlight their limitations in dynamic, non-stationary environments. We then examine the emergence of ML approaches such as XGBoost and RNNs, which have shown promise in capturing complex patterns but also present implementation challenges. Building on this, we explore how ML enables adaptive inventory policies and targeted customer segmentation, improves operational risk detection, and supports ESG-aligned decision-making through composite evaluation metrics. Each thematic subsection identifies methodological gaps in the literature, which, in turn, provide the empirical and theoretical foundation for the hypotheses formulated in this study.

2.1. Demand Forecasting in SCM: From Classical Models to ML Approaches

Accurate demand forecasting is a cornerstone of effective SCM, enabling firms to plan inventory, procurement, and logistics in alignment with customer needs. Traditionally, statistical models such as ARIMA and ETS have been widely adopted in retail and manufacturing settings. These models are praised for their simplicity and interpretability, particularly in stable and low-variance environments [11,12]. However, their linear assumptions and reliance on stationarity significantly limit their ability to handle sudden demand shocks [13,14], seasonality [15], or interactions among exogenous variables such as promotions or weather events [16,17].

In response to these limitations, researchers have explored machine learning (ML) approaches that can capture non-linear patterns in complex demand data. Tree-based ensemble methods like XGBoost have demonstrated improved performance by integrating lagged features, calendar variables, and price elasticity effects [18,19]. Deep learning models, particularly RNNs and their LSTM variants, have also been used to model temporal dependencies in sequential sales data. For example, Ref. [20] showed that LSTM models outperform traditional models in capturing long-term trends in e-commerce demand. Despite these advances, comparative studies of ML and deep learning models under identical supply chain conditions remain rare, and performance differences are often conflated by dataset variability, inconsistent preprocessing, or lack of benchmarking across business outcomes [21].

This study addresses these shortcomings by conducting a direct comparison of XGBoost and RNNs in a real-world retail SCM dataset. The goal is to determine which model more effectively handles high-variance, non-stationary demand and supports downstream inventory decision-making. Drawing on the superior pattern recognition capabilities of both models, we hypothesize the following:

Hypothesis 1:

Gradient-boosted ensemble methods (e.g., XGBoost) and Recurrent Neural Networks (RNNs) will achieve significantly lower out-of-sample forecast errors (e.g., MAE, RMSE) compared to traditional statistical methods (e.g., ARIMA) when applied to intermittent and non-stationary retail demand data.

2.2. Forecast-Driven Inventory Optimization: Moving Beyond Static Policies

Inventory management decisions are highly sensitive to demand variability and replenishment lead times. Traditionally, firms have relied on fixed-interval review policies and EOQ models to manage replenishment cycles. These classical methods simplify operations by assuming constant demand and lead time, offering closed-form solutions that minimize the total inventory cost under deterministic settings [22]. However, in practice, such assumptions are rarely held, especially in volatile retail environments where sudden demand fluctuations or supplier delays can lead to frequent stockouts or overstocking. As a result, these static models often underperform in dynamic SCM settings [12].

Recent advances in machine learning have enabled the development of forecast-driven inventory policies that adjust reorder points in real time based on predicted demand. Dynamic inventory models, such as the continuous-review (s, Q) system, can leverage forecast accuracy thresholds to improve decision-making. For example, studies such as [23,24] demonstrate that when ML-based forecasts are integrated into inventory control, service levels improve by 5–8% and total inventory costs decrease by 10–12%—particularly when the forecast error (e.g., MAE) falls below a defined threshold. Despite these findings, few empirical evaluations exist that test such policies using ML forecasts on high-variance, real-world datasets. Additionally, existing work rarely compares fixed and forecast-driven systems under a controlled experimental design with consistent KPIs.

To close this gap, we simulate both fixed-interval and continuous-review inventory strategies using ML-generated forecasts (from XGBoost and RNN models), benchmarked against a traditional EOQ-based baseline. The simulations assess impacts on service level, stockouts, and cost. Based on the prior literature and our integration of adaptive forecasting, we propose the following:

Hypothesis 2:

Incorporating forecast-driven reorder-point policies into continuous-review inventory systems will yield higher service levels and lower total costs than fixed-interval review policies when forecast accuracy surpasses a defined threshold.

2.3. ML for Risk Mitigation: Fraud and Delay Prediction

Operational risks such as fraud and late deliveries are persistent challenges in supply chain management. Traditional risk detection methods typically rely on rule-based systems or statistical thresholds, which are simple to implement but suffer from high false positive rates and limited adaptability to emerging patterns. For instance, once fraudsters or system anomalies change behavior, static rule sets often become obsolete, leading to missed detections or false alarms [25]. Similarly, delivery delay prediction systems that rely only on univariate time-series thresholds struggle to incorporate contextual features such as product type, customer location, or carrier reliability.

Machine learning models offer a more flexible and data-driven approach to risk mitigation. Techniques such as Random Forests and autoencoders have demonstrated superior performance in fraud detection, reducing false positives by up to 18% and improving recall on rare event classes [26,27]. RNN-based classifiers, particularly those using LSTM units, have been applied in logistics scenarios to detect patterns across time-dependent features, enhancing predictive performance in identifying delivery disruptions. However, most existing studies evaluate either fraud or delay detection independently, often using different datasets, model types, and metrics, making it difficult to draw comprehensive conclusions. Moreover, few works quantify how improved risk detection translates into broader SCM outcomes like reduced churn or operational resilience.

This study benchmarks multiple classification models—including Random Forests, XGBoost, and RNNs—across both fraud and late delivery prediction tasks, using a unified dataset and a harmonized evaluation framework. We further analyze how these improvements support downstream business KPIs, such as customer satisfaction and retention. Based on prior findings and our integrated risk modeling approach, we hypothesize the following:

Hypothesis 3:

ML-based classification models, particularly Random Forests and RNNs, will significantly reduce the incidence of fraud and late deliveries compared to rule-based or univariate statistical approaches, thereby enhancing operational resilience in supply chain management.

2.4. ESG-Aware Model Evaluation: Toward Sustainable Analytics

While predictive accuracy and cost-efficiency are central to supply chain analytics, growing regulatory and stakeholder pressures now demand that decision-making tools also align with ESG priorities. Frameworks such as ISO 14001, the GRI, and the SASB underscore the importance of embedding sustainability considerations into operational strategies [6,7,8]. Despite this, most machine learning applications in supply chain management continue to rely on error-based metrics like MAE or RMSE, neglecting the broader social or environmental implications of model implementation [10].

The recent literature has begun to explore multi-criteria decision-making frameworks that incorporate ESG factors into model evaluation. For example, Ref. [28] shows that slightly less accurate models may result in substantially lower carbon emissions or reduced computational overheads, both of which align with organizational ESG targets. Other studies [29,30] have proposed composite scoring mechanisms that integrate cost, energy use, and governance risk alongside accuracy to guide more responsible algorithm selection. However, these methods remain underutilized in practice, and few SCM studies have formalized ESG-aware metrics that are both actionable and aligned with model performance goals.

In response to this gap, we introduce two holistic evaluation metrics: CAE, and its extended variant CAE-ESG. CAE measures the trade-off between model accuracy and implementation cost, while CAE-ESG incorporates ESG-related dimensions such as energy consumption, supply chain risk, and Social Responsibility Scores. These metrics are designed to facilitate informed model selection that balances performance, cost-efficiency, and sustainability objectives. Based on this integrated evaluation philosophy, we hypothesize the following:

Hypothesis 4:

Holistic evaluation using the CAE and CAE-ESG metrics will reveal trade-offs among accuracy, cost-efficiency, and sustainability, guiding model selection toward solutions that balance these dimensions more effectively than traditional accuracy-based criteria alone.

3. Materials and Methods

This section outlines the comprehensive methodology employed in this study, including the rationale behind each technique and variable used. Models were selected based on their suitability for the data type (e.g., tabular, sequential), interpretability, and alignment with supply chain objectives such as forecasting accuracy, operational cost-efficiency, and ESG alignment. Variables and engineered features were used to predict outcomes such as churn, fraud, fulfillment performance, and inventory control. Each is justified within its respective subsection.

3.1. Data and Preprocessing

This section describes the data preparation pipeline that forms the foundation of all subsequent modeling efforts. The preprocessing framework encompasses data sourcing, quality assessment, feature engineering, and transformation procedures designed to optimize model performance while maintaining interpretability and business relevance.

3.1.1. Data Sources and Context

Data was drawn from DataCo’s central Enterprise Resource Planning (ERP) and logistics databases, covering 180,519 orders over a full calendar year [31]. DataCo is a global company specializing in SCM across multiple industries. The dataset captures activities in key areas such as provisioning, production, sales, and commercial distribution, and it allows for the integration of structured data with unstructured data (clickstream data from tokenized access logs) to generate actionable insights. The products covered in the dataset include clothing, sports equipment, and electronic supplies. The data were collected during the calendar year, including the following:

Order records: Timestamps, item Stock Keeping Units (SKUs), unit prices, quantities.
Shipping logs: Promised vs. actual shipping dates, carrier information, shipping modes (standard, expedited, same-day).
Customer profiles: Geographic location, segment tags (e.g., business vs. individual).

Understanding the data provenance and schema upfront ensured that the downstream analyses correctly interpreted each field (e.g., whether “OrderDate” reflected purchase intent or confirmation).

3.1.2. Cleaning and Imputation

The following data quality issues were addressed systematically:

Missing Values:
⚬
Numeric fields with <1% missing values (e.g., UnitPrice) were imputed using the median to avoid skewness. For numeric fields with ≥1% missingness, we applied a two-step strategy: If the field had business-critical value (e.g., ShippingCost), we used regression imputation based on correlated variables. Fields with >5% missingness and limited analytical value were excluded from the modeling to maintain data integrity and reduce noise.
⚬
Categorical fields (e.g., ShippingMode): Imputed to an explicit “Unknown” category, preserving these records for pattern discovery.
Outliers and Consistency
⚬
Continuous variables beyond μ ± 3σ were winsorized to the 1st or 99th percentile to limit undue influence by recording errors or extreme purchases.
⚬
Date fields were validated (e.g., ensuring ShipDate ≥ OrderDate); any anomalies were manually reviewed and corrected or dropped if unverifiable.

3.1.3. Feature Engineering

To convert raw transactional data into actionable predictive features, we derived several domain-specific metrics based on prior work in demand forecasting, logistics analytics, and customer behavior modeling.

Transforming raw fields into actionable metrics: We transformed the raw fields into actionable metrics, including sales per customer, actual shipping days, and late delivery flags.

Sales per Customer: Measures the total purchase amount per customer, a core variable used in RFM-based segmentation [32,33] and Customer Lifetime Value (CLV) modeling. High values of S_i indicate customers who make significant monetary contributions, which supports prioritization in retention strategies.

$S_{i} = \sum_{j \in O r d e r i} p_{i j} q_{i j}$

Captures total spend per order, where

S_i: This represents the total sales for customer i. It is the sum of the sales amounts for all orders placed by customer i.

Order_i: This denotes the set of all orders placed by customer i. Each order j within this set is considered in the summation.

p_ij: This is the unit price of the product in order j placed by customer i.

q_ij: This is the quantity of the product in order j placed by customer i.

Actual Shipping Days: Shipping delay is a key operational indicator that reflects fulfillment efficiency. It has been linked to customer satisfaction and future order probability [24]. In demand prediction and inventory modeling, longer shipping delays often signal bottlenecks or risk exposures.

$D_{i} = ({A c t u a l S h i p D a t e}_{i} - {O r d e r D a t e}_{i}) (i n d a y s)$

where

D_i: This represents the actual number of days it took to ship the order i after it was placed.

ActualShipDate_i: This is the date when the order i was actually shipped.

OrderDate_i: This is the date when the order i was placed.

Late Delivery Flag: This binary indicator flags whether an order violated its promised lead time. Such variables are crucial for ML-based risk modeling and have been used in prior work to detect supply chain disruptions and fraud patterns.

$L_{i} = \{\begin{matrix} 1, & D_{i} > {P r o m i s e d L e a d T i m e}_{i} \\ 0, & otherwise . \end{matrix}$

where

L_i: This is a binary flag indicating whether the order i was delivered late (1) or on time (0).

D_i: This is the actual number of days that it took to ship the order i after it was placed, as calculated by the formula D_i = (ActualShipDate_i − OrderDate_i).

PromisedLeadTime_i: This is the promised lead time for order i, which is the number of days within which the order is expected to be shipped.

Derived Demographics: Geographic variables such as CustomerCity and OrderCountry were one-hot encoded and evaluated for predictive utility. However, SHAP analysis revealed minimal contribution to model performance, which is consistent with the low correlation values observed during EDA. As a result, these features were excluded from the final models to avoid unnecessary dimensionality and overfitting.

3.1.4. Scaling and Encoding

Min–Max scaling to [0, 1] was applied for algorithms sensitive to feature magnitudes [34] (linear models, neural networks):

$x^{'} = \frac{x - m i n (x)}{\max (x) - m i n (x)}$

where

x: This represents the original value of a feature.

x′: This is the scaled value of the feature after applying the Min–Max scaling.

min(x): This is the minimum value of the feature x in the dataset.

max(x): This is the maximum value of the feature x in the dataset.

No scaling for tree-based methods (RF, XGBoost), preserving their insensitivity to feature monotonic transformations [35].
Categorical Encoding: Rarely used categories grouped into “Other” to avoid increased dimensionality [36].

3.1.5. Descriptive Moments and Distributional Insights

For each numeric feature X, we computed several statistical measures, including the mean, standard deviation, skewness, and kurtosis.

\bar{X}, σ_{X}, s k e w (X) = \frac{{E [X - \bar{X}}^{3}]}{σ_{X}^{3}}, k u r t (X) = \frac{{E [X - \bar{X}}^{4}]}{σ_{X}^{4}}

where

Mean

\bar{X}

;

Standard deviation

σ_{X}

;

Skewness

s k e w (X)

;

Kurtosis

k u r t (X)

.

This revealed that sales per customer was moderately right-skewed (skew = 0.12), indicating a small fraction of very large orders.

3.1.6. Pairwise Correlation Structure

To examine linear associations between continuous variables, we computed Pearson’s correlation coefficients (r) using the scipy.stats.pearsonr() function from the SciPy Python library (version 1.11.4). This enabled the identification of redundant or highly collinear features (e.g., between unit cost and retail price) prior to model training. Variables with ∣r∣ > 0.85 were flagged for exclusion or dimensionality reduction when necessary [26].

3.1.7. Visual Exploration

Histograms and boxplots to check for multimodality and outliers.
Heatmaps to visualize clusters of highly correlated predictors, guiding feature selection to reduce multicollinearity.

3.2. Model Architecture Overview

To address the diverse analytical requirements spanning customer behavior analysis, demand forecasting, and operational risk management, we developed a multi-paradigm modeling framework. This approach leverages the complementary strengths of different algorithmic families to ensure robust performance across varied data types and prediction tasks while maintaining interpretability for business decision-making.

Tree-Based Models (Random Forest, XGBoost): Chosen for their robustness to outliers, built-in feature selection, and interpretability via feature importance metrics. XGBoost additionally provides regularization and handles missing values effectively.
Neural Networks (Feedforward, LSTM-RNN): Selected for their ability to capture non-linear relationships and, in the case of LSTM, long-term sequential dependencies critical for time-series forecasting and temporal pattern recognition in fraud detection.
Linear Models (Logistic Regression, Lasso): Included as interpretable baselines, with Lasso providing automatic feature selection through L1 regularization.
Classical Methods (ARIMA, EOQ): Established benchmarks for time-series forecasting and inventory management, respectively.

3.3. Customer Segmentation and Churn Modeling

Having established the data preprocessing framework and exploratory analysis, we now turn to the application of these prepared datasets for customer segmentation and churn prediction modeling.

3.3.1. RFM Metric Computation

Per customer c:

R_{c} = d a y s s i n c e l a s t p u r c h a s e, F_{c} = \sum 1 [p u r c h a s e], M_{c} = \sum r e v e n u e .

where

Recency R_c: days since last purchase;

Frequency F_c: total purchase count;

Monetary M_c: total revenue.

3.3.2. Quintile Scoring

Each metric Z ∈ {R,F,M} was ranked among all customers and assigned:

s c o r e (Z_{c}) = 1 + ⌊5 . \frac{r a n k (Z_{c})}{N}⌋ .

where

Z: This represents one of the RFM metrics (recency, frequency, or monetary).

Z_c: This is the value of the metric Z for customer c.

rank(Z_c): This is the rank of customer c based on the metric Z, where the customer with the highest value of Z has the rank 1, and so on.

N: This is the total number of customers.

score(Z_c): This is the score assigned to customer c for the metric Z.

This equal-frequency binning ensures that roughly 20% of customers fall into each score bucket [18].

3.3.3. Segment Labeling

Customers were classified into actionable segments by combining their RFM quintile scores (s_R,s_F,s_M) through domain-informed business rules [37]:

Champions: (s_R,s_F,s_M ≥ 4)—Top-spending frequent buyers requiring exclusive rewards.
Loyal Customers: (s_F ≥ 4, s_M ≤ 3)—High-frequency purchasers eligible for volume discounts.
At-Risk: (s_R ≤ 2, s_F,s_M ≥ 3)—Previously valuable customers needing reactivation campaigns
Lost: (s_R ≤ 2, s_F ≤ 2)—Lapsed customers for win-back initiatives.

These segments directly supported targeted marketing strategies, with Champion retention programs yielding 23% higher CLV and At-Risk interventions reducing churn by 15% in Q3 implementations.

RFM analysis was selected for customer segmentation due to its extensive use in predictive modeling for customer retention, churn, and targeting strategies. Recency (R) captures the time since the last transaction, an indicator of engagement, frequency (F) reflects repeat purchase behavior, and monetary (M) indicates customer value. Together, these variables provide a comprehensive behavioral profile that is widely accepted in segmentation and lifetime value estimation tasks [37,38].

3.3.4. Problem Framing and Labels

Customers in “At-Risk” or “Lost” segments at the end of the period were labeled y = 1 (churn), while others were labeled y = 0 [34].

3.3.5. Feature Matrix

x_c = [R_c, F_c, Mc, s_R, s_F, s_M]; additional covariates (e.g., average shipping delay).

3.3.6. Modeling Approaches

Logistic Regression: trained by minimizing [39]

$L = - \sum_{c} [y_{c} l n {\hat{p}}_{c} + (1 - y_{c}) \ln (1 - {\hat{p}}_{c})] .$

where

L

: This is the loss function for the logistic regression model, which is minimized during training.

y_c: This is the actual label for customer c, which is 1 if the customer churned and 0 otherwise.

{\hat{p}}_{c}

: This is the predicted probability that customer c will churn, as output by the logistic regression model.

Random Forest: Ensemble of T decision trees, each split minimizing Gini impurity.
XGBoost: Gradient-boosted trees with regularized objective:

\sum_{c} l (y_{c}, {\hat{y}}_{c}) + \sum_{k} [γ T_{k} + \frac{1}{2} λ ‖ w_{k} ‖^{2}] .

where

l (y_{c}, {\hat{y}}_{c})

: This is the loss function for the XGBoost model, which measures the difference between the actual label y_c and the predicted label

{\hat{y}}_{c}

.

γ T_{k}

: This is the regularization term for the number of leaves T_k in the k-th tree, where γ is the regularization parameter.

λ ‖ w_{k} ‖^{2}

: This is the regularization term for the weights

{\bar{w}}_{k}

of the k-th tree, where λ is the regularization parameter.

3.3.7. Model Selection and Thresholding

Candidates were compared on an 80/20 stratified split, selecting the model and decision threshold that maximized the F1-score for the minority class (churners), thereby balancing false positives and false negatives. Building upon the customer segmentation approach, the next critical component of supply chain optimization involves demand forecasting and inventory management.

3.4. Forecasting and Inventory Optimization Framework

To forecast demand, we selected ARIMA, XGBoost, and LSTM-RNN to represent linear, ensemble-based, and deep learning paradigms, respectively. This trio was chosen due to their complementary strengths: ARIMA provides a well-established baseline for stationary time series; XGBoost is known for its performance on structured, high-dimensional data with limited preprocessing; and RNNs, particularly LSTM models, effectively capture long-term dependencies that are common in seasonal or volatile demand patterns.

3.4.1. Data Structuring

Daily demand y_t and corresponding lagged features x_t = [y_t−1,…, y_t−k, dayOfWeek_t, promoFlag_t] were constructed for each timestep t, including calendar effects and promotion indicators as exogenous variables. Feature scaling and time-based validation splits ensured model compatibility and temporal integrity.

3.4.2. Model Catalog

Linear/Lasso: Lasso regression was included as a regularized linear baseline for its ability to perform feature selection while minimizing overfitting. The objective function is as follows:

${m i n}_{w} \sum_{t} (y_{t} - w^{⊺} x_{t})^{2} + α ‖{w‖}_{1} .$

where

w: This represents the vector of model weights.

y_t: This is the actual value of the target variable at time t.

x_t: This is the vector of feature values at time t.

w^{⊺} x_{t}

: This is the dot product of the weight vector w and the feature vector x_t, representing the predicted value of the target variable at time t.

α: This is the regularization parameter that controls the strength of the L1 regularization term.

‖w‖₁: This is the L1 norm of the weight vector w, which is the sum of the absolute values of the weights.

3.4.3. Training and Validation

A 70–30 temporal split was applied to ensure that the validation period strictly followed the training window, preserving the forward-looking nature of forecasting.

Loss functions: MSE during training; MAE monitored for early stopping.
Evaluation metrics: MAE, RMSE, and MAPE were used to compare the models.
Statistical testing: A Wilcoxon signed-rank test (p < 0.05) was conducted to determine whether XGBoost’s forecast errors were statistically lower than those of the other models, justifying its selection for downstream simulation in inventory policy and ESG evaluation. MAE, RMSE, and MAPE were used. The Wilcoxon signed-rank test (p < 0.05) checked whether XGBoost’s errors were statistically lower than those of the alternatives, guiding its selection for downstream simulations.

3.4.4. Methods Compared

Naive Economic Order Quantity (EOQ)/Reorder Point (ROP): Classical formulae.
Forecast-Driven: Dynamic ROPs based on next-day forecasts from XGBoost and RNN.

The theoretical foundation of these inventory policies relies on established operations research formulations, adapted to incorporate ML forecasts.

3.4.5. Key Equations

These formulae operationalize inventory control policies. Q* minimizes the total inventory cost under known demand, while the forecast-driven reorder point ROP_t dynamically adjusts inventory thresholds based on predicted demand and uncertainty. This integration of ML forecasts into classic models enables responsive, data-driven inventory planning.

Q^{*} = \sqrt{\frac{2 D S}{H},} R O P_{t} = {\hat{D}}_{t + 1} + z_{α} {\hat{σ}}_{t + 1}

where

Q^∗: This is the optimal order quantity that minimizes the total inventory cost.

D: This is the demand rate, which is the average number of units demanded per unit of time.

S: This is the ordering cost, which is the cost of placing an order.

H: This is the holding cost, which is the cost of holding one unit of inventory for one unit of time.

ROP_t: This is the reorder point at time t, which is the inventory level at which a new order should be placed.

{\hat{D}}_{t + 1}

: This is the forecasted demand for the next time period t + 1.

_Zα: This is the safety factor corresponding to the desired service level α = 95%.

{\hat{σ}}_{t + 1}

: This is the forecasted standard deviation of the demand for the next time period t + 1.

For dynamic safety stock:

R O P_{t} = {\hat{D}}_{t + 1} + z . {\hat{σ}}_{l}, {\hat{σ}}_{l} = {\hat{σ}}_{D} \sqrt{L}

where

{\hat{σ}}_{l}

: This is the forecasted standard deviation of the lead time demand.

{\hat{σ}}_{D}

: This is the forecasted standard deviation of the demand.

L: This is the lead time, which is the time between placing an order and receiving it.

3.4.6. Simulation Steps

Over a 365-day horizon, at each day t, do the following:

Forecast the next-day demand

{\hat{D}}_{t + 1}

.

Compute

R O P_{t}

.

Order Q^* if inventory falls below

R O P_{t}

.

Record the fill rate, stockouts, and holding and ordering costs.

Comparison of cumulative costs and service metrics quantifies trade-offs between simplicity (Naive) and forecast-driven policies [40]. While demand forecasting addresses inventory optimization, SCM also requires proactive risk identification. To this end, we developed classification models for fraud detection and delivery performance prediction.

3.5. Risk Prediction and Classification

For fraud and late delivery classification, Random Forest, XGBoost, and RNN were selected. Tree-based models (RF, XGBoost) offer robustness and interpretability, while RNNs are included due to their capacity to learn temporal patterns that are critical to fraud detection. This ensemble of models enables a comparison across interpretability, complexity, and performance.

3.5.1. Label Definitions

Fraud: transactions flagged by the audit team.
Late Delivery: L_i = 1.

3.5.2. Modeling and Metrics

This stage uses the same algorithms as churn, with the addition of the Area Under the Receiver Operating Characteristic Curve (ROC AUC) to assess discrimination capability:

R O C A U C = \int_{1}^{0} T P R (F P R) d (F P R)

RNNs ultimately achieved the best balance of precision and recall in both tasks.

3.5.3. Parameter Grids

Tree Models: Max depth {3, 5, 7}, learning rate {0.01, 0.1}, n_estimators ∈ {100, 300}.
RNN: Hidden units {64, 128}, dropout {0.2, 0.5}, learning rate {1 × 10⁻²,1 × 10⁻³}.

Grid search with 5-fold stratified CV, optimizing the following:

Recall for fraud (minimize false negatives).
F1-score for late delivery (balance precision and recall).

For fraud, recall was prioritized to minimize false negatives. Having developed multiple predictive models across different supply chain domains, it becomes essential to understand model decision-making processes and establish systematic selection criteria.

3.6. Interpretability and Model Selection

For each feature j, the SHAP value ϕ_j assesses its contribution to the prediction [41]:

ϕ_{j} = \sum_{S \subset F \ {j}} \frac{|S|! (|F| - |S| - 1)!}{|F|!} [f_{s \cup \{j\}} (x) - f_{s} (x)] .

where

ϕ_j: This is the SHAP value for feature j, which quantifies the contribution of feature j to the prediction.

S: This is a subset of features excluding feature j.

F: This is the set of all features.

∣S∣: This is the number of features in subset S.

∣F∣: This is the total number of features.

f_s∪{j}(x): This is the model prediction with feature set s and feature j.

f_s(x): This is the model prediction with feature set s.

“Days for Shipping (real)” was the dominant driver of late delivery risk, validating operational intuition and guiding process improvement recommendations. To ensure the reliability and generalizability of these interpretability insights, we implemented validation procedures across all modeling tasks.

3.6.1. Validation and Generalization

Temporal Holdout: Strict forward testing for forecasting tasks ensures real-world applicability.
Stratified kk-Fold CV: Across classification tasks, demonstrating stable performance (±1–2% variance).
Paired Statistical Tests: Wilcoxon tests confirmed that the performance differences were significant (p < 0.05).
Robustness Checks: Retraining on seasonal subsets showed consistent model behavior.

Beyond traditional performance metrics, effective model selection in supply chain contexts requires balancing accuracy with operational and sustainability considerations. To address this need, we developed evaluation metrics.

3.6.2. CAE Computation

To evaluate models holistically, we propose the following two metrics:

CAE:

C A E = \frac{A c c u r a c y \times C o s t R e d u c t i o n}{C o m p C o s t + O p C o m p l e x i t y}

CAE with ESG Integration:

C A E - E S G = \frac{A c c u r a c y \times (C o s t R e d u c t i o n + E S G)}{C o m p C o s t + O p C o m p l e x i t y}

where

Accuracy: Model performance (e.g., 1—MAPE or F1-score).

CostReduction: Fractional cost saving relative to baseline.

CompCost: Normalized computational cost (0–1).

OpComplexity: Normalized operational complexity (0–1).

ESG: Composite score based on Environmental Efficiency Index (EEI), Social Responsibility Score (SRS), and Governance Risk Metric (GRM):

E S G S c o r e = \frac{E E I + S R S + (1 - G R M)}{3}

These metrics guide selection toward models that balance performance, cost-efficiency, and sustainability [9]. The complete methodology, integrating all preprocessing, modeling, and evaluation components, is formalized in the following algorithmic framework.

3.6.3. Enhancing SCM Algorithms

The ML framework developed in this study integrates data preprocessing, model training, evaluation, and sustainability-aware selection. The complete process is presented in Algorithm 1. It provides a step-by-step outline of how the data were prepared, how the models were selected and trained, how performance was evaluated, and how CAE/ESG metrics were incorporated to inform the final model decisions.

Algorithm 1: End-to-end pseudocode for enhancing SCM

INPUT: Raw order and shipping data.
Cleaning and Feature Engineering
Impute missing, winsorize outliers: Handle missing values and outliers in the dataset.
Compute S_i, D_i, L_i, and demographic encodings: Calculate sales per customer, actual shipping days, and late delivery flags, and encode demographic features.
Scale numeric features as needed: Apply Min–Max scaling to numeric features sensitive to feature magnitudes.
Exploratory Analysis
Compute moments and Pearson’s r: Calculate descriptive statistics and correlation coefficients.
Visualize distributions and correlation heatmap: Generate histograms, boxplots, and correlation heatmaps.
Customer Segmentation
Calculate R, F, and M per customer: Compute recency, frequency, and monetary values for each customer.
Assign quintile scores; map to segments: Rank customers and assign them to segments based on RFM scores.
Churn Prediction
Build X_c and y_c labels: Create feature matrix and target labels for churn prediction.
Train LR, RF, and XGB; select best by F1: Train logistic regression, Random Forest, and XGBoost models, and select the best model based on the F1-score.
Forecasting
Build time-series features X_t; target y_t: Create time-series features and target variable for forecasting.
Train models (LR, Lasso, RF, XGB, NN, RNN): Train various models including linear regression, Lasso, Random Forest, XGBoost, neural networks, and RNNs.
Evaluate on holdout; compute CAE and CAE-ESG:
Assess models using cost-efficiency and ESG-aligned metrics.
Select model maximizing CAE-ESG or desired trade-off.
Inventory Simulation
FOR t = 1…365:
Forecast

\hat{y}

_t+1: Predict demand for the next day.
Compute ROP_t =

\hat{y}

_t+1 + zσ_t+1: Calculate the reorder point.
IF Inv_t ≤ ROP_t, order Q^∗: Place an order if inventory falls below the reorder point.
Update inventory and costs: Update inventory levels and associated costs.
Fraud and Late Delivery Classification
Prepare features and labels: Create feature matrix and target labels for fraud and late delivery prediction.
Train and evaluate RF, XGB, and RNN; tune for recall/F1:
Also compute CAE and CAE-ESG for interpretability and efficiency.
Hyperparameter Tuning
Define grid Θ; perform stratified CV: Define hyperparameter grid and perform stratified cross-validation.
Select θ maximizing target metric: Choose the best hyperparameters based on the target metric (e.g., F1-score).
Interpretability
Compute SHAP values; rank features: Calculate SHAP values to assess feature contributions and rank features by importance.
Validation
Temporal holdout, k-fold CV, Wilcoxon tests: Perform temporal holdout validation, k-fold cross-validation, and Wilcoxon tests to validate the model performance. Compare accuracy, CAE, and CAE-ESG across models.
OUTPUT: Final models, segmentation rules, forecast scripts, simulation results, interpretability reports.

4. Results

The results of the analysis provide a comprehensive view of the performance and cost implications of different ML models in inventory optimization, demand forecasting, and classification tasks.

4.1. EDA Results

EDA was conducted to identify the underlying patterns and relationships within the dataset. Summary statistics were computed to provide an overview of key numerical features. For instance, the mean “Sales per customer” was found to be USD 310.5, with a standard deviation of USD 15.3, skewness of 0.12, and kurtosis of 2.8 (Table 1).

Additionally, the correlation heatmap (Figure 1) revealed several important relationships between variables. Notably, a strong positive correlation (r = 0.94) was observed between “Order Item Total” and “Sales per customer,” indicating that higher total order items are closely associated with higher sales per customer. A modest positive correlation (r = 0.36) was also found between “Days for shipping (real)” and “Late delivery risk,” suggesting that longer shipping times are somewhat associated with a higher risk of late deliveries. In contrast, many demographic features, such as “Customer City” vs. “Profit,” exhibited near-zero correlation, implying that customer location does not significantly influence profit in this dataset.

4.2. Customer Segmentation (RFM)

To identify distinct customer behavior patterns, RFM analysis was applied to a dataset of 180,519 orders. This widely used segmentation method evaluates three critical dimensions of customer interaction:

Recency: Time elapsed since the last purchase.
Frequency: Total number of purchases.
Monetary value: Total amount spent.

Each customer was assigned an RFM score, which was then used to categorize them into one of eight predefined behavioral segments. The results of this segmentation are summarized in Table 2.

The segmentation reveals that Recent Customers account for the largest proportion (33.2%) of the customer base, suggesting strong acquisition trends but potential challenges with customer retention. The “Promising” segment (16.9%) includes newer customers who have made recent purchases and may develop into loyal buyers with appropriate engagement. Conversely, the Lost segment—although the smallest, at 4.4%—represents previously active customers who have lapsed and, therefore, may benefit from targeted reactivation strategies. Notably, “Champions” comprise only 0.6% of customers but exhibit high recency, frequency, and monetary scores, making them the most valuable group for loyalty-focused retention efforts.

Each segment is defined based on the interplay of RFM scores, as follows:

Champions: High recency, frequency, and monetary scores;
Cannot Lose Them: High frequency and recency, but lower monetary value;
At-Risk: Low recency but moderate frequency and monetary values;
Customers Needing Attention: Customers with moderate behavior across all metrics;
Lost: Customers with low engagement and spending;
Loyal Customers: High frequency but lower monetary value;
Promising: Recent customers with potential for increased loyalty;
Recent Customers: Newer customers who have made recent purchases.

This segmentation enables targeted marketing and supply chain strategies by aligning communication and resource allocation with customer value and engagement potential.

4.3. Churn Prediction for Specific Customer Segments

Based on the RFM segmentation, customers in the At-Risk and Lost segments are more likely to churn. Customers in these categories are showing signs of disengagement, with lower recency and frequency scores, indicating that they may stop purchasing if not actively engaged. Table 3 shows examples of churn predictions for specific customers.

The “At-Risk” segment, consisting of customers with lower recency, frequency, and monetary scores, is the primary target for churn prevention efforts. These customers are more likely to churn unless intervened with re-engagement campaigns, promotions, or personalized offers. The “Lost” segment, with no recent engagement, is another priority for targeted recovery initiatives. However, customers in the Loyal Customers, Recent Customers, and Cannot Lose Them categories show stable behavior, which suggests that they are less likely to churn.

Table 4 and Figure 2 and Figure 3 present a comparison of key performance metrics before and after interventions:

The analysis reveals several important findings:

The “At-Risk” and “Lost” segments are the primary targets for churn prevention efforts.
Targeted interventions can significantly improve CLV for high-value segments like “Champions” and “Loyal Customers”.
The “Promising” and “Recent Customers” segments show potential for increased loyalty and future purchases.
Low-value segments like “Lost” customers require different engagement approaches.

These results provide valuable insights for developing segment-specific marketing strategies aimed at maximizing customer value and reducing churn.

4.4. Forecasting and Inventory Optimization

Six forecasting models were evaluated using daily aggregated order data using a 70–30 temporal split for training and testing. Model performance was quantified through three error metrics: MAE, RMSE, and MAPE. The evaluated approaches encompassed both classical ML (linear regression, Lasso, Random Forest, XGBoost) and deep learning architectures (neural network, RNN), with features normalized using Min–Max scaling, except for the neural network baseline.

As shown in Table 5, XGBoost demonstrated superior performance among the ML and deep learning approaches, achieving an MAE of 0.1571 (SD ± 0.02) and an RMSE of 0.5333 (SD ± 0.12). Although linear regression exhibited the lowest error metrics in the table, XGBoost outperformed the other more complex models, showing statistically significant superiority in both the MAE and RMSE metrics compared to the Random Forest, neural network, and RNN models (p < 0.05, Wilcoxon signed-rank test). This highlights XGBoost’s effectiveness in capturing non-linear patterns and its robust performance in demand forecasting for SCM. Figure 4 illustrates the comparative error distributions across all models.

The temporal alignment of the predictions is visualized in Figure 5, comparing the XGBoost and RNN forecasts against actual demand. While XGBoost maintained strong correlation (Pearson’s r = 0.93) throughout the test period, the RNN exhibited consistent phase lag (1.8 ± 0.3 days) during demand volatility, as evidenced by cross-correlation analysis of the time series. Note that the actual demand curve appears smooth due to the weekly aggregation applied to reduce noise and reveal trend behavior.

The observed error magnitude discrepancies between models (e.g., linear regression vs. neural network) stem from different data normalization strategies—linear models operated on scaled [0, 1] features, while the neural network used raw sales values. This implementation choice emphasizes the critical role of consistent preprocessing in comparative model evaluations.

Sales forecasting is a pivotal aspect of SCM, facilitating precise demand planning, inventory control, and procurement decisions. In this analysis, we implemented and compared the performance of three distinct regression models for sales forecasting: Random Forest Regressor, XGBoost Regressor, and an RNN. These models were trained using historical sales data and supply chain features extracted from the DataCo dataset, with the aim of identifying the most effective approach for generating accurate, data-driven sales forecasts to support downstream inventory optimization and policy simulations.

The models were assessed using three widely adopted regression metrics: MAE, RMSE, and the Coefficient of Determination (R²). These metrics provide insights into both the average prediction error and the overall goodness of fit. The comparative results of the three models are summarized in Table 6, with visual representations of predicted versus actual sales for the XGBoost and RNN models shown in Figure 6.

The XGBoost Regressor demonstrated the best overall performance, achieving the lowest RMSE (0.4680) and the highest R² (0.99999), indicating excellent fit and generalization. The Random Forest Regressor also yielded strong results, with an exceptionally low MAE (0.0246) and a high R² (0.9999), making it a competitive alternative with minimal error and high stability.

In contrast, the RNN, while still achieving a strong R² of 0.9973, exhibited substantially higher MAE (3.6877) and RMSE (6.8830) compared to the tree-based methods. These results suggest that although RNNs are inherently suited for modeling sequential patterns, their performance may degrade when applied to structured, tabular datasets with weak temporal dependencies. Furthermore, RNNs often require more extensive tuning and longer training times, which may not be justified given their inferior performance in this context.

Although Figure 6 may initially suggest a linear relationship between predicted and actual sales, this visual impression does not imply that the underlying models operate based on strict linear assumptions. The close alignment of data points along the ideal prediction line (represented by the dashed line) primarily indicates the models’ ability to effectively capture underlying patterns in the data, which may appear linear in certain segments. However, both XGBoost and RNNs are fundamentally designed to model complex, non-linear relationships. XGBoost, as an ensemble method built on decision trees, inherently captures non-linear patterns by leveraging multiple tree structures, each modeling different parts of the data. Similarly, RNNs—especially those employing LSTM or GRU architectures—are well suited for handling sequential data and are capable of learning intricate temporal dependencies and non-linear dynamics. To explore these non-linear aspects further, Figure 7 presents predicted versus actual sales values enhanced with a LOWESS smoother for both models. This smoother provides a clearer visualization of the relationship trend and highlights deviations from ideal predictions. The dashed line in the figure represents the ideal case, where the predicted values match actual sales perfectly, while the shaded region marks days 120–135, a timeframe of particular interest due to its notable impact on model performance.

Figure 7 reveals important empirical differences in predictive performance between the models. The XGBoost predictions, represented by blue markers, are closely clustered around the ideal line across most of the sales range, reflecting strong predictive accuracy and consistency. In contrast, the RNN predictions, shown as orange markers, display noticeably wider dispersion, particularly at higher sales values, indicating challenges in modeling extreme outcomes. The LOWESS curves further emphasize these discrepancies, revealing that the RNN deviates more significantly from the ideal trend, especially in the upper sales range. This suggests potential limitations in the model’s ability to generalize in regions with greater variability. Additionally, data points within the shaded region (days 120–135) deviate from the ideal line for both models, signaling a period of reduced prediction accuracy likely influenced by external or unmodeled factors.

Figure 8 presents the residual analysis for XGBoost and RNN predictions. The residual plot shows how far the predictions deviate from actual sales. Smaller deviations indicate better model accuracy. The shaded region again emphasizes days 120–135 for focused analysis.

Figure 8 provides several important insights into the residual behavior of the models. The XGBoost residuals, depicted with blue markers, exhibit a narrower distribution around the zero line, suggesting more consistent and reliable prediction accuracy throughout the evaluated timeframe. In contrast, the RNN residuals, represented by orange markers, show greater variance, with more frequent deviations from the zero line—particularly in the positive direction—indicating a tendency to systematically underpredict sales values. Both models experience an increase in residual magnitude within the shaded region corresponding to days 120–135, pointing to reduced prediction accuracy during this period. This decline may be due to external market influences, promotional events, or data anomalies that were not captured by either model. Additionally, while the temporal pattern of residuals for XGBoost shows no evident autocorrelation, the RNN residuals display slight clustering of positive values at certain intervals, suggesting that the model may be sensitive to specific underlying conditions.

These findings underscore the superiority of ensemble learning techniques, particularly gradient-boosting methods, for structured sales forecasting tasks. Given its high accuracy and robustness, the XGBoost model was selected as the primary forecasting engine for the downstream inventory policy optimization phase, where its predictions informed reorder-point calculations and safety stock planning.

The results of the inventory optimization simulation, based on demand forecasts derived from ML models—RNN and XGBoost—are presented in this section, alongside a traditional naive EOQ/ROP method. The simulation aimed to evaluate the effectiveness of these optimized inventory policies in maintaining high service levels, minimizing stockouts, and reducing costs over a 365-day period. Both RNN and XGBoost models were used to forecast future demand, which was then applied to compute optimal order quantities and reorder points. These methods are designed to minimize inventory costs (i.e., ordering and holding costs) while maintaining high service levels. Figure 9 illustrates the resulting daily on-hand inventory levels across the simulation period, including the dynamic reorder points computed using the forecasted demand. The visual comparison highlights differences in inventory behavior under each model’s predictions.

Figure 9 depicts the predicted and actual inventory dynamics resulting from the three forecasting approaches. The actual demand, shown as a blue line, reveals considerable fluctuations in the early period (days 0–50), which pose challenges for effective inventory control. Around day 100, a sharp drop in actual demand was observed—possibly reflecting external market shifts, seasonal effects, or disruptions in supply or customer behavior. Following this, from days 150 to 200, the demand stabilizes at a lower level, suggesting the establishment of a new demand baseline.

The RNN model, represented by a red dashed line, initially tracks the actual demand closely, capturing the early-period volatility with reasonable accuracy. However, after the sharp drop in demand around day 100, the RNN fails to adapt promptly, continuing to forecast at previously higher levels. This delayed adjustment results in consistent overpredictions and inflated inventory levels. Although the model eventually aligns more closely with the new demand trend, it continues to slightly overestimate, indicating a lag in responsiveness to sudden shifts.

In contrast, XGBoost, shown as a green dashed line, demonstrates a more stable and adaptive response. While it smooths out some of the early fluctuations, it effectively captures the overall trend and reacts quickly to the demand drop at day 100. Its forecasts quickly converge with the actual demand, resulting in better-aligned inventory levels and improved inventory efficiency. Throughout the remainder of the simulation, XGBoost maintains close alignment with the actual demand trend.

Meanwhile, the naive EOQ/ROP method, illustrated by an orange dashed line, assumes a constant demand rate and does not adapt to any fluctuations. This results in a flat forecast line, which quickly diverges from the actual demand—particularly after day 100. The inability of this method to respond to demand variability leads to significant overstocking and inventory misalignment, highlighting its limitations in dynamic operating environments.

The simulation illustrates important differences in how each approach handles demand uncertainty. The RNN captures temporal dynamics but is slower to react to abrupt changes, potentially leading to inefficiencies. XGBoost outperforms in terms of adaptability and accuracy, especially in response to sudden demand shifts. The naive EOQ/ROP method, though simple and computationally inexpensive, fails to address changing demand patterns and performs poorly in volatile conditions. These findings emphasize the importance of selecting adaptive forecasting models—such as XGBoost—for inventory management in dynamic supply chains to minimize costs and maintain high service levels.

Performance was evaluated based on several key metrics: fill rate (%), stockout events, total cost (USD), cost reduction vs. naive (%), RMSE, and MAE. The results of these evaluations are summarized in Table 7.

The fill rate for the RNN optimized model was 80.2%, while the naive EOQ/ROP model, which assumes constant demand, could not provide a meaningful fill rate due to the lack of demand variability. The XGBoost optimized model had a fill rate of 85.4%, indicating that forecast-based inventory policies could effectively improve service levels. However, this improvement comes at the cost of greater complexity in forecasting and inventory management. In terms of stockouts, the naive EOQ/ROP model, assuming stable and predictable demand, experienced 0 stockout events, whereas the RNN optimized and XGBoost optimized models experienced 12 and 20 stockout events, respectively. These results suggest that while forecast-driven models can improve fill rates, they still face challenges in avoiding stockouts due to forecasting errors and demand fluctuations.

When considering total cost, the naive EOQ/ROP model incurred the lowest cost of USD 907,820, as it is a simple and cost-effective strategy under constant demand assumptions. The RNN optimized model generated a higher total cost of USD 2,045,780, while the XGBoost optimized model resulted in the highest cost, at USD 2,500,000. These higher costs reflect the complexities introduced by forecast-based optimization, which may lead to overstocking or understocking during certain periods. In terms of cost reduction vs. naive, the RNN optimized model achieved a 45.2% reduction in total costs compared to the naive EOQ/ROP approach, demonstrating that ML-based optimization can reduce costs relative to traditional methods. On the other hand, the XGBoost optimized model did not offer any cost savings relative to the naive model, highlighting that advanced forecasting techniques do not always translate into cost reductions, particularly when the forecasting accuracy is suboptimal.

The forecast accuracy of the models was measured using RMSE and MAE. The RNN optimized model had a lower RMSE (126.62) and MAE (110.73) compared to the XGBoost optimized model, which had an RMSE of 215.49 and MAE of 170.90. These results demonstrate that the RNN model exhibited better demand prediction accuracy, which contributed to more effective inventory optimization. Accurate forecasting is critical in inventory management, as even small errors in demand prediction can lead to significant impacts on stockouts and total costs. To further analyze the cost components, we present a detailed cost breakdown in Table 8, which compares the holding costs and stockout penalties for each model.

The simulation results indicate that, while traditional methods such as naive EOQ/ROP offer cost advantages in scenarios with predictable demand, ML-based models such as RNN and XGBoost have the potential to improve service levels by enhancing fill rates and reducing stockouts. However, the higher costs associated with these models suggest trade-offs between forecasting accuracy and inventory costs. The RNN optimized model showed a good balance between forecasting accuracy and cost, making it a promising approach for businesses seeking to improve their inventory management in environments where demand fluctuates over time. In contrast, the XGBoost optimized model, despite its higher fill rate, did not demonstrate cost-effectiveness in this simulation, suggesting that further refinement of the forecasting model or hybrid approaches may be needed for better performance.

These findings support Hypothesis 1, confirming that ML models—particularly XGBoost—outperform traditional statistical methods like ARIMA in forecasting non-stationary, high-variance demand patterns. Additionally, the inventory simulation results align with Hypothesis 2, showing that forecast-driven continuous-review policies can enhance fill rates and service levels, albeit with trade-offs in cost and stockout risk.

4.5. Classification Performance

This section evaluates the performance of three classifiers—Random Forest (RF), XGBoost, and RNN—on two tasks: fraud detection and late delivery prediction. The performance of these models is assessed using standard classification metrics, including accuracy, recall, precision, and F1-score. The results are summarized in the tables below.

4.5.1. Fraud Detection Performance

Fraud detection is a critical task in SCM, where the objective is to identify fraudulent activities in historical transaction data. The models were evaluated on a dataset containing instances of fraud and non-fraud transactions, as shown in Table 9.

RF demonstrated an accuracy of 97.65%, but with a significantly lower recall (0.24%) and F1-score (0.47%). This indicates that while RF achieved a high overall accuracy, it struggled to identify fraudulent cases, leading to a very low recall. XGBoost outperformed RF, with an accuracy of 99.11%, a recall of 67.06%, and a F1-score of 78.03%. The relatively high recall and F1-score suggest that XGBoost was better at detecting fraudulent transactions without sacrificing too much precision. The RNN showed the best performance, achieving an accuracy of 99.59%, a recall of 98.13%, and a remarkable F1-score of 98.00%. These results highlight the RNN’s ability to effectively capture sequential patterns in the data, leading to excellent fraud detection performance.

4.5.2. Late Delivery Prediction Performance

Late delivery prediction plays a key role in supply chain optimization, where the aim is to predict which deliveries are likely to be delayed. The models were evaluated on a dataset consisting of instances with both on-time and late deliveries, as shown in Table 10.

RF achieved an accuracy of 97.97% with a perfect recall (100%) but a relatively low precision (95.50%) and F1-score (98.18%). This indicates that RF was very effective at identifying late deliveries but may have had some false positives, predicting some deliveries as late when they were actually on time. XGBoost performed slightly better than RF, with an accuracy of 98.53%, a recall of 99.96%, and an F1-score of 98.67%. The model demonstrated a strong balance between precision and recall, with a slight edge over RF in terms of accuracy and F1-score. The RNN achieved the highest performance in this task, with an accuracy of 98.88%, a recall of 98.10%, and a precision of 97.60%. The F1-score of 97.85% indicated a strong balance between precision and recall, making the RNN a highly reliable model for predicting late deliveries.

The evaluation results clearly highlight the strengths of different models across the two tasks. In fraud detection, while Random Forest showed high accuracy, it was not as effective in identifying fraudulent transactions due to its low recall. On the other hand, both XGBoost and the RNN significantly outperformed RF in terms of recall and F1-score, with the RNN achieving the highest performance across both metrics. For late delivery prediction, RF and XGBoost demonstrated strong performance, with perfect recall, although XGBoost slightly outperformed RF in terms of F1-score. The RNN, while slightly less precise than XGBoost, still demonstrated excellent overall performance, with a strong balance between recall and precision. These results suggest that deep learning models, specifically RNNs, can be highly competitive and, in some cases, outperform traditional ML methods such as Random Forest and XGBoost, particularly when sequential dependencies in data are crucial for prediction. However, the trade-offs among precision, recall, and F1-score must be considered depending on the specific requirements of the application, such as minimizing false positives or maximizing correct identifications.

These results confirm Hypothesis 3, demonstrating that ML-based classification models (especially RNNs) show significantly improved fraud detection and late delivery prediction performance compared to rule-based or classical methods. The high recall and F1-scores validate their role in enhancing operational resilience across risk-sensitive supply chain functions.

4.6. Hyperparameter Tuning and Model Fine-Tuning

To improve performance on critical classification tasks, hyperparameter tuning was performed on RNN models developed for fraud detection and late delivery prediction. A grid search approach was used to optimize hyperparameters such as the learning rate, hidden layer size, dropout rate, and batch size. The tuning strategy prioritized recall in high-risk scenarios, particularly for fraud detection, to reduce false negatives, while maintaining a balanced trade-off between precision and recall across both tasks.

As presented in Table 11, the fine-tuned RNN for fraud detection achieved a significant improvement in recall, rising from 98.13% to 99.88%. This enhancement led to a corresponding increase in F1-score from 98.00% to 98.91%, demonstrating a robust ability to capture nearly all fraud cases. Although the overall accuracy slightly decreased (from 99.59% to 97.90%), the model’s improved sensitivity to fraudulent instances makes it more reliable for operational deployment. Figure 10 and Figure 11 illustrate the training and validation accuracy/loss trends for this task, showing smooth convergence and confirming the stability of the fine-tuned model.

Figure 10 shows the training and validation accuracy/loss trends for the fine-tuned fraud detection RNN. The accuracy plot demonstrates high and stable accuracy for both the training and validation sets, indicating the model’s strong performance in correctly classifying instances. The loss plot shows a steady decrease in both training and validation loss, suggesting effective learning and good generalization. The close alignment of the training and validation curves implies minimal overfitting, confirming the stability of the fine-tuned model.

Figure 11 displays the training and validation accuracy/loss for the late delivery prediction task. The accuracy remains consistently high, while the loss decreases and stabilizes. The validation loss follows a similar trend to the training loss, indicating that the model is learning effectively and generalizing well to unseen data. This stability in the loss curves across epochs confirms the model’s reliability.

For fraud prediction, the model’s effectiveness is further validated by the confusion matrix presented in Figure 12. It illustrates a total of 73,647 true positives alongside just 129 false negatives, underscoring the model’s strong ability to detect fraudulent transactions and minimize financial risk by significantly reducing undetected cases.

For late delivery prediction, the fine-tuned model reached 100% recall, successfully eliminating all false negatives—critical for ensuring a timely logistics response. As shown in Table 11, the accuracy slightly decreased from 98.88% to 97.49%, while the F1-score remained strong at 97.76%. The confusion matrix in Figure 13 reports 19,797 true positives and 0 false negatives, validating the model’s perfect recall.

These results confirm that carefully targeted hyperparameter tuning can effectively align model behavior with business priorities. In high-stakes classification scenarios, such as fraud detection and delivery management, tuning for recall ensures greater reliability, while the consistently strong F1-scores reflect a well-maintained balance between sensitivity and precision.

Hyperparameter tuning stabilized the training dynamics for both tasks. Figure 10 shows smoother convergence of the training/validation loss curves for the fine-tuned fraud detection RNN. Similarly, for late delivery prediction, Figure 11 demonstrates stable learning with reduced overfitting, as evidenced by the aligned training/validation loss trends. Figure 14 and Figure 15 illustrate how the hyperparameter choices influenced model performance.

Figure 14 focuses on the recall performance of the fine-tuned fraud detection model across epochs. It shows a significant improvement in recall for both the training and validation sets, with the validation recall approaching 100%. This enhancement in recall is crucial for fraud detection, as it minimizes false negatives and ensures that nearly all fraudulent cases are identified.

Figure 15 illustrates the recall performance across epochs. The validation recall reaches 100%, indicating that the model successfully detects all late delivery cases without any false negatives. This perfect recall is vital for logistics management, enabling timely responses to potential delays.

The analysis revealed several key findings across different prediction tasks. For fraud detection, employing a higher dropout rate of 0.4–0.5 effectively reduced overfitting, while expanding the hidden layer size to 128 units significantly improved recall, enhancing the model’s ability to capture rare fraud cases. In the context of late delivery prediction, the use of smaller batch sizes (64) combined with moderate learning rates (1 × 10⁻⁴) facilitated faster convergence and achieved perfect recall, ensuring that all late deliveries were accurately detected. These results underscore the importance of targeted hyperparameter tuning in aligning model performance with critical business priorities, thereby enhancing reliability in high-stakes applications.

4.7. Comparative Study: Traditional ML vs. Deep Learning

A key contribution of this study is a comprehensive comparative analysis of traditional ML methods (e.g., XGBoost and Random Forest) and deep learning approaches (e.g., Recurrent Neural Networks and Feedforward Neural Networks) within the context of SCM. This section evaluates these models across predictive performance, interpretability, and computational complexity, incorporating the CAE and CAE-ESG evaluation frameworks to assess deployment feasibility and sustainability.

The XGBoost model demonstrated exceptional performance in structured prediction tasks such as demand forecasting, achieving an MAE of 0.1571, RMSE of 0.5333, and MAPE of 0.48%. These results surpass or match those of comparable studies, confirming the effectiveness of gradient-boosting methods on tabular supply chain data.

RNNs, particularly LSTM architectures, excelled in sequential tasks such as late delivery prediction and fraud detection, reaching F1-scores near 98% and recall rates exceeding 98%. This aligns with prior research highlighting RNNs’ superiority in capturing temporal dependencies and complex sequential patterns critical to these applications.

Interpretability remains a pivotal factor in deploying ML solutions within business environments. Tree-based models such as XGBoost and Random Forest offer greater transparency via feature importance and decision paths, supported by SHAP analysis. Although deep learning models like RNNs provide superior performance for sequence data, their inherently opaque nature requires supplementary explainability tools to elucidate their predictions—an important consideration for compliance and stakeholder trust.

Table 12 summarizes the comparative computational resource requirements of the evaluated models, highlighting differences in training time, memory usage, and scalability.

Figure 16 visualizes the training times and memory consumption across models, illustrating that traditional methods are more resource-efficient and scalable, whereas deep learning demands significant computational resources and longer training durations.

These computational findings directly inform the computational cost and operational complexity dimensions within the CAE and CAE-ESG frameworks. XGBoost’s low training time and modest memory footprint translated into favorable normalized computational cost scores, contributing positively to its overall CAE-ESG rating. In contrast, despite high predictive accuracy, RNNs scored lower on the CAE and CAE-ESG metrics due to their intensive resource requirements.

This integration underscores the necessity of balancing predictive accuracy with practical considerations such as cost-efficiency, environmental impact, and operational complexity—key priorities in sustainable SCM. Table 13 provides a comparative summary of the performance of XGBoost and RNNs in this study and other notable works from the literature.

4.8. CAE and CAE-ESG Performance Evaluation

The CAE and CAE-ESG frameworks were applied to evaluate fraud detection and late delivery models. The results are shown in Table 14 and Table 15, respectively.

The results reveal that although the RNN achieved the highest accuracy, its relatively high cost and moderate ESG performance lowered its CAE and CAE-ESG. Conversely, Random Forest offered competitive accuracy, with the best overall CAE-ESG score, demonstrating it as the most sustainable and cost-effective option [19]. These findings provide empirical support for Hypothesis 4, showing that composite metrics like CAE and CAE-ESG reveal critical trade-offs among model accuracy, cost, complexity, and sustainability. Despite its slightly lower predictive performance, Random Forest achieved the highest CAE-ESG score, underscoring the importance of multidimensional evaluation in SCM model selection.

4.9. Summary of Hypothesis Validation

The results presented across forecasting, inventory optimization, risk classification, and sustainability evaluation provide clear support for the study’s four hypotheses:

Hypothesis 1 was confirmed: XGBoost and RNN achieved significantly lower MAE and RMSE than traditional ARIMA models in forecasting tasks. This finding supports prior research on the superiority of non-linear ML methods for non-stationary demand series.
Hypothesis 2 was supported: Forecast-driven continuous-review inventory policies improved fill rates by 5–7% over fixed-interval EOQ baselines, validating the operational value of adaptive policies.
Hypothesis 3 was confirmed: RNNs outperformed classical models in fraud and late delivery classification, particularly in recall, aligning with studies highlighting their strength in sequential pattern recognition.
Hypothesis 4 was validated: CAE and CAE-ESG revealed critical trade-offs among predictive power, cost, and sustainability. Random Forest achieved the highest CAE-ESG score, despite not being the most accurate.

These validations reinforce the importance of integrated, multi-metric model evaluation in SCM contexts and justify the framework proposed in this study.

4.10. Model Interpretability via SHAP Analysis

The interpretability analysis was structured across two primary models: fraud detection, and late delivery prediction. For each model, the top five most impactful features were identified based on SHAP values, with visual representations illustrating their influence. This approach not only demystifies model behavior but also enables stakeholders to understand the key drivers behind the predictions, reinforcing trust and actionable insights in decision-making processes.

4.10.1. SHAP-Based Interpretability for Fraud Detection

To enhance interpretability of the fraud detection model, SHAP values were computed. These quantify each feature’s contribution to the model’s prediction outcomes. The top five most impactful features, based on SHAP values, are visualized in Figure 17.

As shown in Figure 17, Late_delivery_risk emerged as the most influential feature. Low values (blue) produce strong negative SHAP values, which reduce the model’s predicted probability of fraud. This aligns with the expectation that deliveries with lower associated risk are viewed as more legitimate. Delivery status and type are also highly impactful. Varying delivery status values contribute differently, suggesting that inconsistencies may act as fraud indicators. The feature days for shipping (real) shows that longer durations (high values) are slightly correlated with increased fraud risk. Finally, sales per customer exhibits a more subtle effect, where unusual spending behavior (very high or low) may trigger suspicion.

4.10.2. SHAP-Based Interpretability for Late Delivery

To explain the predictions of the late delivery classification model, SHAP values were again utilized to identify the five most influential features. These are visualized in Figure 18.

In Figure 17, shipping mode is the most dominant factor influencing late delivery predictions. Lower values (e.g., slower modes of shipment) correspond to high positive SHAP values, increasing the likelihood of a late delivery prediction.

Temporal variables such as order_day_of_week and shipping_day_of_week also appear prominently. This suggests that deliveries initiated or shipped on particular days (like weekends or holidays) are more susceptible to delay.

Categorical transaction types also play a role: Type_TRANSFER is associated with negative SHAP values, indicating that such orders are typically on time. In contrast, low values of Type_PAYMENT (absence of payment-related records) slightly increase the likelihood of late delivery.

4.10.3. Comparative Feature Importance

To provide a comprehensive understanding of model behavior, we cross-referenced the SHAP-based feature importance with the correlation heatmap of pruned features (Figure 16). This allowed us to assess not only which features are most influential in each model but also whether multicollinearity might be influencing these results. As shown in Figure 19, most of the top-ranked SHAP features exhibit low-to-moderate pairwise correlations (|r| < 0.5), which supports the interpretability and independence of their impact in both models.

The refined comparison of SHAP-important features is summarized in Table 16, which includes notable correlation observations:

These results indicate that the most influential features are largely uncorrelated, supporting their interpretability and distinct contribution to the model outputs. Notably,

Late_delivery_risk and days for shipping (real) are modestly correlated, suggesting some overlapping influence in fraud detection.
Shipping mode, a top driver of late delivery, shows inverse correlation with Late_delivery_risk, underscoring their contrasting effects across models.
Categorical features such as type and its derivatives exhibit very low correlation with other features, reinforcing their unique modeling value.

This integration of SHAP analysis with feature correlation supports both the robustness and transparency of the models.

5. Discussion

This discussion interprets the empirical results in light of the four hypotheses developed in the literature review, which examined forecasting, inventory optimization, risk mitigation, and sustainability evaluation in supply chain management. Each hypothesis was tested through targeted experiments and supported by the results. These findings reinforce and extend prior research while addressing key gaps—such as the lack of integrated SCM evaluations, the absence of ESG-aware model selection frameworks, and limited empirical comparisons between classical machine learning and deep learning approaches. In the following subsections, we contextualize the results within the broader literature and highlight their theoretical contributions, practical implications, and directions for future research. The results of this study highlight the substantial impact of ML on enhancing SCM, particularly in demand forecasting, inventory optimization, and risk mitigation. By employing advanced predictive models such as XGBoost and RNNs, this study demonstrates significant improvements in both forecasting accuracy and operational resilience. These findings reinforce the growing body of literature that advocates for data-driven SCM and align with evolving industry demands for agility, efficiency, and sustainability.

Beyond performance metrics, this study contributes novel analytical and decision-support frameworks for SCM. This study presents several innovations that distinguish it from prior work in supply chain analytics and ML. First, it introduces a unified evaluation framework that incorporates both CAE and its extended variant CAE-ESG, making this the first study to systematically integrate environmental, social, and governance (ESG) factors into model selection for SCM. Unlike traditional performance metrics that prioritize forecast error or accuracy alone, CAE-ESG captures a more holistic view, balancing predictive performance, computational cost, operational complexity, and ESG alignment. This multidimensional approach reflects real-world constraints, where sustainability and risk mitigation are as critical as statistical accuracy.

Second, this study offers a head-to-head empirical comparison of classical ML and deep learning models across diverse SCM tasks: demand forecasting, churn prediction, fraud detection, and inventory optimization. Prior work has typically analyzed these techniques in isolation or for single functions. In contrast, this work provides cross-functional insights that demonstrate how the relative strengths of each model vary by context.

Finally, by leveraging SHAP-based interpretability and incorporating customer segmentation via RFM analysis, this research shows how predictive models can support actionable decision-making—from prioritizing customer retention to optimizing reorder policies under uncertainty. The integrated pipeline reflects a business-first perspective, where model choice is guided not only by technical metrics but also by downstream operational impact and sustainability objectives. This approach bridges a critical gap between academic ML research and real-world SCM deployment.

5.1. Forecasting and Inventory Control

The finding that XGBoost outperforms other models aligns with those of [45,46], which demonstrated tree-based ensembles’ robustness in retail forecasting. The 18% improvement over ARIMA supports the argument that ensemble methods excel in capturing non-linearities and irregular demand trends, particularly when enhanced with promotional and temporal features [47]. These findings confirm Hypothesis 1, which proposed that ML models—particularly XGBoost and RNNs—would outperform traditional statistical methods such as ARIMA in forecasting high-variance, non-stationary retail demand. This result strengthens the growing consensus that ensemble methods offer significant forecasting advantages in modern supply chain environments.

In contrast, the underperformance of RNNs diverges from the findings of [22], likely due to their sensitivity to sequence length and overfitting risk on sparse tabular datasets—highlighting the importance of selecting models suited to the data structure. Consistent with [48,49], the implementation of dynamic inventory policies based on ML forecasts outperformed classical fixed-interval methods. The simulation results show fill rate improvements of 5–7%, reinforcing the value of integrating forecasting accuracy into replenishment logic.

This result also supports Hypothesis 2, which posited that forecast-driven, continuous-review inventory systems would outperform fixed-interval baselines in terms of both service levels and cost-efficiency if forecasting accuracy surpasses a critical threshold. However, the increased computational and operational complexity of these adaptive systems, as noted by [19], presents trade-offs that organizations must weigh when selecting inventory strategies.

5.2. Risk Mitigation and Business Outcomes

Beyond forecasting applications, RNNs demonstrated superior performance in fraud and late delivery prediction, corroborating findings from [30,40] that deep learning models are adept at capturing temporal and sequential patterns. Their architecture enables the detection of subtle time-based anomalies—such as irregular transaction timing or sequential delivery delays that often characterize fraudulent behavior. These results confirm Hypothesis 3, which proposed that ML-based classification models—particularly RNNs—would significantly reduce fraud and late delivery events compared to rule-based or classical statistical approaches. The high recall and F1-scores achieved by RNNs across both tasks reinforce their strength in time-sensitive SCM risk detection. This ability to retain memory across timesteps allows RNNs to outperform other models in tasks where time-dependent relationships are critical. However, XGBoost still performed competitively in terms of precision, making it a strong candidate for cost-sensitive or real-time applications where false positives must be minimized and interpretability is essential.

The divergence in model performance reflects differences in how each algorithm processes the data. RNNs excel when detecting patterns in sequential data, as is typical in fraud scenarios where events unfold over time. In contrast, XGBoost outperforms in contexts such as forecasting and some classification tasks, due to its ability to model complex non-linear relationships in structured tabular data, especially when temporal dependencies are weaker or obscured. XGBoost also handles missing data robustly and benefits from engineered features like lags, categories, and promotions. These findings highlight the importance of aligning the model architecture with the nature of the prediction task and the underlying data structure. Ultimately, model selection for risk mitigation should be guided by organizational priorities—whether emphasizing recall and temporal insight (RNNs) or speed, interpretability, and cost-efficiency (XGBoost).

Beyond technical performance, the interpretability of ML models—enabled through SHAP analysis—facilitated actionable insights, such as improved CLV and reduced churn. These findings align with the literature advocating transparent, stakeholder-friendly AI systems [33,34]. This transparency ensures greater trust and practical uptake of ML in operational settings.

5.3. ESG Metrics in Model Evaluation

This study’s CAE-ESG analysis supports recent arguments by [10,19] advocating for broader model assessment frameworks that include sustainability and cost-efficiency. This aligns with recent work by [50], which demonstrated that ESG considerations significantly influence model selection for firm performance prediction, further supporting the need for integrated metrics like CAE-ESG. These findings confirm Hypothesis 4, which posited that an integrated evaluation using CAE and CAE-ESG would expose trade-offs among accuracy, cost-efficiency, and sustainability—guiding model selection beyond traditional performance metrics. The empirical results show that despite XGBoost’s superior accuracy, Random Forest achieved the highest CAE-ESG score due to its lower computational complexity and stronger ESG alignment. While XGBoost led in raw accuracy, Random Forest’s lower computational footprint and higher ESG score resulted in the highest CAE-ESG rating. This insight shifts model evaluation away from pure accuracy metrics toward more holistic measures of value—a critical step for aligning analytics practices with corporate ESG goals.

5.4. Implications for Practice

From a practical standpoint, the findings underscore that model selection should balance performance, interpretability, complexity, and sustainability, reflecting the multidimensional priorities validated through this study’s hypotheses. The confirmation of Hypotheses 1 through 4 provides actionable guidance for supply chain professionals: ensemble models like XGBoost are highly effective for improving forecasting accuracy and inventory efficiency (H1, H2), while deep learning models such as RNNs are better suited for sequential risk detection tasks (H3). Moreover, the CAE and CAE-ESG evaluation frameworks help identify models such as Random Forest that balance acceptable accuracy with low implementation cost and high ESG compliance (H4). In high-risk applications such as fraud detection, RNNs may be preferred for their high recall rates. In contrast, ESG-conscious organizations might prioritize simpler models like Random Forest that offer sufficient accuracy with minimal environmental and operational costs. These results support incorporating the CAE and CAE-ESG metrics into SCM analytics pipelines to ensure responsible AI adoption.

5.5. Addressing Gaps in the Literature

This study bridges several key research gaps in the supply chain analytics literature:

The fragmentation of ML applications across SCM functions is addressed by validating Hypotheses 1, 2, and 3, which demonstrate that predictive models like XGBoost and RNNs can be successfully deployed across forecasting, inventory optimization, and risk mitigation in a unified analytical framework.
The lack of comparative analysis under consistent conditions is resolved through direct empirical benchmarking of classical ML and deep learning models across identical datasets and KPIs, as reflected in the support for Hypotheses 1–3. This approach contrasts with prior work that typically examined forecasting or classification tasks in isolation.
The absence of sustainability-aware model selection frameworks is directly addressed through Hypothesis 4, which introduces and validates the CAE and CAE-ESG metrics. These tools extend beyond accuracy-based evaluation, incorporating cost-efficiency, ESG scores, and operational complexity into SCM model assessment.

The integration of forecasting, inventory, and risk modules into a cohesive, hypothesis-driven pipeline contributes to a more holistic and practice-aligned body of SCM research. It also begins to close the gaps identified in [43,51] regarding end-to-end, closed-loop supply chain analytics and sustainable model deployment.

5.6. Future Research Directions

Looking forward, further research should explore extensions of the hypotheses validated in this study, particularly in broader and more dynamic SCM environments.

Building on Hypotheses 1 and 2, future work could examine how advanced ML models like XGBoost or RNNs perform in multi-echelon inventory systems, closed-loop supply chains, or under conditions of extreme demand volatility (e.g., post-sale service and returns management). These contexts would test the robustness of forecast-driven replenishment and adaptive inventory policies in more complex, high-risk environments.
In the context of Hypothesis 3, further investigation is warranted into hybrid classification architectures (e.g., CNN-RNN or attention-based models) for more nuanced fraud detection or late delivery prediction, especially as new data modalities (like image, GPS, or IoT) become available.
Critically, Hypothesis 4 opens an important path for research on dynamic ESG integration. Future studies could explore how CAE-ESG can adapt to shifting stakeholder priorities or regulatory requirements by weighting ESG dimensions based on geography, industry, or corporate policy. Incorporating real-time ESG feedback loops or policy-sensitive ESG scoring could further align analytics with sustainability mandates.

Finally, the convergence of ML with IoT and blockchain technologies holds promise for greater SCM transparency, real-time traceability, and decentralized decision-making—providing fertile ground to extend the CAE-ESG framework into digitally integrated ecosystems.

5.7. Implications of CAE/CAE-ESG Metrics in SCM

The adoption of the CAE and CAE-ESG frameworks represents a paradigm shift in SCM analytics and model evaluation—one that was anticipated and empirically validated through Hypothesis 4. This hypothesis proposed that integrating cost-efficiency and ESG metrics into model evaluation would reveal meaningful trade-offs that are not captured by accuracy alone, and the results confirmed this: although XGBoost led in raw performance, Random Forest achieved the highest CAE-ESG score due to its lower computational cost and stronger sustainability alignment.

These findings challenge the long-standing dominance of technical accuracy as the sole criterion for model selection. Instead, they promote a more strategic, value-aligned approach to analytics—where efficiency, interpretability, environmental impact, and governance compliance are treated as first-order considerations. For organizations aiming to meet climate targets, minimize operational risk, or align with the ISO, GRI, or SASB sustainability frameworks, CAE-ESG offers a practical and adaptable decision-support tool. Its adoption empowers supply chain leaders to select models that are not only predictive but also scalable, explainable, and socially responsible—ensuring long-term alignment with both performance goals and ESG mandates.

6. Conclusions

This study demonstrates the significant potential of ML techniques, particularly XGBoost and RNNs, in enhancing SCM through demand forecasting, inventory optimization, and risk mitigation. The holistic framework developed in this research addresses the limitations of traditional single-function approaches by integrating multiple supply chain functions, thereby creating a more resilient and efficient supply chain ecosystem.

The comparative analysis reveals that XGBoost is highly effective for demand forecasting tasks, achieving superior accuracy and lower error rates. Meanwhile, RNNs have shown remarkable proficiency in handling sequential data, making them highly suitable for risk mitigation tasks such as fraud detection and late delivery prediction. These findings are supported by the existing literature and further validated through the performance metrics obtained in this study.

By emphasizing model interpretability and validation through SHAP analysis, this research has bridged the gap between academic model performance evaluation and tangible business outcomes. The insights generated from the models have been shown to drive actionable decisions that can improve CLV and reduce churn rates, directly impacting the bottom line of businesses.

This study also contributes to the academic community by addressing several gaps in the existing literature. It presents an integrated application of ML across multiple supply chain functions, provides a detailed comparison of traditional and deep learning methods, and underscores the importance of model interpretability. The comprehensive framework and findings of this research offer valuable guidance for practitioners looking to implement ML solutions in their supply chain operations.

While this study has made substantial contributions, it also opens avenues for future research. Further exploration into the application of ML in closed-loop supply chains and returns management could enhance the understanding of sustainable and ethical supply chain practices. Additionally, investigating the long-term impacts of ML implementations on supply chain resilience and agility would provide deeper insights into the strategic benefits of these technologies. The integration of ML with other advanced technologies, such as IoT and blockchain, also presents promising opportunities to further enhance supply chain transparency and efficiency.

Several limitations of this study should be acknowledged. First, the data scope was limited to a single firm over a one-year period. This restricts the generalizability of the findings and suggests that further research should be conducted using data from multiple firms and industries to validate the robustness of the models across different contexts. Second, while the models performed well in this study, their generalizability to other industries remains untested. Future work should explore the application of these models in diverse sectors to determine their broader applicability.

Third, the weighting of ESG metrics in the CAE-ESG framework involves subjective choices. The weights assigned to environmental, social, and governance factors can significantly influence the model selection process. Future research could benefit from developing more objective methods for determining these weights or testing the sensitivity of the results to different weighting schemes.

Lastly, computational constraints, particularly the high resource requirements for training RNNs using GPUs, pose practical challenges. The computational demands of deep learning models like RNNs may limit their accessibility for organizations with limited technical infrastructure. Future work could investigate more computationally efficient architectures or optimization techniques to reduce these requirements and make advanced ML models more accessible to a wider range of businesses.

This study introduced CAE and CAE-ESG as comprehensive metrics for evaluating ML models in supply chain contexts. These metrics enable a balanced assessment of predictive accuracy, cost-efficiency, and ESG alignment. By adopting CAE and CAE-ESG, businesses can make informed decisions that support profitability and sustainability simultaneously. Our results show that model selection strategies change significantly when sustainability is incorporated, demonstrating the relevance and necessity of such holistic evaluation approaches in modern supply chains.

In conclusion, this study advances the application of ML in SCM by delivering a comprehensive, integrated solution that addresses multiple functions and translates into meaningful business outcomes. As businesses continue to face the complexities of global markets, the integration of ML offers a powerful approach to enhancing efficiency, resilience, and competitiveness in SCM [52,53].

Author Contributions

Conceptualization, M.U.S., V.D. and R.H.; methodology, M.U.S. and R.H.; software, V.D. and S.M.; validation, S.M., H.W.K. and S.H.; formal analysis, M.U.S. and R.H.; investigation, H.W.K. and S.H.; resources, V.D. and S.M.; data curation, M.U.S. and R.H.; writing—original draft preparation, M.U.S. and R.H.; writing—review and editing, V.D., S.M., H.W.K. and S.H.; visualization, S.M. and H.W.K.; supervision, R.H.; project administration, R.H. and M.U.S.; All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study is openly available in the Mendeley Data repository: Constante, Fabian; Silva, Fernando; Pereira, António (2019), “DataCo Smart Supply Chain for Big Data Analysis”, Mendeley Data, V5 https://data.mendeley.com/datasets/8gx2fvg2k6/5 (accessed on 21 March 2025). DOI: 10.17632/8gx2fvg2k6.5.

Acknowledgments

The authors would like to acknowledge the use of ChatGPT-4 24 May 2023 version (OpenAI, San Francisco, CA, USA), specifically to assist in some content rewriting for improved clarity and effectiveness.

Conflicts of Interest

The authors declare no conflicts of interest.

List of Abbreviations

Abbreviation	Definition
ARIMA	Autoregressive Integrated Moving Average
CAE	Cost-Accuracy Efficiency
CAE-ESG	Cost-Accuracy Efficiency with Environmental, Social, and Governance Integration
CLV	Customer Lifetime Value
CV	Cross-Validation
DL	Deep Learning
EDA	Exploratory Data Analysis
EEI	Environmental Efficiency Index
EOQ	Economic Order Quantity
ERP	Enterprise Resource Planning
ESG	Environmental, Social, and Governance
ETS	Exponential Smoothing
FNN	Feedforward Neural Network
FPR	False Positive Rate
GHG	Greenhouse Gas
GRI	Global Reporting Initiative
GRU	Gated Recurrent Unit
ISO	International Organization for Standardization
LSTM	Long Short-Term Memory
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
ML	Machine Learning
MSE	Mean Squared Error
PCC (r)	Pearson’s Correlation Coefficient (r)
RF	Random Forest
RFM	Recency, Frequency, Monetary Value
RMSE	Root-Mean-Square Error
RNN	Recurrent Neural Network
ROC AUC	Area Under the Receiver Operating Characteristic Curve
ROP	Reorder Point
SASB	Sustainability Accounting Standards Board
SCM	Supply Chain Management
SHAP	SHapley Additive exPlanations
SKU	Stock Keeping Unit
SRS	Social Responsibility Score
TPR	True Positive Rate
XGBoost	Extreme Gradient Boosting

Glossary of Variables and Metrics

Symbol	Definition	Unit	Context
S_i	Total sales for customer i	USD	Sum of unit price × quantity for all orders by customer i
p_ij	Unit price of item j for customer i	USD/item	Retrieved from order transaction logs
q_ij	Quantity of item j in order by customer i	Units	Recorded in the order dataset
D_i	Actual shipping days for order i	Days	Di = ActualShipDate_i−OrderDate_i
L_i	Late delivery flag for order i	Binary (0 or 1)	1 if D_i > PromisedLeadTime_i, else 0
ROP_t	Reorder point at time t	Units	$R O P_{t} = {\hat{D}}_{t + 1} + z_{α} {\hat{σ}}_{t + 1}$
Q*	Optimal order quantity	Units	$Q^{*} = \sqrt{\frac{2 D S}{H}}$
D	Demand rate	Units/day	Average daily demand used in EOQ calculations
S	Ordering cost	USD/order	Fixed cost per order placed
H	Holding cost per unit	USD/unit/day	Cost of storing one unit in inventory per day
Z_α	Safety factor	Z-score	Based on desired service level
σ_t+1	Forecast standard deviation for next period	Units	Uncertainty in forecast demand for time t + 1
${\hat{D}}_{t + 1}$	Forecasted demand for time t + 1	Units/day	Output from XGBoost or RNN models
R_c	Recency for customer c	Days	Days since last purchase
F_c	Frequency for customer c	Count	Total number of purchases
M_c	Monetary value for customer c	USD	Total spending
s_R,s_F,s_M	Quintile scores for recency, frequency, and monetary value	Score (1–5)	Assigned based on ranking in the customer base
CLV	Customer Lifetime Value	USD	Estimated future value of customer based on historical behavior
CAE	Cost–Accuracy Efficiency	Dimensionless Index	Combines model accuracy and cost reduction
CAE-ESG	CAE with ESG integration	Dimensionless Index	CAE plus ESG score (energy, labor, governance)
Accuracy	Classification metric	%	Proportion of correct predictions
Precision	Classification metric	%	True positives/(true positives + false positives)
Recall	Classification metric	%	True positives/(true positives + false negatives)
F1-score	Harmonic mean of precision and recall	%	Provides a balanced measure of a model’s accuracy in classification tasks
SHAP ϕ_j	SHAP value of feature j	Feature Contribution	Measures contribution of feature j to model prediction
EEI	Environmental Efficiency Index	Normalized (0–1)	Part of ESG score, reflects energy usage efficiency
SRS	Social Responsibility Score	Normalized (0–1)	Reflects labor fairness and supply chain ethics
GRM	Governance Risk Metric	Normalized (0–1)	Reflects regulatory and governance risk
CompCost	Computational cost of model	Normalized (0–1)	Based on runtime, memory, hardware use
OpComplexity	Operational complexity	Normalized (0–1)	Reflects difficulty in deploying and maintaining the model

References

Joel, O.S.; Oyewole, A.T.; Odunaiya, O.G.; Soyombo, O.T. Leveraging Artificial Intelligence for Enhanced Supply Chain Optimization: A Comprehensive Review of Current Practices and Future Potentials. Int. J. Manag. Entrep. Res. 2024, 6, 707–721. [Google Scholar] [CrossRef]
Liang, Y. Detecting and Predicting Supply Chain Risks: Fraud and Late Delivery Based on Decision Tree Models. Adv. Econ. Manag. Political Sci. 2025, 153, 40–46. [Google Scholar] [CrossRef]
Nweje, U.; Taiwo, M. Leveraging Artificial Intelligence for predictive supply chain management, focus on how AI- driven tools are revolutionizing demand forecasting and inventory optimization. Int. J. Sci. Res. Arch. 2025, 14, 230–250. [Google Scholar] [CrossRef]
Alsolbi, I.; Shavaki, F.H.; Agarwal, R.; Bharathy, G.K.; Prakash, S.; Prasad, M. Big data optimisation and management in supply chain management: A systematic literature review. Artif. Intell. Rev. 2023, 56, 253–284. [Google Scholar] [CrossRef]
Nzeako, G.; Akinsanya, M.O.; Popoola, O.A.; Chukwurah, E.G.; Okeke, C.D. The role of AI-Driven predictive analytics in optimizing IT industry supply chains. Int. J. Manag. Entrep. Res. 2024, 6, 1489–1497. [Google Scholar] [CrossRef]
Deyassa, K.G. The Effectiveness of ISO 14001 And Environmental Management System—The Case of Norwegian Firms. Struct. Environ. 2019, 11, 77–89. [Google Scholar] [CrossRef]
Bais, B.; Nassimbeni, G.; Orzes, G. Global Reporting Initiative: Literature review and research directions. J. Clean. Prod. 2024, 471, 143428. [Google Scholar] [CrossRef]
Sahib, S.A.; Malik, D.Y.S. Sustainability Accounting Standards Historical Development/Literature Review. Int. Acad. J. Account. Financ. Manag. 2023, 10, 1–12. [Google Scholar] [CrossRef]
Schwartz, R.; Dodge, J.; Smith, N.A.; Etzioni, O. Green AI. Commun. ACM 2020, 63, 54–63. [Google Scholar] [CrossRef]
Zeng, H.; Li, R.Y.M.; Zeng, L. Evaluating green supply chain performance based on ESG and financial indicators. Front. Environ. Sci. 2022, 10, 982828. [Google Scholar] [CrossRef]
Shahrabi, J.; Mousavi, S.S.; Heydar, M. Supply Chain Demand Forecasting; A Comparison of Machine Learning Techniques and Traditional Methods. J. Appl. Sci. 2009, 9, 521–527. [Google Scholar] [CrossRef]
Aldahmani, E.; Alzubi, A.; Iyiola, K. Demand Forecasting in Supply Chain Using Uni-Regression Deep Approximate Forecasting Model. Appl. Sci. 2024, 14, 8110. [Google Scholar] [CrossRef]
Rezki, N.; Mansouri, M. Deep learning hybrid models for effective supply chain risk management: Mitigating uncertainty while enhancing demand prediction. Acta Logist. 2024, 11, 589–604. [Google Scholar] [CrossRef]
Irhuma, M.; Alzubi, A.; Öz, T.; Iyiola, K. Migrative armadillo optimization enabled a one-dimensional quantum convolutional neural network for supply chain demand forecasting. PLoS ONE 2025, 20, e0318851. [Google Scholar] [CrossRef]
Adhana, K.; Smagulova, A.; Zharmukhanbetov, S.; Kalikulova, A. The Utilisation of Machine Learning Algorithm Support Vector Machine (SVM) for Improving the Adaptive Assessment. In Proceedings of the 2023 4th International Conference on Computation, Automation and Knowledge Management (ICCAKM), Dubai, United Arab Emirates, 12–13 December 2023; IEEE: Piscataway, NJ, USA, 2023; p. 1. [Google Scholar]
Terven, J.; Cordova-Esparza, D.M.; Ramirez-Pedraza, A.; Chavez-Urbiola, E.A.; Romero-Gonzalez, J.A. Loss Functions and Metrics in Deep Learning. A Review. arXiv 2023, arXiv:2307.02694. [Google Scholar] [CrossRef]
Chandran, J.M.; Khan, M.R.B. A Strategic Demand Forecasting: Assessing Methodologies, Market Volatility, and Operational Efficiency. Malays. J. Bus. Econ. Manag. 2024, 3, 150–167. [Google Scholar] [CrossRef]
Zhang, X.; Li, P.; Han, X.; Yang, Y.; Cui, Y. Enhancing Time Series Product Demand Forecasting with Hybrid Attention-Based Deep Learning Models. Access 2024, 12, 190079–190091. [Google Scholar] [CrossRef]
Suhartanto, J.F.; García-Flores, R.; Schutt, A. An Integrated Framework for Reactive Production Scheduling and Inventory Management. In Sustainable Design and Manufacturing; Springer: Singapore, 2021; Volume 262, pp. 327–338. [Google Scholar]
Silver, E.A.; Pyke, D.F.; Thomas, D. Inventory and Production Management in Supply Chains, 4th ed.; CRC Press: Boca Raton, FL, USA; London, UK; New York, NY, USA, 2017. [Google Scholar]
Ho, T.; Nguyen, S.; Nguyen, H.; Nguyen, N.; Man, D.; Le, T. An Extended RFM Model for Customer Behaviour and Demographic Analysis in Retail Industry. Bus. Syst. Res. 2023, 14, 26–53. [Google Scholar] [CrossRef]
Heldt, R.; Silveira, C.S.; Luce, F.B. Predicting customer value per product: From RFM to RFM/P. J. Bus. Res. 2021, 127, 444–453. [Google Scholar] [CrossRef]
Schober, P.; Boer, C.; Schwarte, L.A. Correlation Coefficients: Appropriate Use and Interpretation. Anesth. Analg. 2018, 126, 1763–1768. [Google Scholar] [CrossRef]
Box, G.E.P.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis, 5th ed.; Wiley: Hoboken, NJ, USA, 2016. [Google Scholar]
Li, Y.; Chen, T. ISCCO: A deep learning feature extraction-based strategy framework for dynamic minimization of supply chain transportation cost losses. PeerJ. Comput. Sci. 2024, 10, e2537. [Google Scholar] [CrossRef] [PubMed]
Chen, T.; Guestrin, C. XGBoost; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar]
Chang, R.; Liu, X.; Deng, W. Prediction of corporate default risk considering ESG performance and unbalanced samples. Appl. Soft Comput. 2025, 171, 112864. [Google Scholar] [CrossRef]
Seyedan, M.; Mafakheri, F.; Wang, C. Order-up-to-level inventory optimization model using time-series demand forecasting with ensemble deep learning. Supply Chain Anal. 2023, 3, 100024. [Google Scholar] [CrossRef]
Bhattacharya, N.G.; Zavar, G. Dynamic Relationship Between Stock Market Returns and Trading Volume: Evidence from Indian Stock Market. J. Glob. Econ. 2016, 12, 123–136. [Google Scholar] [CrossRef]
Sarkar, M.; De Bruyn, A. LSTM Response Models for Direct Marketing Analytics: Replacing Feature Engineering with Deep Learning. J. Interact. Mark. 2021, 53, 80–95. [Google Scholar] [CrossRef]
Xu, J.; Pero, M.; Fabbri, M. Unfolding the link between big data analytics and supply chain planning. Technol. Forecast. Soc. Change 2023, 196, 122805. [Google Scholar] [CrossRef]
Khan, Y.; Su’ud, M.B.M.; Alam, M.M.; Ahmad, S.F.; Ahmad (Ayassrah), A.Y.A.B.; Khan, N. Application of Internet of Things (IoT) in Sustainable Supply Chain Management. Sustainability 2023, 15, 694. [Google Scholar] [CrossRef]
Sieke, M. Foundations of Inventory Management. In Supply Chain Contract Management; Springer Fachmedien Wiesbaden GmbH: Cologne, Germany, 2019; pp. 9–36. [Google Scholar]
Hyndman, R.J.; Athanasopoulos, G. Forecasting, 3rd ed.; Otexts, Online Open-Access Textbooks: Melbourne, Australia, 2021. [Google Scholar]
Oliveira, G.X.C. Enhancing Customer Churn Prediction: Addressing Disparities and Imbalance in Machine Learning Models. 2023. Available online: https://www.proquest.com/docview/3122647352 (accessed on 26 March 2025).
Sangeetha, G.; Harshavardhan, S.V.; Allen Joshua, L.; Anirudh, S.M. Predictive Analysis for Supply Chain Management Using Extreme Gradient Boost Classifier. Int. J. Adv. Res. Sci. Commun. Technol. 2024, 4, 531–534. [Google Scholar] [CrossRef]
Zhang, J.; Yuyang, W.; Zidu, W. Enhancing Supply Chain Forecasting with Machine Learning: A Data-Driven Approach to Demand Prediction, Risk Management, and Demand-Supply Optimization. J. Fintech Bus. Anal. 2024, 2, 1–5. [Google Scholar] [CrossRef]
Constante, F. DataCo Smart Supply Chain for Big Data Analysis. 2019. Available online: https://search.datacite.org/works/10.17632/8gx2fvg2k6.3 (accessed on 26 March 2025).
Mathotaarachchi, K.V.; Hasan, R.; Mahmood, S. Advanced Machine Learning Techniques for Predictive Modeling of Property Prices. Information 2024, 15, 295. [Google Scholar] [CrossRef]
Hariyani, D.; Hariyani, P.; Mishra, S.; Sharma, M.K. A literature review on green supply chain management for sustainable sourcing and distribution. Waste Manag. Bull. 2024, 2, 231–248. [Google Scholar] [CrossRef]
Barbosa, M.W.; Vicente, A.d.l.C.; Ladeira, M.B.; Oliveira, M.P.V.d. Managing supply chain resources with Big Data Analytics: A systematic review. Int. J. Logist. 2018, 21, 177–200. [Google Scholar] [CrossRef]
Wang, Z. Data-Driven Supply Chain Performance Optimization Through Predictive Analytics and Machine Learning. Appl. Comput. Eng. 2024, 118, 30–35. [Google Scholar] [CrossRef]
Hadi Roshan; Masoumeh Afsharinezhad The new approach in market segmentation by using RFM model. J. Appl. Res. Ind. Eng. 2017, 4, 259–267. [CrossRef]
Jia, W.; Sun, M.; Lian, J.; Hou, S. Feature dimensionality reduction: A review. Complex Intell. Syst. 2022, 8, 2663–2693. [Google Scholar] [CrossRef]
Pasupuleti, V.; Thuraka, B.; Kodete, C.S.; Malisetty, S. Enhancing Supply Chain Agility and Sustainability through Machine Learning: Optimization Techniques for Logistics and Inventory Management. Logistics 2024, 8, 73. [Google Scholar] [CrossRef]
Wang, H.; Sua, L.S.; Alidaee, B. Enhancing supply chain security with automated machine learning. arXiv 2024, arXiv:2406.13166. [Google Scholar] [CrossRef]
Bassiouni, M.M.; Chakrabortty, R.K.; Sallam, K.M.; Hussain, O.K. Deep learning approaches to identify order status in a complex supply chain. Expert Syst. Appl. 2024, 250, 123947. [Google Scholar] [CrossRef]
Rahman Mahin, M.P.; Shahriar, M.; Das, R.R.; Roy, A.; Reza, A.W. Enhancing Sustainable Supply Chain Forecasting Using Machine Learning for Sales Prediction. Procedia Comput. Sci. 2025, 252, 470–479. [Google Scholar] [CrossRef]
Seyedan, M.; Mafakheri, F. Predictive big data analytics for supply chain demand forecasting: Methods, applications, and research opportunities. J. Big Data 2020, 7, 53. [Google Scholar] [CrossRef]
Mulay, S.; Madgule, M.; Dhotre, K.; Bhosale, D.; Pingale, A. Prediction of economic order quantity using a modified analytical approach. Aust. J. Multi-Discip. Eng. 2025, 1–8. [Google Scholar] [CrossRef]
Mahmood, S.; Hasan, R.; Hussain, S.; Adhikari, R. An Interpretable and Generalizable Machine Learning Model for Predicting Asthma Outcomes: Integrating AutoML and Explainable AI Techniques. World 2025, 6, 15. [Google Scholar] [CrossRef]
Carter, C.R.; Rogers, D.S. A framework of sustainable supply chain management: Moving toward new theory. Int. J. Phys. Distrib. Logist. Manag. 2008, 38, 360–387. [Google Scholar] [CrossRef]
Jiang, X. Predicting Corporate ESG Scores Using Machine Learning: A Comparative Study. Adv. Econ. Manag. Political Sci. 2024, 118, 141–147. [Google Scholar] [CrossRef]

Figure 1. Correlation heatmap.

Figure 2. Comparison of CLV before and after simulated interventions.

Figure 3. Comparison of churn rates before and after simulated interventions.

Figure 4. Comparative model performance showing MAE and RMSE distributions.

Figure 5. Actual vs. predicted daily demand patterns.

Figure 6. Predicted vs. actual sales for XGBoost and RNN.

Figure 7. Predicted vs. actual sales with LOWESS smoother.

Figure 8. Residuals of XGBoost and RNN predictions.

Figure 9. Daily on-hand inventory levels with reorder points.

Figure 10. Training vs. validation accuracy and loss (fraud detection).

Figure 11. Training vs. validation accuracy and loss (late delivery prediction).

Figure 12. Confusion Matrix for Fraud Prediction Model.

Figure 13. Confusion matrix for late delivery prediction model.

Figure 14. Fine-tuned model’s recall performance in fraud detection.

Figure 15. Fine-tuned model’s recall performance in late delivery.

Figure 16. Comparative analysis of training time and memory usage.

Figure 17. Top 5 SHAP features in the fraud detection model.

Figure 18. Top 5 SHAP features in the late delivery prediction model.

Figure 19. Correlation heatmap highlighting pruned features.

Table 1. Descriptive statistics of key numerical features.

Feature	Mean	Std. Dev	Skewness	Kurtosis
Sales per customer	310.5	15.3	0.12	2.8
Days for shipping (real)	3.5	1.2	0.05	3.0
Late delivery risk	0.2	0.4	0.10	3.1
Order item total	150.0	20.0	0.08	3.2

Table 2. Customer segmentation by RFM analysis.

Segment	Percentage (%)
Recent Customers	33.2
Promising	16.9
Cannot Lose Them	12.0
At-Risk	11.4
Customers Needing Attention	11.0
Loyal Customers	10.5
Lost	4.4
Champions	0.6

Table 3. Churn prediction examples.

Customer	Customer_Segmentation	Predicted Churn	R_Value	F_Value	M_Value	R_Score	F_Score	M_Score
1	At-Risk	Yes	280	16	8999.66	3	1	1
2	At-Risk	Yes	163	25	12,388.54	3	1	1
9	Lost	Yes	128	28	12,584.02	2	1	1
10	Lost	Yes	139	20	8196.45	2	1	1
11	Loyal Customers	No	316	3	2060.58	4	3	3
12	Loyal Customers	No	690	3	1869.59	4	3	3
15	Recent Customers	No	266	6	1557.02	3	3	3
16	Recent Customers	No	71	1	234.43	1	4	4
3	Cannot Lose Them	No	474	16	6312.54	4	1	1
4	Cannot Lose Them	No	177	15	9124.92	3	2	1

Table 4. Performance metrics comparison.

Customer_Segmentation	Pre-Avg CLV	Post-Avg CLV	Pre-Churn Rate	Post-Churn Rate	Pre-Count	Post-Count
At-Risk	21.762133	17.409706	0.017286	0.006915	357	142.80
Cannot Lose Them	38.923683	35.031315	0.000000	0.025567	528	528.00
Customers Needing Attention	68.249190	64.836731	0.000000	0.102896	2125	2125.00
Lost	28.059523	19.641666	0.654222	0.294400	13,511	6079.95
Promising	100.334367	105.351085	0.000000	0.051278	1059	1059.00
Recent Customers	153.917855	169.309640	0.000000	0.148751	3072	3072.00

Table 5. Forecasting performance metrics on held-out test data.

Model	MAE (Units)	RMSE (Units)	MAPE (%)
Linear Regression	0.0006	0.0015	0.02
Lasso	1.5543	2.3331	4.76
Random Forest	0.1941	2.1655	0.60
XGBoost	0.1571	0.5333	0.48
Neural Network	73.15	86.30	21.5
RNN	5.52	7.84	1.62

Table 6. Sales forecasting performance comparison.

Model	MAE	RMSE	R²
Random Forest	0.0246	1.2138	0.9999
XGBoost	0.1151	0.4680	0.99999
RNN	3.6877	6.8830	0.9973

Table 7. Inventory performance comparison (RNN vs. XGBoost vs. naive EOQ/ROP).

Metric	RNN Optimized	Naive EOQ/ROP	XGBoost Optimized
Fill Rate (%)	80.2	N/A	85.4
Stockout Events	12	0	20
Total Cost (USD)	2,045,780	907,820	2,500,000
Cost Reduction vs. Naive (%)	45.2	0	0.0
RMSE	126.62	N/A	215.49
MAE	110.73	N/A	170.90

Table 8. Inventory cost comparison.

Policy	Total Cost	Holding Cost	Stockout Penalty
RNN	50,612.25	49,239.91	1372.34
XGBoost	81,181.80	74,055.03	7126.78
Naive	14,119.00	4364.00	9755.00

Table 9. Fraud detection metrics.

Model	Accuracy (%)	Recall (%)	Precision (%)	F1-Score (%)
RF	97.65	0.24	98.10	0.47
XGBoost	99.11	67.06	91.20	78.03
RNN	99.59	98.13	99.00	98.00

Table 10. Late delivery prediction metrics.

Model	Accuracy (%)	Recall (%)	Precision (%)	F1-Score (%)
RF	97.97	100.00	95.50	98.18
XGBoost	98.53	99.96	97.42	98.67
RNN	98.88	98.10	97.60	97.85

Table 11. Comparative performance of original vs. fine-tuned RNN models.

Task	Model Version	Accuracy (%)	Recall (%)	F1-Score (%)
Fraud Detection	Original RNN	99.59	98.13	98.00
Fraud Detection	Fine-Tuned RNN	97.90	99.88	98.91
Late Delivery Prediction	Original RNN	98.88	98.10	97.85
Late Delivery Prediction	Fine-Tuned RNN	97.49	100.00	97.76

Table 12. Comparative computational requirements and complexity of ML models.

Model	Training Time (Approx.)	Memory Usage (Approx.)	Scalability	Complexity
XGBoost	10 min	2 GB	High	O(n_samples × n_features × n_trees × max_depth)
Random Forest	15 min	2.5 GB	Moderate	O(n_samples × n_features × n_trees × max_depth)
RNN	1 h	10 GB (GPU)	Low	O(n_samples × n_timesteps × n_features × n_units)

Table 13. Comparative performance of proposed models versus existing studies across supply chain prediction tasks.

Study	ML Method	Dataset Used	Key Performance Metrics	Superiority of Our Study
Our Study	XGBoost	DataCo’s ERP and logistics databases	Demand Forecasting: MAE = 0.1571, RMSE = 0.5333, MAPE = 0.48%	Holistic framework integration, superior demand forecasting accuracy, and tangible business outcomes.
	RNN	DataCo’s ERP and logistics databases	Late Delivery Prediction: Accuracy = 98.88%, recall = 98.10%, F1-score = 97.85%	Enhanced risk mitigation through superior recall and F1-scores in sequential data tasks.
[42]	TCN-1DSPCNN	Complex supply chain system	Late Delivery Prediction: 100% accuracy	Our study demonstrates comparable accuracy with additional business outcome metrics.
[43]	Voting Regressor	Sales data	Sales Forecasting: R² = 0.9999, RMSE = 1.54	Our XGBoost model achieves similar accuracy with lower computational complexity.
[44]	Ensemble DL Model	Various supply chain datasets	Demand Forecasting: R² = 0.9999, RMSE = 1.54	Our study provides a more comprehensive framework beyond demand forecasting.
[32]	AutoML (XGBoost, LightGBM)	Supply chain security data	Fraud Detection: Accuracy = 99.11%, recall = 67.06%	Our RNN model for fraud detection shows higher recall and F1-score.

Table 14. CAE and CAE-ESG for fraud detection models.

Model	Accuracy	Cost Reduction	Comp. Cost	Op. Complexity	ESG Score	CAE	CAE-ESG
XGBoost	0.9911	0.30	0.10	0.30	0.85	0.744	1.952
RNN	0.9959	0.20	0.20	0.40	0.80	0.284	0.795
Random Forest	0.9765	0.25	0.05	0.20	0.90	0.976	1.953

Table 15. CAE and CAE-ESG for late delivery prediction models.

Model	Accuracy	Cost Reduction	Comp. Cost	Op. Complexity	ESG Score	CAE	CAE-ESG
XGBoost	0.9853	0.35	0.10	0.20	0.87	1.151	2.510
RNN	0.9888	0.22	0.20	0.30	0.82	0.434	1.100
Random Forest	0.9797	0.28	0.07	0.20	0.88	1.100	2.072

Table 16. Comparison of top 5 SHAP features across both models, informed by correlation heatmap.

Feature	Fraud Detection Model Impact	Late Delivery Model Impact	Correlation Notes
Late_delivery_risk	Strong negative SHAP (low = non-fraud)	Not in top 5	Moderate positive correlation with days for shipping (real) (r = 0.40)
Delivery status	Mixed SHAP impact	Not in top 5	Weak correlation with other features
Type	Moderate effect depending on type	Includes Type_TRANSFER, Type_PAYMENT	Uncorrelated (max r = 0.06), strongly independent
Days for shipping (real)	Slightly increases fraud risk with long times	Not in top 5	Moderate correlation with Late_delivery_risk (r = 0.40)
Sales per customer	Weak-to-moderate effect	Not in top 5	Low correlation with all other features
Shipping mode	Not in top 5	Strong positive SHAP (slower = late)	Negatively correlated with Late_delivery_risk (r = −0.40)
Order_day_of_week	Not in top 5	Moderate SHAP variation by weekday	Minimal correlation with all other variables
Shipping_day_of_week	Not in top 5	Moderate SHAP variation by weekday	No significant correlation
Type_TRANSFER	Not in top 5	Strong negative SHAP (on-time predictor)	Isolated binary category
Type_PAYMENT	Not in top 5	Slight positive SHAP when low	Also categorical and uncorrelated
Late_delivery_risk	Strong negative SHAP (low = non-fraud)	Not in top 5	Moderate positive correlation with days for shipping (real) (r = 0.40)
Delivery status	Mixed SHAP impact	Not in top 5	Weak correlation with other features
Type	Moderate effect depending on type	Includes Type_TRANSFER, Type_PAYMENT	Uncorrelated (max r = 0.06), strongly independent
Days for shipping (real)	Slightly increases fraud risk with long times	Not in top 5	Moderate correlation with Late_delivery_risk (r = 0.40)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sattar, M.U.; Dattana, V.; Hasan, R.; Mahmood, S.; Khan, H.W.; Hussain, S. Enhancing Supply Chain Management: A Comparative Study of Machine Learning Techniques with Cost–Accuracy and ESG-Based Evaluation for Forecasting and Risk Mitigation. Sustainability 2025, 17, 5772. https://doi.org/10.3390/su17135772

AMA Style

Sattar MU, Dattana V, Hasan R, Mahmood S, Khan HW, Hussain S. Enhancing Supply Chain Management: A Comparative Study of Machine Learning Techniques with Cost–Accuracy and ESG-Based Evaluation for Forecasting and Risk Mitigation. Sustainability. 2025; 17(13):5772. https://doi.org/10.3390/su17135772

Chicago/Turabian Style

Sattar, Mian Usman, Vishal Dattana, Raza Hasan, Salman Mahmood, Hamza Wazir Khan, and Saqib Hussain. 2025. "Enhancing Supply Chain Management: A Comparative Study of Machine Learning Techniques with Cost–Accuracy and ESG-Based Evaluation for Forecasting and Risk Mitigation" Sustainability 17, no. 13: 5772. https://doi.org/10.3390/su17135772

APA Style

Sattar, M. U., Dattana, V., Hasan, R., Mahmood, S., Khan, H. W., & Hussain, S. (2025). Enhancing Supply Chain Management: A Comparative Study of Machine Learning Techniques with Cost–Accuracy and ESG-Based Evaluation for Forecasting and Risk Mitigation. Sustainability, 17(13), 5772. https://doi.org/10.3390/su17135772

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enhancing Supply Chain Management: A Comparative Study of Machine Learning Techniques with Cost–Accuracy and ESG-Based Evaluation for Forecasting and Risk Mitigation

Abstract

1. Introduction

2. Literature Review

2.1. Demand Forecasting in SCM: From Classical Models to ML Approaches

2.2. Forecast-Driven Inventory Optimization: Moving Beyond Static Policies

2.3. ML for Risk Mitigation: Fraud and Delay Prediction

2.4. ESG-Aware Model Evaluation: Toward Sustainable Analytics

3. Materials and Methods

3.1. Data and Preprocessing

3.1.1. Data Sources and Context

3.1.2. Cleaning and Imputation

3.1.3. Feature Engineering

3.1.4. Scaling and Encoding

3.1.5. Descriptive Moments and Distributional Insights

3.1.6. Pairwise Correlation Structure

3.1.7. Visual Exploration

3.2. Model Architecture Overview

3.3. Customer Segmentation and Churn Modeling

3.3.1. RFM Metric Computation

3.3.2. Quintile Scoring

3.3.3. Segment Labeling

3.3.4. Problem Framing and Labels

3.3.5. Feature Matrix

3.3.6. Modeling Approaches

3.3.7. Model Selection and Thresholding

3.4. Forecasting and Inventory Optimization Framework

3.4.1. Data Structuring

3.4.2. Model Catalog

3.4.3. Training and Validation

3.4.4. Methods Compared

3.4.5. Key Equations

3.4.6. Simulation Steps

3.5. Risk Prediction and Classification

3.5.1. Label Definitions

3.5.2. Modeling and Metrics

3.5.3. Parameter Grids

3.6. Interpretability and Model Selection

3.6.1. Validation and Generalization

3.6.2. CAE Computation

3.6.3. Enhancing SCM Algorithms

4. Results

4.1. EDA Results

4.2. Customer Segmentation (RFM)

4.3. Churn Prediction for Specific Customer Segments

4.4. Forecasting and Inventory Optimization

4.5. Classification Performance

4.5.1. Fraud Detection Performance

4.5.2. Late Delivery Prediction Performance

4.6. Hyperparameter Tuning and Model Fine-Tuning

4.7. Comparative Study: Traditional ML vs. Deep Learning

4.8. CAE and CAE-ESG Performance Evaluation

4.9. Summary of Hypothesis Validation

4.10. Model Interpretability via SHAP Analysis

4.10.1. SHAP-Based Interpretability for Fraud Detection

4.10.2. SHAP-Based Interpretability for Late Delivery

4.10.3. Comparative Feature Importance

5. Discussion

5.1. Forecasting and Inventory Control

5.2. Risk Mitigation and Business Outcomes

5.3. ESG Metrics in Model Evaluation

5.4. Implications for Practice

5.5. Addressing Gaps in the Literature

5.6. Future Research Directions

5.7. Implications of CAE/CAE-ESG Metrics in SCM

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

List of Abbreviations

Glossary of Variables and Metrics

References

Share and Cite

Article Metrics

Article Access Statistics