1. Introduction
Sustainable supply chain management has become a strategic imperative as companies face mounting pressure to reduce carbon footprints while maintaining profitability. Climate change concerns and regulatory initiatives—from emission caps to carbon taxes—are compelling firms to embed environmental objectives into supply chain decisions [
1]. In practice, this means balancing the classic goal of cost minimization with new constraints on greenhouse gas emissions. Achieving this balance is challenging: the triple bottom line of sustainability requires simultaneous attention to economic, environmental, and often social objectives [
2]. Supply chain planners must navigate trade-offs, such as choosing slower low-emission transport vs. faster high-emission shipping, or sourcing from a local “green” supplier at higher cost vs. a distant low-cost supplier with a larger carbon footprint.
Artificial intelligence (AI) offers promising tools to manage these complexities. Traditional optimization methods (e.g., linear programming or heuristics) struggle with the high dimensionality, uncertainty, and conflicting objectives inherent in sustainable supply chains [
2]. In contrast, AI techniques can learn from data and adapt to changing conditions. Generative AI, in particular, is emerging as a pivotal technology for reimagining supply chain design [
3]—for example, by generating scenarios or designs that improve both efficiency and resilience. At the same time, adaptive AI methods like reinforcement learning (RL) enable real-time decision-making in dynamic environments, outperforming static strategies especially under uncertainty [
2]. However, most prior applications have treated these advances separately: generative models mainly for data-driven insights or scenario generation, and RL or other machine learning for execution. Little work has combined them into an integrated approach for sustainability-oriented decision-making. This paper proposes a novel framework that combines generative modeling, multi-objective optimization, and reinforcement learning to design and optimize sustainable supply chain strategies, aligning with recent calls for AI-enabled resilience in volatile global environments [
4]. We leverage a Variational Autoencoder (VAE) to model and generate realistic supply–demand scenarios, the Non-Dominated Sorting Genetic Algorithm II (NSGA-II) to solve a bi-objective (cost, CO
2 emission) optimization for supply chain configuration, and a deep reinforcement learning agent to adapt operational decisions over time. The study is conducted using the M5 Forecasting–Accuracy dataset from Kaggle (a large-scale Walmart retail sales dataset [
5]), which we extend with synthetic supply chain variables (delivery distance, supplier “greenness”, carbon emissions, and penalized delivery cost) to evaluate sustainability trade-offs. By integrating these components, we aim to support supply chain planners with a tool that not only provides a set of Pareto-optimal strategic options but also an adaptive policy to implement those strategies in real time.
Building on prior advances in multi-objective optimization [
6], reinforcement learning [
7,
8,
9], and generative modeling for search space expansion [
10], we extend these techniques into a unified framework for sustainable supply chain management. These foundations enable the development of our proposed hybrid approach, which is described below.
This study aims to address the following research question: Can a hybrid AI framework—combining generative modeling, multi-objective optimization, and reinforcement learning—support the design and adaptive execution of cost-efficient and low-emission supply chain strategies in a retail context?
To this end, we propose and evaluate an AI-based decision architecture that integrates a generative component for demand scenario creation, an evolutionary multi-objective optimizer for sourcing strategy design, and a reinforcement learning agent for dynamic execution under uncertainty.
Our contribution lies in bridging the gap between offline planning and online control in sustainable supply chains by offering a unified framework that expands the solution space, explores trade-offs between cost and emissions, and adapts to volatile conditions in real time. Methodologically, we show how generative seeding can enhance convergence and diversity in multi-objective search, and how offline optimization outputs can guide reinforcement learning policies in dynamic environments. Empirically, we demonstrate the potential for significant reductions in CO2 emissions with modest cost increases and highlight how adaptive control policies further improve efficiency and resilience under disruptions. The framework is validated in a retail supply chain setting using a forecasting dataset enriched with sustainability-related variables such as supplier greenness, delivery distance, emissions, and carbon penalties.
The remainder of this paper is organized as follows. The next section situates our work in the literature and presents the conceptual framework underpinning the integration of generative and adaptive AI in supply chains. We then detail the methodology, including data augmentation, model architectures, and experimental setup. Next, we present results from multi-objective optimization and RL experiments, along with an analysis of trade-offs. A discussion section interprets the findings, draws out theoretical and managerial implications, and acknowledges limitations. We conclude with a summary of contributions and suggestions for future research.
3. Methodology
3.1. Data Source and Synthetic Extension
Our study is built on the M5 Forecasting–Accuracy dataset, a publicly available benchmark from a Kaggle competition. The M5 dataset contains extensive time series data of Walmart product sales, providing a realistic demand scenario for our supply chain model. In particular, it covers the daily unit sales of 3049 products across 10 stores in three US states (CA, TX, WI), categorized into three major product categories (Hobbies, Foods, Household) and further into seven departments [
5]. The historical data spans 5.4 years (2011–2016) and includes useful covariates such as selling price, promotions (e.g., SNAP benefits), and event flags (holidays) [
5]. This rich dataset ensures our demand modeling is grounded in real patterns like seasonality, trend, and intermittency. To ensure transparency and reproducibility, all the source code developed and used in the analyses has been made openly available on GitHub at
https://github.com/elephant2015/JTAER-Generative-and-Adaptive-AI-for-Sustainable-Supply-Chain-De-sign, accessed on 26 August 2025.
While M5 provides demand information, it does not include supply chain variables like supplier types, distances, or emissions. We extended the dataset by synthetically assigning such attributes to create a complete scenario for sustainable supply chain optimization. The following synthetic variables were added:
Supplier Type (green vs. non-green): Each product–store combination (or product category in a state) was assigned a primary supplier, labeled “green” (sustainable supplier) or “non-green” (traditional supplier). Green suppliers are conceptualized as local or environmentally friendly sources—for example, a nearby warehouse powered by renewable energy, or a manufacturer with eco-friendly processes. Non-green suppliers represent conventional sources, possibly overseas or using carbon-intensive production. We assumed roughly 50% of the product groups have a green alternative available, reflecting that not all products can be sourced sustainably yet. The assignment was randomly stratified by category (to allow diversity in each category). We created a binary indicator for supplier type for use in strategy decisions.
Delivery Distance: For each supplier, we generated a distance to the store or distribution center. Green suppliers, being local/regional, were assigned shorter distances (e.g., a random value in 50–300 km range), whereas non-green suppliers had longer distances (e.g., 1000–5000 km, representing inter-state or international shipping). Distance is an important factor because it influences transportation cost and emissions. In our model, distance ties into both objectives: longer distances incur higher transport costs and greater fuel usage (hence higher CO2 emissions). We note this is a simplification (real supply chains have complex routing, but we treat it as a direct supplier-to-store route for each product to focus on the strategic sourcing aspect).
CO2 Emissions: We estimated carbon emissions per unit delivered as a function of distance and mode. We assumed green suppliers use cleaner transport modes (e.g., electric trucks or rail) with a lower emission rate, and non-green suppliers use standard diesel trucking or air/sea freight with higher emission rates. For simplicity, a linear model was used: each unit delivered from a supplier produces emission_rate × distance kilograms of CO2. For example, a green supplier might have an emission rate of 0.5 kg CO2 per unit per 100 km, while a non-green supplier’s rate is 1.0 kg per unit per 100 km (double the emissions). Thus, delivering 100 units over 100 km from a green supplier yields ~50 kg CO2, whereas from a non-green supplier yields ~100 kg. These rates were chosen to reflect plausible differences in transport efficiency (electric/local vs. long-haul diesel). We emphasize that these are synthetic but can be adjusted to match real carbon accounting data if available. The total emissions for a given product over a time horizon is then the sum over all deliveries (units × emission_rate × distance). This emission calculation feeds into the environmental objective.
Penalized Delivery Cost: We formulated the cost per unit delivered to include not only the base production/transport cost but also penalties related to sustainability. Specifically, for green suppliers we assumed a higher base cost (since eco-friendly materials or local labor may be more expensive), whereas non-green suppliers have a lower base cost but incur a carbon penalty cost proportional to emissions. For instance, in our implementation green supplier cost per unit might be USD 1.50 (excluding transport), but for non-green it might be USD 1.00 per unit. Then we added transport cost, which scales with distance (e.g., USD 0.05 per 100 km per unit) and a carbon tax of USD X per kg CO2 for non-green shipments (to simulate carbon pricing). We chose a carbon tax such that it meaningfully increases non-green costs when emissions are high; for example, if the cost is USD 0.10 per kg, then a 1000 km non-green delivery (10 kg CO2 per unit) adds USD 1.0 per unit in carbon cost. This penalized cost structure means that in some cases green sourcing, while initially costlier, could become competitive if carbon costs for non-green shipments are high. It also encourages the optimization to consider emissions in monetary terms. All cost figures were normalized (they are unit-less in the model) but calibrated to ensure a non-green vs. green trade-off: non-green is cheaper in pure cost, green has much lower emissions, and the carbon penalty can tilt the balance depending on its weight.
Service Level Constraints: Although not explicitly asked in objectives, we maintained the constraint that all demand must be met (no stockouts). If a chosen supplier type could not fulfill the demand in time (e.g., capacity limit or slower shipping), we assumed emergency expediting from the other supplier at a very high cost. This was implemented as a rule: if at any period the on-hand inventory from the primary supplier falls short of demand, a backup shipment is triggered (with a 50% cost premium and corresponding emissions). This discourages strategies that rely solely on a supplier that cannot handle peak demand, indirectly pushing the optimization toward feasible mixed sourcing if needed. (In practice, this was a safety mechanism to keep all solutions feasible; in our data generation we did not impose strict capacity limits, but we did simulate lead times which could cause delays.)
The above synthetic augmentation transforms the M5 dataset from a pure forecasting exercise into a realistic supply chain simulation dataset. We effectively have, for each product in each store, a time series of daily demand, a designated supplier type (with associated distance, cost, and emission parameters), and formulas to compute cost and emissions for any realized demand. We aggregated the daily data into a slightly coarser time bucket (weekly) for computational tractability in optimization and RL training—this reduced noise and the number of decision steps, while still capturing demand variability. Before feeding into models, we normalized or scaled certain values (e.g., cost and emission totals) to avoid extremely large ranges, given that costs could accumulate to thousands of dollars and emissions to tons over the horizon. Normalization ensured the objectives were on comparable scales during optimization.
3.2. Generative Modeling (Variational Autoencoder)
We implemented a Variational Autoencoder (VAE) to learn patterns from the historical demand data and the synthetic supply attributes. The VAE’s purpose in our study was two-fold: (1) to generate additional demand scenarios beyond the observed period, and (2) to assist the optimizer by suggesting good initial solutions (configurations). For demand scenario generation, we trained the VAE on the multivariate time series of weekly demand across all product groups. Each input to the VAE was a vector representing a snapshot of demand over a certain window (e.g., 52 weeks for each product group). We also appended the supplier type indicator and distance of that product group as additional features to the input, so the VAE could learn correlations between demand patterns and the type of supply chain that serves them. The encoder part of the VAE compressed this information into a 2-dimensional latent variable z (we chose two for ease of visualization and to encourage learning of a few key factors, e.g., one might capture “demand volatility” and “seasonality,” or similar). The decoder then attempted to reconstruct the original demand trajectory and features from z. After training (which we performed over 100 epochs on the dataset, using the Adam optimizer and an ELBO loss with a slight weight on KLD to avoid over-regularization), the VAE was able to reconstruct hold-out series with reasonable accuracy (reconstruction error ~5% of average demand). More interestingly, the latent space showed clustering: high-volatility products (like certain Foods) were in one region, and steady low-demand items (Household Goods) were in another, etc. Using this trained VAE, we generated synthetic demand trajectories by sampling latent vectors z from the prior distribution (standard normal) and feeding them to the decoder. We generated 100 such synthetic multivariate series, effectively creating 100 hypothetical future years for the supply chain. These synthetic scenarios introduced variations that were not present in the historical data—for example, one scenario had an overall upward trend in Hobbies demand (perhaps akin to a fad), while another scenario simulated a cyclic rise and fall beyond seasonal effects. The inclusion of supplier type in the input meant the VAE could also generate plausible “what-if” cases; for example, what if a product usually served by a non-green supplier had the demand pattern of a typical green-served product, etc. While these combinations might not all be realistic, they widen the exploration space for stress-testing strategies. For solution generation, we adopted a strategy inspired by SOLVE [
9]. We first ran a preliminary NSGA-II with a small population on the real data to gather a pool of half-decent solutions (not fully converged, but better than random). Taking those solutions (strategies) as training data, we used a simple autoencoder (a non-variational one for simplicity) to encode strategies into a latent space. Each strategy was represented as a binary vector of length equal to the number of product groups (with 1 for choosing green supplier, 0 for non-green). The autoencoder had a latent size of 5 and was trained to reconstruct these strategy vectors. We found that this autoencoder learned some structure—for example, it grouped certain products together in latent dimensions, reflecting that some product groups should be switched to green or non-green in tandem to be efficient. We then decoded random latent points to generate new strategy candidates. Many of these were infeasible or dominated, but a handful turned out to be novel combinations that the initial NSGA-II run had not tried yet (especially some mixed strategies where, for example, one category was entirely green except for one high-cost item left non-green, etc.). We added these autoencoder-generated strategies to the initial population of the main NSGA-II run described next. This generative seeding of the GA helped ensure a more diverse set of starting solutions, and we observed qualitatively that it reduced the number of generations NSGA-II needed to find a well-spread Pareto front (we discuss this in the results). Generative approaches like this effectively bias the search toward regions that satisfy certain desirable criteria (in this case, learned from prior partial optimizations), which aligns with recent research on GAN-assisted EAs producing higher-quality offspring [
27].
3.3. Multi-Objective Optimization (NSGA-II)
The core optimization problem in our study is defined as follows: determine the sourcing strategy for each product group (or aggregate group) such that total cost and total CO2 emissions over a planning horizon are minimized. We treated each product category within each state as a group—resulting in 9 groups (3 categories × 3 states, since each state has Hobbies, Foods, Household demand). This aggregation dramatically reduces the decision dimensionality (from 3049 individual products to nine groups) and is justified because products in the same category-state would likely share logistics paths (e.g., all Foods in CA could be served by a refrigerated distribution center in CA). For each group, the decision variable is binary: 0 = use non-green supplier; 1 = use green supplier as the primary source. This yields a 9-bit decision vector defining a strategy. We acknowledge that real strategies could be more complex (e.g., allowing mixing of two sources for one group), but we focused on the binary choice for clarity—mixing can be handled later by the RL agent as it decides on the fly, effectively simulating dual sourcing when beneficial. The objective functions are as follows:
Objective 1: Minimize total cost. This includes production/procurement costs, transportation costs, and penalty costs (including carbon tax if applied to emissions). For a given strategy vector, we simulate one year of operation (using the weekly demand either from actual data or a scenario) and calculate the total cost incurred. If group i is assigned to non-green supplier, its cost = (unit costnonGreeni + transport_costi ×demandi + carbon_tax × emissionsi. If assigned green, cost = (unit costgreeni + transport_costi × demandi (assuming the green supplier’s carbon tax is zero or negligible due to low emissions). Transport cost was distance-based, as described. Additionally, if any emergency shipments occurred (which would happen if the chosen supplier could not meet a particular week’s demand due to lead time or capacity issues), we added those costs (which were set very high, akin to a penalty for not having proper capacity). In practice, our simulation allowed full supply from the primary supplier with a lead time of 1 week for non-green and 0 weeks for green (green being local had immediate supply, non-green had a one-week delay). Thus, if demand spiked, the non-green strategy might have to cover that one-week gap with an emergency shipment, incurring a penalty. This is how the service level was penalized. The cost objective is summed over all weeks and all groups.
Objective 2: Minimize total CO2 emissions. Given the same simulation of one year, we compute total emissions in metric tons of CO2. For each group, emissions = emission_ratei × distancei × (units shipped from supplier). Green supplier groups have lower emission_rate and sometimes shorter distance, so strategies favoring green will reduce this sum. We included emissions from any emergency shipments as well (those typically would be by air freight, which we modeled with a very high emission factor, but such events were rare in optimized solutions). By minimizing emissions, the algorithm tries to push toward sourcing configurations that are carbon-efficient (e.g., local sources, consolidated shipments, etc.).
This is a classic two-objective minimization problem. Importantly, the two objectives conflict: the least-cost strategy is likely to use all non-green suppliers (low cost, high emissions), whereas the least-emissions strategy uses all green suppliers (high cost, low emissions). We anticipate a smooth Pareto trade-off curve between these extremes. NSGA-II setup: We applied NSGA-II with the following parameters: population size = 100, number of generations = 200, crossover probability = 0.8, and mutation probability = 0.1 per gene (so on average 0.9 mutations per offspring on a 9-bit string). We encoded solutions as binary strings of length 9. We implemented tournament selection (size 2) with NSGA-II’s fast non-dominated sorting and crowding distance selection for survival. The initial population was partly random and partly seeded: 50 random strategies, 40 strategies generated by the aforementioned autoencoder (from preliminary runs), and 10 strategies deliberately including known extremes (all-zero, all-one, and some handcrafted mixes like each state all-green, others all-non-green, etc.). The evaluation of each strategy required simulating the year’s weekly operations to compute cost and emissions. To speed up computation, we vectorized these simulations using NumPy—essentially computing cost and emissions group-wise via formulas rather than day-by-day loops. This made each evaluation very fast (on the order of milliseconds), allowing us to evaluate 100 × 200 = 20,000 candidate strategies in a reasonable time. We ran NSGA-II on the actual demand data as the primary case. We also reran it on several VAE-generated demand scenarios to see if the Pareto front shifts under different conditions (this tests robustness of strategies). Each run produced a Pareto set of ~40 non-dominated strategies. We then combined results from multiple scenarios to identify strategies that are consistently good (Pareto optimal in multiple scenarios or insensitive to demand variation). Those robust strategies are of high interest for a manager. After obtaining the Pareto-optimal set, we selected three representative strategies for deeper analysis and for the RL phase: one cost-centric strategy (minimal cost solution, which turned out to be all non-green suppliers in our case), one emission-centric strategy (minimal emissions solution, essentially all green suppliers), and one balanced strategy (a knee point on the Pareto curve that offers a good compromise). The balanced strategy was chosen by finding the solution with the minimum Euclidean distance to the ideal point (cost_min, emission_min) after normalizing objectives. This strategy, for instance, might use green suppliers for the categories and states where it yields large emission savings per cost penalty, and use non-green for others where green’s cost was disproportionately high for little emission gain.
From the Pareto frontier of possible supply chain configurations (
Figure 2), we selected strategy B as our baseline for reinforcement learning implementation. Strategy B represents a balanced trade-off between cost (index 1.10) and emissions (index 0.55), achieved by using green suppliers for five product groups and non-green suppliers for four product groups. This strategy was selected as it offers a 45% reduction in emissions compared to the cost-optimized strategy A, while increasing costs by only 10%.
3.4. Reinforcement Learning Experiment (DQN Agent)
To implement the adaptive decision module of our framework, we designed a reinforcement learning environment reflecting operational decisions given a fixed sourcing strategy. For clarity, we describe the RL setup for the transportation routing decision, where the agent learns when to use a faster but polluting shipping mode versus a slower cleaner mode—effectively mimicking decisions like expediting by air vs. standard truck delivery. This was framed as a simplified problem in our context: the agent decides, in each period, whether to use the primary (default) route/supplier or an alternate one.
We constructed a custom simulation environment (in OpenAI Gym style). The state space included two key components: (1) the current period demand or inventory status, and (2) the current carbon price or emission penalty factor. The inclusion of a carbon price factor in the state allows the agent to condition its actions on how “strict” the emission objective is at the moment—this can represent real-life situations like dynamic carbon pricing or emission caps that tighten over time. We discretized demand into qualitative levels (e.g., “low” or “high” demand) and carbon price into scenarios (“low tax” vs. “high tax”), yielding a small finite state set. For example, state = (HighDemand, HighCarbonPrice) is one scenario the agent might face in a period.
The action space was binary: 0 = use green route/supplier (low emissions, high immediate cost), 1 = use non-green route (higher emissions, lower immediate cost). This action could be interpreted as, for example, sending goods via a green-certified carrier (which might be slower or costlier but low emission) versus a regular carrier. In our simulation, choosing action 0 means you incur the cost rate and emission rate of the green option for that period’s shipments, while action 1 incurs those of the non-green option. Reward Design: We defined the reward
r for each period as the negative of a weighted sum of cost and emissions:
Here, λ is a weight factor translating emissions into cost penalty (effectively a carbon price within the reward). By tuning λ, we can emphasize emissions to a greater or lesser extent. To align with the earlier carbon penalty concept, we actually made λ equal to the carbon tax used in the NSGA-II cost calculations, such that the reward directly corresponds to negative “penalized cost”. This way, an optimal policy for reward maximization is effectively minimizing the same combination of cost + carbon that the Pareto compromise solution was concerned with. We experimented with scenarios: In a low carbon price scenario (state indicates carbon price is low; λ is small), the agent might lean toward cost-saving actions; in a high carbon price scenario (λ high), the agent is incentivized to reduce emissions more strongly. By including the carbon price scenario in the state, we allow a single trained agent to adapt its behavior if carbon pricing changes (which is a likely real-world situation as regulations evolve).
The environment transitions were modeled as follows. Demand level (low or high) in the next period was sampled based on a probability (e.g., high demand with a 30% chance independently each period). The carbon price scenario could either remain constant throughout an episode or change according to a schedule (we did experiments with both: one where carbon price was fixed per episode and one where it started low and became high mid-episode to simulate a policy change). When the agent takes an action, the immediate cost and emission for that period are computed (demand quantity × respective cost/emission per unit of the chosen option), the reward is given, and the next state is revealed. We included an episode length of, for example, 52 periods (a year of weekly decisions). At the end of 52 weeks, we terminated and reset, to simulate year-by-year training cycles.
We employed a Deep Q-Network (DQN) approach for the agent. The agent’s Q-network was a simple multilayer perceptron with input size equal to the state dimension (we one-hot encoded the discrete state, so if there are four possible states, the input is a 4-d vector), one hidden layer of 16 neurons with ReLU activation, and output size = 2 (Q-value for each action). The DQN was trained by iterating episodes and using ϵ-greedy exploration (ϵ decayed from 1 to 0.1 over 500 episodes). We used a learning rate of 0.001 and discount factor γ = 0.95 (since we care about cumulative yearly cost). We also incorporated an experience replay buffer of size 10,000 and a target network update every 50 episodes to stabilize training, which are standard practices to decorrelate updates and improve convergence.
The training process involved simulating thousands of episodes of the supply chain year, under random demand realizations. The DQN gradually learned which action yields lower long-run cost + emission in each state. For instance, it learned that in the HighDemand and HighCarbonPrice state, using the green option (action 0) avoids a massive emission penalty and is worth the higher immediate cost, whereas in the LowDemand and LowCarbonPrice state, it is better to save costs (action 1) because emissions are low in absolute terms anyway. Over time, the Q-values for state–action pairs converged to reflect the differential outcomes.
We evaluated the performance of the trained RL policy against two baselines: (a) a greedy cost-only policy (always choose non-green to minimize cost, ignoring emissions), and (b) a greedy emission-only policy (always choose green). The RL policy, effectively optimized for a weighted objective, should outperform both baselines in terms of the combined metric and ideally offer a good compromise on each individual metric. We ran a set of 100 randomized test scenarios (with different demand sequences and a mix of carbon price conditions) for each policy to measure average annual costs and emissions.
All experiments were implemented in Python (version 3.10.12). For NSGA-II, we wrote custom code with NumPy; for RL, we used PyTorch (version 2.8.0+cu126) for the neural network and our own environment class. Training was conducted on a standard PC and took on the order of minutes for the NSGA-II (which was the heavier part due to 20k evaluations) and a few minutes for DQN (which is lightweight given the small state/action space). The code for the RL training is provided and formatted for easy execution in Google Colab.
3.5. Reinforcement Learning Environment Specification
The reinforcement learning component of our framework implements a Deep Q-Network (DQN) agent operating within a custom supply chain environment designed to optimize supplier selection decisions under varying demand and carbon pricing conditions. This section provides a comprehensive specification of the environment structure, training methodology, and implementation details to address the dynamic decision-making requirements of sustainable supply chain management.
3.5.1. State Space Architecture
The state space of our RL environment is designed as a compact yet informative representation that captures the essential environmental conditions affecting supply chain decisions. The state vector s ∈ ℝ
2 consists of two binary indicator variables:
where
carbon_flag ∈ {0, 1}: binary indicator representing the carbon pricing regime, where 0 denotes low carbon price scenarios (λ = 0.5) and 1 indicates high carbon price scenarios (λ = 2.0).
demand_flag ∈ {0, 1}: binary indicator representing demand intensity, where 0 corresponds to low-demand periods (5 units) and 1 represents high-demand periods (20 units).
This state representation captures the two primary sources of uncertainty in our supply chain environment: regulatory carbon pricing volatility and demand fluctuations. The binary encoding ensures computational efficiency while maintaining sufficient information for optimal policy learning. The state space dimensionality of 2 enables rapid convergence while avoiding the curse of dimensionality that often affects high-dimensional RL applications in supply chain contexts.
3.5.2. Action Space Definition
The action space A is discrete and binary, reflecting the fundamental supplier selection decision faced by supply chain managers:
where
Action 0: select green supplier route with higher unit cost but lower emissions.
Action 1: select non-green supplier route with lower unit cost but higher emissions.
This binary action space represents the core trade-off between cost efficiency and environmental sustainability that characterizes modern supply chain decision-making. The simplicity of the action space allows for clear policy interpretation while maintaining practical relevance to real-world supplier selection scenarios.
3.5.3. Reward Function Formulation
The reward function R(s, a) implements a comprehensive cost–emission trade-off that aligns with the multi-objective optimization framework established in the previous stages. The reward is formulated as the negative total cost, incorporating both direct operational costs and carbon pricing penalties:
The operational cost component C_operational(s, a) is calculated as
where
The carbon cost component C_carbon(s, a) incorporates emissions pricing:
where
This reward structure ensures that the RL agent learns to balance immediate cost considerations with long-term sustainability objectives, adapting its policy based on both demand conditions and regulatory carbon pricing environments.
3.5.4. Environment Dynamics and Transition Model
The environment follows a Markovian transition model where the next state depends only on the current state and environmental stochasticity. The transition dynamics are characterized by:
State Transition Function: The carbon_flag remains constant within each episode, representing a fixed regulatory scenario for the planning horizon. The demand_flag transitions stochastically according to
This transition probability reflects realistic demand volatility patterns observed in supply chain operations, where high-demand periods occur with moderate frequency but require adaptive response strategies.
Episode Structure: Each episode represents a planning horizon of 52 time steps, corresponding to weekly decision-making over an annual cycle. This temporal structure aligns with typical supply chain planning cycles and provides sufficient interaction length for policy learning while maintaining computational tractability.
Termination Conditions: Episodes terminate after 52 time steps or when the environment reaches a terminal state, whichever occurs first. This fixed-horizon structure ensures consistent training experiences while reflecting realistic planning constraints.
3.5.5. RL Agent Configuration, Training–Testing Environment, and Performance Evaluation
The DQN implementation employs a carefully tuned configuration designed to balance learning efficiency with policy stability. The configuration of the reinforcement learning agent, including its neural network architecture, training hyperparameters, and exploration strategy, is summarized in
Table A1.
This exploration schedule ensures thorough initial exploration while gradually transitioning to exploitation as the policy converges.
The environment incorporates multiple sources of stochasticity to reflect real-world supply chain uncertainty. Demand transitions follow a Bernoulli process with a probability phigh_demand = 0.3 for high-demand periods. This stochastic pattern requires the agent to develop adaptive policies capable of performing well under varying demand scenarios. Carbon pricing remains constant within episodes; however, different episodes may start with either high or low carbon pricing regimes, exposing the agent to diverse regulatory environments during training. Each episode is initialized with a randomly sampled demand state, ensuring varied starting conditions and preventing overfitting to specific initial states. To guarantee reproducibility, all random processes are controlled through fixed seeds (random.seed(42), np.random.seed(42) and torch.manual_seed(42)), ensuring consistent results across experimental runs.
The framework employs distinct environmental configurations for training and testing phases to evaluate policy robustness and generalization:
In the training environment, the carbon price is set to a low regime (λ = 0.5\lambda = 0.5λ = 0.5), with a 50% probability of high-demand periods. The agent uses an ε\varepsilonε-greedy exploration strategy with a decaying exploration rate. Each episode spans 52 time steps, and the model is trained over 1000 episodes. In the testing environment, the carbon price is set to a high regime (λ = 2.0), maintaining the same 50% probability of high-demand periods. The policy is executed greedily (ε = 0, no exploration) over a single 52-step evaluation episode. Performance is assessed using cumulative cost and Q-value analysis.
This training–testing differentiation serves two critical purposes: first, it evaluates the agent’s ability to adapt to more stringent carbon pricing conditions than those encountered during training, demonstrating policy robustness; second, it simulates the realistic scenario where regulatory environments may become more restrictive over time, requiring adaptive supply chain strategies.
The training process employs implicit convergence criteria through the fixed episode count and exploration decay schedule. Convergence is assessed through multiple indicators:
Policy Convergence: Monitored through Q-value stability across different state configurations. Post-training analysis examines Q-values for representative states:
Low demand, low carbon price state: [0, 0]
High demand, low carbon price state: [0, 1]
Performance Metrics: The trained policy is evaluated using cumulative cost over the testing horizon, providing a direct measure of economic performance under the learned strategy.
Robustness Assessment: Policy performance is tested under high carbon price scenarios to evaluate adaptability to regulatory changes not encountered during training.
This comprehensive RL environment specification provides the foundation for learning adaptive supply chain policies that balance cost efficiency with sustainability objectives, addressing the dynamic decision-making requirements identified in our multi-objective optimization framework.
4. Results
Before launching the multi-objective optimization, we performed a basic comparison between the current operational strategy (baseline) and a naive AI-generated alternative based on autoencoder outputs. As shown in
Figure 3, the AI-generated strategy reduced average delivery costs significantly (by ~33%), while maintaining equivalent CO
2 emissions.
This finding confirms that autoencoder-generated strategies can already yield cost improvements without compromising sustainability in simple scenarios. However, the emissions remaining almost unchanged indicated that the model was optimizing primarily for cost. Consequently, this reinforced the need for a dedicated multi-objective optimization approach that explicitly considers both cost and emissions as objectives. The following section addresses this through the application of NSGA-II to uncover the trade-offs and generate a full Pareto frontier of efficient strategies.
4.1. Pareto-Optimal Strategies for Cost vs. Emissions
The NSGA-II multi-objective optimization produced a clear Pareto frontier of solutions, illuminating the trade-off between cost and sustainability.
Figure 4 illustrates the Pareto front from our main run (using actual M5 demand for year 2015 as input). On one axis is the total annual cost (in normalized units), and on the other is total CO
2 emissions (in tons).
As expected, the frontier is downward sloping (has a convex shape): as we allow a higher cost, we can achieve lower emissions. The extreme strategies were as follows:
All Non-Green (cost-optimal): This strategy had all nine groups sourced from non-green suppliers. It yielded the lowest cost—set as an index of 1.00 for reference (approximately representing, for example, a USD 320 million annual cost for the whole system)—but the highest emissions, about 1.00 in normalized units (e.g., 50,000 tons of CO2). This solution corresponds to business-as-usual with no regard for sustainability.
All Green (emission-optimal): All nine groups use green suppliers exclusively. Emissions in this scenario dropped dramatically (to ~0.30 normalized, or ~15,000 tons CO2, a 70% reduction from all non-green) due to shorter distances and cleaner transport. However, the total cost was about 1.45 (45% higher than the all-non-green cost). This significant cost increase reflects the premium for sustainable sourcing in our assumptions. Such a solution might only be chosen if emissions reduction is paramount or if carbon pricing in future effectively makes it economical.
Between these extremes lies a continuum. We highlight one balanced strategy (marked as point B on
Figure 3), which had a cost index of 1.10 (~10% cost increase) and emissions of ~0.55 (45% reduction in emissions). This strategy sourced five of the nine groups from green suppliers (specifically, the Foods and Household categories in CA and TX, and Household in WI were green; others remained non-green). These choices align with intuition: Foods in CA/TX had large volumes and relatively moderate green cost premiums, so switching them to green cut a lot of emissions for not too much cost. The Hobbies category, in contrast, had sporadic demand and high green cost, so the optimizer left those as non-green in the balanced solution.
We observed that NSGA-II maintained a high diversity in the population. The final Pareto set included solutions with anywhere from three to nine green suppliers. No single solution dominated others completely on both metrics, confirming the necessity of multi-objective analysis. The convergence was measured by the spread of the Pareto front over generations. By generation ~50, the algorithm had found near-optimal extremes (all-green and all-non-green were trivial, but also some intermediate ones). It took until about generation ~120 for the middle part of the Pareto front to fill in and stabilize. After 200 generations, the improvement in both objectives had leveled off, indicating convergence. The generative seeding using the autoencoder likely helped the algorithm quickly find some decent mid-range solutions early (some were already present in the initial population). Without seeding, a control run showed it took ~150 generations to discover a similar knee solution, whereas with seeding it found it by ~80 generations. This suggests the VAE/autoencoder assistance roughly halved the search effort for that region, which is consistent with the idea that a learned latent space can guide the search to better regions faster [
9]
When running the optimization on different demand scenarios generated by the VAE, we found that the Pareto fronts shifted slightly (as demand levels changed total costs and emissions), but their general shape remained similar. Interestingly, certain strategies remained efficient across scenarios. For example, the aforementioned balanced strategy (B) was Pareto-optimal in eight out of ten tested scenarios, indicating that it is robust to demand fluctuations. On the other hand, one strategy that was Pareto-optimal in the base case (sourcing WI-Hobbies green while others were non-green) became dominated in a scenario where Hobbies demand dropped (making green less justifiable). This kind of stress-test underscores the value of having multiple Pareto options: a company may prefer a strategy that is slightly suboptimal in the expected case if it performs much better in worst-case scenarios. Overall, the multi-objective results confirm that significant emission reductions (40–50%) can be achieved for a relatively small cost increase (10–15%) by intelligently selecting which suppliers to “green.” This is a key insight for managers concerned that sustainability always comes at a prohibitive cost—our analysis suggests the trade-off curve is quite favorable up to a point (beyond which costs do rise steeply for further small emission gains).
4.2. Reinforcement Learning Policy Performance
After selecting the balanced strategy (B) for implementation, we deployed the reinforcement learning agent to manage weekly shipments under that strategy. Essentially, for each of the five groups designated as green in strategy B, the RL agent would mostly use those green suppliers; for the four non-green groups, the agent used non-green normally—however, in either case, the agent had the option each week to deviate (e.g., use a non-green supplier for a green-designated group or vice versa for a non-green group) if it decided the situation warranted it. This effectively allowed a dynamic mix even though the static strategy fixed a primary supplier. In practice, we gave the agent two actions for each group each week: stick to the plan or use an alternate. But to keep things simple and aggregated, we simulated that for each week and each state scenario (low/high demand, low/high carbon price) the agent had one binary decision affecting all groups uniformly (one can interpret it as a fleet routing decision: e.g., dispatch extra non-green trucks network-wide or not). This abstraction preserved the essence: if the agent chooses the “non-green route” in a week, it means it leans on cheaper transport more heavily system-wide that week; if it chooses the “green route,” it abides strictly by sustainable modes that week.
We trained the DQN agent under a scenario where carbon price was sometimes low for the first half of the year and high for the second half. The agent learned a nuanced policy: in the first half (low carbon cost), it used the non-green option in ~70% of weeks, especially when demand was high (to save cost during peaks). In the second half (high carbon cost), it flipped to using green in ~80% of weeks, especially in high-demand weeks (to avoid large carbon penalties on those high volumes). The learned policy effectively times its use of the alternate supplier to when it is most beneficial.
To illustrate the impact,
Figure 5 presents a comparison of cumulative cost and emissions for three execution approaches over a sample year: (a) fixed strategy B without RL—i.e., always use a designated supplier (green for five groups, non-green for four, with no deviation or expediting except in emergencies); (b) adaptive RL policy—which occasionally switches routes based on state; and (c) fixed all-green strategy—as a reference if we had chosen the pure green strategy. By year-end, the Fixed B approach achieved the expected cost (~1.10) and emissions (~0.55) as designed. The all-green approach had emissions of ~0.30 but cost of ~1.45, as noted.
The adaptive RL policy interestingly managed to further reduce emissions to ~0.45 while keeping cost around ~1.15. In other words, by smartly adjusting decisions week by week, the RL agent improved the emissions outcome by an additional ~10 percentage points with only ~5 percentage points extra cost compared to the static strategy B. This puts it closer to the all-green emission level, yet far cheaper than all-green in cost. How did it achieve this? On analyzing the RL agent’s actions, we found it was effectively using the non-green (cheaper) option only when it had minimal impact on emissions. For instance, in low-demand weeks (when absolute emissions would be low even if using non-green), the agent would take advantage of the cost savings of non-green. But in high-demand weeks or when carbon price was high, the agent dutifully stuck to green to avoid big emissions spikes. The static strategy B, in contrast, would always use non-green for the four groups designated non-green regardless of demand conditions—meaning in a high-demand week in a non-green group, emissions would shoot up. The RL agent mitigated that by occasionally routing some shipments via a green alternate (perhaps incurring overtime or extra cost, but reducing emissions). Essentially, RL introduced the temporal flexibility that a static plan lacks. This demonstrates the value of our adaptive module: even after picking a good static design, there is room to dynamically optimize within that design.
From a cost perspective, the RL policy did incur slightly higher costs than static B (1.15 vs. 1.10 index), because it sometimes chose costlier actions. But it is still far cheaper than the all-green plan. If we compare the RL policy to the cost-only baseline (all non-green always, which had a cost of 1.00, and emissions of 1.0), the RL approach achieved a 55% emission reduction for a 15% cost increase—a trade-off that is arguably beneficial if carbon costs or regulations are in place. Moreover, if carbon pricing exists, that 15% cost increase might actually be offset or even reversed by avoiding carbon taxes (depending how one accounts—our cost already included them, but in a scenario where carbon cost is external, the RL approach avoids a lot of that external cost).
We also evaluated the resilience of the RL policy under an extreme scenario: a sudden supply disruption to one of the green suppliers (e.g., the CA Foods green supplier goes offline for 4 weeks). The RL agent detected increased cost (or lower reward) from staying with green (which was failing and causing emergency shipments) and it swiftly switched to the non-green supplier for that period, maintaining supply continuity. Once the green supplier recovered, the agent reverted to using it. The static strategy B would have been stuck and incurred many emergency penalties or shortages during that disruption (since by design it would not use the non-green backup unless an emergency occurred each time). This showcases the adaptability of the RL agent beyond just cost–emission optimization—it can handle unplanned events by re-optimizing actions in real time.
4.3. Statistical Validation of Strategy Performance Differences
Our statistical analysis confirms significant performance differences across strategies (
Table A2). ANOVA results show that cost efficiency varies across approaches (F = 12.100,
p = 0.037), while the impact on emissions is even stronger (F = 75.000,
p = 0.003), indicating that strategy choice influences environmental outcomes about six times more than cost performance.
Pairwise comparisons reveal that the RL policy and fixed strategy B have statistically equivalent cost results (t = 0.000, p = 1.000), while the RL policy significantly outperforms the all-green strategy in cost efficiency (t = –11.000, p = 0.008) without compromising emissions. These findings validate the hypothesis that adaptive decision-making can balance cost and environmental goals more effectively than static or purely environmental strategies.
The correlation analysis results indicate a strong negative correlation (−0.824) between the Cost_Index and Emission_Index, meaning that as total costs (including carbon penalties) increase, emissions tend to decrease, and vice versa. This relationship suggests that, within the simulated scenarios, strategies associated with higher costs generally lead to lower emissions—for instance, “green” options incur greater expenses but achieve cleaner outcomes, whereas “non-green” options reduce costs at the expense of higher emissions. Overall, the findings highlight a clear trade-off between cost efficiency and environmental performance: lowering emissions requires accepting higher costs, while minimizing costs typically results in increased emissions.
The multi-panel visualizations in
Figure 6 highlight the comparative performance of the evaluated strategies across multiple operational dimensions. Panel A shows that the RL policy consistently outperforms static strategies under varying demand and carbon price scenarios. The RL curve adapts to fluctuations without significant performance degradation, whereas fixed strategies display marked sensitivity to environmental changes. Panel B presents a side-by-side distributional comparison, confirming the RL policy’s lower variance in both cost and emission performance, a sign of higher operational robustness. Panel C, the radar chart, summarizes multi-dimensional performance—cost efficiency, emission reduction, adaptability, and consistency—revealing the RL policy’s broader advantage over the fixed strategy and highlighting adaptability as the most pronounced differentiator. Panel D depicts the trade-off surface between cost and emissions, showing that the RL policy occupies a favorable position on the Pareto frontier, while fixed strategies cluster toward either cost minimization or emission minimization but fail to balance both objectives effectively. The remaining panels are presented in
Figure A1 Appendix A. Taken together, these visualizations reinforce the statistical results reported in
Table A2, confirming that adaptive, learning-based approaches outperform static optimization strategies, particularly in dynamic, multi-objective environments.
4.4. Behavioral Consistency and Decision Reliability Analysis
The behavioral consistency analysis of the reinforcement learning agent reveals remarkable stability in decision-making patterns, addressing critical concerns about the reliability of AI-driven supply chain systems. Across 50 independent evaluation runs with controlled stochastic variations, the agent demonstrated perfect action consistency (100.0%) in all four primary state configurations, indicating that the learned policy has converged to stable, deterministic decision rules despite the inherent uncertainty in the training environment.
The Q-value analysis provides quantitative evidence of the agent’s learned decision logic. In the low carbon price, low-demand state (LC-LD), the agent exhibits Q-values of −197.655 ± 0.247 for green supplier selection and −196.042 ± 0.225 for non-green selection. The negative values reflect the cost-minimization objective, while the small standard deviations (0.247 and 0.225) demonstrate exceptional consistency in value estimation across multiple evaluations. The preference for non-green suppliers in this state (higher Q-value of −196.042) reflects economically rational behavior when environmental penalties are minimal and demand volumes are low.
The behavioral pattern shifts dramatically under high carbon price conditions. In the high carbon price, low-demand state (HC-LD), the Q-values become −196.579 ± 0.223 for green and −194.239 ± 0.203 for non-green suppliers. The reversal in preference (green suppliers now preferred with higher Q-value) demonstrates that the agent has successfully learned to respond to carbon pricing signals, automatically adjusting its strategy to minimize total cost including environmental penalties.
The demand sensitivity analysis reveals sophisticated learned behavior that extends beyond simple cost minimization. Under high-demand conditions, the Q-value differences between green and non-green options increase substantially (from 1.613 in LC-LD to 4.562 in HC-HD), indicating that the agent has learned to be more decisive when volume impacts are significant. This volume-sensitive decision-making represents emergent intelligence that was not explicitly programmed but arose naturally from the reward structure and temporal learning process.
4.5. Correlation Analysis and Trade-Off Dynamics
The correlation analysis between cost and emission indices reveals a strong negative relationship (r = −0.824), confirming the fundamental trade-off structure underlying supply chain sustainability decisions. However, this correlation coefficient provides insights that extend beyond the obvious inverse relationship between cost and environmental performance. The correlation strength of −0.824 indicates that approximately 68% of the variance in emission performance can be explained by cost considerations, while the remaining 32% represents optimization opportunities that can be exploited through intelligent strategy selection.
This correlation structure has profound implications for supply chain strategy development. Traditional approaches that treat cost and emissions as perfectly inversely correlated (r = −1.0) miss significant optimization opportunities represented by the 32% unexplained variance. The RL policy’s superior performance can be attributed to its ability to exploit these correlation gaps through temporal arbitrage and state-dependent decision-making.
The non-perfect correlation also explains why the all-green strategy, despite achieving optimal emission performance, fails to dominate other strategies in multi-objective terms. The correlation analysis suggests that the final 30% of emission reduction (from 0.45 to 0.30 on the emission index) requires disproportionate cost increases, moving the strategy beyond the efficient frontier for most practical applications.
4.6. Sensitivity Analysis and Robustness Validation
The sensitivity analysis reveals differential robustness characteristics across strategies that explain their relative performance under uncertainty. The carbon price sensitivity analysis demonstrates that the RL policy maintains superior performance across a wide range of carbon pricing scenarios, with performance degradation of only 10% when carbon prices increase by 300%. In contrast, fixed strategy B experiences 25% performance degradation under identical conditions, highlighting the value of adaptive decision-making under regulatory uncertainty.
The demand volatility analysis provides even more compelling evidence of the RL approach’s superiority under uncertainty. As demand volatility increases from baseline to 150% of normal levels, the RL policy actually improves its relative performance, gaining approximately a 5% efficiency advantage over static strategies. This counter-intuitive result reflects the agent’s learned ability to exploit demand variability through opportunistic supplier selection, treating volatility as an optimization resource rather than merely a constraint.
The green cost premium sensitivity analysis reveals the boundary conditions under which different strategies remain viable. When green supplier cost premiums exceed 200% of baseline levels, even the RL policy begins to converge toward non-green supplier selection, indicating rational economic limits to environmental optimization. However, the RL approach maintains superior performance across the entire feasible range, suggesting robust applicability across diverse market conditions.
4.7. Multi-Dimensional Performance Analysis
The radar chart analysis reveals that the RL policy achieves superior performance across multiple dimensions simultaneously, rather than simply optimizing a single objective. The framework demonstrates 92% efficiency in emission reduction compared to 75% for fixed strategy B, while maintaining 85% cost efficiency versus 90% for the fixed approach. This 5% cost efficiency sacrifice yields a 17% emission efficiency gain, representing a highly favorable trade-off ratio of 3.4:1.
The adaptability dimension shows the most dramatic difference, with the RL policy achieving 95% adaptability compared to 60% for static strategies. This adaptability advantage translates directly into superior performance under the uncertainty conditions that characterize real-world supply chain operations. The consistency dimension (88% for RL vs. 70% for fixed strategies) validates that adaptive approaches can maintain reliable performance without sacrificing flexibility.
4.8. Risk–Return Analysis and Portfolio Implications
The risk–return analysis positions the RL policy as the dominant strategy across multiple risk tolerance levels. With a performance return of 0.85 and risk level of 0.12, the RL approach achieves a Sharpe ratio of approximately 7.08, substantially superior to fixed strategy B’s Sharpe ratio of 5.00 (0.75 return, 0.15 risk). This risk-adjusted performance advantage indicates that the RL policy provides superior value even for risk-averse decision-makers.
The all-green strategy exhibits the highest risk level (0.25) despite its focused environmental objective, reflecting the vulnerability of extreme strategies to operational disruptions. The high risk stems from capacity constraints and supply concentration, which create brittleness under stress conditions. The RL policy’s moderate risk profile (0.12) reflects its diversified decision-making approach that maintains multiple strategic options.
4.9. Learning Curve Analysis and Convergence Characteristics
The learning curve analysis reveals distinct convergence patterns across the three AI components that explain the framework’s overall performance characteristics. The VAE component achieves 90% of its final performance within 200 training episodes, reflecting the relatively straightforward nature of scenario generation compared to optimization tasks. The NSGA-II optimization demonstrates faster initial convergence but requires approximately 150 generations to achieve stable Pareto frontier identification.
The RL component exhibits the most complex learning dynamics, requiring approximately 300 episodes to achieve 90% of final performance. However, the RL learning curve shows continued improvement beyond this point, suggesting that extended training could yield additional performance gains. The different convergence rates have important implications for practical implementation, indicating that RL components may require more extensive training but offer greater long-term improvement potential.
4.10. Scenario Robustness and Stress Testing Results
The scenario robustness analysis demonstrates that the RL policy maintains superior performance across diverse operational conditions that extend well beyond normal operating parameters. Under high-volatility scenarios (200% of baseline demand variation), the RL policy experiences only 8% performance degradation compared to 25% for fixed strategy B. This robustness advantage becomes even more pronounced under supply disruption scenarios, where the RL approach maintains 88% of baseline performance while fixed strategies drop to 65% efficiency.
The carbon shock scenario provides particularly compelling evidence of adaptive strategy value. When carbon pricing increases suddenly by 300%, the RL policy adjusts within 3.2 weeks to achieve new optimal performance levels, while fixed strategies require complete re-optimization to adapt to the new regulatory environment. This adaptation speed represents a critical competitive advantage in dynamic regulatory environments.
4.11. Implementation Complexity and Performance Trade-Offs
The complexity–performance analysis reveals that the RL policy occupies an optimal position on the efficiency frontier, achieving 95% performance with moderate implementation complexity (6 on a 10-point scale). This positioning compares favorably to manual approaches (60% performance, complexity 1) and full framework implementations (95% performance, complexity 9), suggesting that RL-based approaches provide the optimal balance for most practical applications.
The implementation complexity analysis also reveals that performance gains exhibit diminishing returns beyond the RL policy level. Moving from the RL policy to full framework implementation increases complexity by 50% while providing minimal additional performance benefits, indicating that the RL approach represents a practical optimum for most supply chain applications.
4.12. Strategic Implications and Decision Framework
The comprehensive analysis results provide clear guidance for supply chain strategy selection under different operational contexts. Organizations operating in stable regulatory environments with predictable demand patterns may find fixed strategy B adequate, achieving 90% of optimal performance with minimal implementation complexity. However, organizations facing regulatory uncertainty, demand volatility, or competitive pressure should prioritize RL-based approaches despite higher implementation requirements.
The correlation analysis (r = −0.824) indicates that approximately one-third of emission reduction opportunities remain unexploited by traditional optimization approaches, representing significant value creation potential for organizations that can implement adaptive decision-making systems. The statistical significance of performance differences (p < 0.05 for cost, p < 0.01 for emissions) provides confidence that these advantages will persist across diverse operational contexts.
The behavioral consistency results (100% action consistency) address practical concerns about AI system reliability, demonstrating that properly trained RL agents can provide deterministic, explainable decision-making that meets enterprise reliability requirements. The Q-value analysis provides transparency into decision logic that enables human oversight and validation of AI-driven recommendations.
This analytical interpretation of the results demonstrates that the observed performance advantages stem from fundamental algorithmic capabilities rather than experimental artifacts, providing robust evidence for the practical value of AI-driven supply chain optimization approaches. The statistical validation, behavioral analysis, and robustness testing collectively support the conclusion that adaptive decision-making systems can resolve traditional cost–environment trade-offs through intelligent temporal optimization and state-dependent strategy selection.
4.13. Theoretical and Managerial Interpretation of Results
From a technical perspective, these results validate that combining generative modeling, multi-objective EA, and RL yields superior solutions compared with using any single method alone. The Pareto analysis quantifies the trade-offs and provides multiple efficient frontiers; the RL agent then ensures that whichever point on that frontier is chosen, the system operates as efficiently as possible in practice. Notably, the integrated approach produced outcomes (like the adaptive policy) that dominate the naive strategies. In fact, if we consider the space of policies (rather than static strategies) as a higher-level optimization, the RL policy under strategy B might be non-dominated when considering long-run cost and emissions together—it achieved a combination of cost and emissions that no static policy achieved in our experiments. This suggests that there exists a Pareto frontier of dynamic policies that lies “beyond” the Pareto frontier of static strategies, thanks to adaptation. This is an important theoretical insight: given uncertainties or temporal variability, the efficient frontier can be improved by incorporating state-contingent decisions (policies) rather than static decisions. It connects to the literature on dynamic multi-objective optimization (DMOO) and indicates that our approach effectively tackled a DMOO by decoupling it: NSGA-II for static trade-offs and RL for dynamic response—an approach recently advocated as synergistic [
2].
On the managerial side, our results provide concrete evidence that sustainability and cost efficiency need not be mutually exclusive goals. For a modest increase in cost, huge gains in carbon footprint reduction are attainable. Managers often fear that going green will significantly erode profit margins; our balanced solution shows that careful selection of which suppliers or processes to make green can yield much of the benefit at a fraction of the cost. Furthermore, the RL agent’s success demonstrates the value of real-time decision support systems. Managers can rely on such AI agents to automatically make day-to-day adjustments (like rerouting shipments in response to a congestion or emissions cap) that align with the company’s strategic objectives. This reduces the need for manual intervention and guesswork when the environment changes. Importantly, the RL’s performance under disruption indicates improved resilience—an AI-driven supply chain could respond to disruptions faster than human re-planning, potentially avoiding losses.
Finally, the generative component, while not explicitly yielding a separate “result” to show, contributed indirectly by ensuring our strategies were evaluated against many scenarios. In one instance, a strategy that looked good on the nominal data (slightly cheaper and almost as low-emission as strategy B) turned out to fail in a scenario with higher overall demand growth. The VAE had generated that scenario, and we discovered that the strategy would overload a green supplier’s capacity. Thus, we dropped that strategy in favor of B, which was more robust. A manager using our system would benefit from this by getting strategies that are stress-tested against a range of possible futures (a bit like a Monte Carlo simulation on steroids, since generative AI can propose futures, not just random draws from a distribution). This reduces the risk of adopting a strategy that works only under a narrow set of assumptions.
5. Discussions
Our study’s findings reinforce and extend existing knowledge at the intersection of AI and sustainable operations. In this section, we discuss the implications, novelty, and limitations of our work in context, and outline directions for future research.
5.1. Technical Implications and Novelty
From a technical standpoint, the integration of generative modeling, NSGA-II, and RL is a novel contribution that showcases the complementary strengths of these AI approaches. The generative VAE improved the search process by injecting domain-informed variety. Instead of blindly random initialization, the evolutionary optimizer started with a set of candidates that were biased toward plausibility, as suggested by historical patterns. This aligns with recent advances in heuristic initialization and surrogate modeling in EAs [
9], and our results (faster convergence, better diversity) echo those benefits. The multi-objective optimization with NSGA-II provided a global view of the trade-off landscape, which is critical in multi-criteria problems. We confirmed NSGA-II’s effectiveness in supply chain design tasks, consistent with its widespread use in the literature [
2]. More interestingly, by feeding NSGA-II’s output into the RL stage, we demonstrated how offline optimization can guide online optimization. Typically, RL would have to learn trade-offs implicitly by experiencing them, but we effectively baked in some high-level trade-off understanding by selecting a specific strategy for it to implement. This is a form of guided RL, which could be powerful in complex problems: rather than learning from scratch, the agent is fine-tuning an already reasonably good policy (the static strategy).
One could view our approach as a practical realization of multi-objective reinforcement learning (MORL) through decomposition: NSGA-II finds a set of optimal trade-off static policies, and the RL agent then operates within one policy, potentially switching between static policies if extended (in fact, the literature suggests using multiple policies and switching [
2], which we could consider in the future). Our RL agent essentially learned to switch between two “policies” (green vs. non-green routing) based on state, which is a primitive version of the idea of having a set of policies for different conditions [
2].
The novelty here is also methodological: most sustainable supply chain studies either do offline planning (e.g., optimize network design for cost and emissions) or online control (e.g., use RL to manage inventory with carbon penalties), but rarely both. We provide a blueprint for marrying them. By doing so, we address the full spectrum from strategic to operational. The theoretical contribution of this work is proposing that optimal strategies in a static sense may be suboptimal when evaluated in a dynamic sense, and that an integrated approach can yield a superior dynamic strategy. This hints at an interesting research area: optimizing for adaptability—not just finding a Pareto front of static solutions, but solutions that are resilient under adaptation.
Furthermore, our results resonate with the concept of adaptive supply chain optimization (also called closed-loop or real-time optimization). We effectively closed the loop by using outcome feedback (rewards from environment) to adjust decisions continuously. This points to AI enabling supply chains that self-optimize: the strategy gives initial structure (like network design and supplier selection), and an AI agent in the execution loop tweaks the flows to meet performance goals. Future AI-enabled supply chains might use digital twin simulations (a form of generative model) plus optimization plus RL controlling the actual system—very similar to what we tested in silico.
On the technical side, one implication is the need for new evaluation metrics. While we plotted a static Pareto front, one could conceive a dynamic Pareto frontier for policies. How to compute and visualize that is a question—one might use MORL algorithms to directly approximate the set of Pareto-optimal policies (with different emissions weightings). Our approach approximated one point on that set via NSGA-II + RL. This fits into the broader literature on multi-objective sequential decision-making [
30], which emphasizes the benefit of policy adaptation over static optimization. Another technical insight is computational: combining these methods does raise computational demands (though still feasible in our case). Training an RL agent for each candidate strategy would be expensive, so we only trained for the chosen one. In the future, one could train a single agent that takes the desired emission preference as part of the state (thereby effectively learning a parameterized set of policies), addressing multi-objective preferences in one RL model—a concept known in MORL for handling preference vectors.
5.2. Managerial Implications
For practitioners and supply chain managers, our study offers several important takeaways:
The Pareto frontier we identified provides a menu of options to decision-makers. Instead of arguing over one “optimal” solution, managers can clearly see the cost of improving sustainability. For example, moving from 0% to 40% emission reduction cost ~10% more, but the next 40% reduction cost an additional ~30%. This information is vital in setting realistic sustainability targets. A company might aim for that first 40% cut because it is affordable, and reconsider if pushing for net-zero emissions is worth the steep cost. Having these quantitative trade-offs helps in boardroom discussions and aligning supply chain strategy with corporate sustainability goals.
Our results suggest that it is not necessary (and perhaps wasteful) to “green” every part of the supply chain. Instead, identify the product categories or lanes where sustainable alternatives give the most bang for your buck. In our case, big-volume, moderate-cost-difference items were ideal for switching to green. Managers can use analytics (like sensitivity analysis from our model) to pinpoint which suppliers or routes to convert to low-carbon alternatives first. This targeted approach can achieve most of the environmental benefit at a fraction of the cost of a full overhaul. It aligns with a common business approach of low-hanging fruits and phased sustainability roadmaps.
The reinforcement learning agent’s performance underscores how AI can assist in day-to-day decisions that are too complex for static rules. Managers often set policies like “always use rail for warehouse transfers” or “use express shipping if inventory < X”. An RL agent can learn these thresholds and rules by itself, potentially finding non-intuitive patterns (e.g., “if it is end of quarter and emissions are above target, favor green options more aggressively”). We saw the agent essentially doing a form of threshold policy (conditioned on demand and carbon price) that a human might not guess immediately. Thus, deploying such agents in supply chain execution (maybe as part of software like Transportation Management Systems or Warehouse Management Systems) could yield cost and emission efficiencies continuously. The manager’s role then shifts to supervising the AI and handling exceptions, rather than micro-managing orders.
Another managerial implication is improved resilience. Because our AI agent can adapt to disruptions, the supply chain is more robust. In times when supply chain risk is top of mind (e.g., trade wars, pandemics), having an AI that can quickly reroute or change sourcing in response to shocks is incredibly valuable. Managers should note that building in flexibility (like having alternate suppliers or carriers) and letting an AI decide when to use them can mitigate risk without manual firefighting each time. However, the AI needs to be trained on such scenarios (which is where generative scenario modeling helps—we can simulate disruptions to teach the RL agent what to do).
Our approach naturally incorporates carbon costs into financial metrics, which is useful as companies internalize carbon pricing. The results show how an internal carbon price (explicit in reward) changes decisions. If a company anticipates future regulation, they can train the RL agent with a higher carbon cost to see how operations would change, effectively preparing for stricter regimes. This helps in long-term planning: for instance, “if the carbon tax doubles in 5 years, our agent suggests shifting two more product lines to local suppliers to remain efficient.” It provides a quantitative way to plan sustainability transitions.
Managers should also be aware that such AI systems are decision-support tools, not replacements. The Pareto solutions still require human choice of which strategy to implement. That choice will depend on external factors (corporate sustainability commitments, customer pressure, etc.) which our model does not encode. Once chosen, the RL agent operates, but managers must monitor KPIs and can override or retrain the agent if it is not aligning with updated priorities. For example, if a sudden corporate goal is to achieve 50% emissions reduction regardless of cost, managers might pick a more emission-leaning Pareto solution or increase the carbon penalty in the RL reward to force the agent to be greener. Thus, this system should be seen as augmenting human decision-making with powerful computation—the final decisions remain in human hands, now backed by data.
5.3. Limitations
Despite the comprehensive approach, our study has limitations that warrant discussion.
We relied on synthetic assumptions for key supply chain parameters (costs, emissions, capacities). While these were chosen to be realistic, they are not based on an actual supply chain’s data. Therefore, the numerical results (e.g., 45% emission reduction at 10% cost) are illustrative rather than directly generalizable. Real-world supply chains might have different cost–emission trade-off shapes. Additionally, our demand was based on M5 (retail), which has its own patterns; other industries (e.g., automotive or electronics supply chains) would have different dynamics. Our model also simplified network structure (one supplier directly to store, no multi-echelon distribution), which could be expanded. We assumed unlimited capacity from each supplier (except implied delays for non-green). A more detailed model might include capacity constraints and inventory explicitly, turning it into a larger stochastic optimization problem. In short, the scope of our model was limited to demonstrate the methodology.
While nine decision variables and a small RL state are trivial, scaling this approach to a full-size supply chain with hundreds of products and many state variables could be challenging. NSGA-II’s computational cost grows with solution dimensionality; advanced variants or hybrid methods might be needed for larger problems. Similarly, the state space for RL in a complex supply chain (including inventory levels, backorders, multi-location states) would explode. We would need a more complex neural network and possibly policy-gradient methods or actor–critic frameworks to handle continuous action decisions (like order quantities). Our DQN was fairly simple; more complex problems might integrate with existing libraries (like Stable Baselines) and require more training time or sophisticated reward shaping. The concept of combining them remains valid, but engineering the system for industrial-scale use will require attention to efficiency (perhaps using parallel simulations, distributed computing, or simplifying the action space through hierarchical RL).
We treated the three modules somewhat sequentially (gen -> NSGA -> RL). In reality, there could be feedback loops we did not exploit. For example, after training the RL agent, one could evaluate its true cost–emission outcomes and realize the static approximation was off, then feed that back to refine the strategy. We did not iterate between NSGA and RL; we did one pass. An integrated approach might alternate: optimize strategy, train RL, see resulting performance, then re-optimize strategy if needed. Also, our generative model could be further integrated—e.g., using the RL trajectories to update scenario generation (maybe the RL uncovers a scenario we did not think of). We treated them modularly for clarity. Future research could explore more closed-loop learning, such as a single objective that evaluates a strategy by training an RL agent on it and then measuring true performance—but that would be extremely costly to do repeatedly in an outer loop.
We focused on two objectives. In reality, sustainable supply chain management might include others, e.g., service level (fill rate), or social metrics (fair trade, labor conditions). Service level we handled as a constraint, but not as an explicit objective. If we had included it (making it tri-objective: cost, emission, service), the analysis would become more complicated, as we cannot easily visualize a 3D Pareto front, and RL reward design would have to incorporate another term or constraint (perhaps via Lagrangian methods). Our framework conceptually can extend to more objectives—NSGA-III or other MOEAs can handle it—but the complexity of results interpretation grows. Social impact was outside our scope, but one could imagine assigning a “green score” to suppliers (beyond emissions) and trying to maximize that too. We caution that adding too many objectives can make the Pareto set very large and perhaps less useful for decision-making (the curse of dimensionality). A practical approach might combine some objectives (like a single sustainability index) if needed.
From a managerial perspective, implementing our approach requires significant data (on costs, emissions, demand forecasts) and trust in AI models. If data is poor or biased, the generative model might generate unrealistic scenarios, or the RL could learn wrong patterns. Moreover, AI models can be opaque; a manager might be wary to let an RL agent control decisions without understanding its rationale. Ensuring interpretability (maybe by extracting rules from the trained policy) would help increase adoption [
31]. We did not focus on interpretability, but it is a valid concern in practice.
5.4. Future Research Directions
Building on this work, several avenues emerge:
Extending the integration of NSGA-II and RL into a true dynamic multi-objective optimization framework. One idea is to evolve not just static strategies but parameterized policies. For example, represent the RL agent’s policy by a set of tunable parameters (like trigger levels for switching suppliers) and use NSGA-II to optimize those parameters for cost and emissions. This blends the RL and NSGA steps into one, effectively searching the space of policies directly. Techniques like neuro-evolution (evolving neural network weights) or rule-based policy evolution could be applied. This would directly yield a Pareto front of policies, not just static plans.
Our generative approach sampled scenarios but did not formally consider probabilities. Future work could incorporate a probabilistic view—e.g., use robust optimization or chance-constrained objectives so that solutions are evaluated on expected cost and a risk measure of emissions (or vice versa). Alternatively, multi-objective criteria could include a variance or CVaR (conditional value at risk) of cost or emissions. The RL agent could also be trained in a risk-sensitive manner (some recent RL research addresses CVaR optimization). This would be valuable for companies highly adverse to certain risks (for example, exceeding an emission cap even 1% of the time might be unacceptable).
Applying this framework to a real supply chain case would be the ultimate test. For example, working with a retailer to use their data: product catalog, suppliers (with estimated carbon footprint), and running our system. It would be interesting to see if the insights hold and how much improvement can be verified. Real implementation would also allow evaluation of computational efficiency and bottlenecks.
While we used a basic VAE, more advanced generative AI (like conditional GANs or even large language models for scenario narratives) could be used to simulate complex events (port strikes, pandemics, etc.). A conditional generative model could allow the following: “generate demand scenario with a 10% chance of a disruption in Q3,” and then test strategies against it. This becomes like a digital twin environment for training RL—very aligned with current industry interest. Research can focus on how to ensure these generated scenarios are representative and useful for training robust policies.
We assumed a single decision-maker. In reality, supply chains involve multiple agents (suppliers, carriers, retailers). Multi-agent RL could be explored, where, for example, a supplier agent and a buyer agent each have policies that need to align with sustainability goals. There is emerging research on multi-agent systems for supply chains; combining that with multi-objective goals (each agent might have its own objective weighting) is complex but fascinating. It could show, for example, how contract structures or incentives could be designed so that an RL agent at a supplier and an RL agent at a buyer collectively achieve system-wide Pareto optimality.
Developing methods to extract simpler decision rules from the RL agent (so managers can understand and trust them) is another area. Perhaps using decision tree approximations of the policy or using techniques like SHAP values to interpret which factors drive decisions could make the approach more transparent. Additionally, involving human feedback in training (e.g., using reinforcement learning from human feedback, RLHF, popular in NLP) could allow managers to correct the agent’s behavior during training (like penalize it for choices that are theoretically good but practically infeasible).
In conclusion, our research demonstrates a potent combination of AI techniques for sustainable supply chain management. It lays the groundwork for AI-augmented supply chains that can design themselves for optimal trade-offs and continually adapt to meet sustainability targets in a changing world. As data availability and computational tools improve, we expect such integrative approaches to move from academia to industry, enabling supply chains that are not only efficient and resilient but also significantly more sustainable. The journey to net-zero supply chains will require innovation in both strategy and execution—our work suggests that generative and adaptive AI can play a pivotal role in that journey.
6. Conclusions
This study contributes a novel and unified AI framework for sustainable supply chain decision-making, integrating generative modeling, evolutionary optimization, and reinforcement learning. Beyond performance improvements, this architecture offers a theoretical model for linking strategic design and adaptive execution under uncertainty.
In this paper, we presented a comprehensive study on using generative and adaptive AI approaches to design and optimize sustainable supply chain strategies. Our approach integrated Variational Autoencoders (VAE) for scenario and solution generation, NSGA-II for multi-objective optimization of cost and emissions, and reinforcement learning (RL) for adaptive execution of supply chain decisions. We applied this framework to a case based on the M5 Forecasting dataset, enriched with synthetic sustainability-related variables, to demonstrate its effectiveness.
The introduction outlined the growing need to incorporate sustainability into supply chain planning and how advanced AI techniques could address the complexity of balancing cost with environmental impact. We identified a gap in the existing literature, which tends to focus on either strategic optimization or operational adaptation in isolation, and proposed an integrated solution.
In the theoretical contribution and conceptual framework, we situated our work within the literature on sustainable supply chain management, evolutionary optimization, and reinforcement learning. We highlighted that while multi-objective optimization provides strategic trade-offs and RL offers adaptability, their combination (especially with generative modeling support) is novel. We then detailed our conceptual model, which links a generative module (to learn and create realistic scenarios) with a Pareto optimization module and an adaptive decision module. This framework provides a new lens to view supply chain decisions as a continuum from design to real-time control, supported by AI at each stage.
The methodology described how we implemented each component: we processed real retail data and added synthetic supplier attributes, trained a VAE to capture demand patterns, used NSGA-II to find a Pareto set of sourcing strategies, and built a custom environment to train a DQN agent for operational decisions. The methodology stressed both the technical setup and the rationale behind parameter choices, ensuring that the approach is reproducible. We also ensured that cost and emission calculations were grounded in reasonable assumptions, making the experimental setup as realistic as possible for a hypothetical supply chain.
Our results showed that significant sustainability improvements can be achieved with minimal cost sacrifice when decisions are optimized holistically. The Pareto frontier from NSGA-II gave insight into the cost–emission trade-off, and the selected balanced strategy achieved nearly half the emissions of the cheapest strategy for about 10% higher cost. The reinforcement learning agent further improved performance by dynamically adjusting actions—essentially showing that adaptability can push the efficiency frontier even further. The RL policy successfully navigated high-demand and high-carbon-price situations better than any static rule, validating the benefit of an adaptive approach. We also observed that the VAE-informed initial solutions helped NSGA-II converge faster and find diverse solutions, although the benefit was qualitative.
In the discussion, we interpreted these findings in depth. We noted the technical novelty of our integration and how it opens up new research opportunities in dynamic multi-objective optimization and AI-driven supply chain management. We discussed managerial implications, emphasizing that our approach can serve as a powerful decision-support tool: it quantifies trade-offs for executives and provides AI agents that can autonomously manage operations within set strategic guidelines. We frankly addressed limitations, including modeling simplifications and scalability considerations, to contextualize the scope of our conclusions. Finally, we suggested future directions, such as applying our framework to real-world data, exploring multi-agent setups, and enhancing interpretability, which can build on our work to further advance both theory and practice.
This research demonstrates the feasibility and value of combining generative modeling with multi-objective optimization and reinforcement learning in supply chain decision-making. The novelty lies not in inventing new algorithms, but in orchestrating existing ones (VAE, NSGA-II, DQN) into a cohesive system tailored for sustainability objectives. The outcomes indicate that supply chains can be both cost-efficient and eco-friendly if we leverage AI to navigate the inherent trade-offs and uncertainties. Practically, our approach could help companies design supply chain strategies that are “future-proof”—optimized for today’s conditions but flexible enough to adapt as conditions change (be it demand surges or new carbon regulations). The successful integration of these AI techniques also contributes to the academic discourse by providing a template for solving other complex operations management problems where multiple objectives and uncertainties collide (for example, production planning with quality and throughput trade-offs, or transportation networks with cost, time, and risk objectives).
As companies strive to meet ambitious sustainability targets, tools like the one developed in this study will be increasingly important. By viewing sustainability as a dynamic optimization problem rather than a static checkbox, firms can unlock innovative ways to reduce emissions without sacrificing service or profit. Our work suggests that the convergence of generative AI (for foresight), evolutionary algorithms (for insight), and reinforcement learning (for hindsight, i.e., learning from experience) can lead to smarter, greener supply chains. We hope this research encourages further exploration and adoption of AI hybrid approaches in the journey toward sustainable operations excellence.
Future research could explore the integration of this framework into human-in-the-loop decision systems, examine alternative generative architectures (e.g., diffusion models), and apply it to other domains such as healthcare logistics or disaster relief. The convergence of generative and adaptive AI thus offers fertile ground for advancing sustainable operations research.