Operational Resilience Under Carbon Constraints: A Socio-Technical Multi-Agentic Approach to Global Supply Chains

Kaur, Rashanjot; Kundu, Triparna; Sharma, Bhanu; Park, Kathleen Marshall; Pinsky, Eugene

doi:10.3390/systems14040374

Open AccessArticle

Operational Resilience Under Carbon Constraints: A Socio-Technical Multi-Agentic Approach to Global Supply Chains

by

Rashanjot Kaur

¹

,

Triparna Kundu

¹

,

Bhanu Sharma

²

,

Kathleen Marshall Park

^3,*

and

Eugene Pinsky

¹

MET Department of Computer Science, Boston University, Boston, MA 02215, USA

²

College of Science, Northeastern University, Boston, MA 02115, USA

³

MET Department of Administrative Sciences, Global Development Policy Center and Institute for Global Sustainability, Boston University, Boston, MA 02215, USA

^*

Author to whom correspondence should be addressed.

Systems 2026, 14(4), 374; https://doi.org/10.3390/systems14040374

Submission received: 14 February 2026 / Revised: 22 March 2026 / Accepted: 25 March 2026 / Published: 31 March 2026

(This article belongs to the Special Issue Artificial Intelligence and Big Data Strategies for Sustainable and Resilient Supply Chain Management)

Download

Browse Figures

Versions Notes

Abstract

High-stakes logistics, defined as supply chains where delays, quality loss, or noncompliance have serious human, safety, financial, or geopolitical consequences, are a prominent case of a broader reality: global supply chains are safety-, cost-, and time-critical socio-technical systems where forecasting quality, vendor coordination, and operational decisions shape service levels and stakeholder welfare. At the same time, decarbonization pressures and the growing use of AI for planning and control introduce new risks and trade-offs across energy, computation, and physical logistics. We develop a multi-agent framework that models supply chain system-of-systems dynamics drawing on (1) supply chain decision functions (shipment planning, sourcing and vendor management), (2) national energy-transition conditions that determine grid carbon intensity, and (3) carbon-aware computation accounting for AI-enabled decision support. Methodologically, we combine predictive analytics, unsupervised segmentation, and a carbon-cost-of-intelligence layer in a scenario-based assessment of how national energy-transition profiles–from Norway to India–affect the intensity of AI compute carbon, meaning the carbon emissions generated by the hardware and data centers required to train and run AI models. We introduce the carbon-adjusted supply chain performance (CASP) metric that integrates physical transport carbon, cold-chain overhead where applicable, and AI compute carbon into a per-package-type performance measure. Our analysis yields three actionable outputs for systems engineering and environmental management: carbon, service, and cost trade-off frontiers; governance levers (sourcing portfolio rules, buffers, and compute policies); and system-level early-warning indicators for disruption amplification. This study implements a tool-augmented multi-agent system (orchestrator, risk, and sourcing agents) using AWS bedrock and strands agents, where LLM-based agents orchestrate deterministic analytical engines through structured tool interfaces with adaptive query generation. Theoretically, we extend previous systems-of-systems and sustainable supply chain findings by formalizing package-type-specific carbon–service frontiers and by embedding AI compute carbon into a socio-technical resilience framework. Practically, the CASP benchmark, governance lever analysis, and multi-agent implementation provide decision-makers with concrete tools to compare carriers, routes, and compute strategies across countries while making transparent the trade-offs between service reliability and total carbon.

Keywords:

system-of-systems; agentic AI; carbon cost of intelligence; sustainable supply chain management; tool-augmented LLM agents; predictive logistics analytics; carbon–service trade-offs

Graphical Abstract

1. Introduction

Global supply chains represent complex system-of-systems architectures where operational decisions propagate across interconnected networks of suppliers, carriers, and distribution centers [1,2]. The accelerating adoption of artificial intelligence (AI) for supply chain management (SCM), including demand forecasting, route optimization, and inventory management, precipitates new trade-offs between computational carbon costs and operational efficiency [3,4]. Unlike prior work focusing on isolated optimization objectives, real-world supply chains span heterogeneous package types, from pharmaceuticals to food and fashion. with fundamentally different constraints. Critical logistics–such as pharmaceuticals and food items–require ≥99% on-time delivery and cold chain logistics where applicable, while standard logistics–such a for fashion items ranging from clothing to cosmetics–can tolerate ≥85% service levels with flexible routing [5,6].

From a systems perspective, the core research problem we address is how to design supply chain architectures that remain operationally resilient, that is, capable of maintaining acceptable service levels under disruption, while simultaneously meeting stringent decarbonization constraints and accounting for the carbon cost of AI-enabled decision support. This study conceptualizes global supply chains as socio-technical systems-of-systems in which physical transport assets, digital decision-support infrastructures, and national energy systems co-evolve, and where improvements in one layer, for example, more accurate forecasting, can shift risk and carbon burdens to other layers.

Our conceptual framework builds on systems-of-systems theory and sustainability-oriented supply chain design. Following Ackoff [1], Boardman and Sauser [2], and Barbosa-Póvoa et al. [5], we view operational resilience as an emergent property of interactions between autonomous subsystems, including suppliers, carriers, digital agents, and energy infrastructures. In this view, resilience depends not only on redundancy and buffers but also on governance decisions, such as sourcing rules and sustainability compliance, and on the choice of AI models whose energy use is coupled to regional grid carbon intensity in the broader context of corporate and national interests [7].

Against this backdrop, the aim of this study is threefold: (i) to develop a carbon-adjusted performance metric that jointly captures service levels and emissions at the package-type level; (ii) to design and implement a tool-augmented multi-agent architecture that operationalizes this metric for real-world logistics decision support; and (iii) to quantify how carbon–service trade-offs and optimization potential differ across package types, governance levers, and national energy-transition profiles (e.g., Norway, France, the United States, India).

Current supply chain optimization frameworks report aggregate performance metrics without category-level breakdowns, preventing accurate carbon footprint estimation for product-specific operations. This limitation is particularly critical as organizations deploy AI-enabled planning systems across multiple product categories with varying service requirements and environmental constraints [8]. The environmental impact of supply chain operations has gained attention, with logistics accounting for approximately 8% of global greenhouse gas emissions [9]. However, the carbon cost of AI-enabled decision support, which constitutes a growing portion of operational overhead, remains poorly characterized across diverse supply chain contexts [10].

1.1. Research Motivation

Recent work in sustainable SCM and agentic AI highlights open questions at the intersection of carbon accounting, product-category heterogeneity, and AI compute, which we synthesize formally in the Research Gaps subsection of the Literature Review (Section 2). Aggregate carbon metrics, missing treatment of AI compute carbon, and limited empirical evidence on how governance levers operate across energy-transition scenarios all hinder the design of resilience- and sustainability-oriented global supply chain architectures. The motivation of this study is to provide a socio-technical, empirically grounded framework that addresses these limitations at the package-type level.

To this end, we introduce the carbon-adjusted supply chain performance (CASP) metric, a framework that integrates physical transport carbon, cold-chain overhead where applicable, and AI compute carbon into a per-package-type performance measure. Methodologically, we combine predictive analytics, vendor segmentation, and a carbon-cost-of-intelligence layer to estimate carbon and service performance for each package type under different national grid carbon intensities [6]. The cross-category carbon benchmark, constructed from a 25,000-record Indian logistics dataset [11] with nine package types and nine delivery partners, directly addresses the limitations summarized in Section 2: it replaces aggregate carbon metrics with category-level CASP, incorporates AI compute carbon alongside transport emissions, quantifies optimization potential and governance lever effectiveness by package type, and evaluates how these patterns vary across contrasting energy-transition profiles (e.g., Norway to India). In doing so, the CASP benchmark operationalizes the abstract research gaps into measurable indicators that can guide both academic inquiry and practitioner decision-making.

1.2. Research Questions

This study investigates the following interrelated questions:

1.: How do carbon emissions and service performance vary across nine package types (pharmacy, electronics, groceries, automobile parts, furniture, documents, fragile items, clothing, cosmetics) in supply chain logistics?
2.: What is the carbon-versus-service trade-off for each package type, and how do cold-chain requirements affect this relationship?
3.: How does per-package-type CASP reveal carbon efficiency differences that aggregate metrics obscure, and how does the carbon cost of intelligence compare to physical logistics emissions across models and grid regions?
4.: What governance levers (sourcing rules, buffer policies, compute strategies) most effectively reduce total carbon footprint while maintaining service levels?
5.: How does the carbon cost of intelligence vary across national energy-transition profiles, and what early-warning indicators predict delivery delays and disruption amplification?

1.3. Contributions

This study makes multiple conceptual and practical contributions to systems engineering and sustainable SCM, most notably by introducing a unified metric and system-of-systems framework for analyzing operational resilience under carbon constraints.

First, we propose the CASP metric, a per-package-type measure that integrates physical transport carbon, cold-chain overhead, and AI compute carbon into a single, decision-ready performance framework. CASP is motivated by the logic of carbon-aware intelligence measurement—linking performance to energy use and carbon emissions—while translating that logic to an end-to-end supply chain setting where digital and physical processes jointly determine sustainability outcomes. Second, building on the CASP metric, we establish carbon benchmarks for nine package types across three “stakes” tiers, demonstrating that carbon intensity varies substantially from critical logistics (e.g., pharmacy with cold-chain requirements) to more routine delivery categories (e.g., clothing). Finally, we develop a tool-augmented, multi-agent system-of-systems framework in which deterministic analytical engines—predictive models, vendor segmentation, carbon calculators, and optimization routines—are incorporated as structured algorithms that LLM-based agents can invoke, thereby connecting supply chain decision support, national energy-transition conditions, and carbon-aware computation in a single assessment architecture.

We furthermore contribute practical decision heuristics that translate the framework into actionable managerial and engineering guidance. First, we derive carbon–service–cost trade-off frontiers that allow organizations to select feasible operating points aligned with service requirements and sustainability targets, rather than treating decarbonization as a purely aspirational constraint. Second, we quantify the effects of key governance levers—including sourcing portfolio rules, buffer policies, and compute strategies—on total carbon footprint, making explicit which levers meaningfully move outcomes under different package types and operational constraints. Third, we further introduce system-level early-warning indicators to anticipate disruption amplification across interdependent logistics networks, supporting proactive risk mitigation rather than reactive crisis response.

Fourth, operationally, we implement these ideas as a three-agent, LLM-powered architecture (Orchestrator, Risk, and Sourcing) featuring semantic package-type classification, adaptive tool cascades (API/web retrieval where needed), and constraint-aware route optimization, implemented using AWS Bedrock (Claude-3-Sonnet) and Strands Agents for tool-augmented reasoning. Unlike agentic supply chain approaches that rely primarily on web search and generic API tools, our agents execute quantitative, ML-backed decision support inside the reasoning loop by invoking trained Gradient Boosting models and k-means clustering—enabling reproducible, data-driven risk evaluation and sourcing recommendations rather than narrative or impressionistic outputs.

The remainder of the paper is organized as follows: Section 2 reviews related work, including agentic AI and multi-agent systems. Section 3 details the methodology (CASP, dataset, predictive analytics, vendor segmentation, carbon cost of intelligence, risk scoring, route optimization). Section 4 describes system architecture and implementation (agents, tools, communication). Section 5 presents experimental evaluation and results (delay/on-time/cost prediction, vendor segmentation, carbon and CASP analysis), and the case studies of shipments in the pharmaceuticals and fashion industries follow. Section 7 discusses implications, error propagation, limitations, and threats to validity. Section 8 concludes the paper. Appendix A provides full agent prompt templates (A1 to A3).

2. Literature Review

2.1. System-of-Systems in Supply Chain Management

The system-of-systems (SoS) paradigm provides a conceptual foundation for analyzing supply chains as interconnected networks of autonomous yet interdependent components [12,13,14,15]. Ackoff’s foundational work on systems thinking [1] established that complex organizational challenges require holistic approaches that account for emergent behaviors arising from component interactions. Boardman and Sauser [2] extended this framework to engineering contexts, identifying five distinguishing characteristics of SoS: operational independence, managerial independence, evolutionary development, emergent behavior, and geographic distribution, all of which are applicable to modern supply chains.

Recent applications of SoS theory to SCM have focused on resilience and disruption propagation [8,16,17,18,19,20]. Ivanov [8] introduced the concept of “digital supply chain twins” as SoS models enabling real-time visibility and predictive analytics across supply chain networks. Dolgui et al. [16] demonstrated how disruptions propagate through supply chain networks via the “ripple effect,” with amplification patterns dependent on network topology and inventory policies; large-scale disruptions (e.g., maritime blockages) illustrate the systemic impact of such propagation [21]. These studies highlight the importance of treating supply chains as emergent systems where operational reliability arises from the interaction of multiple autonomous subsystems [22].

2.2. Carbon-Aware Supply Chain Optimization

The environmental impact of supply chain operations has emerged as a critical research area, with logistics contributing approximately 8% of global greenhouse gas emissions [9,23]. Barbosa-Póvoa et al. [5] provided a comprehensive review of sustainable supply chain design, identifying three pillars: economic viability, environmental responsibility, and social equity. They demonstrated that carbon optimization often conflicts with cost minimization, necessitating multi-objective approaches that generate Pareto-optimal trade-off frontiers.

Recent work has examined carbon-aware routing and scheduling in logistics networks [24,25]. Bektaş and Laporte [24] introduced the pollution-routing problem, incorporating vehicle emissions into traditional vehicle routing formulations. Demir et al. [25] reviewed green vehicle routing variants, finding that fuel consumption models significantly impact optimal route selection. However, these studies focus exclusively on physical transport carbon, neglecting the computational overhead of AI-enabled planning systems that increasingly drive operational decisions.

Cold-chain logistics present particular carbon challenges due to refrigeration requirements [26,27,28,29]. Mercier et al. [27] found that cold-chain operations consume 2–3× more energy than ambient logistics due to continuous temperature maintenance. The pharmaceutical industry exemplifies high-stakes cold-chain requirements, with strict 2–8 °C storage mandates and 95%+ on-time delivery expectations [30,31]. These constraints become particularly salient in crisis times [32,33] and severely limit routing flexibility, reducing carbon optimization potential compared to ambient-temperature goods.

2.3. AI-Enabled Supply Chain Planning and Carbon Cost

The adoption of AI for supply chain planning has accelerated rapidly, with applications spanning demand forecasting, inventory optimization, and route planning [34,35,36]. Ni et al. [34] conducted a systematic review of machine learning applications in SCM, identifying predictive analytics as the dominant use case. Riahi et al. [35] demonstrated that AI-enabled demand sensing can reduce forecast error by 20–50%, enabling leaner inventory policies with corresponding carbon benefits.

However, the carbon cost of AI computation itself has received limited attention in supply chain contexts. Kaur et al. [10] introduced the carbon-cost-of-intelligence (CCI) metric for AI workloads, demonstrating 4.3× variation in energy consumption across application domains. Their work established that AI compute carbon, while smaller than physical logistics carbon, represents a non-negligible and growing contribution to total supply chain environmental impact. Patterson et al. [37] showed that AI inference operations, which constitute 80–90% of production energy usage, accumulate to significant carbon footprints at scale.

2.4. Agentic AI and Multi-Agent Systems in Supply Chain Management

Agentic AI refers to systems where large language models (LLMs) act as reasoning agents that plan, call tools, and iterate based on outcomes [38]. Unlike traditional decision-support systems with fixed workflows, agentic frameworks allow adaptive query generation, external API and web-search integration, and structured tool use (e.g., risk assessment, carrier lookup, optimization). Implementations are supported by frameworks such as LangChain/LangGraph [39], CrewAI [40], and Strands Agents [41]. Recent supply chain and sustainability applications have begun to adopt such architectures: multi-agent designs coordinate specialized agents for disruption monitoring, sourcing, and carbon analysis, with agents exchanging structured outputs (e.g., JSON) in a sequential or hierarchical pipeline. Representative work includes a seven-agent framework for supply chain disruption monitoring (F1 0.962 to 0.991) [42], autonomous LLM-based consensus-seeking in supply chains [43], agentic AI for sustainable supply chain process automation (SustAI-SCM) [44], and LLM-based multi-agent inventory management (InvAgent) [45].

Traditional multi-agent systems (MASs) in supply chains rely on predefined protocols, ontologies, and often hand-crafted negotiation logic [8]. Prior work has reviewed the limitations of agent-based SCM and the path toward autonomous supply chains [46,47]. The Cambridge agentic LLM work [43], disruption monitoring [42], and SustAI-SCM [44] illustrate the shift toward LLM-powered, tool-augmented agents in SCM. LLM-powered agents differ in that they use natural language and tool-augmented reasoning to interpret user queries, generate adaptive search queries (e.g., weather, disruption news, carrier rates), and fuse multiple data sources before invoking deterministic backend engines (predictive models, optimization, carbon calculators). This hybrid design (LLM orchestration with local ML and API tools) is well-suited to supply chain contexts where inputs are heterogeneous (e.g., “ship insulin from Mumbai to Delhi”) and where reasoning over risk, sourcing, and carbon trade-offs benefits from structured tool calls rather than pure retrieval. This study’s three-agent design (Orchestrator → Risk → Sourcing → optimization → carbon) positions itself alongside such agentic SCM frameworks, with explicit integration of carbon-cost-of-intelligence and package-type-specific constraints (critical vs. standard) that prior agentic SCM work does not formalize.

A common architectural limitation across these agentic frameworks is the nature of their tool ecosystems. AlMahri et al. [42] delegate critical computations to deterministic functions (graph traversals, formula-based risk calculators), while InvAgent [45] and the Cambridge consensus framework [43] rely on LLM-only reasoning without external predictive tools. SustAI-SCM [44] proposes but does not implement trained model integration. None of these systems wrap domain-trained machine learning models (classifiers, regressors, or clustering engines with independently validated accuracy) as structured tools that LLM agents invoke during reasoning. This leaves a gap between the qualitative reasoning capabilities of LLM agents and the quantitative precision (e.g., delay probability, cost prediction, on-time estimation) needed for operational logistics decision-making.

2.5. Research Gaps

Despite growing attention to supply chain sustainability, significant gaps remain. First, existing carbon benchmarks report aggregate metrics without product-category breakdowns, preventing accurate assessment of heterogeneous supply chain portfolios. Second, while the existing literature extensively applies AI to reduce transportation emissions [38,48], these studies generally do not quantify the computational carbon footprint of the AI systems themselves within their logistics emissions models. This creates a methodological gap: the energy demands of AI computation are acknowledged as an implementation barrier [49], but are not integrated into end-to-end supply chain carbon accounting [50]. Third, the relationship between product-category constraints (e.g., cold-chain requirements, service-level targets) and carbon optimization potential remains unquantified. Fourth, governance levers for reducing supply chain carbon footprint lack empirical validation across different energy-transition scenarios. Table 1 summarizes these gaps and how the framework in this study addresses them.

Table 1 maps four key research gaps in the sustainable supply chain literature to the specific contributions of this study. It shows how the framework in this study addresses limitations in carbon metrics, AI integration, optimization potential quantification, and governance validation across energy scenarios.

This study’s framework addresses these gaps by introducing a carbon-cost-of-intelligence layer that quantifies AI inference emissions alongside physical logistics emissions, and through systematic empirical measurement using the CASP metric for category-specific carbon assessment.

3. Methodology

Figure 1 illustrates the layered system-of-systems framework, showing four distinct layers: Output (CASP, Governance, Early Warning), Agent Layer (LLM reasoning with Orchestrator, Risk Agent, Sourcing Agent), Analytical Backend (seven analytical engines with descriptive names), and Data Layer (datasets, APIs, carriers, routes, grid carbon). This layered architecture demonstrates that analytical engines are the backend that tools wrap and agents invoke, not standalone components.

For readability, we use descriptive backend component names throughout: predictive analytics, vendor segmentation, carbon cost of intelligence, governance lever analysis, and early-warning risk scoring.

This study uses this four-layer architecture to operationalize CASP: LLM agents handle reasoning over constraints, while deterministic analytical engines compute carbon and performance through structured tool interfaces.

3.1. Carbon-Adjusted Supply Chain Performance (CASP) Metric

This study introduces the CASP metric to enable package-type-specific carbon assessment that accounts for service performance. For a supply chain operation with package type i, CASP is defined as

{CASP}_{i} = \frac{Service {Performance}_{i}}{Total {Carbon}_{i}}

(1)

where Service Performance is measured as on-time delivery percentage (0 to 100%). Total Carbon in Equation (1) is the sum of transport carbon and AI compute carbon (Equation (2)); transport carbon includes cold-chain overhead as a multiplicative factor

λ_{cc}

per Equation (3), not as a separate additive component.

Total {Carbon}_{i} = C_{transport, i} + C_{AI, i}

(2)

Transport carbon is calculated as follows:

C_{transport, i} = d_{i} \times {EF}_{v} \times λ_{cc}

(3)

where

d_{i}

is route distance (km);

{EF}_{v}

is vehicle emission factor (gCO₂/km) sourced from CPCB [54], BEE [55], and ICCT [56], with the EV van retained as a UK proxy from the 2024 UK government GHG conversion factors [57]; and

λ_{cc}

is the cold-chain multiplier: 1.0 (ambient), 2.0 (groceries), or 2.5 (pharmacy). Cold-chain logistics increases transport carbon by a multiplicative factor

λ

(2.0 to 2.5×) due to refrigeration energy requirements, rather than as a separate carbon component. This study uses per-vehicle emission factors expressed in gCO₂/km, not gCO₂/ton-km, so the model implicitly assumes typical loading and average cargo weight and does not explicitly represent weight, load factor, or return trips. As a result, absolute transport carbon for very long routes (e.g., >1000 km) should be interpreted as an approximation suitable for comparative analysis across package types and scenarios rather than as an exact life-cycle inventory. Table 2 summarizes the vehicle emission factors and source provenance used in Equation (3).

AI compute carbon is calculated as follows:

C_{AI, i} = \frac{E_{model} \times N_{inferences}}{1000} \times G_{country}

(4)

where

E_{model}

is energy per inference (Wh) from Patterson et al. [37],

N_{inferences}

includes both LLM calls (3 per optimization: Orchestrator + Risk Agent + Sourcing Agent) and ML predictions (9 per optimization: 3 routes × 3 predictions per route for cost, carbon, on-time), and

G_{country}

is grid carbon intensity (gCO₂/kWh) from electricity maps 2024 [53]. For the multi-agent system, LLM energy dominates (3 × 0.0015 Wh = 0.0045 Wh) compared to ML predictions (9 × 0.00001 Wh = 0.00009 Wh).

$C_{transport, i}$ : Transport carbon (gCO₂/shipment) per Equation (3), where cold-chain overhead is embedded as the multiplier $λ_{cc}$ within the transport calculation.
$C_{AI, i}$ : AI compute carbon (gCO₂/optimization) per Equation (4), from route planning and inference (carbon-cost-of-intelligence).

The per-package-type CASP structure is motivated by the carbon-cost-of-intelligence (CCI) framework [10], which uses per-domain accuracy-to-energy ratios to characterize AI workload efficiency. Where CCI measures AI accuracy per unit energy across computational domains, CASP measures delivery service performance per unit carbon across logistics package types.

3.2. Package-Type Classification

This study classifies supply chain operations into nine package types across three stakes tiers based on operational constraints, regulatory requirements, and carbon profiles (aligned with industry SLAs [58,59,60] and WHO Good Distribution Practice where applicable [30,61]):

Tier 1: Critical (≥99% on-time):

Pharmacy: Cold chain required (2 to 8 °C), carbon multiplier 2.5×; WHO/FDA-regulated.
Groceries: Perishable, refrigeration often required; carbon multiplier 2.0×.

Tier 2: High-Value (≥95% on-time [59,60,62]):

Automobile Parts, Furniture, Documents, Fragile Items, Electronics: Business-critical; carbon multiplier 1.0×.

Tier 3: Standard (≥85% on-time [59,60,63,64]):

Clothing, Cosmetics: Consumer goods; carbon multiplier 1.0×; flexible routing enables carbon optimization.

Package type is assigned by a semantic classifier: rule-based extraction from the user query (e.g., origin, destination, product keywords) is applied first; when the product description is ambiguous or would default to a generic type (e.g., “clothing”), an LLM-based fallback (AWS Bedrock Claude) classifies the query into one of the nine package types. This ensures that life-safety cases (e.g., “insulin” → pharmacy, cold chain) are correctly identified rather than misclassified as standard logistics.

3.3. Dataset and Measurement Framework

To ground the empirical analysis in this study, we now describe the dataset and measurement framework used throughout this study. This study uses the Indian logistics dataset [11] comprising 25,000 delivery records. Table 3 summarizes dataset statistics. Features include package type, origin, destination, distance, weight, vehicle type, delivery partner, delivery mode, region, weather condition, and delivery rating. Nine delivery partners are represented (Delhivery, XpressBees, Shadowfax, DHL, Amazon Logistics, BlueDart, FedEx, Ecom Express, Ekart). Package types are mapped to the nine categories and three stakes tiers described above; cold-chain multipliers (2.5× for pharmacy, 2.0× for groceries, 1.0× otherwise) are applied where applicable.

Table 3 provides a comprehensive overview of the dataset [11] structure, including the number of records, train/test split, package types, delivery partners, vehicle types, and predictive features. It establishes the data foundation for all subsequent ML models and analyses reported in this paper. Carbon intensity data, a reflection of grid carbon intensity, is obtained from configurable regional values (e.g., India 708, USA 386, France 56 gCO₂/kWh) consistent with EPA eGRID [52] and Electricity Maps [53]. AI compute carbon is derived from the carbon-cost-of-intelligence layer, as described in Section 3.6.

3.4. Predictive Analytics

This study uses Gradient Boosting for cost regression and on-time/delay classification [34,36]. The choice is motivated by mixed feature types (categorical and numerical), interpretability of feature importance, and training speed. Categorical features (delivery partner, package type, vehicle type, delivery mode, region, weather condition) are one-hot encoded; numerical features (distance_km, package_weight_kg, delivery_rating) are standardized (StandardScaler). The same preprocessing is used for predictive analytics and early-warning delay prediction. This study reports both a single 80%/20% train/test split (random_state=42) and 5-fold cross-validation (mean ± std), stratified for delay classification and standard KFold for on-time classification, to assess stability across folds. No target leakage (e.g., delayed, delivery_status) is used as input.

The additive model objective is

F (x) = \sum_{m = 1}^{M} h_{m} (x)

, where each

h_{m}

is a weak learner (regression tree), and the loss

L (y, F)

is minimized by Gradient Boosting. For on-time and early-warning delay tasks, we use GradientBoostingClassifier with the same feature set and binary targets (on_time_label, is_delayed). Hyperparameters are given in Table 4.

Table 4 specifies the hyperparameters used for training Gradient Boosting models for on-time and delay prediction. It shows the number of estimators, train/test split ratio, random state for reproducibility, and cross-validation strategy, which are essential for understanding model configuration and ensuring reproducibility.

3.5. Vendor Segmentation

Vendor (carrier) segmentation uses K-Means (

k = 4

) on per-partner aggregates: avg_cost, delay_rate, avg_rating, on_time_pct, and a composite reliability score (Equation (5)):

reliability_score = 0.6 \times on_time_pct + 0.4 \times (avg_rating \times 20) .

(5)

Features are scaled (StandardScaler) before clustering. The choice of

k = 4

aligns with interpretable roles (e.g., premium vs. budget vs. unreliable). Results are used to label carrier clusters (e.g., “Budget/Cheap”, “Unreliable”) and to inform the Sourcing Agent’s carrier recommendations.

The reliability_score is a weighted composite (on-time performance 60%, delivery rating 40%) on a 0–100 scale; StandardScaler normalizes it before clustering, so the raw range does not affect segmentation results.

3.6. Carbon Cost of Intelligence

This study now turns to quantifying AI compute carbon, which is essential for understanding the full carbon footprint of AI-augmented supply chain systems. AI compute carbon is calculated per Equation (4) (see Equation (3) for transport carbon decomposition). Energy per inference is model-dependent (Table 5). Industry benchmarks report median values on the order of 0.24 Wh per text prompt with substantial recent efficiency gains [65]; comprehensive benchmarking of LLM inference energy and carbon is provided by Jegham et al. [66] and by foundational work on inference energy costs [67]. For route optimization we use local Gradient Boosting (

E \approx 0.00001

Wh per inference); the multi-agent system requires 3 LLM calls (Orchestrator + Risk Agent + Sourcing Agent) using Claude models available in AWS Bedrock (Haiku 0.0006 Wh, Sonnet 0.0015 Wh, Opus 0.0027 Wh per inference). Thus,

C_{AI}

varies by model and country; the often-cited 50 gCO₂ per optimization corresponds to a specific (model, country) pair and is replaced in the implementation in this study by Equation (4). To test robustness, we also consider a simple sensitivity check in which

E_{model}

is doubled or tripled for a given LLM: even when energy per inference is increased 3×, per-optimization AI carbon remains on the order of

10^{- 3}

gCO₂ in the India grid scenario compared to transport carbon of

\sim 10^{5}

gCO₂ for long-haul shipments This confirms that our qualitative conclusion—that transport dominates single-shipment CASP and that AI carbon only becomes material at scale or under much heavier workloads than those modeled here—is not an artifact of the baseline energy assumptions.

Carbon ROI answers, “Is AI worth the carbon?” This study defines it as

{ROI}_{carbon} = \frac{Δ C_{transport}}{C_{AI}}

(6)

where

Δ C_{transport} = C_{transport, baseline} - C_{transport, optimized}

. Net savings =

Δ C_{transport} - C_{AI}

; when net savings

> 0

, AI reduces total carbon.

3.7. Risk Scoring and Early-Warning Indicators

The risk score combines predicted delay probability with an impact multiplier and optional adjustment factors:

RiskScore = P (delay) \times M_{impact} \times \prod adjustment_factors,

(7)

where

M_{impact}

is package-type-specific (critical types have higher impact). Adjustment factors include stormy weather (+50%), long distance and heavy weight (+30%), and high-risk partner (+20%). Following the ISO 31000 risk matrix approach [68], we define four risk levels based on this composite risk score: CRITICAL (

> 5

), HIGH (

> 3

), MEDIUM (

> 1.5

), and LOW (

\leq 1.5

). These thresholds are framework parameters calibrated to the Indian logistics dataset; organizations should adjust them to their own risk tolerance.

The risk score is a composite severity index (not bounded by 0–1) that scales with package criticality and environmental factors; the downstream output used by agents is the discrete risk level, not the raw score. Buffer days are produced via a hybrid policy: the Risk Agent recommends a context-aware buffer from multi-source reasoning (weather, disruptions, distance, package criticality), while deterministic risk-level floors enforce minimum safety bounds.

The three early-warning indicators for disruption amplification are: (1) supplier concentration index:

max_share / total

; threshold 40%, above which single-supplier dependence amplifies disruption risk. (2) geographic clustering:

max_region_share / total

; threshold 60%, above which regional concentration increases weather-related delay correlation. (3) cold-chain fragility: A function of the tier and cold-chain multiplier; critical/cold-chain types (e.g., pharmacy) have higher spoilage risk. Quantified amplification risk is computed per indicator (0–1 scale when threshold is exceeded), and the overall score is

overall_amp = (supplier_amp + geo_amp + cold_amp) / 3

.

3.8. Route Optimization

The route optimizer (used after the Sourcing Agent returns carrier options) evaluates all routes with cost and on-time predictors, applies the package-type-specific on-time threshold (e.g., ≥99% for critical, ≥85% for standard), and selects the best route based on the objective. For critical and high-value types:

min \cos t

, subject to the on-time ≥ threshold. For standard types:

min total carbon

(transport + AI), subject to the on-time ≥ threshold. Routes that do not meet the SLA are filtered out when possible; if none meet it, the route with the highest predicted on-time delivery percentage is chosen. This constraint-aware selection aligns with the stakes tiers defined in Section 3.2.

3.9. Scenario-Based Systems Analysis

This study evaluates AI compute carbon across multiple grid carbon scenarios reflecting different energy-transition profiles (e.g., Norway 20, France 56, UK 193, USA 386, Germany 350, China 555, India 708 gCO₂/kWh) [4]. For each scenario, we calculate AI compute carbon by model and country (Section 3.6), showing 20× variation between Norway and India. This study also calculates CASP across package types and quantifies governance lever effectiveness for carbon reduction.

3.10. Implementation: Multi-Agent System

The research implementation supports a multi-agent architecture (Orchestrator → Risk Agent → Sourcing Agent → Optimizer + Carbon) that mirrors the layered system-of-systems design. Figure 2 shows the full agent architecture with tools, JSON outputs (o1 to o5), and external data sources: the Orchestrator (LLM #1) has five tools (extract_features, risk_agent_tool, sourcing_agent_tool, run_optimization, carbon_analysis); the Risk Agent (LLM #2) has four tools (weather_api, news_api, web_search, calculate_risk_score) and outputs JSON o1 (risk_level, delay_prob, buffer_days); the Sourcing Agent (LLM #3) has four tools (distance_api, routes_lookup, web_search, get_carrier_options) and outputs JSON o2 (carrier_options); and the Optimizer + Carbon (local) processes these inputs and outputs JSON o3 to o5 (best_route, total_carbon, CASP, early_warning). External APIs (OpenWeatherMap, NewsAPI, OpenRouteService) provide real-time data. For cost-aware decision support, recommended shipment cost is compared to an industry benchmark (e.g., from web search or carrier averages), with efficiency reported as below or above the benchmark.

Figure 2 details the complete multi-agent system flow, showing how the Orchestrator coordinates the Risk and Sourcing Agents, each with their specific tools and external API integrations. The structured JSON outputs (o1 to o5) demonstrate the data flow from agent reasoning through deterministic optimization to final recommendations.

4. System Architecture and Implementation

4.1. Overview and Architecture

This study now provides detailed implementation specifications for the multi-agent system. The implementation comprises three LLM-based agents (Agent 1: Orchestrator; Agent 2: Risk Agent; Agent 3: Sourcing Agent) plus a semantic package-type classifier and deterministic backend engines (predictive models, optimization, carbon service). The agent framework follows a model-driven, tool-augmented design supported by Strands Agents (Amazon Bedrock, LiteLLM) [41].

Table 6 breaks down each agent’s responsibilities, showing which LLM model powers each agent, what tools it uses, and the data flow from inputs to outputs. It clarifies how Agent 1 (Orchestrator) coordinates the pipeline while Agents 2 (Risk Agent) and Agents 3 (Sourcing Agent) handle specialized tasks.

Table 7 catalogs all tools available in the system, showing their functions, which agents use them, and whether they are API-based, local computations, or ML models. It provides a complete reference for understanding the tool ecosystem that enables agent functionality.

Table 6 and Table 7 provide a comprehensive overview of the agent architecture and tool ecosystem.

4.2. Communication Protocol and Design Patterns

Ommunication protocol: Agents exchange structured JSON (e.g., features_json, risk_assessment_json, carrier_options_json). The flow is sequential: Agent 1 (Orchestrator) → extract features → Agent 2 (Risk Agent) → Agent 3 (Sourcing Agent) → run_optimization → carbon_analysis. Error handling: if an API fails (e.g., weather), Agent 2 (Risk Agent) fuses whatever data is available (e.g., web search only) and still calls calculate_risk_score_tool with a canonical weather value.

Prompt engineering: Each agent has a role-based system prompt that defines its task, lists its tools, and specifies a workflow (e.g., “Gather from both API and web search; fuse; call calculate_risk_score_tool; return JSON”). Outputs are constrained to JSON where needed so that Agent 1 (Orchestrator) can parse carrier options and pass them to the optimizer. Adaptive query generation is required: Agent 1 (Orchestrator) generates risk_queries_json and sourcing_queries_json from the user request and features (e.g., “Mumbai weather today”, “Delhivery rates Mumbai Delhi 2025”).

Table 7. Tool catalog: Function and type (API / Local / ML), grouped by agent.

Tool	Function	Type
Agent 1: Orchestrator
`extract_features`	Semantic package-type classification	Local + LLM
`risk_agent`	Run Risk Agent (weather, news, risk score)	API + LLM
`sourcing_agent`	Run Sourcing Agent (carriers, benchmark)	API + LLM
`run_optimization`	Route optimization (cost/carbon, SLA filter)	Local (ML)
`carbon_analysis`	Carbon & governance analysis	Local
Agent 2: Risk Agent
`weather_api`	OpenWeatherMap for city weather	API
`news_api`	NewsAPI for disruption headlines	API
`web_search` *	Web search (risk & sourcing queries)	API
`calculate_risk_score`	Risk score & early-warning	Local (ML)
Agent 3: Sourcing Agent
`distance_api`	OpenRouteService distance	API
`routes_lookup`	Local routes (distance, region, metro)	Local
`web_search` *	(shared with Agent 2)	API
`get_carrier_options`	Carrier options (cost, on-time, carbon)	Local (ML)

* web_search is shared by Agent 2 (Risk Agent) and Agent 3 (Sourcing Agent).

LLM and API cascade: This study uses AWS Bedrock Claude-3-Sonnet for Agent 1 (Orchestrator), Agent 2 (Risk Agent), and Agent 3 (Sourcing Agent), chosen for structured reasoning and tool use. Agent 2 (Risk Agent) resolves weather via a cascade: OpenWeatherMap API → NewsAPI (disruption context) → Web search → local fallback (e.g., “clear”). This ensures that even when APIs are unavailable, the pipeline still returns a valid risk assessment.

Software and data sources: Implementation is in Python 3.12 with Strands Agents [41] (agent framework), scikit-learn (Gradient Boosting, K-Means, preprocessing), and boto3 (Bedrock). Data sources are: Delivery Logistics dataset [11] (25K records), reference data files (config/carriers.py loads data/reference/carriers.json (carrier profiles from industry reports and carrier public information), config/routes.py loads data/reference/cities.json and data/reference/routes.csv [51] (route distances from Google Maps v.25 Distance Matrix API), config/grid_carbon.py loads data/reference/ grid_carbon.json [10,52,53] (grid carbon intensities from EPA eGRID [52], Electricity Maps [53], and Kaur et al. [10]) and data/reference/ai_model_energy.json [10,37] (AI model energy consumption from Patterson et al. [37] and Kaur et al. [10]), config/vehicle _emissions.py loads data/reference/vehicle_emissions.csv [54,55,56,57] (vehicle emission factors using India sources for two-wheelers and ICE freight, with UK EV-van proxy)), and external APIs (OpenWeatherMap, NewsAPI, OpenRouteService, optional web search).

Table 8 lists all data sources used by the pipeline, including the main dataset [11], reference data files (stored in data/reference/ and loaded by configuration files in config/) with proper citations to original sources [10,24,25,37,51,52,53,54,55,56,57], and external APIs. It provides transparency about data provenance and helps readers understand what information feeds into the multi-agent system. Reference values are compiled from published datasets, government reports, and peer-reviewed literature, with provenance files documented in data/reference/README.md.

4.3. Agent 1: Orchestrator

Agent 1: Orchestrator is the central coordinator of the multi-agent pipeline. It receives natural language user queries, extracts shipment features (origin, destination, package type, weight), and orchestrates the sequential execution of specialized agents and optimization tools. The Orchestrator decides the tool execution order, invokes Agent 2 (Risk Agent) for risk assessment and Agent 3 (Sourcing Agent) for carrier options, runs route optimization, performs carbon analysis, and synthesizes the final recommendation. Its system prompt defines a step-by-step pipeline workflow, ensuring structured reasoning and deterministic tool invocation. Figure 3 shows the Orchestrator’s input, tool sequence, and output contract.

The Orchestrator uses five main tools: extract_features (semantic package-type classification), risk_agent_tool (invokes Agent 2), sourcing_agent_tool (invokes Agent 3), run_optimization (route optimization with SLA filtering), and carbon_analysis (carbon footprint and governance recommendations). It generates adaptive queries (risk_queries_json, sourcing_queries_json) from user input to enable context-aware API and web searches by downstream agents. The final output is a structured JSON recommendation containing carrier, cost, predicted on-time percentage, total carbon, and risk level.

4.4. Agent 2: Risk Agent

Agent 2: Risk Agent specializes in delivery risk assessment by fusing multiple data sources (weather APIs, news APIs, web search) and computing risk scores using the early-warning system. It receives features_json and risk_queries_json from Agent 1 (Orchestrator), adaptively queries external APIs and web sources, fuses the results, and invokes calculate_risk_score_tool, which uses a trained Gradient Boosting classifier to predict delay probability and compute risk indicators. Figure 4 details the Risk Agent data flow and structured output fields.

The Risk Agent employs a cascading fallback strategy: OpenWeatherMap API → NewsAPI (disruption context) → Web search → local fallback. This ensures robust operation even when APIs are unavailable. Its system prompt instructs it to gather data from both API and web sources, fuse the information, call the risk-scoring tool, and return structured JSON containing risk_level (LOW/MEDIUM/HIGH), delay_probability, recommended_buffer_days, and risk_factors (list of identified risks). The output feeds into Agent 3 (Sourcing Agent) and the optimizer to inform carrier selection and route planning.

4.5. Agent 3: Sourcing Agent

Agent 3: Sourcing Agent identifies viable carrier options by reasoning over risk assessments, service level agreements (SLAs), and industry benchmarks. It receives features_json, risk_assessment_json from Agent 2 (Risk Agent), and sourcing_queries_ json from Agent 1 (Orchestrator). The agent uses distance APIs, local route lookups, web search for industry benchmarks, and get_carrier_options_tool, which leverages vendor segmentation and carrier profiles to return options matching package-type constraints and risk tolerance. Figure 5 shows how candidate carriers are assembled before optimization.

The Sourcing Agent’s system prompt guides it to consider risk levels, SLA requirements (e.g., ≥99% on-time for critical types), carrier reliability from vendor segmentation, and industry benchmark rates. It returns a structured JSON array of carrier_options, each containing a carrier name, estimated cost, predicted on-time percentage, and carbon footprint. This array is passed to the route optimizer, which selects the best route based on the optimization objective (cost minimization for critical/high-value types, carbon minimization for standard types) while respecting SLA thresholds.

5. Experimental Evaluation and Results

We now present our comprehensive experimental results evaluating the multi-agent system performance in delay prediction, on-time forecasting, vendor segmentation, carbon assessment, and optimization capabilities.

5.1. Delay Prediction

Five-fold stratified cross-validation results are presented in Table 9. The results yield F1 = 0.954 ± 0.003, precision = 0.939 ± 0.004, and recall = 0.971 ± 0.004, confirming stability across folds. This study reports CV results throughout. This model is used by the Risk Agent to compute delay probability and risk score.

Table 9 reports cross-validation performance metrics for both delay and on-time classification, showing mean and standard deviation across five folds. The low standard deviations indicate stable model performance across different data splits.

The delay prediction model (GradientBoostingClassifier; early-warning risk scoring) achieves strong performance on the held-out test set (5000 samples, 26.7% delay rate). Table 10 reports precision, recall, and F1 for the Delayed class.

Table 10 shows precision, recall, and F1-scores for both On-Time and Delayed classes, along with overall accuracy. The high F1-score (0.956) for the Delayed class demonstrates the model’s ability to identify delayed shipments while maintaining high precision and recall.

Table 11 gives the confusion matrix.

Table 11 shows the actual versus predicted classifications, breaking down true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). The low false negative rate (38 FN out of 1334 actual delays) indicates that the model effectively identifies delayed shipments.

Table 12 reports delay rate (fraction of deliveries delayed) by package type from the same dataset [11] (25,000 records). Pharmacy and groceries show slightly higher delay rates (27.5% and 27.4%) than furniture (24.8%); the model uses these distributions along with partner and service-tier features for prediction.

Table 12 shows delay rates broken down by package type, revealing that pharmacy and groceries have slightly higher delay rates than other categories. These per-type statistics inform the model’s predictions and help explain why certain package types may require different risk assessment strategies.

To deliver on our methodological promise to characterize where and why forecasts fail, we report feature importance for both the early-warning delay model and the predictive on-time model in Table 13.

5.2. Feature Importance: Delay and On-Time Prediction

Delivery rating accounts for ∼82% of the explanatory power in both models: the models rely heavily on historical or expected partner performance (rating) to predict delay and on-time percentage. Delivery mode (express, two-day, standard, same-day) contributes ∼16% combined in the delay model and ∼15% in the on-time model, with express alone ∼11%; service tier is the second major driver. Distance contributes ∼1%; weather (stormy, rainy, foggy, cold) contributes <1% each. So in this dataset [11], both delay and on-time prediction are driven primarily by who delivers and which service tier is involved, not by distance or weather alone. This implies that forecasts are most uncertain when a rating is missing or when the portfolio includes a mix of many delivery modes; weather and distance add secondary signals for “where and why” forecasts may fail under disruption.

5.3. On-Time, Cost, and Carbon Prediction

Predictive analytics trains two gradient boosting models: a GradientBoostingRegressor for cost prediction and a GradientBoostingClassifier for on-time classification. On-time classification yields accuracy = 0.9778, F1 = 0.9848, and MAE = 3.43% on the single split (MAE computed against predicted on-time probability in percent). Five-fold cross-validation (Table 9) gives accuracy = 0.976 ± 0.002 and F1 = 0.984 ± 0.002, indicating stable performance across folds. Cost prediction attains R² = 0.9998 and MAE = Rs.5.30 (single split; CV R² = 0.9998 ± 0.0000, MAE = Rs.5.40 ± Rs.0.07). Transport carbon is computed deterministically via Equation (3) (distance × vehicle emission factor × cold-chain multiplier

λ_{cc}

), so it does not require an additional ML regressor and should be interpreted as physics-based rather than an ML fit. Table 13 summarizes the top drivers for both models.

The forecast failure of the on-time model varies by weather condition and by delivery partner. MAE ranges by weather condition from about 5.2% (foggy) to 6.4% (clear) and by partner from 5.5% to 6.2%, indicating that forecast failure is somewhat higher for certain weather and partner combinations, although those features have lower importance overall. This finding is consistent with the importance analysis determination that rating and mode dominate, while weather and partner add context.

5.4. Segmentation Results

This study now analyzes vendor segmentation results, which help the Sourcing Agent categorize carriers for recommendation. K-means clustering (

k = 4

) on the nine delivery partners yields a silhouette score of 0.4517 (moderate separation). Figure 6 visualizes the nine vendors in (avg cost, on-time %) space, with carrier names labeled and colored by cluster. Clusters are labeled by dominant behavior: Cluster 0 (Amazon, Ekart, Shadowfax; avg cost Rs.869, on-time 72.7%) is labeled standard performance; Cluster 1 (Delhivery, FedEx; avg cost Rs.853, on-time 75.0%) is labeled budget/value; Cluster 2 (BlueDart, DHL, Ecom; avg cost Rs.868, on-time 73.4%) is labeled mid-range; and Cluster 3 (XpressBees; avg cost Rs.868, on-time 71.7%) is labeled lower reliability. Average cost ranges from Rs.853 to Rs.869, and on-time rates from 71.7% to 75.0%. The limited spread reflects relatively homogeneous carrier performance in the Indian logistics dataset [11]; the segmentation remains useful for the sourcing agent to reason about carrier tiers.

Figure 6 visualizes vendor clusters in the cost-performance space, showing how carriers group together based on average cost and on-time percentage. The clustering reveals four distinct carrier segments that inform sourcing decisions: Budget/Value (Delhivery, FedEx) offers the best balance of cost and reliability, while Standard Performance (Amazon, Ekart, Shadowfax), Mid-Range (BlueDart, DHL, Ecom), and Lower Reliability (XpressBees) represent different performance tiers.

5.5. Cross-Package-Type Carbon Analysis

Table 14 presents carbon intensity and CASP scores across nine package types (Indian logistics dataset [11]; India’s grid scenario, 708 gCO₂/kWh). Critical types (pharmacy, groceries) with cold chains show higher total carbon; standard types (clothing, cosmetics) show lower carbon and higher CASP due to flexible routing.

Table 14 presents the complete carbon breakdown (transport with cold-chain multiplier

λ_{cc}

embedded per Equation (3), AI compute) and CASP scores for all nine package types, computed as averages across all records in the Indian logistics dataset [11]. It demonstrates how cold-chain multipliers (2.0–2.5×) increase transport carbon for critical types like pharmacy and groceries compared to standard types, while standard types achieve higher CASP due to routing flexibility and absence of refrigeration overhead (

λ = 1.0

). AI compute carbon for runtime CASP uses three LLM calls (Claude-3-Sonnet), totaling ∼0.003 gCO₂ per optimization for India’s grid (708 gCO₂/kWh); the additional local ML prediction term is negligible at this scale.

Results reveal substantial variation in total carbon between critical types with cold-chain multipliers (e.g., pharmacy 148,437 gCO₂ with

λ = 2.5

) and standard types (e.g., clothing 57,922 gCO₂ with

λ = 1.0

), computed as averages across the dataset. Critical logistics achieve lower CASP despite high service targets; standard types achieve higher CASP due to absence of cold-chain overhead (

λ = 1.0

) and greater routing flexibility.

5.6. Optimization Potential by Package Type

Figure 7 illustrates carbon optimization potential across representative package types, defined as achievable carbon reduction through route optimization while maintaining service-level targets. The carbon-versus-service trade-off frontier (Figure 8) is Pareto-optimal: improving service level typically requires accepting higher carbon (or cost), and vice versa [5,69].

Figure 7 quantifies achievable carbon reduction through route optimization for different package types, showing that critical types have limited optimization potential (4–12%), while standard types can achieve 35–40% reduction. This visualization helps readers understand which package types offer the greatest opportunities for carbon reduction.

Figure 8 plots the Pareto frontier showing the trade-off between carbon emissions and service level across all package types. It demonstrates that critical types cluster at high service levels (99%) with higher carbon, while standard types achieve lower carbon at acceptable service levels (85%), illustrating the fundamental trade-off in supply chain optimization.

Critical logistics (e.g., pharmacy) exhibit only 4% optimization potential, constrained by cold-chain and ≥99% on-time requirements that limit routing alternatives. In contrast, standard types (e.g., clothing) achieve up to 40% optimization potential through flexible routing, consolidated shipments, and tolerance for slower transport modes. This order-of-magnitude difference in optimization potential represents a key finding of this study: AI-enabled supply chain optimization has fundamentally different ROI across package types.

5.7. Carbon-Cost-of-Intelligence Results

This study reports carbon-cost-of-intelligence results (Section 3.6): the LLM × Country carbon matrix and an AI vs. transport comparison for a representative shipment. These results show (i) how grid intensity drives AI compute carbon across regions and (ii) that AI carbon per optimization is orders of magnitude smaller than transport carbon for a single shipment. Therefore, the main lever for total carbon emissions remains physical logistics, while CCI matters for scaling many inferences and carbon-aware placement of data centers.

5.7.1. LLM × Country Matrix

Table 15 gives AI compute carbon (mgCO₂ per single LLM inference) for three Claude models and seven countries, computed from the carbon-cost-of-intelligence layer as

C_{AI} = (E_{model} \times N_{inferences} / 1000) \times G_{country}

. Figure 9 plots the same data. Norway (grid 20 gCO₂/kWh) yields the lowest AI carbon; India (708 gCO₂/kWh) yields the highest. For a fixed model (e.g., Claude-3-Opus), the India/Norway ratio is ∼20×, illustrating that carbon-aware placement of inference, such as running in a low-carbon grid, can substantially reduce AI-related emissions when many optimizations are run. Note: Each optimization requires three LLM calls (Orchestrator + Risk Agent + Sourcing Agent). The implication is that the total AI carbon per optimization is 3× the values shown in this table.

Table 15 presents AI compute carbon per single LLM inference across different Claude models and countries, showing how grid carbon intensity dramatically affects AI emissions. The 20× difference between Norway and India for the same model demonstrates the importance of carbon-aware inference placement for organizations running many optimizations. Each optimization requires three LLM calls, so multiply table values by 3 to get total AI carbon per optimization.

5.7.2. AI vs. Transport Comparison

For a representative single shipment (e.g., clothing, 57,922 gCO₂ transport from Table 14, dataset average), one route-optimization run requires three LLM calls (Orchestrator + Risk Agent + Sourcing Agent) using Claude-3-Sonnet in India, adding ∼3.18 mgCO₂≈ 0.003 gCO₂ per optimization (3 × 1.06 mgCO₂ per inference from Table 15). Transport carbon dominates: AI is ∼0.005% of transport for that shipment. Using the carbon ROI formula (Section 3.6), if AI-assisted optimization saves 5% of transport (2896 gCO₂) at an AI cost of 0.003 gCO₂, the ROI is extremely high (net carbon savings positive). Table 16 summarizes this comparison. The takeaway: For single-shipment decisions, AI compute carbon is negligible relative to transport; CCI becomes relevant when (a) many optimizations are run at scale, (b) inference is in a high-carbon grid, or (c) one wishes to compare placement (e.g., Norway vs. India) for sustainability reporting. Sensitivity check (2×/3× energy per inference): Using the same India scenario and three-agent-call optimization, total AI carbon increases from 0.00325 gCO₂ (1×) to 0.00650 gCO₂ (2×) and 0.00975 gCO₂ (3×). Recomputing CASP with Table 14’s transport values gives pharmacy 6.669496 → 6.669496 → 6.669496 and clothing 14.674907 → 14.674906 → 14.674905 (all

\times 10^{- 6}

scale), indicating negligible change at the single-shipment scale and preserving the qualitative conclusion that transport dominates.

Table 16 compares AI compute carbon (per optimization with three LLM calls) to transport carbon for a single representative shipment, demonstrating that transport carbon dominates by orders of magnitude. It shows that for single-shipment decisions, AI carbon is negligible, but becomes relevant at scale or when comparing inference placement across countries. The intensity of the carbon grid is shown in Figure 9.

Carbon grid intensity–substantially higher in India than in Norway–emerges as the dominant driver of AI compute emissions for a fixed model/inference workload.

5.8. Governance Lever Effectiveness

This study now examines how different governance strategies can reduce carbon emissions across package types. Table 17 quantifies the effectiveness of three governance levers for carbon reduction across package types (governance lever analysis).

Table 17 quantifies carbon reduction potential from three governance levers (sourcing rules, buffer policies, compute policies) across different package types. This lever framing follows the prior logistics governance literature on carrier/3PL selection and reliability criteria [70,71], inventory-buffer and service-level control [72], and AI energy/carbon accounting for compute-policy decisions [10,37]. It shows that sourcing rules provide the largest reductions, especially for standard types like clothing (18%), while critical types like pharmacy have limited potential (2%) due to service constraints.

Figure 10 illustrates the three governance levers and their typical carbon reduction ranges. Sourcing portfolio rules (e.g., preferring regional suppliers, consolidating shipments) provide the largest carbon reduction, particularly for standard package types [70,71]. Buffer policies (strategic inventory positioning) offer secondary benefits by reducing expedited shipments [72]. AI compute policies provide minimal impact (<0.1%) on total carbon, as physical transport dominates the carbon footprint [5,10,37]. EV-vehicle-switching governance is currently a last-mile lever only in India; for intercity freight (the majority of high-stakes pharmacy/cold-chain routes), the EV governance lever yields 0% improvement because no EV truck option exists at present.

Governance results emphasize that carbon reductions primarily come from sourcing and buffer decisions, while compute-policy choices have smaller effects because transport carbon dominates single-shipment totals.

5.9. Regional Energy-Transition Impact

This study now analyzes how regional energy-transition progress affects AI compute carbon intensity. Table 15 (Section 3.6) and Figure 9 present AI compute carbon per single LLM inference across different Claude models and countries, showing 20× variation between Norway (0.03 mgCO₂ for Claude-3-Sonnet) and India (1.06 mgCO₂ for Claude-3-Sonnet). This demonstrates how grid carbon intensity dramatically affects AI emissions, though for single-shipment decisions, AI compute carbon (0.003 gCO₂) is negligible compared to transport carbon (148,437 gCO₂), so total CASP is dominated by transport and shows minimal variation across countries. The CCI layer becomes decision-relevant at scale (many optimizations) or when comparing inference placement strategies.

5.10. Early-Warning Indicators

Analysis of delay and disruption risk using early-warning risk scoring yields three early-warning indicators for disruption amplification:

1.: Supplier Concentration Index: Portfolios with >40% single-supplier dependence exhibit higher disruption amplification.
2.: Geographic Clustering: Supply chains with >60% regional concentration show higher weather-related disruption correlation.
3.: Cold-Chain Fragility: Critical (e.g., pharmacy) supply chains with limited temperature buffer exhibit higher spoilage risk during disruptions.

Figure 11 illustrates the three early-warning indicators that help identify supply chain disruption risk: supplier concentration, geographic clustering, and cold-chain fragility. It shows the thresholds (40% and 60%) that indicate when portfolios are at higher risk of disruption amplification. Figure 12 reports computed values from early-warning risk scoring for pharmacy and clothing: supplier concentration and geographic clustering are below their thresholds (40% and 60%); cold-chain fragility yields amplification risks of 0.5 for pharmacy (critical/cold chain) and 0 for clothing (ambient).

Figure 12 shows actual computed values for the three early-warning indicators for Pharmacy Shipment Mumbai to Delhi Insulin and Clothing Shipment Mumbai to Delhi T-Shirts portfolios. Both portfolio types are below the risk thresholds, with pharmacy showing some cold-chain fragility risk (25), while clothing has none, demonstrating how package type affects disruption risk.

5.11. Pipeline Runtime and Cost per Query

This study reports estimated pipeline runtime and cost-per-query for the full Orchestrator → Risk → Sourcing → optimization → carbon flow. The current implementation does not log wall-clock time per step in production; Table 18 and Table 19 provide estimates based on typical LLM latency (Claude-3-Sonnet), external API response times, and local ML/optimization runtimes.

Table 18 gives estimated time per pipeline step for a single user query, for instance, “Ship insulin from Mumbai to Delhi”. The Orchestrator invokes extract_features (local + LLM fallback for semantic classification), then the Risk Agent (LLM + weather/news/web + risk score tool), then the Sourcing Agent (LLM + distance/routes/web + carrier options), then run_optimization (local ML), and then carbon_analysis (local). LLM steps dominate: each agent call can take 5–20 s, depending on context length and tool use; external APIs add 1–5 s when used. Local steps (extraction, optimization, carbon) are at the sub-second scale. The total estimated runtime per query is approximately 45–90 s, with most variance from LLM and API latency.

Table 19 estimates cost per query using AWS Bedrock (Claude-3-Sonnet) list pricing (input $3 per 1 M input tokens, $15 per 1 M output tokens, approximate). A typical run uses three LLM agents (Orchestrator, Risk, Sourcing), each with system prompt + user/tool context; a conservative estimate is ∼15–25 K input tokens and ∼2–4 K output tokens per query, yielding approximately $0.05–0.10 per query. Local ML and APIs (when not billed per call) add no direct monetary cost. These estimates support comparison with manual or rule-based baselines (e.g., analyst time) and with other LLM-based systems.

Table 18 breaks down the estimated pipeline runtime per step, showing that LLM agent calls dominate the total time (15–30 s each), while local ML and optimization steps take less than a second. The total estimated runtime of 45–90 s per query provides practical guidance for deployment planning.

Table 19 estimates the cost per query for the multi-agent system, showing that LLM calls (Orchestrator, Risk, Sourcing Agents) dominate costs at approximately $0.05–0.10 per query. This cost structure helps organizations evaluate the economic feasibility of deploying the system at scale.

6. Case Studies: Pharmaceuricals and Fashion Shipments

To demonstrate the practical application of this study’s multi-agent system, this study now presents two detailed case studies. The analysis illustrates the end-to-end pipeline with two concrete runs on the same route (Mumbai–Delhi, 1400 km): a critical cold-chain shipment (insulin) and a standard logistics shipment (T-Shirts, 20 kg). Both runs use the same Python backend and models as Section 5; JSON outputs are in code/casestudy_output.json and code/casestudy_fashion_output.json.

6.1. Case 1: Pharmacy Shipment—Mumbai–Delhi Insulin (Critical)

This study runs the Pharmacy Shipment—Mumbai–Delhi Insulin query. (1) Extraction: The semantic classifier maps “insulin” to package type pharmacy (critical, cold chain). Features: Origin Mumbai, destination Delhi, package type pharmacy, with defaults for weight and distance if not specified. (2) Risk assessment: The Risk Agent invokes the delay predictor and risk aggregation. Output: risk_level: LOW, delay_probability: 0.0016, recommended_buffer_days: 0, risk_factors: [“High-risk partner: delhivery (+20% risk)”]. (3) Sourcing: Distance 1400 km; three carrier options; features and risk passed to the optimizer. (4) Optimization: With priority carbon, best route: Delhivery, EV van; cost Rs.1509.30; predicted on-time 99.92%; total carbon 525,000 gCO₂. Early-warning: risk_level: LOW, delay_probability: 0.0002, risk_score: 0.0. (5) Carbon and governance: Greenest viable option (EV van), transport and cold-chain carbon, governance notes.

Representative JSON from the run. Risk assessment:

{"risk_level": "LOW", "delay_probability": 0.0016, "recommended_buffer_days": 0, "risk_factors": ["High-risk partner: delhivery (+20% risk)"]}

Optimization result:

{"best_route": {"delivery_partner": "delhivery", "vehicle_type": "ev van"}, "cost": 1509.3, "predicted_on_time_pct": 99.92, "total_carbon_gco2": 525,000}

Figure 13 demonstrates how Agent 1 (Orchestrator) controls the sequential execution of five tools within a single agent’s reasoning process. It shows the Orchestrator-hub architecture, where Agent 1 (Orchestrator) controls the reasoning loop through five sequential tool invocations. Unlike a linear pipeline, this layout shows that the Orchestrator (powered by the Strands model-driven loop) decides when and how to call each tool, with intelligent context propagation (e.g., risk context from step 2 passed to step 3). The dashed separators between tool calls indicate steps within one agent’s reasoning, not separate components. This visualization aligns with Section 4.2, emphasizing that the system uses LLM-based orchestration rather than hardcoded sequential scripts.

Table 20 summarizes the complete pipeline flow for the pharmacy shipment—Mumbai–Delhi insulin case study, showing outputs from each step (extraction, risk assessment, sourcing, optimization, early warning, carbon analysis). It demonstrates how the multi-agent system processes a critical cold-chain shipment from query to final recommendation.

6.2. Case 2: Clothing Shipment—Mumbai–Delhi T-Shirts (Standard)

This study runs the same route for a clothing logistics scenario: T-shirts, 20 kg, Mumbai to Delhi, representing the standard stakes tier (≥85% on-time, no cold chain). (1) Extraction: Package type clothing, origin Mumbai, destination Delhi, weight 20 kg. (2) Risk assessment: risk_level: LOW, delay_probability: 0.0016, recommended_buffer_days: 0. (3) Sourcing: Distance 1400 km; six carrier options before feasibility checks. (4) Optimization: With priority carbon, long-haul two-wheeler options are filtered by feasibility (>200 km), leaving five feasible candidates; best route: Ekart, EV van; predicted on-time 99.98%; total carbon 210,000 gCO₂. Early-warning: risk_level: LOW, delay_probability: 0.0002, risk_score: 0.0. (5) Carbon and governance: No cold chain; governance recommendations for standard logistics.

Representative JSON from the Clothing Shipment—Mumbai–Delhi T-Shirts run. Risk assessment:

{"risk_level": "LOW", "delay_probability": 0.0016, "recommended_buffer_days": 0, "risk_factors": ["High-risk partner: delhivery (+20% risk)"]}

Optimization result:

{"best_route": {"delivery_partner": "ekart", "vehicle_type": "ev van"},

"predicted_on_time_pct": 99.98, "total_carbon_gco2": 210,000}

Table 21 summarizes the pipeline flow for the clothing shipment—Mumbai–Delhi T-shirts case study, showing how the system handles a standard logistics shipment differently from the critical cold-chain case. The standard type allows more carrier options than the critical case (6 vs. 3 before feasibility filtering), but long-haul two-wheelers are removed by feasibility rules.

6.3. Comparison of the Two Case Studies

To highlight the differences between critical and standard logistics, we now compare the two case studies side by side. Table 22 compares the two runs on the same route (Mumbai–Delhi, 1400 km). The critical (pharmacy) case yields higher total carbon (525,000 gCO₂) due to the cold chain and constrained carrier/vehicle choice (EV van); the standard (clothing) case yields 210,000 gCO₂ with an EV van and no cold chain. Note: Case-study values represent specific route optimizations (Mumbai–Delhi, 1400 km) with feasibility-aware vehicle filtering, while Table 14 shows dataset averages computed across all records in the Indian logistics dataset [11], which include various routes, distances, and vehicle types. The case-study route (Mumbai–Delhi, 1400 km with cold chain

λ = 2.5

) is substantially longer than the dataset average distance, which includes shorter intra-regional routes, explaining the higher case-study totals.

The high predicted on-time percentages in both case studies (near 100%) should be interpreted as optimized route-level service outcomes, not as re-statements of the historical average on-time rate in the dataset (∼72–75% across partners). The Gradient Boosting models are trained on historical data, but in the case studies the optimizer explicitly selects carrier/vehicle/service combinations that satisfy stakes-tier constraints whenever feasible. Quantitatively, the Mumbai–Delhi pharmacy run retained 3/3 candidates after SLA filtering (0 rejected), while the clothing run retained 5/6 candidates after feasibility and SLA filtering (1 rejected by long-haul vehicle feasibility, 0 additional SLA rejections). This explains why optimized recommendations can achieve near-SLA or above-SLA performance even when the historical portfolio average is substantially lower.

Table 22 highlights the differences between critical and standard logistics on the same route. The Pharmacy Shipment—Mumbai–Delhi Insulin case (critical) requires a cold chain and achieves 99.92% on-time with higher carbon (525,000 gCO₂), while the Clothing Shipment—Mumbai–Delhi T-Shirts case (standard) achieves lower carbon (210,000 gCO₂) with no cold chain and 99.98% predicted on-time after feasibility-aware filtering. This contrast demonstrates why package-type-specific assessment and operational feasibility constraints are imperative from a global socio-technical perspective for maintaining balance, flexibility and optimization across items and routes.

7. Discussion

7.1. Implications for Systems Engineering

The results of this study demonstrate that treating supply chains as systems-of-systems reveals emergent trade-offs invisible to component-level analysis. The order-of-magnitude difference in optimization potential between critical types (e.g., pharmacy ∼4%) and standard types (e.g., clothing ∼40%) reflects fundamental constraints embedded in the socio-technical system: regulatory requirements, cold-chain physics where applicable, and customer expectations jointly determine the feasible optimization space. This finding aligns with Ackoff’s [1] observation that system performance cannot be optimized by optimizing individual components in isolation.

The per-package-type CASP metric enables category-level carbon assessment that reveals heterogeneity hidden by aggregate metrics. The substantial variation in CASP between critical types (e.g., pharmacy,

6.67 \times 10^{- 4}

) and standard types (e.g., documents,

16.72 \times 10^{- 4}

) demonstrates that product-category-specific assessment is essential for accurate carbon reporting and sustainability planning. This per-category approach is motivated by the CCI framework [10], which similarly demonstrated that domain-level breakdowns reveal efficiency variations masked by aggregate metrics.

From a comparative perspective, the findings of this study complement prior sustainability-oriented supply chain design studies [5,24,25] and AI carbon accounting work [10,37]. While earlier work typically optimized routing or facility location under fixed emission factors, the results of this study show that once package-type constraints and AI compute carbon are made explicit, the feasible carbon–service frontier tightens markedly for critical cold-chain categories and that governance levers such as sourcing rules and buffer policies have asymmetric impacts across stakes tiers. This underscores that sustainability-oriented logistics strategies cannot be evaluated in isolation from their socio-technical and energy-system context.

In the multi-agent system, a risk of error propagation exists, where failure or delay in one step can affect downstream steps. If the Risk Agent returns an incorrect or missing weather value, the risk score may be biased, and the reasoning of the Sourcing Agent may change. If the Sourcing Agent returns no carrier options, the optimizer has nothing to evaluate. We mitigate this risk with fallbacks (e.g., default weather “clear”, local routes when APIs fail) and deterministic backend tools (early-warning risk score and predictive models) that do not depend on LLM correctness. A full sensitivity analysis of cascading errors is left for future work.

7.2. Theoretical Implications

Theoretically, this study extends three strands of literature reviewed in Section 2. First, relative to system-of-systems and resilience work [1,2,8,16], CASP formalizes package-type-specific carbon–service frontiers and shows that emergent constraints differ markedly across critical and standard categories, which cannot be seen from aggregate metrics alone. Second, building on carbon-aware routing and sustainable supply chain design [5,24,25,26,27], we integrate cold-chain multipliers and AI compute carbon into a single socio-technical performance metric, clarifying when AI optimization improves or worsens total carbon. Third, in the context of agentic AI and multi-agent systems [38,39,40,41,42,43,44,45], we show how wrapping trained predictive models and carbon calculators as tools allows LLM-based agents to reason over quantitative performance and carbon outcomes, thereby enriching prior qualitative orchestration frameworks with a formal, measurable notion of carbon-adjusted performance.

7.3. Generalizability of the Results

The numerical findings in this paper are grounded in a single 25,000-record Indian logistics dataset and a specific carrier ecosystem, so they should not be interpreted as universal benchmarks. However, the underlying structure of the CASP metric (service performance per unit carbon), the decomposition of total carbon into transport and AI compute components, and the multi-agent orchestration pattern are intentionally model- and country-agnostic.

Generalization to other geographies primarily requires substituting region-specific inputs: alternative route networks and cost structures, country- or region-level grid carbon intensities (for example, 20 gCO₂/kWh in Norway vs. 386 in the United States vs. 708 in India), and locally relevant governance levers (such as modal shifts to rail in Europe or coastal shipping in East Asia). Similarly, while the case studies in this study focus on high-stakes Indian routes, the same framework can be instantiated for other critical supply chains, such as vaccine logistics in sub-Saharan Africa or cold-chain food distribution in the European Union. This study therefore views the present results as a proof of concept for a generalizable socio-technical design pattern, not as a fixed catalog of carbon values.

Without the AI pipeline, a planner would typically use fixed rules, such as choose the cheapest carrier meeting the SLA, or manual lookup. Our analytic pipeline adds adaptive risk and sourcing reasoning, semantic package-type classification, and explicit carbon and early-warning outputs. The quantitative comparison (e.g., carbon with vs. without optimization, or time-to-recommendation vs. manual process) depends on the deployment context; the evaluation in this paper establishes that the delay predictor (F1 = 0.954 ± 0.003 over 5-fold CV), on-time predictor (accuracy = 0.976 ± 0.002 and F1 = 0.984 ± 0.002 over 5-fold CV), and CASP-based trade-offs are fit for use in the framework.

7.4. Governance Recommendations

Based on our governance lever analysis, we recommend a set of prioritized interventions translating the carbon–service–cost trade-offs into actionable policies. First, package-type-specific optimization should concentrate AI-enabled route and planning improvements on standard package categories (e.g., clothing, cosmetics), where the analysis indicates that approximately 35–40% carbon reduction is achievable, rather than on critical cold-chain or high-stakes categories, where binding constraints reduce the marginal return on optimization to <5%. Second, sourcing portfolio diversification should be operationalized through regional sourcing rules for ambient-temperature goods, reducing transport distance while preserving supplier resilience and continuity of supply. Third, energy-transition alignment suggests locating warehouse and distribution centers and, where feasible, workload placement for compute in regions with lower grid carbon intensity to improve CASP without compromising service levels. Fourth, early-warning monitoring should be institutionalized by deploying supplier concentration and geographic clustering indicators within the early-warning risk-scoring layer, enabling decision-makers to identify disruption amplification risk proactively and intervene before cascading failures propagate through the network.

7.5. Implications for Agentic AI Architecture

The implementation in this study demonstrates that wrapping trained ML models as agent tools provides three advantages over retrieval-only tool ecosystems. First, quantitative grounding: The Risk Agent’s delay predictions (F1 = 0.954) and the Sourcing Agent’s on-time estimates (accuracy = 0.976, F1 = 0.984) provide numerical confidence that text-based reasoning cannot match. Second, deterministic reproducibility: Given the same input features, the analytical engines always produce the same predictions, providing auditability that pure LLM reasoning cannot guarantee, which is a requirement for operational supply chain decision-making. Third, carbon accountability: Because the analytical engines are deterministic, their computational carbon (Equation (4)) can be precisely measured, enabling the carbon-cost-of-intelligence accounting that is central to this framework.

7.6. Limitations

This study focuses on per-package-type carbon assessment using average grid intensity and the Indian logistics dataset [11] (25,000 records, nine delivery partners). Transport carbon estimates rely on distance-based per-vehicle emission (gCO₂/km) rather than gCO₂/ton-km, with implicit assumptions about typical loading, average cargo weight, and no explicit representation of backhaul or partial loads. Similarly, AI compute carbon is modeled using literature-based median energy-per-inference values and a small number of LLM calls per optimization. Very compute-heavy deployments or substantially different hardware could yield higher

C_{AI}

than reported here, even if transport still dominates. The EV-van emission factor (150 gCO₂/km) uses UK Government GHG 2024 as a proxy and can be updated upon availability of further India-specific EV LCV fleet data. Future work should extend CASP to incorporate Scope 3 emissions, real-time marginal grid intensity optimization, and validation across additional geographic and carrier ecosystems.

7.7. Future Work

Future work should extend CASP in several directions: (1) Scope 3 emissions incorporation from supplier operations for end-to-end carbon accounting, (2) real-time carbon optimization using marginal grid intensity forecasts to enable temporal load shifting, (3) portfolio-level aggregation using weighted harmonic means across package-type mixes, analogous to the CCI workload aggregation [10]), and (4) validation across additional datasets and geographies. Production deployment with continuous monitoring of pipeline runtime and cost (Section 5.11) would provide precise per-step metrics.

8. Conclusions

This study introduces an integrated multi-agent framework that models supply chain system-of-systems dynamics for carbon-aware supply chain assessment, integrating physical transport carbon, cold-chain overhead where applicable, and AI compute carbon into the carbon-adjusted supply chain performance (CASP) metric (Equation (1)), which measures service performance per unit carbon for each package type, revealing that critical types with cold chains exhibit substantially lower carbon efficiency than standard types. The framework is implemented as an LLM-powered multi-agent system (Orchestrator, Risk, Sourcing) with semantic package-type classification, constraint-aware route optimization, and carbon-cost-of-intelligence accounting.

Empirical analysis across nine package types (three stakes tiers) using a 25,000-record Indian logistics dataset [11] yields concrete results: early-warning delay prediction achieves F1 = 0.954 ± 0.003 (five-fold stratified CV) and accuracy 0.976; predictive on-time classification achieves accuracy = 0.976 ± 0.002 and F1 = 0.984 ± 0.002 (five-fold CV), with single-split MAE = 3.43%; and vendor segmentation yields silhouette 0.45 and four interpretable clusters. Critical types with cold chains (e.g., pharmacy) show much higher total carbon and lower optimization potential (∼4%) than standard types (e.g., clothing, ∼40%). Per-package-type CASP varies from

6.67 \times 10^{- 4}

(pharmacy, cold chain) to

17.13 \times 10^{- 4}

(furniture), confirming that category-level assessment is essential. Regional grid intensity significantly affects AI compute carbon (20× Norway vs. India for Claude-3-Sonnet), although transport carbon dominates total CASP for single-shipment decisions.

A key finding of this study (that AI-enabled supply chain optimization has order-of-magnitude different ROIs across package types) has immediate practical implications. Organizations should prioritize optimization investments in standard package types where routing flexibility enables substantial carbon reduction, rather than applying uniform optimization strategies across heterogeneous portfolios. The CASP metric and the multi-agent architecture together enable package-type-specific assessment with carbon, service, and cost trade-off frontiers and early-warning indicators for disruption amplification.

By treating operational reliability as an emergent outcome of coupled energy, digital, and logistics dynamics, this work advances systems theory from an abstract principle to a practical tool for sustainability-oriented supply chain design.

Author Contributions

Conceptualization: E.P. and K.M.P.; Methodology: R.K., T.K., K.M.P. and E.P.; Software: R.K. and B.S.; Data Curation: R.K. and B.S.; Investigation: R.K., T.K., and B.S.; Formal Analysis: R.K.,T.K., K.M.P. and E.P.; Visualization: R.K., B.S. and K.M.P.; Writing (Original Draft Preparation): R.K. and T.K.; Writing (Review and Editing): K.M.P. and E.P.; Project Administration and Supervision: K.M.P. and E.P. All authors contributed equally to the paper. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The source code, reference data, and reproduction scripts for the CASP framework are publicly available at https://github.com/anacodicAI-labs/casp-agent.git (accessed on 12 February 2026). The repository includes all analytical components (ml/ and analytics/), agent orchestration (agents/), service layer components (services/), reference data with provenance documentation (data/reference/), and scripts to reproduce all the paper’s figures and tables. The primary dataset (Delivery Logistics, 25,000 records) is available at [11]. A live demonstration of the multi-agent system is available at https://anacodicai.com/casp/ (accessed on 1 February 2026).

Acknowledgments

The authors thank the Boston University, Metropolitan College, Department of Administrative Sciences and Department of Computer Science for their research support.

Conflicts of Interest

The authors declare no conflicts of interest regarding the publication of this paper.

Appendix A. Agent Prompt Templates

The following are the system prompts used by the three LLM agents in the implementation (see agents/). Validation blocks were added to each prompt so that agents check feasibility of returned values (e.g., vehicle–distance compatibility, cost plausibility, and buffer-day consistency) and include explicit validation outputs (e.g., validation_summary or pipeline_sanity_flags).

Appendix A.1. Orchestrator (A1)

You are a supply chain optimization orchestrator that coordinates

specialized agents and tools. You read and understand the user query

and decide which tools or agents to call.

Your tools:

1. extract_features_tool(query) - Extract shipment features from

natural language. Returns features + defaults_used.

2. risk_agent_tool(features_json, risk_queries_json) - Assess delivery

risk via the Risk Agent. You MUST pass risk_queries_json: a JSON

array of adaptive search queries (e.g., weather, disruptions).

3. sourcing_agent_tool(features_json, risk_assessment_json,

sourcing_queries_json) - Get carrier options via the Sourcing Agent.

You MUST pass sourcing_queries_json: a JSON array of queries.

4. run_optimization_tool(features_json, carrier_options_json) - Run

route optimization. Use after sourcing_agent_tool.

5. carbon_analysis_tool(carrier_options_json, package_type,

optimization_result_json) - Carbon and governance analysis.

Guidelines:

- Always use the step-by-step path: extract_features_tool →

risk_agent_tool → sourcing_agent_tool → run_optimization_tool →

carbon_analysis_tool. Generate 4-6 risk queries and 2-4 sourcing

queries.

- Always summarize the outcome: recommended carrier, cost, on-time

probability, carbon, risk level, and any governance notes.

- Package types are classified semantically (e.g., insulin → pharmacy).

Validation after extract_features_tool:

- Check distance plausibility (e.g., Mumbai→Delhi ~1400 km;

Mumbai→Pune ~150 km; Delhi→Kolkata ~1500 km; Chennai→Bangalore ~350 km).

If a cross-state route is <100 km, add ANOMALY_DISTANCE.

- Check package-type semantics (e.g., insulin→pharmacy, shirt→clothing).

If package_type came from defaults but query has specific product cues,

add ANOMALY_PACKAGE_TYPE.

- Check weight plausibility for package_type.

- Add anomalies under validation_flags; do not stop pipeline.

Final pipeline sanity check after all tools:

- CASP range check (dataset-average references: pharmacy around 6.67e-4;

clothing around 14.67e-4). Note: a single case-study route can differ

(e.g., pharmacy Mumbai-Delhi example ~1.9e-4);

>50% deviation should be flagged.

- Carbon vs distance/vehicle plausibility (e.g., EV van 1400 km with

lambda=2.5 is ~525000 gCO2; bike 1400 km with lambda=1.0 is ~70000 gCO2).

- Enforce cost > 0 and on_time in [0,100].

- Flag any zero/default where a real value is expected.

- Include pipeline_sanity_flags in final summary.

Appendix A.2. Risk Agent (A2)

You are a supply chain risk assessment agent. You reason about

delivery risk by gathering data from multiple sources and fusing them.

Your tools:

1. weather_api_tool(city) - Get weather from OpenWeatherMap.

2. news_api_tool(query) - Search news for disruption context.

3. web_search_tool(queries) - Run web search (single query or JSON

array of queries).

4. calculate_risk_score_tool(package_type, weather_condition,

risk_factors_json, route_dict_json) - Compute risk using the

early-warning model; returns risk_level, delay_probability,

and min_buffer_days floor.

Workflow:

1. Gather from BOTH APIs and web search (union). Call weather_api_tool

for origin and destination, news_api_tool for disruption,

web_search_tool with the provided risk_queries.

2. Fuse and check consistency. Decide canonical weather_condition and

final list of risk_factors.

3. Call calculate_risk_score_tool with package_type, fused

weather_condition, risk_factors as JSON array, and minimal

route_dict (include origin, destination, weather_condition,

package_type, distance_km, etc.; use defaults where needed).

4. Return the risk assessment as JSON: risk_level, risk_score,

risk_factors, delay_probability, recommended_buffer_days,

buffer_rationale, warnings, alert_required.

Buffer recommendation policy:

- Reason over delay_probability, weather (origin/destination),

package criticality, route distance, and disruption signals.

- Guidelines:

0 days: delay_prob < 5% and no weather/disruption concerns

1 day: delay_prob 5-15% OR adverse weather OR long route (>500 km)

2 days: delay_prob 15-30% OR severe weather OR active disruption

3 days: delay_prob > 30% OR critical package with any risk signal

- For critical packages (pharmacy, groceries), add +1 day safety margin.

- Ensure recommended_buffer_days >= min_buffer_days.

- Include concise buffer_rationale. Final message must be valid JSON.

Validation after all tools return (before final JSON):

- delay_probability must be in [0.0, 1.0]; cap out-of-range values and add

validation_flag: delay_probability_capped.

- recommended_buffer_days must align with risk_level:

LOW: 0-1, MEDIUM: 1-2, HIGH: >=2, CRITICAL: >=3.

If HIGH/CRITICAL has 0 buffer days, add ANOMALY_BUFFER_TOO_LOW.

- weather_condition in output must match weather_api_tool evidence

(not assumptions). If weather is inferred from web fallback, include

weather_source=web_search_fallback.

- Tool errors must not be silent; include them in risk_factors.

Appendix A.3. Sourcing Agent (A3)

You are a supply chain sourcing agent. You reason about the best

carrier for a shipment given risk and constraints.

Your tools:

1. distance_api_tool(origin, destination) - Get road distance (km).

2. routes_lookup_tool(origin, destination) - Get route info from

local routes (distance_km, region, is_metro_to_metro).

3. web_search_tool(queries) - Run web search for pricing/context.

4. get_carrier_options_tool(features_json, risk_assessment_json) -

Get carrier options (cost, on_time_pct, carbon, route).

Workflow:

1. Call distance_api_tool and routes_lookup_tool for origin and

destination.

2. Call web_search_tool with the provided sourcing_queries.

3. Call get_carrier_options_tool(features_json, risk_assessment_json).

4. Reason about which carrier best fits given risk_level, package_type

(e.g., pharmacy needs high SLA), and cost/carbon trade-offs.

5. Get industry benchmark via web_search_tool; compare recommended

cost to benchmark; set efficiency and efficiency_percentage.

6. Return final message with a valid JSON block containing:

carrier_options, recommendation, industry_benchmark, efficiency,

efficiency_percentage. The orchestrator parses carrier_options from

this block for the next step.

Validation after get_carrier_options_tool (before final JSON):

- Vehicle-distance feasibility:

bike/ev bike/scooter valid for <80 km.

If distance >200 km and vehicle is bike/scooter/ev bike, flag

vehicle_range_mismatch and prefer van/truck.

ev van typically 80-400 km; van/truck: any distance.

- Cost plausibility checks:

bike short-haul <INR 500;

van 1400 km ~INR 1500-2500;

truck 1400 km ~INR 3000-6000.

If cost <INR 100 for 1400 km, flag ANOMALY_COST_TOO_LOW.

- on_time_pct must be in [0,100] for each option; exclude out-of-range.

- Include validation_summary (ok or list of flags) in final JSON.

References

Ackoff, R.L. Towards a System of Systems Concepts. Manag. Sci. 1971, 17, 661–671. [Google Scholar] [CrossRef]
Boardman, J.; Sauser, B. System of systems—The meaning of. In Proceedings of the IEEE/SMC International Conference on System of Systems Engineering, Los Angeles, CA, USA, 24–26 April 2006; pp. 118–123. [Google Scholar] [CrossRef]
Wasi, A.T.; Islam, M.S.; Akib, A.R. SupplyGraph: A Benchmark Dataset for Supply Chain Planning using Graph Neural Networks. arXiv 2024, arXiv:2401.15299. [Google Scholar] [CrossRef]
Park, K.M.; Liew, N.; Pattnaik, S.; Kures, A.O.; Pinsky, E. Exploring the Transition to Low-Carbon Energy: A Comparative Analysis of Population, Economic Growth, and Energy Consumption in Oil-Producing OECD and BRICS Nations. Sustainability 2025, 17, 6221. [Google Scholar] [CrossRef]
Barbosa-Póvoa, A.P.; da Silva, C.; Carvalho, A. Opportunities and challenges in sustainable supply chain: An operations research perspective. Eur. J. Oper. Res. 2018, 268, 399–431. [Google Scholar] [CrossRef]
Park, K.M.; Pattnaik, S.; Liew, N.; Kundu, T.; Kures, A.O.; Pinsky, E. Smarter Chains, Safer Medicines: From Predictive Failures to Algorithmic Fixes in Global Pharmaceutical Logistics. Forecasting 2025, 7, 78. [Google Scholar] [CrossRef]
AlReshaid, F.; Park, K.M.; Vogel, B.; Graca, A.; Ikwegbu, O. Collaborative Leadership Dynamics: Joint Evolution of Chair and CEO Roles. J. Strategy Manag. 2025, 18, 793–820. [Google Scholar] [CrossRef]
Ivanov, D.; Dolgui, A. Digital supply chain twins: Managing the ripple effect, resilience and disruption risks by data-driven optimization, simulation, and visibility. In Handbook of Ripple Effects in the Supply Chain; Springer: Cham, Switzerland, 2021; pp. 309–332. [Google Scholar] [CrossRef]
McKinnon, A.; Browne, M.; Whiteing, A.; Piecyk, M. Green Logistics: Improving the Environmental Sustainability of Logistics; Kogan Page Publishers: London, UK, 2015; Available online: https://www.koganpage.com/product/green-logistics-9780749471859 (accessed on 1 February 2026).
Kaur, R.; Kundu, T.; Park, K.M.; Pinsky, E. The Carbon Cost of Intelligence: A Domain-Specific Framework for Measuring AI Energy and Emissions. Energies 2026, 19, 642. [Google Scholar] [CrossRef]
Seherr, A.; Kaggle. Delivery Logistics Dataset. Kaggle Dataset, 2025. Available online: https://www.kaggle.com/datasets/ayeshaseherr/delivery-logistics-dataset/data (accessed on 6 February 2026).
Maier, M.W. Architecting Principles for Systems-of-Systems. Syst. Eng. 1999, 1, 267–284. [Google Scholar] [CrossRef]
Jamshidi, M. System of Systems Engineering: Innovations for the 21st Century; John Wiley & Sons: Hoboken, NJ, USA, 2011. [Google Scholar]
Rana, R.; Sauser, B.; Gligor, D.; Prybutok, V.R.; Hiatt, B. A systematic review of systems thinking in supply chain research to manage complexity, resilience and sustainability. Syst. Res. Behav. Sci. 2025. [Google Scholar] [CrossRef]
Wilden, D.; Hopkins, J.; Sadler, I. Systems thinking skills and their effect upon supply chain resilience: A practitioner perspective. Syst. Res. Behav. Sci. 2025, Early View. [Google Scholar] [CrossRef]
Dolgui, A.; Ivanov, D.; Sokolov, B. Ripple effect in the supply chain: An analysis and recent literature. Int. J. Prod. Res. 2018, 56, 414–430. [Google Scholar] [CrossRef]
Li, Y.; Xia, X.; Wang, C.; Huang, Q. Manufacturing supply chain resilience amid global value chain pressures and sustainability mechanisms. Systems 2025, 13, 873. [Google Scholar] [CrossRef]
Sufi, F.; Alsulami, M. From Events to Systems: Modeling disruption dynamics and resilience in global supply chains. Mathematics 2025, 13, 3471. [Google Scholar] [CrossRef]
Paul, A. A systematic literature review on flexible strategies and the impact on supply chain resilience performance. J. Ind. Eng. Int. 2025, 26, 207–231. [Google Scholar] [CrossRef]
Tiwari, M.; Bryde, D.J.; Stavropoulou, F.; Dubey, R.; Kumari, S.; Foropon, C. Modelling supply chain visibility, digital technologies, environmental dynamism and healthcare supply chain resilience: An Organisation information processing theory perspective. Transp. Res. Part E Logist. Transp. Rev. 2024, 188, 103613. [Google Scholar] [CrossRef]
Tran, N.; Haralambides, H.; Notteboom, T.; Cullinane, K. The costs of maritime supply chain disruptions: The case of the Suez Canal blockage by the ‘Ever Given’ megaship. Int. J. Prod. Econ. 2025, 279, 109464. [Google Scholar] [CrossRef]
Pattnaik, S.; Liew, N.; Kures, A.O.; Pinsky, E.; Park, K.M. Catalyzing Supply Chain Evolution: A Comprehensive Examination of Artificial Intelligence Integration in Supply Chain Management. Eng. Proc. 2024, 68, 57. [Google Scholar] [CrossRef]
Younes, S.; Adedokun, M.W.; Alzubi, A.B.; Aljuhmani, H.Y. Impact of supply chain management on energy transition and environmental sustainability: The role of knowledge management and green innovations. Sustainability 2025, 17, 9249. [Google Scholar] [CrossRef]
Bektaş, T.; Laporte, G. The pollution-routing problem. Transp. Res. Part B Methodol. 2011, 45, 1232–1250. [Google Scholar] [CrossRef]
Demir, E.; Bektaş, T.; Laporte, G. A review of recent research on green road freight transportation. Eur. J. Oper. Res. 2014, 237, 775–793. [Google Scholar] [CrossRef]
James, S.J.; James, C. The food cold-chain and climate change. Food Res. Int. 2010, 43, 1944–1956. [Google Scholar] [CrossRef]
Mercier, S.; Villeneuve, S.; Mondor, M.; Uysal, I. Time-temperature management along the food cold chain: A review of recent developments. Compr. Rev. Food Sci. Food Saf. 2017, 16, 647–667. [Google Scholar] [CrossRef]
Flammini, A.; Adzmir, H.; Pattison, R.; Karl, K.; Allouche, Y.; Tubiello, F.N. Greenhouse gas emissions from cold chains in agrifood systems. Sustainability 2024, 16, 9184. [Google Scholar] [CrossRef]
Mohan, M.; Amin, S. Green cold chain logistics: Minimising greenhouse gas emissions of fresh food products in transport refrigeration units. Logistics 2025, 9, 112. [Google Scholar] [CrossRef]
World Health Organization. Vaccine Management Handbook: How to Monitor Temperatures in the Vaccine Supply Chain; WHO: Geneva, Switzerland, 2015; Available online: https://www.who.int/publications-detail-redirect/WHO-IVB-15.04 (accessed on 1 February 2026).
Ashworth, B.; du Plessis, M.J.; Goedhals-Gerber, L.L.; Van Eeden, J. The carbon footprint of pharmaceutical logistics: Calculating distribution emissions. Sustainability 2025, 17, 760. [Google Scholar] [CrossRef]
Park, K.M. Navigating the digital revolution and crisis times: Humanitarian and innovation-inspired leadership through the pandemic. J. Strategy Manag. 2021, 14, 360–377. [Google Scholar] [CrossRef]
Durugbo, C.M.; Al-Balushi, Z. Supply Chain Management in Times of Crisis: A Systematic Review. Manag. Rev. Q. 2023, 73, 1179–1235. [Google Scholar] [CrossRef]
Ni, D.; Xiao, Z.; Lim, M.K. A systematic review of the research trends of machine learning in supply chain management. Int. J. Mach. Learn. Cybern. 2020, 11, 1463–1482. [Google Scholar] [CrossRef]
Riahi, Y.; Saikouk, T.; Gunasekaran, A.; Badraoui, I. Artificial intelligence applications in supply chain: A descriptive bibliometric analysis and future research directions. Expert Syst. Appl. 2021, 173, 114702. [Google Scholar] [CrossRef]
Liew, N.; Pattnaik, S.; Kures, A.O.; Park, K.M.; Pinsky, E. Transforming Global Supply Chains with Artificial Intelligence, Machine Learning, and Next-Generation Technologies. In Next Generation Entrepreneurship: Convergence of Innovation, Technology, and Society; Rajagopal, F., Goncalves, M., Zlatev, V., Eds.; Springer Nature: Cham, Switzerland, 2025. [Google Scholar]
Patterson, D.; Gonzalez, J.; Le, Q.; Liang, C.; Munguia, L.M.; Rothchild, D.; So, D.; Texier, M.; Dean, J. Carbon emissions and large neural network training. arXiv 2021, arXiv:2104.10350. [Google Scholar] [CrossRef]
Chen, W.; Men, Y.; Fuster, N.; Osorio, C.; Juan, A.A. Artificial Intelligence in Logistics Optimization with Sustainable Criteria: A Review. Sustainability 2024, 16, 9145. [Google Scholar] [CrossRef]
LangChain. LangChain and LangGraph: Multi-Agent Workflows, Flows, and Parallelism. LangGraph for multi-Agent Design. Available online: https://www.langchain.com (accessed on 1 February 2026).
CrewAI. CrewAI: A Multi-Agent Platform. Available online: https://www.crewai.com (accessed on 1 February 2026).
AWS. Strands Agents: Open-Source AI Agents SDK for Amazon Bedrock and LiteLLM. Available online: https://strandsagents.com (accessed on 1 February 2026).
AlMahri, S.; Xu, L.; Brintrup, A. Automating Supply Chain Disruption Monitoring via an Agentic AI Approach. arXiv 2026, arXiv:2601.09680. [Google Scholar] [CrossRef]
Jannelli, V.; Schoepf, S.; Bickel, M.; Netland, T.; Brintrup, A. Agentic LLMs in the Supply Chain: Towards Autonomous Multi-Agent Consensus-Seeking. arXiv 2024, arXiv:2411.10184. [Google Scholar] [CrossRef]
Aylak, B.L. SustAI-SCM: Intelligent Supply Chain Process Automation with Agentic AI for Sustainability and Cost Efficiency. Sustainability 2025, 17, 2453. [Google Scholar] [CrossRef]
Quan, Y.; Liu, Z. InvAgent: A Large Language Model Based Multi-Agent System for Inventory Management in Supply Chains. arXiv 2024, arXiv:2407.11384. [Google Scholar] [CrossRef]
Xu, L.; Mak, S.; Brintrup, A. Will Bots Take Over the Supply Chain? Revisiting Agent-Based Supply Chain Automation. Int. J. Prod. Econ. 2021, 241, 108279. [Google Scholar] [CrossRef]
Xu, L.; Mak, S.; Minaricova, M.; Brintrup, A. On Implementing Autonomous Supply Chains: A Multi-Agent System Approach. Comput. Ind. 2024, 161, 104120. [Google Scholar] [CrossRef]
Mandal, J.; Mohammed, I.A. Implementation of AI Transportation Routing in Reverse Logistics to Reduce CO₂ Footprint. Int. J. Supply Chain. Manag. 2024, 9, 1–12. [Google Scholar] [CrossRef]
Mrad, M.; Frikha, M.; Boujelbene, Y. A Comprehensive Survey of Artificial Intelligence and Robotics for Reducing Carbon Emissions in Supply Chain Management. Logistics 2025, 9, 104. [Google Scholar] [CrossRef]
Parthasarathy, V. AI-Driven Carbon Footprint Tracking and Emission Reduction in Logistics Networks. Int. J. Artif. Intell. Data Sci. Mach. Learn. (IJAIDSML) 2024, 5, 47–56. [Google Scholar] [CrossRef]
Google. Google Maps Distance Matrix API. Google LLC. 2024; Available online: https://developers.google.com/maps/documentation/distance-matrix (accessed on 1 February 2026).
U.S. Environmental Protection Agency. Emissions & Generation Resource Integrated Database (eGRID) 2023; EPA: Washington, DC, USA, 2025. Available online: https://www.epa.gov/egrid (accessed on 1 February 2026).
Electricity Maps. Electricity Maps: Real-time Carbon Intensity API. Electricity Maps 2025. Available online: https://www.electricitymaps.com/ (accessed on 1 February 2026).
Central Pollution Control Board (CPCB); Government of India. India Two-Wheeler Emission Inventory 2023. Technical report, Central Pollution Control Board, Ministry of Environment, Forest and Climate Change, Government of India, 2023. Available online: https://cpcb.nic.in/ (accessed on 1 February 2026).
Bureau of Energy Efficiency (BEE); Government of India. India EV Adoption and Grid Emission Impact 2023. Technical report, Bureau of Energy Efficiency, Ministry of Power, Government of India, 2023. Available online: https://beeindia.gov.in/ (accessed on 1 February 2026).
International Council on Clean Transportation (ICCT). India Freight Transport: Baseline Emissions and Mitigation Pathways. Technical report, International Council on Clean Transportation, 2023. Available online: https://theicct.org/ (accessed on 1 February 2026).
UK Department for Energy Security and Net Zero. Greenhouse Gas Reporting: Conversion Factors 2024. GOV.UK, 2024. Available online: https://www.gov.uk/government/publications/greenhouse-gas-reporting-conversion-factors-2024 (accessed on 1 February 2026).
World Courier. Cold Chain Logistics for Pharmaceutical Industry. World Courier. 2025. Available online: https://www.worldcourier.com/solutions/pharmaceutical-cold-chain (accessed on 1 February 2026).
Service Club. On-Time Delivery Rate Benchmarks: How Your Business Stacks up in 2025. Service Club. 2025. Available online: https://serviceclub.com/on-time-delivery-rate-benchmarks/ (accessed on 1 February 2026).
Opensend. 7 On-time Delivery Rate Statistics For eCommerce Stores. Opensend. 2025. Available online: https://www.opensend.com/post/on-time-delivery-rate-statistics-ecommerce (accessed on 1 February 2026).
World Health Organization. Model guidance for the storage and transport of time- and temperature-sensitive pharmaceutical products. In WHO Technical Report Series; 961; World Health Organization: Geneva, Switzerland, 2011; Available online: https://www.who.int/publications/m/item/trs961-annex9-modelguidanceforstoragetransport (accessed on 1 February 2026).
Amazon Multi-Channel Fulfillment. Guide to the 5 Most Important Ecommerce fulfillment KPIs. Amazon Supply Chain. 2025. Available online: https://supplychain.amazon.com/learn/5-ecommerce-fulfillment-kpis-guide (accessed on 1 February 2026).
FCBCO. Benchmarking Metrics for Warehouse Operations and Fulfillment Centers. FCBCO, 2024. Available online: https://www.fcbco.com/blog/bid/156213/benchmarking-metrics-of-warehouse-operations (accessed on 1 February 2026).
SmartRoutes. Delivery Success Rates: Key Retail & eCommerce Stats. SmartRoutes. 2025. Available online: https://smartroutes.io/blogs/delivery-success-rates/ (accessed on 1 February 2026).
Google. Measuring the Environmental Impact of Delivering AI at Google Scale. Median Gemini prompt: 0.24 Wh, 0.03 gCO₂e; 33× Energy and 44× Carbon Reduction over 12 Months. 2025. Available online: https://cloud.google.com/blog/products/infrastructure/measuring-the-environmental-impact-of-ai-inference (accessed on 1 February 2026).
Jegham, N.; Abdelatti, M.; Koh, C.Y.; Elmoubarki, L.; Hendawi, A. How Hungry is AI? Benchmarking Energy, Water, and Carbon Footprint of LLM Inference. arXiv 2025, arXiv:2505.09598. [Google Scholar] [CrossRef]
Samsi, S.; Zhao, D.; McDonald, J.; Li, B.; Michaleas, A.; Jones, M.; Bergeron, W.; Kepner, J.; Tiwari, D.; Gadepally, V. From Words to Watts: Benchmarking the Energy Costs of Large Language Model Inference. In Proceedings of the IEEE High Performance Extreme Computing (HPEC), Boston, MA, USA, 25–29 September 2023. [Google Scholar] [CrossRef]
ISO 31000:2018; Risk Management—Guidelines. International Organization for Standardization: Geneva, Switzerland, 2018. Available online: https://www.iso.org/standard/65694.html (accessed on 1 February 2026).
Kuo, T.; Lee, Y. Using Pareto Optimization to Support Supply Chain Network Design within Environmental Footprint Impact Assessment. Sustainability 2019, 11, 452. [Google Scholar] [CrossRef]
Aguezzoul, A. Third-party logistics selection problem: A literature review on criteria and methods. Omega 2014, 49, 69–78. [Google Scholar] [CrossRef]
Jharkharia, S.; Shankar, R. Selection of logistics service provider: An analytic network process (ANP) approach. Omega 2007, 35, 274–289. [Google Scholar] [CrossRef]
Chopra, S.; Reinhardt, G.; Dada, M. The Effect of Lead Time Uncertainty on Safety Stocks. Decis. Sci. 2004, 35, 1–24. [Google Scholar] [CrossRef]

Figure 1. Layered system-of-systems framework: Four-layer architecture showing Output → Agent Layer (LLM reasoning) → Analytical Backend (predictive engine, vendor segmentation, carbon calculator, grid scenarios, trade-off analyzer, governance advisor, early-warning engine) → Data Layer (Delivery Logistics dataset [11], reference data files (carriers.json (industry reports), cities.json [51], routes.csv [51], grid_carbon.json [10,52,53], vehicle_emissions.csv [54,55,56,57]), APIs). The analytical engines are the backend that services wrap, that tools wrap, and that agents invoke through structured interfaces. Reference data is stored in data/reference/ and loaded by configuration files in config/. All data sources are properly cited (see Section 4.2).

Figure 2. Full multi-agent architecture: Orchestrator (LLM #1) with five tools calls Risk Agent (LLM #2) with four tools and external APIs (OpenWeatherMap, NewsAPI) and outputs JSON o1; Orchestrator calls Sourcing Agent (LLM #3) with four tools and external API (OpenRouteService) and outputs JSON o2; Optimizer + Carbon (local) processes inputs and outputs JSON o3 to o5; and final output provides recommendation. Figure shows tools, JSON outputs (o1 to o5), and external data sources.

Figure 3. Agent 1: Orchestrator (horizontal layout): Input LEFT (user query), Agent CENTER (prompt: step-by-step pipeline; tools: extract_features, risk_agent_tool, sourcing_agent_tool, run_optimization, carbon_analysis), Output RIGHT (JSON: carrier, cost, on_time, carbon, risk). The diagram emphasizes orchestration logic rather than raw prediction: this agent sequences sub-agents, validates intermediate outputs, and composes the final recommendation. It also shows that adaptive query generation and structured tool invocation are centralized in one controller.

Figure 4. Agent 2: Risk Agent (horizontal layout): Input LEFT (features_json, risk_queries_json from Orchestrator), Agent CENTER (prompt: gather API + web, fuse, then score; tools: weather_api, news_api, web_search, calculate_risk_score), Output RIGHT (JSON: risk_level, delay_prob, buffer_days, risk_factors). The figure highlights multi-source fusion: weather and disruption signals are collected first and then translated into a quantitative delay estimate with explicit risk factors. This structure supports robust fallbacks when one external source is unavailable.

Figure 5. Agent 3: Sourcing Agent (horizontal layout): Input LEFT (features_json, risk_assessment, sourcing_queries from Orchestrator), Agent CENTER (prompt: carriers + benchmark; tools: distance_api, routes_lookup, web_search, get_carrier_options), Output RIGHT (JSON: carrier_options array with cost, on_time, carbon). The decomposition clarifies how operational feasibility (distance and route context) is combined with economic and service benchmarks before optimization. It also makes explicit that this stage returns multiple candidates, not a single fixed route.

Figure 6. Vendor segmentation: Scatter of delivery partners in (avg cost, on-time %) space, with carrier names labeled and colored by cluster. Cluster characteristics: Cluster 0: Standard Performance (Amazon, Ekart, Shadowfax; Rs.869, 72.7%); Cluster 1: Budget/Value (Delhivery, FedEx; Rs.853, 75.0%); Cluster 2: Mid-Range (BlueDart, DHL, Ecom; Rs.868, 73.4%); Cluster 3: Lower Reliability (XpressBees; Rs.868, 71.7%). Silhouette score: 0.4517.

Figure 7. Carbon optimization potential by package type (all nine types). Critical types (pharmacy, groceries, shown in red-orange) exhibit limited optimization potential due to cold-chain and on-time constraints; high-value types (automobile parts, furniture, documents, fragile items, electronics, shown in orange) show moderate optimization (25–35%); standard types (clothing, cosmetics, shown in green) enable 40% carbon reduction through flexible routing.

Figure 8. Carbon footprint by package type, grouped by service-level tier (from Table 14). Each bar shows total transport carbon per package type; dashed rules separate the three tiers. Critical types (pharmacy, groceries) carry cold-chain multipliers that raise carbon

1.9 \times

–

2.6 \times

above the standard tier, illustrating the Pareto trade-off between service reliability and emissions.

Figure 8. Carbon footprint by package type, grouped by service-level tier (from Table 14). Each bar shows total transport carbon per package type; dashed rules separate the three tiers. Critical types (pharmacy, groceries) carry cold-chain multipliers that raise carbon

1.9 \times

–

2.6 \times

above the standard tier, illustrating the Pareto trade-off between service reliability and emissions.

Figure 9. AI compute carbon per single LLM inference by model and country, in milligrams of CO₂ per request. Bars show three Claude models (Haiku, Sonnet, Opus) evaluated under seven grid carbon intensity scenarios (Norway, France, UK, USA, Germany, China, India). Values are derived from model energy-per-inference estimates combined with country-level grid carbon intensities (see Table 15). Each optimization in the multi-agent system requires 3 LLM calls, so per-optimization AI carbon is three times the per-inference values shown.

Figure 10. Governance levers: Sourcing rules, buffer policies, and compute policies and their indicative carbon reduction potential. Bars summarize how sourcing portfolio rules (e.g., favoring regional suppliers or shipment consolidation) deliver the largest reductions, particularly for standard package types. Buffer policies, implemented via strategic inventory positioning and safety stocks, reduce reliance on expedited, carbon-intensive shipments [72]. Compute policies (model choice and inference placement) have comparatively small effects on total carbon because transport emissions dominate, but become more relevant when many optimizations are executed at scale [10,37].

Figure 11. Early-warning indicators for disruption amplification: Supplier concentration (threshold 40%), geographic clustering (threshold 60%), and cold-chain fragility.

Figure 12. Early-warning indicators with real computed values: Supplier concentration and geographic clustering (value_pct) for Pharmacy Shipment Mumbai to Delhi Insulin and Clothing Shipment Mumbai to Delhi T-Shirts portfolios; cold-chain fragility (amplification risk: pharmacy

0.5 \times 50 = 25

, clothing 0). Both portfolio types fall below the supplier concentration threshold (40%, dashed) and geographic clustering threshold (60%, dotted).

Figure 12. Early-warning indicators with real computed values: Supplier concentration and geographic clustering (value_pct) for Pharmacy Shipment Mumbai to Delhi Insulin and Clothing Shipment Mumbai to Delhi T-Shirts portfolios; cold-chain fragility (amplification risk: pharmacy

0.5 \times 50 = 25

, clothing 0). Both portfolio types fall below the supplier concentration threshold (40%, dashed) and geographic clustering threshold (60%, dotted).

Figure 13. Orchestrator’s reasoning loop (Pharmacy Shipment—Mumbai–Delhi Insulin): Agent 1 (Orchestrator) executes five sequential tool calls within a single Strands model-driven loop: extract_features → item: insulin, route: MUM→DEL, temp: 2–8 °C, weight: 50 kg; risk_agent_tool [Agent 2: Risk Agent] → risk_level: LOW, delay_prob: 0.16%, risk_factors: Delhivery +20%; sourcing_agent_tool [Agent 3: Sourcing Agent] → Delhivery · EV Van · Rs.1509, BlueDart · Diesel · Rs.2100; run_optimization [Pareto solver] → carrier: Delhivery, carbon: 525,000 gCO₂; carbon_analysis [Carbon Engine] → CASP: 1.90 × 10⁻⁴, casp_tier: CRITICAL. Risk context from risk_agent_tool is propagated to sourcing_agent_tool. Dashed separators indicate steps within one agent’s reasoning process, not separate pipeline stages.

Table 1. Research gaps and framework contributions.

Gap	Study Approach
Aggregate carbon metrics without product-category breakdowns	CASP metric with nine package types and three stakes tiers; per-package-type assessment (portfolio aggregation is future work)
AI compute carbon not integrated into logistics emissions	Carbon cost of intelligence: $C_{AI}$ by model and country; AI vs. transport comparison; Carbon ROI formula
Product-category constraints vs. optimization potential unquantified	Empirical measurement: critical types ∼4% vs. standard ∼40% optimization potential; governance levers by package type
Governance levers lack validation across energy-transition scenarios	AI compute carbon by model × country (Norway to India); transport carbon dominates total CASP; governance lever effectiveness table
Agentic tools limited to web search, APIs, and retrieval; no trained ML models as agent tools	Trained Gradient Boosting models (delay classifier F1 = 0.954, on-time classifier accuracy = 0.976 and F1 = 0.984 in 5-fold CV) and K-Means clustering wrapped as structured `@tool` interfaces for LLM agents via Strands Agents
Positioning: Tool-augmented multi-agent system (Orchestrator, Risk, Sourcing) with trained ML models as agent tools, CCI integration, and package-type-specific constraints.

Table 2. Vehicle emission factors used in Equation (3), using India-specific sources where available (CPCB/BEE/ICCT) and a UK proxy only for EV vans due to sparse India EV LCV data.

Vehicle	EF_v (gCO₂/km)	Source
Bike	50	CPCB India Two-Wheeler Emission Inventory 2023 [54]
EV bike	22	BEE India EV Report 2023 [55]
Scooter	50	CPCB India Two-Wheeler Emission Inventory 2023 [54]
EV van	150	UK Government GHG 2024 (proxy) ^† [57]
Van	750	ICCT India 2023 LCV emission factor [56]
Truck	1400	ICCT India 2023 HCV emission factor [56]

^† No India-specific EV LCV factor was available; UK value is used as a proxy.

Table 3. Dataset statistics: The dataset [11].

Attribute	Value
Total records	25,000
Train/test split	20,000/5000 (80%/20%)
Package types	9 (pharmacy, electronics, groceries, automobile parts, furniture, documents, fragile items, clothing, cosmetics)
Delivery partners	9
Vehicle types	6 (bike, ev bike, scooter, ev van, van, truck)
Features (predictive)	15 (categorical: partner, package_type, vehicle, mode, region, weather; numerical: distance_km, weight, delivery_rating)
Delay rate (`is_delayed` target)	26.7% (binary: delayed = yes → 1)
Carbon intensity: EPA eGRID [52], Electricity Maps [53].

Table 4. ML hyperparameters.

Parameter	Value	Note
n_estimators	100	Gradient Boosting trees
test_size	0.2	80% train, 20% test
random_state	42	Reproducibility
stratify	yes	For delay classification
cv	5-fold	Stratified for delay classification; KFold for on-time classification; reported as mean ± std

Table 5. Energy per inference (Wh) by model (sources: Patterson et al. [37], Kaur et al. [10], Jegham et al. [66]). All models are available in AWS Bedrock.

Model	Energy/Inference (Wh)	Type
Gradient Boosting (local)	0.00001	ML
Claude-3-Haiku	0.0006	LLM
Claude-3-Sonnet	0.0015	LLM
Claude-3-Opus	0.0027	LLM
Default (unknown)	0.001	Conservative

Table 6. Agent decomposition: Roles, tools, and data flow across the three-agent pipeline. The table maps each agent to its tool set and explicitly shows the structured JSON handoffs between stages. This decomposition makes it clear where LLM reasoning is used versus where deterministic analytical tools are executed. All agents use Claude-3-Sonnet (AWS Bedrock).

Agent & Role	Tools	Data Flow
Agent 1: Orchestrator Coordinates pipeline; decides tool order; summarizes outcome	`extract_features` `risk_agent` `sourcing_agent` `run_optimization` `carbon_analysis`	In: User query (natural language) Out: carrier, cost, on-time, carbon, risk
Agent 2: Risk Agent Assess delivery risk; fuse weather & news; return risk level and buffer	`weather_api` `news_api` `web_search` `calculate_risk_score`	In: features_json, risk_queries_json Out: risk_level, delay_prob, buffer_days
Agent 3: Sourcing Agent Get carrier options; reason over risk & SLA; industry benchmark	`distance_api` `routes_lookup` `web_search` `get_carrier_options`	In: features_json, risk_assessment_json Out: carrier_options (for optimizer)
Semantic classifier Map product description to package type (e.g., insulin → pharmacy)	Rule-based + LLM fallback (Bedrock Claude, on demand)	In: Query text Out: package_type

Table 8. Data sources for the pipeline. Reference data files are stored in data/reference/ and loaded by the corresponding config/ files.

Source	Content	Provenance
Reference Data
Delivery Logistics dataset [11]	25K records; 9 package types, 9 partners, 6 vehicle types	Kaggle [11]
`carriers.json`	10 carriers: names, tiers, rates, carbon footprint	Industry reports
`cities.json`, `routes.csv`	22 cities, 33 routes; origin–destination, distance, region	Google Maps API [51]
`grid_carbon.json`	12 country grid intensities (gCO₂/kWh)	EPA eGRID [52], Elec. Maps [53], Kaur [10]
`ai_model_energy.json`	15 AI model energy values (Wh per inference)	Patterson [37], Kaur [10]
`vehicle_emissions.csv`	6 vehicle emission factors (gCO₂/km)	CPCB [54], BEE [55], ICCT [56], UK proxy [57]
External APIs
OpenWeatherMap API	Real-time weather by city	REST API
NewsAPI	Disruption/news headlines	REST API
OpenRouteService API	Road distance (origin–dest.)	REST API
Web search (optional)	Weather, rates, disruption context	Bedrock tool

Table 9. Five-fold cross-validation results (mean ± std). Delay classification uses stratified CV; on-time classification uses KFold.

Model	Metric	Single Split	5-Fold CV (Mean ± Std)
Delay classification	F1	0.956	0.954 ± 0.003
	Precision	0.941	0.939 ± 0.004
	Recall	0.972	0.971 ± 0.004
On-time classification	Accuracy	0.978	0.976 ± 0.002
	F1	0.985	0.984 ± 0.002
	MAE (%)	3.43	–

Table 10. Delay prediction: Classification report (single split, test set).

Class	Precision	Recall	F1-Score
On-Time	0.990	0.978	0.984
Delayed	0.941	0.972	0.956
Accuracy	0.976

Table 11. Confusion matrix: Delay prediction (test set).

		Predicted
		On-Time	Delayed
Actual	On-Time	3585 (TN)	81 (FP)
	Delayed	38 (FN)	1296 (TP)

Table 12. Delay rate by package type (dataset [11]; source: figdata/delay_rate_by_package_type.csv).

Package Type	Delay Rate (%)	n Delayed	n Total
Automobile parts	26.19	732	2795
Clothing	26.09	722	2767
Cosmetics	27.04	742	2744
Documents	27.17	762	2805
Electronics	27.18	759	2792
Fragile items	26.65	759	2848
Furniture	24.76	680	2746
Groceries	27.44	739	2693
Pharmacy	27.54	774	2810

Table 13. Top drivers of delay and on-time predictions. Feature importance from Gradient Boosting; interpretation for “where and why” forecasts fail.

Model	Top feature	Interpretation
Delay classification	delivery_rating (∼82%)	Historical/expected partner reliability is the main signal for delay risk.
	delivery_mode (∼16% combined)	Express vs. standard vs. same-day strongly affects predicted delay.
	distance_km, weather (<2%)	Secondary; forecasts fail more when rating/mode are ambiguous.
On-time classification	delivery_rating (∼82%)	Same as delay: partner and service tier drive on-time prediction.
	delivery_mode (∼15% combined)	Service level choice (express, etc.) is the second driver.
	distance_km, weather (<1%)	Refine predictions; not the primary cause of forecast failure.

Table 14. Carbon intensity and CASP by package type (nine types, three stakes tiers; cold-chain multiplier

λ_{cc}

applied where applicable per Equation (3)). AI compute is reported for 3 LLM calls (Claude-3-Sonnet: Orchestrator + Risk Agent + Sourcing Agent) and India (708 gCO₂/kWh) and is calculated per Equation (4) as ∼0.003 gCO₂; ML prediction energy is negligible at this scale and excluded from runtime CASP computation. CASP service performance values in this table use package-tier SLA targets for comparability (critical = 99%, high-value = 95%, standard = 85%). See Section 3.6 and Section 5.9 for LLM × Country values.

Table 14. Carbon intensity and CASP by package type (nine types, three stakes tiers; cold-chain multiplier

λ_{cc}

applied where applicable per Equation (3)). AI compute is reported for 3 LLM calls (Claude-3-Sonnet: Orchestrator + Risk Agent + Sourcing Agent) and India (708 gCO₂/kWh) and is calculated per Equation (4) as ∼0.003 gCO₂; ML prediction energy is negligible at this scale and excluded from runtime CASP computation. CASP service performance values in this table use package-tier SLA targets for comparability (critical = 99%, high-value = 95%, standard = 85%). See Section 3.6 and Section 5.9 for LLM × Country values.

Package Type	Transport (gCO₂)	AI Compute (gCO₂)	Total Carbon (gCO₂)	CASP (×10⁻⁴)
Pharmacy (critical, $λ = 2.5$ )	148,437	0.003	148,437	6.67
Electronics (high-value, $λ = 1.0$ )	57,355	0.003	57,355	16.56
Groceries (critical, $λ = 2.0$ )	110,190	0.003	110,190	8.98
Automobile parts ( $λ = 1.0$ )	55,888	0.003	55,888	17.00
Furniture ( $λ = 1.0$ )	55,457	0.003	55,457	17.13
Documents ( $λ = 1.0$ )	56,819	0.003	56,819	16.72
Fragile items ( $λ = 1.0$ )	56,163	0.003	56,163	16.92
Clothing (standard, $λ = 1.0$ )	57,922	0.003	57,922	14.67
Cosmetics (standard, $λ = 1.0$ )	57,285	0.003	57,285	14.84

Table 15. LLM × Country matrix: AI compute carbon (mgCO₂ per inference). All models are Claude variants available in AWS Bedrock.

Country (Grid)	Claude-3-Haiku (mgCO₂)	Claude-3-Sonnet (mgCO₂)	Claude-3-Opus (mgCO₂)
Norway (20)	0.01	0.03	0.05
France (56)	0.03	0.08	0.15
UK (193)	0.12	0.29	0.52
USA (386)	0.23	0.58	1.04
Germany (350)	0.21	0.53	0.95
China (555)	0.33	0.83	1.50
India (708)	0.42	1.06	1.91

Table 16. AI vs. transport: Representative single shipment (clothing, 57,922 gCO₂ transport, dataset average). AI carbon per optimization (3 LLM calls) for India using Claude-3-Sonnet.

Quantity	Value	Note
Transport carbon (single shipment)	57,922 gCO₂	From Table 14 (clothing, dataset average)
AI carbon (per optimization, 3 LLM calls, India, Claude-3-Sonnet)	∼0.003 gCO₂	3 × 1.06 mgCO₂ = 3.18 mgCO₂ from Table 15
Ratio (transport/AI)	∼1.9×10⁷	Transport dominates
Carbon ROI (if 5% transport saved)	Net positive	2896 gCO₂ saved vs. 0.003 gCO₂ AI

Table 17. Governance lever effectiveness: Carbon reduction potential by package type (representative).

Package Type	Sourcing Rules (% Reduction)	Buffer Policies (% Reduction)	Compute Policies (% Reduction)
Pharmacy	2%	1%	0.01%
Groceries	5%	3%	0.01%
Electronics	12%	8%	0.02%
Clothing	18%	15%	0.02%
Documents	15%	12%	0.02%

Table 18. Estimated pipeline runtime per step (single query). LLM = Claude-3-Sonnet; API = weather/news/distance; local = Python/ML.

Step	Estimated Time (s)	Type
Feature extraction	1–3	Local/LLM
Risk Agent (LLM + tools)	15–30	LLM + API
Sourcing Agent (LLM + tools)	15–30	LLM + API
Route optimization	<1	Local (ML)
Carbon analysis	<1	Local
Total	45–90

Table 19. Estimated cost per query (AWS Bedrock Claude-3-Sonnet; list pricing).

Component	Estimated Cost per Query
Orchestrator (1 call)	$0.02–0.04
Risk Agent (1 call)	$0.01–0.03
Sourcing Agent (1 call)	$0.02–0.04
Local ML/APIs	$0 (or API-specific)
Total	∼$0.05–0.10

Table 20. Case 1: Pharmacy Shipment—Mumbai–Delhi Insulin (Critical): pipeline steps and actual outputs (casestudy_output.json).

Step	Key Output (Actual Run)
Extraction	package_type: pharmacy, origin: Mumbai, destination: Delhi
Risk	risk_level: LOW, delay_probability: 0.0016, risk_factors: [High-risk partner: delhivery (+20%)]
Sourcing	distance: 1400 km; 3 carrier options
Optimization	best: Delhivery, EV van; cost Rs.1509.30; on-time 99.92%; carbon 525,000 gCO₂
Early warning	risk_level: LOW, delay_probability: 0.0002, risk_score: 0.0
Carbon & governance	Greenest viable: EV van; transport + cold-chain carbon; governance notes

Table 21. Case 2: Clothing Shipment—Mumbai–Delhi T-Shirts (Standard): Pipeline steps and actual outputs (casestudy_fashion_output.json).

Step	Key Output (Actual Run)
Extraction	package_type: clothing, origin: Mumbai, destination: Delhi, weight: 20 kg
Risk	risk_level: LOW, delay_probability: 0.0016, risk_factors: [High-risk partner: delhivery (+20%)]
Sourcing	distance: 1400 km; 6 options before feasibility check; 5 retained
Optimization	best: Ekart, EV van; on-time 99.98%; carbon 210,000 gCO₂
Early warning	risk_level: LOW, delay_probability: 0.0002, risk_score: 0.0
Carbon & governance	No cold chain; standard logistics governance

Table 22. Comparison of Mumbai–Delhi case studies: Pharmacy Shipment—Mumbai–Delhi Insulin (Critical) vs. Clothing Shipment—Mumbai–Delhi T-Shirts (Standard). Same route; outputs from casestudy_output.json and casestudy_fashion_output.json.

Aspect	Case 1: Pharmacy Shipment—Mumbai–Delhi Insulin (Critical)	Case 2: Clothing Shipment—Mumbai–Delhi T-Shirts (Standard)
Package type	Pharmacy (critical, cold chain)	Clothing (standard)
Stakes tier	Critical (≥99% on-time)	Standard (≥85% on-time)
Cold chain	Yes (2–8 °C)	No
Best carrier	Delhivery, EV van	Ekart, EV van
Cost	Rs.1509.30	(reported in JSON output)
Total carbon	525,000 gCO₂	210,000 gCO₂
Predicted on-time	99.92%	99.98%
Risk level	LOW	LOW
Carrier options (sourcing)	3	6 (5 retained after feasibility)
Optimization potential	∼4% (constrained)	∼40% (flexible routing)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Kaur, R.; Kundu, T.; Sharma, B.; Park, K.M.; Pinsky, E. Operational Resilience Under Carbon Constraints: A Socio-Technical Multi-Agentic Approach to Global Supply Chains. Systems 2026, 14, 374. https://doi.org/10.3390/systems14040374

AMA Style

Kaur R, Kundu T, Sharma B, Park KM, Pinsky E. Operational Resilience Under Carbon Constraints: A Socio-Technical Multi-Agentic Approach to Global Supply Chains. Systems. 2026; 14(4):374. https://doi.org/10.3390/systems14040374

Chicago/Turabian Style

Kaur, Rashanjot, Triparna Kundu, Bhanu Sharma, Kathleen Marshall Park, and Eugene Pinsky. 2026. "Operational Resilience Under Carbon Constraints: A Socio-Technical Multi-Agentic Approach to Global Supply Chains" Systems 14, no. 4: 374. https://doi.org/10.3390/systems14040374

APA Style

Kaur, R., Kundu, T., Sharma, B., Park, K. M., & Pinsky, E. (2026). Operational Resilience Under Carbon Constraints: A Socio-Technical Multi-Agentic Approach to Global Supply Chains. Systems, 14(4), 374. https://doi.org/10.3390/systems14040374

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Operational Resilience Under Carbon Constraints: A Socio-Technical Multi-Agentic Approach to Global Supply Chains

Abstract

1. Introduction

1.1. Research Motivation

1.2. Research Questions

1.3. Contributions

2. Literature Review

2.1. System-of-Systems in Supply Chain Management

2.2. Carbon-Aware Supply Chain Optimization

2.3. AI-Enabled Supply Chain Planning and Carbon Cost

2.4. Agentic AI and Multi-Agent Systems in Supply Chain Management

2.5. Research Gaps

3. Methodology

3.1. Carbon-Adjusted Supply Chain Performance (CASP) Metric

3.2. Package-Type Classification

3.3. Dataset and Measurement Framework

3.4. Predictive Analytics

3.5. Vendor Segmentation

3.6. Carbon Cost of Intelligence

3.7. Risk Scoring and Early-Warning Indicators

3.8. Route Optimization

3.9. Scenario-Based Systems Analysis

3.10. Implementation: Multi-Agent System

4. System Architecture and Implementation

4.1. Overview and Architecture

4.2. Communication Protocol and Design Patterns

4.3. Agent 1: Orchestrator

4.4. Agent 2: Risk Agent

4.5. Agent 3: Sourcing Agent

5. Experimental Evaluation and Results

5.1. Delay Prediction

5.2. Feature Importance: Delay and On-Time Prediction

5.3. On-Time, Cost, and Carbon Prediction

5.4. Segmentation Results

5.5. Cross-Package-Type Carbon Analysis

5.6. Optimization Potential by Package Type

5.7. Carbon-Cost-of-Intelligence Results

5.7.1. LLM × Country Matrix

5.7.2. AI vs. Transport Comparison

5.8. Governance Lever Effectiveness

5.9. Regional Energy-Transition Impact

5.10. Early-Warning Indicators

5.11. Pipeline Runtime and Cost per Query

6. Case Studies: Pharmaceuricals and Fashion Shipments

6.1. Case 1: Pharmacy Shipment—Mumbai–Delhi Insulin (Critical)

6.2. Case 2: Clothing Shipment—Mumbai–Delhi T-Shirts (Standard)

6.3. Comparison of the Two Case Studies

7. Discussion

7.1. Implications for Systems Engineering

7.2. Theoretical Implications

7.3. Generalizability of the Results

7.4. Governance Recommendations

7.5. Implications for Agentic AI Architecture

7.6. Limitations

7.7. Future Work

8. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A. Agent Prompt Templates

Appendix A.1. Orchestrator (A1)

Appendix A.2. Risk Agent (A2)

Appendix A.3. Sourcing Agent (A3)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI