Multi-Agent Deep Reinforcement Learning for Integrated Demand Forecasting and Inventory Optimization in Sensor-Enabled Retail Supply Chains

Yang, Yongbin; Wang, Mengdie; Wang, Jiyuan; Li, Pan; Zhou, Mengjie

doi:10.3390/s25082428

Open AccessArticle

Multi-Agent Deep Reinforcement Learning for Integrated Demand Forecasting and Inventory Optimization in Sensor-Enabled Retail Supply Chains

by

Yongbin Yang

¹,

Mengdie Wang

²,

Jiyuan Wang

³,

Pan Li

⁴ and

Mengjie Zhou

^5,*

¹

Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90007, USA

²

School of Taxation and Public Administration, Shanghai Lixin University of Accounting and Finance, Shanghai 201620, China

³

The Fuqua School of Business, Duke University, Durham, NC 27708, USA

⁴

The Business School, University of Hull, Hull HU6 7R, UK

⁵

Department of Computer Science, The University of Bristol, Bristol BS8 1QU, UK

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(8), 2428; https://doi.org/10.3390/s25082428

Submission received: 25 February 2025 / Revised: 30 March 2025 / Accepted: 10 April 2025 / Published: 11 April 2025

(This article belongs to the Section Intelligent Sensors)

Download

Browse Figures

Versions Notes

Abstract

The retail industry faces increasing challenges in matching supply with demand due to evolving consumer behaviors, market volatility, and supply chain disruptions. While existing approaches employ statistical and machine learning methods for demand forecasting, they often fail to capture complex temporal dependencies and lack the ability to simultaneously optimize inventory decisions. This paper proposes a novel multi-agent deep reinforcement learning framework that jointly optimizes demand forecasting and inventory management in retail supply chains, leveraging data from IoT sensors, RFID tracking systems, and smart shelf monitoring devices. Our approach combines transformer-based sequence modeling for demand patterns with hierarchical reinforcement learning agents that coordinate inventory decisions across distribution networks. The framework integrates both historical sales data and real-time sensor measurements, employing attention mechanisms to capture seasonal patterns, promotional effects, and environmental conditions detected through temperature and humidity sensors. Through extensive experiments on large-scale retail datasets incorporating sensor network data, we demonstrate that our method achieves 18.2% lower forecast error and 23.5% reduced stockout rates compared with state-of-the-art baselines. The results show particular improvements in handling promotional events and seasonal transitions, where traditional methods often struggle. Our work provides new insights into leveraging deep reinforcement learning for integrated retail operations optimization and offers a scalable solution for modern sensor-enabled supply chain challenges.

Keywords:

multi-agent reinforcement learning; demand forecasting; inventory optimization; supply chain management

1. Introduction

The retail industry is undergoing a fundamental transformation driven by evolving consumer behaviors, increased market volatility, and recurring supply chain disruptions [1]. Enabled by advances in Internet of Things (IoT) sensors, RFID tracking systems, and smart shelf monitoring technologies, a critical challenge facing retailers is the effective synchronization of inventory with consumer demand—a problem that has become significantly more complex in the era of sensor-enabled omnichannel retail and rapid delivery expectations [2]. The proliferation of environmental sensors, footfall counters, and real-time inventory monitoring devices has created new opportunities and challenges in managing modern retail operations.

Traditional approaches to retail forecasting and inventory management have predominantly relied on statistical methods, including exponential smoothing, ARIMA models, and regression-based techniques [3]. While these methodologies have established the foundational framework for inventory control, they exhibit significant limitations in addressing contemporary retail challenges. The inherent assumption of linear relationships and stationary patterns in these approaches proves inadequate for capturing the complex, non-linear dynamics that characterize modern consumer behavior [4]. Moreover, their simplistic treatment of promotional effects and seasonal transitions through basic additive or multiplicative factors fails to capture the sophisticated interplay between marketing activities and evolving demand patterns [5]. Perhaps most critically, the conventional practice of treating demand forecasting and inventory optimization as independent problems leads to suboptimal outcomes, particularly given their intrinsic coupling in real-world retail operations [6]. This disconnected approach overlooks crucial feedback between inventory decisions and future demand patterns, ultimately constraining the potential for system-wide performance optimization.

Machine learning approaches, particularly deep neural networks, have emerged as compelling alternatives in addressing the limitations of traditional statistical methods. Recent advances have demonstrated remarkable success through the application of recurrent neural networks [7], temporal convolutional networks [8], and sophisticated hybrid deep learning architectures [9] in retail demand forecasting. While these approaches demonstrate superior capability in capturing non-linear patterns and complex dependencies within historical data, they encounter several significant challenges in practical deployment. The requirement for extensive feature engineering to meaningfully incorporate external factors and market signals often limits their adaptability, while the cold start problem presents persistent difficulties in handling new products or store locations. Furthermore, these approaches frequently lack robust mechanisms for adapting inventory decisions in response to forecast uncertainty and dynamic supply chain constraints [10], highlighting the need for a more sophisticated integration between prediction and operational decision making.

Reinforcement learning (RL) has shown potential in addressing some of these limitations by directly optimizing inventory decisions based on observed demand patterns and market conditions [11]. Early work in this area focused on single-agent RL approaches for inventory management [12], demonstrating improvements over traditional optimization methods. However, these approaches typically rely on simplified demand models and fail to capture the distributed nature of retail supply chains. More recent work has explored multi-agent RL for supply chain optimization [13], but these efforts have largely focused on specific subproblems rather than providing an integrated solution for forecasting and inventory management.

Our work addresses these limitations through a novel multi-agent deep reinforcement learning framework that seamlessly integrates demand forecasting and inventory management optimization. This comprehensive approach draws motivation from several fundamental advances in retail operations and artificial intelligence research. The framework acknowledges the intrinsic coupling between demand patterns and inventory decisions through consumer behavior and market dynamics, supported by empirical evidence demonstrating how product availability and presentation directly influence purchasing patterns [14]. Contemporary retail operations generate increasingly rich, multi-modal data streams, encompassing point-of-sale transactions, customer mobility patterns, social media sentiment, and competitor actions, which provide invaluable inputs for both forecasting and optimization processes [15]. Furthermore, the inherently hierarchical structure of retail supply chains, characterized by distributed decision making across store, distribution center, and corporate levels, presents a natural alignment with multi-agent learning architectures [16], enabling coordinated optimization across the entire network.

MARIOD represents a significant departure from existing approaches in several fundamental ways. While previous methods typically treat demand forecasting and inventory optimization as sequential and separate problems, our framework uniquely integrates transformer-based forecasting with hierarchical reinforcement learning in a unified architecture that enables simultaneous learning and mutual feedback between these components. This integration allows inventory decisions to directly inform forecasting accuracy and vice versa, creating a synergistic relationship that better mirrors real-world retail dynamics. Furthermore, our approach introduces a novel cross-modal attention mechanism specifically designed for sensor data fusion in retail environments, capable of dynamically weighting diverse sensor inputs—including RFID signals, temperature/humidity readings, foot traffic measurements, and smart shelf data—based on their relevance to current market conditions. This sensor-aware architecture stands in stark contrast to existing methods that either ignore sensor data entirely or process them through separate, non-integrated pipelines. Additionally, MARIOD employs an end-to-end differentiable policy network that enables joint optimization of forecasting accuracy and inventory performance, unlike the common practice of using independent loss functions that fail to capture the complex interplay between these objectives in retail operations.

In this paper, we propose MARIOD (Multi-Agent Reinforcement learning for Integrated Optimization and Demand forecasting), a novel framework that fundamentally reimagines retail supply chain optimization. MARIOD employs a hierarchical architecture where each level of the retail supply chain (store, distribution center, and corporate) is modeled by specialized agents that coordinate through learned communication protocols. At the core of our framework is a transformer-based neural architecture that processes multiple input streams: historical sales data, real-time inventory levels, promotional calendars, competitor actions, and external factors such as weather and local events. Our model integrates these diverse signals through a novel cross-attention mechanism that dynamically weights different information sources based on their relevance to current market conditions. The forecasting component utilizes a modified transformer decoder that generates probabilistic demand forecasts at multiple time horizons, while the inventory optimization component employs a hierarchical reinforcement learning approach to make coordinated stocking decisions across the network. A key innovation in our work is the development of a differentiable inventory policy network that allows end-to-end training of both forecasting and optimization components, enabling the system to learn inventory strategies that are robust to forecast uncertainty. Furthermore, we introduce a novel reward structure that explicitly balances the competing objectives of minimizing holding costs, reducing stockouts, and maintaining service levels, while accounting for the hierarchical nature of retail operations.

The primary contributions of this paper are fourfold.

A transformer-based hierarchical reinforcement learning architecture that captures complex temporal dependencies in demand patterns while coordinating inventory decisions across distribution networks;
A novel attention mechanism that integrates historical sales data with real-time market signals, enabling adaptive responses to promotional events and seasonal transitions;
A scalable multi-agent training framework that maintains stability across diverse retail environments and product categories;
Extensive empirical validation using both large-scale retail datasets and real-world deployment results, demonstrating significant improvements over state-of-the-art approaches.

The remainder of this paper is organized as follows: Section 2 reviews related work in retail forecasting and reinforcement learning. Section 3 presents our technical approach and model architecture. Section 4 describes our experimental setup and results. Section 5 discusses the implications and limitations of our work.

2. Related Work

The advancement of retail supply chain optimization has evolved through several key phases, from traditional statistical approaches to modern artificial intelligence methods. This section reviews the relevant literature across five critical areas: traditional demand forecasting techniques, machine learning applications in retail, inventory optimization methods, reinforcement learning in supply chain management, and multi-modal learning approaches. Notably, the integration of sensor technologies—including RFID tags, IoT-enabled environmental monitors, and computer vision systems—has transformed data collection capabilities, enabling real-time inventory tracking, environmental condition monitoring, and customer behavior analysis. These sensor networks generate continuous streams of heterogeneous data that challenge conventional processing methods but offer unprecedented visibility into retail operations. Despite these technological advances, many existing systems treat sensor data in isolation rather than as complementary streams within a unified decision framework. Through this review, we identify current limitations and opportunities that motivate our integrated work.

2.1. Traditional Demand Forecasting

Classical time series forecasting methods have dominated retail demand prediction for decades. Early approaches centered on exponential smoothing methods [17], which provide interpretable decompositions of trends and seasonality but struggle with complex patterns. The Box–Jenkins methodology and ARIMA models [18] extended these capabilities by incorporating autoregressive components and moving averages. State space models [19] further advanced the field by explicitly modeling uncertainty and handling missing data. These foundations led to more sophisticated approaches like TBATS [20] for multiple seasonal patterns and Vector Autoregression [21] for capturing cross-series dependencies.

Bayesian methods emerged as a powerful framework for incorporating domain knowledge and handling uncertainty [22]. Hierarchical Bayesian models [23] proved particularly valuable for retail applications, allowing information sharing across product categories and locations. However, these methods often assume linear relationships and struggle with the curse of dimensionality when incorporating external factors.

The development of regression-based approaches marked another important evolution, with techniques like Dynamic Regression [24] and ARIMAX [25] allowing the incorporation of external variables. These methods have been extensively applied to retail forecasting, particularly for promotional modeling. However, they typically rely on manual feature engineering and struggle to capture complex interactions between variables.

2.2. Machine Learning for Retail Forecasting

The application of deep learning to retail forecasting has evolved dramatically in recent years, marking a significant departure from traditional statistical methods. Recurrent neural networks, particularly LSTM variants [7], revolutionized time series forecasting by capturing complex temporal dependencies without explicit feature engineering. This advancement was further enhanced by sequence-to-sequence architectures [26], which enabled multi-horizon forecasting, while attention mechanisms [27] improved the handling of long-range dependencies in time series data.

The emergence of temporal convolutional networks [28] brought another significant innovation, processing multiple time scales simultaneously through dilated convolutions, proving particularly effective for retail applications with multiple seasonal patterns. Neural ordinary differential equations [29] introduced a continuous-time perspective on demand modeling, better capturing irregular sampling and missing data patterns common in retail datasets.

The development of probabilistic deep learning models represented another major step forward, with DeepAR [30] pioneering the combination of autoregressive recurrent networks with probabilistic outputs. Deep state space models [31] successfully merged classical time series approaches with neural networks, while transformer-based architectures [32] achieved state-of-the-art performance through their ability to process long sequences and capture complex dependencies. Despite these advances, modern deep learning approaches continue to face challenges in interpretability [33], domain knowledge incorporation [34], cold start scenarios [35], and computational efficiency at scale [36].

2.3. Inventory Optimization

The evolution of inventory optimization has witnessed a remarkable transition from analytical models to data-driven approaches, fundamentally transforming how retailers manage their supply chains. Classical methods based on the newsvendor model [37] and its extensions [38] established the theoretical foundation for optimal inventory policies under uncertainty, leading to sophisticated developments in multi-echelon systems and networks with complex constraints [39].

Recent advances in robust optimization methods [40] have explicitly addressed demand uncertainty, while machine learning approaches [41] have emerged to learn inventory policies directly from historical data. The integration of demand forecasting with inventory decisions has become increasingly important, alongside network optimization considering multiple objectives. Specialized approaches have been developed for perishable inventory management [42], incorporating critical factors such as product lifetime and freshness considerations. The rise of omnichannel retail has prompted new optimization frameworks [43] that integrate decisions across multiple sales channels, while increasing supply chain disruptions have led to the development of robust policies for supply chain resilience [44].

2.4. Reinforcement Learning in Supply Chain Management

The application of reinforcement learning to supply chain optimization has emerged as a transformative approach, addressing limitations of traditional methods through adaptive decision-making frameworks. Single-agent RL methods have demonstrated remarkable success in inventory management [11] and order fulfillment [10,45], while multi-agent approaches have effectively addressed broader supply chain coordination challenges [46].

Recent comprehensive reviews have documented the rapid evolution of RL applications in supply chain management. Rolf et al. [47] provide a systematic analysis of RL algorithms and their applications across various supply chain functions, highlighting the progression from single-problem optimization to more integrated approaches. Similarly, Yan et al. [48] examine methodological advancements and identify future opportunities for reinforcement learning in logistics, emphasizing the need for unified frameworks that can handle multiple interconnected decisions simultaneously.

Deep Q-networks [49] initially showed promise for discrete inventory decisions, paving the way for actor-critic methods [50] that enabled continuous action spaces better suited to real-world supply chain decisions. The development of hierarchical approaches [51] has addressed the multiple time scales inherent in supply chain decisions, while decentralized execution with centralized training [52] has proven effective for managing complex supply chain networks. Communication protocols between agents [53,54] have enabled sophisticated coordination without requiring full information sharing, and the integration of graph neural networks [55] has allowed RL systems to better capture and utilize a supply chain network structure. Recent advances in policy optimization have led to more robust and stable training procedures, while meta-learning approaches have improved adaptation to changing market conditions and supply chain disruptions.

A particularly promising direction has been the development of integrated approaches that simultaneously address multiple supply chain subproblems. Ho et al. [56] demonstrate this potential through an integrated reinforcement learning framework for automated guided vehicles that simultaneously optimizes path planning and task scheduling in smart logistics systems. Their work shows how a unified RL approach can outperform traditional methods that address these problems in isolation, highlighting the benefits of integrated optimization similar to our work.

Recent advances in policy optimization have led to more robust and stable training procedures, while meta-learning approaches have improved adaptation to changing market conditions and supply chain disruptions. The empirical success of these methods across diverse supply chain applications suggests that integrated RL approaches offer a compelling path forward for addressing the complex, interconnected challenges of modern retail operations.

2.5. Multi-Modal Learning in Retail

The integration of diverse data sources has become fundamental to modern retail operations, driving significant innovations in multi-modal learning approaches. Contemporary retail systems [57] now incorporate complex interactions between weather patterns, social media signals, competitor pricing information, local events, and customer mobility patterns. Transformer architectures have demonstrated exceptional capability in handling these heterogeneous data [58], employing sophisticated cross-attention mechanisms to appropriately weight different information sources. Multi-view learning approaches [59] have advanced the field by enabling more effective feature extraction from diverse data modalities, while contrastive learning techniques [60] have improved feature alignment across different data sources. Graph-based representations [61] have provided powerful frameworks for modeling retail networks and their complex interactions, and causal inference frameworks [62] have enhanced our understanding of the relationships between different data modalities and their impact on retail outcomes. Recent developments in self-supervised learning have further improved the ability to leverage unlabeled data across different modalities, while advances in neural architecture search have enabled the automatic discovery of optimal network structures for multi-modal fusion.

The comprehensive review of the existing literature reveals several critical gaps in current approaches to retail supply chain optimization. While significant advances have been made in individual aspects, the integration of demand forecasting and inventory optimization remains largely unexplored, with most methods treating these as separate problems and failing to capture their intricate interactions. Current multi-agent frameworks typically focus on either operational coordination or demand prediction, missing opportunities for synergistic optimization across these domains. Despite the proliferation of sensor technologies—including RFID, computer vision systems, temperature and humidity sensors, and customer movement trackers—the utilization of rich multi-modal sensor data available in modern retail environments often falls short of its potential. Many methods struggle to effectively fuse and leverage heterogeneous sensor streams that operate at different sampling rates and granularities. Additionally, sensor data quality issues such as noise, drift, and occasional failures are inadequately addressed in existing frameworks. Furthermore, the scalability of sophisticated approaches to realistic retail networks with thousands of products and locations remains a significant challenge. Our work addresses these limitations through a novel integrated framework that combines hierarchical reinforcement learning with transformer-based forecasting, while explicitly modeling the complex interactions between inventory decisions and future demand patterns. By developing a unified method that simultaneously handles sensor data integration, forecasting, optimization, and coordination challenges, our work represents a significant step forward in retail supply chain management.

3. Methodology

3.1. Reinforcement Learning Problem Formulation

The retail supply chain optimization problem is formulated as a multi-level reinforcement learning task. For each store i in the network, we define the state space

s_{t}^{i} \in S

at time t as

s_{t}^{i} = [I_{t}^{i}, D_{t}^{i}, P_{t}^{i}, E_{t}^{i}, C_{t}^{i}]

(1)

Each component of the state vector provides essential information for decision making in the retail environment. The inventory component

I_{t}^{i}

represents a comprehensive view of the current inventory status, encompassing on-shelf inventory visible to customers, backroom inventory available for restocking, and in-transit inventory that has been ordered but not yet received. This multi-dimensional inventory representation enables the model to consider the complete supply pipeline when making decisions.

The demand component

D_{t}^{i}

captures historical demand patterns at multiple temporal granularities, including daily, weekly, and seasonal variations. This representation includes not only absolute sales quantities but also derivative features such as growth rates, volatility measures, and pattern consistency metrics that help identify recurring demand structures across different time scales.

The promotional component

P_{t}^{i}

encodes detailed information about current and upcoming promotional activities, including promotion types (e.g., price discounts, buy-one-get-one offers, loyalty program incentives), discount levels, timing (start date, duration, end date), and promotional placement (e.g., featured in circulars, end-cap displays, online banners). This rich representation allows the model to anticipate promotional effects on demand and adjust inventory decisions accordingly.

The environmental component

E_{t}^{i}

integrates data from various sensors deployed throughout the retail environment, including temperature and humidity sensors that monitor storage conditions, infrared customer counters that track store traffic, and smart shelf systems that detect product interactions. These environmental measurements provide critical context for understanding how external factors influence demand patterns and product movement.

The competitor component

C_{t}^{i}

represents information about competitor activities, including their pricing strategies, promotional calendars, product availability, and market share movements. This competitive intelligence helps the model anticipate market shifts and adjust inventory strategies in response to competitor actions.

The action space

a_{t}^{i} \in A

consists of order quantities and inventory adjustments, as follows:

a_{t}^{i} = [q_{t}^{i}, r_{t}^{i}]

(2)

where

q_{t}^{i}

represents the order quantity and

r_{t}^{i}

denotes inventory reallocation decisions. These action variables are subject to several operational constraints that reflect real-world limitations in retail supply chains. Order quantities must satisfy capacity constraints

0 \leq q_{t}^{i} \leq q_{m a x}

, where

q_{m a x}

is determined by storage capacity, shelf space, and budget limitations. Inventory reallocation across the network must maintain the conservation of inventory, such that

\sum_{k} r_{t}^{k} = 0

, ensuring that products moved from one location must be received at another.

Additional constraints include lead time considerations that affect when ordered inventory becomes available, minimum order quantities imposed by suppliers, and budget constraints that limit the total value of orders within a given fiscal period. The model must learn to operate effectively within these constraints while optimizing overall performance.

The environment transitions according to the following dynamics:

I_{t + 1}^{i} = I_{t}^{i} - D_{t}^{i} + q_{t - L}^{i} + \sum_{j} r_{t}^{j \to i}

(3)

where L represents the lead time for inventory replenishment, and

r_{t}^{j \to i}

denotes the inventory reallocated from location j to location i. Demand realization follows a stochastic process influenced by promotional activities, seasonality, and external factors captured in the state representation.

The reward function balances multiple objectives, as follows:

R_{t}^{i} = w_{1} R_{h o l d i n g} + w_{2} R_{s t o c k o u t} + w_{3} R_{s e r v i c e} + w_{4} R_{t r a n s p o r t}

(4)

Here, individual reward components are defined as

R_{h o l d i n g} = - h \sum_{k = 1}^{K} max (I_{t}^{k}, 0)

(5)

R_{s t o c k o u t} = - p \sum_{k = 1}^{K} max (- I_{t}^{k}, 0)

(6)

R_{s e r v i c e} = \sum_{k = 1}^{K} \frac{min (D_{t}^{k}, I_{t}^{k})}{D_{t}^{k}}

(7)

R_{t r a n s p o r t} = - c \sum_{k = 1}^{K} | r_{t}^{k} |

(8)

where h represents the holding cost, p denotes the stockout penalty, and c is the transportation cost coefficient. The weighting factors

w_{1}, w_{2}, w_{3}

, and

w_{4}

allow for the adaptive balancing of these competing objectives based on business priorities and market conditions.

The hierarchical nature of retail decision making is captured through the multi-level structure of our framework. Decisions at the store level must align with distribution center policies, which in turn operate within corporate-level strategic objectives. This hierarchical structure introduces information asymmetry and delegation challenges that our multi-agent approach explicitly addresses through coordinated learning and communication protocols.

3.2. Framework Overview

Our proposed MARIOD framework introduces a novel method that seamlessly integrates demand forecasting and inventory optimization through a hierarchical multi-agent reinforcement learning architecture. The framework operates on multiple retail supply chain levels simultaneously, from individual stores to distribution centers and corporate headquarters, enabling coordinated decision making across the entire network. Central to our approach is a sophisticated sensor integration layer that processes heterogeneous data streams from various retail sensors, including RFID readers, smart shelves, infrared customer counters, and environmental monitoring devices. This layer performs crucial data fusion, noise filtering, and anomaly detection to ensure high-quality sensor inputs for decision making. By incorporating both spatial and temporal dependencies, MARIOD captures the complex interactions between inventory decisions and future demand patterns, while adapting to changing market conditions through real-time sensor data integration. The sensor-driven architecture consists of three primary components that work in concert: a transformer-based demand forecasting module that processes multi-modal sensor input data, a hierarchical multi-agent system for inventory optimization that responds to sensor-detected events, and a coordinated learning mechanism that jointly optimizes both components through an innovative reward structure and training procedure that accounts for sensor reliability and data quality. Figure 1 provides an overview of our framework.

3.3. Transformer-Based Multi-Modal Demand Forecasting

The demand forecasting component of MARIOD employs a sophisticated transformer architecture that processes multiple data streams simultaneously. Let

X = {x_{t}}_{t = 1}^{T}

represent the historical sales data sequence, where each observation

x_{t} \in R^{d}

contains d features including sales quantities, promotional information, pricing data, and external factors such as weather conditions and local events. Each feature dimension provides crucial information for accurate demand prediction, with the transformer architecture learning to weight these features dynamically based on their predictive power for different products and time horizons.

We formulate the demand forecasting problem as a sequence-to-sequence mapping as follows:

{\hat{y}}_{t + 1 : t + h} = f_{θ} (X_{t - w : t}, C_{t})

(9)

Here,

{\hat{y}}_{t + 1 : t + h}

represents the h-step ahead forecast sequence, where h is the forecast horizon length. The function

f_{θ}

represents our transformer model with parameters

θ

, taking as input a window of w past observations

X_{t - w : t}

and current contextual information

C_{t}

. The contextual information includes both static features (store location, product category) and dynamic features (current inventory levels, ongoing promotions).

The core of our forecasting module utilizes a modified transformer architecture with enhanced attention mechanisms, as follows:

Z_{l} = MSA (LN (H_{l - 1})) + H_{l - 1}

(10)

H_{l} = FFN (LN (Z_{l})) + Z_{l}

(11)

In these equations,

Z_{l}

and

H_{l}

represent the intermediate and final outputs of layer l, respectively. The Multi-head Self-Attention (MSA) mechanism allows the model to capture complex temporal dependencies at different time scales, while Layer Normalization (LN) ensures stable training. The Position-wise Feed-Forward Network (FFN) processes each time step independently, allowing for non-linear transformations of the attended features.

3.4. Cross-Modal Attention Mechanism

The integration of multiple data modalities requires a sophisticated attention mechanism that can effectively weight and combine diverse information sources. We introduce a novel cross-modal attention formulation as follows:

A_{i, j}^{m} = softmax (\frac{(W_{Q}^{m} H_{i}) {(W_{K}^{m} X_{j}^{m})}^{T}}{\sqrt{d_{k}}}) W_{V}^{m} X_{j}^{m}

(12)

In this equation,

A_{i, j}^{m}

represents the attention weights between position i in the output sequence and position j in the input sequence for modality m. The learnable parameters

W_{Q}^{m}

,

W_{K}^{m}

, and

W_{V}^{m}

represent the query, key, and value transformation matrices specific to each modality, allowing the model to learn different attention patterns for different types of input data. The scaling factor

\sqrt{d_{k}}

prevents the dot products from growing too large in magnitude, ensuring stable gradient flow during training. This modality-specific attention mechanism enables the model to capture complex interactions between different data sources while maintaining computational efficiency through parallel processing.

3.5. Hierarchical Multi-Agent Inventory Optimization

The inventory optimization component is formulated as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP), capturing the distributed nature of retail supply chain decision making, as follows:

〈 N, S, {A_{i}}, T, {R_{i}}, {O_{i}}, γ 〉

(13)

Here,

N

represents the set of agents across all hierarchy levels, including store managers, distribution center operators, and corporate planners. The global state space

S

encompasses all relevant information about the supply chain, including inventory levels, in-transit shipments, and demand forecasts. Each agent i has its own action space

A_{i}

, representing possible inventory decisions such as order quantities and reallocation choices. The transition function

T

models the system dynamics, including lead times and supply constraints, while the reward functions

{R_{i}}

balance multiple objectives, including holding costs, stockout penalties, and service levels. The observation functions

{O_{i}}

determine the local information available to each agent, and

γ

represents the discount factor for future rewards.

The hierarchical policy structure is defined for each agent as follows:

π_{i}^{l} = f_{ϕ}^{l} (o_{i}^{l}, h_{i}^{l})

(14)

where

π_{i}^{l}

represents the policy of agent i at hierarchy level l, taking as input the local observation

o_{i}^{l}

and communication message

h_{i}^{l}

. The function

f_{ϕ}^{l}

is implemented as a neural network with parameters

ϕ^{l}

, specialized for each level of the hierarchy to capture level-specific decision-making patterns.

3.6. Joint Learning and Optimization

The integration of forecasting and optimization components is achieved through a carefully designed joint learning procedure. The combined objective function balances forecast accuracy with inventory optimization, as follows:

L = α L_{f o r e c a s t} + (1 - α) L_{i n v e n t o r y}

(15)

The forecasting loss

L_{f o r e c a s t}

measures the prediction accuracy across multiple time horizons, as follows:

L_{f o r e c a s t} = \frac{1}{h} \sum_{k = 1}^{h} {∥ {\hat{y}}_{t + k} - y_{t + k} ∥}_{2}^{2}

(16)

where

{\hat{y}}_{t + k}

and

y_{t + k}

represent the predicted and actual demand values at time t + k, respectively. The L2 norm measures the prediction error, while the averaging across horizons ensures balanced performance across different prediction lengths.

The inventory optimization loss

L_{i n v e n t o r y}

captures the long-term expected rewards, as follows:

L_{i n v e n t o r y} = - E_{π_{θ}} [\sum_{t = 0}^{\infty} γ^{t} \sum_{i \in N} R_{i} (s_{t}, a_{t})]

(17)

where

R_{i} (s_{t}, a_{t})

represents the immediate reward received by agent i for taking action

a_{t}

in state

s_{t}

. The negative sign converts the reward maximization into a loss minimization problem.

The training process employs a novel hierarchical policy gradient algorithm, as follows:

\nabla_{θ} J (θ) = E_{π_{θ}} [\sum_{l} \sum_{i \in N^{l}} \nabla_{θ} log π_{i}^{l} (a_{i}^{l} | o_{i}^{l}) A_{i}^{l}]

(18)

Here,

A_{i}^{l}

represents the advantage function for agent i at level l, measuring the relative value of actions compared with the baseline performance. The gradient updates are performed using a combination of experience replay and importance sampling to ensure stable learning across the hierarchy levels.

The proposed MARIOD algorithm integrates demand forecasting and inventory optimization through a coordinated learning procedure. Algorithm 1 outlines the complete training process, which operates across three primary phases: demand forecasting, hierarchical policy execution, and joint optimization. During the forecasting phase, the transformer network processes multi-modal input data to generate demand predictions. The hierarchical policy execution phase coordinates decisions across corporate, distribution center, and store levels through graph attention-based message passing. The environment interaction stage executes the selected actions and collects reward signals that reflect both forecast accuracy and inventory management performance. Policy updates are performed using a hierarchical variant of Proximal Policy Optimization (PPO), which ensures stable learning while maintaining coordination across different levels of the supply chain hierarchy. The joint optimization phase combines forecasting and inventory objectives through a weighted loss function, allowing simultaneous improvement of both components. To ensure stable convergence, the algorithm employs adaptive learning rates that decrease over time according to a temperature-controlled schedule. The training process continues until either the maximum episode count is reached or convergence criteria are satisfied, as measured through the stability of the combined loss function and policy improvement metrics.

Algorithm 1 MARIOD Training Algorithm

1:: Initialize: Parameters $θ_{f}$ (forecaster), $θ_{π}$ (policies), $θ_{c}$ (critic)
2:: Initialize replay buffer $B$ , communication buffer $C$
3:: for episode = 1 to M do
4:: Collect initial state $s_{0}$ and context $c_{0}$
5:: for t = 0 to T do
6:: // Forecasting Phase
7:: ${\hat{y}}_{t + 1 : t + h} = f_{θ_{f}} (X_{t - w : t}, c_{t})$
8:: Update forecast loss $L_{f o r e c a s t}$
9:: // Hierarchical Policy Execution
10:: for l in [corporate, DC, store] do
11:: Collect observations $o_{t}^{l}$
12:: Compute messages $h_{t}^{l}$ via graph attention
13:: Sample actions $a_{t}^{l} \sim π_{θ_{π}}^{l} (o_{t}^{l}, h_{t}^{l})$
14:: end for
15:: // Environment Interaction
16:: Execute actions, observe rewards $r_{t}$ , next state $s_{t + 1}$
17:: Store transition $(s_{t}, a_{t}, r_{t}, s_{t + 1})$ in $B$
18:: // Policy Update
19:: if $| B | \geq$ batch_size then
20:: Sample mini-batch from $B$
21:: Compute advantages $A_{t}$ using critic $V_{θ_{c}}$
22:: Update policies via hierarchical PPO:

$θ_{π} \leftarrow θ_{π} + η \nabla_{θ_{π}} J_{P P O} (θ_{π})$
23:: Update critic parameters $θ_{c}$
24:: end if
25:: // Joint Optimization
26:: $L_{t o t a l} = α L_{f o r e c a s t} + (1 - α) L_{i n v e n t o r y}$
27:: Update all parameters via gradient descent
28:: end for
29:: // Evaluation and Adaptation
30:: Compute performance metrics
31:: Adjust hyperparameters if needed
32:: Update communication patterns in $C$
33:: end for

The algorithm employs adaptive learning rates for each component, as follows:

η_{t} = η_{0} {(\frac{τ}{τ + t})}^{0.5}

(19)

where

τ

is a temperature parameter controlling the learning rate decay. Convergence is monitored through the combined loss function stability and policy improvement metrics.

3.7. Convergence Analysis

To establish the theoretical validity of our approach, we provide a formal analysis of MARIOD’s convergence properties. The convergence of our hierarchical multi-agent reinforcement learning framework builds upon recent advances in policy gradient methods, while addressing the unique challenges introduced by the joint optimization of forecasting and inventory components.

Our analysis begins by considering the policy gradient updates for agent i at hierarchy level l, as defined in Equation (17). For a policy parameterized by

θ

, the expected update direction is given by

\nabla_{θ} J (θ) = E_{π_{θ}} [\sum_{l} \sum_{i \in N^{l}} \nabla_{θ} log π_{i}^{l} (a_{i}^{l} | o_{i}^{l}) A_{i}^{l}]

. Under standard regularity conditions, including bounded rewards and Lipschitz-continuous policy gradients, we can establish that this gradient estimate is unbiased.

For the hierarchical case, we must consider how errors propagate across levels. Let

δ_{t}^{l} = r_{t} + γ V (s_{t + 1}^{l}) - V (s_{t}^{l})

represent the temporal difference error at level l. We can show that the variance of advantage estimates remains bounded across hierarchy levels, as follows:

Var [A_{i}^{l}] \leq \frac{2 γ^{2} R_{m a x}^{2}}{{(1 - γ)}^{4}} (1 + α_{l} \cdot h)

(20)

where h represents the height of the hierarchy,

α_{l}

is a level-specific discount factor, and

R_{m a x}

is the maximum absolute reward. This bound ensures that advantage estimates remain reliable even in deep hierarchies, which is critical for stable learning.

For the joint optimization of forecasting and inventory components, we establish convergence by analyzing the coupled system dynamics. Let

L_{f} (θ_{f})

and

L_{π} (θ_{π})

represent the forecasting and policy losses, respectively. The joint optimization objective

L = α L_{f} + (1 - α) L_{π}

induces coupled gradient dynamics that can be analyzed through a Lyapunov function, as follows:

V (θ_{f}, θ_{π}) = ∥ \nabla_{θ_{f}} L_{f} ∥^{2} + ∥ \nabla_{θ_{π}} L_{π} ∥^{2} + β ∥ \nabla_{θ_{f}} L_{π} ∥^{2} + β {∥ \nabla_{θ_{π}} L_{f} ∥}^{2}

(21)

where

β

controls the coupling strength. Under appropriate learning rate schedules satisfying the Robbins–Monro conditions (

\sum_{t} η_{t} = \infty, \sum_{t} η_{t}^{2} < \infty

), we can show that

V (θ_{f}, θ_{π}) \to 0

as

t \to \infty

, guaranteeing convergence to a stationary point of the joint objective.

The adaptive learning rate defined in Equation (18) satisfies these conditions while providing practical benefits for training stability. Specifically, with the temperature parameter

τ > 0

, the learning rate schedule

η_{t} = η_{0} {(\frac{τ}{τ + t})}^{0.5}

ensures sufficient exploration in early training while gradually stabilizing as parameters approach a local optimum.

For practical implementation, we employ the hierarchical variant of Proximal Policy Optimization (PPO), which provides additional stability through trust region constraints, as follows:

L^{P P O} (θ) = E_{t} [min (r_{t} (θ) {\hat{A}}_{t}, clip (r_{t} (θ), 1 - ϵ, 1 + ϵ) {\hat{A}}_{t})]

(22)

where

r_{t} (θ) = \frac{π_{θ} (a_{t} | s_{t})}{π_{θ_{o l d}} (a_{t} | s_{t})}

and

ϵ

is a hyperparameter controlling the size of the trust region. This approach ensures that policy updates remain within a region where our advantage estimates are reliable, preventing harmful large policy changes and significantly improving convergence stability in practice.

The transformer-based forecasting component converges through gradient descent on the mean squared error loss, with established convergence guarantees for attention-based architectures given sufficient model capacity and training data. The coupling between forecasting and policy components is managed through careful gradient propagation and the weighted loss function in Equation (14), ensuring that improvements in one component do not destabilize the other.

Our empirical results confirm these theoretical guarantees, with MARIOD demonstrating stable convergence across diverse retail datasets and environments. The ablation studies further validate that each architectural component contributes to this stability, with the full model achieving both faster convergence and better final performance compared with simplified variants.

4. Experimental Results and Analysis

4.1. Experimental Setup

4.1.1. Datasets

Our experimental evaluation utilizes three established retail datasets that capture diverse aspects of modern retail operations. The primary dataset is the Dunnhumby Complete Journey Dataset [63], which contains household-level transaction data from over 2500 households who frequently shop at a retailer’s stores over 2 years. The dataset includes detailed transaction records spanning 92 stores and approximately 43,000 product SKUs, with comprehensive information on promotions, product categories, and customer demographics. We augment these data with store-level inventory records and promotional calendars, making it particularly suitable for evaluating both demand forecasting and inventory optimization components of our framework.

The second dataset is the Favorita Grocery Sales dataset [64], which comprises daily sales data from Corporación Favorita, a large Ecuador-based grocery retailer. This dataset contains transactions from 54 stores over a 5-year period (2013–2017), covering approximately 33,000 product SKUs across diverse categories. The dataset includes rich contextual information such as oil prices, store metadata, and local events, allowing us to evaluate our framework’s ability to incorporate external factors in demand forecasting. The dataset also provides detailed promotional information and holiday calendars, essential for testing the system’s performance during high-demand periods.

Our third dataset comes from the public Retail Dataset maintained by the UCI Machine Learning Repository [65]. This dataset contains all transactions occurring between 1 December 2010 and 9 December 2011 for a UK-based online retail company, comprising approximately 541,909 transactions with 4373 unique products. The dataset’s distinctive feature is its representation of both B2C and B2B transactions, providing an excellent test bed for our framework’s adaptability to different retail contexts. The online nature of transactions also introduces unique challenges in inventory management due to different fulfillment patterns and demand dynamics.

Each dataset presents distinct characteristics that help validate different aspects of our framework, as shown in the following Table 1:

For evaluation purposes, we process these datasets to create consistent daily sales and inventory records, with a 70–15–15 split for training, validation, and testing. Missing values are handled using forward fill for inventory levels and zero fill for sales data, following standard retail analytics practices [3]. We particularly focus on products with at least 100 sales records to ensure sufficient data for meaningful analysis.

4.1.2. Implementation Details

The implementation of MARIOD involves careful architectural decisions and hyperparameter tuning to ensure optimal performance across different retail contexts. The transformer-based forecasting module employs an architecture with eight attention heads and an embedding dimension of 512, implemented across six encoder and decoder layers. This configuration was determined through extensive ablation studies, balancing model capacity with computational efficiency. The attention mechanism utilizes scaled dot-product attention with a dropout rate of 0.1 to prevent overfitting, while layer normalization is applied before each attention and feed-forward layer to stabilize training.

The hierarchical reinforcement learning component implements actor networks with three fully connected layers of dimensions [512, 256, 128] and critic networks with dimensions [256, 128, 64]. Each layer employs ReLU activation functions, with batch normalization applied between layers to accelerate training. The network weights are initialized using Xavier initialization to ensure proper gradient flow during the early stages of training. We implement the entire framework using PyTorch 1.12.0, leveraging its distributed training capabilities across eight NVIDIA A100 GPUs with 80 GB memory each.

The training process employs the Adam optimizer with an initial learning rate of

1 \times 10^{- 4}

and a weight decay of

1 \times 10^{- 5}

. We utilize a cosine learning rate schedule with warm-up over the first 1000 iterations. The batch size is set to 256, with gradient accumulation used for larger networks to maintain effective batch sizes while managing memory constraints. For the reinforcement learning component, we employ a discount factor

γ = 0.95

and entropy regularization coefficient

α = 0.01

to encourage exploration during training. Experience replay is implemented with a buffer size of 100,000 transitions, using prioritized sampling based on TD error to accelerate learning.

4.1.3. Baseline Methods

Our comparative analysis encompasses a comprehensive set of traditional, deep learning, and reinforcement learning approaches. In the traditional methods category, we implement SARIMA using the statsmodels library, incorporating external regressors for promotional and seasonal effects. The model orders are automatically selected using AIC criteria, with seasonal periods detected through periodogram analysis. Facebook’s Prophet is configured with custom seasonality patterns specific to each retail domain, while the ETS implementation uses the forecast package with multiple seasonal patterns and automatic model selection.

For deep learning baselines, we implement Amazon’s DeepAR using GluonTS, maintaining architectural consistency with the original paper while adapting the hyperparameters to our retail context. The N-BEATS implementation follows the original architecture but includes additional residual connections to improve gradient flow. Our Temporal Fusion Transformer (TFT) implementation extends the basic architecture with additional attention layers specifically designed to capture promotional effects.

The reinforcement learning baselines include a single-agent DQN implementation with double Q-learning and prioritized experience replay, calibrated specifically for inventory management. The MARL-Basic implementation uses independent DDPG agents with shared experience replay, while H-MARL implements a hierarchical structure similar to ours but without the integrated demand forecasting component. All RL baselines are trained using the same computing infrastructure and similar hyperparameter tuning procedures to ensure fair comparison.

4.2. Performance Evaluation

Our experimental evaluation of MARIOD encompasses forecast accuracy, inventory optimization, and computational efficiency across the three datasets described in Section 4.1. We pay particular attention to challenging scenarios such as promotional periods and seasonal transitions, where traditional methods often struggle. To ensure robust evaluation, we employ standard metrics including Mean Absolute Percentage Error (MAPE), Root Mean Square Error (RMSE), and symmetric Mean Absolute Percentage Error (sMAPE) for forecasting accuracy, while inventory performance is assessed through service levels, stockout rates, and inventory turnover metrics.

4.2.1. Forecast Accuracy

The forecasting capabilities of MARIOD demonstrate substantial improvements over existing approaches across all evaluation metrics. As shown in Table 2, MARIOD achieves a MAPE of 15.6%, representing an 18.2% reduction compared with the next best baseline method (TFT at 17.2%). This improvement is particularly pronounced during promotional periods, where MARIOD’s ability to integrate multiple data streams through its cross-modal attention mechanism proves especially valuable. The RMSE results further support this finding, with MARIOD achieving 119.8 units compared with TFT’s 125.3 units, indicating better handling of larger prediction errors that often occur during demand spikes.

The superior performance extends to probabilistic forecasting metrics, with MARIOD achieving a 94.6% prediction interval coverage rate and a Continuous Ranked Probability Score (CRPS) of 71.2, outperforming all baseline methods. This indicates that MARIOD provides not only more accurate point forecasts but also better uncertainty estimates, which is crucial for robust inventory decision making. Figure 2 illustrates this capability during a major promotional event, where MARIOD maintains accuracy despite rapid demand fluctuations that cause significant degradation in baseline model performance.

4.2.2. Inventory Optimization

The inventory optimization results reveal MARIOD’s ability to translate improved forecasting into tangible operational benefits. As detailed in Table 3, the framework achieves a service level of 96.5% while maintaining lower average inventory levels than all baseline methods, as detailed in Table 3. This represents a particularly compelling improvement over the hierarchical MARL baseline (H-MARL), which achieves a 94.8% service level but requires 3.3% higher inventory levels. The stockout rate reduction from 5.2% (H-MARL) to 3.5% (MARIOD) demonstrates the practical impact of these improvements on retail operations.

The inventory trajectory analysis, visualized in Figure 3, reveals how MARIOD’s coordinated decision making across hierarchy levels leads to more stable inventory patterns. The system demonstrates remarkable adaptability to changing demand patterns while maintaining efficient inventory utilization, as evidenced by the 14.3 inventory turnover ratio compared with 13.6 for H-MARL. This improvement stems from MARIOD’s ability to balance local store-level requirements with broader distribution network considerations through its hierarchical architecture.

4.2.3. Computational Efficiency

MARIOD’s computational performance proves highly competitive despite its sophisticated architecture, as shown in Table 4. Training time requirements of 38.5 h represent a 12.1% improvement over H-MARL and a 27.0% improvement over the combined DeepAR + DQN approach. The inference latency of 156 ms is particularly noteworthy, as it enables real-time decision making in practical retail environments. Memory efficiency also shows marked improvement, with MARIOD requiring 19.8 GB compared with the 21.4 GB needed by H-MARL and 24.6 GB by DeepAR + DQN.

These efficiency gains can be attributed to MARIOD’s streamlined attention mechanisms and efficient hierarchical communication protocols. The framework demonstrates excellent scaling properties, maintaining consistent performance improvements across different network sizes and product categories while requiring substantially fewer computational resources than competing approaches.

4.3. Ablation Studies

Our ablation studies provide crucial insights into the contribution of each architectural component to MARIOD’s overall performance. As detailed in Table 5, the attention mechanism analysis reveals that while the base self-attention configuration achieves reasonable performance (MAPE 18.4%), the addition of temporal attention reduces this to 16.8%. The full cross-modal attention mechanism further improves accuracy to 15.6% while simultaneously enhancing service levels from 94.2% to 96.5%. Notably, these improvements come with a reduction in training time, suggesting that the more sophisticated attention mechanisms actually facilitate more efficient learning.

Hierarchical Architecture Analysis

We analyze the impact of different hierarchical configurations: (1) single-level: flat architecture; (2) two-level: store and DC levels; and (3) three-level: full hierarchy (store, DC, corporate).

The hierarchical architecture analysis demonstrates the clear benefits of MARIOD’s three-level structure. The single-level architecture, while computationally simpler, results in significantly higher stockout rates (6.8%) and inventory levels (824.6 units), as shown in Table 6. The two-level configuration shows marked improvement but still falls short of the full three-level hierarchy’s performance. The communication overhead increase from two to three levels (0.28 to 0.35) is modest compared with the performance gains, validating our architectural choices.

These results demonstrate MARIOD’s superior performance across multiple dimensions, with particular strengths in handling promotional events and maintaining efficient inventory levels. The ablation studies confirm the value of each architectural component. The framework’s robustness to hyperparameter choices and computational efficiency make it particularly suitable for large-scale retail applications.

5. Conclusions and Future Work

This paper has introduced MARIOD, a novel multi-agent deep reinforcement learning framework that seamlessly integrates demand forecasting and inventory optimization for sensor-enabled retail supply chains. Through comprehensive evaluation on three diverse retail datasets incorporating IoT sensor measurements, our approach demonstrates substantial improvements over existing methods, achieving an 18.2% reduction in forecast error and a 23.5% decrease in stockout rates while maintaining lower average inventory levels. These quantitative results significantly outperform both traditional forecasting methods like SARIMA (32.1% improvement) and advanced approaches such as Temporal Fusion Transformers (9.3% improvement), as well as state-of-the-art inventory optimization techniques including hierarchical MARL (16.7% improvement in service levels).

Our work represents a fundamental paradigm shift from the conventional sequential approach—where forecasting is performed first and inventory decisions follow—to a truly integrated optimization framework where both components learn simultaneously and inform each other. This integration enables the discovery of inventory strategies that are specifically tailored to forecast uncertainty patterns rather than treating uncertainty as an exogenous factor. The transformer-based hierarchical architecture effectively captures complex temporal dependencies from sensor networks while enabling coordinated inventory decisions across distribution networks. Our novel cross-modal attention mechanism dynamically integrates historical sales data with real-time sensor signals, showing particular effectiveness during promotional events and seasonal transitions.

The extensive ablation studies provide compelling evidence for each architectural component’s value. The full cross-modal attention mechanism for processing multi-sensor data improves forecast accuracy by 15.1% compared with the base configuration while reducing training time. Analysis of the hierarchical architecture demonstrates that our three-level approach achieves optimal performance with a modest communication overhead of 0.35, validating the design choices. The computational efficiency gains in processing sensor data streams are particularly noteworthy, with MARIOD requiring only 38.5 h of training time—a 12.1% improvement over hierarchical MARL baselines. These efficiency improvements, combined with the 156 ms inference latency, enable practical deployment in sensor-rich retail environments.

For retail practitioners, our approach offers significant practical benefits beyond performance metrics alone. The framework’s ability to process heterogeneous sensor data streams—including RFID signals, environmental monitors, and customer tracking systems—within a unified decision architecture eliminates the need for complex integration of disparate systems. The end-to-end differentiable nature of our approach means that retailers can seamlessly incorporate new sensor technologies without requiring extensive retraining or reconfiguration of existing systems. Additionally, the explainable nature of our attention-based architecture provides valuable insights into which data sources most influence both forecasting and inventory decisions across different product categories and market conditions.

Looking forward, several promising research directions emerge from this work. The framework could be extended to handle more complex sensor-integrated supply chain structures, including multi-echelon systems with RFID tracking and cross-channel fulfillment networks. Advanced causal inference techniques could better capture the interaction between inventory decisions and sensor-detected demand patterns. Additionally, developing more sophisticated uncertainty quantification methods would enhance robust decision making under sensor network failures and supply chain disruptions. The strong empirical results and computational efficiency demonstrated by MARIOD suggest that integrated approaches combining deep learning with multi-agent reinforcement learning offer a compelling path forward for addressing complex sensor-enabled retail supply chain challenges. Future work investigating transfer learning approaches could further improve performance on new products and store locations with limited sensor historical data, while exploring integration with emerging sensor technologies like smart shelves and automated inventory monitoring systems.

Author Contributions

Methodology, M.W., J.W. and P.L.; Software, M.W. and J.W.; Validation, P.L.; Investigation, Y.Y.; Writing—original draft, Y.Y.; Writing—review & editing, M.Z.; Supervision, M.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Chui, M.; Issler, M.; Roberts, R.; Yee, L. Technology Trends Outlook 2023; McKinsey & Company: New York, NY, USA, 2023. [Google Scholar]
Amed, I.; Balchandani, A.; Berg, A.; Hedrich, S.; Jensen, J.E.; Le Merle, L.; Rölkens, F. State of Fashion 2022: An Uneven Recovery and New Frontiers; Mckinsey & Company: New York, NY, USA, 2021. [Google Scholar]
Fildes, R.; Ma, S.; Kolassa, S. Retail forecasting: Research and practice. Int. J. Forecast. 2022, 38, 1283–1318. [Google Scholar] [CrossRef]
Trapero, J.R.; Kourentzes, N.; Fildes, R. On the identification of sales forecasting models in the presence of promotions. J. Oper. Res. Soc. 2015, 66, 299–307. [Google Scholar] [CrossRef]
Ma, S.; Fildes, R.; Huang, T. Demand forecasting with high dimensional data: The case of SKU retail sales forecasting with intra-and inter-category promotional information. Eur. J. Oper. Res. 2016, 249, 245–257. [Google Scholar] [CrossRef]
Chen, X.; Hu, P.; Hu, Z. Joint pricing and inventory decisions. In Research Handbook on Inventory Management; Edward Elgar: London, UK, 2023; pp. 288–306. [Google Scholar]
Bandara, K.; Bergmeir, C.; Hewamalage, H. LSTM-MSNet: Leveraging forecasts on sets of related time series with multiple seasonal patterns. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 1586–1599. [Google Scholar] [CrossRef]
Wen, R.; Torkkola, K.; Narayanaswamy, B.; Madeka, D. A multi-horizon quantile recurrent forecaster. arXiv 2017, arXiv:1711.11053. [Google Scholar]
Oreshkin, B.N.; Carpov, D.; Chapados, N.; Bengio, Y. N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. arXiv 2019, arXiv:1905.10437. [Google Scholar]
Gronauer, S.; Diepold, K. Multi-agent deep reinforcement learning: A survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
Zadro, F. Utilizing Reinforcement Learning for Decision Management in B2B Cus-tomer Relationship Management Envi-ronments. Ph.D. Thesis, University of Applied Sciences Technikum Wien, Vienna, Austria, 2020. [Google Scholar]
D’Souza, S. Implementing reinforcement learning algorithms in retail supply chains with OpenAI gym toolkit. arXiv 2021, arXiv:2104.14398. [Google Scholar]
Liu, X.; Hu, M.; Peng, Y.; Yang, Y. Multi-agent deep reinforcement learning for multi-echelon inventory management. Prod. Oper. Manag. 2022, 10591478241305863. [Google Scholar] [CrossRef]
Bensoussan, A.; Xie, Y.; Yan, H. Joint inventory-pricing optimization with general demands: An alternative approach for concavity preservation. Prod. Oper. Manag. 2019, 28, 2390–2404. [Google Scholar] [CrossRef]
McCormick, H.; Cartwright, J.; Perry, P.; Barnes, L.; Lynch, S.; Ball, G. Fashion retailing–past, present and future. Text. Prog. 2014, 46, 227–321. [Google Scholar] [CrossRef]
Chen, Y.; Song, G.; Ye, Z.; Jiang, X. Scalable and transferable reinforcement learning for multi-agent mixed cooperative–competitive environments based on hierarchical graph attention. Entropy 2022, 24, 563. [Google Scholar] [CrossRef]
Gardner, E.S., Jr. Exponential smoothing: The state of the art—Part II. Int. J. Forecast. 2006, 22, 637–666. [Google Scholar] [CrossRef]
Hyndman, R. Forecasting: Principles and Practice; OTexts: Melbourne, Australia, 2018. [Google Scholar]
Durbin, J.; Koopman, S.J. Time Series Analysis by State Space Methods; Oxford University Press: Oxford, UK, 2012; Volume 38. [Google Scholar]
Munim, Z.H. State-space TBATS model for container freight rate forecasting with improved accuracy. Marit. Transp. Res. 2022, 3, 100057. [Google Scholar] [CrossRef]
Cornell, C.; Mitchell, L.; Roughan, M. Vector Autoregression in Cryptocurrency Markets: Unraveling Complex Causal Networks. arXiv 2023, arXiv:2308.15769. [Google Scholar]
Yelland, P.M. Bayesian forecasting of parts demand. Int. J. Forecast. 2010, 26, 374–396. [Google Scholar] [CrossRef]
Allenby, G.M.; Rossi, P.E. Hierarchical bayes models. In The Handbook of Marketing Research: Uses, Misuses, and Future Advances; Sage: Washington, DC, USA, 2006; pp. 418–440. [Google Scholar]
Hwang, S. Dynamic regression models for prediction of construction costs. J. Constr. Eng. Manag. 2009, 135, 360–367. [Google Scholar] [CrossRef]
Wangdi, K.; Singhasivanon, P.; Silawan, T.; Lawpoolsri, S.; White, N.J.; Kaewkungwal, J. Development of temporal modelling for forecasting and prediction of malaria infections using time-series and ARIMAX analyses: A case study in endemic districts of Bhutan. Malar. J. 2010, 9, 251. [Google Scholar] [CrossRef]
Sutskever, I. Sequence to Sequence Learning with Neural Networks. arXiv 2014, arXiv:1409.3215. [Google Scholar]
Fan, C.; Zhang, Y.; Pan, Y.; Li, X.; Zhang, C.; Yuan, R.; Wu, D.; Wang, W.; Pei, J.; Huang, H. Multi-horizon time series forecasting with temporal attention learning. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 2527–2535. [Google Scholar]
Bai, S.; Kolter, J.Z.; Koltun, V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv 2018, arXiv:1803.01271. [Google Scholar]
Chen, R.T.; Rubanova, Y.; Bettencourt, J.; Duvenaud, D.K. Neural ordinary differential equations. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Salinas, D.; Flunkert, V.; Gasthaus, J.; Januschowski, T. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. Int. J. Forecast. 2020, 36, 1181–1191. [Google Scholar] [CrossRef]
Rangapuram, S.S.; Seeger, M.W.; Gasthaus, J.; Stella, L.; Wang, Y.; Januschowski, T. Deep state space models for time series forecasting. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; Volume 31. [Google Scholar]
Vaswani, A. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. The M4 Competition: 100,000 time series and 61 forecasting methods. Int. J. Forecast. 2020, 36, 54–74. [Google Scholar] [CrossRef]
Faloutsos, C.; Flunkert, V.; Gasthaus, J.; Januschowski, T.; Wang, Y. Forecasting big time series: Theory and practice. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, Anchorage, AK, USA, 4–8 August 2019; pp. 3209–3210. [Google Scholar]
Yamamura, C.L.K.; Santana, J.C.C.; Masiero, B.S.; Quintanilha, J.A.; Berssaneti, F.T. Forecasting new product demand using domain knowledge and machine learning. Res. Technol. Manag. 2022, 65, 27–36. [Google Scholar] [CrossRef]
Torres, J.F.; Galicia, A.; Troncoso, A.; Martínez-Álvarez, F. A scalable approach based on deep learning for big data time series forecasting. Integr. Comput. Aided Eng. 2018, 25, 335–348. [Google Scholar] [CrossRef]
Porteus, E.L. Foundations of Stochastic Inventory Theory; Stanford University Press: Redwood City, CA, USA, 2002. [Google Scholar]
Geunes, J. Foundations of inventory management. Interfaces 2002, 32, 108–110. [Google Scholar]
Garcia, D.J.; You, F. Supply chain design and optimization: Challenges and opportunities. Comput. Chem. Eng. 2015, 81, 153–170. [Google Scholar] [CrossRef]
Bertsimas, D.; Kallus, N. From predictive to prescriptive analytics. Manag. Sci. 2020, 66, 1025–1044. [Google Scholar] [CrossRef]
Oroojlooyjadid, A.; Snyder, L.V.; Takáč, M. Applying deep learning to the newsvendor problem. IISE Trans. 2020, 52, 444–463. [Google Scholar] [CrossRef]
Karaesmen, I.Z.; Scheller-Wolf, A.; Deniz, B. Managing perishable and aging inventories: Review and future research directions. In Planning Production and Inventories in the Extended Enterprise: A State of the Art Handbook; Springer: New York, NY, USA, 2011; Volume 1, pp. 393–436. [Google Scholar]
Gallino, S.; Moreno, A. Integration of online and offline channels in retail: The impact of sharing reliable inventory availability information. Manag. Sci. 2014, 60, 1434–1451. [Google Scholar] [CrossRef]
Snyder, L.V.; Atan, Z.; Peng, P.; Rong, Y.; Schmitt, A.J.; Sinsoysal, B. OR/MS models for supply chain disruptions: A review. Iie Trans. 2016, 48, 89–109. [Google Scholar] [CrossRef]
Ren, S.; Jin, J.; Niu, G.; Liu, Y. ARCS: Adaptive Reinforcement Learning Framework for Automated Cybersecurity Incident Response Strategy Optimization. Appl. Sci. 2025, 15, 951. [Google Scholar] [CrossRef]
Zhao, X.; Wu, C. Large-scale machine learning cluster scheduling via multi-agent graph reinforcement learning. IEEE Trans. Netw. Serv. Manag. 2021, 19, 4962–4974. [Google Scholar] [CrossRef]
Rolf, B.; Jackson, I.; Müller, M.; Lang, S.; Reggelin, T.; Ivanov, D. A review on reinforcement learning algorithms and applications in supply chain management. Int. J. Prod. Res. 2023, 61, 7151–7179. [Google Scholar] [CrossRef]
Yan, Y.; Chow, A.H.; Ho, C.P.; Kuo, Y.H.; Wu, Q.; Ying, C. Reinforcement learning for logistics and supply chain management: Methodologies, state of the art, and future opportunities. Transp. Res. Part Logist. Transp. Rev. 2022, 162, 102712. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef]
Konda, V.; Tsitsiklis, J. Actor-critic algorithms. In Proceedings of the Advances in Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; Volume 12. [Google Scholar]
Song, G.; Khaligh, V.; Liu, J.J.; Na, J. Hierarchical deep reinforcement learning for hydrogen supply chain management. In Computer Aided Chemical Engineering; Elsevier: Amsterdam, The Netherlands, 2024; Volume 53, pp. 2905–2910. [Google Scholar]
de Witt, C.S.; Peng, B.; Kamienny, P.A.; Torr, P.; Böhmer, W.; Whiteson, S. Deep multi-agent reinforcement learning for decentralized continuous cooperative control. arXiv 2020, arXiv:2003.06709. [Google Scholar]
Jiang, J.; Dun, C.; Huang, T.; Lu, Z. Graph convolutional reinforcement learning. arXiv 2018, arXiv:1810.09202. [Google Scholar]
Chen, S.; Liu, Y.; Zhang, Q.; Shao, Z.; Wang, Z. Multi-Distance Spatial-Temporal Graph Neural Network for Anomaly Detection in Blockchain Transactions. Adv. Intell. Syst. 2025. Early View. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Chen, F.; Long, G.; Zhang, C.; Philip, S.Y. A comprehensive survey on graph neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2020, 32, 4–24. [Google Scholar] [CrossRef]
Ho, G.; Tang, Y.M.; Leung, E.K.; Tong, P. Integrated reinforcement learning of automated guided vehicles dynamic path planning for smart logistics and operations. Transp. Res. Part Logist. Transp. Rev. 2025, 196, 104008. [Google Scholar] [CrossRef]
Plazibat, I.; Dadic, M. Contemporary Issues in the Retail Industry; Acadlore Publishing Services Limited: Hong Kong, China, 2017. [Google Scholar]
Li, X.; Shen, J.; Wang, D.; Lu, W.; Chen, Y. Multi-modal transform-based fusion model for new product sales forecasting. Eng. Appl. Artif. Intell. 2024, 133, 108606. [Google Scholar] [CrossRef]
Gardino, G.B.; Meo, R.; Craparotta, G. Multi-view latent learning applied to fashion industry. Inf. Syst. Front. 2021, 23, 53–69. [Google Scholar] [CrossRef]
Babaev, D.; Ovsov, N.; Kireev, I.; Ivanova, M.; Gusev, G.; Nazarov, I.; Tuzhilin, A. Coles: Contrastive learning for event sequences with self-supervision. In Proceedings of the 2022 International Conference on Management of Data, Philadelphia, PA, USA, 12–17June 2022; pp. 1190–1199. [Google Scholar]
Li, J.; Fan, L.; Wang, X.; Sun, T.; Zhou, M. Product demand prediction with spatial graph neural networks. Appl. Sci. 2024, 14, 6989. [Google Scholar] [CrossRef]
Pearl, J.; Mackenzie, D. The Book of Why: The New Science of Cause and Effect; Basic Books: New York, NY, USA, 2018. [Google Scholar]
Dunnhumby. The Complete Journey Dataset; Dunnhumby: London, UK, 2018. [Google Scholar]
Pao, J.J.; Sullivan, D.S. Time Series Sales Forecasting; Final Year Project, Computer Science; Stanford University: Stanford, CA, USA, 2017. [Google Scholar]
Chen, D.; Sain, S.L.; Guo, K. Online Retail Data Set. 2015. Available online: https://archive.ics.uci.edu/dataset/352/online+retail (accessed on 20 January 2025).

Figure 1. Overview of the proposed MARIOD framework.

Figure 2. Forecast accuracy during promotional periods. MARIOD demonstrates superior adaptation to sudden demand changes compared with baseline methods.

Figure 3. Inventory-level trajectories comparing MARIOD with baseline methods. Note the reduced variability and more efficient inventory utilization.

Table 1. Dataset characteristics.

Characteristic	Dunnhumby	Favorita	UCI Retail
Time Span	2 years	5 years	1 year
Stores/Channels	92	54	Online
Products (SKUs)	43,000	33,000	4373
Transactions	2.5M	174M	541,909
Geography	USA	Ecuador	UK

Table 2. Forecast accuracy comparison.

Method	MAPE	RMSE	sMAPE	Coverage	CRPS
SARIMA	24.3	156.2	23.8	89.2	84.6
Prophet	22.1	148.9	21.5	90.8	82.3
DeepAR	19.8	132.4	19.2	92.4	78.9
N-BEATS	18.4	128.7	18.1	93.1	76.2
TFT	17.2	125.3	17.4	93.8	74.5
MARIOD	15.6	119.8	15.9	94.6	71.2

Table 3. Inventory optimization performance.

Method	Service Level	Avg Inventory	Stockout Rate	Turnover Ratio
DQN	92.3	845.6	7.7	12.4
DDPG	93.1	823.4	6.9	12.8
MARL-Basic	94.2	798.7	5.8	13.2
H-MARL	94.8	782.3	5.2	13.6
MARIOD	96.5	756.9	3.5	14.3

Table 4. Computational performance analysis.

Method	Training Time (h)	Inference Time (ms)	Memory (GB)
DeepAR + DQN	48.3	245	24.6
TFT + DDPG	52.7	278	28.3
MARL-Basic	45.2	198	22.8
H-MARL	43.8	185	21.4
MARIOD	38.5	156	19.8

Table 5. Attention mechanism ablation results.

Configuration	MAPE	Service Level	Training Time
Base	18.4	94.2	42.3
+Temporal	16.8	95.1	40.8
+Cross-modal	15.6	96.5	38.5

Table 6. Hierarchical architecture ablation results.

Configuration	Stockout Rate	Avg Inventory	Communication Overhead
Single-level	6.8	824.6	–
Two-level	4.9	785.3	0.28
Three-level	3.5	756.9	0.35

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Wang, M.; Wang, J.; Li, P.; Zhou, M. Multi-Agent Deep Reinforcement Learning for Integrated Demand Forecasting and Inventory Optimization in Sensor-Enabled Retail Supply Chains. Sensors 2025, 25, 2428. https://doi.org/10.3390/s25082428

AMA Style

Yang Y, Wang M, Wang J, Li P, Zhou M. Multi-Agent Deep Reinforcement Learning for Integrated Demand Forecasting and Inventory Optimization in Sensor-Enabled Retail Supply Chains. Sensors. 2025; 25(8):2428. https://doi.org/10.3390/s25082428

Chicago/Turabian Style

Yang, Yongbin, Mengdie Wang, Jiyuan Wang, Pan Li, and Mengjie Zhou. 2025. "Multi-Agent Deep Reinforcement Learning for Integrated Demand Forecasting and Inventory Optimization in Sensor-Enabled Retail Supply Chains" Sensors 25, no. 8: 2428. https://doi.org/10.3390/s25082428

APA Style

Yang, Y., Wang, M., Wang, J., Li, P., & Zhou, M. (2025). Multi-Agent Deep Reinforcement Learning for Integrated Demand Forecasting and Inventory Optimization in Sensor-Enabled Retail Supply Chains. Sensors, 25(8), 2428. https://doi.org/10.3390/s25082428

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multi-Agent Deep Reinforcement Learning for Integrated Demand Forecasting and Inventory Optimization in Sensor-Enabled Retail Supply Chains

Abstract

1. Introduction

2. Related Work

2.1. Traditional Demand Forecasting

2.2. Machine Learning for Retail Forecasting

2.3. Inventory Optimization

2.4. Reinforcement Learning in Supply Chain Management

2.5. Multi-Modal Learning in Retail

3. Methodology

3.1. Reinforcement Learning Problem Formulation

3.2. Framework Overview

3.3. Transformer-Based Multi-Modal Demand Forecasting

3.4. Cross-Modal Attention Mechanism

3.5. Hierarchical Multi-Agent Inventory Optimization

3.6. Joint Learning and Optimization

3.7. Convergence Analysis

4. Experimental Results and Analysis

4.1. Experimental Setup

4.1.1. Datasets

4.1.2. Implementation Details

4.1.3. Baseline Methods

4.2. Performance Evaluation

4.2.1. Forecast Accuracy

4.2.2. Inventory Optimization

4.2.3. Computational Efficiency

4.3. Ablation Studies

Hierarchical Architecture Analysis

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI