Cascading Multi-Agent Policy Optimization for Demand Forecasting

Varasteh Yazdi, Saeed

doi:10.3390/cmsf2025011018

Open AccessProceeding Paper

Cascading Multi-Agent Policy Optimization for Demand Forecasting^†

by

Saeed Varasteh Yazdi

AIM Research Center on Quantitative Methods in Business (QUANT), Emlyon Business School, 69007 Lyon, France

^†

Presented at the 11th International Conference on Time Series and Forecasting, Canaria, Spain, 16–18 July 2025.

Comput. Sci. Math. Forum 2025, 11(1), 18; https://doi.org/10.3390/cmsf2025011018

Published: 31 July 2025

Download

Browse Figures

Versions Notes

Abstract

Reliable demand forecasting is crucial for effective supply chain management, where inaccurate forecasts can lead to frequent out-of-stock or overstock situations. While numerous statistical and machine learning methods have been explored for demand forecasting, reinforcement learning approaches, despite their significant potential, remain little known in this domain. In this paper, we propose a multi-agent deep reinforcement learning solution designed to accurately predict demand across multiple stores. We present empirical evidence that demonstrates the effectiveness of our model using a real-world dataset. The results confirm the practicality of our proposed approach and highlight its potential to improve demand forecasting in retail and potentially other forecasting scenarios.

Keywords:

demand forecasting; reinforcement learning; multi-agent systems

1. Introduction

A large retailer operates multiple stores across a wide geographical area, with each store selling thousands of products (e.g., Walmart). Demand planning for such retailers often relies on fixed, rule-based approaches for forecasting and replenishment order management. While this method works adequately for stable and predictable product categories, it falls short in optimizing the inventory for products with nonlinear and volatile demand patterns. In these scenarios, accurate demand forecasting becomes critical. Demand forecasting as a key component of the supply chain has been extensively studied due to its practical importance in production planning and inventory control. Traditionally, demand forecasting has relied heavily on statistical time series methods, such as exponential smoothing and ARIMA, which have been widely applied in various domains, including energy [1], transportation [2], fashion [3], retail [4], and finance [5]. Over the past two decades, forecasting techniques have evolved significantly, with machine learning emerging as a powerful tool for developing more sophisticated models. Machine learning and deep learning algorithms, with their ability to process large amounts of data and capture complex nonlinear relationships, can uncover patterns and trends that traditional statistical methods may miss [6].

Reinforcement learning (RL) [7], particularly when combined with deep learning, known as deep reinforcement learning (DRL) [8], forms a powerful approach to making decisions in dynamic environments. The primary goal of RL methods is to discover an optimal policy that maximizes the expected cumulative reward over time, which can be achieved using techniques such as Q-learning [9] and policy gradients [10]. Recently, RL has demonstrated promising potential in various challenging scenarios that necessitate dynamic modeling and long-term planning, including gaming [11], real-time ad bidding [12], and recommender systems [13]. RL methods have also been explored for addressing several inventory management problems [14]. However, their application to demand forecasting remains relatively underexplored.

Although RL methods generally depend on online data collection, offline RL algorithms [15] offer a data-centric approach that focuses on learning from large, static datasets. This paradigm enables the applicability of RL to areas where real-time data collection is impractical or impossible, such as forecasting. In this context, the model relies exclusively on past experience from an unknown policy to be discovered. As a result, a well-designed RL system can generate forecasting policies optimized for complex scenarios, often surpassing the adaptability of conventional forecasting techniques. In this paper, we seek to address the forecasting problem using RL by investigating the following research questions:

RQ1: Can RL be used for forecasting problems?
RQ2: What will the performance of an RL-based forecasting system be compared to that of the state-of-the-art gradient boosting models?

To address these research questions, we introduce a novel deep reinforcement learning multi-agent system, where agents collaborate to make predictive decisions, resulting in a highly accurate demand forecasting solution for multi-store retailers. Our proposed method effectively predicts future retail sales by analyzing historical sales data. To evaluate the performance of our framework, we conduct a comparative analysis against other state-of-the-art forecasting methods using a historical sales dataset from Walmart. The main contributions of our work can be summarized as follows:

We propose a novel multi-agent DRL architecture to solve the real-world problem of multi-store demand forecasting, where agents interact and share knowledge in a cascading fashion. It is, to the best of our knowledge, the first such study.
We highlight the potential of DRL methods in forecasting, an area that has not been well studied.
We conducted extensive experiments to validate the model’s rationale and practicality, as well as to compare its performance against that of the existing methods.

The remainder of this study is organized as follows. Section 2 provides a brief overview of related studies in the literature on the use of machine learning and reinforcement learning for demand forecasting. Section 3 introduces our proposed multi-agent DRL architecture and its components. Section 4 presents the experimental settings, followed by the results in Section 5. Finally, Section 6 concludes this study and discusses future research directions.

2. The Literature Review

Researchers have employed various mathematical modeling and optimization techniques to tackle a wide range of inventory-related problems, including forecasting, procurement, production, distribution, and transportation [16,17]. Demand forecasting, which is the primary focus of this paper, has a rich history, evolving from simple rule-based or statistical methods [18,19] to more advanced predictive analytics powered by machine learning and deep learning algorithms [20]. Among the machine learning approaches, support vector machines [21], decision trees, gradient boosting trees [22], and ensemble methods [23] have emerged as some of the most widely used techniques. A recent study by Huber et al. [24] demonstrates that machine learning methods can deliver more accurate forecasts for large-scale demand applications compared to traditional statistical approaches. Neural-network-based demand forecasting methods have also been extensively explored in the literature for their capacity to capture complex demand dynamics [25]. Among these, recurrent neural networks (RNNs) are particularly popular due to their effectiveness in modeling sequential dependencies and temporal patterns. RNNs and their variants, such as LSTMs and GRUs, are well suited to demand forecasting because they can learn from historical data patterns and adapt to fluctuations, making them valuable for industries with seasonally or cyclically influenced demand.

RL methods have been widely studied in the context of various inventory management problems, including demand forecasting. For example, Q-learning [9] has traditionally been a prominent reinforcement learning method for tackling inventory management challenges [26,27], while Deep Q-Networks (DQNs) [28] have been employed to address the beer game problem [14]. Recently, a multi-agent DRL approach [29] was used to optimize replenishment decisions in inventory management; however, the proposed approach was limited to a single store and focused primarily on optimizing shared inventory resources. Similar studies have demonstrated the potential of DRL in demand forecasting. However, these approaches differ significantly in terms of the problem formulation and solution when applied to multi-store demand forecasting, which is the focus of this study. For example, Liu et al. [30] showcased the practicality and efficiency of various DRL methods in predicting building energy consumption, while a related study [31] demonstrated the effectiveness of a hybrid DRL-LSTM model in forecasting next-day electricity usage. To date, the use of DRL for demand forecasting has seen limited exploration. To fill this gap in the literature, specifically the absence of a strong DRL-based approach to retail demand forecasting, we developed a multi-agent DRL model tailored to multi-store retailers.

3. The Problem Statement and the Proposed Solution

We formulate the demand forecasting problem for a multi-store retailer as a multi-agent reinforcement learning system, where each agent is responsible for predicting sales of multiple items in each store. To do this, we propose a set of cascading agents, each responsible for one store, and train them using a policy gradient optimization strategy. We assume that there is an upstream warehouse capable of fulfilling all orders. In the following, we outline the key components of our model, the environment, and the proposed objective function. An overview of the model’s architecture is provided in Figure 1.

3.1. The Multi-Agent Environment

In a multi-agent learning domain, the Markov Decision Process (MDP) [32] is generalized to a stochastic game or a Markov game (MG) [33] where multiple agents interact simultaneously within a shared environment. The MG is simply an extension to the MDP where a set of states

s \in S

is observed by

N > 1

agents. The Partially Observable Markov Game (POMG) is a generalization of both the MG and the MDP where agents do not perceive the whole state space

S

but a subset of it

O \subset S

, called the observation space. Our environment is defined in a similar way to that in the MG and POMG, where we denote

S = S^{1} \times \dots \times S^{N}

as a collection of the individual local state spaces for each agent

i \in N

and

A = A^{1} \times \dots \times A^{N}

as their joint action spaces. The transition probability function describes the chance of the whole state transition, and finally,

r^{i} : S^{i} \times A^{i} \times S^{i} \to R

represents the associated reward function of the agent i. At each step t, each agent i selects an action according to its individual policy

π^{i} : S^{i} \to p (A^{i} | S^{i})

. After the transition to the next state, each agent receives

r_{t}^{i}

as immediate feedback to the state transition. Similarly to the single-agent problem, each agent’s goal is to update its policy to obtain higher rewards. Our multi-agent RL framework has the following components:

$s_{t}^{i} \in S^{i}$ is the state representation for agent i at step t which consists of the item embedding, the state feature vector, and the action information of other agents. Further elaboration on this representation will be provided in the following sections.
Joint actions $a_{t} = (a_{t}^{1}, a_{t}^{2}, \dots, a_{t}^{N})$ are continuous non-negative vectors of dimension N. $a_{t}^{i}$ is the action (predicted sales) provided by the agent i at step t.
At each step t, agent i observes state $s_{t}^{i}$ and takes action $a_{t}^{i}$ . The corresponding reward value $r_{t}^{i}$ is computed as the difference between its action (predicted sales) and the target (actual sales), which is unique to the agent.

3.2. Cascading Agents

A key challenge in multi-agent systems is to identify interdependencies between agents, especially when the actions of one agent may affect others—in our case, the impact of selling an item in one of the stores on the others. Therefore, rather than training agents individually, our aim is to train the agents in a cascading fashion, using the actions taken by the previous agents as an additional input to the subsequent ones. First, a random permutation of agents is chosen. Given the current arrangement, state representations are generated for each agent. The first agent is constrained to estimating its action

a_{t}^{1}

at step t based on

s_{t}^{1}

. Subsequently, for

i > 1

, the agents are given an augmented state using the action values of the previous agents. Given the state of the first agent,

s_{t}^{1}

, the states of the remaining agents can be obtained as follows:

s_{t}^{i} = (s_{t}^{1}, {a_{t}^{j}}_{j = 1}^{i - 1}), \forall 1 < i \leq N .

Note that at training time, it is advantageous to use the true action values instead of those predicted by the agents, while at execution time, the agents use the predicted action values of the subsequent agents. Since the prediction performance of the agents may be sensitive to the order of the chain, we optimize the order of the agents by training them on several randomly chosen permutations.

3.3. Parameter Sharing

The use of multiple agents introduces additional training overhead, as each agent requires a policy to be trained, leading to increased computational and memory demands when the policies are represented by neural networks. For homogeneous agents, the training efficiency can be improved through parameter sharing. This is particularly beneficial when policies are updated using the policy gradient method. We used a soft parameter sharing technique. At every interval of a given fixed number of episodes, the parameters of the best-performing policy are shared with the others; i.e., at predefined intervals, for agent i,

W^{i} = η W^{i} + (1 - η) W^{b e s t}

where

W^{i}

is the weight matrix of the policy network of agent i, and

W^{b e s t}

is the weight matrix of the best-performing policy network. The best-performing policy is the one with the highest reward so far.

η

is a hyperparameter between 0 and 1 that controls the effect of the sharing and needs to be tuned. Note that sharing the network parameters serves the dual purpose of capturing the knowledge acquired by other agents and mitigating the risk of overfitting the current policy.

3.4. The Objective Function

Given the proposed architecture and the parameter sharing technique discussed above, we extend the proximal policy gradient optimization (PPO) algorithm [34] to multiple agents. At each epoch, the policies are fixed, and the transitions

(s_{t}^{i}, {\dot{a}}_{t}^{i}, π_{θ_{t}}^{i} (a_{t}^{i} | s_{t}^{i}), r_{t}^{i})

for all agents are recorded in memory for an episode of length T.

{\dot{a}}_{t}^{i}

is the target action for agent i at state

s_{t}^{i}

. After calculating the returns of each agent in the episode, the new policy parameters

θ_{t + 1}

are trained over the mini-batches of the stored transitions in memory to optimize the clipped PPO objective, maximizing the reward

r_{t}^{i} (s_{t}^{i}, a_{t}^{i})

. For each agent i, the policy objective is defined as

L_{π^{i}} (θ^{i}) = E_{τ \sim π_{θ}^{i}} [\sum_{t} min (l (θ^{i}) r_{t}^{i} (s_{t}^{i}, a_{t}^{i}), c l i p (l (θ^{i}), 1 - ϵ, 1 + ϵ) r_{t}^{i} (s_{t}^{i}, a_{t}^{i}))]

under the constraint of the proximity to the previous policy imposed by

l (θ^{i}) = \frac{π_{θ^{i}} (a_{t}^{i} | s_{t}^{i})}{π_{θ_{o l d}^{i}} (a_{t}^{i} | s_{t}^{i})}

.

The resulting functions

L_{π^{i}}

are smooth and differentiable, allowing the objective function to be maximized using standard gradient backpropagation updating of the model parameters

θ^{i}

. Note that at each step t, all N agents perform actions sequentially, and all transitions are stored in memory. This memory is cleared before starting a new episode. Our multi-agent framework roughly falls into the commonly used category of centralized-training, decentralized-execution multi-agent systems [35]. Algorithm 1 describes the complete training procedure for our system.

Algorithm 1 Cascading multi-agent PPO

Input: Initial policy parameters

θ_{0}^{i}

,

ϵ

, k, and

η .

1:: Initialize the cumulative reward vector $r e w a r d = 0_{N}$ .
2:: for $e (e p i s o d e) = 1, 2, \dots .$ do
3:: for $t = 1, 2, \dots, T$ do
4:: Build state representations $s_{t}^{i}$ for each agent i.
5:: Save the transition $(s_{t}^{i}, {\dot{a}}_{t}^{i}, π_{θ_{e}}^{i} (a_{t}^{i} | s_{t}^{i}), r_{t}^{i})$ in memory using the current policies $π_{θ_{e}}^{i}$ .
6:: end for
7:: Updating agent rewards: $r e w a r d_{i} = r e w a r d_{i} + \sum_{m = 1}^{e} \sum_{t = 1}^{T} r_{t}^{i}$ .
8:: for $i = 1, 2, \dots, N$ do
9:: Update the policy by maximizing the objective function:

$θ_{e + 1}^{i} = \underset{θ^{i}}{arg max} \frac{1}{T} \sum_{t = 0}^{T} min (\frac{π_{θ^{i}} (a_{t}^{i} | s_{t}^{i})}{π_{θ_{e}^{i}} (a_{t}^{i} | s_{t}^{i})} r_{t}^{i} (s_{t}^{i}, a_{t}^{i}), c l i p (\frac{π_{θ^{i}} (a_{t}^{i} | s_{t}^{i})}{π_{θ_{e}^{i}} (a_{t}^{i} | s_{t}^{i})}, 1 - ϵ, 1 + ϵ) r_{t}^{i} (s_{t}^{i}, a_{t}^{i}))$

via stochastic gradient ascent.
10:: end for
11:: {Every k episodes:}
12:: Calculate the agent’s cumulative rewards and find the best-performing agent: $b e s t = arg max (r e w a r d)$
13:: Update the policy parameters: $\forall i \in N, W^{i} = η W^{i} + (1 - η) W^{b e s t}$
14:: $r e w a r d = 0_{N}$
15:: end for

4. Experiments

We evaluate the performance of the proposed method on real sales data provided under the M5 competition [36] by the University of Nicosia, which aims to predict future retail sales at various Walmart stores in three US states: California (CA), Texas (TX), and Wisconsin (WI).

4.1. Data and Methods

The sales data spans from January 2011 to April 2016 and includes detailed information on items, departments, categories, and store specifics. Additionally, it includes explanatory variables such as price, promotions, day of the week, and special events. The dataset contains a total of 3049 individual items across three main categories (Hobbies, Food, and Household) and seven subcategories (Hobbies 1 and 2; Food 1, 2, and 3; and Household 1 and 2), sold in 10 stores located in three different states. Figure 2 presents the daily and weekly sales profiles of two randomly selected items in the dataset. It is apparent from the figure that with so many factors affecting the sales per day, the sales data comes with erratic behavior.

We trained the proposed model with 10 agents, 1 per store, in an environment with

n = [50, 300, 700]

randomly selected items to predict sales for the next 7 days. The states at step t are represented by the extracted sales features, item embedding, and action information. The agent actions are the continuous normalized sales values, and the reward of each agent i at step t is set as

r_{t}^{i} = - {(a_{t}^{i} - {\dot{a}}_{t}^{i})}^{2}

. We compared the prediction performance of our method with that of the most popular forecasting approaches, including Multiple Linear Regression (MLR) and a recurrent neural network (RNN) as baselines and Light Gradient Boosting Machine (LGBM) [37] as a strong baseline. MLR is a statistical technique that uses multiple explanatory variables to predict the outcome of multiple response variables. Long Short-Term Memory Networks (LSTMs) are an improved version of RNNs that have been adopted as strong predictors due to their ability to capture long-range dependencies more accurately. Finally, gradient boosting decision trees (GBDTs) are some of the most successful machine learning techniques used in forecasting competitions. According to the final results of the M5 competitions, LightGBM, or LGBM for short, was the most used and best-performing method among the others.

4.2. The Experimental Setup

We followed a commonly used machine learning pipeline to train the models, starting with feature extraction and normalization. The extracted features include calendar information such as time of day, day of the week, and month of the year, as well as item price and lagged values, the values associated with previous sales. Specifically, we incorporate lagged information from the previous 7, 14, and 28 days. We also use the rolling averages for the same periods. For the LSTM model, since this approach required temporal data as the input, we transformed the data into multivariate time series by implementing a sliding window operation, creating series that gave the daily sales information for an item over 14 consecutive days. Data normalization was also performed for all models by scaling all of the features to lie between 0 and 1. Totals of 94 K, 564 K, and 1.3 M historical instances were considered for environments with 50, 300, and 700 items, respectively. The datasets are divided into two parts, with sales from the last 7 days (i.e., 18th to 24th April) kept for testing purposes. The same test is used across all models; therefore, the predictions from individual models are directly comparable.

For all methods, a line/grid search is performed over the hyperparameters given in Table 1, using

10 %

of the training data as the validation set. For the RNN architecture, we experiment with different combinations of stacked LSTM layers and hidden units, along with the batch size. The predictions are performed using a fully connected network over the final LSTM output. For LGBM, the best-performing parameters were selected. In the case of our model, batch size and

η

are the only parameters that need to be tuned. The policy networks are represented by a two-layer fully connected neural network with 100 hidden neurons per layer. We trained the model using the Adam optimizer with a learning rate of

1 \times 10^{- 4}

. In our experiments, the

ϵ

and k defined in Algorithm 1 are set to

0.2

and 50, respectively. After identifying the parameters that lead to the best prediction performance using the validation set, we merge the validation set back into the training set and retrain the models on this merged training set.

4.3. Measuring the Forecasting Performance

To evaluate the effectiveness of the proposed method, we considered two metrics widely used in forecasting: the root mean squared error (RMSE) and the coefficient of determination

R^{2}

. Smaller values for the RMSE indicate a more accurate prediction result.

R^{2}

, the coefficient of determination, can be interpreted as the percentage of the variance explained by the model, which is particularly important in our sales prediction problem. Its best possible value is

1.0

and can be negative because the model can be arbitrarily worse.

R^{2} = 1 - \frac{V a r ({\dot{a}}_{t} - {\hat{a}}_{t})}{V a r ({\dot{a}}_{t})}

Our experiments are conducted using a random seed to reduce the stochasticity associated with model training. The item selection process is repeated three times, and the averaged performances are stored. The final results are the mean and standard deviation across all targets (stores), reported in Table 2 and Table 3, with the best significantly different values (t-test at 5% risk) in bold.

5. Results and Discussion

Table 2 and Table 3 summarize the performance of the models in three environments with

n = [50, 300, 700]

items. In order to compare the models better and answer our RQ1 and RQ2, the evaluation indices are calculated on two scales, daily and cumulative 7-day sales. First, we observe that our method, together with LGBM, achieves the best average

R^{2}

and RMSE values on both the daily and 7-day scales for all indices and for all three environments, followed by MLR in third place and LSTM in last. Our method shows a competitive performance with that of LGBM, which is known as a strong predictive method and may not achieve the best score in some cases but has statistically equivalent results. Both MLR and LSTM demonstrate an inadequate performance across all scenarios, especially when dealing with small numbers of items, highlighting the inefficiency of these methods. Regarding LSTM, this can be attributed to its reliance on a substantial amount of data for effective training and capture of the internal temporal relationships. Alternatively, the poor performance could be due to the overfitting. As shown in Table 3, the MLR model exhibits a large error of

9.22

in its RMSE, which is nearly twice the error of our model when n is 50, confirming that linear solvers such as MLR also require a lot of data to accurately capture input–output relationships. Nevertheless, across all scenarios, the predictions improve as n increases, and both the MLR and LSTM models converge towards nearly the same performance level.

For a 7-day forecast, our method and LGBM present the closest results to the actual target sales, outperforming the other methods by a significant margin. The prediction performance of our model for

n = 50

, measured using the RMSE and

R^{2}

, is

5.74

and

0.82

, respectively, which is far superior to LSTM and MLR. Similar superior outcomes are achieved as n increases. Our method gives

R^{2}

values of

0.90

and

0.92

compared to MLR’s

0.83

and

0.89

. Remarkable

R^{2}

values of up to

0.92

indicate that the algorithm appears to be very good at 7-day forecasting. Meanwhile, when considering 300 and 700 items, our method may not exhibit the same level of performance as LGBM in terms of the RMSE, with values of

5.90

compared to LGBM’s

4.99

and

6.45

compared to

5.15

. However, it is important not to misinterpret this result. The performance of our method should not be deemed inadequate. Both methods perform similarly in capturing sales variability, as indicated by the equivalent

R^{2}

values. There are two potential reasons for the observed differences: the training mode and the policy architecture. Our model was trained episodically, following standard reinforcement learning principles, which may present a disadvantage compared to conventional batch training used in supervised learning. The other reason could be the compact policy architecture (two fully connected layers). These factors represent avenues for future research and need to be explored further.

To show the accuracy of the proposed model better, the cumulative 7-day predictions for two randomly selected items are presented in Figure 3 and Figure 4. The proposed framework performs similarly to LGBM, as both methods efficiently capture the volatility observed across the stores, while our method shows a better performance in predicting peak periods. The figures clearly illustrate that the proposed multi-agent system accurately reflects the mixed patterns in both cases, with a high correlation with the actual values.

The results confirm the efficiency and practicality of our model. However, the advantage of our model is not limited to its predictive power. Our model can be used to provide a probabilistic demand forecast, allowing demand planners to incorporate more uncertain and random factors into their decisions in order to reduce operating costs and improve system reliability. Our model treats the daily sales of a product at each store as an independent random variable and estimates its probability density function (PDF). This approach allows us to evaluate sales behavior more effectively, as it takes into account the distribution of sales rather than relying solely on point estimates.

In terms of computational times, it is not surprising that LGBM has the fastest training, followed by MLR, which is the simplest of the methods considered. Our method demonstrates remarkable sample efficiency, as even a fraction of the training data is sufficient to train our model. The fastest training time of our model comes after MLR. On the other hand, the training time for LSTM models is considerably longer compared to that of the other methods, being approximately 20 times slower, making it the most computationally demanding approach. Finally, it is worth mentioning that the efficiency of the proposed approach may in certain cases rely on an appropriate order of the permutations of the agents. However, our analysis shows that the model is robust to changes in the order, which further validates its reliability.

6. Conclusions

This research addresses the problem of demand forecasting in a supply chain replenishment management system. We propose a novel solution based on multi-agent deep reinforcement learning for demand forecasting to efficiently manage a multi-store retailer by taking full advantage of historical sales. Our approach incorporates three key techniques: cascading agents, parameter sharing, and proximal policy gradient optimization. Furthermore, the proposed method goes beyond demand forecasting and can be applied to various other forecasting applications. The proposed approach was validated in an application on real data, and its performance was assessed by comparing the predicted results with those of the other models. The evaluation confirmed the performance of our method in predictive applications. Notably, our approach strikes a balance between performance and explicability. In terms of future research directions, we consider investigating the interactions between agents and the environment. In this context, the problem can be extended to incorporate various environmental constraints associated with inventory management. There is also potential for further studies to explore the predictive performance of the proposed method in diverse domains, including energy consumption prediction.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original data presented in this study are openly available from the M5 competition [36].

Conflicts of Interest

The author declares no conflicts of interest.

References

Yang, Q.; Tian, Z. A hybrid load forecasting system based on data augmentation and ensemble learning under limited feature availability. Expert Syst. Appl. 2024, 261, 125567. [Google Scholar] [CrossRef]
Xu, S.; Chan, H.K.; Zhang, T. Forecasting the demand of the aviation industry using hybrid time series SARIMA-SVR approach. Transp. Res. Part E Logist. Transp. Rev. 2019, 122, 169–180. [Google Scholar] [CrossRef]
Swaminathan, K.; Venkitasubramony, R. Demand forecasting for fashion products: A systematic review. Int. J. Forecast. 2023, 40, 247–267. [Google Scholar] [CrossRef]
Fildes, R.; Ma, S.; Kolassa, S. Retail forecasting: Research and practice. Int. J. Forecast. 2022, 38, 1283–1318. [Google Scholar] [CrossRef]
Lu, C.J.; Lee, T.S.; Chiu, C.C. Financial time series forecasting using independent component analysis and support vector regression. Decis. Support Syst. 2009, 47, 115–125. [Google Scholar] [CrossRef]
Aamer, A.; Eka Yani, L.; Alan Priyatna, I. Data analytics in the supply chain management: Review of machine learning applications in demand forecasting. Oper. Supply Chain. Manag. Int. J. 2020, 14, 1–13. [Google Scholar] [CrossRef]
Kaelbling, L.P.; Littman, M.L.; Moore, A.W. Reinforcement learning: A survey. J. Artif. Intell. Res. 1996, 4, 237–285. [Google Scholar] [CrossRef]
Arulkumaran, K.; Deisenroth, M.P.; Brundage, M.; Bharath, A.A. Deep reinforcement learning: A brief survey. IEEE Signal Process. Mag. 2017, 34, 26–38. [Google Scholar] [CrossRef]
Watkins, C.J.; Dayan, P. Q-learning. Mach. Learn. 1992, 8, 279–292. [Google Scholar] [CrossRef]
Sutton, R.S.; McAllester, D.; Singh, S.; Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Adv. Neural Inf. Process. Syst. 1999, 12, 1057–1063. [Google Scholar]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Cai, H.; Ren, K.; Zhang, W.; Malialis, K.; Wang, J.; Yu, Y.; Guo, D. Real-time bidding by reinforcement learning in display advertising. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, Cambridge, UK, 6–10 February 2017; pp. 661–670. [Google Scholar]
Zheng, G.; Zhang, F.; Zheng, Z.; Xiang, Y.; Yuan, N.J.; Xie, X.; Li, Z. DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference, Lyon, France, 23–27 April 2018; pp. 167–176. [Google Scholar]
Oroojlooyjadid, A.; Nazari, M.; Snyder, L.V.; Takáč, M. A deep q-network for the beer game: Deep reinforcement learning for inventory optimization. Manuf. Serv. Oper. Manag. 2022, 24, 285–304. [Google Scholar] [CrossRef]
Levine, S.; Kumar, A.; Tucker, G.; Fu, J. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv 2020, arXiv:2005.01643. [Google Scholar] [CrossRef]
Taleizadeh, A.A.; Niaki, S.T.A.; Aryanezhad, M.B.; Shafii, N. A hybrid method of fuzzy simulation and genetic algorithm to optimize constrained inventory control systems with stochastic replenishments and fuzzy demand. Inf. Sci. 2013, 220, 425–441. [Google Scholar] [CrossRef]
Daniel, J.S.R.; Rajendran, C. A simulation-based genetic algorithm for inventory optimization in a serial supply chain. Int. Trans. Oper. Res. 2005, 12, 101–127. [Google Scholar] [CrossRef]
Syntetos, A.A.; Boylan, J.E. The accuracy of intermittent demand estimates. Int. J. Forecast. 2005, 21, 303–314. [Google Scholar] [CrossRef]
Ferbar, L.; Čreslovnik, D.; Mojškerc, B.; Rajgelj, M. Demand forecasting methods in a supply chain: Smoothing and denoising. Int. J. Prod. Econ. 2009, 118, 49–54. [Google Scholar] [CrossRef]
Gonçalves, J.N.; Cortez, P.; Carvalho, M.S.; Frazao, N.M. A multivariate approach for multi-step demand forecasting in assembly industries: Empirical evidence from an automotive supply chain. Decis. Support Syst. 2021, 142, 113452. [Google Scholar] [CrossRef]
Bolandnazar, E.; Rohani, A.; Taki, M. Energy consumption forecasting in agriculture by artificial intelligence and mathematical models. Energy Sources Part A Recover. Util. Environ. Eff. 2020, 42, 1618–1632. [Google Scholar] [CrossRef]
Deng, S.; Su, J.; Zhu, Y.; Yu, Y.; Xiao, C. Forecasting carbon price trends based on an interpretable light gradient boosting machine and Bayesian optimization. Expert Syst. Appl. 2024, 242, 122502. [Google Scholar] [CrossRef]
Huang, Y.; Yuan, Y.; Chen, H.; Wang, J.; Guo, Y.; Ahmad, T. A novel energy demand prediction strategy for residential buildings based on ensemble learning. Energy Procedia 2019, 158, 3411–3416. [Google Scholar] [CrossRef]
Huber, J.; Stuckenschmidt, H. Daily retail demand forecasting using machine learning with emphasis on calendric special days. Int. J. Forecast. 2020, 36, 1420–1438. [Google Scholar] [CrossRef]
Gutierrez, R.S.; Solis, A.O.; Mukhopadhyay, S. Lumpy demand forecasting using neural networks. Int. J. Prod. Econ. 2008, 111, 409–420. [Google Scholar] [CrossRef]
Ravulapati, K.K.; Rao, J.; Das, T.K. A reinforcement learning approach to stochastic business games. Iie Trans. 2004, 36, 373–385. [Google Scholar] [CrossRef]
Sui, Z.; Gosavi, A.; Lin, L. A reinforcement learning approach for inventory replenishment in vendor-managed inventory systems with consignment inventory. Eng. Manag. J. 2010, 22, 44–53. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Graves, A.; Antonoglou, I.; Wierstra, D.; Riedmiller, M. Playing atari with deep reinforcement learning. arXiv 2013, arXiv:1312.5602. [Google Scholar] [CrossRef]
Ding, Y.; Feng, M.; Liu, G.; Jiang, W.; Zhang, C.; Zhao, L.; Song, L.; Li, H.; Jin, Y.; Bian, J. Multi-Agent Reinforcement Learning with Shared Resources for Inventory Management. arXiv 2022, arXiv:2212.07684. [Google Scholar] [CrossRef]
Liu, T.; Tan, Z.; Xu, C.; Chen, H.; Li, Z. Study on deep reinforcement learning techniques for building energy consumption forecasting. Energy Build. 2020, 208, 109675. [Google Scholar] [CrossRef]
Zhou, X.; Lin, W.; Kumar, R.; Cui, P.; Ma, Z. A data-driven strategy using long short term memory models and reinforcement learning to predict building electricity consumption. Appl. Energy 2022, 306, 118078. [Google Scholar] [CrossRef]
Bellman, R. A Markovian decision process. J. Math. Mech. 1957, 6, 679–684. [Google Scholar] [CrossRef]
Littman, M.L. Markov games as a framework for multi-agent reinforcement learning. In Machine Learning Proceedings 1994; Elsevier: New Brunswick, NJ, USA, 1994; pp. 157–163. [Google Scholar]
Schulman, J.; Wolski, F.; Dhariwal, P.; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347. [Google Scholar] [CrossRef]
Gronauer, S.; Diepold, K. Multi-agent deep reinforcement learning: A survey. Artif. Intell. Rev. 2022, 55, 895–943. [Google Scholar] [CrossRef]
Makridakis, S.; Spiliotis, E.; Assimakopoulos, V. M5 accuracy competition: Results, findings, and conclusions. Int. J. Forecast. 2022, 38, 1346–1364. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3149–3157. [Google Scholar]

Figure 1. A generic architecture for the proposed multi-agent DRL method. In the proposed method, agents share their predicted actions, sampled from their learned distributions, with subsequent agents. In addition, a parameter sharing mechanism enhances training by allowing policies to adapt to the experience of others.

Figure 2. Daily and weekly sales profiles of two randomly selected items in the dataset. Above: FOODS_3_090_CA_3; below: HOUSEHOLD_2_464_WI_3.

Figure 3. Actual and predicted sales of FOODS_3_776 over the 7-day period.

Figure 4. Actual and predicted sales of HOBBIES_1_209 over the 7-day period.

Table 1. Hyperparameter search space.

Method	Parameter	Line/Grid Values
lstm	hidden layers	${1, 2, 3}$
	hidden units	${32, 64}$
	batch size	${256, 512}$
lgbm	number of estimators	${100, 300}$
	learning rate	${0.01, 0.1, 0.3}$
	max depth	${4, 8, 12, - 1}$
	min child samples	${10, 20, 30}$
ours	regularization	$η \in [0.01, 0.1]$ , step of $0.03$
	batch size	${256, 512}$

Table 2. Performance comparison of the models (daily scale).

n.	Index	mlr	lstm	lgbm	ours
50	RMSE	$2.06 \pm 0.86$	$1.92 \pm 0.49$	$1.71 \pm 0.40$	$1.72 \pm 0.33$
	$R^{2}$	$0.35 \pm 0.26$	$0.42 \pm 0.14$	$0.54 \pm 0.09$	$0.54 \pm 0.11$
300	RMSE	$1.94 \pm 0.46$	$2.03 \pm 0.50$	$1.83 \pm 0.36$	$1.88 \pm 0.38$
	$R^{2}$	$0.58 \pm 0.12$	$0.54 \pm 0.11$	$0.61 \pm 0.13$	$0.60 \pm 0.13$
700	RMSE	$1.95 \pm 0.38$	$2.02 \pm 0.45$	$1.84 \pm 0.32$	$1.94 \pm 0.34$
	$R^{2}$	$0.69 \pm 0.09$	$0.67 \pm 0.09$	$0.72 \pm 0.09$	$0.70 \pm 0.09$

Table 3. Performance comparison of the models (7-day scale).

n.	Index	mlr	lstm	lgbm	ours
50	RMSE	$9.22 \pm 6.94$	$7.81 \pm 3.64$	$5.26 \pm 2.62$	$5.74 \pm 1.74$
	$R^{2}$	$0.57 \pm 0.36$	$0.69 \pm 0.18$	$0.85 \pm 0.08$	$0.82 \pm 0.11$
300	RMSE	$6.99 \pm 2.66$	$7.63 \pm 3.03$	$4.99 \pm 1.02$	$5.90 \pm 1.34$
	$R^{2}$	$0.83 \pm 0.06$	$0.81 \pm 0.07$	$0.91 \pm 0.03$	$0.90 \pm 0.05$
700	RMSE	$6.88 \pm 1.9$	$7.74 \pm 2.81$	$5.15 \pm 0.85$	$6.45 \pm 1.17$
	$R^{2}$	$0.89 \pm 0.04$	$0.87 \pm 0.05$	$0.93 \pm 0.02$	$0.92 \pm 0.03$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Varasteh Yazdi, S. Cascading Multi-Agent Policy Optimization for Demand Forecasting. Comput. Sci. Math. Forum 2025, 11, 18. https://doi.org/10.3390/cmsf2025011018

AMA Style

Varasteh Yazdi S. Cascading Multi-Agent Policy Optimization for Demand Forecasting. Computer Sciences & Mathematics Forum. 2025; 11(1):18. https://doi.org/10.3390/cmsf2025011018

Chicago/Turabian Style

Varasteh Yazdi, Saeed. 2025. "Cascading Multi-Agent Policy Optimization for Demand Forecasting" Computer Sciences & Mathematics Forum 11, no. 1: 18. https://doi.org/10.3390/cmsf2025011018

APA Style

Varasteh Yazdi, S. (2025). Cascading Multi-Agent Policy Optimization for Demand Forecasting. Computer Sciences & Mathematics Forum, 11(1), 18. https://doi.org/10.3390/cmsf2025011018

Article Menu

Cascading Multi-Agent Policy Optimization for Demand Forecasting^†

Abstract

1. Introduction

2. The Literature Review