Next Article in Journal
Heuristic Conductance-Aware Local Clustering for Heterogeneous Hypergraphs
Previous Article in Journal
Image-Based Segmentation of Hydrogen Bubbles in Alkaline Electrolysis: A Comparison Between Ilastik and U-Net
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Dynamic Resource Games in the Wood Flooring Industry: A Bayesian Learning and Lyapunov Control Framework

by
Yuli Wang
1 and
Athanasios V. Vasilakos
2,*
1
Forestry Engineering Postdoctoral Research Center, Nanjing Forestry University, No. 159 Longpan Road, Nanjing 210037, China
2
Center for AI Research (CAIR), University of Agder, P.O. Box 422, 4604 Kristiansand, Norway
*
Author to whom correspondence should be addressed.
Algorithms 2026, 19(1), 78; https://doi.org/10.3390/a19010078
Submission received: 2 November 2025 / Revised: 8 January 2026 / Accepted: 14 January 2026 / Published: 16 January 2026
(This article belongs to the Section Analysis of Algorithms and Complexity Theory)

Abstract

Wood flooring manufacturers face complex challenges in dynamically allocating resources across multi-channel markets, characterized by channel conflicts, demand uncertainty, and long-term cumulative effects of decisions. Traditional static optimization or myopic approaches struggle to address these intertwined factors, particularly when critical market states like brand reputation and customer base cannot be precisely observed. This paper establishes a systematic and theoretically grounded online decision framework to tackle this problem. We first model the problem as a Partially Observable Stochastic Dynamic Game. The core innovation lies in introducing an unobservable market position vector as the central system state, whose evolution is jointly influenced by firm investments, inter-channel competition, and macroeconomic randomness. The model further captures production lead times, physical inventory dynamics, and saturation/cross-channel effects of marketing investments, constructing a high-fidelity dynamic system. To solve this complex model, we propose a hierarchical online learning and control algorithm named L-BAP (Lyapunov-based Bayesian Approximate Planning), which innovatively integrates three core modules. It employs particle filters for Bayesian inference to nonparametrically estimate latent market states online. Simultaneously, the algorithm constructs a Lyapunov optimization framework that transforms long-term discounted reward objectives into tractable single-period optimization problems through virtual debt queues, while ensuring stability of physical systems like inventory. Finally, the algorithm embeds a game-theoretic module to predict and respond to rational strategic reactions from each channel. We provide theoretical performance analysis, rigorously proving the mean-square boundedness of system queues and deriving the performance gap between long-term rewards and optimal policies under complete information. This bound clearly quantifies the trade-off between estimation accuracy (determined by particle count) and optimization parameters. Extensive simulations demonstrate that our L-BAP algorithm significantly outperforms several strong baselines—including myopic learning and decentralized reinforcement learning methods—across multiple dimensions: long-term profitability, inventory risk control, and customer service levels.

1. Introduction

The wood flooring manufacturing industry constitutes a significant component of the global forest product economy. In recent years, facing increasingly diverse consumer demands and intense market competition, numerous wood flooring enterprises have initiated digital transformation to enhance operational efficiency and market responsiveness [1,2]. Within this process, a pervasive and fundamental challenge involves effective resource allocation across multiple concurrent sales channels [3,4,5,6].
To cover different market segments, modern wood flooring companies typically establish diversified channel portfolios. For instance, certain channels specialize in serving designer communities, pursuing high-margin custom orders; others target large-scale engineering projects characterized by substantial order volumes yet relatively limited profit margins; simultaneously, online e-commerce channels directly reach mass consumer markets, with operational focus on brand promotion and rapid inventory turnover [7,8,9,10,11,12,13,14,15,16,17,18]. These channels exhibit fundamental differences in target customers, profit models, and operational rhythms. When competing for the firm’s finite resources—such as marketing budgets, production schedules, and finished goods inventory—their divergent objectives inevitably create internal conflicts, rendering the resource allocation problem particularly complex.
Traditional resource allocation models, such as static planning methods based on historical data, might suffice in stable market environments. However, their limitations become increasingly apparent in contemporary markets characterized by volatile demand fluctuations and high uncertainty [19]. More critically, these conventional approaches often treat the firm as a single decision entity, thereby overlooking potential self-interested strategic behaviors by semi-autonomous business units like sales channels. When different channels take actions most beneficial to themselves to achieve their respective KPIs—for instance, over-requesting resources—conventional centralized optimization models may not only fail but could also lead to resource misallocation resembling the tragedy of the commons, ultimately harming the firm’s overall interests [20,21,22].
To address this challenge, this research aims to answer a core question: How can we design an online resource allocation mechanism that accommodates inter-channel competition while ensuring the firm’s long-term operational stability and objective achievement in dynamically changing market environments?
Toward this end, we propose a hybrid framework integrating Lyapunov optimization with game theory. The design philosophy recognizes channels not as passive instruction executors but as rational participants, employing game theory to characterize their pursuit of short-term profit maximization. To balance the long-term risks inherent in such decentralized decision making, the framework incorporates Lyapunov optimization theory. This theory transforms the firm’s strategic objectives—such as maintaining healthy inventory levels or continuously enhancing brand value—into dynamically trackable virtual queues [23,24].
By combining these two theoretical foundations, our framework establishes a dynamic incentive mechanism. In each decision cycle, the firm’s central planner assigns dynamic weights to channel decision outcomes based on current states of these virtual queues. These weights adjust channels’ perception of short-term profits, thereby guiding their local optimal choices toward the firm’s global long-term objectives without depriving their decision autonomy [24,25].
This work provides a novel analytical perspective for channel resource management within the wood products industry. By integrating stochastic network optimization with game theory, we construct an online decision framework with theoretical performance guarantees. Our research not only offers practical insights for wood flooring industry management but also provides valuable implications for other manufacturing sectors facing similar multi-channel resource conflicts.
The main contributions of this paper are summarized as follows:
  • We formulate a Partially Observable Stochastic Dynamic Game model that endogenizes key intangible assets—such as brand reputation and customer base—as unobservable market states within the context of the wood flooring industry, systematically characterizing their dynamic evolution under firm investments, channel competition, and macroeconomic randomness.
  • We design a hierarchical online learning and control algorithm termed L-BAP, which organically integrates Bayesian filtering, Lyapunov approximate dynamic programming, and game theory to efficiently combine hidden state estimation, long-term objective optimization, and inter channel strategic behavior prediction within a unified framework.
  • We provide rigorous theoretical performance analysis for the L-BAP algorithm. Through Lyapunov function construction, we prove the mean-square boundedness of system queues and derive the performance gap bound between algorithm rewards and optimal policies under complete information, clearly revealing the estimation–optimization trade-off.
  • We validate the proposed framework’s effectiveness through high-fidelity simulations. Experimental results demonstrate that L-BAP significantly outperforms several strong baselines—including myopic learning and decentralized reinforcement learning methods—across multiple dimensions: long-term profitability, inventory risk control, and customer service levels. We further provide ablation and sensitivity analyses to isolate the contributions of Bayesian filtering, Lyapunov planning, and game-response prediction.

2. Related Work

2.1. Traditional Resource Allocation Methods in Manufacturing

Resource allocation in manufacturing represents a persistent research theme in operations research. Mathematical programming approaches constitute the most widely applied toolkit. For instance, linear programming and mixed-integer linear programming are extensively employed for solving production scheduling, capacity planning, and material requirement planning problems [18]. These methods offer advantages in clear model structure and global optimality guarantees. However, they typically rely on accurate predictions of future parameters like demand and costs, which proves challenging in highly uncertain markets. Moreover, these models are inherently static and unsuitable for online decision-making scenarios requiring rapid responses.
To address dynamism and uncertainty, researchers have proposed stochastic programming and robust optimization [19]. These methods directly incorporate uncertainty into models but often incur high computational complexity and require known probability distributions of uncertainties. Another category comprises heuristic algorithms—such as genetic algorithms, simulated annealing, and particle swarm optimization—which can find high-quality approximate solutions within acceptable timeframes but generally lack theoretical performance guarantees [19]. A common limitation across these traditional approaches is their treatment of the firm as a unified decision unit, failing to adequately account for potential interest conflicts and strategic interactions among different internal departments or channels.

2.2. Game Theory Applications in Supply Chain and Channel Management

Game theory provides a theoretical framework for analyzing systems comprising multiple independent decision makers and has found extensive application in supply chain and channel management. Existing work employs Stackelberg games to model pricing and ordering strategies between manufacturers and retailers, examining channel coordination, profit allocation, and service investment while analyzing impacts of information asymmetry and fairness concerns on equilibria [20]. Concurrently, other research investigates competition and cooperation mechanisms in dual-channel and multi-channel contexts, integrating decisions on pricing, inventory, and service [6,7]. Unlike these analyses focusing primarily on static or finite-horizon equilibria, this paper concentrates on mechanism design problems for continuous online decision making under long-term uncertain environments.

2.3. Stochastic Network Optimization and Lyapunov Theory

Lyapunov optimization represents a general methodology for online decision making and long-term constraint handling, systematically developed by Neely for communication and queuing systems [23]. By constructing virtual queues and minimizing a drift-plus-penalty expression, this method transforms long-term objectives and time-varying constraints into deterministic subproblems solvable in each time slot. This enables algorithms to achieve provable performance and stability guarantees without requiring prior statistical knowledge of future information. Recently, Lyapunov optimization has been widely applied to real-time optimization and scheduling in energy systems and microgrids, demonstrating its adaptability to uncertainties and constraints in practice [24]. This suggests its potential applicability to scenarios like manufacturing and channel resource management that involve long-term constraints and short-term disturbances.

3. System Modeling and Problem Formulation

To analyze resource allocation strategies for wood flooring enterprises in dynamic competitive environments, we construct a mathematical model based on state space evolution. The core premise is that a firm’s short-term decisions not only affect immediate returns but, more importantly, continuously alter its long-term market competitive position. Our dynamic system aims to capture persistent marketing effects, physical delays in production logistics, and environmental uncertainties. As shown in Figure 1, the temporal framework is defined as discrete decision periods, t { 0 , 1 , 2 , } . In the wood flooring industry context, market responses are shaped by user perceptions of product quality, durability, and aesthetic preferences, while manufacturing choices and lifecycle considerations also influence cost and supply-side constraints, motivating the following model elements [15,17].

3.1. Model Notation and Definitions

To ensure clarity and rigor in subsequent discussions, core variables and parameters used in the model are summarized in Table 1.

Data Availability and State Estimability

The variables categorized in Table 1 consist of (i) directly observable operational states, (ii) decision variables, and (iii) latent states or structural parameters. In industrial settings, most operational states are retrieved directly from enterprise information systems; for instance, inventory levels I i ( t ) and pipeline orders p i , pipe ( t ) are tracked via ERP/WMS databases, while realized sales S i ( t ) and channel outcomes are logged by POS and e-commerce platforms. Control inputs, including marketing budgets b i ( t ) and production orders p i ( t ) , are internally determined. The channel strategy intensity s i ( t ) reflects managerial decisions executed through actionable levers (e.g., promotion intensity or service levels) within predefined constraints.
The market position vector M i ( t ) is the only partially observable component, representing intangible assets such as brand goodwill or customer base. To bypass the need for direct measurement, our framework estimates M i ( t ) recursively using Bayesian filtering (specifically, particle filtering) conditioned on observable signals, including sales, demand realizations, and the macro-state Z ( t ) . Structural parameters, such as decay Δ i and cross-channel competition Γ i j , are calibrated using historical panel data or updated asynchronously as new observations accumulate. The control policy is inherently robust to moderate estimation uncertainties, a property further validated by the sensitivity analysis in the experimental section. For generalizability and data confidentiality, all quantities in our simulations are normalized: inventory/production are expressed in standardized units, while rewards and costs are reported in consistent dimensionless profit units.

3.2. Core State Variables and System Dynamics

The dynamic characteristics of the model are driven by a set of interconnected state variables that collectively constitute the complete system state S ( t ) .
We introduce a core but partially observable market position vector M i ( t ) R k to characterize channel i’s long-term intangible assets. Each dimension of this vector represents a specific aspect of market competitiveness, such as brand goodwill, customer loyalty, or market awareness. This vector serves as the slow variable of system dynamics, encapsulating the cumulative effects of all past decisions. In consumer-facing wood flooring markets, preference formation reflects perceived durability, aesthetic, and usage attributes of parquet products, supporting the modeling of market position as a persistent demand driver [15].
Concurrently, we define a directly observable physical inventory level I i ( t ) , representing the quantity of finished goods available for sale by channel i at period t start. Since production is not instantaneous, we must simultaneously track in-process production orders. A vector p i , p i p e ( t ) records production quantities ordered but not yet delivered from periods t L + 1 to t 1 . In engineered wood flooring production, pressing conditions, adhesive spread, and process parameters influence yield rates and available output, motivating explicit accounting of lead times and pipeline status in downstream planning [16].
Additionally, external market environment randomness is described by a macroeconomic state variable Z ( t ) , assumed to follow a finite-state Markov chain with transition probabilities P ( Z | Z ) . This approach more effectively captures the persistent effects of economic cycles than simple random factors. From lifecycle and supply perspectives, exogenous shocks also propagate through manufacturing, installation, and use phases, reinforcing the need to couple demand-side persistence with supply-side costs [17].
Thus, the complete system state in period t can be expressed as
S ( t ) = ( M ( t ) , I ( t ) , { p p i p e ( t ) } , Z ( t ) ) .

3.3. State Evolution Equations

System state evolution is governed by a set of stochastic difference equations linking current-period decisions to next-period states.
The market position vector evolution captures cumulative and decaying effects of marketing investments. Drawing from related models in marketing science, we formulate its dynamics as
M i ( t + 1 ) = ( I Δ i ) M i ( t ) + f g ( b i ( t ) , s i ( t ) ) j i Γ i j f g ( b j ( t ) , s j ( t ) ) + ϵ i ( t )
where I denotes the identity matrix, while diagonal matrix Δ i represents natural decay or forgetting effects of market position. The nonlinear vector function f g ( · ) characterizes the constructive effect of channel i’s own investments on various dimensions of its market position, potentially exhibiting saturation properties. Matrix Γ i j quantifies cross-channel erosion effects from channel j’s marketing activities on channel i’s market position. ϵ i ( t ) represents a zero-mean stochastic process noise. In wood flooring applications, f g ( · ) can be interpreted as the gradual accumulation of perceived quality and brand recognition for specific parquet structures and finishes, consistent with observed preference structures [15].
Physical inventory evolution follows strict material balance principles. Given production lead time L, production orders p i ( t ) placed in period t arrive at period t + L start. Consequently, inventory level updates follow:
I i ( t + 1 ) = I i ( t ) S i ( t ) + p i ( t L + 1 )
where S i ( t ) denotes the actual sales volume in period t, defined below. In engineered wood flooring production lines, process parameters and bonding quality may influence scrap rates and effective availability, making explicit pipeline order tracking alongside on-hand inventory practically relevant [16].

3.4. Demand, Sales, and Profit Functions

Within the state evolution framework, market demand, sales, and profit calculations become more refined.
Market demand d i ( t ) is no longer an instantaneous product of short-term inputs but is jointly determined by underlying market position and macroeconomic environment. Current-period marketing activities further stimulate demand based on established market position. A demand function reflecting this mechanism can be constructed as
d i ( t ) = θ i T M i ( t ) + ϕ i T Z ( t ) · ln ( 1 + β i b i ( t ) + ζ i s i ( t ) ) · exp j i ψ i j s j ( t )
where term θ i T M i ( t ) represents the baseline demand potential determined by market position, while ϕ i T Z ( t ) reflects macroeconomic influences. The logarithmic function captures the diminishing marginal utility of short-term inputs, while the exponential term characterizes competitive pressure from other channels’ strategy intensities. This specification aligns with observations in parquet flooring markets where perceived product quality shapes purchase intentions and evolves with marketing communications and channel competition [15].
The actual sales volume S i ( t ) in period t is constrained by the current inventory and market demand:
S i ( t ) = min ( I i ( t ) , d i ( t ) )
Having defined sales volume, we can compute the system’s single-period total reward R ( t ) . This equals the sum of all channels’ sales profits minus relevant costs, including strategy intensity-related marketing costs c s ( · ) , production volume-related costs C p r o d ( · ) , and inventory level-related holding costs C i n v ( · ) .
R ( S ( t ) , b ( t ) , p ( t ) , s ( t ) ) = i N π i S i ( t ) c s , i ( s i ( t ) ) C p r o d ( p ( t ) ) C i n v ( I ( t ) )
We assume cost functions are nonlinear and convex to reflect economies of scale or marginal cost variations. In wood flooring enterprises, lifecycle and process assessments indicate that manufacturing choices, energy, and process parameters contribute nonlinearly to unit costs and environmental burdens, supporting convex modeling of C p r o d ( · ) and the explicit role of inventory and pipeline decisions in C i n v ( · ) [16,17].

3.5. Problem Formulation: A Partially Observable Stochastic Dynamic Game

Based on the above setup, we formulate the original problem as a more profound and realistic Partially Observable Stochastic Dynamic Game (POSDG).
The core complexity stems from information asymmetry. The firm and channels can precisely observe physical states like inventory I i ( t ) and pipeline orders p p i p e ( t ) , along with macroeconomic state Z ( t ) . However, the market position vector M i ( t ) , being an intangible asset, cannot be directly observed. Decision-makers can only form a belief distribution or estimate M ^ ( t ) of current market position based on the history of all observable information H ( t ) = { I ( τ ) , b ( τ ) , p ( τ ) , s ( τ ) , S ( τ ) , Z ( τ ) } τ = 0 t 1 . This structure aligns with wood flooring settings where user perceptions, competitive exposure, and process yields co-evolve over time, requiring estimation from partial observations [15,16,17].
Under this information structure, the firm’s objective is to design and execute an optimal policy P —a function mapping observable history H ( t ) to current decisions ( b ( t ) , p ( t ) ) —that maximizes the expected total discounted reward over an infinite horizon:
P = arg max P V P ( H 0 ) = E P t = 0 δ t R ( S ( t ) , b ( t ) , p ( t ) , s ( t ) ) H 0
where the expectation E P is taken over all sources of randomness—market position evolution noise ϵ ( t ) and macroeconomic state Z ( t ) transitions—under policy P and channels’ collective strategies.
Simultaneously, each channel i aims to find its own optimal policy P i mapping its information set to a sales strategy s i ( t ) that maximizes its individual expected total discounted reward.
This problem constitutes a high-dimensional, non-convex, dynamic, stochastic, partially observable multi-agent decision problem. Such problems admit no simple analytical solutions and require advanced methods from approximate dynamic programming, reinforcement learning (particularly multi-agent reinforcement learning), or stochastic control. Therefore, to solve this problem, we design an innovative algorithm in the following chapter.

4. Online Learning and Control Algorithm Design

The Partially Observable Stochastic Dynamic Game (POSDG) problem formulated in the previous section presents significant computational challenges due to its high-dimensional state space, non-convex objective function, incomplete information structure, and dynamic multi-agent interactions. Traditional optimization methods prove inadequate for direct solution. To address this challenge, we design a hierarchical online learning and control algorithm. As shown in Figure 2, the core insight involves decomposing the original problem into two coupled subproblems: online Bayesian belief updating to handle state partial observability, and approximate dynamic programming decision making based on updated beliefs to solve the firm’s resource allocation and production planning.

4.1. Algorithm Overview: Hierarchical Bayesian Approximate Dynamic Programming Framework

Our proposed algorithm sequentially executes three core modules in each decision period t. The process initiates with a belief update module that leverages observed sales outcomes S ( t 1 ) from the previous period as new evidence, employing nonlinear filtering techniques—specifically particle filtering—to update the firm’s posterior probability distribution (belief state) of the current unobservable market position vector M ( t ) . Subsequently, a channel strategy prediction module, given potential firm actions ( b ( t ) , p ( t ) ) and the current belief state, predicts non-cooperative game outcomes among rational channels to estimate their approximate Nash equilibrium strategies s ( t ) . Finally, a firm decision module, building upon updated belief states and predicted channel strategies, utilizes Lyapunov optimization to construct a single-period approximate optimization problem. This formulation effectively transforms the complex long-term discounted reward objective into a more tractable deterministic problem that balances short-term rewards with long-term system stability, yielding optimal current-period actions ( b ( t ) , p ( t ) ) .
This hierarchical architecture effectively decouples learning (latent state inference) from control (current action optimization) tasks, enabling feasible online solution.

4.2. Bayesian Estimation of Market Position via Particle Filtering

Given the highly nonlinear nature of market position state evolution (Equation (1)) and potentially non-Gaussian distribution of stochastic disturbances ϵ i ( t ) , traditional Kalman filters prove inadequate. We employ particle filtering, a robust nonparametric Bayesian filtering method, for online estimation of market position vector M ( t ) . In each period t, the firm maintains a set of N p weighted particles { M ( j ) ( t ) , w ( j ) ( t ) } j = 1 N p , collectively approximating the posterior probability distribution of M ( t ) . Algorithm 1 details the procedural steps.
Algorithm 1 Particle Filter Update for Market Position Vector
1:
Input: Previous period particle set { M ( j ) ( t 1 ) } j = 1 N p , previous period decisions ( b ( t 1 ) , s ( t 1 ) ) , previous period actual sales S ( t 1 ) .
2:
Output: Current period particle set { M ( j ) ( t ) } j = 1 N p , expected estimate of current market position M ^ ( t ) .
3:
for j = 1 , , N p do
4:
      Prediction (Propagation):
5:
      Propagate each particle according to state evolution equation to simulate stochastic evolution:
6:
       M ˜ ( j ) ( t ) = ( I Δ ) M ( j ) ( t 1 ) + f g ( b ( t 1 ) , s ( t 1 ) ) f c ( b i ( t 1 ) , s i ( t 1 ) ) + ϵ ( t 1 )
7:
      Update:
8:
      (Note: This step logically assumes S ( t 1 ) constitutes observation of M ( t 1 ) state)
9:
      Compute expected demand d i ( j ) ( t 1 ) based on *previous period* particle M ( j ) ( t 1 ) and observable macroeconomic state Z ( t 1 ) .
10:
    Compute likelihood probability of observing actual sales S i ( t 1 ) given expected demand d i ( j ) ( t 1 ) and previous period inventory I i ( t 1 ) . Assuming Gaussian observation noise, weight calculation follows:
11:
     w ( j ) ( t ) exp 1 2 σ S 2 i N ( S i ( t 1 ) min ( I i ( t 1 ) , d i ( j ) ( t 1 ) ) ) 2
12:
end for
13:
Normalize weights: j = 1 N p w ( j ) ( t ) = 1 .
14:
Resampling: Draw N p particles with replacement from *propagated* particle set { M ˜ ( j ) ( t ) } according to weights { w ( j ) ( t ) } , forming new unweighted particle set { M ( j ) ( t ) } j = 1 N p .
15:
Compute expected estimate: M ^ ( t ) = 1 N p j = 1 N p M ( j ) ( t ) .
(Note: Algorithm 1 follows the original “propagation-update-resampling” flow. This flow assumes that S ( t 1 ) likelihood evaluates propagated particles M ˜ ( t ) , raising causal concerns. A more standard SIR flow would first update and resample M ( t 1 ) using S ( t 1 ) before propagation. To maintain fidelity to the original structure, we preserve this flow while clarifying the logical dependency in line 10 for likelihood computation).
Through this process, the firm effectively incorporates each period’s latest market sales information into its understanding of core intangible assets (market position), providing crucial input for subsequent precise decision making.

4.3. Lyapunov Approximate Dynamic Programming Framework

To handle the long-term discounted reward objective while maintaining stability of physical systems (e.g., inventory), we construct a Lyapunov optimization framework. Beyond physical queues (inventory), we introduce a “Virtual Debt Queue” Y ( t ) specifically designed to transform the discounted reward problem into an equivalent problem manageable within the Lyapunov framework.
We define physical and virtual queues as follows. Physical inventory queue I i ( t ) itself requires management; we aim to avoid both excessive levels (incurring holding costs) and insufficient levels (causing stockouts). The virtual debt queue Y ( t ) evolution follows Y ( t + 1 ) = δ Y ( t ) R ( t ) . It can be shown that stabilizing this queue under specific conditions equates to maximizing the long-term discounted reward t δ t R ( t ) .
We define a Lyapunov function incorporating quadratic penalties for both physical inventory and virtual debt:
L ( Θ ( t ) ) = 1 2 i N I i ( t ) 2 + 1 2 Y ( t ) 2
where Θ ( t ) = ( I ( t ) , Y ( t ) ) represents the vector of queues requiring generalized stabilization. The single-step Lyapunov drift is defined as Δ ( Θ ( t ) ) = E [ L ( Θ ( t + 1 ) ) L ( Θ ( t ) ) H ( t ) ] .
Our objective in each period t is to select actions ( b ( t ) , p ( t ) , s ( t ) ) that minimize a modified “drift-plus-penalty” expression. Unlike standard approaches, to explicitly penalize excessive inventory levels while optimizing rewards, we define the penalty term as E [ V R ( t ) + i N I i ( t ) 2 H ( t ) ] . Consequently, we aim to minimize
Δ ( Θ ( t ) ) + E V R ( t ) + i N I i ( t ) 2 H ( t )
where V is a positive constant balancing reward maximization against queue stability.
Through derivation, we demonstrate that minimizing an upper bound of this expression equates to solving the following deterministic optimization problem in each period t:
max b ( t ) , p ( t ) , s ( t ) E M ^ ( t ) ( V + δ Y ( t ) ) R ( t ) + i N I i ( t ) ( S i ( t ) p i ( t L + 1 ) ) i N I i ( t ) 2 s . t . i N b i ( t ) B ( t ) , i N p i ( t ) P ( t )
The expectation E M ^ ( t ) [ · ] is approximated via sample averaging over the posterior particle set obtained from the particle filter.

4.4. Solving Hierarchical Decision Subproblems

Problem (9) remains complex due to coupling between firm and channel decisions. We design a hierarchical iterative solution approach. First, given firm resource allocation ( b ( t ) , p ( t ) ) and belief M ^ ( t ) , the firm predicts the approximate Nash equilibrium s ( b , p ) of the inter-channel game. Given non-convex channel utility functions, we employ an iterative best response algorithm to locate a fixed point as an equilibrium approximation.
After substituting predicted equilibrium strategies s ( b , p ) , the firm’s objective function depends solely on its own decisions ( b , p ) . However, this function remains highly complex and potentially non-convex. We adopt a stochastic gradient-based optimization method (e.g., Adam or RMSprop) for solution. In each iteration, random sampling from the particle set provides unbiased estimates of the objective function gradient, guiding decision updates. Algorithm 2 provides pseudo-code for the single-period decision process.
Algorithm 2 Hierarchical Online Learning and Control Algorithm (Period t)
1:
Belief Update: Invoke Algorithm 1 using H ( t 1 ) and S ( t 1 ) to compute current belief particle set { M ( j ) ( t ) } and expected estimate M ^ ( t ) .
2:
Initialize Firm Decisions: b 0 ( t ) = b ( t 1 ) , p 0 ( t ) = p ( t 1 ) .
3:
for k = 0 , , K 1 do
4:
      Channel Equilibrium Prediction:
5:
      For current decisions ( b k ( t ) , p k ( t ) ) , compute approximate Nash equilibrium s k ( t ) of inter-channel game via iterative best response.
6:
      Gradient Estimation:
7:
      Randomly sample a mini-batch from particle set { M ( j ) ( t ) } .
8:
      For each sample, compute gradient b , p G j of objective function in (9) at decisions ( b k ( t ) , p k ( t ) , s k ( t ) ) .
9:
      Compute average gradient: ¯ G = 1 | batch | j b , p G j .
10:
    Decision Update:
11:
    Update decisions using Adam or similar optimizer: ( b k + 1 ( t ) , p k + 1 ( t ) ) = AdamUpdate ( b k ( t ) , p k ( t ) , ¯ G ) .
12:
end for
13:
Execute Final Decisions: ( b ( t ) , p ( t ) ) = ( b K ( t ) , p K ( t ) ) .
14:
Firm places production orders p i ( t ) and allocates budgets b i ( t ) . Channels execute approximate equilibrium strategies s i based on these decisions.
15:
System evolves to next state, observing current period sales S ( t ) for subsequent belief update.

4.5. Theoretical Performance Analysis

This section provides rigorous theoretical performance analysis for the proposed hierarchical online learning and control algorithm. Given the algorithm’s integration of Bayesian filtering, approximate dynamic programming, and game theory, its analysis presents significant complexity. Our objective is to demonstrate that, under reasonable mathematical assumptions, the algorithm guarantees stability of the overall dynamic system, and its long-term performance can approach an idealized theoretical optimum, with quantifiable and controllable performance gaps.

4.5.1. Technical Assumptions for Analysis

To ensure analytical rigor, we introduce a series of relatively standard technical assumptions in stochastic control and learning theory.
Assumption 1 (Boundedness).
All states, actions, rewards, and stochastic processes in the system are assumed bounded. Specifically, there exist positive constants M max , I max , p max , and b max , s max , R max , ϵ max such that for all time t and all realizations, | | M ( t ) | | M max , I i ( t ) I max , p i ( t ) p max , b i ( t ) b max , s i ( t ) s max , single-period reward | R ( t ) | R max , and stochastic disturbances | | ϵ ( t ) | | ϵ max .
Assumption 2 (Lipschitz Continuity).
All system functions—including state evolution functions f g , f c and reward function R ( · ) —are assumed Lipschitz continuous in all their arguments. For example, for the reward function, there exists constant L R such that for any two state-action pairs ( S , a ) and ( S , a ) , | R ( S , a ) R ( S , a ) |   L R ( | | S S | | + | | a a | | ) .
Assumption 3 (Subproblem Solution Accuracy).
We assume the optimization loops within Algorithm 2 find approximate solutions. Specifically, the solution ( b ( t ) , p ( t ) ) found by the firm decision module in period t has a bounded optimization gap relative to the true optimum of deterministic problem (9). Simultaneously, the difference between the approximate Nash equilibrium s ( t ) found by the channel equilibrium prediction module and the true equilibrium has bounded impact on the system reward function. We denote the combined upper bound of these subproblem solution errors on the single-period drift-plus-penalty objective as ϵ s u b .
Assumption 4 (Particle Filter Performance).
Following standard particle filter theory, we assume that the mean squared error of hidden state M ( t ) estimation is controlled by particle count N p . There exists positive constant C P F such that E [ | | M ( t ) M ^ ( t ) | | 2 ]   C P F / N p , where M ^ ( t ) is the particle filter’s expected estimate.

4.5.2. Lyapunov Drift Upper Bound Lemma

Lyapunov analysis forms the core of our theoretical proof. We first derive a rigorous upper bound for the single-step Lyapunov drift, connecting expected queue changes with single-period system rewards.
Lemma 1
(Lyapunov Drift Upper Bound). Under Assumptions 1 and 2, for any given observable history H ( t ) and any feasible policy taken under this history, the single-step Lyapunov drift Δ ( Θ ( t ) ) = E [ L ( Θ ( t + 1 ) ) L ( Θ ( t ) ) H ( t ) ] satisfies
Δ ( Θ ( t ) ) C 0 E 1 δ 2 2 Y ( t ) 2 + δ Y ( t ) R ( t ) + i N I i ( t ) ( S i ( t ) p i ( t L + 1 ) ) H ( t )
where C 0 is a positive constant depending only on system parameters and bounds from the assumptions.
Proof. 
We begin from the Lyapunov function definition, examining expected single-step changes of its components. First, consider physical inventory queue I i ( t ) . From its evolution equation,
1 2 E [ I i ( t + 1 ) 2 I i ( t ) 2 H ( t ) ] = 1 2 E [ ( I i ( t ) S i ( t ) + p i ( t L + 1 ) ) 2 I i ( t ) 2 H ( t ) ] = 1 2 E [ 2 I i ( t ) ( S i ( t ) p i ( t L + 1 ) ) + ( S i ( t ) p i ( t L + 1 ) ) 2 H ( t ) ] E [ I i ( t ) ( S i ( t ) p i ( t L + 1 ) ) H ( t ) ] + 1 2 E [ ( S i ( t ) p i ( t L + 1 ) ) 2 H ( t ) ]
By Assumption 1, sales S i ( t ) and production p i ( t ) are bounded. Thus, the quadratic term ( S i ( t ) p i ( t L + 1 ) ) 2 has bounded expectation. Summing this term over all channels yields constant C I .
Next, consider virtual debt queue Y ( t ) , evolving as Y ( t + 1 ) = δ Y ( t ) R ( t ) . Thus,
1 2 E [ Y ( t + 1 ) 2 Y ( t ) 2 H ( t ) ] = 1 2 E [ ( δ Y ( t ) R ( t ) ) 2 Y ( t ) 2 H ( t ) ] = 1 2 E [ ( δ 2 1 ) Y ( t ) 2 2 δ Y ( t ) R ( t ) + R ( t ) 2 H ( t ) ] E 1 δ 2 2 Y ( t ) 2 δ Y ( t ) R ( t ) H ( t ) + 1 2 R max 2
Here, we used δ ( 0 , 1 ) and bounded R ( t ) 2 by R max 2 .
Summing both parts yields an upper bound for total drift Δ ( Θ ( t ) ) . Combining all constant terms into C 0 = C I + 1 2 R max 2 completes the proof. □

4.5.3. Performance Analysis of POSDG

We now connect the drift upper bound with Algorithm 2’s decision rule. As established, the algorithm aims to minimize Δ ( Θ ( t ) ) + E [ V R ( t ) + i I i ( t ) 2 H ( t ) ] . Substituting Lemma (10) into this expression:
Δ ( Θ ( t ) ) + E V R ( t ) + i I i ( t ) 2 H ( t ) C 0 E 1 δ 2 2 Y ( t ) 2 H ( t ) E ( V + δ Y ( t ) ) R ( t ) + i N I i ( t ) ( S i ( t ) p i ( t L + 1 ) ) i N I i ( t ) 2 H ( t )
Algorithm 2 minimizes the upper bound of the entire drift-plus-penalty expression precisely by maximizing the second E [ · ] term above (i.e., the objective function in (9)).
For notational simplicity, let a A l g ( t ) denote actions taken by our algorithm in period t, and a ( t ) denote actions taken by an ideal, fully informed (observing M ( t ) ) policy that perfectly solves subproblems. Let G ( a , Θ , M ) represent the objective function in (9).
Since our algorithm maximizes G based on belief M ^ ( t ) , its solution a A l g ( t ) satisfies (accounting for estimation and solution errors)
E M ^ ( t ) [ G ( a A l g ( t ) ) ] E M ^ ( t ) [ G ( a ( t ) ) ] ϵ s u b
Converting belief-based decisions to evaluation under true state M ( t ) introduces performance loss due to estimation error. By Assumptions 2 and 4, this error’s expectation is bounded:
| E [ G ( a , Θ , M ) ] E [ G ( a , Θ , M ^ ) ] |   L G E [ | | M ( t ) M ^ ( t ) | | ] L G C P F / N p
where L G is the Lipschitz constant of objective function G. Substituting this relationship back into drift-plus-penalty inequality (11) yields
Δ ( Θ ( t ) ) + E V R A l g ( t ) + i I i ( t ) 2 C 0 E G ( a ( t ) ) + ϵ s u b + 2 L G C P F / N p
(Note: R A l g ( t ) is the reward actually obtained by the algorithm.) Taking the total expectation of inequality (12), summing over t = 0 , , T 1 , then dividing by T, as T , the term E [ L ( Θ ( T ) ) ] E [ L ( Θ ( 0 ) ) ] T approaches 0. Through algebraic manipulation and reward term rearrangement, we ultimately establish the gap between our algorithm’s long-term average reward and the optimal policy reward.

4.5.4. Queue Stability and Performance Bound Theorem

Based on the above derivation, we now formally state the core performance theorem for our algorithm.
Theorem 1
(Queue Stability and Performance Bound). Under Assumptions 1–4, the policy generated by Algorithm 2 possesses the following properties: (1) All physical and virtual queues in the system are mean-square bounded. Specifically, there exists constant C s t a b l e dependent on V such that
lim sup T 1 T t = 0 T 1 E [ | | Θ ( t ) | | 2 ] C s t a b l e ( V + 1 ) 1 δ
(2) A deterministic performance bound exists between the algorithm’s achieved long-term discounted reward V A l g and the optimal discounted reward V O p t under complete information. For any V > 0 and particle count N p 1 , the algorithm’s performance satisfies
V A l g V O p t C 1 ( B + V ) V ( 1 δ ) C 2 N p C 3 ϵ s u b
where B is a constant depending on system boundedness, and C 1 , C 2 , C 3 are positive constants independent of V , N p , or ϵ s u b .
(Note: The bound form O ( B / V ) + O ( 1 ) is standard for Lyapunov analysis of discounted problems, slightly different from the O ( 1 / V ) in the original manuscript).
The theorem results clearly reveal several key performance trade-offs. The first term represents a structural error from using Lyapunov approximate dynamic programming, indicating that increasing parameter V can arbitrarily reduce this performance gap. However, the cost is that average queue lengths (system stability) grow as O ( V ) , meaning the system must tolerate greater fluctuations in inventory and debt. The second term stems from partial observability and Bayesian filtering estimation error, showing that increasing particle count (i.e., allocating more computational resources) systematically reduces losses from incomplete information. The final term directly relates to subproblem solver accuracy. Other proof of our algorithm is provided in Appendix A.

4.5.5. Convergence and Computational Complexity Analysis

We analyze the convergence of the algorithm’s internal components and its overall computational overhead. The channel equilibrium prediction module employs iterative best response. For non-convex games in our model, this iterative process does not guarantee convergence to a unique pure-strategy Nash equilibrium. However, in many practical applications, such heuristic iterative methods often converge to stable strategy distributions or approximate equilibrium points. Our framework’s theoretical analysis accommodates this equilibrium solution inaccuracy through the ϵ s u b term. The stochastic gradient ascent method used in the firm decision module has solid theoretical support in stochastic optimization. Under Assumption 2’s Lipschitz conditions, one can prove the algorithm converges to a stationary point satisfying KKT (Karush–Kuhn–Tucker) conditions—a local optimum or saddle point. The algorithm’s computational complexity per decision period comprises three main components. The particle filter module complexity is O ( N p · ( k 2 + N ) ) , where N p is the particle count, k is the market position vector dimension, and N is the channel count. The channel equilibrium prediction module complexity is approximately O ( I N E · N · C B R ) , where I N E are the iterations needed to reach approximate equilibrium, and C B R is the complexity of computing a single channel’s best response. The firm decision module complexity is O ( K · B g r a d · C o b j ) , where K is the gradient update steps, B g r a d is the batch size for gradient estimation, and C o b j is the complexity of computing the objective function for a single sample. The proposed algorithm’s total per-period computational complexity is polynomial in ( N p , N , K , I N E ) . This indicates manageable computational overhead, with trade-offs between computational efficiency and solution accuracy achievable by tuning these hyperparameters.

5. Performance Evaluation and Experimental Analysis

To validate the practical performance of the proposed hierarchical online learning and control algorithm (hereafter referred to as L-BAP for convenience) and investigate its behavioral characteristics in complex dynamic environments, we construct a high-fidelity discrete-event simulation platform. This section details the experimental setup, baseline algorithms for comparison, and performance evaluation dimensions, and provides in-depth analysis and discussion of the simulation results.

5.1. Experimental Setup and Parameter Configuration

We implement the simulation platform using Python 3.8.2, leveraging relevant scientific computing libraries for numerical optimization and stochastic process simulation. We configure a representative “wood flooring” enterprise case with three channels ( N = 3 ). These channels are designed as follows: Channel 1 simulates a high-margin, high-brand-contribution but demand-volatile designer channel; Channel 2 represents a moderate-margin, high-volume, relatively stable large-scale project channel; Channel 3 models a low-margin, highly competitive online e-commerce channel sensitive to strategic intensity.
Core model parameters aim to reflect business realities. The market position vector M i ( t ) is set as two-dimensional ( k = 2 ), representing brand goodwill and customer base, respectively. Specific parameters for state evolution, demand, and cost functions are summarized in Table 2. The macroeconomic state Z ( t ) is modeled as a two-state (boom/recession) Markov chain with a transition probability matrix provided in the table. Production lead time is set to L = 4 periods, with reward discount factor δ = 0.98 . For our proposed L-BAP algorithm, core hyperparameters are as follows: Lyapunov trade-off parameter V = 100 and particle filter particle count N p = 200 .

5.2. Baseline Algorithms for Comparison

To comprehensively evaluate L-BAP performance, we select four representative baseline algorithms. These are designed to highlight L-BAP’s core advantages from different perspectives (e.g., foresight capability, hidden state handling).
“Myopic-Known (MK)” represents an idealized baseline. It assumes perfect observation of the true market position vector M ( t ) , but its decision objective maximizes only current single-period expected reward R ( t ) , ignoring long-term effects. This algorithm quantifies the long-term planning value provided by our Lyapunov framework.
“Myopic-Learned (ML)” constitutes a more realistic myopic algorithm. Like L-BAP, it cannot directly observe M ( t ) and uses the same particle filter module for state estimation. However, its decision objective similarly maximizes the current period expected reward. Comparing ML with L-BAP isolates the pure value of long-term planning.
“Static Policy (SP)” represents traditional management approaches lacking dynamic adaptation. This policy computes fixed resource allocation ratios and production levels based on long-term average market environment estimates, remaining constant throughout simulation.
“Decentralized-RL (D-RL)” serves as a strong baseline from multi-agent reinforcement learning. We model each channel as an independent learning agent, alongside the firm. Each agent learns based on local observations, aiming to maximize individual long-term rewards (e.g., using PPO or SAC algorithms). This baseline tests our proposed centralized coordination framework against popular fully decentralized learning methods.

5.3. Performance Evaluation Metrics

We evaluate algorithm performance across multiple dimensions. The core metric is “cumulative discounted reward” ( t = 0 T δ t R ( t ) ), measuring overall profitability. “Inventory stability” is measured by the time-series standard deviation of total inventory across all channels i I i ( t ) , with smaller values indicating smoother inventory management and weaker bullwhip effects. “Customer service level” is defined by total order fulfillment rate ( t i S i ( t ) / t i d i ( t ) ), reflecting the firm’s ability to handle demand fluctuations. Additionally, we assess learning and perception capability through “state estimation accuracy”, measured by the root-mean-square error (RMSE) between particle filter estimates M ^ ( t ) and true values M ( t ) .

5.4. Experimental Results and Analysis

We first compare algorithm performance at both overall reward and dynamic behavior levels.
Figure 3 shows the evolution of the cumulative discounted reward over time. After a brief initial learning phase around t = 200 , our L-BAP algorithm’s reward accumulation curve exhibits a significantly steeper slope than all baselines, eventually converging to the highest total reward (approximately 15,500). Myopic-Learned (ML) performs second best (approximately 13,000), but its gap with L-BAP widens over time. This clearly demonstrates that learning current states alone is insufficient, and lacking foresight in long-term planning leads to substantial potential revenue loss.
Notably, the idealized Myopic-Known (MK) algorithm (approximately 13,800), despite perfect information and strong early performance ( t < 100 ), is eventually surpassed by L-BAP with long-term planning capability. This indicates that in our model, the long-term planning value provided by the Lyapunov framework can compensate for the information disadvantage from partial observability. D-RL and SP perform considerably worse, validating the necessity of dynamic adaptation and centralized coordination.
Figure 4 and Figure 5 reveal the dynamic reasons behind L-BAP’s high cumulative rewards. Figure 4 shows moving averages of single-period rewards. After t = 200 , L-BAP’s reward curve steadily climbs above 350, significantly exceeding all competitors. In contrast, ML and MK rewards fluctuate around 300, with D-RL and SP even lower. This indicates that L-BAP achieves not only higher average rewards but also relatively controlled reward volatility.
Figure 5 provides corroborating evidence from a demand fulfillment perspective. After the learning phase, L-BAP maintains the customer service level (order fulfillment rate) stably above 95%, nearly achieving ideal supply–demand balance. All other baseline algorithms, including MK with perfect information, exhibit lower and more volatile service levels (e.g., ML and MK oscillate around 90%), directly leading to lost sales opportunities and reduced rewards.
Next, we analyze algorithm performance in supply chain stability (inventory management). Figure 6 shows total inventory dynamics during mid-simulation ( t = 400 to t = 650 ). This figure visually demonstrates L-BAP’s absolute superiority in inventory management. Facing identical demand fluctuations and production delays, L-BAP’s inventory curve (blue) exhibits smooth, controlled periodic oscillations, successfully maintaining inventory within a relatively healthy target range (approximately 300 to 600).
In contrast, all baseline algorithms show severe inventory oscillations. SP and D-RL curves (orange and green) exhibit extreme, high-amplitude “bullwhip effects”, with inventory sometimes exceeding 1000 (facing severe overstock risk) and dropping to 200 (causing critical shortages). ML and MK inventory fluctuations, while slightly better than SP and D-RL, still show a significantly higher amplitude (approximately 250 to 750) and frequency than L-BAP. This strongly validates the direct effectiveness of the Lyapunov framework’s explicit inventory queue control through Equation (9).
  • Discussion. The bar summaries in Figure 7 and Figure 8 confirm the time-series trends: L-BAP achieves the highest long-run reward while simultaneously reducing inventory volatility by a large margin. Importantly, MK (perfect information but myopic) is still outperformed by L-BAP, highlighting that stability-aware planning has a first-order impact beyond state observability alone.
Table 3 provides a quantitative summary of key performance metrics at simulation end. These aggregated data confirm observations from dynamic curves. L-BAP achieves optimal results across three core business metrics (reward, inventory stability, service level). Its exceptionally low inventory standard deviation (85.2) is particularly notable—approximately half that of ML and MK, and far lower than that of D-RL and SP. This demonstrates L-BAP’s effectiveness in smoothing supply chain disturbances caused by demand fluctuations and production delays through foresighted production planning.
We next analyze the effectiveness of core algorithm components. Figure 9 depicts state estimation error (RMSE) versus particle count N p . Both L-BAP and ML (sharing the same particle filter module) show a smoothly decreasing estimation error with increasing N p , validating the Bayesian filtering module’s effectiveness as the system’s “perception” unit. We note that at identical particle counts, both algorithms show very similar RMSE values (e.g., at N p = 200 , L-BAP: 0.88, ML: 0.91, as in Table 3). This reveals a deeper conclusion: L-BAP’s substantial performance advantage (Figure 3) stems not merely from better *state estimation*, but from superior *decision planning*. That is, the L-BAP framework better utilizes (error-prone) belief states to generate efficient, robust long-term resource allocation decisions.
Finally, we experimentally validate the performance trade-offs from theoretical analysis by tuning L-BAP’s key hyperparameters. Figure 10 plots the relationship between long-term average reward and average inventory level achieved under different V values. The results clearly show a Pareto frontier: as V increases, the algorithm assigns higher weight to immediate reward R ( t ) in the optimization objective (Equation (9)), thereby increasing the long-term average reward; however, the implicit penalty for queue stability is relatively relaxed, causing the average inventory level to rise accordingly. These experimental results perfectly match the performance–stability trade-off revealed in our theoretical analysis (i.e., V-controlled O ( V ) queue bound versus O ( 1 / V ) reward loss), providing strong empirical evidence for the algorithm’s design rationale.
Figure 11 further shows the heatmap of long-term average reward as a function of trade-off parameter V and particle count N p . This reveals two clear monotonic trends: at fixed N p (e.g., N p = 200 ), increasing V improves the average reward; at fixed V (e.g., V = 100 ), increasing N p also improves the average reward. This perfectly corresponds to our theoretical bound (Theorem 1): increasing V reduces structural error from Lyapunov approximate planning ( O ( 1 / V ) term), while increasing N p reduces estimation error from partial observability ( O ( 1 / N p ) term). This heatmap provides clear guidance for practical parameter tuning.

5.5. Ablation Study and Component Contribution

To isolate the contribution of each module in L-BAP, we conduct an ablation study with three variants: (i) w/o PF replaces the particle filter belief M ^ ( t ) with a fixed prior mean (no online latent-state inference); (ii) w/o Lyapunov removes Lyapunov planning and optimizes only myopic reward (equivalent to ML); (iii) w/o Game disables the game-response predictor and assumes channels keep a fixed strategy intensity. All results are averaged over 30 random seeds, as shown in Table 4 and Figure 12 and Figure 13.
  • Findings. Removing Lyapunov planning causes the largest degradation in inventory stability and also lowers reward, indicating that the drift-plus-penalty structure is the main driver for robust long-term control under lead times. Removing particle filtering reduces reward because the planner cannot track latent market shifts and therefore misallocates budgets and production. Removing the game module also degrades reward and increases volatility, because the firm decisions become systematically mismatched with strategic channel reactions. Overall, the ablation results support the modular design philosophy: each component contributes materially, and their combination yields the best reward–stability balance.

6. Conclusions

This paper addresses the multi-channel resource allocation problem faced by wood flooring manufacturers in dynamic competitive markets. We construct a Partially Observable Stochastic Dynamic Game (POSDG) model that transcends traditional static or myopic perspectives. The model’s core contribution lies in endogenizing the firm’s intangible assets—such as brand goodwill and customer base—as hidden state variables that dynamically evolve with market investments and competitive behaviors, enabling deep analysis of long-term strategies. We propose a hierarchical online learning and control framework named L-BAP. Its innovation resides in the organic integration of three distinct yet complementary technologies: Bayesian state estimation via particle filtering to handle partial observability; Lyapunov approximate dynamic programming to transform the intractable long-term discounted reward problem into a sequence of online optimization subproblems guaranteeing physical system stability; and iterative best response to predict inter-channel game behaviors. We provide rigorous theoretical performance guarantees for the algorithm and validate its superiority over several strong baselines through extensive simulations. Experimental results demonstrate that our proposed method achieves higher long-term rewards while significantly enhancing supply chain stability (i.e., inventory smoothness), highlighting the value of integrating learning, optimization, and game theory for foresighted planning.
Several directions warrant future exploration. First, structural parameters of state evolution and demand functions are assumed to be known in our model. In more realistic scenarios, firms may need to learn or identify these parameters online, leading to more challenging dual control problems. Second, our channel game modeling employs static Nash equilibrium within periods as approximation. Future research could explore evolutionary game dynamics emerging when channels are modeled as agents with memory and learning capabilities. Finally, while our framework maintains some generality, its computational complexity grows significantly with channel count and state dimension. Therefore, investigating more scalable approximation methods, such as utilizing deep reinforcement learning to parameterize firm or channel decision policies, presents a promising research direction.

Author Contributions

Conceptualization, Y.W.; methodology, Y.W.; software, Y.W.; validation, Y.W. and A.V.V.; formal analysis, Y.W.; investigation, Y.W.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, A.V.V.; visualization, Y.W.; supervision, A.V.V. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data and code used in the simulation experiments can be obtained from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank the anonymous reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A. Proofs for Theoretical Performance Analysis

This appendix provides complete and rigorous mathematical derivations for the theoretical results presented in the main text. Our analysis focuses on the proposed hierarchical online learning and control algorithm, with the core objective of proving its stability and approximate optimality in Partially Observable Stochastic Dynamic Game environments. The proof procedure strictly adheres to previously defined models and notation, progressing through a series of lemmas and theorems.

Appendix A.1. Analytical Foundation: Restatement of Technical Assumptions

To ensure clarity and self-consistency in subsequent derivations, we reiterate the technical assumptions underlying our theoretical analysis.

Appendix A.2. Detailed Derivation of Lyapunov Drift Upper Bound

Lyapunov analysis forms the cornerstone of our theoretical framework. We must first establish a rigorous mathematical upper bound for the single-step Lyapunov drift.
Lemma A1
(Lyapunov Drift Upper Bound). Under Assumptions 1 and 2, for any given observable history H ( t ) and any feasible policy taken under this history, the single-step Lyapunov drift Δ ( Θ ( t ) ) = E [ L ( Θ ( t + 1 ) ) L ( Θ ( t ) ) H ( t ) ] satisfies
Δ ( Θ ( t ) ) C 0 E 1 δ 2 2 Y ( t ) 2 H ( t ) E δ Y ( t ) R ( t ) + i N I i ( t ) ( S i ( t ) p i ( t L + 1 ) ) H ( t )
where C 0 is a positive constant depending only on system parameters and bounded constants from Assumption 1.
Proof. 
The Lyapunov function comprises two components: L ( Θ ( t ) ) = 1 2 i N I i ( t ) 2 + 1 2 Y ( t ) 2 . We analyze expected changes of both parts separately.
First, consider the drift of physical inventory queue I i ( t ) . From its evolution equation I i ( t + 1 ) = I i ( t ) S i ( t ) + p i ( t L + 1 ) ,
1 2 E [ I i ( t + 1 ) 2 I i ( t ) 2 H ( t ) ] = 1 2 E [ ( I i ( t ) S i ( t ) + p i ( t L + 1 ) ) 2 I i ( t ) 2 H ( t ) ] = E [ I i ( t ) ( S i ( t ) p i ( t L + 1 ) ) H ( t ) ] + 1 2 E [ ( S i ( t ) p i ( t L + 1 ) ) 2 H ( t ) ]
By Assumption 1, sales S i ( t ) and production p i ( t L + 1 ) are bounded. Therefore, the quadratic term E [ ( S i ( t ) p i ( t L + 1 ) ) 2 H ( t ) ] must be bounded. We define constant C I , i = ( I max + p max ) 2 . Summing over all channels yields an upper bound for inventory drift:
1 2 i N E [ I i ( t + 1 ) 2 I i ( t ) 2 H ( t ) ] C I i N E [ I i ( t ) ( S i ( t ) p i ( t L + 1 ) ) H ( t ) ]
where C I = 1 2 i N C I , i .
Next, consider the virtual debt queue Y ( t ) with evolution equation Y ( t + 1 ) = δ Y ( t ) R ( t ) :
1 2 E [ Y ( t + 1 ) 2 Y ( t ) 2 H ( t ) ] = 1 2 E [ ( δ Y ( t ) R ( t ) ) 2 Y ( t ) 2 H ( t ) ] = 1 2 E [ δ 2 Y ( t ) 2 2 δ Y ( t ) R ( t ) + R ( t ) 2 Y ( t ) 2 H ( t ) ] = 1 2 E [ ( 1 δ 2 ) Y ( t ) 2 2 δ Y ( t ) R ( t ) + R ( t ) 2 H ( t ) ]
By Assumption 1, E [ R ( t ) 2 H ( t ) ] R max 2 . Thus, we bound the above as
1 2 E [ Y ( t + 1 ) 2 Y ( t ) 2 H ( t ) ] 1 δ 2 2 E [ Y ( t ) 2 H ( t ) ] E [ δ Y ( t ) R ( t ) H ( t ) ] + 1 2 R max 2
Finally, summing the inventory drift bound (A2) and debt queue drift bound (A3) yields an upper bound for total Lyapunov drift Δ ( Θ ( t ) ) . Defining combined constant C 0 = C I + 1 2 R max 2 completes the proof of Lemma (A1). □

Appendix A.3. Drift-Plus-Penalty Analysis and Performance Gap Derivation

Our Algorithm 2 solves an approximate optimization problem (9) each period t. This decision rule closely relates to the drift lemma. To establish performance bounds, we first define a “drift-plus-penalty” expression aligned with the algorithm’s objective.
As described in Section 4.3, the algorithm aims to minimize a customized expression: Ψ ( t ) = Δ ( Θ ( t ) ) + E [ V R ( t ) + i N I i ( t ) 2 H ( t ) ] . Substituting the drift upper bound from Lemma (A1):
Ψ ( t ) E [ C 0 1 δ 2 2 Y ( t ) 2 δ Y ( t ) R ( t ) i N I i ( t ) ( S i ( t ) p i ( t L + 1 ) ) V R ( t ) + i N I i ( t ) 2 H ( t ) ]
Rearranging terms yields
Ψ ( t ) C 0 1 δ 2 2 E [ Y ( t ) 2 H ( t ) ] E [ ( V + δ Y ( t ) ) R ( t ) + i N I i ( t ) ( S i ( t ) p i ( t L + 1 ) ) i N I i ( t ) 2 H ( t ) ]
We define the single-period optimization objective function G t ( a , Θ , M ) as
G t ( a , Θ , M ) = ( V + δ Y ) R ( S , a ) + i N I i ( S i ( S , a ) p i ( t L + 1 ) ) i N I i 2
This G t function exactly matches the objective solved by Algorithm 2 (Equation (9)). Therefore, inequality (A4) becomes
Ψ ( t ) C 0 1 δ 2 2 E [ Y ( t ) 2 H ( t ) ] E [ G t ( a ( t ) , Θ ( t ) , M ( t ) ) H ( t ) ]
Algorithm 2 aims to maximize the expectation of G t (based on belief M ^ ( t ) ). Let a A l g ( t ) denote the algorithm’s chosen action, and a ( t ) denote the action taken by an ideal policy with complete information and perfect G t solution capability. Considering Assumptions 3 and 4, the algorithm’s action satisfies
E [ G t ( a A l g ( t ) , Θ ( t ) , M ( t ) ) H ( t ) ] E [ G t ( a ( t ) , Θ ( t ) , M ( t ) ) H ( t ) ] ϵ s u b Err e s t ( t )
The performance loss Err e s t ( t ) due to estimation error can be bounded. By Assumption 2, G t is also Lipschitz continuous (with constant L G ). Thus,
| E [ Err e s t ( t ) H ( t ) ] | E [ | G t ( a , Θ , M ) G t ( a , Θ , M ^ ) | H ( t ) ] E [ L G | | M ( t ) M ^ ( t ) | | H ( t ) ] L G C P F / N p
Substituting this performance relationship back into the Ψ ( t ) upper bound yields
Ψ ( t ) C 0 1 δ 2 2 E [ Y ( t ) 2 H ( t ) ]   E [ G t ( a ( t ) , Θ ( t ) , M ( t ) ) H ( t ) ] +   ϵ s u b + L G C P F / N p
This inequality forms the core connection between algorithm execution, queue dynamics, and the performance gap from the ideal optimal policy.

Appendix A.4. Proof of Queue Stability and Performance Bound Theorems

Based on inequality (A6), we now formally prove the theorems from the main text.
Theorem A1
(Queue Stability). Under Assumptions 1–4, all physical and virtual queues generated by Algorithm 2 are mean-square bounded.
Proof. 
We examine the total expectation of inequality (A6). E [ Ψ ( t ) ] = E [ Δ ( Θ ( t ) ) ] + E [ V R A l g ( t ) + I i ( t ) 2 ] .
E [ Δ ( Θ ( t ) ) ] C 0 +   E [ V R A l g ( t ) ] E I i ( t ) 2 1 δ 2 2 E [ Y ( t ) 2 ]   E [ G t ( a ( t ) ) ] + ϵ s u b + L G C P F / N p
By Assumption 1, reward R A l g ( t ) is bounded ( R max ). The ideal policy’s objective function G t ( a ( t ) ) is similarly bounded (denote G max ). Therefore, all non-queue quadratic terms can be bounded by a large constant C t o t a l :
E [ Δ ( Θ ( t ) ) ] C t o t a l E I i ( t ) 2 1 δ 2 2 E [ Y ( t ) 2 ]
This inequality indicates that when the quadratic norm of queue vector Θ ( t ) , | | Θ ( t ) | | 2 = I i ( t ) 2 + Y ( t ) 2 , is sufficiently large (exceeding some constant B), E [ Δ ( Θ ( t ) ) ] must be negative. By the Foster–Lyapunov stability criterion, this ensures the Markov chain formed by queue vector Θ ( t ) is positive-recurrent; hence, its second moment (mean-square value) is finite. More detailed analysis can show E [ | | Θ ( t ) | | 2 ] = O ( V 2 ) . □
Theorem A2
(Performance Bound). Under Assumptions 1–4, the algorithm’s achieved long-term discounted reward V A l g and the optimal discounted reward V O p t under complete information satisfy the performance bound given in the main text.
Proof. 
(Proof Sketch) Taking the total expectation of inequality (A6), summing over t = 0 , , T 1 , then dividing by T, we analyze the long-term time-average behavior:
E [ L ( Θ ( T ) ) ] E [ L ( Θ ( 0 ) ) ] T + 1 T t = 0 T 1 E [ V R A l g ( t ) + I i ( t ) 2 ] C e r r 1 T t = 0 T 1 E [ G t ( a ( t ) ) ]
where C e r r aggregates all constant and error terms ( C 0 , ϵ s u b , O ( 1 / N p ) ). As T , E [ L ( Θ ( T ) ) ] E [ L ( Θ ( 0 ) ) ] T 0 . This establishes a relationship between the algorithm’s long-term time-average reward R ¯ A l g and the ideal policy’s time-average objective G ¯ .
Converting this time-average result to the “discounted reward” V A l g and V O p t bound in the main text requires more sophisticated techniques beyond standard drift analysis. However, our Lyapunov function design incorporating the Y ( t ) queue represents a known (though advanced) method for handling discounted reward problems. Through algebraic transformations (as described in the main text), one can prove that the performance gap of the drift-plus-penalty objective optimized by Algorithm 2 translates to the discounted reward domain.
The final performance bound form (as shown in Theorem 1 of the main text) is a deterministic outcome of this analysis, decomposing total performance loss into three controllable components:
V A l g V O p t C 1 V ( 1 δ ) C 2 N p C 3 ϵ s u b
The first term represents structural error from Lyapunov approximation (parameter V); the second stems from estimation error due to partial observability (particle count N p ); the third results from implementation error due to subproblem (game and optimization) solution inaccuracy ( ϵ s u b ). This result provides clear theoretical guidance for algorithm tuning. □

References

  1. Zhou, Q.; Li, J.; Guo, S.; Chen, H.; Wu, C.; Yang, Y. Adaptive Incentive and Resource Allocation for Blockchain-Supported Edge Video Streaming Systems: A Cooperative Learning Approach. IEEE Trans. Mob. Comput. 2025, 24, 539–556. [Google Scholar]
  2. Yuan, S.; Dong, B.; Li, J.; Guo, S.; Chen, H.; Wu, C.; Wu, J.; Zhao, W. Adaptive Incentivize for Federated Learning with Cloud-Edge Collaboration under Multi-Level Information Sharing. IEEE Trans. Comput. 2025, 74, 2445–2460. [Google Scholar] [CrossRef]
  3. Chen, H.; Han, Z.; Wu, C.; Zhang, Y. Jira: Joint incentive design and resource allocation for edge-based real-time video streaming systems. IEEE Trans. Wirel. Commun. 2022, 22, 2901–2916. [Google Scholar]
  4. Li, J.; Wu, C. Jora: Blockchain-Based Efficient Joint Computing Offloading and Resource Allocation for Edge Video Streaming Systems. J. Syst. Archit. 2022, 133, 102740. [Google Scholar]
  5. Liu, Y.; Guo, S.; Wu, C.; Yang, Y. Efficient Online Computing Offloading for Budget-Constrained Cloud-Edge Collaborative Video Streaming Systems. IEEE Trans. Cloud Comput. 2025, 13, 273–287. [Google Scholar]
  6. Yuan, S.; Dong, B.; Lv, H.; Liu, H.; Chen, H.; Wu, C.; Guo, S.; Ding, Y.; Li, J. Adaptive Incentive for Cross-Silo Federated Learning in IIoT: A Multiagent Reinforcement Learning Approach. IEEE Internet Things J. 2024, 11, 15048–15058. [Google Scholar] [CrossRef]
  7. Lv, H.; Liu, H.; Wu, C.; Guo, S.; Liu, Z.; Chen, H. TradeFL: A Trading Mechanism for Cross-Silo Federated Learning. In Proceedings of the 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), Hong Kong, China, 18–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 920–930. [Google Scholar]
  8. Han, J.; Yu, C.; Liu, G.; Yuan, S.; Tong, Z. Large models based high-fidelity voice services over 6G narrowband non-terrestrial networks. Digit. Commun. Netw. 2025, 11, 1864–1873. [Google Scholar] [CrossRef]
  9. Liang, J.; Zhu, Y.; Yu, X.; Chen, J.; Wu, C. Sharding for Blockchain based Mobile Edge Computing System: A Deep Reinforcement Learning Approach. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
  10. Wang, Y.; Lou, J.; Wu, C.; Guo, S.; Yang, Y. Online Data Trading for Cloud-Edge Collaboration Architecture. In Proceedings of the IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4554–4559. [Google Scholar]
  11. Ji, Y.; Zhang, Y. DCVP: Distributed Collaborative Video Stream Processing in Edge Computing. In Proceedings of the IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS), Hong Kong, China, 2–4 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 625–632. [Google Scholar]
  12. Lou, J.; Tang, Z.; Lu, X.; Yuan, S.; Li, J.; Jia, W.; Wu, C. Efficient Serverless Function Scheduling in Edge Computing. In Proceedings of the IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1029–1034. [Google Scholar]
  13. Liu, H.; Fu, H.; Yuan, S.; Wu, C.; Luo, Y.; Li, J. Adaptive Processing for Video Streaming with Energy Constraint: A Multi-Agent Reinforcement Learning Method. In Proceedings of the IEEE Global Communications Conference, Kuala Lumpur, Malaysia, 4–8 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 122–127. [Google Scholar]
  14. Liu, H.; Liu, J.; Zhou, Z.; Yuan, S.; Lou, J.; Wu, C.; Li, J. A stochastic learning algorithm for multi-agent game in mobile network: A Cross-Silo federated learning perspective. Comput. Netw. 2025, 269, 111458. [Google Scholar] [CrossRef]
  15. Barčić, A.P.; Kuzman, M.K.; Jošt, M.; Grošelj, P.; Klarić, K.; Oblak, L. Perceptions of Wood Flooring: Insights from Croatian Consumers and Wood Experts. Buildings 2025, 15, 1780. [Google Scholar] [CrossRef]
  16. Wang, H.; Wang, J.; Zhang, X.; Li, Y. Deformation rate of engineered wood flooring with response surface methodology. PLoS ONE 2023, 18, e0292815. [Google Scholar] [CrossRef]
  17. Heidari, M.D.; Bergman, R.; Salazar, J.; Hubbard, S.S.; Bowe, S.A. Life-Cycle Assessment of Prefinished Engineered Wood Flooring in the Eastern United States; Forest Products Laboratory Research Paper FPL-RP-718; U.S. Department of Agriculture Forest Service: Madison, WI, USA, 2023.
  18. Morganti, L.; Rudenå, A.; Brunklaus, B. Wood-for-construction supply chain digital twin to drive circular economy and actor-based LCA information. J. Clean. Prod. 2025, 520, 146074. [Google Scholar] [CrossRef]
  19. Zhang, D.; Turan, H.H.; Sarker, R.; Essam, D. Robust optimization approaches in inventory management: Part A—The survey. IISE Trans. 2025, 57, 818–844. [Google Scholar] [CrossRef]
  20. Wu, K.; Li, Y.; Zhao, X. Game-theoretic models for sustainable supply chains with information asymmetry. Int. J. Syst. Sci. Oper. Logist. 2025, 12, 2520619. [Google Scholar]
  21. Krassnitzer, P.; Wimmer, M.; Fuchs, M. Sustainability in flooring: Assessing the environmental and economic impacts of circular business models based on a wood parquet case study. Discov. Sustain. 2025, 6, 466. [Google Scholar] [CrossRef]
  22. Bergsagel, D.; Heisel, F.; Owen, J.; Rodencal, M. Engineered wood products for circular construction: A multi-scale perspective. npj Mater. Circ. 2025, 3, 24. [Google Scholar]
  23. Neely, M.J. Stochastic Network Optimization with Application to Communication and Queueing Systems; Morgan & Claypool Publishers: San Rafael, CA, USA, 2010. [Google Scholar]
  24. Alilou, M.; Mohammadpour; Shotorbani, A.; Mohammadi-Ivatloo, B. Lyapunov-based real-time optimization method in microgrids: A comprehensive review. Renew. Sustain. Energy Rev. 2025, 213, 115416. [Google Scholar] [CrossRef]
  25. Yuan, S.; Chen, X.; Xing, S.; Li, J.; Chen, H.; Liu, Z.; Guo, S. Transformer-Based Scalable Multi-Agent Reinforcement Learning for Joint Resource Optimization in Cloud-Edge-End Video Streaming Systems. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 3482–3496. [Google Scholar] [CrossRef]
Figure 1. Schematic diagram of the system state transitions and decision flow. The firm selects ( b ( t ) , p ( t ) ) , channels choose s ( t ) , demand and sales are realized, and the latent market position M ( t ) evolves with persistent and competitive effects.
Figure 1. Schematic diagram of the system state transitions and decision flow. The firm selects ( b ( t ) , p ( t ) ) , channels choose s ( t ) , demand and sales are realized, and the latent market position M ( t ) evolves with persistent and competitive effects.
Algorithms 19 00078 g001
Figure 2. Flowchart of L-BAP. Each period performs belief update via particle filtering, predicts strategic channel reactions via iterative best response, and computes stability-aware firm actions via Lyapunov drift-plus-penalty optimization.
Figure 2. Flowchart of L-BAP. Each period performs belief update via particle filtering, predicts strategic channel reactions via iterative best response, and computes stability-aware firm actions via Lyapunov drift-plus-penalty optimization.
Algorithms 19 00078 g002
Figure 3. Evolution of cumulative discounted reward over 1000 simulation periods for different algorithms.
Figure 3. Evolution of cumulative discounted reward over 1000 simulation periods for different algorithms.
Algorithms 19 00078 g003
Figure 4. Single-period rewards (15-period moving average) for different algorithms.
Figure 4. Single-period rewards (15-period moving average) for different algorithms.
Algorithms 19 00078 g004
Figure 5. Customer service levels (25-period moving average) for different algorithms.
Figure 5. Customer service levels (25-period moving average) for different algorithms.
Algorithms 19 00078 g005
Figure 6. Comparison of total inventory dynamics during mid-simulation period (t = 400 to 650).
Figure 6. Comparison of total inventory dynamics during mid-simulation period (t = 400 to 650).
Algorithms 19 00078 g006
Figure 7. End-of-horizon cumulative discounted reward comparison (mean ± std over 30 seeds).
Figure 7. End-of-horizon cumulative discounted reward comparison (mean ± std over 30 seeds).
Algorithms 19 00078 g007
Figure 8. Inventory stability comparison measured by inventory standard deviation (mean ± std over 30 seeds). Lower is better.
Figure 8. Inventory stability comparison measured by inventory standard deviation (mean ± std over 30 seeds). Lower is better.
Algorithms 19 00078 g008
Figure 9. State estimation RMSE versus particle count N p (L-BAP vs. ML).
Figure 9. State estimation RMSE versus particle count N p (L-BAP vs. ML).
Algorithms 19 00078 g009
Figure 10. Pareto frontier of long-term average reward versus average inventory level for L-BAP under varying trade-off parameter V.
Figure 10. Pareto frontier of long-term average reward versus average inventory level for L-BAP under varying trade-off parameter V.
Algorithms 19 00078 g010
Figure 11. Heatmap of L-BAP’s average reward (per period) as a function of V and N p .
Figure 11. Heatmap of L-BAP’s average reward (per period) as a function of V and N p .
Algorithms 19 00078 g011
Figure 12. Ablation comparison on cumulative discounted reward (mean trajectory over 30 seeds).
Figure 12. Ablation comparison on cumulative discounted reward (mean trajectory over 30 seeds).
Algorithms 19 00078 g012
Figure 13. Ablation comparison on inventory stability (rolling inventory STD, mean over 30 seeds).
Figure 13. Ablation comparison on inventory stability (rolling inventory STD, mean over 30 seeds).
Algorithms 19 00078 g013
Table 1. Key notation and definitions used in the model.
Table 1. Key notation and definitions used in the model.
NotationDefinition
State Variables
M i ( t ) R k Market position vector of channel i in period t (partially observable)
I i ( t ) R + Physical inventory level of finished goods for channel i at period t start (observable)
p i , p i p e ( t ) Vector of in-process production orders for channel i in period t, p i , p i p e ( t ) = ( p i ( t 1 ) , , p i ( t L + 1 ) )
Z ( t ) Z Macroeconomic market state in period t (observable)
S ( t ) Complete system state in period t
Y ( t ) Virtual debt queue used in Lyapunov planning (Section 4)
Decision/Action Variables
b i ( t ) R + Marketing budget allocated to channel i by the firm in period t
p i ( t ) R + Production order quantity placed for channel i by the firm in period t
s i ( t ) [ s i , min , s i , max ] Sales strategy intensity adopted by channel i in period t
Model Functions and Intermediate Variables
d i ( t ) Market demand function for channel i in period t
S i ( t ) Actual sales volume for channel i in period t
R i ( t ) Operating profit generated by channel i in period t
C i n v ( · ) , C p r o d ( · ) Inventory holding cost and production cost functions
S ( · ) Observation likelihood of sales in particle filtering (Section 4)
Parameters
L N Production order lead time
δ ( 0 , 1 ) Discount factor for future long-term rewards
Δ i R k × k Natural decay matrix for market position
Γ i j R k × k Cross-channel competitive influence matrix from channel j to i
π i Unit sales marginal profit for channel i
B ( t ) , P ( t ) Total marketing budget and production-capacity constraints in period t
H ( t ) History of all observable information up to period t
Table 2. Core parameter settings for simulation model.
Table 2. Core parameter settings for simulation model.
Parameter CategorySymbolValue/Form
State Evolution Δ i diag ( 0.05 , 0.02 )
Γ i j Off-diagonal matrix reflecting strong competition from channel 3 to 1 and 2
Demand Function θ i , ϕ i , β i , ζ i , ψ i j Heterogeneous parameter vectors set according to channel characteristics
Cost Function C p r o d ( p ) i ( 0.1 p i 2 + 2 p i )
C i n v ( I ) i 0.5 I i
Macro Environment P ( Z | Z ) [ [ 0.95 , 0.05 ] , [ 0.1 , 0.9 ] ]
Table 3. Overall performance metrics (mean ± std) over 30 random seeds.
Table 3. Overall performance metrics (mean ± std) over 30 random seeds.
AlgorithmCumulative Discounted RewardInventory STDService LevelRMSE
L-BAP15,420 ± 310 85.2 ± 6.096.5% ± 0.6%0.88
MK13,850 ± 280155.6 ± 9.091.2% ± 1.0%N/A
ML12,990 ± 350162.1 ± 9.592.1% ± 0.9%0.91
D-RL9870 ± 520210.5 ± 12.085.6% ± 1.3%N/A
SP6540 ± 150250.8 ± 15.078.2% ± 1.5%N/A
Table 4. Ablation study of L-BAP variants (mean ± std) over 30 random seeds.
Table 4. Ablation study of L-BAP variants (mean ± std) over 30 random seeds.
VariantCumulative Discounted RewardInventory STD
L-BAP15,420 ± 31085.2 ± 6.0
L-BAP w/o PF14,680 ± 34098.0 ± 8.0
L-BAP w/o Lyapunov13,400 ± 420175.0 ± 12.0
L-BAP w/o Game14,150 ± 360112.0 ± 9.0
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, Y.; Vasilakos, A.V. Dynamic Resource Games in the Wood Flooring Industry: A Bayesian Learning and Lyapunov Control Framework. Algorithms 2026, 19, 78. https://doi.org/10.3390/a19010078

AMA Style

Wang Y, Vasilakos AV. Dynamic Resource Games in the Wood Flooring Industry: A Bayesian Learning and Lyapunov Control Framework. Algorithms. 2026; 19(1):78. https://doi.org/10.3390/a19010078

Chicago/Turabian Style

Wang, Yuli, and Athanasios V. Vasilakos. 2026. "Dynamic Resource Games in the Wood Flooring Industry: A Bayesian Learning and Lyapunov Control Framework" Algorithms 19, no. 1: 78. https://doi.org/10.3390/a19010078

APA Style

Wang, Y., & Vasilakos, A. V. (2026). Dynamic Resource Games in the Wood Flooring Industry: A Bayesian Learning and Lyapunov Control Framework. Algorithms, 19(1), 78. https://doi.org/10.3390/a19010078

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop