Abstract
Wood flooring manufacturers face complex challenges in dynamically allocating resources across multi-channel markets, characterized by channel conflicts, demand uncertainty, and long-term cumulative effects of decisions. Traditional static optimization or myopic approaches struggle to address these intertwined factors, particularly when critical market states like brand reputation and customer base cannot be precisely observed. This paper establishes a systematic and theoretically grounded online decision framework to tackle this problem. We first model the problem as a Partially Observable Stochastic Dynamic Game. The core innovation lies in introducing an unobservable market position vector as the central system state, whose evolution is jointly influenced by firm investments, inter-channel competition, and macroeconomic randomness. The model further captures production lead times, physical inventory dynamics, and saturation/cross-channel effects of marketing investments, constructing a high-fidelity dynamic system. To solve this complex model, we propose a hierarchical online learning and control algorithm named L-BAP (Lyapunov-based Bayesian Approximate Planning), which innovatively integrates three core modules. It employs particle filters for Bayesian inference to nonparametrically estimate latent market states online. Simultaneously, the algorithm constructs a Lyapunov optimization framework that transforms long-term discounted reward objectives into tractable single-period optimization problems through virtual debt queues, while ensuring stability of physical systems like inventory. Finally, the algorithm embeds a game-theoretic module to predict and respond to rational strategic reactions from each channel. We provide theoretical performance analysis, rigorously proving the mean-square boundedness of system queues and deriving the performance gap between long-term rewards and optimal policies under complete information. This bound clearly quantifies the trade-off between estimation accuracy (determined by particle count) and optimization parameters. Extensive simulations demonstrate that our L-BAP algorithm significantly outperforms several strong baselines—including myopic learning and decentralized reinforcement learning methods—across multiple dimensions: long-term profitability, inventory risk control, and customer service levels.
1. Introduction
The wood flooring manufacturing industry constitutes a significant component of the global forest product economy. In recent years, facing increasingly diverse consumer demands and intense market competition, numerous wood flooring enterprises have initiated digital transformation to enhance operational efficiency and market responsiveness [1,2]. Within this process, a pervasive and fundamental challenge involves effective resource allocation across multiple concurrent sales channels [3,4,5,6].
To cover different market segments, modern wood flooring companies typically establish diversified channel portfolios. For instance, certain channels specialize in serving designer communities, pursuing high-margin custom orders; others target large-scale engineering projects characterized by substantial order volumes yet relatively limited profit margins; simultaneously, online e-commerce channels directly reach mass consumer markets, with operational focus on brand promotion and rapid inventory turnover [7,8,9,10,11,12,13,14,15,16,17,18]. These channels exhibit fundamental differences in target customers, profit models, and operational rhythms. When competing for the firm’s finite resources—such as marketing budgets, production schedules, and finished goods inventory—their divergent objectives inevitably create internal conflicts, rendering the resource allocation problem particularly complex.
Traditional resource allocation models, such as static planning methods based on historical data, might suffice in stable market environments. However, their limitations become increasingly apparent in contemporary markets characterized by volatile demand fluctuations and high uncertainty [19]. More critically, these conventional approaches often treat the firm as a single decision entity, thereby overlooking potential self-interested strategic behaviors by semi-autonomous business units like sales channels. When different channels take actions most beneficial to themselves to achieve their respective KPIs—for instance, over-requesting resources—conventional centralized optimization models may not only fail but could also lead to resource misallocation resembling the tragedy of the commons, ultimately harming the firm’s overall interests [20,21,22].
To address this challenge, this research aims to answer a core question: How can we design an online resource allocation mechanism that accommodates inter-channel competition while ensuring the firm’s long-term operational stability and objective achievement in dynamically changing market environments?
Toward this end, we propose a hybrid framework integrating Lyapunov optimization with game theory. The design philosophy recognizes channels not as passive instruction executors but as rational participants, employing game theory to characterize their pursuit of short-term profit maximization. To balance the long-term risks inherent in such decentralized decision making, the framework incorporates Lyapunov optimization theory. This theory transforms the firm’s strategic objectives—such as maintaining healthy inventory levels or continuously enhancing brand value—into dynamically trackable virtual queues [23,24].
By combining these two theoretical foundations, our framework establishes a dynamic incentive mechanism. In each decision cycle, the firm’s central planner assigns dynamic weights to channel decision outcomes based on current states of these virtual queues. These weights adjust channels’ perception of short-term profits, thereby guiding their local optimal choices toward the firm’s global long-term objectives without depriving their decision autonomy [24,25].
This work provides a novel analytical perspective for channel resource management within the wood products industry. By integrating stochastic network optimization with game theory, we construct an online decision framework with theoretical performance guarantees. Our research not only offers practical insights for wood flooring industry management but also provides valuable implications for other manufacturing sectors facing similar multi-channel resource conflicts.
The main contributions of this paper are summarized as follows:
- We formulate a Partially Observable Stochastic Dynamic Game model that endogenizes key intangible assets—such as brand reputation and customer base—as unobservable market states within the context of the wood flooring industry, systematically characterizing their dynamic evolution under firm investments, channel competition, and macroeconomic randomness.
- We design a hierarchical online learning and control algorithm termed L-BAP, which organically integrates Bayesian filtering, Lyapunov approximate dynamic programming, and game theory to efficiently combine hidden state estimation, long-term objective optimization, and inter channel strategic behavior prediction within a unified framework.
- We provide rigorous theoretical performance analysis for the L-BAP algorithm. Through Lyapunov function construction, we prove the mean-square boundedness of system queues and derive the performance gap bound between algorithm rewards and optimal policies under complete information, clearly revealing the estimation–optimization trade-off.
- We validate the proposed framework’s effectiveness through high-fidelity simulations. Experimental results demonstrate that L-BAP significantly outperforms several strong baselines—including myopic learning and decentralized reinforcement learning methods—across multiple dimensions: long-term profitability, inventory risk control, and customer service levels. We further provide ablation and sensitivity analyses to isolate the contributions of Bayesian filtering, Lyapunov planning, and game-response prediction.
2. Related Work
2.1. Traditional Resource Allocation Methods in Manufacturing
Resource allocation in manufacturing represents a persistent research theme in operations research. Mathematical programming approaches constitute the most widely applied toolkit. For instance, linear programming and mixed-integer linear programming are extensively employed for solving production scheduling, capacity planning, and material requirement planning problems [18]. These methods offer advantages in clear model structure and global optimality guarantees. However, they typically rely on accurate predictions of future parameters like demand and costs, which proves challenging in highly uncertain markets. Moreover, these models are inherently static and unsuitable for online decision-making scenarios requiring rapid responses.
To address dynamism and uncertainty, researchers have proposed stochastic programming and robust optimization [19]. These methods directly incorporate uncertainty into models but often incur high computational complexity and require known probability distributions of uncertainties. Another category comprises heuristic algorithms—such as genetic algorithms, simulated annealing, and particle swarm optimization—which can find high-quality approximate solutions within acceptable timeframes but generally lack theoretical performance guarantees [19]. A common limitation across these traditional approaches is their treatment of the firm as a unified decision unit, failing to adequately account for potential interest conflicts and strategic interactions among different internal departments or channels.
2.2. Game Theory Applications in Supply Chain and Channel Management
Game theory provides a theoretical framework for analyzing systems comprising multiple independent decision makers and has found extensive application in supply chain and channel management. Existing work employs Stackelberg games to model pricing and ordering strategies between manufacturers and retailers, examining channel coordination, profit allocation, and service investment while analyzing impacts of information asymmetry and fairness concerns on equilibria [20]. Concurrently, other research investigates competition and cooperation mechanisms in dual-channel and multi-channel contexts, integrating decisions on pricing, inventory, and service [6,7]. Unlike these analyses focusing primarily on static or finite-horizon equilibria, this paper concentrates on mechanism design problems for continuous online decision making under long-term uncertain environments.
2.3. Stochastic Network Optimization and Lyapunov Theory
Lyapunov optimization represents a general methodology for online decision making and long-term constraint handling, systematically developed by Neely for communication and queuing systems [23]. By constructing virtual queues and minimizing a drift-plus-penalty expression, this method transforms long-term objectives and time-varying constraints into deterministic subproblems solvable in each time slot. This enables algorithms to achieve provable performance and stability guarantees without requiring prior statistical knowledge of future information. Recently, Lyapunov optimization has been widely applied to real-time optimization and scheduling in energy systems and microgrids, demonstrating its adaptability to uncertainties and constraints in practice [24]. This suggests its potential applicability to scenarios like manufacturing and channel resource management that involve long-term constraints and short-term disturbances.
3. System Modeling and Problem Formulation
To analyze resource allocation strategies for wood flooring enterprises in dynamic competitive environments, we construct a mathematical model based on state space evolution. The core premise is that a firm’s short-term decisions not only affect immediate returns but, more importantly, continuously alter its long-term market competitive position. Our dynamic system aims to capture persistent marketing effects, physical delays in production logistics, and environmental uncertainties. As shown in Figure 1, the temporal framework is defined as discrete decision periods, . In the wood flooring industry context, market responses are shaped by user perceptions of product quality, durability, and aesthetic preferences, while manufacturing choices and lifecycle considerations also influence cost and supply-side constraints, motivating the following model elements [15,17].
Figure 1.
Schematic diagram of the system state transitions and decision flow. The firm selects , channels choose , demand and sales are realized, and the latent market position evolves with persistent and competitive effects.
3.1. Model Notation and Definitions
To ensure clarity and rigor in subsequent discussions, core variables and parameters used in the model are summarized in Table 1.
Table 1.
Key notation and definitions used in the model.
Data Availability and State Estimability
The variables categorized in Table 1 consist of (i) directly observable operational states, (ii) decision variables, and (iii) latent states or structural parameters. In industrial settings, most operational states are retrieved directly from enterprise information systems; for instance, inventory levels and pipeline orders are tracked via ERP/WMS databases, while realized sales and channel outcomes are logged by POS and e-commerce platforms. Control inputs, including marketing budgets and production orders , are internally determined. The channel strategy intensity reflects managerial decisions executed through actionable levers (e.g., promotion intensity or service levels) within predefined constraints.
The market position vector is the only partially observable component, representing intangible assets such as brand goodwill or customer base. To bypass the need for direct measurement, our framework estimates recursively using Bayesian filtering (specifically, particle filtering) conditioned on observable signals, including sales, demand realizations, and the macro-state . Structural parameters, such as decay and cross-channel competition , are calibrated using historical panel data or updated asynchronously as new observations accumulate. The control policy is inherently robust to moderate estimation uncertainties, a property further validated by the sensitivity analysis in the experimental section. For generalizability and data confidentiality, all quantities in our simulations are normalized: inventory/production are expressed in standardized units, while rewards and costs are reported in consistent dimensionless profit units.
3.2. Core State Variables and System Dynamics
The dynamic characteristics of the model are driven by a set of interconnected state variables that collectively constitute the complete system state .
We introduce a core but partially observable market position vector to characterize channel i’s long-term intangible assets. Each dimension of this vector represents a specific aspect of market competitiveness, such as brand goodwill, customer loyalty, or market awareness. This vector serves as the slow variable of system dynamics, encapsulating the cumulative effects of all past decisions. In consumer-facing wood flooring markets, preference formation reflects perceived durability, aesthetic, and usage attributes of parquet products, supporting the modeling of market position as a persistent demand driver [15].
Concurrently, we define a directly observable physical inventory level , representing the quantity of finished goods available for sale by channel i at period t start. Since production is not instantaneous, we must simultaneously track in-process production orders. A vector records production quantities ordered but not yet delivered from periods to . In engineered wood flooring production, pressing conditions, adhesive spread, and process parameters influence yield rates and available output, motivating explicit accounting of lead times and pipeline status in downstream planning [16].
Additionally, external market environment randomness is described by a macroeconomic state variable , assumed to follow a finite-state Markov chain with transition probabilities . This approach more effectively captures the persistent effects of economic cycles than simple random factors. From lifecycle and supply perspectives, exogenous shocks also propagate through manufacturing, installation, and use phases, reinforcing the need to couple demand-side persistence with supply-side costs [17].
Thus, the complete system state in period t can be expressed as
3.3. State Evolution Equations
System state evolution is governed by a set of stochastic difference equations linking current-period decisions to next-period states.
The market position vector evolution captures cumulative and decaying effects of marketing investments. Drawing from related models in marketing science, we formulate its dynamics as
where I denotes the identity matrix, while diagonal matrix represents natural decay or forgetting effects of market position. The nonlinear vector function characterizes the constructive effect of channel i’s own investments on various dimensions of its market position, potentially exhibiting saturation properties. Matrix quantifies cross-channel erosion effects from channel j’s marketing activities on channel i’s market position. represents a zero-mean stochastic process noise. In wood flooring applications, can be interpreted as the gradual accumulation of perceived quality and brand recognition for specific parquet structures and finishes, consistent with observed preference structures [15].
Physical inventory evolution follows strict material balance principles. Given production lead time L, production orders placed in period t arrive at period start. Consequently, inventory level updates follow:
where denotes the actual sales volume in period t, defined below. In engineered wood flooring production lines, process parameters and bonding quality may influence scrap rates and effective availability, making explicit pipeline order tracking alongside on-hand inventory practically relevant [16].
3.4. Demand, Sales, and Profit Functions
Within the state evolution framework, market demand, sales, and profit calculations become more refined.
Market demand is no longer an instantaneous product of short-term inputs but is jointly determined by underlying market position and macroeconomic environment. Current-period marketing activities further stimulate demand based on established market position. A demand function reflecting this mechanism can be constructed as
where term represents the baseline demand potential determined by market position, while reflects macroeconomic influences. The logarithmic function captures the diminishing marginal utility of short-term inputs, while the exponential term characterizes competitive pressure from other channels’ strategy intensities. This specification aligns with observations in parquet flooring markets where perceived product quality shapes purchase intentions and evolves with marketing communications and channel competition [15].
The actual sales volume in period t is constrained by the current inventory and market demand:
Having defined sales volume, we can compute the system’s single-period total reward . This equals the sum of all channels’ sales profits minus relevant costs, including strategy intensity-related marketing costs , production volume-related costs , and inventory level-related holding costs .
We assume cost functions are nonlinear and convex to reflect economies of scale or marginal cost variations. In wood flooring enterprises, lifecycle and process assessments indicate that manufacturing choices, energy, and process parameters contribute nonlinearly to unit costs and environmental burdens, supporting convex modeling of and the explicit role of inventory and pipeline decisions in [16,17].
3.5. Problem Formulation: A Partially Observable Stochastic Dynamic Game
Based on the above setup, we formulate the original problem as a more profound and realistic Partially Observable Stochastic Dynamic Game (POSDG).
The core complexity stems from information asymmetry. The firm and channels can precisely observe physical states like inventory and pipeline orders , along with macroeconomic state . However, the market position vector , being an intangible asset, cannot be directly observed. Decision-makers can only form a belief distribution or estimate of current market position based on the history of all observable information . This structure aligns with wood flooring settings where user perceptions, competitive exposure, and process yields co-evolve over time, requiring estimation from partial observations [15,16,17].
Under this information structure, the firm’s objective is to design and execute an optimal policy —a function mapping observable history to current decisions —that maximizes the expected total discounted reward over an infinite horizon:
where the expectation is taken over all sources of randomness—market position evolution noise and macroeconomic state transitions—under policy and channels’ collective strategies.
Simultaneously, each channel i aims to find its own optimal policy mapping its information set to a sales strategy that maximizes its individual expected total discounted reward.
This problem constitutes a high-dimensional, non-convex, dynamic, stochastic, partially observable multi-agent decision problem. Such problems admit no simple analytical solutions and require advanced methods from approximate dynamic programming, reinforcement learning (particularly multi-agent reinforcement learning), or stochastic control. Therefore, to solve this problem, we design an innovative algorithm in the following chapter.
4. Online Learning and Control Algorithm Design
The Partially Observable Stochastic Dynamic Game (POSDG) problem formulated in the previous section presents significant computational challenges due to its high-dimensional state space, non-convex objective function, incomplete information structure, and dynamic multi-agent interactions. Traditional optimization methods prove inadequate for direct solution. To address this challenge, we design a hierarchical online learning and control algorithm. As shown in Figure 2, the core insight involves decomposing the original problem into two coupled subproblems: online Bayesian belief updating to handle state partial observability, and approximate dynamic programming decision making based on updated beliefs to solve the firm’s resource allocation and production planning.
Figure 2.
Flowchart of L-BAP. Each period performs belief update via particle filtering, predicts strategic channel reactions via iterative best response, and computes stability-aware firm actions via Lyapunov drift-plus-penalty optimization.
4.1. Algorithm Overview: Hierarchical Bayesian Approximate Dynamic Programming Framework
Our proposed algorithm sequentially executes three core modules in each decision period t. The process initiates with a belief update module that leverages observed sales outcomes from the previous period as new evidence, employing nonlinear filtering techniques—specifically particle filtering—to update the firm’s posterior probability distribution (belief state) of the current unobservable market position vector . Subsequently, a channel strategy prediction module, given potential firm actions and the current belief state, predicts non-cooperative game outcomes among rational channels to estimate their approximate Nash equilibrium strategies . Finally, a firm decision module, building upon updated belief states and predicted channel strategies, utilizes Lyapunov optimization to construct a single-period approximate optimization problem. This formulation effectively transforms the complex long-term discounted reward objective into a more tractable deterministic problem that balances short-term rewards with long-term system stability, yielding optimal current-period actions .
This hierarchical architecture effectively decouples learning (latent state inference) from control (current action optimization) tasks, enabling feasible online solution.
4.2. Bayesian Estimation of Market Position via Particle Filtering
Given the highly nonlinear nature of market position state evolution (Equation (1)) and potentially non-Gaussian distribution of stochastic disturbances , traditional Kalman filters prove inadequate. We employ particle filtering, a robust nonparametric Bayesian filtering method, for online estimation of market position vector . In each period t, the firm maintains a set of weighted particles , collectively approximating the posterior probability distribution of . Algorithm 1 details the procedural steps.
| Algorithm 1 Particle Filter Update for Market Position Vector |
|
(Note: Algorithm 1 follows the original “propagation-update-resampling” flow. This flow assumes that likelihood evaluates propagated particles , raising causal concerns. A more standard SIR flow would first update and resample using before propagation. To maintain fidelity to the original structure, we preserve this flow while clarifying the logical dependency in line 10 for likelihood computation).
Through this process, the firm effectively incorporates each period’s latest market sales information into its understanding of core intangible assets (market position), providing crucial input for subsequent precise decision making.
4.3. Lyapunov Approximate Dynamic Programming Framework
To handle the long-term discounted reward objective while maintaining stability of physical systems (e.g., inventory), we construct a Lyapunov optimization framework. Beyond physical queues (inventory), we introduce a “Virtual Debt Queue” specifically designed to transform the discounted reward problem into an equivalent problem manageable within the Lyapunov framework.
We define physical and virtual queues as follows. Physical inventory queue itself requires management; we aim to avoid both excessive levels (incurring holding costs) and insufficient levels (causing stockouts). The virtual debt queue evolution follows . It can be shown that stabilizing this queue under specific conditions equates to maximizing the long-term discounted reward .
We define a Lyapunov function incorporating quadratic penalties for both physical inventory and virtual debt:
where represents the vector of queues requiring generalized stabilization. The single-step Lyapunov drift is defined as .
Our objective in each period t is to select actions that minimize a modified “drift-plus-penalty” expression. Unlike standard approaches, to explicitly penalize excessive inventory levels while optimizing rewards, we define the penalty term as . Consequently, we aim to minimize
where V is a positive constant balancing reward maximization against queue stability.
Through derivation, we demonstrate that minimizing an upper bound of this expression equates to solving the following deterministic optimization problem in each period t:
The expectation is approximated via sample averaging over the posterior particle set obtained from the particle filter.
4.4. Solving Hierarchical Decision Subproblems
Problem (9) remains complex due to coupling between firm and channel decisions. We design a hierarchical iterative solution approach. First, given firm resource allocation and belief , the firm predicts the approximate Nash equilibrium of the inter-channel game. Given non-convex channel utility functions, we employ an iterative best response algorithm to locate a fixed point as an equilibrium approximation.
After substituting predicted equilibrium strategies , the firm’s objective function depends solely on its own decisions . However, this function remains highly complex and potentially non-convex. We adopt a stochastic gradient-based optimization method (e.g., Adam or RMSprop) for solution. In each iteration, random sampling from the particle set provides unbiased estimates of the objective function gradient, guiding decision updates. Algorithm 2 provides pseudo-code for the single-period decision process.
| Algorithm 2 Hierarchical Online Learning and Control Algorithm (Period t) |
|
4.5. Theoretical Performance Analysis
This section provides rigorous theoretical performance analysis for the proposed hierarchical online learning and control algorithm. Given the algorithm’s integration of Bayesian filtering, approximate dynamic programming, and game theory, its analysis presents significant complexity. Our objective is to demonstrate that, under reasonable mathematical assumptions, the algorithm guarantees stability of the overall dynamic system, and its long-term performance can approach an idealized theoretical optimum, with quantifiable and controllable performance gaps.
4.5.1. Technical Assumptions for Analysis
To ensure analytical rigor, we introduce a series of relatively standard technical assumptions in stochastic control and learning theory.
Assumption 1 (Boundedness).
All states, actions, rewards, and stochastic processes in the system are assumed bounded. Specifically, there exist positive constants , and such that for all time t and all realizations, , , , , , single-period reward , and stochastic disturbances .
Assumption 2 (Lipschitz Continuity).
All system functions—including state evolution functions and reward function —are assumed Lipschitz continuous in all their arguments. For example, for the reward function, there exists constant such that for any two state-action pairs and , .
Assumption 3 (Subproblem Solution Accuracy).
We assume the optimization loops within Algorithm 2 find approximate solutions. Specifically, the solution found by the firm decision module in period t has a bounded optimization gap relative to the true optimum of deterministic problem (9). Simultaneously, the difference between the approximate Nash equilibrium found by the channel equilibrium prediction module and the true equilibrium has bounded impact on the system reward function. We denote the combined upper bound of these subproblem solution errors on the single-period drift-plus-penalty objective as .
Assumption 4 (Particle Filter Performance).
Following standard particle filter theory, we assume that the mean squared error of hidden state estimation is controlled by particle count . There exists positive constant such that , where is the particle filter’s expected estimate.
4.5.2. Lyapunov Drift Upper Bound Lemma
Lyapunov analysis forms the core of our theoretical proof. We first derive a rigorous upper bound for the single-step Lyapunov drift, connecting expected queue changes with single-period system rewards.
Lemma 1
(Lyapunov Drift Upper Bound). Under Assumptions 1 and 2, for any given observable history and any feasible policy taken under this history, the single-step Lyapunov drift satisfies
where is a positive constant depending only on system parameters and bounds from the assumptions.
Proof.
We begin from the Lyapunov function definition, examining expected single-step changes of its components. First, consider physical inventory queue . From its evolution equation,
By Assumption 1, sales and production are bounded. Thus, the quadratic term has bounded expectation. Summing this term over all channels yields constant .
Next, consider virtual debt queue , evolving as . Thus,
Here, we used and bounded by .
Summing both parts yields an upper bound for total drift . Combining all constant terms into completes the proof. □
4.5.3. Performance Analysis of POSDG
We now connect the drift upper bound with Algorithm 2’s decision rule. As established, the algorithm aims to minimize . Substituting Lemma (10) into this expression:
Algorithm 2 minimizes the upper bound of the entire drift-plus-penalty expression precisely by maximizing the second term above (i.e., the objective function in (9)).
For notational simplicity, let denote actions taken by our algorithm in period t, and denote actions taken by an ideal, fully informed (observing ) policy that perfectly solves subproblems. Let represent the objective function in (9).
Since our algorithm maximizes G based on belief , its solution satisfies (accounting for estimation and solution errors)
Converting belief-based decisions to evaluation under true state introduces performance loss due to estimation error. By Assumptions 2 and 4, this error’s expectation is bounded:
where is the Lipschitz constant of objective function G. Substituting this relationship back into drift-plus-penalty inequality (11) yields
(Note: is the reward actually obtained by the algorithm.) Taking the total expectation of inequality (12), summing over , then dividing by T, as , the term approaches 0. Through algebraic manipulation and reward term rearrangement, we ultimately establish the gap between our algorithm’s long-term average reward and the optimal policy reward.
4.5.4. Queue Stability and Performance Bound Theorem
Based on the above derivation, we now formally state the core performance theorem for our algorithm.
Theorem 1
(Queue Stability and Performance Bound). Under Assumptions 1–4, the policy generated by Algorithm 2 possesses the following properties: (1) All physical and virtual queues in the system are mean-square bounded. Specifically, there exists constant dependent on V such that
(2) A deterministic performance bound exists between the algorithm’s achieved long-term discounted reward and the optimal discounted reward under complete information. For any and particle count , the algorithm’s performance satisfies
where B is a constant depending on system boundedness, and are positive constants independent of or .
(Note: The bound form is standard for Lyapunov analysis of discounted problems, slightly different from the in the original manuscript).
The theorem results clearly reveal several key performance trade-offs. The first term represents a structural error from using Lyapunov approximate dynamic programming, indicating that increasing parameter V can arbitrarily reduce this performance gap. However, the cost is that average queue lengths (system stability) grow as , meaning the system must tolerate greater fluctuations in inventory and debt. The second term stems from partial observability and Bayesian filtering estimation error, showing that increasing particle count (i.e., allocating more computational resources) systematically reduces losses from incomplete information. The final term directly relates to subproblem solver accuracy. Other proof of our algorithm is provided in Appendix A.
4.5.5. Convergence and Computational Complexity Analysis
We analyze the convergence of the algorithm’s internal components and its overall computational overhead. The channel equilibrium prediction module employs iterative best response. For non-convex games in our model, this iterative process does not guarantee convergence to a unique pure-strategy Nash equilibrium. However, in many practical applications, such heuristic iterative methods often converge to stable strategy distributions or approximate equilibrium points. Our framework’s theoretical analysis accommodates this equilibrium solution inaccuracy through the term. The stochastic gradient ascent method used in the firm decision module has solid theoretical support in stochastic optimization. Under Assumption 2’s Lipschitz conditions, one can prove the algorithm converges to a stationary point satisfying KKT (Karush–Kuhn–Tucker) conditions—a local optimum or saddle point. The algorithm’s computational complexity per decision period comprises three main components. The particle filter module complexity is , where is the particle count, k is the market position vector dimension, and N is the channel count. The channel equilibrium prediction module complexity is approximately , where are the iterations needed to reach approximate equilibrium, and is the complexity of computing a single channel’s best response. The firm decision module complexity is , where K is the gradient update steps, is the batch size for gradient estimation, and is the complexity of computing the objective function for a single sample. The proposed algorithm’s total per-period computational complexity is polynomial in . This indicates manageable computational overhead, with trade-offs between computational efficiency and solution accuracy achievable by tuning these hyperparameters.
5. Performance Evaluation and Experimental Analysis
To validate the practical performance of the proposed hierarchical online learning and control algorithm (hereafter referred to as L-BAP for convenience) and investigate its behavioral characteristics in complex dynamic environments, we construct a high-fidelity discrete-event simulation platform. This section details the experimental setup, baseline algorithms for comparison, and performance evaluation dimensions, and provides in-depth analysis and discussion of the simulation results.
5.1. Experimental Setup and Parameter Configuration
We implement the simulation platform using Python 3.8.2, leveraging relevant scientific computing libraries for numerical optimization and stochastic process simulation. We configure a representative “wood flooring” enterprise case with three channels (). These channels are designed as follows: Channel 1 simulates a high-margin, high-brand-contribution but demand-volatile designer channel; Channel 2 represents a moderate-margin, high-volume, relatively stable large-scale project channel; Channel 3 models a low-margin, highly competitive online e-commerce channel sensitive to strategic intensity.
Core model parameters aim to reflect business realities. The market position vector is set as two-dimensional (), representing brand goodwill and customer base, respectively. Specific parameters for state evolution, demand, and cost functions are summarized in Table 2. The macroeconomic state is modeled as a two-state (boom/recession) Markov chain with a transition probability matrix provided in the table. Production lead time is set to periods, with reward discount factor . For our proposed L-BAP algorithm, core hyperparameters are as follows: Lyapunov trade-off parameter and particle filter particle count .
Table 2.
Core parameter settings for simulation model.
5.2. Baseline Algorithms for Comparison
To comprehensively evaluate L-BAP performance, we select four representative baseline algorithms. These are designed to highlight L-BAP’s core advantages from different perspectives (e.g., foresight capability, hidden state handling).
“Myopic-Known (MK)” represents an idealized baseline. It assumes perfect observation of the true market position vector , but its decision objective maximizes only current single-period expected reward , ignoring long-term effects. This algorithm quantifies the long-term planning value provided by our Lyapunov framework.
“Myopic-Learned (ML)” constitutes a more realistic myopic algorithm. Like L-BAP, it cannot directly observe and uses the same particle filter module for state estimation. However, its decision objective similarly maximizes the current period expected reward. Comparing ML with L-BAP isolates the pure value of long-term planning.
“Static Policy (SP)” represents traditional management approaches lacking dynamic adaptation. This policy computes fixed resource allocation ratios and production levels based on long-term average market environment estimates, remaining constant throughout simulation.
“Decentralized-RL (D-RL)” serves as a strong baseline from multi-agent reinforcement learning. We model each channel as an independent learning agent, alongside the firm. Each agent learns based on local observations, aiming to maximize individual long-term rewards (e.g., using PPO or SAC algorithms). This baseline tests our proposed centralized coordination framework against popular fully decentralized learning methods.
5.3. Performance Evaluation Metrics
We evaluate algorithm performance across multiple dimensions. The core metric is “cumulative discounted reward” (), measuring overall profitability. “Inventory stability” is measured by the time-series standard deviation of total inventory across all channels , with smaller values indicating smoother inventory management and weaker bullwhip effects. “Customer service level” is defined by total order fulfillment rate (), reflecting the firm’s ability to handle demand fluctuations. Additionally, we assess learning and perception capability through “state estimation accuracy”, measured by the root-mean-square error (RMSE) between particle filter estimates and true values .
5.4. Experimental Results and Analysis
We first compare algorithm performance at both overall reward and dynamic behavior levels.
Figure 3 shows the evolution of the cumulative discounted reward over time. After a brief initial learning phase around , our L-BAP algorithm’s reward accumulation curve exhibits a significantly steeper slope than all baselines, eventually converging to the highest total reward (approximately 15,500). Myopic-Learned (ML) performs second best (approximately 13,000), but its gap with L-BAP widens over time. This clearly demonstrates that learning current states alone is insufficient, and lacking foresight in long-term planning leads to substantial potential revenue loss.
Figure 3.
Evolution of cumulative discounted reward over 1000 simulation periods for different algorithms.
Notably, the idealized Myopic-Known (MK) algorithm (approximately 13,800), despite perfect information and strong early performance (), is eventually surpassed by L-BAP with long-term planning capability. This indicates that in our model, the long-term planning value provided by the Lyapunov framework can compensate for the information disadvantage from partial observability. D-RL and SP perform considerably worse, validating the necessity of dynamic adaptation and centralized coordination.
Figure 4 and Figure 5 reveal the dynamic reasons behind L-BAP’s high cumulative rewards. Figure 4 shows moving averages of single-period rewards. After , L-BAP’s reward curve steadily climbs above 350, significantly exceeding all competitors. In contrast, ML and MK rewards fluctuate around 300, with D-RL and SP even lower. This indicates that L-BAP achieves not only higher average rewards but also relatively controlled reward volatility.
Figure 4.
Single-period rewards (15-period moving average) for different algorithms.
Figure 5.
Customer service levels (25-period moving average) for different algorithms.
Figure 5 provides corroborating evidence from a demand fulfillment perspective. After the learning phase, L-BAP maintains the customer service level (order fulfillment rate) stably above 95%, nearly achieving ideal supply–demand balance. All other baseline algorithms, including MK with perfect information, exhibit lower and more volatile service levels (e.g., ML and MK oscillate around 90%), directly leading to lost sales opportunities and reduced rewards.
Next, we analyze algorithm performance in supply chain stability (inventory management). Figure 6 shows total inventory dynamics during mid-simulation ( to ). This figure visually demonstrates L-BAP’s absolute superiority in inventory management. Facing identical demand fluctuations and production delays, L-BAP’s inventory curve (blue) exhibits smooth, controlled periodic oscillations, successfully maintaining inventory within a relatively healthy target range (approximately 300 to 600).
Figure 6.
Comparison of total inventory dynamics during mid-simulation period (t = 400 to 650).
In contrast, all baseline algorithms show severe inventory oscillations. SP and D-RL curves (orange and green) exhibit extreme, high-amplitude “bullwhip effects”, with inventory sometimes exceeding 1000 (facing severe overstock risk) and dropping to 200 (causing critical shortages). ML and MK inventory fluctuations, while slightly better than SP and D-RL, still show a significantly higher amplitude (approximately 250 to 750) and frequency than L-BAP. This strongly validates the direct effectiveness of the Lyapunov framework’s explicit inventory queue control through Equation (9).
- Discussion. The bar summaries in Figure 7 and Figure 8 confirm the time-series trends: L-BAP achieves the highest long-run reward while simultaneously reducing inventory volatility by a large margin. Importantly, MK (perfect information but myopic) is still outperformed by L-BAP, highlighting that stability-aware planning has a first-order impact beyond state observability alone.
Figure 7. End-of-horizon cumulative discounted reward comparison (mean ± std over 30 seeds).
Figure 8. Inventory stability comparison measured by inventory standard deviation (mean ± std over 30 seeds). Lower is better.
Table 3 provides a quantitative summary of key performance metrics at simulation end. These aggregated data confirm observations from dynamic curves. L-BAP achieves optimal results across three core business metrics (reward, inventory stability, service level). Its exceptionally low inventory standard deviation (85.2) is particularly notable—approximately half that of ML and MK, and far lower than that of D-RL and SP. This demonstrates L-BAP’s effectiveness in smoothing supply chain disturbances caused by demand fluctuations and production delays through foresighted production planning.
Table 3.
Overall performance metrics (mean ± std) over 30 random seeds.
We next analyze the effectiveness of core algorithm components. Figure 9 depicts state estimation error (RMSE) versus particle count . Both L-BAP and ML (sharing the same particle filter module) show a smoothly decreasing estimation error with increasing , validating the Bayesian filtering module’s effectiveness as the system’s “perception” unit. We note that at identical particle counts, both algorithms show very similar RMSE values (e.g., at , L-BAP: 0.88, ML: 0.91, as in Table 3). This reveals a deeper conclusion: L-BAP’s substantial performance advantage (Figure 3) stems not merely from better *state estimation*, but from superior *decision planning*. That is, the L-BAP framework better utilizes (error-prone) belief states to generate efficient, robust long-term resource allocation decisions.
Figure 9.
State estimation RMSE versus particle count (L-BAP vs. ML).
Finally, we experimentally validate the performance trade-offs from theoretical analysis by tuning L-BAP’s key hyperparameters. Figure 10 plots the relationship between long-term average reward and average inventory level achieved under different V values. The results clearly show a Pareto frontier: as V increases, the algorithm assigns higher weight to immediate reward in the optimization objective (Equation (9)), thereby increasing the long-term average reward; however, the implicit penalty for queue stability is relatively relaxed, causing the average inventory level to rise accordingly. These experimental results perfectly match the performance–stability trade-off revealed in our theoretical analysis (i.e., V-controlled queue bound versus reward loss), providing strong empirical evidence for the algorithm’s design rationale.
Figure 10.
Pareto frontier of long-term average reward versus average inventory level for L-BAP under varying trade-off parameter V.
Figure 11 further shows the heatmap of long-term average reward as a function of trade-off parameter V and particle count . This reveals two clear monotonic trends: at fixed (e.g., ), increasing V improves the average reward; at fixed V (e.g., ), increasing also improves the average reward. This perfectly corresponds to our theoretical bound (Theorem 1): increasing V reduces structural error from Lyapunov approximate planning ( term), while increasing reduces estimation error from partial observability ( term). This heatmap provides clear guidance for practical parameter tuning.
Figure 11.
Heatmap of L-BAP’s average reward (per period) as a function of V and .
5.5. Ablation Study and Component Contribution
To isolate the contribution of each module in L-BAP, we conduct an ablation study with three variants: (i) w/o PF replaces the particle filter belief with a fixed prior mean (no online latent-state inference); (ii) w/o Lyapunov removes Lyapunov planning and optimizes only myopic reward (equivalent to ML); (iii) w/o Game disables the game-response predictor and assumes channels keep a fixed strategy intensity. All results are averaged over 30 random seeds, as shown in Table 4 and Figure 12 and Figure 13.
Table 4.
Ablation study of L-BAP variants (mean ± std) over 30 random seeds.
Figure 12.
Ablation comparison on cumulative discounted reward (mean trajectory over 30 seeds).
Figure 13.
Ablation comparison on inventory stability (rolling inventory STD, mean over 30 seeds).
- Findings. Removing Lyapunov planning causes the largest degradation in inventory stability and also lowers reward, indicating that the drift-plus-penalty structure is the main driver for robust long-term control under lead times. Removing particle filtering reduces reward because the planner cannot track latent market shifts and therefore misallocates budgets and production. Removing the game module also degrades reward and increases volatility, because the firm decisions become systematically mismatched with strategic channel reactions. Overall, the ablation results support the modular design philosophy: each component contributes materially, and their combination yields the best reward–stability balance.
6. Conclusions
This paper addresses the multi-channel resource allocation problem faced by wood flooring manufacturers in dynamic competitive markets. We construct a Partially Observable Stochastic Dynamic Game (POSDG) model that transcends traditional static or myopic perspectives. The model’s core contribution lies in endogenizing the firm’s intangible assets—such as brand goodwill and customer base—as hidden state variables that dynamically evolve with market investments and competitive behaviors, enabling deep analysis of long-term strategies. We propose a hierarchical online learning and control framework named L-BAP. Its innovation resides in the organic integration of three distinct yet complementary technologies: Bayesian state estimation via particle filtering to handle partial observability; Lyapunov approximate dynamic programming to transform the intractable long-term discounted reward problem into a sequence of online optimization subproblems guaranteeing physical system stability; and iterative best response to predict inter-channel game behaviors. We provide rigorous theoretical performance guarantees for the algorithm and validate its superiority over several strong baselines through extensive simulations. Experimental results demonstrate that our proposed method achieves higher long-term rewards while significantly enhancing supply chain stability (i.e., inventory smoothness), highlighting the value of integrating learning, optimization, and game theory for foresighted planning.
Several directions warrant future exploration. First, structural parameters of state evolution and demand functions are assumed to be known in our model. In more realistic scenarios, firms may need to learn or identify these parameters online, leading to more challenging dual control problems. Second, our channel game modeling employs static Nash equilibrium within periods as approximation. Future research could explore evolutionary game dynamics emerging when channels are modeled as agents with memory and learning capabilities. Finally, while our framework maintains some generality, its computational complexity grows significantly with channel count and state dimension. Therefore, investigating more scalable approximation methods, such as utilizing deep reinforcement learning to parameterize firm or channel decision policies, presents a promising research direction.
Author Contributions
Conceptualization, Y.W.; methodology, Y.W.; software, Y.W.; validation, Y.W. and A.V.V.; formal analysis, Y.W.; investigation, Y.W.; data curation, Y.W.; writing—original draft preparation, Y.W.; writing—review and editing, A.V.V.; visualization, Y.W.; supervision, A.V.V. All authors have read and agreed to the published version of the manuscript.
Funding
This research received no external funding.
Data Availability Statement
The data and code used in the simulation experiments can be obtained from the corresponding author upon reasonable request.
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable comments.
Conflicts of Interest
The authors declare no conflicts of interest.
Appendix A. Proofs for Theoretical Performance Analysis
This appendix provides complete and rigorous mathematical derivations for the theoretical results presented in the main text. Our analysis focuses on the proposed hierarchical online learning and control algorithm, with the core objective of proving its stability and approximate optimality in Partially Observable Stochastic Dynamic Game environments. The proof procedure strictly adheres to previously defined models and notation, progressing through a series of lemmas and theorems.
Appendix A.1. Analytical Foundation: Restatement of Technical Assumptions
To ensure clarity and self-consistency in subsequent derivations, we reiterate the technical assumptions underlying our theoretical analysis.
Appendix A.2. Detailed Derivation of Lyapunov Drift Upper Bound
Lyapunov analysis forms the cornerstone of our theoretical framework. We must first establish a rigorous mathematical upper bound for the single-step Lyapunov drift.
Lemma A1
(Lyapunov Drift Upper Bound). Under Assumptions 1 and 2, for any given observable history and any feasible policy taken under this history, the single-step Lyapunov drift satisfies
where is a positive constant depending only on system parameters and bounded constants from Assumption 1.
Proof.
The Lyapunov function comprises two components: . We analyze expected changes of both parts separately.
First, consider the drift of physical inventory queue . From its evolution equation ,
By Assumption 1, sales and production are bounded. Therefore, the quadratic term must be bounded. We define constant . Summing over all channels yields an upper bound for inventory drift:
where .
Next, consider the virtual debt queue with evolution equation :
By Assumption 1, . Thus, we bound the above as
Appendix A.3. Drift-Plus-Penalty Analysis and Performance Gap Derivation
Our Algorithm 2 solves an approximate optimization problem (9) each period t. This decision rule closely relates to the drift lemma. To establish performance bounds, we first define a “drift-plus-penalty” expression aligned with the algorithm’s objective.
As described in Section 4.3, the algorithm aims to minimize a customized expression: . Substituting the drift upper bound from Lemma (A1):
Rearranging terms yields
We define the single-period optimization objective function as
This function exactly matches the objective solved by Algorithm 2 (Equation (9)). Therefore, inequality (A4) becomes
Algorithm 2 aims to maximize the expectation of (based on belief ). Let denote the algorithm’s chosen action, and denote the action taken by an ideal policy with complete information and perfect solution capability. Considering Assumptions 3 and 4, the algorithm’s action satisfies
The performance loss due to estimation error can be bounded. By Assumption 2, is also Lipschitz continuous (with constant ). Thus,
Substituting this performance relationship back into the upper bound yields
This inequality forms the core connection between algorithm execution, queue dynamics, and the performance gap from the ideal optimal policy.
Appendix A.4. Proof of Queue Stability and Performance Bound Theorems
Based on inequality (A6), we now formally prove the theorems from the main text.
Theorem A1
(Queue Stability). Under Assumptions 1–4, all physical and virtual queues generated by Algorithm 2 are mean-square bounded.
Proof.
We examine the total expectation of inequality (A6). .
By Assumption 1, reward is bounded (). The ideal policy’s objective function is similarly bounded (denote ). Therefore, all non-queue quadratic terms can be bounded by a large constant :
This inequality indicates that when the quadratic norm of queue vector , , is sufficiently large (exceeding some constant B), must be negative. By the Foster–Lyapunov stability criterion, this ensures the Markov chain formed by queue vector is positive-recurrent; hence, its second moment (mean-square value) is finite. More detailed analysis can show . □
Theorem A2
(Performance Bound). Under Assumptions 1–4, the algorithm’s achieved long-term discounted reward and the optimal discounted reward under complete information satisfy the performance bound given in the main text.
Proof.
(Proof Sketch) Taking the total expectation of inequality (A6), summing over , then dividing by T, we analyze the long-term time-average behavior:
where aggregates all constant and error terms (). As , . This establishes a relationship between the algorithm’s long-term time-average reward and the ideal policy’s time-average objective .
Converting this time-average result to the “discounted reward” and bound in the main text requires more sophisticated techniques beyond standard drift analysis. However, our Lyapunov function design incorporating the queue represents a known (though advanced) method for handling discounted reward problems. Through algebraic transformations (as described in the main text), one can prove that the performance gap of the drift-plus-penalty objective optimized by Algorithm 2 translates to the discounted reward domain.
The final performance bound form (as shown in Theorem 1 of the main text) is a deterministic outcome of this analysis, decomposing total performance loss into three controllable components:
The first term represents structural error from Lyapunov approximation (parameter V); the second stems from estimation error due to partial observability (particle count ); the third results from implementation error due to subproblem (game and optimization) solution inaccuracy (). This result provides clear theoretical guidance for algorithm tuning. □
References
- Zhou, Q.; Li, J.; Guo, S.; Chen, H.; Wu, C.; Yang, Y. Adaptive Incentive and Resource Allocation for Blockchain-Supported Edge Video Streaming Systems: A Cooperative Learning Approach. IEEE Trans. Mob. Comput. 2025, 24, 539–556. [Google Scholar]
- Yuan, S.; Dong, B.; Li, J.; Guo, S.; Chen, H.; Wu, C.; Wu, J.; Zhao, W. Adaptive Incentivize for Federated Learning with Cloud-Edge Collaboration under Multi-Level Information Sharing. IEEE Trans. Comput. 2025, 74, 2445–2460. [Google Scholar] [CrossRef]
- Chen, H.; Han, Z.; Wu, C.; Zhang, Y. Jira: Joint incentive design and resource allocation for edge-based real-time video streaming systems. IEEE Trans. Wirel. Commun. 2022, 22, 2901–2916. [Google Scholar]
- Li, J.; Wu, C. Jora: Blockchain-Based Efficient Joint Computing Offloading and Resource Allocation for Edge Video Streaming Systems. J. Syst. Archit. 2022, 133, 102740. [Google Scholar]
- Liu, Y.; Guo, S.; Wu, C.; Yang, Y. Efficient Online Computing Offloading for Budget-Constrained Cloud-Edge Collaborative Video Streaming Systems. IEEE Trans. Cloud Comput. 2025, 13, 273–287. [Google Scholar]
- Yuan, S.; Dong, B.; Lv, H.; Liu, H.; Chen, H.; Wu, C.; Guo, S.; Ding, Y.; Li, J. Adaptive Incentive for Cross-Silo Federated Learning in IIoT: A Multiagent Reinforcement Learning Approach. IEEE Internet Things J. 2024, 11, 15048–15058. [Google Scholar] [CrossRef]
- Lv, H.; Liu, H.; Wu, C.; Guo, S.; Liu, Z.; Chen, H. TradeFL: A Trading Mechanism for Cross-Silo Federated Learning. In Proceedings of the 2023 IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), Hong Kong, China, 18–21 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 920–930. [Google Scholar]
- Han, J.; Yu, C.; Liu, G.; Yuan, S.; Tong, Z. Large models based high-fidelity voice services over 6G narrowband non-terrestrial networks. Digit. Commun. Netw. 2025, 11, 1864–1873. [Google Scholar] [CrossRef]
- Liang, J.; Zhu, Y.; Yu, X.; Chen, J.; Wu, C. Sharding for Blockchain based Mobile Edge Computing System: A Deep Reinforcement Learning Approach. In Proceedings of the IEEE Global Communications Conference (GLOBECOM), Madrid, Spain, 7–11 December 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar]
- Wang, Y.; Lou, J.; Wu, C.; Guo, S.; Yang, Y. Online Data Trading for Cloud-Edge Collaboration Architecture. In Proceedings of the IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 4554–4559. [Google Scholar]
- Ji, Y.; Zhang, Y. DCVP: Distributed Collaborative Video Stream Processing in Edge Computing. In Proceedings of the IEEE 26th International Conference on Parallel and Distributed Systems (ICPADS), Hong Kong, China, 2–4 December 2020; IEEE: Piscataway, NJ, USA, 2020; pp. 625–632. [Google Scholar]
- Lou, J.; Tang, Z.; Lu, X.; Yuan, S.; Li, J.; Jia, W.; Wu, C. Efficient Serverless Function Scheduling in Edge Computing. In Proceedings of the IEEE International Conference on Communications, Denver, CO, USA, 9–13 June 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1029–1034. [Google Scholar]
- Liu, H.; Fu, H.; Yuan, S.; Wu, C.; Luo, Y.; Li, J. Adaptive Processing for Video Streaming with Energy Constraint: A Multi-Agent Reinforcement Learning Method. In Proceedings of the IEEE Global Communications Conference, Kuala Lumpur, Malaysia, 4–8 December 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 122–127. [Google Scholar]
- Liu, H.; Liu, J.; Zhou, Z.; Yuan, S.; Lou, J.; Wu, C.; Li, J. A stochastic learning algorithm for multi-agent game in mobile network: A Cross-Silo federated learning perspective. Comput. Netw. 2025, 269, 111458. [Google Scholar] [CrossRef]
- Barčić, A.P.; Kuzman, M.K.; Jošt, M.; Grošelj, P.; Klarić, K.; Oblak, L. Perceptions of Wood Flooring: Insights from Croatian Consumers and Wood Experts. Buildings 2025, 15, 1780. [Google Scholar] [CrossRef]
- Wang, H.; Wang, J.; Zhang, X.; Li, Y. Deformation rate of engineered wood flooring with response surface methodology. PLoS ONE 2023, 18, e0292815. [Google Scholar] [CrossRef]
- Heidari, M.D.; Bergman, R.; Salazar, J.; Hubbard, S.S.; Bowe, S.A. Life-Cycle Assessment of Prefinished Engineered Wood Flooring in the Eastern United States; Forest Products Laboratory Research Paper FPL-RP-718; U.S. Department of Agriculture Forest Service: Madison, WI, USA, 2023.
- Morganti, L.; Rudenå, A.; Brunklaus, B. Wood-for-construction supply chain digital twin to drive circular economy and actor-based LCA information. J. Clean. Prod. 2025, 520, 146074. [Google Scholar] [CrossRef]
- Zhang, D.; Turan, H.H.; Sarker, R.; Essam, D. Robust optimization approaches in inventory management: Part A—The survey. IISE Trans. 2025, 57, 818–844. [Google Scholar] [CrossRef]
- Wu, K.; Li, Y.; Zhao, X. Game-theoretic models for sustainable supply chains with information asymmetry. Int. J. Syst. Sci. Oper. Logist. 2025, 12, 2520619. [Google Scholar]
- Krassnitzer, P.; Wimmer, M.; Fuchs, M. Sustainability in flooring: Assessing the environmental and economic impacts of circular business models based on a wood parquet case study. Discov. Sustain. 2025, 6, 466. [Google Scholar] [CrossRef]
- Bergsagel, D.; Heisel, F.; Owen, J.; Rodencal, M. Engineered wood products for circular construction: A multi-scale perspective. npj Mater. Circ. 2025, 3, 24. [Google Scholar]
- Neely, M.J. Stochastic Network Optimization with Application to Communication and Queueing Systems; Morgan & Claypool Publishers: San Rafael, CA, USA, 2010. [Google Scholar]
- Alilou, M.; Mohammadpour; Shotorbani, A.; Mohammadi-Ivatloo, B. Lyapunov-based real-time optimization method in microgrids: A comprehensive review. Renew. Sustain. Energy Rev. 2025, 213, 115416. [Google Scholar] [CrossRef]
- Yuan, S.; Chen, X.; Xing, S.; Li, J.; Chen, H.; Liu, Z.; Guo, S. Transformer-Based Scalable Multi-Agent Reinforcement Learning for Joint Resource Optimization in Cloud-Edge-End Video Streaming Systems. IEEE Trans. Cogn. Commun. Netw. 2025, 12, 3482–3496. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.












